# Data Challenge: Predicting the SWI Uniforme from SWI (France)

## Context

### Introduction to SWI and SWI Uniforme

In this project, we would like to explore the relationship between **SWI (Soil Wetness Index)** and **SWI uniforme** in the context of forecasting **natural catastrophes**, particularly **droughts** (*s√©cheresse*) in France.

#### What are SWI and SWI uniforme?

- **SWI (Soil Wetness Index)** is a numerical indicator designed to capture the **moisture content in soils**. It reflects how saturated the soil is, which is crucial for understanding the potential for **droughts or floods**.
- It is computed from hydrological models that integrate **precipitation**, **evapotranspiration**, and crucially, **soil properties**‚Äîespecially the **clay content** (in French: *argile*), which influences the soil's **water retention capacity**.
- **SWI uniforme** (Soil Wetness Index under uniform soil assumption) is a *standardized version* of the SWI that assumes **uniform soil characteristics** across France (same soil texture, composition everywhere). It removes spatial variability due to soil heterogeneity and is used as a **neutral benchmark**.


#### Who produces them and why?

- The **SWI** (Soil Wetness Index) is produced by **M√©t√©o-France** based on the **SAFRAN** model and is made available **daily** (_SIM quotidienne_). It reflects actual soil moisture and accounts for local soil properties such as clay content.

- The **SWI uniforme** is also produced by **M√©t√©o-France**, but under the assumption of a **homogeneous soil profile across France**. It is released **once a year**, typically in **April‚ÄìMay of the following year**, and serves as the **official reference** for drought-related policy and insurance decisions.

- The **interministerial commission** uses the SWI uniforme as part of the **CatNat (Catastrophes Naturelles)** compensation process. This uniform soil assumption ensures **fairness and comparability** across regions, especially given the limited spatial resolution and coverage of available soil data (e.g., BRGM maps).

#### Why are these indices important?

These indices are widely used by researchers, insurers, and public authorities to:

- **Forecast droughts** and estimate their impact on infrastructure and buildings.
- **Assess financial risk** and plan insurance reserves.
- Trigger or reject **CatNat eligibility** for affected communes.
- Compare actual drought impacts (SWI) with reference drought levels (SWI uniforme).

#### Real-world challenge behind this project

This project simulates the practical challenge faced by **insurers** (e.g., Allianz) and **public bodies**:

>  **How can we estimate the SWI Uniforme months before it is officially published**, using available SWI data and other spatial indicators?

Such a forecast is critical for:

- Anticipating the **number of eligible communes**,
- Estimating potential **financial losses** in the year ahead,
- Helping in **portfolio planning** and **risk management** decisions by September of the current year, not April of the next.

This challenge will also allow you to reflect on the **influence of soil composition** (particularly clay content) on SWI values and drought modeling.

### Download the Daily Soil Wetness Index (_SIM quotidienne_ - M√©t√©o-France)

You can download the **daily SWI values** produced by M√©t√©o-France's hydrological model (SIM SAFRAN) here:

üëâ [https://www.data.gouv.fr/datasets/donnees-changement-climatique-sim-quotidienne/](https://www.data.gouv.fr/datasets/donnees-changement-climatique-sim-quotidienne/)

This dataset includes time series of **SWI and other hydrometeorological variables** for each point of the SAFRAN grid, available at **daily resolution**. It is essential for analyzing drought dynamics and comparing against the SWI uniforme.

**Metadata in the Data Files:**
Each data file contains, in addition to the values of the climatological parameters, the following fields:

- **LAMBX**: X coordinate of the grid point in **Lambert II √©tendu** projection (in hectometres, hm)
- **LAMBY**: Y coordinate of the grid point in **Lambert II √©tendu** projection (in hectometres, hm)

> **Remark:** the CRS corresponding to _Lambert II √©tendu_ is _27572_, i.e., `crs='EPSG:27572'`

### Download the SWI Uniforme dataset

üëâ [https://donneespubliques.meteofrance.fr/?fond=produit&id_produit=301&id_rubrique=40](https://donneespubliques.meteofrance.fr/?fond=produit&id_produit=301&id_rubrique=40)

The dataset contains:
- Monthly values of SWI uniforme,
- At the same spatial resolution as the SAFRAN grid,
- For the full French territory,
- In formats suitable for analysis (CSV, NetCDF, etc.).

## The BRGM Soil Map and Its Role in SWI Modeling

The **BRGM map** refers to the geoscientific datasets produced by the **Bureau de Recherches G√©ologiques et Mini√®res** (BRGM), the French national geological survey. These datasets include detailed information about the **composition of soils** across France ‚Äî particularly the **percentage of clay (argile)**, which plays a central role in soil behavior during wet and dry periods.

###  Why is the BRGM map important for SWI?

The **Soil Wetness Index (SWI)** depends on how soils retain and release water. Soils with a high clay content can **retain moisture longer**, which affects the evolution of the SWI after precipitation events. 

In contrast, the **SWI uniforme** is computed with the assumption of a **uniform soil composition across the country**, which removes local variations like those captured in the BRGM map.

As a result, the **difference between SWI and SWI uniforme** can often be explained by the local variability in soil clay content, making the BRGM map a key dataset for interpreting and modeling drought risk indicators.

###  Download Soil Clay Content Data (RGA - BRGM 2020)

You can download the official dataset showing the **percentage of clay in soils** across France from the French open data portal:

üëâ [https://www.data.gouv.fr/datasets/tiles-rga-brgm-2020/](https://www.data.gouv.fr/datasets/tiles-rga-brgm-2020/)

This dataset comes from the **R√©f√©rentiel G√©ologique de l'Argile (RGA)** and can be used to spatially join soil clay content with SWI grids or administrative units (communes, d√©partements).

## Administrative Boundaries of France

The **administrative division of the French territory** is structured into several hierarchical levels, each serving different roles in governance, statistics, and spatial analysis. Understanding these units is essential when working with geographic data such as **SWI**, **SWI uniforme**, or **soil composition**, since analyses often need to be aggregated or compared across these boundaries.

### Main Administrative Levels

- **Commune** ‚Üí The smallest administrative unit in France (over 34,000 in total). Each commune has a mayor and a local council.  
- **Arrondissement d√©partemental** ‚Üí A subdivision of a department, mainly used for administrative management.  
- **D√©partement** ‚Üí Intermediate level of governance (101 in total, including overseas). Each department has a prefect and a council.  
- **R√©gion** ‚Üí The largest administrative division (18 in total). Regions handle economic planning, transport, and regional development.  
- **Intercommunalit√©** (optional level) ‚Üí Groupings of communes that cooperate on shared projects and services.

### The Admin Express Dataset

All these administrative layers are available in geospatial format through the **Admin Express** dataset published by the **_Institut national de l'information g√©ographique et foresti√®re_ (IGN)**.

This dataset provides official boundaries for:
- Communes (`COMMUNE`),
- Arrondissements (`ARRONDISSEMENT`),
- D√©partements (`DEPARTEMENT`),
- R√©gions (`REGION`),
- Intercommunalities (`EPCI`).

Each level includes identifiers compatible with **INSEE codes**, making it easy to join with demographic or environmental data.

#### üì• Download the Admin Express dataset here:

üëâ [https://geoservices.ign.fr/adminexpress](https://geoservices.ign.fr/adminexpress)



---

## Project task, Part 1: SWI vs SWI uniforme

**Predict the SWI Uniforme** (delayed, official drought index) from the **daily SWI (Safran)** and complementary data sources such as:

- **Soil composition** (e.g., clay content from BRGM maps),
- **Geographic location** (commune polygons, coordinates),
- **Historical SWI vs SWI Uniforme pairs** for past years.


### Tasks and Research Questions

#### 1. Data Understanding & Exploratory Analysis

Before moving to prediction, your first task is to perform a **comprehensive exploratory analysis** of the available data. This should include both **quantitative** and **qualitative** insights on drought behavior across space and time.

Your goals:

#### A. Load & Prepare the Data

- Load historical **SWI (Safran)** and **SWI Uniforme** datasets.
- Load **soil composition data** (clay percentage from BRGM).
- Load **administrative boundaries** (communes, d√©partements, r√©gions).
- Link the SWI and SWI uniforme data to **commune polygons** and **soil maps**, so you can analyze at regional scale.

#### B. Quantitative Analysis

- Compute **summary statistics** (mean, std, min, max) for both SWI and SWI Uniforme.
- Evaluate the **temporal evolution** of SWI:
  - Identify trends across **years** (e.g., are droughts more frequent after 2010?),
  - Explore **monthly patterns** (e.g., which months are driest).
- Perform **regional aggregation** (mean SWI by region, month, or year).

#### C. Qualitative Regional Drought Insights

- Identify **which regions are most and least affected by drought**:
  - Expect **Southwest** (Bordeaux, Montpellier) to show more frequent/severe drought,
  - **Brittany** (Bretagne) likely less affected, small presence of clay in the soil (to be checked)
  - **Grand Est** may show **increasing drought trends** in recent years.
- Visualize regional variability using **choropleth maps** or **ranking plots**.

#### D. SWI vs. SWI Uniforme ‚Äî Relationship Analysis

- Quantify how strongly **SWI and SWI Uniforme are correlated** (per region or grid point).
- Study the **spatial divergence**:
  - Map where the two indicators **differ the most** across France.
- Investigate how **soil properties** (e.g., clay content) **explain this divergence**:
  - Are differences larger in clay-rich soils?
  - Do differences align with regions that vary strongly in soil composition?

> This analysis will help you understand the **behavioral difference between the indicators**, which is crucial for modeling and for interpreting why a commune may or may not be declared in drought.
- Load and explore historical **SWI (Safran)** and **SWI Uniforme** datasets.
- Link the SWI grid data to **commune polygons** and **soil maps (BRGM)**.
- Compute **summary statistics** and visualize the spatial variability of both indices.

#### 2. Modeling the SWI Uniforme
- Build a predictive model (e.g., linear regression, random forest, neural network) that estimates **SWI Uniforme** from:
  - SWI (Safran),
  - Soil characteristics,
  - Climatic or geographic covariates.

- Evaluate the performance using past years where both SWI and SWI Uniforme are known.

#### 3. Interpretation & Communication
- Map and interpret areas with the largest errors.
- Discuss the **potential policy implications** (e.g., fairness in drought declarations, bias due to soil heterogeneity).
- Reflect on how this model could help anticipate **CatNat eligibility** for droughts.

---

## Project task, Part 2: Defining and Predicting Extreme Drought Using SWI Uniforme

In this project, we aim to predict whether a location (commune or grid cell) will experience an **extreme geotechnical drought**, based on **SWI uniforme** values. The challenge is to simulate the eligibility rules used by the **interministerial commission** (CatNat) ‚Äî which are based not on fixed values, but on **statistical thresholds**.

We will approach this problem in **three steps**, using increasingly realistic definitions of drought:

### 1. Drought Definition Using the 25% Quantile

We begin by computing the **25th percentile (Q1)** of historical SWI uniforme values for each location and quarter (e.g., Q2: April‚ÄìJune).  
Any year where the SWI uniforme falls **below this 25% threshold** is marked as an **extreme drought**.

> This first task defines drought in a broad sense and helps initialize your binary classification model.


### 2. Drought Definition Using the 5% Quantile

Next, we apply a **stricter criterion**: we compute the **5th percentile** of SWI uniforme for each location and quarter.  
Years below this threshold are classified as **very extreme droughts**.

> This step simulates rare but impactful events, refining your model's sensitivity to extreme low moisture conditions.


### 3. Official Rule (‚ÄúR√®gle M√©tier‚Äù) from M√©t√©o-France

Finally, we implement the official drought definition from the **M√©t√©o-France 2019 report**:

> *‚ÄúOn consid√®re que la dur√©e de retour de la s√©cheresse g√©otechnique est sup√©rieure ou √©gale √† 25‚ÄØans si l‚Äôindicateur du trimestre consid√©r√© se classe au 1er ou au 2·µâ rang parmi les indicateurs calcul√©s sur les 50‚ÄØderni√®res ann√©es pour ce trimestre.‚Äù*

Translated: A **drought event** is defined when the **SWI uniforme** for a quarter is among the **two lowest values over the past 50 years** ‚Äî corresponding to a **return period of at least 25 years**.

üìÑ Full report:  
üëâ [M√©t√©o-France, Rapport 2019 ‚Äî S√©cheresse g√©otechnique](https://meteofrance.fr/sites/meteofrance.fr/files/files/editorial/RapportM%C3%A9t%C3%A9orologique_CatNat-S%C3%A9cheresse_2019-2-compressed%20%281%29.pdf)

> This is the most realistic and policy-relevant definition. Your final binary label will be:
> - `1` if the SWI uniforme is in the **lowest 2 values** over 50 years,
> - `0` otherwise.


#### Prediction Task Summary

You will build and evaluate **binary classification models** using these different drought definitions as targets.  
Your predictors may include:
- **SWI (Safran)** values,
- **Soil clay content** (BRGM),
- **Geographic location** or grid cell ID,
- **Temporal features** (e.g., month, year, quarter).

This stepwise approach lets you explore how different drought definitions affect model performance and policy relevance.

## Project task, Part 3: Prediction of CatNat Recognition for Drought at National Scale

This is an **advanced, optional task** for students who wish to go further in the project.

The objective is to **predict which communes in France will be officially recognized as being in a state of drought** under the **CatNat (Catastrophes Naturelles)** regime ‚Äî not just based on environmental indicators, but by reproducing the actual **decision process** of the interministerial commission.

#### üéØ What needs to be predicted?

A binary outcome:
- `1` ‚Üí Commune is recognized as eligible for drought compensation by CatNat in a given year,
- `0` ‚Üí Commune is not recognized.

This goes beyond modeling just SWI or SWI uniforme, and includes **institutional, procedural, and possibly socio-political factors**.

#### Why is it hard?

Because CatNat recognition depends on:
- Environmental indicators (SWI uniforme thresholds, soil composition),
- Technical criteria (e.g., $\geq 3\%$ sensitive clay soils, damage declarations),
- Administrative requests and delays,
- Human/political variability (some communes never apply; others always do).

You will need to:
- Align multiple datasets across **space (communes)** and **time (years)**
- Reconstruct the **target variable**: recognized/not recognized per commune/year
- Engineer meaningful **predictors** (e.g., SWI, soil, past recognition, spatial neighbors‚Ä¶)

### Downloading the CatNat Drought Recognition Data

The dataset of **official drought recognition decrees (arr√™t√©s CatNat)** can be obtained from two sources:

1. **Directly from the CCR official portal:**

   üëâ [https://www.ccr.fr/portail-catastrophes-naturelles/liste-arretes/?select_type=55](https://www.ccr.fr/portail-catastrophes-naturelles/liste-arretes/?select_type=55&year_type=&departement-select=&commune-select=#wp-block-themeisle-blocks-advanced-column-131151bd)

   - On this page, you can filter by **phenomenon = ‚ÄúS√©cheresse‚Äù**,  
     then click **‚ÄúT√©l√©charger le tableau‚Äù** to export the list of all recognized communes and dates.  
   - The table includes information on the **date of the decree**, **departement**, **commune**, and **type of disaster**.

2. **From the project materials:**

   A pre-downloaded version of the dataset is included in the **ZIP file** provided with the course presentation.  
   You can unzip it into your working directory (e.g. `data/catnat_drought/`) and use it directly in your analysis.

> These data represent the official recognition of drought events under the **CatNat (Catastrophes Naturelles)** system in France, as published by the **Caisse Centrale de R√©assurance (CCR)**.

---

### Suggested Approach

- Construct a **panel dataset**: one row per (commune, year), with columns like:
  - SWI Uniforme minimum
  - Drought label (quantile / rule m√©tier)
  - Soil clay %
  - Past recognition (lag variables)
  - Department or region-level indicators
- Build a **binary classifier** to estimate the probability of recognition
- Analyze errors: false positives (eligible but not recognized?) vs false negatives (recognized but not predicted)

> ‚ö†Ô∏è This task is optional but **highly valuable**. It simulates a real-world challenge faced by insurers, policy-makers, and disaster risk modelers.

## What It Is Expected

Your final submission should include **both analytical and technical components**, combining a clear presentation of findings with a well-documented implementation.

### 1. Written Report

We expect a concise report. It should look like something like this:

üëâ [France Assureurs ‚Äì Master Class S√©cheresse (Oct. 2023)](https://www.franceassureurs.fr/master-class-secheresse/pdf/20231017%20Master%20class%20Secheresse.pdf)

The report should:
- Summarize the goals, methods, and key findings of your analysis,
- Present your maps and results in a clean and readable way,
- Highlight policy or operational implications (e.g., regions at risk, model usefulness),
- Stay within a few well-organized pages (not a thesis).

### 2. Jupyter Notebook

You must submit a well-documented **Jupyter notebook** containing:
- All your data processing and cleaning steps,
- Exploratory data analysis (EDA) using `geopandas`,
- Aggregation and correlation studies (SWI vs SWI Uniforme),
- Model training, evaluation, and explanation,
- Visualizations and intermediate outputs.

> All code should be clean, reproducible, and commented clearly.

### 3. Heat Maps and Spatial Visualizations

You are expected to include:
- **Heat maps** showing spatial variation (e.g., SWI levels, drought classifications),
- Continuous gradients (not just binary maps),
- Maps at multiple scales: regional, d√©partemental, or commune-level when relevant,
- Comparisons across time (e.g., drought evolution from 2010 to 2024).

### 4. Brief Model Explanation

If you use a predictive model (e.g., logistic regression, random forest, etc.), we expect:
- A **brief explanation** of the model you used,
- What variables you used as input,
- How you evaluated performance (e.g., accuracy, ROC AUC),
- Any limitations or insights from the model.

This is not a modeling competition ‚Äî the goal is **interpretability and relevance**, not just performance.

> Your work should reflect both rigorous analysis and clarity of communication, with a focus on actionable insights into drought risk and soil-climate interactions.

### To Know More

Below are some resources to deepen your understanding of the **Soil Wetness Index (SWI)** and **SWI Uniforme**, and their role in natural catastrophe forecasting in France:

- üåç [SWI Soil Wetness Index ‚Äì France (Google Earth Engine Datasets)](https://geodatafr.github.io/METEO_FRANCE/SWI_Soil_Wetness_Index/)  
  Overview and interactive tools for exploring SWI in France.

- üõ∞Ô∏è [Evolution of the Soil Wetness Index (SWI) in France ‚Äì A Google Earth Engine Study](https://medium.com/googledeveloperseurope/evolution-of-the-soil-wetness-index-swi-in-france-analysis-with-google-earth-engine-3b541583883b)  
  Article by G. Attard & J. Bardonnet explaining the use of Earth Engine to analyze SWI evolution.

- üìÑ [A new approach for drought index adjustment to clay‚Äëshrinkage‚Äìinduced subsidence in France (2024)](https://nhess.copernicus.org/articles/24/999/2024/nhess-24-999-2024.pdf)  
  Scientific article showing how SWI and SWI uniforme are used in natural catastrophe (CatNat) modeling.

- üìä [M√©t√©o-France SWI Uniforme Dataset ‚Äì Monthly Indicators for CatNat](https://donneespubliques.meteofrance.fr/?fond=produit&id_produit=301&id_rubrique=40)  
  Official data product used for assessing drought risk in the CatNat framework.

- üîç [Open Data ‚Äì Indice SWI Uniforme (data.gouv.fr)](https://www.data.gouv.fr/datasets/donnees-mensuelles-dindice-dhumidite-des-sols-pour-le-dispositif-catastrophes-naturelles/)  
  Collection of public datasets and reuse cases related to SWI Uniforme.

---

These resources provide both theoretical background and practical datasets to help you explore the links between soil moisture, clay content, and climate-related natural risks.