<div style="text-align: center;">
    <a href="https://www.dataia.eu/">
        <img border="0" src="https://github.com/ramp-kits/template-kit/raw/main/img/DATAIA-h.png" width="90%"></a>
</div>

# Template Kit for RAMP challenge

<i> Thomas Moreau (Inria) </i>

## Introduction

Describe the challenge, in particular:

- Where the data comes from?
- What is the task this challenge aims to solve?
- Why does it matter?

### Predicting Unemployment in France by Year and Department Using Machine Learning

Where does the data come from?\
The data used in this challenge comes from various datasets provided by INSEE (the French National Institute of Statistics and Economic Studies). The primary dataset is "Population active et chômage", which provides information on employment status by year, age and department. Additional datasets are integrated to enrich the analysis:

DS_RP_EMPLOI_LR_PRINC.csv: Contains employment data categorized by year, department, and gender.
DS_DIPLOMES.csv: Provides the number of graduates per diploma level, categorized by year, department, and gender.
DS_SRCV_SATISFACTION.csv: Includes population satisfaction levels (low, medium, high) over time.
By merging these datasets, we construct a more comprehensive feature set to model unemployment trends.

What is the task this challenge aims to solve?\
The objective of this challenge is to build a machine learning model capable of predicting the number of unemployed individuals aged between 15 and 24 in France at the departmental level for each year between 2010 and 2021. The prediction is based on demographic, educational, and satisfaction-related variables.

The task involves:

Data Preprocessing & Feature Engineering – Merging and cleaning datasets to extract meaningful variables.
Model Training & Evaluation – Training regression models to estimate unemployment figures.
Cross-Validation & Optimization – Ensuring robustness and generalization of predictions.

Why does it matter?\
Understanding and predicting unemployment trends is crucial for economic planning, policymaking, and social programs. By analyzing factors such as education levels and population satisfaction, this project can help:

Government agencies anticipate and mitigate unemployment trends.
Local policymakers implement targeted economic policies.
Researchers explore socioeconomic factors influencing employment.
Ultimately, this challenge contributes to improving labor market analysis and decision-making in France.

# Exploratory data analysis

The goal of this section is to show what's in the data, and how to play with it.
This is the first set in any data science project, and here, you should give a sense of the data the participants will be working with.

You can first load and describe the data, and then show some interesting properties of it.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)

# Load the data

import problem
X_df, y = problem.get_train_data()

In [2]:
X_df

Unnamed: 0,GEO,GEO_OBJECT,FREQ,SEX,AGE,EDUC,RP_MEASURE,TIME_PERIOD,EDUC_001T003_RP,EDUC_001T100_RP,EDUC_001T200_RP,EDUC_100_RP,EDUC_200_RP,EDUC_300_RP,EDUC_350T351_RP,EDUC_500T702_RP,EDUC_500_RP,EDUC_600T702_RP,EDUC_600_RP,EDUC_700_RP,EDUC__T,SATISF__T
145,02,DEP,A,F,Y15T24,_T,POP,2010,49976.17430,,,33568.56099,15874.15810,42873.64589,29283.81368,,22534.85925,12362.99954,,,206474.21175,7.2
9,03,DEP,A,M,Y15T24,_T,POP,2010,20749.49021,,,16163.87863,7045.95623,43466.53015,18618.45586,,10592.42694,9223.68089,,,125860.41892,7.2
375,01,DEP,A,F,Y15T24,_T,POP,2015,,,73830.28856,,,51745.09726,41673.89834,66799.67071,,,,,234048.95487,7.8
523,53,DEP,A,F,Y15T24,_T,POP,2021,,31966.82949,,,7104.22376,27683.48431,20298.18102,,12999.66096,,10317.04480,6280.70884,116650.13317,
188,01,DEP,A,F,Y15T24,_T,POP,2010,37315.94512,,,28646.59286,14321.36989,45803.37461,37965.51273,,31549.99383,23191.70266,,,218794.49171,7.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71,60,DEP,A,M,Y15T24,_T,POP,2010,58432.62272,,,23592.99829,14682.45282,85111.56052,41940.05572,,27667.90000,28229.23963,,,279656.82970,7.2
106,86,DEP,A,M,Y15T24,_T,POP,2010,24443.11543,,,16552.80413,7020.70198,45928.74595,22602.95364,,14573.47126,16398.62483,,,147520.41722,7.2
270,971,DEP,A,F,Y15T24,_T,POP,2015,,,68983.01906,,,27203.35725,27960.13977,33385.32355,,,,,157531.83963,7.8
435,91,DEP,A,M,Y15T24,_T,POP,2021,,77440.90041,,,22022.29829,97856.34333,78865.76250,,53464.47456,,39685.89263,71384.41766,440720.08938,


# Challenge evaluation

A particularly important point in a challenge is to describe how it is evaluated. This is the section where you should describe the metric that will be used to evaluate the participants' submissions, as well as your evaluation strategy, in particular if there is some complexity in the way the data should be split to ensure valid results.

# Submission format

Here, you should describe the submission format. This is the format the participants should follow to submit their predictions on the RAMP plateform.

This section also show how to use the `ramp-workflow` library to test the submission locally.

## The pipeline workflow

The input data are stored in a dataframe. To go from a dataframe to a numpy array we will use a scikit-learn column transformer. The first example we will write will just consist in selecting a subset of columns we want to work with.

In [5]:
# %load submissions/starting_kit/estimator.py

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer

cols = ['GEO', 'SEX', 'TIME_PERIOD', 'EMPSTA_ENQ_1', 'EMPSTA_ENQ_31', 'EMPSTA_ENQ_33',
       'EMPSTA_ENQ_35', 'EMPSTA_ENQ_36']

categorical_cols = ['GEO', 'SEX', 'TIME_PERIOD']
numerical_cols = ['EMPSTA_ENQ_1', 'EMPSTA_ENQ_31', 'EMPSTA_ENQ_33', 'EMPSTA_ENQ_35', 'EMPSTA_ENQ_36']

transformer = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), categorical_cols),
    ('passthrough', numerical_cols)
)

def get_estimator():
    pipe = make_pipeline(
        transformer,
        SimpleImputer(strategy='most_frequent'),
        LinearRegression()
    )

    return pipe


In [4]:
from skrub import tabular_learner
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tabular_learner('regressor'), X_df, y, scoring='neg_median_absolute_error')
print(-scores)

[206.17068348 233.20006134 273.51356183 328.795814   335.56495011]


## Testing using a scikit-learn pipeline

In [3]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(get_estimator(), X_df, y, cv=3, scoring='neg_median_absolute_error')
print(-scores)

NameError: name 'get_estimator' is not defined

## Submission

To submit your code, you can refer to the [online documentation](https://paris-saclay-cds.github.io/ramp-docs/ramp-workflow/stable/using_kits.html).