<div style="text-align: center;">
    <a href="https://www.dataia.eu/">
        <img border="0" src="https://github.com/ramp-kits/template-kit/raw/main/img/DATAIA-h.png" width="90%"></a>
</div>

# Template Kit for RAMP challenge

<i> Thomas Moreau (Inria) </i>

## Introduction

<p style="text-align: justify;">
    At the start of the year, the Court of Auditors published its report on the evaluation of the Skills Investment Plan, implemented during Emmanuel Macron’s first term from 2017 to 2022. This plan allocated 15 billion euros over five years to tackle youth and long-term unemployment. However, the report highlights an uneven distribution of funds across regions and years, with some areas receiving surplus funding while others faced shortages. <br>
</p>
<p style="text-align: justify;">
    Our project proposes <b>a solution for reallocating resources annually based on predicted funding needs at the departmental level. </b> Indeed, at the end of each year N-1, we aim to forecast the number of young job seekers for year N. This prediction will help the State better assess the potential regional needs for training funding, as it must depend on the number of unemployed individuals. Thus this result would help optimize the allocation of financial resources between regions and improve the efficiency of future public policies. <br>
</p>
<p style="text-align: justify;">
    For this project, we chose to focus on young job seekers at the departmental level. Our target variable will therefore be: <b>"the number of job seekers under 25yo in department D for year N".</b> To build this prediction, we rely on the Workforce Needs Survey conducted by France Travail in the last quarter of each year. This survey collects data from 2 million private-sector companies, asking about their expected job creations for the coming year and the challenges they anticipate in filling these positions (such as skill shortages or job difficulty). In addition, we will integrate other key indicators from year N-1, including the number of job postings, completed training programs, and control variables like the department’s population.
</p>



## Dataset Description

This dataset is designed to predict youth unemployment (ages 15-24) in France using indicators from the French labor market. The data comes from *France Travail*, specifically from *the Statistiques et Analyses* section, covering the period from 2015 to 2023.
### The dataset includes the following key features:

- Year
- Department
- Number of workforce needs declared by companies, indicating recruitment demand across different sectors for this year and department.
- Recruitment difficulty index (0-100%), showing the percentage of difficulty companies face when hiring for this year and by department.
- Number of unemployed youth (15-24 years old) in France for the **previous** year and by department.
- Number of training programs offered for job seekers for the previous year and by department, providing insights into workforce skill development.
- Number of job offers available for the previous year and by department, reflecting labor market demand.
- Number of people entering and exiting the unemployment lists in France for the previous year and by department, providing a dynamic view of job market inflows and outflows.
- The size of the population for the previous year and by department (*Insee*).

# Exploratory data analysis

The goal of this section is to show what's in the data, and how to play with it.
This is the first set in any data science project, and here, you should give a sense of the data the participants will be working with.

You can first load and describe the data, and then show some interesting properties of it.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)

# Load the data
import problem
X_df, y = problem.get_train_data()

In [10]:
X_df.GEO.unique()

array(['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11',
       '12', '13', '14', '15', '16', '17', '18', '19', '21', '22', '23',
       '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34',
       '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45',
       '46', '47', '49', '50', '51', '52', '53', '54', '55', '56', '57',
       '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68',
       '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79',
       '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90',
       '91', '92', '93', '94', '95'], dtype=object)

In [3]:
y.head()

0     7440.833333
1    11160.833333
2     5078.333333
3     2411.666667
4     1920.833333
Name: OBS_VALUE, dtype: float64

In [4]:
len(y)

744

In [5]:
print(sorted(X_df.TIME_PERIOD.unique()))

[2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022]


# Challenge evaluation

A particularly important point in a challenge is to describe how it is evaluated. This is the section where you should describe the metric that will be used to evaluate the participants' submissions, as well as your evaluation strategy, in particular if there is some complexity in the way the data should be split to ensure valid results.

# Submission format

Here, you should describe the submission format. This is the format the participants should follow to submit their predictions on the RAMP plateform.

This section also show how to use the `ramp-workflow` library to test the submission locally.

## The pipeline workflow

The input data are stored in a dataframe. To go from a dataframe to a numpy array we will use a scikit-learn column transformer. The first example we will write will just consist in selecting a subset of columns we want to work with.

In [11]:
X_df.columns

Index(['TIME_PERIOD', 'GEO', 'number_courses', 'job_offer',
       'need_for_manpower', 'difficult_recruitment', 'out_of_list',
       'entry_on_list', 'population'],
      dtype='object')

In [12]:
# %load submissions/starting_kit/estimator.py

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer

cols = ['TIME_PERIOD', 'GEO', 'number_courses', 'job_offer',
       'need_for_manpower', 'difficult_recruitment', 'out_of_list',
       'entry_on_list', 'population']

categorical_cols = ['GEO', 'TIME_PERIOD']
numerical_cols = ['number_courses', 'job_offer',
       'need_for_manpower', 'difficult_recruitment', 'out_of_list',
       'entry_on_list', 'population']

transformer = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), categorical_cols),
    ('passthrough', numerical_cols)
)

def get_estimator():
    pipe = make_pipeline(
        transformer,
        SimpleImputer(strategy='most_frequent'),
        LinearRegression()
    )

    return pipe

In [13]:
from skrub import tabular_learner
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tabular_learner('regressor'), X_df, y, scoring='neg_median_absolute_error')
print(-scores)

[362.25196214 261.46333993 207.06763228 254.56361181 498.42648144]


## Testing using a scikit-learn pipeline

In [14]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(get_estimator(), X_df, y, cv=3, scoring='neg_median_absolute_error')
print(-scores)

[ 834.98732666  984.5303425  2197.00422415]


In [15]:
X_df_test, y_test = problem.get_test_data()
scores = cross_val_score(get_estimator(), X_df_test, y_test, cv=3, scoring='neg_median_absolute_error')
print(-scores)

[ 629.32549198  814.16355456 1181.18506594]


## Submission

To submit your code, you can refer to the [online documentation](https://paris-saclay-cds.github.io/ramp-docs/ramp-workflow/stable/using_kits.html).