<div style="text-align: center;">
    <a href="https://www.dataia.eu/">
        <img border="0" src="https://github.com/ramp-kits/template-kit/raw/main/img/DATAIA-h.png" width="90%"></a>
</div>

# Template Kit for RAMP challenge

<i> Thomas Moreau (Inria) </i>

## Introduction

Describe the challenge, in particular:

- Where the data comes from?
- What is the task this challenge aims to solve?
- Why does it matter?

## Dataset Description: Youth Unemployment Prediction in France
This dataset is designed to predict youth unemployment (ages 15-24) in France using various socioeconomic indicators. The data comes from France Travail, specifically from the Statistiques et Analyses section, covering the period from 1996 to 2024.

### The dataset includes the following key features:

- Number of unemployed youth (15-24 years old) in France by year and department.
- Number of training programs offered for job seekers by year and department, providing insights into workforce skill development.
- Number of job offers available each year, reflecting labor market demand.
- Number of workforce needs declared by companies, indicating recruitment demand across different sectors.
- Recruitment difficulty index (0-100%), showing the percentage of difficulty companies face when hiring.
- Number of people entering and exiting the unemployment lists in France, providing a dynamic view of job market inflows and outflows.
### Why Predict Youth Unemployment?
Predicting youth unemployment is crucial for policymakers, businesses, and educational institutions. High youth unemployment can lead to long-term economic and social consequences, such as increased poverty, social exclusion, and reduced economic growth. By forecasting unemployment trends, authorities can implement targeted policies, such as improving training programs, adapting job market strategies, and addressing skill mismatches. This proactive approach helps ensure a smoother transition for young people into the workforce, fostering economic stability and social cohesion.

# Exploratory data analysis

The goal of this section is to show what's in the data, and how to play with it.
This is the first set in any data science project, and here, you should give a sense of the data the participants will be working with.

You can first load and describe the data, and then show some interesting properties of it.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)

# Load the data

import problem
X_df, y = problem.get_train_data()

In [2]:
X_df

Unnamed: 0,TIME_PERIOD,GEO,number_courses,job_offer,need_for_manpower,difficult_recruitment,out_of_list,entry_on_list
363,2018,87,354.166667,3652.5,9870.0,49.0,2072900,2077300
250,2017,67,842.500000,14082.5,30950.0,40.0,2101700,2160900
465,2020,01,634.166667,5077.5,20940.0,59.0,1874400,1974600
346,2018,70,233.333333,1592.5,5330.0,37.0,2072900,2077300
585,2021,29,1427.500000,11745.0,37760.0,54.0,2118300,1994800
...,...,...,...,...,...,...,...,...
71,2015,74,558.333333,9055.0,24125.0,33.0,1904800,1999900
106,2016,14,881.666667,6930.0,17520.0,32.0,2160300,2172000
270,2017,87,332.500000,3087.5,7830.0,37.0,2101700,2160900
435,2019,66,660.833333,4787.5,26150.0,33.0,2139700,2089700


# Challenge evaluation

A particularly important point in a challenge is to describe how it is evaluated. This is the section where you should describe the metric that will be used to evaluate the participants' submissions, as well as your evaluation strategy, in particular if there is some complexity in the way the data should be split to ensure valid results.

# Submission format

Here, you should describe the submission format. This is the format the participants should follow to submit their predictions on the RAMP plateform.

This section also show how to use the `ramp-workflow` library to test the submission locally.

## The pipeline workflow

The input data are stored in a dataframe. To go from a dataframe to a numpy array we will use a scikit-learn column transformer. The first example we will write will just consist in selecting a subset of columns we want to work with.

In [5]:
# %load submissions/starting_kit/estimator.py

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer

cols = ['GEO', 'SEX', 'TIME_PERIOD', 'EMPSTA_ENQ_1', 'EMPSTA_ENQ_31', 'EMPSTA_ENQ_33',
       'EMPSTA_ENQ_35', 'EMPSTA_ENQ_36']

categorical_cols = ['GEO', 'SEX', 'TIME_PERIOD']
numerical_cols = ['EMPSTA_ENQ_1', 'EMPSTA_ENQ_31', 'EMPSTA_ENQ_33', 'EMPSTA_ENQ_35', 'EMPSTA_ENQ_36']

transformer = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), categorical_cols),
    ('passthrough', numerical_cols)
)

def get_estimator():
    pipe = make_pipeline(
        transformer,
        SimpleImputer(strategy='most_frequent'),
        LinearRegression()
    )

    return pipe


In [3]:
from skrub import tabular_learner
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tabular_learner('regressor'), X_df, y, scoring='neg_median_absolute_error')
print(-scores)

[538.3485782  692.07410525 566.72776386 568.07751541 549.95466346]


## Testing using a scikit-learn pipeline

In [2]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(get_estimator(), X_df, y, cv=3, scoring='neg_median_absolute_error')
print(-scores)

NameError: name 'get_estimator' is not defined

## Submission

To submit your code, you can refer to the [online documentation](https://paris-saclay-cds.github.io/ramp-docs/ramp-workflow/stable/using_kits.html).