## Legend

🎯 Objective    
❓ Question  
📝 Task  
☑️ Instructions  
💡 Informations  
💾 Submit your results  

**variable name**  
*field name*  
`python object`

# Apprentissage automatique (Machine Learning) supervisé et non supervisé

This exam contains **9** sections.  

For each one a time indication is given for information but you may choose your own pace.

To pass the exam you need to validate at least **5** sections.  

Some sections can be validated independently even if the exam use the same data for all of them.  

## Description (10 min)

🎯 **Your objective is to create a model that predicts the popularity of a song based on its characteristics**

To achieve this, you are given a dataset containing a list of songs with the following characteristics:

**acousticness**: whether the track is acoustic  

**danceability**: describes how suitable a track is for dancing  

**duration_ms**: duration of the track in milliseconds  

**energy**: represents a perceptual measure of intensity and activity  

**explicit**: whether the track has explicit lyrics  

**id**: id for the track  

**instrumentalness**: predicts whether a track contains no vocals  

**key**: the key the track is in  

**liveness**: detects the presence of an audience in the recording  

**loudness**: the overall loudness of a track in decibels  

**mode**: modality of a track  

**name**: name of the track  

**popularity**: popularity of the track  

**release_date**: release date  

**speechiness**: detects the presence of spoken words in a track  

**tempo**: overall estimated tempo of a track in beats per minute  

**valence**: describes the musical positiveness conveyed by a track  

**artist**: artist who performed the track

☑️ Only fine-tune your model when explicitly asked to do so, in section *7 - Fine-tuning*  

## 1 - Data Cleaning (15 min)

*C5 - Préparer les données en vue de l'apprentissage afin que celles-ci soient nettoyées*

🎯 **Load and clean the data**

📝 Load the data in **df**, a `DataFrame`  
☑️ The data file is available at this url: https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/spotify_popularity_train.csv

In [11]:
import pandas as pd

df = pd.read_csv('https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/spotify_popularity_train.csv')

In [13]:
df.head(2)

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,artist
0,0.654,0.499,219827,0.19,0,0B6BeEUd6UwFlbsHMQKjob,0.00409,7,0.0898,-16.435,1,Back in the Goodle Days,40,1971,0.0454,149.46,0.43,John Hartford
1,0.00592,0.439,483948,0.808,0,5Gpx4lJy3vKmIvjwbiR5c8,0.14,2,0.089,-8.497,1,Worlds Which Break Us - Intro Mix,22,2015-02-02,0.0677,138.04,0.0587,Driftmoon


📝 Clean the data, make sure that no duplicates nor missing values remain in **df**

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52317 entries, 0 to 52316
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      52317 non-null  float64
 1   danceability      52317 non-null  float64
 2   duration_ms       52317 non-null  int64  
 3   energy            52317 non-null  float64
 4   explicit          52317 non-null  int64  
 5   id                52317 non-null  object 
 6   instrumentalness  52317 non-null  float64
 7   key               52317 non-null  int64  
 8   liveness          52317 non-null  float64
 9   loudness          52317 non-null  float64
 10  mode              52317 non-null  int64  
 11  name              52317 non-null  object 
 12  popularity        52317 non-null  int64  
 13  release_date      52317 non-null  object 
 14  speechiness       52317 non-null  float64
 15  tempo             52317 non-null  float64
 16  valence           52317 non-null  float6

In [15]:
df.drop_duplicates(inplace=True)
df.dropna(inplace=True)

In [16]:
df.shape

(52053, 18)

In [17]:
df['key'].value_counts()
df['mode'].value_counts()

1    36732
0    15321
Name: mode, dtype: int64

In [18]:
df[['instrumentalness','liveness','loudness','popularity','speechiness','tempo','valence']].describe()

Unnamed: 0,instrumentalness,liveness,loudness,popularity,speechiness,tempo,valence
count,52053.0,52053.0,52053.0,52053.0,52053.0,52053.0,52053.0
mean,0.195664,0.211833,-11.745365,25.815188,0.106189,117.077248,0.524738
std,0.333686,0.180351,5.696061,21.864219,0.182825,30.266286,0.263819
min,0.0,0.0,-60.0,0.0,0.0,0.0,0.0
25%,0.0,0.0995,-14.913,1.0,0.0352,94.004,0.312
50%,0.000469,0.139,-10.836,26.0,0.0457,115.939,0.538
75%,0.24,0.273,-7.478,42.0,0.0768,135.114,0.742
max,1.0,0.999,3.744,96.0,0.97,243.507,1.0


💾 **Run the following cell to save your results**

In [19]:
from nbresult import ChallengeResult

result = ChallengeResult("data_cleaning", shape=df.shape)
result.write()

## 2 - Supervised Learning (40 min)

*C9 - Entraîner un modèle d'apprentissage supervisé pour optimiser une fonction de prédiction à partir d'exemples annotées*

🎯 **Identify your metrics, compute your baseline and evaluate a basic model**

📝 Choose an appropriate scoring [metric](https://scikit-learn.org/stable/modules/model_evaluation.html) from `sklearn` for this challenge, the chosen metric must:    

☑️ strongly penalize largest errors relatively to smaller ones  
☑️ measure errors in the same unit as the target `popularity`  
☑️ the greater, the better (metric_good_model > metric_bad_model)  

📝 Store in **scoring** its exact name as `string`

In [82]:
scoring = 'neg_root_mean_squared_error'

📝 Define your features and target   

☑️ Assign to **X_simple** a `DataFrame` containing only numerical features  
☑️ Assign to **y** a `Series` containing only your target: *popularity*  

In [66]:
X_simple = df.select_dtypes(include=['float64', 'int64']).drop(columns=['popularity'])
X_simple.columns

Index(['acousticness', 'danceability', 'duration_ms', 'energy', 'explicit',
       'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
       'speechiness', 'tempo', 'valence'],
      dtype='object')

In [27]:
y = df['popularity']
y

0        40
1        22
2        40
3         1
4        15
         ..
52312    25
52313    43
52314     0
52315     0
52316    40
Name: popularity, Length: 52053, dtype: int64

📝 Compute your baseline and store it in **baseline_score**, as a `float`  
☑️ Do so by simulating a constant prediction equivalent to the mean value of your target  
☑️ Use the same scoring function as the one stored in **scoring**  
☑️ You may have to code the scoring function yourself to use it outside a `sklearn` workflow  

In [84]:
y_pred_base = y.mean()
errors_base = np.array([(y_true - y_pred_base) for y_true in y])
squared_errors_base = errors_base*errors_base
mse_base = squared_errors_base.mean()
baseline_score = -float(np.sqrt(mse_base))
baseline_score

-21.86400900273424

### Holdout evaluation

📝 Split your data, holding out 50% of observations, randomly sampled, as test set  
☑️  Assign the result of your holdout to **X_train_simple** **y_train**, **X_test_simple**, **y_test**

In [106]:
from sklearn.model_selection import train_test_split

X_train_simple, X_test_simple, y_train, y_test = train_test_split(X_simple, y, test_size=0.5)

📝 Fit and evaluate the most basic linear model you can find in the [`linear_model`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) module  
☑️ Use the metric you stored in **scoring**    
☑️ Store in **score_simple_holdout** your model score

In [130]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import RobustScaler

basic_linear_model = LinearRegression()

In [85]:
# See below as I cross_validate before fitting

### Cross-validation evaluation

📝 Cross-validate your basic model  
☑️ Use 5 folds for your cross-validation  
☑️ Store your mean score in **score_simple_cv_mean** as a `float`  
☑️ Store the standard deviation of your scores in **score_simple_cv_std** as a `float`

In [131]:
from sklearn.model_selection import cross_val_score

score_simple_cv = cross_val_score(basic_linear_model, X_train_simple, y_train, cv=5, scoring=scoring)
score_simple_cv_mean = score_simple_cv.mean()
score_simple_cv_mean = float(score_simple_cv_mean)
score_simple_cv_mean

-18.367474400641676

In [132]:
score_simple_cv_std = np.std(score_simple_cv)
score_simple_cv_std

0.10481798803238368

In [133]:
model_simple = basic_linear_model.fit(X_train_simple, y_train)
y_pred = model_simple.predict(X_test_simple)

In [134]:
from sklearn.metrics import mean_squared_error
score_simple_holdout = float(-mean_squared_error(y_test, y_pred, squared=False))
score_simple_holdout

-18.36058726905631

☑️ From now on, you will stop using your train-test split    
☑️ Instead we expect you to cross-validate (5 folds) your results with the whole dataset

💾 **Run the following cell to save your results**

In [135]:
from nbresult import ChallengeResult

result = ChallengeResult(
    "supervised_learning",
    scoring=scoring,
    baseline_score=baseline_score,
    model=model_simple,
    shape_train = X_train_simple.shape,
    score_simple_holdout=score_simple_holdout,
    score_simple_cv_mean=score_simple_cv_mean,
    score_simple_cv_std=score_simple_cv_std,
)
result.write()

## 3 - Feature engineering (20 min)

*C7 - Générer des données d'entrée afin de satisfaire les contraintes inhérentes au modèles (Feature Engineering)*

🎯 **Create a new feature by extracting information from existing features**

Let's try to improve performance using the feature *release_date*

📝 Create a `DataFrame` **X_engineered** by adding a new column *year* to **X_simple**  
☑️ *year* must contain the release year of the track as `integer`

In [136]:
X_engineered = X_simple
X_engineered['year'] = pd.to_datetime(df['release_date']).dt.month
X_engineered.head(2)

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence,year
0,0.654,0.499,219827,0.19,0,0.00409,7,0.0898,-16.435,1,0.0454,149.46,0.43,1
1,0.00592,0.439,483948,0.808,0,0.14,2,0.089,-8.497,1,0.0677,138.04,0.0587,2


In [137]:
X_engineered['year'].unique()

array([ 1,  2,  9,  7,  8, 12,  5, 10, 11,  3,  6,  4])

📝 Check the impact of your new feature on the performance of your model  
☑️ Retrain the same basic linear model you used in section 2  
☑️ Use your **X_engineered** for the training  
☑️ Save the mean score after cross-validation in **score_engineered** as a `float`  

In [151]:
score_engineered_cv = cross_val_score(basic_linear_model, X_engineered, y, cv=5, scoring=scoring)
score_engineered = float(score_engineered_cv.mean())
score_engineered

-18.254988261783755

In [153]:
score_engineered_cv_std = np.std(score_engineered_cv)
score_engineered_cv_std

0.08307210374825719

In [154]:
model_simple_engineered = basic_linear_model.fit(X_engineered, y)
y_pred = model_simple_engineered.predict(X_engineered)

score_engineered = float(-mean_squared_error(y, y_pred, squared=False))
score_engineered

-18.249703365015296

💾 **Run the following cell to save your results**

In [155]:
from nbresult import ChallengeResult

result = ChallengeResult("feature_engineering",
    cols = X_engineered.columns,
    years = X_engineered.get("year"),
    score=score_engineered
)
result.write()

## 4 - Unsupervised Learning (20 min)

*C10 - Entraîner un modèle d'apprentissage non supervisé pour détecter des structures sous-jacentes à partir de données non étiquetées*

🎯 **Create a new feature by performing a clustering of your existing features**

📝 Use a `KMeans` to assign each track to a cluster  
☑️ Your target number of clusters is 5  
☑️ Fit your `KMeans` on **X_simple**  
☑️ Store your fitted `KMeans` in **kmeans**  

In [171]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5, random_state=42).fit(X_simple)

📝 Add your clusters as features to your **X_engineered**  
☑️ Use your **kmeans** to get cluster predictions on **X_simple**  
☑️ Store the resulting predictions in a new column of **X_engineered** called *clusters*  

In [172]:
X_engineered['clusters'] = kmeans.predict(X_simple)

📝 Check the impact of your new *clusters* feature on the performance of your model  
☑️ Retrain the same basic linear model you used in section 2 and 3  
☑️ Use your **X_engineered**, with both *year* and *clusters* for the training  
☑️ Save the mean score after cross-validation in **score_clusters** as a `float`  

In [173]:
score_clusters_cv = cross_val_score(basic_linear_model, X_engineered, y, cv=5, scoring=scoring)
score_clusters = float(score_clusters_cv.mean())
score_clusters

-18.254988261783755

💾 **Run the following cell to save your results**

In [174]:
from nbresult import ChallengeResult

result = ChallengeResult("unsupervised_learning",
    cols=X_engineered.columns.tolist(),
    clusters= kmeans.n_clusters,
    labels=X_engineered['clusters'].value_counts(normalize=True).values,
    score=score_clusters
)
result.write()

## 5 - Preprocressing (1 h)

*C6 - Transformer les données d'entrée afin de satisfaire les contraintes inhérentes au modèle (Preprocessing)*

🎯 **Construct a preprocessing pipeline for your data**

In [159]:
# This will help you visualize your pipelines
from sklearn import set_config; set_config(display='diagram')

In [166]:
# 👉 Do not hesitate to reload a clean new dataset if you need a fresh start.
X = df.drop(columns=['popularity', 'id'])
y = df['popularity']

📝 Look at your features with an `object` type in your **df**  
☑️ Check their number of unique values  

❓ Do you think it would be reasonable or efficient to one-hot encode any of them?  
☑️ Store you answer as a string (Yes or No) in **answer_ohe** below

In [167]:
for cat_col in X.select_dtypes(include='object').columns:
    print(cat_col + ':', df[cat_col].nunique())

name: 46641
release_date: 7547
artist: 12577


In [168]:
answer_ohe = 'No'

### 5.1 - Year

📝 Create a custom transformer to extract the *year* from *release_date*  
☑️ Use a [`FunctionTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html)  
☑️ Store your custom transformer in **transformer_year**

In [225]:
from sklearn.preprocessing import FunctionTransformer

def process_year(release_dates):
    return pd.to_datetime(release_dates).dt.month

transformer_year = FunctionTransformer(process_year)

📝 Create a pipeline **pipeline_year** with two steps:
- your **transformer_year**  
- a scaler that ensures values between 0 and 1

In [183]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler

pipeline_year = make_pipeline(
    transformer_year,
    MinMaxScaler()
)

### 5.2 - Clusters

📝 We provide you with a custom transformer to extract a cluster id for each observation  
☑️ The [`transform`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.transform) method of a `KMeans` return an array of shape (n_samples, n_clusters) with the distance to cluster for each pair obs-cluster  
☑️ We then simply use an `np.argmin` on the rows to get the index of the center the observation is closest to  
☑️ This effectively yields clusters for each observation

In [184]:
def process_clusters(clusters):
    return np.argmin(clusters, axis=1).reshape((-1, 1))

transformer_clusters = FunctionTransformer(process_clusters)

📝 Create a pipeline **pipeline_clusters** with three steps:
- a `KMeans` with a target number of clusters equals to 5  
- your custom transformer **transformer_clusters**  
- an encoder that creates a new binary column for each cluster - 1  

In [185]:
from sklearn.preprocessing import OneHotEncoder

pipeline_clusters = make_pipeline(
    KMeans(n_clusters=5),
    transformer_clusters,
    OneHotEncoder(sparse = False,
                  drop='if_binary',
                  handle_unknown = "ignore"
                 )
)

### 5.3 - Artist

📝 We provide you with a custom Transformer Class below  
☑️ Take some time to understand it  
☑️ It computes the average popularity of songs, per artist, on the train set only  
☑️ If the artist is unknown in the test set, the average popularity will be equal to the mean popularity on the train set  

In [186]:
from sklearn.base import BaseEstimator, TransformerMixin

class ArtistPopularityTransformer(BaseEstimator, TransformerMixin):
    """
    Compute, as a new feature the artist's popularity
    Does so by computing the mean popularity of all songs from the artist
    Notice that the popularity is computed on the train only to avoid leakage
    """

    def __init__(self):
        pass

    def fit(self, X, y=None):
        """
        process artist mean popularity from artists songs popularity
        process song global mean popularity
        """

        # process artist popularity
        self.artist_popularity = y.groupby(X.artist).agg("mean")
        self.artist_popularity.name = "artist_popularity"

        # process mean popularity
        self.mean_popularity = y.mean()

        return self

    def transform(self, X, y=None):
        """
        apply artist mean popularity vs song global mean popularity to songs
        """

        # inject artist popularity
        X_copy = X.merge(self.artist_popularity, how="left", left_on="artist", right_index=True)

        # fills popularity of unknown artists with song global mean popularity
        X_copy.replace(np.nan, self.mean_popularity, inplace=True)

        return X_copy[["artist_popularity"]]

📝 Create a **pipeline_artist** with two steps:  
- the custom `ArtistPopularityTransformer`  
- a scaler that ensures values between 0 and 1

In [188]:
pipeline_artist = make_pipeline(
    ArtistPopularityTransformer(),
    MinMaxScaler()
)

### 5.4 Preprocessing Pipeline

📝 Create a transformer that contains all your preprocessing steps using a [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html?highlight=column%20transformer#sklearn.compose.ColumnTransformer), which should:  
☑️ Apply your **pipeline_clusters** to all numeric features  
☑️ Scale all numeric features, so that their scaled values are within 0 and 1  
☑️ Apply your **pipeline_year** to the *release_date* field  
☑️ Apply your **pipeline_artist** to the *artist* field  
☑️ Drop all other fields  

In [196]:
from sklearn.compose import make_column_transformer, make_column_selector

num_cols = make_column_selector(dtype_exclude=['object', 'bool'])
cat_cols = make_column_selector(dtype_include=['object', 'bool'])

preprocessor = make_column_transformer(
    (pipeline_clusters, num_cols),
    (MinMaxScaler(), num_cols),
    (pipeline_year, ['release_date']),
    (pipeline_artist, ['artist']),
    remainder = 'drop')

In [197]:
preprocessor

📝 Use your pipeline to `transform` your **X** and store the result in **X_transformed**

In [198]:
X_transformed = preprocessor.fit_transform(X)

ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing

💾 **Run the following cell to save your results**

In [229]:
X_transformed = np.array(['raté !'])

In [230]:
# Save your preproc
from nbresult import ChallengeResult

result = ChallengeResult(
    "preprocessing",
    answer=answer_ohe,
    shape=X_transformed.shape,
    first_observation = X_transformed[0]
)
result.write()

## 6 - Model Selection (40 min)

*C8 - Maîtriser les différents algorithmes d'apprentissage afin d'apporter une réponse adaptée à une problématique d'une organisation (entreprise, laboratoire, etc.)*  

🎯 **Select the model that yields the best performance for your task**

📝 Try model from two different families: linear and ensemble  
☑️ We expect you to cross-validate all scores with 5 folds in this section  

**If you did not manage to construct the full preprocessing:**  
☑️ Construct a light pipeline that use only features in **X_simple** and scale them to values between 0 and 1  

### 6.1 - Linear Models

📝 Construct a `Pipeline` that combines your **preproc**  and a linear estimator from `sklearn`  
☑️ Assign your pipeline to a variable named **pipe_linear**  
☑️ We expect you to cross-validate all scores with 5 folds in this section  
☑️ Store the mean of the scores in **score_linear**  

In [239]:
pipe_linear = make_pipeline(
    MinMaxScaler(),
    LinearRegression()
)

score_linear = cross_val_score(pipe_linear, X_simple, y, cv=5, scoring=scoring).mean()
score_linear = float(score_linear)
score_linear

-18.254988261784135

### 6.2 - Ensemble Methods

📝 Construct a `Pipeline` that combines your **preproc**  and an ensemble estimator from `sklearn`  
☑️ Assign your pipeline to a variable named **pipe_ensemble**  
☑️ We expect you to cross-validate all scores with 5 folds in this section  
☑️ Store the mean of the scores in **score_ensemble**  

In [238]:
from sklearn.ensemble import RandomForestRegressor

pipe_ensemble = make_pipeline(
    MinMaxScaler(),
    RandomForestRegressor()
)

score_ensemble = cross_val_score(pipe_ensemble, X_simple, y, cv=5, scoring=scoring).mean()
score_ensemble = float(score_ensemble)
score_ensemble

-15.745803022794215

💾 **Run the following cell to save your results**

In [241]:
from nbresult import ChallengeResult

result = ChallengeResult("model_selection",
    baseline=baseline_score,
    estimator_linear=pipe_linear._final_estimator,
    estimator_ensemble=pipe_ensemble._final_estimator,
    score_linear=score_linear,
    score_ensemble=score_ensemble)
result.write()

## 7 - Fine-tuning (25 min)

*C11 - Améliorer les capacités prédictives d'un systèmes en sélectionnant un modèle différent ou en modifiant ses hyperparamètres en vue de corriger des erreurs (hyperparameter tuning)*

🎯 **Fine-tune your best model to achieve the highest possible score**

📝 Create a cross-validated grid search and assign it to **search**  
☑️ Choose the model that yielded the best result in section 6  
☑️ Create a **grid**, a `dict`, that stores the hyperparameters you want to search  
☑️ Limit yourself to 2 hyperparameters, with up to 3 possible values for each one  
☑️ Use only one scoring method, the one you stored in **scoring** in section 2  

In [240]:
pipe_ensemble.get_params()

{'memory': None,
 'steps': [('minmaxscaler', MinMaxScaler()),
  ('randomforestregressor', RandomForestRegressor())],
 'verbose': False,
 'minmaxscaler': MinMaxScaler(),
 'randomforestregressor': RandomForestRegressor(),
 'minmaxscaler__clip': False,
 'minmaxscaler__copy': True,
 'minmaxscaler__feature_range': (0, 1),
 'randomforestregressor__bootstrap': True,
 'randomforestregressor__ccp_alpha': 0.0,
 'randomforestregressor__criterion': 'squared_error',
 'randomforestregressor__max_depth': None,
 'randomforestregressor__max_features': 'auto',
 'randomforestregressor__max_leaf_nodes': None,
 'randomforestregressor__max_samples': None,
 'randomforestregressor__min_impurity_decrease': 0.0,
 'randomforestregressor__min_samples_leaf': 1,
 'randomforestregressor__min_samples_split': 2,
 'randomforestregressor__min_weight_fraction_leaf': 0.0,
 'randomforestregressor__n_estimators': 100,
 'randomforestregressor__n_jobs': None,
 'randomforestregressor__oob_score': False,
 'randomforestregressor

In [242]:
grid = {
    'randomforestregressor__n_estimators':[50,100,200]
}

In [243]:
from sklearn.model_selection import GridSearchCV

grid_searched = GridSearchCV(pipe_ensemble, cv=5, n_jobs=-1, param_grid=grid, scoring=scoring)

📝 fit your **search**  on the full **X** and **y**  
☑️ Iterate until you notice an improvement on the best score  compared to the scores obtained in section 6  
💡 You won't be judged by the computing power of your machine, your grid search should fit in under 3 minutes

In [58]:
grid_searched.fit(X_simple, y)

💾 **Run the following cell to save your results**

In [1]:
from nbresult import ChallengeResult

result = ChallengeResult("model_tuning",
    search_results=search.cv,
    score=search.best_score_)
result.write()

NameError: name 'search' is not defined

## 8 - Recommendations and Continuous Improvement (30 min)

*C13 - Adopter une démarche d'amélioration continue en identifiant les axes de perfectionnement d'un produit à l'aide d'une méthode adaptée de manière à améliorer la performance d'un produit*

🎯 **Transform your regression task into a classification task**

The product owner of your company tells you that he only needs to know whether a song is above or below popularity median  
The exact popularity value is of little interest to her as it won't bring any value to the feature under development.

📝 Create a new target **y_cat**, a `Series`, using the formula below  

☑️ $y\_cat_i = 1 \quad if\quad y_i \geq median(y),\quad 0\quad otherwise.$  

In [62]:
# YOUR CODE HERE

📝 Cross validate a classification  

☑️ Use your **preproc** with a basic linear model from `sklearn` suited for classification  
☑️ Assign the resulting pipeline to **pipe_cat**  
☑️ Cross validate the pipeline with 5 folds  
☑️ Use the Accuracy metric and store the mean of scores in **score_cat**  

In [64]:
# YOUR CODE HERE

💾 **Run the following cell to save your results**

In [65]:
from nbresult import ChallengeResult

result = ChallengeResult(
    "recommendations",
    target_cat = y_cat.value_counts(normalize=True).values,
    model = pipe_cat._final_estimator,
    score_cat = score_cat
)
result.write()

## 9 - Deployment - API (40 min)

*C12 - Mettre en production le modèle d'apprentissage supervisé ou non supervisé obtenu sous la forme d'une API*

📝 This challenge takes place in another repository  
☑️ Follow the instructions provided on the certification platform