## Legend

🎯 Objective    
❓ Question  
📝 Task  
☑️ Instructions  
💡 Informations  
💾 Submit your results  

**variable name**  
*field name*  
`python object`

# Apprentissage automatique (Machine Learning) supervisé et non supervisé

This exam contains **9** sections.  

For each one a time indication is given for information but you may choose your own pace.

To pass the exam you need to validate at least **5** sections.  

Some sections can be validated independently even if the exam use the same data for all of them.  

## Description (10 min)

🎯 **Your objective is to create a model that predicts the popularity of a song based on its characteristics**

To achieve this, you are given a dataset containing a list of songs with the following characteristics:

**acousticness**: whether the track is acoustic  

**danceability**: describes how suitable a track is for dancing  

**duration_ms**: duration of the track in milliseconds  

**energy**: represents a perceptual measure of intensity and activity  

**explicit**: whether the track has explicit lyrics  

**id**: id for the track  

**instrumentalness**: predicts whether a track contains no vocals  

**key**: the key the track is in  

**liveness**: detects the presence of an audience in the recording  

**loudness**: the overall loudness of a track in decibels  

**mode**: modality of a track  

**name**: name of the track  

**popularity**: popularity of the track  

**release_date**: release date  

**speechiness**: detects the presence of spoken words in a track  

**tempo**: overall estimated tempo of a track in beats per minute  

**valence**: describes the musical positiveness conveyed by a track  

**artist**: artist who performed the track

☑️ Only fine-tune your model when explicitly asked to do so, in section *7 - Fine-tuning*  

## 1 - Data Cleaning (15 min)

*C5 - Préparer les données en vue de l'apprentissage afin que celles-ci soient nettoyées*

🎯 **Load and clean the data**

📝 Load the data in **df**, a `DataFrame`  
☑️ The data file is available at this url: https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/spotify_popularity_train.csv

In [144]:
import pandas as pd

In [145]:
df = pd.read_csv('/home/demange/code/lewagon-assess/LDemange-data-spotify-popularity-challenge-2022-07-13/data/spotify_popularity_train.csv')

📝 Clean the data, make sure that no duplicates nor missing values remain in **df**

In [146]:
df = df.drop_duplicates() 
df = df.dropna()

In [147]:
#df.isnull().sum()
#df.duplicated().sum()

In [148]:
df.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,artist
0,0.654,0.499,219827,0.19,0,0B6BeEUd6UwFlbsHMQKjob,0.00409,7,0.0898,-16.435,1,Back in the Goodle Days,40,1971,0.0454,149.46,0.43,John Hartford
1,0.00592,0.439,483948,0.808,0,5Gpx4lJy3vKmIvjwbiR5c8,0.14,2,0.089,-8.497,1,Worlds Which Break Us - Intro Mix,22,2015-02-02,0.0677,138.04,0.0587,Driftmoon
2,0.734,0.523,245693,0.288,0,7MxuUYqrCIy93h1EEHrIrL,0.0,0,0.0771,-11.506,1,I'm The Greatest Star,40,1968-09-01,0.214,75.869,0.464,Barbra Streisand
3,0.429,0.681,130026,0.165,0,4GeYbfIx1vSQXTfQb1m8Th,0.0,11,0.394,-21.457,0,Kapitel 281 - Der Page und die Herzogin,1,1926,0.946,145.333,0.288,Georgette Heyer
4,0.562,0.543,129813,0.575,0,2JPGGZwajjMk0vvhfC17RK,4e-06,2,0.127,-7.374,1,Away from You,15,2008-02-11,0.0265,139.272,0.801,Gerry & The Pacemakers


💾 **Run the following cell to save your results**

In [149]:
from nbresult import ChallengeResult

result = ChallengeResult("data_cleaning", shape=df.shape)
result.write()

## 2 - Supervised Learning (40 min)

*C9 - Entraîner un modèle d'apprentissage supervisé pour optimiser une fonction de prédiction à partir d'exemples annotées*

🎯 **Identify your metrics, compute your baseline and evaluate a basic model**

📝 Choose an appropriate scoring [metric](https://scikit-learn.org/stable/modules/model_evaluation.html) from `sklearn` for this challenge, the chosen metric must:    

☑️ strongly penalize largest errors relatively to smaller ones  
☑️ measure errors in the same unit as the target `popularity`  
☑️ the greater, the better (metric_good_model > metric_bad_model)  

📝 Store in **scoring** its exact name as `string`

In [150]:
from sklearn.metrics import make_scorer
rmse_neg = make_scorer(lambda y_true, y_pred: -1 * mean_squared_error(y_true, y_pred)**0.5)
scoring = 'neg_root_mean_squared_error' #rmse_neg

In [151]:
from sklearn.metrics import mean_squared_error
def compute_rmse_neg(y_true, y_pred):
    return -1 * mean_squared_error(y_true, y_pred)**0.5

📝 Define your features and target   

☑️ Assign to **X_simple** a `DataFrame` containing only numerical features  
☑️ Assign to **y** a `Series` containing only your target: *popularity*  

In [152]:
X_simple = df.select_dtypes(include=['float64','int64']).drop(columns='popularity')
y=df['popularity']

📝 Compute your baseline and store it in **baseline_score**, as a `float`  
☑️ Do so by simulating a constant prediction equivalent to the mean value of your target  
☑️ Use the same scoring function as the one stored in **scoring**  
☑️ You may have to code the scoring function yourself to use it outside a `sklearn` workflow  

In [153]:
import numpy as np
from sklearn import metrics
y_baseline = np.ones(len(y))*y.mean()
baseline_score = compute_rmse_neg(y_baseline, y)

In [154]:
baseline_score

-21.86400900273424

### Holdout evaluation

📝 Split your data, holding out 50% of observations, randomly sampled, as test set  
☑️  Assign the result of your holdout to **X_train_simple** **y_train**, **X_test_simple**, **y_test**

In [155]:
from sklearn.model_selection import train_test_split
X_train_simple, X_test_simple, y_train, y_test= train_test_split(X_simple, y, test_size=0.5,random_state=42)

📝 Fit and evaluate the most basic linear model you can find in the [`linear_model`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) module  
☑️ Use the metric you stored in **scoring**    
☑️ Store in **score_simple_holdout** your model score

In [156]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Instanciate the model
model_simple = LinearRegression()
model_simple.fit(X_train_simple, y_train)
# Train the model on the Training data
score_simple_holdout = cross_val_score(model_simple, X_train_simple, y_train, cv=5, scoring='neg_root_mean_squared_error').mean()

In [157]:
score_simple_holdout 

-18.33297304347687

### Cross-validation evaluation

📝 Cross-validate your basic model  
☑️ Use 5 folds for your cross-validation  
☑️ Store your mean score in **score_simple_cv_mean** as a `float`  
☑️ Store the standard deviation of your scores in **score_simple_cv_std** as a `float`

In [158]:
score_simple_cv_mean = cross_val_score(model_simple, X_train_simple, y_train, cv=5, scoring='neg_root_mean_squared_error').mean()

In [159]:
score_simple_cv_std = cross_val_score(model_simple, X_train_simple, y_train, cv=5, scoring='neg_root_mean_squared_error').std()

In [160]:
score_simple_cv_mean

-18.33297304347687

In [161]:
score_simple_cv_std 

0.11667427590263478

☑️ From now on, you will stop using your train-test split    
☑️ Instead we expect you to cross-validate (5 folds) your results with the whole dataset

💾 **Run the following cell to save your results**

In [162]:
from nbresult import ChallengeResult

result = ChallengeResult(
    "supervised_learning",
    scoring=scoring,
    baseline_score=baseline_score,
    model=model_simple,
    shape_train = X_train_simple.shape,
    score_simple_holdout=score_simple_holdout,
    score_simple_cv_mean=score_simple_cv_mean,
    score_simple_cv_std=score_simple_cv_std,
)
result.write()

## 3 - Feature engineering (20 min)

*C7 - Générer des données d'entrée afin de satisfaire les contraintes inhérentes au modèles (Feature Engineering)*

🎯 **Create a new feature by extracting information from existing features**

Let's try to improve performance using the feature *release_date*

📝 Create a `DataFrame` **X_engineered** by adding a new column *year* to **X_simple**  
☑️ *year* must contain the release year of the track as `integer`

In [163]:
X_engineered = X_simple.copy()
X_engineered['year'] = pd.to_datetime(df['release_date']).dt.year

In [164]:
X_engineered

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence,year
0,0.65400,0.499,219827,0.190,0,0.004090,7,0.0898,-16.435,1,0.0454,149.460,0.4300,1971
1,0.00592,0.439,483948,0.808,0,0.140000,2,0.0890,-8.497,1,0.0677,138.040,0.0587,2015
2,0.73400,0.523,245693,0.288,0,0.000000,0,0.0771,-11.506,1,0.2140,75.869,0.4640,1968
3,0.42900,0.681,130026,0.165,0,0.000000,11,0.3940,-21.457,0,0.9460,145.333,0.2880,1926
4,0.56200,0.543,129813,0.575,0,0.000004,2,0.1270,-7.374,1,0.0265,139.272,0.8010,2008
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52312,0.16400,0.512,56253,0.907,0,0.004870,6,0.8010,-7.804,1,0.6620,85.615,0.3150,1977
52313,0.77300,0.533,192838,0.659,0,0.773000,2,0.1130,-9.117,0,0.0426,158.366,0.6140,1965
52314,0.45600,0.548,310840,0.568,0,0.000000,6,0.0892,-5.348,1,0.0275,77.495,0.3380,2020
52315,0.96500,0.360,216493,0.132,0,0.000000,10,0.1260,-21.014,1,0.0355,80.909,0.4100,1952


📝 Check the impact of your new feature on the performance of your model  
☑️ Retrain the same basic linear model you used in section 2  
☑️ Use your **X_engineered** for the training  
☑️ Save the mean score after cross-validation in **score_engineered** as a `float`  

In [165]:
X_train_simple, X_test_simple, y_train, y_test= train_test_split(X_engineered, y, test_size=0.5,random_state=42)
score_engineered = cross_val_score(model_simple, X_train_simple, y_train, cv=5, scoring='neg_root_mean_squared_error').mean()

In [166]:
score_engineered

-17.27716564677127

💾 **Run the following cell to save your results**

In [167]:
from nbresult import ChallengeResult

result = ChallengeResult("feature_engineering",
    cols = X_engineered.columns,
    years = X_engineered.get("year"),
    score=score_engineered
)
result.write()

## 4 - Unsupervised Learning (20 min)

*C10 - Entraîner un modèle d'apprentissage non supervisé pour détecter des structures sous-jacentes à partir de données non étiquetées*

🎯 **Create a new feature by performing a clustering of your existing features**

📝 Use a `KMeans` to assign each track to a cluster  
☑️ Your target number of clusters is 5  
☑️ Fit your `KMeans` on **X_simple**  
☑️ Store your fitted `KMeans` in **kmeans**  

In [168]:
from sklearn.cluster import KMeans
import numpy as np
kmeans = KMeans(n_clusters=5).fit(X_simple)

📝 Add your clusters as features to your **X_engineered**  
☑️ Use your **kmeans** to get cluster predictions on **X_simple**  
☑️ Store the resulting predictions in a new column of **X_engineered** called *clusters*  

In [169]:
clusters = kmeans.predict(X_simple)
X_engineered['clusters'] = clusters
X_engineered

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence,year,clusters
0,0.65400,0.499,219827,0.190,0,0.004090,7,0.0898,-16.435,1,0.0454,149.460,0.4300,1971,3
1,0.00592,0.439,483948,0.808,0,0.140000,2,0.0890,-8.497,1,0.0677,138.040,0.0587,2015,4
2,0.73400,0.523,245693,0.288,0,0.000000,0,0.0771,-11.506,1,0.2140,75.869,0.4640,1968,3
3,0.42900,0.681,130026,0.165,0,0.000000,11,0.3940,-21.457,0,0.9460,145.333,0.2880,1926,0
4,0.56200,0.543,129813,0.575,0,0.000004,2,0.1270,-7.374,1,0.0265,139.272,0.8010,2008,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52312,0.16400,0.512,56253,0.907,0,0.004870,6,0.8010,-7.804,1,0.6620,85.615,0.3150,1977,0
52313,0.77300,0.533,192838,0.659,0,0.773000,2,0.1130,-9.117,0,0.0426,158.366,0.6140,1965,0
52314,0.45600,0.548,310840,0.568,0,0.000000,6,0.0892,-5.348,1,0.0275,77.495,0.3380,2020,3
52315,0.96500,0.360,216493,0.132,0,0.000000,10,0.1260,-21.014,1,0.0355,80.909,0.4100,1952,3


📝 Check the impact of your new *clusters* feature on the performance of your model  
☑️ Retrain the same basic linear model you used in section 2 and 3  
☑️ Use your **X_engineered**, with both *year* and *clusters* for the training  
☑️ Save the mean score after cross-validation in **score_clusters** as a `float`  

In [170]:
X_train_simple, X_test_simple, y_train, y_test= train_test_split(X_engineered, y, test_size=0.5,random_state=42)

In [171]:
score_clusters = cross_val_score(model_simple, X_train_simple, y_train, cv=5, scoring='neg_root_mean_squared_error').mean()
score_clusters

-17.24726980691602

💾 **Run the following cell to save your results**

In [172]:
from nbresult import ChallengeResult

result = ChallengeResult("unsupervised_learning",
    cols=X_engineered.columns.tolist(),
    clusters= kmeans.n_clusters,
    labels=X_engineered['clusters'].value_counts(normalize=True).values,
    score=score_clusters
)
result.write()

## 5 - Preprocressing (1 h)

*C6 - Transformer les données d'entrée afin de satisfaire les contraintes inhérentes au modèle (Preprocessing)*

🎯 **Construct a preprocessing pipeline for your data**

In [173]:
# This will help you visualize your pipelines
from sklearn import set_config; set_config(display='diagram')

In [174]:
df.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,artist
0,0.654,0.499,219827,0.19,0,0B6BeEUd6UwFlbsHMQKjob,0.00409,7,0.0898,-16.435,1,Back in the Goodle Days,40,1971,0.0454,149.46,0.43,John Hartford
1,0.00592,0.439,483948,0.808,0,5Gpx4lJy3vKmIvjwbiR5c8,0.14,2,0.089,-8.497,1,Worlds Which Break Us - Intro Mix,22,2015-02-02,0.0677,138.04,0.0587,Driftmoon
2,0.734,0.523,245693,0.288,0,7MxuUYqrCIy93h1EEHrIrL,0.0,0,0.0771,-11.506,1,I'm The Greatest Star,40,1968-09-01,0.214,75.869,0.464,Barbra Streisand
3,0.429,0.681,130026,0.165,0,4GeYbfIx1vSQXTfQb1m8Th,0.0,11,0.394,-21.457,0,Kapitel 281 - Der Page und die Herzogin,1,1926,0.946,145.333,0.288,Georgette Heyer
4,0.562,0.543,129813,0.575,0,2JPGGZwajjMk0vvhfC17RK,4e-06,2,0.127,-7.374,1,Away from You,15,2008-02-11,0.0265,139.272,0.801,Gerry & The Pacemakers


In [176]:
# 👉 Do not hesitate to reload a clean new dataset if you need a fresh start.
X = df.drop(columns='popularity')
y = df['popularity']

📝 Look at your features with an `object` type in your **df**  
☑️ Check their number of unique values  

❓ Do you think it would be reasonable or efficient to one-hot encode any of them?  
☑️ Store you answer as a string (Yes or No) in **answer_ohe** below

In [183]:
X.select_dtypes(include='object').head()

Unnamed: 0,id,name,release_date,artist
0,0B6BeEUd6UwFlbsHMQKjob,Back in the Goodle Days,1971,John Hartford
1,5Gpx4lJy3vKmIvjwbiR5c8,Worlds Which Break Us - Intro Mix,2015-02-02,Driftmoon
2,7MxuUYqrCIy93h1EEHrIrL,I'm The Greatest Star,1968-09-01,Barbra Streisand
3,4GeYbfIx1vSQXTfQb1m8Th,Kapitel 281 - Der Page und die Herzogin,1926,Georgette Heyer
4,2JPGGZwajjMk0vvhfC17RK,Away from You,2008-02-11,Gerry & The Pacemakers


In [184]:
answer_ohe = 'No' #Cela ferait trop de features par rapport au nobmre de données

### 5.1 - Year

📝 Create a custom transformer to extract the *year* from *release_date*  
☑️ Use a [`FunctionTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html)  
☑️ Store your custom transformer in **transformer_year**

In [185]:
#définir fonction transformer
def transformer_year(X):
    return pd.DataFrame(pd.to_datetime(X['release_date']).dt.year)

📝 Create a pipeline **pipeline_year** with two steps:
- your **transformer_year**  
- a scaler that ensures values between 0 and 1

In [186]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer
pipeline_year = make_pipeline((FunctionTransformer(transformer_year)), MinMaxScaler())

### 5.2 - Clusters

📝 We provide you with a custom transformer to extract a cluster id for each observation  
☑️ The [`transform`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.transform) method of a `KMeans` return an array of shape (n_samples, n_clusters) with the distance to cluster for each pair obs-cluster  
☑️ We then simply use an `np.argmin` on the rows to get the index of the center the observation is closest to  
☑️ This effectively yields clusters for each observation

In [42]:
# YOUR CODE HERE

📝 Create a pipeline **pipeline_clusters** with three steps:
- a `KMeans` with a target number of clusters equals to 5  
- your custom transformer **transformer_clusters**  
- an encoder that creates a new binary column for each cluster - 1  

In [191]:
#kmeans = KMeans(n_clusters=5).fit(X_simple)
#clusters = kmeans.predict(X_simple)
#X_engineered['clusters'] = clusters
from sklearn.preprocessing import OneHotEncoder
def transformer_clusters(X):
    KMeans(n_clusters=5).fit(X)
    return kmeans.predict(X)
     
pipeline_clusters = make_pipeline(FunctionTransformer(transformer_clusters),OneHotEncoder(sparse = False))

### 5.3 - Artist

📝 We provide you with a custom Transformer Class below  
☑️ Take some time to understand it  
☑️ It computes the average popularity of songs, per artist, on the train set only  
☑️ If the artist is unknown in the test set, the average popularity will be equal to the mean popularity on the train set  

In [192]:
from sklearn.base import BaseEstimator, TransformerMixin

class ArtistPopularityTransformer(BaseEstimator, TransformerMixin):
    """
    Compute, as a new feature the artist's popularity
    Do so by computing the mean popularity of all songs from the artist
    Notice that the popularity is computed on the train only to avoid leakage
    """

    def __init__(self):
        pass

    def fit(self, X, y=None):
        """
        process artist mean popularity from artists songs popularity
        process song global mean popularity
        """

        # process artist popularity
        self.artist_popularity = y.groupby(X.artist).agg("mean")
        self.artist_popularity.name = "artist_popularity"

        # process mean popularity
        self.mean_popularity = y.mean()

        return self

    def transform(self, X, y=None):
        """
        apply artist mean popularity vs song global mean popularity to songs
        """

        # inject artist popularity
        X_copy = X.merge(self.artist_popularity, how="left", left_on="artist", right_index=True)

        # fills popularity of unknown artists with song global mean popularity
        X_copy.replace(np.nan, self.mean_popularity, inplace=True)

        return X_copy[["artist_popularity"]]

📝 Create a **pipeline_artist** with two steps:  
- the custom `ArtistPopularityTransformer`  
- a scaler that ensures values between 0 and 1

In [193]:
pipeline_artist = make_pipeline(FunctionTransformer(ArtistPopularityTransformer),MinMaxScaler())

### 5.4 Preprocessing Pipeline

📝 Create a transformer that contains all your preprocessing steps using a [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html?highlight=column%20transformer#sklearn.compose.ColumnTransformer), which should:  
☑️ Apply your **pipeline_clusters** to all numeric features  
☑️ Scale all numeric features, so that their scaled values are within 0 and 1  
☑️ Apply your **pipeline_year** to the *release_date* field  
☑️ Apply your **pipeline_artist** to the *artist* field  
☑️ Drop all other fields  

In [215]:
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector

preproc = make_column_transformer(
    #(pipeline_clusters, make_column_selector(dtype_include=['int64','float64'])),
    (MinMaxScaler(), make_column_selector(dtype_include=['int64','float64'])),
    #(pipeline_year, ['year']),
    #(pipeline_artist, ['artist']),
    remainder='drop')

📝 Use your pipeline to `transform` your **X** and store the result in **X_transformed**

In [216]:
X_transformed = preproc.fit_transform(X)

💾 **Run the following cell to save your results**

In [217]:
# Save your preproc
from nbresult import ChallengeResult

result = ChallengeResult(
    "preprocessing",
    answer=answer_ohe,
    shape=X_transformed.shape,
    first_observation = X_transformed[0]
)
result.write()

## 6 - Model Selection (40 min)

*C8 - Maîtriser les différents algorithmes d'apprentissage afin d'apporter une réponse adaptée à une problématique d'une organisation (entreprise, laboratoire, etc.)*  

🎯 **Select the model that yields the best performance for your task**

📝 Try model from two different families: linear and ensemble  
☑️ We expect you to cross-validate all scores with 5 folds in this section  

**If you did not manage to construct the full preprocessing:**  
☑️ Construct a light pipeline that use only features in **X_simple** and scale them to values between 0 and 1  

### 6.1 - Linear Models

📝 Construct a `Pipeline` that combines your **preproc**  and a linear estimator from `sklearn`  
☑️ Assign your pipeline to a variable named **pipe_linear**  
☑️ We expect you to cross-validate all scores with 5 folds in this section  
☑️ Store the mean of the scores in **score_linear**  

In [223]:
model = LinearRegression()
pipe_linear = make_pipeline(preproc, model)
score_linear = cross_val_score(pipe_linear, X, y, cv=5, scoring='neg_root_mean_squared_error').mean()
score_linear

-18.360558551569312

### 6.2 - Ensemble Methods

📝 Construct a `Pipeline` that combines your **preproc**  and an ensemble estimator from `sklearn`  
☑️ Assign your pipeline to a variable named **pipe_ensemble**  
☑️ We expect you to cross-validate all scores with 5 folds in this section  
☑️ Store the mean of the scores in **score_ensemble**  

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import VotingRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import StackingRegressor
from sklearn.ensemble import RandomForestRegressor

estimators=[
    KNeighborsRegressor(),
    Ridge(),
    AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=None)),
    RandomForestRegressor(max_depth=50,min_samples_leaf=20),
    GradientBoostingRegressor(max_depth=50,min_samples_leaf=20)
]

score_ensemble=[]

for model in estimators:
    pipe_ensemble=make_pipeline(preproc,model)
    score = cross_val_score(pipe_ensemble, X, y, cv=5, scoring='neg_root_mean_squared_error').mean()
    score_ensemble.append((model,score))

💾 **Run the following cell to save your results**

In [55]:
from nbresult import ChallengeResult

result = ChallengeResult("model_selection",
    baseline=baseline_score,
    estimator_linear=pipe_linear._final_estimator,
    estimator_ensemble=pipe_ensemble._final_estimator,
    score_linear=score_linear,
    score_ensemble=score_ensemble)
result.write()

## 7 - Fine-tuning (25 min)

*C11 - Améliorer les capacités prédictives d'un systèmes en sélectionnant un modèle différent ou en modifiant ses hyperparamètres en vue de corriger des erreurs (hyperparameter tuning)*

🎯 **Fine-tune your best model to achieve the highest possible score**

📝 Create a cross-validated grid search and assign it to **search**  
☑️ Choose the model that yielded the best result in section 6  
☑️ Create a **grid**, a `dict`, that stores the hyperparameters you want to search  
☑️ Limit yourself to 2 hyperparameters, with up to 3 possible values for each one  
☑️ Use only one scoring method, the one you stored in **scoring** in section 2  

In [57]:
from sklearn.model_selection import GridSearchCV
search = GridSearchCV(estimator=pipe,
            param_grid={'imputer__n_neighbors': [2, 5, 10]},
                        cv=5,
                        scoring="recall")

📝 fit your **search**  on the full **X** and **y**  
☑️ Iterate until you notice an improvement on the best score  compared to the scores obtained in section 6  
💡 You won't be judged by the computing power of your machine, your grid search should fit in under 3 minutes

In [58]:
# YOUR CODE HERE

💾 **Run the following cell to save your results**

In [69]:
from nbresult import ChallengeResult

result = ChallengeResult("model_tuning",
    search_results=search.cv,
    score=search.best_score_)
result.write()

## 8 - Recommendations and Continuous Improvement (30 min)

*C13 - Adopter une démarche d'amélioration continue en identifiant les axes de perfectionnement d'un produit à l'aide d'une méthode adaptée de manière à améliorer la performance d'un produit*

🎯 **Transform your regression task into a classification task**

The product owner of your company tells you that he only needs to know whether a song is above or below popularity median  
The exact popularity value is of little interest to her as it won't bring any value to the feature under development.

📝 Create a new target **y_cat**, a `Series`, using the formula below  

☑️ $y\_cat_i = 1 \quad if\quad y_i \geq median(y),\quad 0\quad otherwise.$  

In [62]:
# YOUR CODE HERE

📝 Cross validate a classification  

☑️ Use your **preproc** with a basic linear model from `sklearn` suited for classification  
☑️ Assign the resulting pipeline to **pipe_cat**  
☑️ Cross validate the pipeline with 5 folds  
☑️ Use the Accuracy metric and store the mean of scores in **score_cat**  

In [64]:
# YOUR CODE HERE

💾 **Run the following cell to save your results**

In [65]:
from nbresult import ChallengeResult

result = ChallengeResult(
    "recommendations",
    target_cat = y_cat.value_counts(normalize=True).values,
    model = pipe_cat._final_estimator,
    score_cat = score_cat
)
result.write()

## 9 - Deployment - API (40 min)

*C12 - Mettre en production le modèle d'apprentissage supervisé ou non supervisé obtenu sous la forme d'une API*

📝 This challenge takes place in another repository  
☑️ Follow the instructions provided on the certification platform