## Legend

In [142]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler, RobustScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

from sklearn.model_selection import cross_validate, train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import mean_squared_error 

from sklearn.linear_model import LinearRegression, Lasso,SGDClassifier
from sklearn.ensemble import RandomForestRegressor

from sklearn.dummy import DummyRegressor
from sklearn.cluster import KMeans


🎯 Objective    
❓ Question  
📝 Task  
☑️ Instructions  
💡 Informations  
💾 Submit your results  

**variable name**  
*field name*  
`python object`

# Apprentissage automatique (Machine Learning) supervisé et non supervisé

This exam contains **9** sections.  

For each one a time indication is given for information but you may choose your own pace.

To pass the exam you need to validate at least **5** sections.  

Some sections can be validated independently even if the exam use the same data for all of them.  

## Description (10 min)

🎯 **Your objective is to create a model that predicts the popularity of a song based on its characteristics**

To achieve this, you are given a dataset containing a list of songs with the following characteristics:

**acousticness**: whether the track is acoustic  

**danceability**: describes how suitable a track is for dancing  

**duration_ms**: duration of the track in milliseconds  

**energy**: represents a perceptual measure of intensity and activity  

**explicit**: whether the track has explicit lyrics  

**id**: id for the track  

**instrumentalness**: predicts whether a track contains no vocals  

**key**: the key the track is in  

**liveness**: detects the presence of an audience in the recording  

**loudness**: the overall loudness of a track in decibels  

**mode**: modality of a track  

**name**: name of the track  

**popularity**: popularity of the track  

**release_date**: release date  

**speechiness**: detects the presence of spoken words in a track  

**tempo**: overall estimated tempo of a track in beats per minute  

**valence**: describes the musical positiveness conveyed by a track  

**artist**: artist who performed the track

☑️ Only fine-tune your model when explicitly asked to do so, in section *7 - Fine-tuning*  

## 1 - Data Cleaning (15 min)

*C5 - Préparer les données en vue de l'apprentissage afin que celles-ci soient nettoyées*

🎯 **Load and clean the data**

📝 Load the data in **df**, a `DataFrame`  
☑️ The data file is available at this url: https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/spotify_popularity_train.csv

In [143]:
df=pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/spotify_popularity_train.csv")
df

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,artist
0,0.65400,0.499,219827,0.190,0,0B6BeEUd6UwFlbsHMQKjob,0.004090,7,0.0898,-16.435,1,Back in the Goodle Days,40,1971,0.0454,149.460,0.4300,John Hartford
1,0.00592,0.439,483948,0.808,0,5Gpx4lJy3vKmIvjwbiR5c8,0.140000,2,0.0890,-8.497,1,Worlds Which Break Us - Intro Mix,22,2015-02-02,0.0677,138.040,0.0587,Driftmoon
2,0.73400,0.523,245693,0.288,0,7MxuUYqrCIy93h1EEHrIrL,0.000000,0,0.0771,-11.506,1,I'm The Greatest Star,40,1968-09-01,0.2140,75.869,0.4640,Barbra Streisand
3,0.42900,0.681,130026,0.165,0,4GeYbfIx1vSQXTfQb1m8Th,0.000000,11,0.3940,-21.457,0,Kapitel 281 - Der Page und die Herzogin,1,1926,0.9460,145.333,0.2880,Georgette Heyer
4,0.56200,0.543,129813,0.575,0,2JPGGZwajjMk0vvhfC17RK,0.000004,2,0.1270,-7.374,1,Away from You,15,2008-02-11,0.0265,139.272,0.8010,Gerry & The Pacemakers
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52312,0.16400,0.512,56253,0.907,0,2GJxRwFe8oLcbXgTw9P5of,0.004870,6,0.8010,-7.804,1,"Incidental CB Dialogue - Bandit, Smokey & Snowman",25,1977-01-01,0.6620,85.615,0.3150,Burt Reynolds
52313,0.77300,0.533,192838,0.659,0,0EtAPdqg7TysBXKDbnzuSO,0.773000,2,0.1130,-9.117,0,Samba De Verão,43,1965-06-01,0.0426,158.366,0.6140,Walter Wanderley
52314,0.45600,0.548,310840,0.568,0,1s78GLrkZT7rTKAEu056M8,0.000000,6,0.0892,-5.348,1,Kekkonnouta,0,2020-04-15,0.0275,77.495,0.3380,accel
52315,0.96500,0.360,216493,0.132,0,1LUBU2WI4z0dALUM16hoAH,0.000000,10,0.1260,-21.014,1,Die Meistersinger von Nürnberg - Act 1: Wohl M...,0,1952-01-01,0.0355,80.909,0.4100,Richard Wagner


📝 Clean the data, make sure that no duplicates nor missing values remain in **df**

In [144]:
df.isnull().sum()
df.shape

(52317, 18)

In [145]:
df.dropna(inplace=True)
df.shape

(52313, 18)

In [146]:
df=df.drop_duplicates()
df.shape

(52053, 18)

💾 **Run the following cell to save your results**

In [147]:
from nbresult import ChallengeResult

result = ChallengeResult("data_cleaning", shape=df.shape)
result.write()

## 2 - Supervised Learning (40 min)

*C9 - Entraîner un modèle d'apprentissage supervisé pour optimiser une fonction de prédiction à partir d'exemples annotées*

🎯 **Identify your metrics, compute your baseline and evaluate a basic model**

📝 Choose an appropriate scoring [metric](https://scikit-learn.org/stable/modules/model_evaluation.html) from `sklearn` for this challenge, the chosen metric must:    

☑️ strongly penalize largest errors relatively to smaller ones  
☑️ measure errors in the same unit as the target `popularity`  
☑️ the greater, the better (metric_good_model > metric_bad_model)  

📝 Store in **scoring** its exact name as `string`

In [148]:
#Use RMSE when you need to penalize large errors, but see it in the unit of the target.
scoring ="Root Mean squared error"

In [149]:
df.dtypes.value_counts()

float64    9
int64      5
object     4
dtype: int64

In [150]:
features_categorical = df.select_dtypes(include='object').columns
features_categorical

Index(['id', 'name', 'release_date', 'artist'], dtype='object')

In [151]:
features_numerical = df.select_dtypes(include=['int64','float']).columns
features_numerical

Index(['acousticness', 'danceability', 'duration_ms', 'energy', 'explicit',
       'instrumentalness', 'key', 'liveness', 'loudness', 'mode', 'popularity',
       'speechiness', 'tempo', 'valence'],
      dtype='object')

📝 Define your features and target   

☑️ Assign to **X_simple** a `DataFrame` containing only numerical features  
☑️ Assign to **y** a `Series` containing only your target: *popularity*  

In [152]:
X_simple = df[features_numerical].drop(columns='popularity')
y = df['popularity']

📝 Compute your baseline and store it in **baseline_score**, as a `float`  
☑️ Do so by simulating a constant prediction equivalent to the mean value of your target  
☑️ Use the same scoring function as the one stored in **scoring**  
☑️ You may have to code the scoring function yourself to use it outside a `sklearn` workflow  

In [153]:
baseline_score=y.mean()
baseline_score

25.815188365704188

### Holdout evaluation

📝 Split your data, holding out 50% of observations, randomly sampled, as test set  
☑️  Assign the result of your holdout to **X_train_simple** **y_train**, **X_test_simple**, **y_test**


In [154]:
X_train_simple, X_test_simple, y_train, y_test = train_test_split(X_simple, y, test_size=0.5)

📝 Fit and evaluate the most basic linear model you can find in the [`linear_model`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) module  
☑️ Use the metric you stored in **scoring**    
☑️ Store in **score_simple_holdout** your model score

In [155]:
model_simple = LinearRegression()
model_simple.fit(X_train_simple, y_train) 

y_pred = model_simple.predict(X_test_simple)
score_simple_holdout = round(mean_squared_error(y_test, y_pred, squared=False),1)
score_simple_holdout

18.4

### Cross-validation evaluation

📝 Cross-validate your basic model  
☑️ Use 5 folds for your cross-validation  
☑️ Store your mean score in **score_simple_cv_mean** as a `float`  
☑️ Store the standard deviation of your scores in **score_simple_cv_std** as a `float`

In [156]:
# cross-validation with 5 folds
cv_results = cross_validate(
    model_simple,    
    X_train_simple, y_train,   
    scoring=['neg_root_mean_squared_error'],# negative RMSE
    cv=5,  
    verbose=1
)

cv_results
score_simple_cv_mean=round(cv_results['test_neg_root_mean_squared_error'].mean(),2)
score_simple_cv_mean

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.1s finished


-18.3

In [157]:
import statistics
score_simple_cv_std = round(statistics.stdev(cv_results['test_neg_root_mean_squared_error']),2)
print(score_simple_cv_std)

0.12


In [158]:
#Double check my Standard deviation 

scores=cv_results['test_neg_root_mean_squared_error']
mean = np.mean(scores)
differences = [(score - mean)**2 for score in scores]
sum_of_squared_differences = sum(differences)
standard_deviation = np.sqrt(sum_of_squared_differences / len(scores))
score_simple_cv_std=round(standard_deviation,2)
score_simple_cv_std

0.1

☑️ From now on, you will stop using your train-test split    
☑️ Instead we expect you to cross-validate (5 folds) your results with the whole dataset

💾 **Run the following cell to save your results**

In [159]:
print(scoring)
print(baseline_score)
print(model_simple)
print(X_train_simple.shape)
print(score_simple_holdout)
print(score_simple_cv_mean)
print(score_simple_cv_std)

Root Mean squared error
25.815188365704188
LinearRegression()
(26026, 13)
18.4
-18.3
0.1


In [160]:
from nbresult import ChallengeResult

result = ChallengeResult(
    "supervised_learning",
    scoring=scoring,
    baseline_score=baseline_score,
    model=model_simple,
    shape_train = X_train_simple.shape,
    score_simple_holdout=score_simple_holdout,
    score_simple_cv_mean=score_simple_cv_mean,
    score_simple_cv_std=score_simple_cv_std,
)
result.write()

## 3 - Feature engineering (20 min)

*C7 - Générer des données d'entrée afin de satisfaire les contraintes inhérentes au modèles (Feature Engineering)*

🎯 **Create a new feature by extracting information from existing features**

Let's try to improve performance using the feature *release_date*

📝 Create a `DataFrame` **X_engineered** by adding a new column *year* to **X_simple**  
☑️ *year* must contain the release year of the track as `integer`

In [161]:
X_simple.shape

(52053, 13)

In [162]:
print(df.shape)
date_series=df['release_date']
date_series

(52053, 18)


0              1971
1        2015-02-02
2        1968-09-01
3              1926
4        2008-02-11
            ...    
52312    1977-01-01
52313    1965-06-01
52314    2020-04-15
52315    1952-01-01
52316    1993-10-26
Name: release_date, Length: 52053, dtype: object

In [163]:
from datetime import datetime

list_date=[]
for i in range(len(date_series)) :
    val=date_series.iloc[i]
    char_length=len(val)

    if (char_length==4) :
        date = datetime.strptime(val, '%Y')
    elif (char_length==7) :
        date = datetime.strptime(val, '%Y-%m')
    elif (char_length==10) :
        date = datetime.strptime(val, '%Y-%m-%d')
    else :
        print("error for ",val)
    list_date.append(date.year)
print(len(list_date)) 
print(max(list_date)) 
print(min(list_date)) 


52053
2021
1920


In [164]:
zipped =zip(df['id'],list_date)
new_df=pd.DataFrame(zipped, columns=['id','year'])
new_df

Unnamed: 0,id,year
0,0B6BeEUd6UwFlbsHMQKjob,1971
1,5Gpx4lJy3vKmIvjwbiR5c8,2015
2,7MxuUYqrCIy93h1EEHrIrL,1968
3,4GeYbfIx1vSQXTfQb1m8Th,1926
4,2JPGGZwajjMk0vvhfC17RK,2008
...,...,...
52048,2GJxRwFe8oLcbXgTw9P5of,1977
52049,0EtAPdqg7TysBXKDbnzuSO,1965
52050,1s78GLrkZT7rTKAEu056M8,2020
52051,1LUBU2WI4z0dALUM16hoAH,1952


In [165]:
X_engineered = df.drop(columns='release_date',axis=1)
X_engineered = X_engineered.merge(new_df, how='left', on='id')
features_numerical = X_engineered.select_dtypes(include=['int64','float']).columns
X_engineered = X_engineered[features_numerical].drop(columns='popularity')

X_engineered.shape

(52053, 14)

📝 Check the impact of your new feature on the performance of your model  
☑️ Retrain the same basic linear model you used in section 2  
☑️ Use your **X_engineered** for the training  
☑️ Save the mean score after cross-validation in **score_engineered** as a `float`  

In [166]:
X_train_simple, X_test_simple, y_train, y_test = train_test_split(X_engineered, y, test_size=0.5)
cv_results = cross_validate(
    model_simple,    
    X_train_simple, y_train,   
    scoring=['neg_root_mean_squared_error'],# negative RMSE
    cv=5,  
    verbose=1
)

score_engineered=round(cv_results['test_neg_root_mean_squared_error'].mean(),2)
score_engineered

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.1s finished


-17.43

In [167]:
print(X_engineered.columns)
print(X_engineered.get("year"))
print(score_engineered)

Index(['acousticness', 'danceability', 'duration_ms', 'energy', 'explicit',
       'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
       'speechiness', 'tempo', 'valence', 'year'],
      dtype='object')
0        1971
1        2015
2        1968
3        1926
4        2008
         ... 
52048    1977
52049    1965
52050    2020
52051    1952
52052    1993
Name: year, Length: 52053, dtype: int64
-17.43


💾 **Run the following cell to save your results**

In [168]:
from nbresult import ChallengeResult

result = ChallengeResult("feature_engineering",
    cols = X_engineered.columns,
    years = X_engineered.get("year"),
    score=score_engineered
)
result.write()

## 4 - Unsupervised Learning (20 min)

*C10 - Entraîner un modèle d'apprentissage non supervisé pour détecter des structures sous-jacentes à partir de données non étiquetées*

🎯 **Create a new feature by performing a clustering of your existing features**

📝 Use a `KMeans` to assign each track to a cluster  
☑️ Your target number of clusters is 5  
☑️ Fit your `KMeans` on **X_simple**  
☑️ Store your fitted `KMeans` in **kmeans**  

In [169]:
# Fit K-means
kmeans = KMeans(n_clusters=5)
kmeans.fit(X_simple)
print(kmeans.cluster_centers_.shape)
kmeans.cluster_centers_

(5, 13)


array([[ 5.94073745e-01,  5.41335298e-01,  1.61381685e+05,
         4.34122027e-01,  6.68730871e-02,  1.93661798e-01,
         5.17246779e+00,  2.12261074e-01, -1.24003318e+01,
         7.30763755e-01,  1.28664834e-01,  1.16261400e+02,
         5.68328478e-01],
       [ 7.21560874e-01,  3.38623529e-01,  1.02381700e+06,
         3.23767961e-01,  2.24089636e-02,  4.85721476e-01,
         5.00840336e+00,  2.65350700e-01, -1.59640448e+01,
         7.03081232e-01,  1.17692717e-01,  1.06049263e+02,
         2.56196639e-01],
       [ 3.66982692e-01,  5.44474259e-01,  2.66276650e+05,
         5.56727296e-01,  8.42492742e-02,  1.51472250e-01,
         5.21256740e+00,  2.05334509e-01, -1.04980963e+01,
         6.86385317e-01,  7.89673217e-02,  1.18538884e+02,
         4.97320769e-01],
       [ 3.18975997e-01,  4.04722581e-01,  3.39373352e+06,
         6.14884523e-01,  1.61290323e-01,  4.28260316e-01,
         5.32258065e+00,  3.80058065e-01, -1.07172903e+01,
         8.06451613e-01,  1.35632258e

In [170]:
y_kmeans =kmeans.predict(X_simple)
clusters=pd.DataFrame(y_kmeans,columns=['clusters'])
clusters.max()
clusters.min()

clusters    0
dtype: int32

📝 Add your clusters as features to your **X_engineered**  
☑️ Use your **kmeans** to get cluster predictions on **X_simple**  
☑️ Store the resulting predictions in a new column of **X_engineered** called *clusters*  

In [171]:
X_engineered = pd.concat([X_engineered, clusters], axis=1)
X_engineered.shape
X_engineered

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence,year,clusters
0,0.65400,0.499,219827,0.190,0,0.004090,7,0.0898,-16.435,1,0.0454,149.460,0.4300,1971,2
1,0.00592,0.439,483948,0.808,0,0.140000,2,0.0890,-8.497,1,0.0677,138.040,0.0587,2015,4
2,0.73400,0.523,245693,0.288,0,0.000000,0,0.0771,-11.506,1,0.2140,75.869,0.4640,1968,2
3,0.42900,0.681,130026,0.165,0,0.000000,11,0.3940,-21.457,0,0.9460,145.333,0.2880,1926,0
4,0.56200,0.543,129813,0.575,0,0.000004,2,0.1270,-7.374,1,0.0265,139.272,0.8010,2008,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52048,0.16400,0.512,56253,0.907,0,0.004870,6,0.8010,-7.804,1,0.6620,85.615,0.3150,1977,0
52049,0.77300,0.533,192838,0.659,0,0.773000,2,0.1130,-9.117,0,0.0426,158.366,0.6140,1965,0
52050,0.45600,0.548,310840,0.568,0,0.000000,6,0.0892,-5.348,1,0.0275,77.495,0.3380,2020,2
52051,0.96500,0.360,216493,0.132,0,0.000000,10,0.1260,-21.014,1,0.0355,80.909,0.4100,1952,2


📝 Check the impact of your new *clusters* feature on the performance of your model  
☑️ Retrain the same basic linear model you used in section 2 and 3  
☑️ Use your **X_engineered**, with both *year* and *clusters* for the training  
☑️ Save the mean score after cross-validation in **score_clusters** as a `float`  

In [172]:
X_train_simple, X_test_simple, y_train, y_test = train_test_split(X_engineered, y, test_size=0.5)
cv_results = cross_validate(
    model_simple,    
    X_train_simple, y_train,   
    scoring=['neg_root_mean_squared_error'],# negative RMSE
    cv=5,  
    verbose=1
)

score_clusters=round(cv_results['test_neg_root_mean_squared_error'].mean(),2)
score_clusters

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.1s finished


-17.29

💾 **Run the following cell to save your results**

In [173]:
from nbresult import ChallengeResult

result = ChallengeResult("unsupervised_learning",
    cols=X_engineered.columns.tolist(),
    clusters= kmeans.n_clusters,
    labels=X_engineered['clusters'].value_counts(normalize=True).values,
    score=score_clusters
)
result.write()

## 5 - Preprocressing (1 h)

*C6 - Transformer les données d'entrée afin de satisfaire les contraintes inhérentes au modèle (Preprocessing)*

🎯 **Construct a preprocessing pipeline for your data**

In [174]:
# This will help you visualize your pipelines
from sklearn import set_config; set_config(display='diagram')

In [175]:
# 👉 Do not hesitate to reload a clean new dataset if you need a fresh start.
X = df.drop(columns='popularity')
y = df['popularity']
print(X.shape); print(y.shape)

(52053, 17)
(52053,)


In [176]:
print(df.shape)
print(df['id'].nunique())
print(df['name'].nunique())
print(df['release_date'].nunique())
print(df['artist'].nunique())
print(X_engineered['year'].nunique())
print(X_engineered['clusters'].nunique())

(52053, 18)
52053
46641
7547
12577
102
5


📝 Look at your features with an `object` type in your **df**  
☑️ Check their number of unique values  

❓ Do you think it would be reasonable or efficient to one-hot encode any of them?  
☑️ Store you answer as a string (Yes or No) in **answer_ohe** below

In [177]:
#No on the original df but clusters on X_engineered are OK to be one encoded since only 5 values
answer_ohe = 'Yes'

### 5.1 - Year

📝 Create a custom transformer to extract the *year* from *release_date*  
☑️ Use a [`FunctionTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html)  
☑️ Store your custom transformer in **transformer_year**

In [178]:
def ReleaseYearTransformer(date_series) : 
    
    list_date=[]
    for i in range(len(date_series)) :
        val=date_series.iloc[i]
        char_length=len(val)

        if (char_length==4) :
            date = datetime.strptime(val, '%Y')
        elif (char_length==7) :
            date = datetime.strptime(val, '%Y-%m')
        elif (char_length==10) :
            date = datetime.strptime(val, '%Y-%m-%d')
        else :
            print("error for ",val)
        list_date.append(date.year)
        
    list_date_series = pd.Series(list_date)
    date_np=list_date_series.to_numpy().reshape(-1,1)

    return date_np 

transformer_year = FunctionTransformer(ReleaseYearTransformer)


📝 Create a pipeline **pipeline_year** with two steps:
- your **transformer_year**  
- a scaler that ensures values between 0 and 1

### 5.2 - Clusters

📝 We provide you with a custom transformer to extract a cluster id for each observation  
☑️ The [`transform`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.transform) method of a `KMeans` return an array of shape (n_samples, n_clusters) with the distance to cluster for each pair obs-cluster  
☑️ We then simply use an `np.argmin` on the rows to get the index of the center the observation is closest to  
☑️ This effectively yields clusters for each observation

In [179]:
def process_clusters(clusters):
    return np.argmin(clusters, axis=1).reshape((-1, 1))

transformer_clusters = FunctionTransformer(process_clusters)

In [180]:
features_numerical = X.select_dtypes(include=['int64','float']).columns
features_numerical

Index(['acousticness', 'danceability', 'duration_ms', 'energy', 'explicit',
       'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
       'speechiness', 'tempo', 'valence'],
      dtype='object')

📝 Create a pipeline **pipeline_clusters** with three steps:
- a `KMeans` with a target number of clusters equals to 5  
- your custom transformer **transformer_clusters**  
- an encoder that creates a new binary column for each cluster - 1  

In [181]:
pipeline_year = Pipeline([
        ('transformer_year', transformer_year),
        ('MinMaxScaler', MinMaxScaler())])

In [182]:
pipeline_clusters = Pipeline([
        # First applying the transformation relative to the artist colum
        ('Kmeans', KMeans(n_clusters=5)),
        ('transformer_clusters', transformer_clusters),
        ('OHE',OneHotEncoder(drop='if_binary',handle_unknown='ignore')) ])

In [183]:
preprocessor = ColumnTransformer([
    ('pipeline_year', pipeline_year, 'release_date'),
    ('pipeline_clusters', pipeline_clusters, features_numerical)
],
    remainder='passthrough')
preprocessor

In [184]:
pd.DataFrame(preprocessor.fit_transform(X,y))

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0.50495,0.0,0.0,1.0,0.0,0.0,0B6BeEUd6UwFlbsHMQKjob,Back in the Goodle Days,John Hartford
1,0.940594,0.0,1.0,0.0,0.0,0.0,5Gpx4lJy3vKmIvjwbiR5c8,Worlds Which Break Us - Intro Mix,Driftmoon
2,0.475248,0.0,0.0,1.0,0.0,0.0,7MxuUYqrCIy93h1EEHrIrL,I'm The Greatest Star,Barbra Streisand
3,0.059406,1.0,0.0,0.0,0.0,0.0,4GeYbfIx1vSQXTfQb1m8Th,Kapitel 281 - Der Page und die Herzogin,Georgette Heyer
4,0.871287,1.0,0.0,0.0,0.0,0.0,2JPGGZwajjMk0vvhfC17RK,Away from You,Gerry & The Pacemakers
...,...,...,...,...,...,...,...,...,...
52048,0.564356,1.0,0.0,0.0,0.0,0.0,2GJxRwFe8oLcbXgTw9P5of,"Incidental CB Dialogue - Bandit, Smokey & Snowman",Burt Reynolds
52049,0.445545,1.0,0.0,0.0,0.0,0.0,0EtAPdqg7TysBXKDbnzuSO,Samba De Verão,Walter Wanderley
52050,0.990099,0.0,0.0,1.0,0.0,0.0,1s78GLrkZT7rTKAEu056M8,Kekkonnouta,accel
52051,0.316832,0.0,0.0,1.0,0.0,0.0,1LUBU2WI4z0dALUM16hoAH,Die Meistersinger von Nürnberg - Act 1: Wohl M...,Richard Wagner


### 5.3 - Artist

📝 We provide you with a custom Transformer Class below  
☑️ Take some time to understand it  
☑️ It computes the average popularity of songs, per artist, on the train set only  
☑️ If the artist is unknown in the test set, the average popularity will be equal to the mean popularity on the train set  

In [185]:
from sklearn.base import BaseEstimator, TransformerMixin

class ArtistPopularityTransformer(BaseEstimator, TransformerMixin):
    """
    Compute, as a new feature the artist's popularity
    Do so by computing the mean popularity of all songs from the artist
    Notice that the popularity is computed on the train only to avoid leakage
    """

    def __init__(self):
        pass

    def fit(self, X, y=None):
        """
        process artist mean popularity from artists songs popularity
        process song global mean popularity
        """

        # process artist popularity
        self.artist_popularity = y.groupby(X.artist).agg("mean")
        self.artist_popularity.name = "artist_popularity"

        # process mean popularity
        self.mean_popularity = y.mean()

        return self

    def transform(self, X, y=None):
        """
        apply artist mean popularity vs song global mean popularity to songs
        """

        # inject artist popularity
        X_copy = X.merge(self.artist_popularity, how="left", left_on="artist", right_index=True)

        # fills popularity of unknown artists with song global mean popularity
        X_copy.replace(np.nan, self.mean_popularity, inplace=True)

        return X_copy[["artist_popularity"]]
    

📝 Create a **pipeline_artist** with two steps:  
- the custom `ArtistPopularityTransformer`  
- a scaler that ensures values between 0 and 1

In [186]:
pipeline_artist = Pipeline([
        ('transformer_artist', ArtistPopularityTransformer()),
        ('MinMaxScaler', MinMaxScaler())])

### 5.4 Preprocessing Pipeline

📝 Create a transformer that contains all your preprocessing steps using a [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html?highlight=column%20transformer#sklearn.compose.ColumnTransformer), which should:  
☑️ Apply your **pipeline_clusters** to all numeric features  
☑️ Scale all numeric features, so that their scaled values are within 0 and 1  
☑️ Apply your **pipeline_year** to the *release_date* field  
☑️ Apply your **pipeline_artist** to the *artist* field  
☑️ Drop all other fields  

In [187]:
preprocessor = ColumnTransformer([
    ('pipeline_year', pipeline_year, 'release_date'),
    ('pipeline_clusters', pipeline_clusters, features_numerical),
    ('pipeline_artist', pipeline_artist, ['artist','name'])],
    remainder='passthrough')
preprocessor

📝 Use your pipeline to `transform` your **X** and store the result in **X_transformed**

In [188]:
X_transformed=preprocessor.fit_transform(X,y)
pd.DataFrame(X_transformed)

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.50495,1.0,0.0,0.0,0.0,0.0,0.364583,0B6BeEUd6UwFlbsHMQKjob
1,0.940594,0.0,0.0,1.0,0.0,0.0,0.104167,5Gpx4lJy3vKmIvjwbiR5c8
2,0.475248,1.0,0.0,0.0,0.0,0.0,0.357639,7MxuUYqrCIy93h1EEHrIrL
3,0.059406,0.0,0.0,0.0,0.0,1.0,0.039278,4GeYbfIx1vSQXTfQb1m8Th
4,0.871287,0.0,0.0,0.0,0.0,1.0,0.138503,2JPGGZwajjMk0vvhfC17RK
...,...,...,...,...,...,...,...,...
52048,0.564356,0.0,0.0,0.0,0.0,1.0,0.260417,2GJxRwFe8oLcbXgTw9P5of
52049,0.445545,0.0,0.0,0.0,0.0,1.0,0.299479,0EtAPdqg7TysBXKDbnzuSO
52050,0.990099,1.0,0.0,0.0,0.0,0.0,0.0,1s78GLrkZT7rTKAEu056M8
52051,0.316832,1.0,0.0,0.0,0.0,0.0,0.006031,1LUBU2WI4z0dALUM16hoAH


💾 **Run the following cell to save your results**

In [189]:
print(answer_ohe)
print(X_transformed.shape)
print(X_transformed[0])

Yes
(52053, 8)
[0.5049504950495063 1.0 0.0 0.0 0.0 0.0 0.3645833333333333
 '0B6BeEUd6UwFlbsHMQKjob']


In [190]:
# Save your preproc
from nbresult import ChallengeResult

result = ChallengeResult(
    "preprocessing",
    answer=answer_ohe,
    shape=X_transformed.shape,
    first_observation = X_transformed[0]
)
result.write()

## 6 - Model Selection (40 min)

*C8 - Maîtriser les différents algorithmes d'apprentissage afin d'apporter une réponse adaptée à une problématique d'une organisation (entreprise, laboratoire, etc.)*  

🎯 **Select the model that yields the best performance for your task**

📝 Try model from two different families: linear and ensemble  
☑️ We expect you to cross-validate all scores with 5 folds in this section  

**If you did not manage to construct the full preprocessing:**  
☑️ Construct a light pipeline that use only features in **X_simple** and scale them to values between 0 and 1  

### 6.1 - Linear Models

📝 Construct a `Pipeline` that combines your **preproc**  and a linear estimator from `sklearn`  
☑️ Assign your pipeline to a variable named **pipe_linear**  
☑️ We expect you to cross-validate all scores with 5 folds in this section  
☑️ Store the mean of the scores in **score_linear**  

In [191]:
X=X.drop(columns=['id'])
X

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,name,release_date,speechiness,tempo,valence,artist
0,0.65400,0.499,219827,0.190,0,0.004090,7,0.0898,-16.435,1,Back in the Goodle Days,1971,0.0454,149.460,0.4300,John Hartford
1,0.00592,0.439,483948,0.808,0,0.140000,2,0.0890,-8.497,1,Worlds Which Break Us - Intro Mix,2015-02-02,0.0677,138.040,0.0587,Driftmoon
2,0.73400,0.523,245693,0.288,0,0.000000,0,0.0771,-11.506,1,I'm The Greatest Star,1968-09-01,0.2140,75.869,0.4640,Barbra Streisand
3,0.42900,0.681,130026,0.165,0,0.000000,11,0.3940,-21.457,0,Kapitel 281 - Der Page und die Herzogin,1926,0.9460,145.333,0.2880,Georgette Heyer
4,0.56200,0.543,129813,0.575,0,0.000004,2,0.1270,-7.374,1,Away from You,2008-02-11,0.0265,139.272,0.8010,Gerry & The Pacemakers
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52312,0.16400,0.512,56253,0.907,0,0.004870,6,0.8010,-7.804,1,"Incidental CB Dialogue - Bandit, Smokey & Snowman",1977-01-01,0.6620,85.615,0.3150,Burt Reynolds
52313,0.77300,0.533,192838,0.659,0,0.773000,2,0.1130,-9.117,0,Samba De Verão,1965-06-01,0.0426,158.366,0.6140,Walter Wanderley
52314,0.45600,0.548,310840,0.568,0,0.000000,6,0.0892,-5.348,1,Kekkonnouta,2020-04-15,0.0275,77.495,0.3380,accel
52315,0.96500,0.360,216493,0.132,0,0.000000,10,0.1260,-21.014,1,Die Meistersinger von Nürnberg - Act 1: Wohl M...,1952-01-01,0.0355,80.909,0.4100,Richard Wagner


In [192]:
pipe_linear = Pipeline([
        ('preprocessor', preprocessor),
        ('linear_reg', Lasso())
        ])

pipe_linear

In [193]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

cv_results = cross_validate(
    pipe_linear,    
    X, y,   
    scoring=['neg_root_mean_squared_error'],# negative RMSE
    cv=5,  
    verbose=1
)

score_linear=round(cv_results['test_neg_root_mean_squared_error'].mean(),2)
score_linear


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   12.5s finished


-14.14

### 6.2 - Ensemble Methods

📝 Construct a `Pipeline` that combines your **preproc**  and an ensemble estimator from `sklearn`  
☑️ Assign your pipeline to a variable named **pipe_ensemble**  
☑️ We expect you to cross-validate all scores with 5 folds in this section  
☑️ Store the mean of the scores in **score_ensemble**  

In [194]:
pipe_ensemble = Pipeline(
    [
        ('preprocessor', preprocessor),
        ('ensemble_model', RandomForestRegressor())
    ]
)
cv_results = cross_validate(
    pipe_ensemble,    
    X, y,   
    scoring=['neg_root_mean_squared_error'],# negative RMSE
    cv=5,  
    verbose=1)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   35.8s finished


In [195]:
score_ensemble=round(cv_results['test_neg_root_mean_squared_error'].mean(),2)
score_ensemble

-12.47

In [196]:
baseline=baseline_score
estimator_linear=pipe_linear._final_estimator
estimator_ensemble=pipe_ensemble._final_estimator
score_linear=score_linear
score_ensemble=score_ensemble
print(baseline)
print(estimator_linear)
print(estimator_ensemble)
print(score_linear)
print(score_ensemble)

25.815188365704188
Lasso()
RandomForestRegressor()
-14.14
-12.47


💾 **Run the following cell to save your results**

In [197]:
from nbresult import ChallengeResult

result = ChallengeResult("model_selection",
    baseline=baseline_score,
    estimator_linear=pipe_linear._final_estimator,
    estimator_ensemble=pipe_ensemble._final_estimator,
    score_linear=score_linear,
    score_ensemble=score_ensemble)
result.write()

## 7 - Fine-tuning (25 min)

*C11 - Améliorer les capacités prédictives d'un systèmes en sélectionnant un modèle différent ou en modifiant ses hyperparamètres en vue de corriger des erreurs (hyperparameter tuning)*

🎯 **Fine-tune your best model to achieve the highest possible score**

📝 Create a cross-validated grid search and assign it to **search**  
☑️ Choose the model that yielded the best result in section 6  
☑️ Create a **grid**, a `dict`, that stores the hyperparameters you want to search  
☑️ Limit yourself to 2 hyperparameters, with up to 3 possible values for each one  
☑️ Use only one scoring method, the one you stored in **scoring** in section 2  

In [198]:
pipe_ensemble = Pipeline(
    [
        ('preprocessor', preprocessor),
        ('ensemble_model', RandomForestRegressor())
    ]
)

param_distributions = {
    'ensemble_model__n_estimators': [50, 100, 500],
    'ensemble_model__max_depth': [5, 10, 50],
    'ensemble_model__max_features': ["auto", "sqrt", "log2"],
    'ensemble_model__min_samples_split' : [2,4,8],
    
}

search = RandomizedSearchCV(
    pipe_ensemble, 
    param_distributions=param_distributions,
    scoring='neg_root_mean_squared_error',
    cv=5, 
    verbose=1,
    n_jobs=-1
)                                  
              
search.fit(X, y)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


📝 fit your **search**  on the full **X** and **y**  
☑️ Iterate until you notice an improvement on the best score  compared to the scores obtained in section 6  
💡 You won't be judged by the computing power of your machine, don't spend too much time waiting for your grid search to fit.  
⏳ For reference, a grid search cross validated with **3 folds** and **4 combinations of hyperparameters** should fit in under **10 minutes**.

In [199]:
search.best_params_

{'ensemble_model__n_estimators': 500,
 'ensemble_model__min_samples_split': 8,
 'ensemble_model__max_features': 'log2',
 'ensemble_model__max_depth': 10}

In [200]:
search.best_score_

-12.060249898890467

💾 **Run the following cell to save your results**

In [201]:
from nbresult import ChallengeResult

result = ChallengeResult("model_tuning",
    search_results=search.cv,
    score=search.best_score_)
result.write()

## 8 - Recommendations and Continuous Improvement (30 min)

*C13 - Adopter une démarche d'amélioration continue en identifiant les axes de perfectionnement d'un produit à l'aide d'une méthode adaptée de manière à améliorer la performance d'un produit*

🎯 **Transform your regression task into a classification task**

The product owner of your company tells you that he only needs to know whether a song is above or below popularity median  
The exact popularity value is of little interest to her as it won't bring any value to the feature under development.

📝 Create a new target **y_cat**, a `Series`, using the formula below  

☑️ $y\_cat_i = 1 \quad if\quad y_i \geq median(y),\quad 0\quad otherwise.$  

In [202]:
def threshold_function(x, median):
    if x < median:
        return 0
    else:
        return 1

median=y.median()
y_cat = y.apply(lambda x: threshold_function(x, median))

📝 Cross validate a classification  

☑️ Use your **preproc** with a basic linear model from `sklearn` suited for classification  
☑️ Assign the resulting pipeline to **pipe_cat**  
☑️ Cross validate the pipeline with 5 folds  
☑️ Use the Accuracy metric and store the mean of scores in **score_cat**  

In [203]:
pipe_cat = Pipeline(
    [
        ('preprocessor', preprocessor),
        ('SGDClassifier', SGDClassifier())
    ]
)

cv_results = cross_validate(
    pipe_cat,    
    X, y_cat,   
    scoring=['neg_root_mean_squared_error'],# negative RMSE
    cv=5,  
    verbose=1
)

cv_results
score_cat=round(cv_results['test_neg_root_mean_squared_error'].mean(),2)
score_cat

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   14.1s finished


-0.41

💾 **Run the following cell to save your results**

In [204]:
target_cat = y_cat.value_counts(normalize=True).values,
model = pipe_cat._final_estimator,
score_cat = score_cat
print(target_cat)
print(model)
print(score_cat)

(array([0.50417843, 0.49582157]),)
(SGDClassifier(),)
-0.41


In [205]:
from nbresult import ChallengeResult

result = ChallengeResult(
    "recommendations",
    target_cat = y_cat.value_counts(normalize=True).values,
    model = pipe_cat._final_estimator,
    score_cat = score_cat
)
result.write()

## 9 - Deployment - API (40 min)

*C12 - Mettre en production le modèle d'apprentissage supervisé ou non supervisé obtenu sous la forme d'une API*

📝 This challenge takes place in another repository  
☑️ Follow the instructions provided on the certification platform