In [139]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cluster import KMeans
from sklearn.preprocessing import FunctionTransformer, MinMaxScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import make_union
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

## Legend

🎯 Objective    
❓ Question  
📝 Task  
☑️ Instructions  
💡 Informations  
💾 Submit your results  

**variable name**  
*field name*  
`python object`

# Apprentissage automatique (Machine Learning) supervisé et non supervisé

This exam contains **9** sections.  

For each one a time indication is given for information but you may choose your own pace.

To pass the exam you need to validate at least **5** sections.  

Some sections can be validated independently even if the exam use the same data for all of them.  

## Description (10 min)

🎯 **Your objective is to create a model that predicts the popularity of a song based on its characteristics**

To achieve this, you are given a dataset containing a list of songs with the following characteristics:

**acousticness**: whether the track is acoustic  

**danceability**: describes how suitable a track is for dancing  

**duration_ms**: duration of the track in milliseconds  

**energy**: represents a perceptual measure of intensity and activity  

**explicit**: whether the track has explicit lyrics  

**id**: id for the track  

**instrumentalness**: predicts whether a track contains no vocals  

**key**: the key the track is in  

**liveness**: detects the presence of an audience in the recording  

**loudness**: the overall loudness of a track in decibels  

**mode**: modality of a track  

**name**: name of the track  

**popularity**: popularity of the track  

**release_date**: release date  

**speechiness**: detects the presence of spoken words in a track  

**tempo**: overall estimated tempo of a track in beats per minute  

**valence**: describes the musical positiveness conveyed by a track  

**artist**: artist who performed the track

☑️ Only fine-tune your model when explicitly asked to do so, in section *7 - Fine-tuning*  

## 1 - Data Cleaning (15 min)

*C5 - Préparer les données en vue de l'apprentissage afin que celles-ci soient nettoyées*

🎯 **Load and clean the data**

📝 Load the data in **df**, a `DataFrame`  
☑️ The data file is available at this url: https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/spotify_popularity_train.csv

In [6]:
df = pd.read_csv('https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/spotify_popularity_train.csv')
df.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,artist
0,0.654,0.499,219827,0.19,0,0B6BeEUd6UwFlbsHMQKjob,0.00409,7,0.0898,-16.435,1,Back in the Goodle Days,40,1971,0.0454,149.46,0.43,John Hartford
1,0.00592,0.439,483948,0.808,0,5Gpx4lJy3vKmIvjwbiR5c8,0.14,2,0.089,-8.497,1,Worlds Which Break Us - Intro Mix,22,2015-02-02,0.0677,138.04,0.0587,Driftmoon
2,0.734,0.523,245693,0.288,0,7MxuUYqrCIy93h1EEHrIrL,0.0,0,0.0771,-11.506,1,I'm The Greatest Star,40,1968-09-01,0.214,75.869,0.464,Barbra Streisand
3,0.429,0.681,130026,0.165,0,4GeYbfIx1vSQXTfQb1m8Th,0.0,11,0.394,-21.457,0,Kapitel 281 - Der Page und die Herzogin,1,1926,0.946,145.333,0.288,Georgette Heyer
4,0.562,0.543,129813,0.575,0,2JPGGZwajjMk0vvhfC17RK,4e-06,2,0.127,-7.374,1,Away from You,15,2008-02-11,0.0265,139.272,0.801,Gerry & The Pacemakers


📝 Clean the data, make sure that no duplicates nor missing values remain in **df**

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52317 entries, 0 to 52316
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      52317 non-null  float64
 1   danceability      52317 non-null  float64
 2   duration_ms       52317 non-null  int64  
 3   energy            52317 non-null  float64
 4   explicit          52317 non-null  int64  
 5   id                52317 non-null  object 
 6   instrumentalness  52317 non-null  float64
 7   key               52317 non-null  int64  
 8   liveness          52317 non-null  float64
 9   loudness          52317 non-null  float64
 10  mode              52317 non-null  int64  
 11  name              52317 non-null  object 
 12  popularity        52317 non-null  int64  
 13  release_date      52317 non-null  object 
 14  speechiness       52317 non-null  float64
 15  tempo             52317 non-null  float64
 16  valence           52317 non-null  float6

In [8]:
df.duplicated().sum()

260

In [9]:
df = df.drop_duplicates()
df.duplicated().sum()

0

In [10]:
df.isnull().sum()

acousticness        0
danceability        0
duration_ms         0
energy              0
explicit            0
id                  0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
name                0
popularity          0
release_date        0
speechiness         0
tempo               0
valence             0
artist              4
dtype: int64

In [11]:
df = df.dropna()
df.isnull().sum()

acousticness        0
danceability        0
duration_ms         0
energy              0
explicit            0
id                  0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
name                0
popularity          0
release_date        0
speechiness         0
tempo               0
valence             0
artist              0
dtype: int64

In [12]:
df.shape

(52053, 18)

💾 **Run the following cell to save your results**

In [21]:
from nbresult import ChallengeResult

result = ChallengeResult("data_cleaning", shape=df.shape)
result.write()

## 2 - Supervised Learning (40 min)

*C9 - Entraîner un modèle d'apprentissage supervisé pour optimiser une fonction de prédiction à partir d'exemples annotées*

🎯 **Identify your metrics, compute your baseline and evaluate a basic model**

📝 Choose an appropriate scoring [metric](https://scikit-learn.org/stable/modules/model_evaluation.html) from `sklearn` for this challenge, the chosen metric must:    

☑️ strongly penalize largest errors relatively to smaller ones  
☑️ measure errors in the same unit as the target `popularity`  
☑️ the greater, the better (metric_good_model > metric_bad_model)  

📝 Store in **scoring** its exact name as `string`

In [13]:
scoring = "neg_root_mean_squared_error"

📝 Define your features and target   

☑️ Assign to **X_simple** a `DataFrame` containing only numerical features  
☑️ Assign to **y** a `Series` containing only your target: *popularity*  

In [14]:
X_simple = df.select_dtypes(include=['int64','float64']).drop(columns='popularity')
y = df['popularity']

In [15]:
X_simple

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence
0,0.65400,0.499,219827,0.190,0,0.004090,7,0.0898,-16.435,1,0.0454,149.460,0.4300
1,0.00592,0.439,483948,0.808,0,0.140000,2,0.0890,-8.497,1,0.0677,138.040,0.0587
2,0.73400,0.523,245693,0.288,0,0.000000,0,0.0771,-11.506,1,0.2140,75.869,0.4640
3,0.42900,0.681,130026,0.165,0,0.000000,11,0.3940,-21.457,0,0.9460,145.333,0.2880
4,0.56200,0.543,129813,0.575,0,0.000004,2,0.1270,-7.374,1,0.0265,139.272,0.8010
...,...,...,...,...,...,...,...,...,...,...,...,...,...
52312,0.16400,0.512,56253,0.907,0,0.004870,6,0.8010,-7.804,1,0.6620,85.615,0.3150
52313,0.77300,0.533,192838,0.659,0,0.773000,2,0.1130,-9.117,0,0.0426,158.366,0.6140
52314,0.45600,0.548,310840,0.568,0,0.000000,6,0.0892,-5.348,1,0.0275,77.495,0.3380
52315,0.96500,0.360,216493,0.132,0,0.000000,10,0.1260,-21.014,1,0.0355,80.909,0.4100


📝 Compute your baseline and store it in **baseline_score**, as a `float`  
☑️ Do so by simulating a constant prediction equivalent to the mean value of your target  
☑️ Use the same scoring function as the one stored in **scoring**  
☑️ You may have to code the scoring function yourself to use it outside a `sklearn` workflow  

In [18]:
def nrmse(y_true, y_pred):
    return - np.sqrt(np.square(np.subtract(y_true, y_pred)).mean())

In [19]:
baseline_score = nrmse(y, np.full(y.shape, y.mean()))
baseline_score

-21.86400900273424

### Holdout evaluation

📝 Split your data, holding out 50% of observations, randomly sampled, as test set  
☑️  Assign the result of your holdout to **X_train_simple** **y_train**, **X_test_simple**, **y_test**

In [20]:
X_train_simple, X_test_simple, y_train, y_test = train_test_split(X_simple, y, test_size=0.5)
X_train_simple.shape, X_test_simple.shape, y_train.shape, y_test.shape

((26026, 13), (26027, 13), (26026,), (26027,))

📝 Fit and evaluate the most basic linear model you can find in the [`linear_model`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) module  
☑️ Use the metric you stored in **scoring**    
☑️ Store in **score_simple_holdout** your model score

In [21]:
model_simple = LinearRegression()
model_simple.fit(X_train_simple, y_train)
y_pred = model_simple.predict(X_test_simple)
score_simple_holdout = nrmse(y_test, y_pred)
score_simple_holdout

-18.397800203577937

### Cross-validation evaluation

📝 Cross-validate your basic model  
☑️ Use 5 folds for your cross-validation  
☑️ Store your mean score in **score_simple_cv_mean** as a `float`  
☑️ Store the standard deviation of your scores in **score_simple_cv_std** as a `float`

In [22]:
simple_cv_result = cross_val_score(model_simple, X_train_simple, y_train, scoring=scoring, cv=5)
score_simple_cv_mean = simple_cv_result.mean()
score_simple_cv_mean

-18.327940910513014

In [24]:
score_simple_cv_std = np.std(simple_cv_result)
score_simple_cv_std

0.21847858303251394

☑️ From now on, you will stop using your train-test split    
☑️ Instead we expect you to cross-validate (5 folds) your results with the whole dataset

💾 **Run the following cell to save your results**

In [25]:
from nbresult import ChallengeResult

result = ChallengeResult(
    "supervised_learning",
    scoring=scoring,
    baseline_score=baseline_score,
    model=model_simple,
    shape_train = X_train_simple.shape,
    score_simple_holdout=score_simple_holdout,
    score_simple_cv_mean=score_simple_cv_mean,
    score_simple_cv_std=score_simple_cv_std,
)
result.write()

## 3 - Feature engineering (20 min)

*C7 - Générer des données d'entrée afin de satisfaire les contraintes inhérentes au modèles (Feature Engineering)*

🎯 **Create a new feature by extracting information from existing features**

Let's try to improve performance using the feature *release_date*

📝 Create a `DataFrame` **X_engineered** by adding a new column *year* to **X_simple**  
☑️ *year* must contain the release year of the track as `integer`

In [26]:
X_engineered = X_simple.copy()

In [27]:
X_engineered["year"] = pd.to_datetime(df["release_date"]).dt.year

In [28]:
X_engineered.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52053 entries, 0 to 52316
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      52053 non-null  float64
 1   danceability      52053 non-null  float64
 2   duration_ms       52053 non-null  int64  
 3   energy            52053 non-null  float64
 4   explicit          52053 non-null  int64  
 5   instrumentalness  52053 non-null  float64
 6   key               52053 non-null  int64  
 7   liveness          52053 non-null  float64
 8   loudness          52053 non-null  float64
 9   mode              52053 non-null  int64  
 10  speechiness       52053 non-null  float64
 11  tempo             52053 non-null  float64
 12  valence           52053 non-null  float64
 13  year              52053 non-null  int64  
dtypes: float64(9), int64(5)
memory usage: 6.0 MB


In [29]:
X_engineered.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence,year
0,0.654,0.499,219827,0.19,0,0.00409,7,0.0898,-16.435,1,0.0454,149.46,0.43,1971
1,0.00592,0.439,483948,0.808,0,0.14,2,0.089,-8.497,1,0.0677,138.04,0.0587,2015
2,0.734,0.523,245693,0.288,0,0.0,0,0.0771,-11.506,1,0.214,75.869,0.464,1968
3,0.429,0.681,130026,0.165,0,0.0,11,0.394,-21.457,0,0.946,145.333,0.288,1926
4,0.562,0.543,129813,0.575,0,4e-06,2,0.127,-7.374,1,0.0265,139.272,0.801,2008


📝 Check the impact of your new feature on the performance of your model  
☑️ Retrain the same basic linear model you used in section 2  
☑️ Use your **X_engineered** for the training  
☑️ Save the mean score after cross-validation in **score_engineered** as a `float`  

In [30]:
model_simple.fit(X_engineered, y)
score_engineered = cross_val_score(model_simple, X_engineered, y, scoring=scoring, cv=5).mean()
score_engineered

-17.301966769706553

💾 **Run the following cell to save your results**

In [89]:
from nbresult import ChallengeResult

result = ChallengeResult("feature_engineering",
    cols = X_engineered.columns,
    years = X_engineered.get("year"),
    score=score_engineered
)
result.write()

## 4 - Unsupervised Learning (20 min)

*C10 - Entraîner un modèle d'apprentissage non supervisé pour détecter des structures sous-jacentes à partir de données non étiquetées*

🎯 **Create a new feature by performing a clustering of your existing features**

📝 Use a `KMeans` to assign each track to a cluster  
☑️ Your target number of clusters is 5  
☑️ Fit your `KMeans` on **X_simple**  
☑️ Store your fitted `KMeans` in **kmeans**  

In [31]:
kmeans = KMeans(n_clusters=5)
kmeans.fit(X_simple)

📝 Add your clusters as features to your **X_engineered**  
☑️ Use your **kmeans** to get cluster predictions on **X_simple**  
☑️ Store the resulting predictions in a new column of **X_engineered** called *clusters*  

In [32]:
X_engineered["clusters"] = kmeans.labels_

In [33]:
X_engineered

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence,year,clusters
0,0.65400,0.499,219827,0.190,0,0.004090,7,0.0898,-16.435,1,0.0454,149.460,0.4300,1971,3
1,0.00592,0.439,483948,0.808,0,0.140000,2,0.0890,-8.497,1,0.0677,138.040,0.0587,2015,1
2,0.73400,0.523,245693,0.288,0,0.000000,0,0.0771,-11.506,1,0.2140,75.869,0.4640,1968,3
3,0.42900,0.681,130026,0.165,0,0.000000,11,0.3940,-21.457,0,0.9460,145.333,0.2880,1926,0
4,0.56200,0.543,129813,0.575,0,0.000004,2,0.1270,-7.374,1,0.0265,139.272,0.8010,2008,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52312,0.16400,0.512,56253,0.907,0,0.004870,6,0.8010,-7.804,1,0.6620,85.615,0.3150,1977,0
52313,0.77300,0.533,192838,0.659,0,0.773000,2,0.1130,-9.117,0,0.0426,158.366,0.6140,1965,0
52314,0.45600,0.548,310840,0.568,0,0.000000,6,0.0892,-5.348,1,0.0275,77.495,0.3380,2020,3
52315,0.96500,0.360,216493,0.132,0,0.000000,10,0.1260,-21.014,1,0.0355,80.909,0.4100,1952,3


📝 Check the impact of your new *clusters* feature on the performance of your model  
☑️ Retrain the same basic linear model you used in section 2 and 3  
☑️ Use your **X_engineered**, with both *year* and *clusters* for the training  
☑️ Save the mean score after cross-validation in **score_clusters** as a `float`  

In [34]:
model_simple.fit(X_engineered, y)
score_clusters = cross_val_score(model_simple, X_engineered, y, scoring=scoring, cv=5).mean()
score_clusters

-17.225524008389584

💾 **Run the following cell to save your results**

In [35]:
from nbresult import ChallengeResult

result = ChallengeResult("unsupervised_learning",
    cols=X_engineered.columns.tolist(),
    clusters= kmeans.n_clusters,
    labels=X_engineered['clusters'].value_counts(normalize=True).values,
    score=score_clusters
)
result.write()

## 5 - Preprocressing (1 h)

*C6 - Transformer les données d'entrée afin de satisfaire les contraintes inhérentes au modèle (Preprocessing)*

🎯 **Construct a preprocessing pipeline for your data**

In [61]:
# This will help you visualize your pipelines
from sklearn import set_config; set_config(display='diagram')

In [62]:
# 👉 Do not hesitate to reload a clean new dataset if you need a fresh start.
X = df.drop(columns='popularity')
y = df['popularity']

📝 Look at your features with an `object` type in your **df**  
☑️ Check their number of unique values  

❓ Do you think it would be reasonable or efficient to one-hot encode any of them?  
☑️ Store you answer as a string (Yes or No) in **answer_ohe** below

In [38]:
answer_ohe = 'No'

In [63]:
df.select_dtypes(include='object')

Unnamed: 0,id,name,release_date,artist
0,0B6BeEUd6UwFlbsHMQKjob,Back in the Goodle Days,1971,John Hartford
1,5Gpx4lJy3vKmIvjwbiR5c8,Worlds Which Break Us - Intro Mix,2015-02-02,Driftmoon
2,7MxuUYqrCIy93h1EEHrIrL,I'm The Greatest Star,1968-09-01,Barbra Streisand
3,4GeYbfIx1vSQXTfQb1m8Th,Kapitel 281 - Der Page und die Herzogin,1926,Georgette Heyer
4,2JPGGZwajjMk0vvhfC17RK,Away from You,2008-02-11,Gerry & The Pacemakers
...,...,...,...,...
52312,2GJxRwFe8oLcbXgTw9P5of,"Incidental CB Dialogue - Bandit, Smokey & Snowman",1977-01-01,Burt Reynolds
52313,0EtAPdqg7TysBXKDbnzuSO,Samba De Verão,1965-06-01,Walter Wanderley
52314,1s78GLrkZT7rTKAEu056M8,Kekkonnouta,2020-04-15,accel
52315,1LUBU2WI4z0dALUM16hoAH,Die Meistersinger von Nürnberg - Act 1: Wohl M...,1952-01-01,Richard Wagner


In [66]:
#too many unique values

df.id.nunique(), df.name.nunique(), df.release_date.nunique(), df.artist.nunique()

(52053, 46641, 7547, 12577)

### 5.1 - Year

📝 Create a custom transformer to extract the *year* from *release_date*  
☑️ Use a [`FunctionTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html)  
☑️ Store your custom transformer in **transformer_year**

In [94]:
def process_year(date):
    year = pd.to_datetime(date).dt.year
    return year.to_numpy().reshape((-1, 1))

transformer_year = FunctionTransformer(process_year)
transformer_year

# np.datetime64('2005-02-25')

📝 Create a pipeline **pipeline_year** with two steps:
- your **transformer_year**  
- a scaler that ensures values between 0 and 1

In [95]:
pipeline_year = make_pipeline(
    transformer_year,
    MinMaxScaler()
)
pipeline_year

### 5.2 - Clusters

📝 We provide you with a custom transformer to extract a cluster id for each observation  
☑️ The [`transform`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.transform) method of a `KMeans` return an array of shape (n_samples, n_clusters) with the distance to cluster for each pair obs-cluster  
☑️ We then simply use an `np.argmin` on the rows to get the index of the center the observation is closest to  
☑️ This effectively yields clusters for each observation

In [67]:
def process_clusters(clusters):
    return np.argmin(clusters, axis=1).reshape((-1, 1))

transformer_clusters = FunctionTransformer(process_clusters)

📝 Create a pipeline **pipeline_clusters** with three steps:
- a `KMeans` with a target number of clusters equals to 5  
- your custom transformer **transformer_clusters**  
- an encoder that creates a new binary column for each cluster - 1  

In [68]:
pipeline_clusters = make_pipeline(
    KMeans(n_clusters=5),
    transformer_clusters,
    OneHotEncoder(handle_unknown='ignore')
)
pipeline_clusters

### 5.3 - Artist

📝 We provide you with a custom Transformer Class below  
☑️ Take some time to understand it  
☑️ It computes the average popularity of songs, per artist, on the train set only  
☑️ If the artist is unknown in the test set, the average popularity will be equal to the mean popularity on the train set  

In [69]:
from sklearn.base import BaseEstimator, TransformerMixin

class ArtistPopularityTransformer(BaseEstimator, TransformerMixin):
    """
    Compute, as a new feature the artist's popularity
    Do so by computing the mean popularity of all songs from the artist
    Notice that the popularity is computed on the train only to avoid leakage
    """

    def __init__(self):
        pass

    def fit(self, X, y=None):
        """
        process artist mean popularity from artists songs popularity
        process song global mean popularity
        """

        # process artist popularity
        self.artist_popularity = y.groupby(X.artist).agg("mean")
        self.artist_popularity.name = "artist_popularity"

        # process mean popularity
        self.mean_popularity = y.mean()

        return self

    def transform(self, X, y=None):
        """
        apply artist mean popularity vs song global mean popularity to songs
        """

        # inject artist popularity
        X_copy = X.merge(self.artist_popularity, how="left", left_on="artist", right_index=True)

        # fills popularity of unknown artists with song global mean popularity
        X_copy.replace(np.nan, self.mean_popularity, inplace=True)

        return X_copy[["artist_popularity"]]

📝 Create a **pipeline_artist** with two steps:  
- the custom `ArtistPopularityTransformer`  
- a scaler that ensures values between 0 and 1

In [70]:
pipeline_artist = make_pipeline(
    ArtistPopularityTransformer(),
    MinMaxScaler()
)
pipeline_artist

### 5.4 Preprocessing Pipeline

📝 Create a transformer that contains all your preprocessing steps using a [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html?highlight=column%20transformer#sklearn.compose.ColumnTransformer), which should:  
☑️ Apply your **pipeline_clusters** to all numeric features  
☑️ Scale all numeric features, so that their scaled values are within 0 and 1  
☑️ Apply your **pipeline_year** to the *release_date* field  
☑️ Apply your **pipeline_artist** to the *artist* field  
☑️ Drop all other fields  

In [96]:
# cluster_col = make_column_selector(dtype_exclude="object")
num_col = make_column_selector(dtype_include=["int64", "float64"])

preproc = make_column_transformer(
    (pipeline_clusters, num_col),
    (MinMaxScaler(), num_col),
    (pipeline_year, "release_date"),
    (pipeline_artist, ["artist", "name"]),
    remainder='drop'
)
preproc

📝 Use your pipeline to `transform` your **X** and store the result in **X_transformed**

In [97]:
X_transformed = preproc.fit_transform(X, y)
pd.DataFrame(X_transformed)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.0,1.0,0.0,0.0,0.0,0.656627,0.506085,0.044604,0.190,0.0,0.004090,0.636364,0.089890,0.683437,1.0,0.046804,0.613781,0.4300,0.504950,0.364583
1,0.0,0.0,0.0,0.0,1.0,0.005944,0.445233,0.099696,0.808,0.0,0.140000,0.181818,0.089089,0.807966,1.0,0.069794,0.566883,0.0587,0.940594,0.104167
2,0.0,1.0,0.0,0.0,0.0,0.736948,0.530426,0.049999,0.288,0.0,0.000000,0.000000,0.077177,0.760762,1.0,0.220619,0.311568,0.4640,0.475248,0.357639
3,1.0,0.0,0.0,0.0,0.0,0.430723,0.690669,0.025872,0.165,0.0,0.000000,1.000000,0.394394,0.604653,0.0,0.975258,0.596833,0.2880,0.059406,0.039278
4,1.0,0.0,0.0,0.0,0.0,0.564257,0.550710,0.025828,0.575,0.0,0.000004,0.181818,0.127127,0.825584,1.0,0.027320,0.571942,0.8010,0.871287,0.138503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52048,1.0,0.0,0.0,0.0,0.0,0.164659,0.519270,0.010484,0.907,0.0,0.004870,0.545455,0.801802,0.818838,1.0,0.682474,0.351592,0.3150,0.564356,0.260417
52049,1.0,0.0,0.0,0.0,0.0,0.776104,0.540568,0.038974,0.659,0.0,0.773000,0.181818,0.113113,0.798240,0.0,0.043918,0.650355,0.6140,0.445545,0.299479
52050,0.0,1.0,0.0,0.0,0.0,0.457831,0.555781,0.063588,0.568,0.0,0.000000,0.545455,0.089289,0.857367,1.0,0.028351,0.318245,0.3380,0.990099,0.000000
52051,0.0,1.0,0.0,0.0,0.0,0.968876,0.365112,0.043908,0.132,0.0,0.000000,0.909091,0.126126,0.611603,1.0,0.036598,0.332266,0.4100,0.316832,0.006031


💾 **Run the following cell to save your results**

In [98]:
# Save your preproc
from nbresult import ChallengeResult

result = ChallengeResult(
    "preprocessing",
    answer=answer_ohe,
    shape=X_transformed.shape,
    first_observation = X_transformed[0]
)
result.write()

## 6 - Model Selection (40 min)

*C8 - Maîtriser les différents algorithmes d'apprentissage afin d'apporter une réponse adaptée à une problématique d'une organisation (entreprise, laboratoire, etc.)*  

🎯 **Select the model that yields the best performance for your task**

📝 Try model from two different families: linear and ensemble  
☑️ We expect you to cross-validate all scores with 5 folds in this section  

**If you did not manage to construct the full preprocessing:**  
☑️ Construct a light pipeline that use only features in **X_simple** and scale them to values between 0 and 1  

### 6.1 - Linear Models

📝 Construct a `Pipeline` that combines your **preproc**  and a linear estimator from `sklearn`  
☑️ Assign your pipeline to a variable named **pipe_linear**  
☑️ We expect you to cross-validate all scores with 5 folds in this section  
☑️ Store the mean of the scores in **score_linear**  

In [99]:
pipe_linear = make_pipeline(
    MinMaxScaler(),
    LinearRegression()
)
pipe_linear

In [100]:
score_linear = cross_val_score(pipe_linear, X_transformed, y, scoring=scoring, cv=5).mean()
score_linear

-8.855849890946555

### 6.2 - Ensemble Methods

📝 Construct a `Pipeline` that combines your **preproc**  and an ensemble estimator from `sklearn`  
☑️ Assign your pipeline to a variable named **pipe_ensemble**  
☑️ We expect you to cross-validate all scores with 5 folds in this section  
☑️ Store the mean of the scores in **score_ensemble**  

In [101]:
pipe_ensemble = make_pipeline(
    MinMaxScaler(),
    RandomForestRegressor()
)
pipe_ensemble

In [102]:
score_ensemble = cross_val_score(pipe_ensemble, X_transformed, y, scoring=scoring, cv=5).mean()
score_ensemble

-8.028154099794657

💾 **Run the following cell to save your results**

In [103]:
from nbresult import ChallengeResult

result = ChallengeResult("model_selection",
    baseline=baseline_score,
    estimator_linear=pipe_linear._final_estimator,
    estimator_ensemble=pipe_ensemble._final_estimator,
    score_linear=score_linear,
    score_ensemble=score_ensemble)
result.write()

## 7 - Fine-tuning (25 min)

*C11 - Améliorer les capacités prédictives d'un systèmes en sélectionnant un modèle différent ou en modifiant ses hyperparamètres en vue de corriger des erreurs (hyperparameter tuning)*

🎯 **Fine-tune your best model to achieve the highest possible score**

📝 Create a cross-validated grid search and assign it to **search**  
☑️ Choose the model that yielded the best result in section 6  
☑️ Create a **grid**, a `dict`, that stores the hyperparameters you want to search  
☑️ Limit yourself to 2 hyperparameters, with up to 3 possible values for each one  
☑️ Use only one scoring method, the one you stored in **scoring** in section 2  

In [105]:
pipe_ensemble.get_params()

{'memory': None,
 'steps': [('minmaxscaler', MinMaxScaler()),
  ('randomforestregressor', RandomForestRegressor())],
 'verbose': False,
 'minmaxscaler': MinMaxScaler(),
 'randomforestregressor': RandomForestRegressor(),
 'minmaxscaler__clip': False,
 'minmaxscaler__copy': True,
 'minmaxscaler__feature_range': (0, 1),
 'randomforestregressor__bootstrap': True,
 'randomforestregressor__ccp_alpha': 0.0,
 'randomforestregressor__criterion': 'squared_error',
 'randomforestregressor__max_depth': None,
 'randomforestregressor__max_features': 1.0,
 'randomforestregressor__max_leaf_nodes': None,
 'randomforestregressor__max_samples': None,
 'randomforestregressor__min_impurity_decrease': 0.0,
 'randomforestregressor__min_samples_leaf': 1,
 'randomforestregressor__min_samples_split': 2,
 'randomforestregressor__min_weight_fraction_leaf': 0.0,
 'randomforestregressor__n_estimators': 100,
 'randomforestregressor__n_jobs': None,
 'randomforestregressor__oob_score': False,
 'randomforestregressor__r

In [None]:
RandomForestRegressor(
    n_estimators=100,
    max_depth=None,
    max_features=1.0,
)

In [115]:
grid={'randomforestregressor__n_estimators': [100, 400, 500],
      'randomforestregressor__max_features': [2, 4, 6],
      'randomforestregressor__max_depth': [10, 20, 30]
     }

In [116]:
search = GridSearchCV(pipe_ensemble, param_grid=grid, cv=5, scoring=scoring, n_jobs=-1)
search.fit(X_transformed, y)



In [122]:
search.best_params_

{'randomforestregressor__max_depth': 20,
 'randomforestregressor__max_features': 6,
 'randomforestregressor__n_estimators': 500}

📝 fit your **search**  on the full **X** and **y**  
☑️ Iterate until you notice an improvement on the best score  compared to the scores obtained in section 6  
💡 You won't be judged by the computing power of your machine, don't spend too much time waiting for your grid search to fit.  
⏳ For reference, a grid search cross validated with **3 folds** and **4 combinations of hyperparameters** should fit in under **10 minutes**.

In [120]:
score = search.best_score_
score

-8.02598679618352

💾 **Run the following cell to save your results**

In [123]:
from nbresult import ChallengeResult

result = ChallengeResult("model_tuning",
    search_results=search.cv,
    score=search.best_score_)
result.write()

## 8 - Recommendations and Continuous Improvement (30 min)

*C13 - Adopter une démarche d'amélioration continue en identifiant les axes de perfectionnement d'un produit à l'aide d'une méthode adaptée de manière à améliorer la performance d'un produit*

🎯 **Transform your regression task into a classification task**

The product owner of your company tells you that he only needs to know whether a song is above or below popularity median  
The exact popularity value is of little interest to her as it won't bring any value to the feature under development.

📝 Create a new target **y_cat**, a `Series`, using the formula below  

☑️ $y\_cat_i = 1 \quad if\quad y_i \geq median(y),\quad 0\quad otherwise.$  

In [138]:
y_cat = (df['popularity'] >= df['popularity'].median()).astype(int)
y_cat

0        1
1        0
2        1
3        0
4        0
        ..
52312    0
52313    1
52314    0
52315    0
52316    1
Name: popularity, Length: 52053, dtype: int64

📝 Cross validate a classification  

☑️ Use your **preproc** with a basic linear model from `sklearn` suited for classification  
☑️ Assign the resulting pipeline to **pipe_cat**  
☑️ Cross validate the pipeline with 5 folds  
☑️ Use the Accuracy metric and store the mean of scores in **score_cat**  

In [142]:
pipe_cat = make_pipeline(preproc, LogisticRegression(max_iter=1000))
pipe_cat

In [143]:
score_cat = cross_val_score(pipe_cat, X, y_cat, scoring=scoring, cv=5).mean()
score_cat

-0.4111210327815666

💾 **Run the following cell to save your results**

In [145]:
from nbresult import ChallengeResult

result = ChallengeResult(
    "recommendations",
    target_cat = y_cat.value_counts(normalize=True).values,
    model = pipe_cat._final_estimator,
    score_cat = score_cat
)
result.write()

## 9 - Deployment - API (40 min)

*C12 - Mettre en production le modèle d'apprentissage supervisé ou non supervisé obtenu sous la forme d'une API*

📝 This challenge takes place in another repository  
☑️ Follow the instructions provided on the certification platform