## Legend

🎯 Objective    
❓ Question  
📝 Task  
☑️ Instructions  
💡 Informations  
💾 Submit your results  

**variable name**  
*field name*  
`python object`

# Apprentissage automatique (Machine Learning) supervisé et non supervisé

This exam contains **9** sections.  

For each one a time indication is given for information but you may choose your own pace.

To pass the exam you need to validate at least **5** sections.  

Some sections can be validated independently even if the exam use the same data for all of them.  

## Description (10 min)

🎯 **Your objective is to create a model that predicts the popularity of a song based on its characteristics**

To achieve this, you are given a dataset containing a list of songs with the following characteristics:

**acousticness**: whether the track is acoustic  

**danceability**: describes how suitable a track is for dancing  

**duration_ms**: duration of the track in milliseconds  

**energy**: represents a perceptual measure of intensity and activity  

**explicit**: whether the track has explicit lyrics  

**id**: id for the track  

**instrumentalness**: predicts whether a track contains no vocals  

**key**: the key the track is in  

**liveness**: detects the presence of an audience in the recording  

**loudness**: the overall loudness of a track in decibels  

**mode**: modality of a track  

**name**: name of the track  

**popularity**: popularity of the track  

**release_date**: release date  

**speechiness**: detects the presence of spoken words in a track  

**tempo**: overall estimated tempo of a track in beats per minute  

**valence**: describes the musical positiveness conveyed by a track  

**artist**: artist who performed the track

☑️ Only fine-tune your model when explicitly asked to do so, in section *7 - Fine-tuning*  

## 1 - Data Cleaning (15 min)

In [488]:
import pandas as pd
from sklearn.metrics import mean_squared_error
from math import sqrt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.cluster import KMeans
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

*C5 - Préparer les données en vue de l'apprentissage afin que celles-ci soient nettoyées*

🎯 **Load and clean the data**

📝 Load the data in **df**, a `DataFrame`  
☑️ The data file is available at this url: https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/spotify_popularity_train.csv

In [248]:
df = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/spotify_popularity_train.csv")

In [249]:
df.shape

(52317, 18)

📝 Clean the data, make sure that no duplicates nor missing values remain in **df**

Duplicates

In [250]:
duplicate_count = len(df)-len(df.drop_duplicates()) # Original data lenght minus data length without duplicates

duplicate_count

260

In [251]:
df.drop_duplicates(inplace=True) # Drop duplicates in place

In [252]:
df.shape

(52057, 18)

Missing Values

In [253]:
df.isnull().sum().sort_values(ascending=False)/len(df) #NaN percentage for each column

artist              0.000077
danceability        0.000000
valence             0.000000
tempo               0.000000
speechiness         0.000000
release_date        0.000000
popularity          0.000000
name                0.000000
mode                0.000000
acousticness        0.000000
liveness            0.000000
key                 0.000000
instrumentalness    0.000000
id                  0.000000
explicit            0.000000
energy              0.000000
duration_ms         0.000000
loudness            0.000000
dtype: float64

In [254]:
df.dropna(inplace = True)

In [255]:
df.shape

(52053, 18)

💾 **Run the following cell to save your results**

In [256]:
from nbresult import ChallengeResult

result = ChallengeResult("data_cleaning", shape=df.shape)
result.write()

## 2 - Supervised Learning (40 min)

*C9 - Entraîner un modèle d'apprentissage supervisé pour optimiser une fonction de prédiction à partir d'exemples annotées*

🎯 **Identify your metrics, compute your baseline and evaluate a basic model**

📝 Choose an appropriate scoring [metric](https://scikit-learn.org/stable/modules/model_evaluation.html) from `sklearn` for this challenge, the chosen metric must:    

☑️ strongly penalize largest errors relatively to smaller ones  
☑️ measure errors in the same unit as the target `popularity`  
☑️ the greater, the better (metric_good_model > metric_bad_model)  

📝 Store in **scoring** its exact name as `string`

In [257]:
scoring = "neg_root_mean_squared_error"

📝 Define your features and target   

☑️ Assign to **X_simple** a `DataFrame` containing only numerical features  
☑️ Assign to **y** a `Series` containing only your target: *popularity*  

In [258]:
df.dtypes

acousticness        float64
danceability        float64
duration_ms           int64
energy              float64
explicit              int64
id                   object
instrumentalness    float64
key                   int64
liveness            float64
loudness            float64
mode                  int64
name                 object
popularity            int64
release_date         object
speechiness         float64
tempo               float64
valence             float64
artist               object
dtype: object

In [259]:
X_simple = df.select_dtypes(include=['float64','int64']).drop(columns = 'popularity')
y = df['popularity']

In [260]:
print(X_simple.shape, y.shape)

(52053, 13) (52053,)


📝 Compute your baseline and store it in **baseline_score**, as a `float`  
☑️ Do so by simulating a constant prediction equivalent to the mean value of your target  
☑️ Use the same scoring function as the one stored in **scoring**  
☑️ You may have to code the scoring function yourself to use it outside a `sklearn` workflow  

In [261]:
mean_popularity = y.mean()

rmse = mean_squared_error(y,np.full(len(y),mean_popularity),squared=False)

baseline_score = -rmse

In [262]:
baseline_score

-21.86400900273424

### Holdout evaluation

📝 Split your data, holding out 50% of observations, randomly sampled, as test set  
☑️  Assign the result of your holdout to **X_train_simple** **y_train**, **X_test_simple**, **y_test**

In [263]:
X_train_simple, X_test_simple, y_train, y_test = train_test_split(X_simple, y, test_size=0.5)

In [264]:
print(X_train_simple.shape, X_test_simple.shape, y_train.shape, y_test.shape)

(26026, 13) (26027, 13) (26026,) (26027,)


📝 Fit and evaluate the most basic linear model you can find in the [`linear_model`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) module  
☑️ Use the metric you stored in **scoring**    
☑️ Store in **score_simple_holdout** your model score

In [265]:
model_simple = LinearRegression()

model_simple.fit(X_train_simple,y_train)

y_pred = model_simple.predict(X_test_simple)

negrmse = -mean_squared_error(y_test,y_pred,squared=False)

score_simple_holdout = negrmse

score_simple_holdout

-18.39557025989475

### Cross-validation evaluation

📝 Cross-validate your basic model  
☑️ Use 5 folds for your cross-validation  
☑️ Store your mean score in **score_simple_cv_mean** as a `float`  
☑️ Store the standard deviation of your scores in **score_simple_cv_std** as a `float`

In [266]:
cv_results = cross_val_score(LinearRegression(), \
                                       X_train_simple, y_train, cv=5,\
                                       scoring = 'neg_root_mean_squared_error')

score_simple_cv_mean = cv_results.mean()
score_simple_cv_std = cv_results.std()

In [267]:
cv_results

array([-18.18495221, -18.44226419, -18.44946268, -18.40720878,
       -18.16786536])

In [268]:
score_simple_cv_mean

-18.330350642764824

In [269]:
score_simple_cv_std

0.12661884894055922

☑️ From now on, you will stop using your train-test split    
☑️ Instead we expect you to cross-validate (5 folds) your results with the whole dataset

💾 **Run the following cell to save your results**

In [270]:
from nbresult import ChallengeResult

result = ChallengeResult(
    "supervised_learning",
    scoring=scoring,
    baseline_score=baseline_score,
    model=model_simple,
    shape_train = X_train_simple.shape,
    score_simple_holdout=score_simple_holdout,
    score_simple_cv_mean=score_simple_cv_mean,
    score_simple_cv_std=score_simple_cv_std,
)
result.write()

## 3 - Feature engineering (20 min)

*C7 - Générer des données d'entrée afin de satisfaire les contraintes inhérentes au modèles (Feature Engineering)*

🎯 **Create a new feature by extracting information from existing features**

Let's try to improve performance using the feature *release_date*

📝 Create a `DataFrame` **X_engineered** by adding a new column *year* to **X_simple**  
☑️ *year* must contain the release year of the track as `integer`

In [271]:
df_copy = df.copy()

In [272]:
df_copy['release_date'] = pd.to_datetime(df_copy['release_date'])

In [273]:
df_copy['year'] = df_copy['release_date'].dt.year

In [274]:
X_engineered = df_copy.select_dtypes(include=['float64','int64']).drop(columns = 'popularity')

In [275]:
X_engineered.dtypes

acousticness        float64
danceability        float64
duration_ms           int64
energy              float64
explicit              int64
instrumentalness    float64
key                   int64
liveness            float64
loudness            float64
mode                  int64
speechiness         float64
tempo               float64
valence             float64
year                  int64
dtype: object

📝 Check the impact of your new feature on the performance of your model  
☑️ Retrain the same basic linear model you used in section 2  
☑️ Use your **X_engineered** for the training  
☑️ Save the mean score after cross-validation in **score_engineered** as a `float`  

In [276]:
cv_results_eng = cross_val_score(LinearRegression(), \
                                       X_engineered, y, cv=5,\
                                       scoring = 'neg_root_mean_squared_error')


In [277]:
score_engineered = cv_results_eng.mean()

In [278]:
score_engineered

-17.301966769706553

💾 **Run the following cell to save your results**

In [279]:
from nbresult import ChallengeResult

result = ChallengeResult("feature_engineering",
    cols = X_engineered.columns,
    years = X_engineered.get("year"),
    score=score_engineered
)
result.write()

## 4 - Unsupervised Learning (20 min)

*C10 - Entraîner un modèle d'apprentissage non supervisé pour détecter des structures sous-jacentes à partir de données non étiquetées*

🎯 **Create a new feature by performing a clustering of your existing features**

📝 Use a `KMeans` to assign each track to a cluster  
☑️ Your target number of clusters is 5  
☑️ Fit your `KMeans` on **X_simple**  
☑️ Store your fitted `KMeans` in **kmeans**  

In [280]:
kmeans = KMeans(n_clusters=5).fit(X_simple)

In [281]:
kmeans

In [282]:
kmeans.labels_

array([0, 2, 0, ..., 0, 0, 0], dtype=int32)

📝 Add your clusters as features to your **X_engineered**  
☑️ Use your **kmeans** to get cluster predictions on **X_simple**  
☑️ Store the resulting predictions in a new column of **X_engineered** called *clusters*  

In [283]:
X_engineered['clusters'] = kmeans.labels_

In [284]:
X_engineered.head(2)

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence,year,clusters
0,0.654,0.499,219827,0.19,0,0.00409,7,0.0898,-16.435,1,0.0454,149.46,0.43,1971,0
1,0.00592,0.439,483948,0.808,0,0.14,2,0.089,-8.497,1,0.0677,138.04,0.0587,2015,2


📝 Check the impact of your new *clusters* feature on the performance of your model  
☑️ Retrain the same basic linear model you used in section 2 and 3  
☑️ Use your **X_engineered**, with both *year* and *clusters* for the training  
☑️ Save the mean score after cross-validation in **score_clusters** as a `float`  

In [285]:
cv_results_eng_clusters = cross_val_score(LinearRegression(), \
                                       X_engineered, y, cv=5,\
                                       scoring = 'neg_root_mean_squared_error')


In [286]:
score_clusters = cv_results_eng_clusters.mean()

In [287]:
score_clusters

-17.231566851984724

💾 **Run the following cell to save your results**

In [288]:
from nbresult import ChallengeResult

result = ChallengeResult("unsupervised_learning",
    cols=X_engineered.columns.tolist(),
    clusters= kmeans.n_clusters,
    labels=X_engineered['clusters'].value_counts(normalize=True).values,
    score=score_clusters
)
result.write()

## 5 - Preprocressing (1 h)

*C6 - Transformer les données d'entrée afin de satisfaire les contraintes inhérentes au modèle (Preprocessing)*

🎯 **Construct a preprocessing pipeline for your data**

In [289]:
# This will help you visualize your pipelines
from sklearn import set_config; set_config(display='diagram')

In [290]:
# 👉 Do not hesitate to reload a clean new dataset if you need a fresh start.
X = df.drop(columns = 'popularity')
y = df['popularity']

📝 Look at your features with an `object` type in your **df**  
☑️ Check their number of unique values  

❓ Do you think it would be reasonable or efficient to one-hot encode any of them?  
☑️ Store you answer as a string (Yes or No) in **answer_ohe** below

In [291]:
df.select_dtypes(include=['object']).nunique()

id              52053
name            46641
release_date     7547
artist          12577
dtype: int64

In [292]:
answer_ohe = 'No'

### 5.1 - Year

📝 Create a custom transformer to extract the *year* from *release_date*  
☑️ Use a [`FunctionTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html)  
☑️ Store your custom transformer in **transformer_year**

In [379]:
def year_function(df):
    
    return pd.DataFrame(pd.to_datetime(df['release_date']).dt.year)

In [380]:
transformer_year = FunctionTransformer(year_function)

📝 Create a pipeline **pipeline_year** with two steps:
- your **transformer_year**  
- a scaler that ensures values between 0 and 1

In [381]:
pipeline_year = Pipeline([
    ('transformer_year', transformer_year),
    ('min_max_scaler',MinMaxScaler())
])

In [382]:
pipeline_year

### 5.2 - Clusters

📝 We provide you with a custom transformer to extract a cluster id for each observation  
☑️ The [`transform`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.transform) method of a `KMeans` return an array of shape (n_samples, n_clusters) with the distance to cluster for each pair obs-cluster  
☑️ We then simply use an `np.argmin` on the rows to get the index of the center the observation is closest to  
☑️ This effectively yields clusters for each observation

In [383]:
def process_clusters(clusters):
    return np.argmin(clusters, axis=1).reshape((-1, 1))

transformer_clusters = FunctionTransformer(process_clusters)

📝 Create a pipeline **pipeline_clusters** with three steps:
- a `KMeans` with a target number of clusters equals to 5  
- your custom transformer **transformer_clusters**  
- an encoder that creates a new binary column for each cluster - 1  

In [384]:
pipeline_clusters = Pipeline([
    ('Kmeans', KMeans(n_clusters = 5)),
    ('transformer_clusters',transformer_clusters),
    ('ohe', OneHotEncoder(sparse = False, handle_unknown = 'ignore', drop = 'first'))
])

In [385]:
pipeline_clusters

### 5.3 - Artist

📝 We provide you with a custom Transformer Class below  
☑️ Take some time to understand it  
☑️ It computes the average popularity of songs, per artist, on the train set only  
☑️ If the artist is unknown in the test set, the average popularity will be equal to the mean popularity on the train set  

In [386]:
from sklearn.base import BaseEstimator, TransformerMixin

class ArtistPopularityTransformer(BaseEstimator, TransformerMixin):
    """
    Compute, as a new feature the artist's popularity
    Do so by computing the mean popularity of all songs from the artist
    Notice that the popularity is computed on the train only to avoid leakage
    """

    def __init__(self):
        pass

    def fit(self, X, y=None):
        """
        process artist mean popularity from artists songs popularity
        process song global mean popularity
        """

        # process artist popularity
        self.artist_popularity = y.groupby(X.artist).agg("mean")
        self.artist_popularity.name = "artist_popularity"

        # process mean popularity
        self.mean_popularity = y.mean()

        return self

    def transform(self, X, y=None):
        """
        apply artist mean popularity vs song global mean popularity to songs
        """

        # inject artist popularity
        X_copy = X.merge(self.artist_popularity, how="left", left_on="artist", right_index=True)

        # fills popularity of unknown artists with song global mean popularity
        X_copy.replace(np.nan, self.mean_popularity, inplace=True)

        return X_copy[["artist_popularity"]]

📝 Create a **pipeline_artist** with two steps:  
- the custom `ArtistPopularityTransformer`  
- a scaler that ensures values between 0 and 1

In [387]:
pipeline_artist = Pipeline([
    ('artist_popularity', ArtistPopularityTransformer()),
    ('MMscaler', MinMaxScaler())
])

In [388]:
pipeline_artist

### 5.4 Preprocessing Pipeline

📝 Create a transformer that contains all your preprocessing steps using a [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html?highlight=column%20transformer#sklearn.compose.ColumnTransformer), which should:  
☑️ Apply your **pipeline_clusters** to all numeric features  
☑️ Scale all numeric features, so that their scaled values are within 0 and 1  
☑️ Apply your **pipeline_year** to the *release_date* field  
☑️ Apply your **pipeline_artist** to the *artist* field  
☑️ Drop all other fields  

In [389]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

num_col = list(X.select_dtypes(include=numerics).columns)

num_col

['acousticness',
 'danceability',
 'duration_ms',
 'energy',
 'explicit',
 'instrumentalness',
 'key',
 'liveness',
 'loudness',
 'mode',
 'speechiness',
 'tempo',
 'valence']

In [390]:
X.columns

Index(['acousticness', 'danceability', 'duration_ms', 'energy', 'explicit',
       'id', 'instrumentalness', 'key', 'liveness', 'loudness', 'mode', 'name',
       'release_date', 'speechiness', 'tempo', 'valence', 'artist'],
      dtype='object')

In [391]:
preprocessor = ColumnTransformer([
    ('clusters_num_col', pipeline_clusters, num_col),
    ('MMscaler_num_col', MinMaxScaler(), num_col),
    ('year', pipeline_year, ['release_date']),
    ('artist', pipeline_artist, ['artist'])
],
    remainder='drop')

preprocessor

📝 Use your pipeline to `transform` your **X** and store the result in **X_transformed**

In [392]:
X_transformed = pd.DataFrame(preprocessor.fit_transform(X,y))

In [398]:
X_transformed.shape

(52053, 19)

💾 **Run the following cell to save your results**

In [399]:
# Save your preproc
from nbresult import ChallengeResult

result = ChallengeResult(
    "preprocessing",
    answer=answer_ohe,
    shape=X_transformed.shape,
    first_observation = X_transformed[0]
)
result.write()

## 6 - Model Selection (40 min)

*C8 - Maîtriser les différents algorithmes d'apprentissage afin d'apporter une réponse adaptée à une problématique d'une organisation (entreprise, laboratoire, etc.)*  

🎯 **Select the model that yields the best performance for your task**

📝 Try model from two different families: linear and ensemble  
☑️ We expect you to cross-validate all scores with 5 folds in this section  

**If you did not manage to construct the full preprocessing:**  
☑️ Construct a light pipeline that use only features in **X_simple** and scale them to values between 0 and 1  

### 6.1 - Linear Models

📝 Construct a `Pipeline` that combines your **preproc**  and a linear estimator from `sklearn`  
☑️ Assign your pipeline to a variable named **pipe_linear**  
☑️ We expect you to cross-validate all scores with 5 folds in this section  
☑️ Store the mean of the scores in **score_linear**  

In [403]:
pipe_linear = make_pipeline(preprocessor, LinearRegression())

In [404]:
pipe_linear

In [424]:
score_linear = cross_val_score(pipe_linear, X,y,cv=5, scoring = 'neg_root_mean_squared_error')

In [425]:
score_linear = score_linear.mean()

In [426]:
score_linear

-13.49329322132599

### 6.2 - Ensemble Methods

📝 Construct a `Pipeline` that combines your **preproc**  and an ensemble estimator from `sklearn`  
☑️ Assign your pipeline to a variable named **pipe_ensemble**  
☑️ We expect you to cross-validate all scores with 5 folds in this section  
☑️ Store the mean of the scores in **score_ensemble**  

In [435]:
pipe_ensemble = make_pipeline(preprocessor, RandomForestRegressor(max_depth = 8))

In [436]:
pipe_ensemble

In [437]:
score_ensemble = cross_val_score(pipe_ensemble, X,y,cv=5,scoring = 'neg_root_mean_squared_error')

In [438]:
score_ensemble = score_ensemble.mean()

In [439]:
score_ensemble

-12.142731817797385

💾 **Run the following cell to save your results**

In [440]:
from nbresult import ChallengeResult

result = ChallengeResult("model_selection",
    baseline=baseline_score,
    estimator_linear=pipe_linear._final_estimator,
    estimator_ensemble=pipe_ensemble._final_estimator,
    score_linear=score_linear,
    score_ensemble=score_ensemble)
result.write()

## 7 - Fine-tuning (25 min)

*C11 - Améliorer les capacités prédictives d'un systèmes en sélectionnant un modèle différent ou en modifiant ses hyperparamètres en vue de corriger des erreurs (hyperparameter tuning)*

🎯 **Fine-tune your best model to achieve the highest possible score**

📝 Create a cross-validated grid search and assign it to **search**  
☑️ Choose the model that yielded the best result in section 6  
☑️ Create a **grid**, a `dict`, that stores the hyperparameters you want to search  
☑️ Limit yourself to 2 hyperparameters, with up to 3 possible values for each one  
☑️ Use only one scoring method, the one you stored in **scoring** in section 2  

In [441]:
pipe_ensemble.get_params()

{'memory': None,
 'steps': [('columntransformer',
   ColumnTransformer(transformers=[('clusters_num_col',
                                    Pipeline(steps=[('Kmeans',
                                                     KMeans(n_clusters=5)),
                                                    ('transformer_clusters',
                                                     FunctionTransformer(func=<function process_clusters at 0x7feff9a4c820>)),
                                                    ('ohe',
                                                     OneHotEncoder(drop='first',
                                                                   handle_unknown='ignore',
                                                                   sparse=False))]),
                                    ['acousticness', 'danceability', 'duration_ms',
                                     'energy', 'explicit', 'instrumentalnes...
                                     'key', 'liveness', 'loudness', 'm

In [445]:
grid = {'randomforestregressor__max_depth' : [2,4,8],
       'randomforestregressor__min_samples_split' : [2,4,8]}

In [450]:
search = GridSearchCV(
    pipe_ensemble, 
    param_grid = grid,
    cv=5,
    n_jobs = -1,
    verbose = 3,
    scoring="neg_root_mean_squared_error")

In [451]:
search.fit(X,y)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


📝 fit your **search**  on the full **X** and **y**  
☑️ Iterate until you notice an improvement on the best score  compared to the scores obtained in section 6  
💡 You won't be judged by the computing power of your machine, your grid search should fit in under 3 minutes

In [455]:
search.best_score_

-12.137131447220522

[CV 3/5] END randomforestregressor__max_depth=2, randomforestregressor__min_samples_split=4;, score=-14.654 total time=  12.7s
[CV 5/5] END randomforestregressor__max_depth=2, randomforestregressor__min_samples_split=8;, score=-14.660 total time=  13.7s
[CV 2/5] END randomforestregressor__max_depth=4, randomforestregressor__min_samples_split=4;, score=-13.279 total time=  24.1s
[CV 5/5] END randomforestregressor__max_depth=4, randomforestregressor__min_samples_split=8;, score=-13.200 total time=  25.1s
[CV 3/5] END randomforestregressor__max_depth=8, randomforestregressor__min_samples_split=4;, score=-12.141 total time=  41.8s
[CV 4/5] END randomforestregressor__max_depth=2, randomforestregressor__min_samples_split=2;, score=-14.628 total time=  13.0s
[CV 4/5] END randomforestregressor__max_depth=2, randomforestregressor__min_samples_split=8;, score=-14.633 total time=  13.6s
[CV 1/5] END randomforestregressor__max_depth=4, randomforestregressor__min_samples_split=4;, score=-13.514 tot

💾 **Run the following cell to save your results**

In [481]:
from nbresult import ChallengeResult

result = ChallengeResult("model_tuning",
    search_results=search.cv,
    score=search.best_score_)
result.write()

## 8 - Recommendations and Continuous Improvement (30 min)

*C13 - Adopter une démarche d'amélioration continue en identifiant les axes de perfectionnement d'un produit à l'aide d'une méthode adaptée de manière à améliorer la performance d'un produit*

🎯 **Transform your regression task into a classification task**

The product owner of your company tells you that he only needs to know whether a song is above or below popularity median  
The exact popularity value is of little interest to her as it won't bring any value to the feature under development.

📝 Create a new target **y_cat**, a `Series`, using the formula below  

☑️ $y\_cat_i = 1 \quad if\quad y_i \geq median(y),\quad 0\quad otherwise.$  

In [482]:
median = y.median()

In [483]:
y_cat = y.apply(lambda x : 1 if x > median else 0)

In [484]:
y_cat.value_counts()

0    26591
1    25462
Name: popularity, dtype: int64

In [485]:
y_cat.shape, y.shape

((52053,), (52053,))

📝 Cross validate a classification  

☑️ Use your **preproc** with a basic linear model from `sklearn` suited for classification  
☑️ Assign the resulting pipeline to **pipe_cat**  
☑️ Cross validate the pipeline with 5 folds  
☑️ Use the Accuracy metric and store the mean of scores in **score_cat**  

In [493]:
pipe_cat = make_pipeline(preprocessor, LogisticRegression(max_iter = 1000))

In [494]:
pipe_cat

In [495]:
score_cat = cross_val_score(pipe_cat, X, y_cat , cv=5,scoring = 'accuracy')

In [496]:
score_cat = score_cat.mean()

In [497]:
score_cat

0.8255816988072635

💾 **Run the following cell to save your results**

In [498]:
from nbresult import ChallengeResult

result = ChallengeResult(
    "recommendations",
    target_cat = y_cat.value_counts(normalize=True).values,
    model = pipe_cat._final_estimator,
    score_cat = score_cat
)
result.write()

## 9 - Deployment - API (40 min)

*C12 - Mettre en production le modèle d'apprentissage supervisé ou non supervisé obtenu sous la forme d'une API*

📝 This challenge takes place in another repository  
☑️ Follow the instructions provided on the certification platform