# Spotify Popularity Predictor (39%)

The goal of this challenge is to create a model that predicts the popularity of a song based on its features.

The dataset contains a list of tracks with the following characteristics:
- `acousticness`: whether the track is acoustic
- `danceability`: describes how suitable a track is for dancing
- `duration_ms`: duration of the track in milliseconds
- `energy`: represents a perceptual measure of intensity and activity
- `explicit`: whether the track has explicit lyrics
- `id`: id for the track
- `instrumentalness`: predicts whether a track contains no vocals
- `key`: the key the track is in
- `liveness`: detects the presence of an audience in the recording
- `loudness`: the overall loudness of a track in decibels
- `mode`: modality of a track
- `name`: name of the track
- `popularity`: popularity of the track
- `release_date`: release date
- `speechiness`: detects the presence of spoken words in a track
- `tempo`: overall estimated tempo of a track in beats per minute
- `valence`: describes the musical positiveness conveyed by a track
- `artist`: artist who performed the track

# Model

## Data collection

📝 **Load the `spotify_popularity_train.csv` dataset from the provided URL**
- Display the first few rows
- Perform the basic cleaning operations (remove redundant lines, as well as those with missing values)
- Store the result in a `DataFrame` named `data`

In [1]:
# Canonical imports:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
url = "https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/spotify_popularity_train.csv"

In [3]:
data = pd.read_csv(url)

In [4]:
data.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,artist
0,0.654,0.499,219827,0.19,0,0B6BeEUd6UwFlbsHMQKjob,0.00409,7,0.0898,-16.435,1,Back in the Goodle Days,40,1971,0.0454,149.46,0.43,John Hartford
1,0.00592,0.439,483948,0.808,0,5Gpx4lJy3vKmIvjwbiR5c8,0.14,2,0.089,-8.497,1,Worlds Which Break Us - Intro Mix,22,2015-02-02,0.0677,138.04,0.0587,Driftmoon
2,0.734,0.523,245693,0.288,0,7MxuUYqrCIy93h1EEHrIrL,0.0,0,0.0771,-11.506,1,I'm The Greatest Star,40,1968-09-01,0.214,75.869,0.464,Barbra Streisand
3,0.429,0.681,130026,0.165,0,4GeYbfIx1vSQXTfQb1m8Th,0.0,11,0.394,-21.457,0,Kapitel 281 - Der Page und die Herzogin,1,1926,0.946,145.333,0.288,Georgette Heyer
4,0.562,0.543,129813,0.575,0,2JPGGZwajjMk0vvhfC17RK,4e-06,2,0.127,-7.374,1,Away from You,15,2008-02-11,0.0265,139.272,0.801,Gerry & The Pacemakers


In [5]:
data.shape

(52317, 18)

In [6]:
# removing duplicates

data = data.drop_duplicates()
len(data)

52057

In [7]:
# handling missing values

data.isnull().sum().sort_values(ascending=False)/len(data)

artist              0.000077
valence             0.000000
danceability        0.000000
duration_ms         0.000000
energy              0.000000
explicit            0.000000
id                  0.000000
instrumentalness    0.000000
key                 0.000000
liveness            0.000000
loudness            0.000000
mode                0.000000
name                0.000000
popularity          0.000000
release_date        0.000000
speechiness         0.000000
tempo               0.000000
acousticness        0.000000
dtype: float64

In [8]:
# removing rows with no artist

data = data.dropna(subset=['artist'])

In [9]:
# checking 
data.isnull().sum().sort_values(ascending=False)/len(data)

artist              0.0
valence             0.0
danceability        0.0
duration_ms         0.0
energy              0.0
explicit            0.0
id                  0.0
instrumentalness    0.0
key                 0.0
liveness            0.0
loudness            0.0
mode                0.0
name                0.0
popularity          0.0
release_date        0.0
speechiness         0.0
tempo               0.0
acousticness        0.0
dtype: float64

In [10]:
data.shape

(52053, 18)

🧪 **Run the following cell to save your results**

In [11]:
from nbresult import ChallengeResult

ChallengeResult(
    "data_cleaning",
    shape=data.shape).write()

## Simple model

📝 **Which sklearn's scoring [metric](https://scikit-learn.org/stable/modules/model_evaluation.html) should we use if we want to:**
- **Strongly penalize** largest errors
- Measure errors **in the same unit** than `popularity` 
- Is better when greater (metric_good_model > metric_bad_model)

👉 Store its exact name as `string` in the variable `scoring` below

🚨 You must use this metric for the rest of the challenge

In [12]:
# defining the scoring metric for the rest of the exercise

scoring = 'neg_root_mean_squared_error'

**📝 Let's build a first simple linear model using only the numerical features in our dataset to start with**
- Build `X_simple` keeping only numerical features
- Build `y` your target containing the `popularity`

In [13]:
X_simple = data.select_dtypes(include=np.number).drop(columns=['popularity'])
X_simple

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence
0,0.65400,0.499,219827,0.190,0,0.004090,7,0.0898,-16.435,1,0.0454,149.460,0.4300
1,0.00592,0.439,483948,0.808,0,0.140000,2,0.0890,-8.497,1,0.0677,138.040,0.0587
2,0.73400,0.523,245693,0.288,0,0.000000,0,0.0771,-11.506,1,0.2140,75.869,0.4640
3,0.42900,0.681,130026,0.165,0,0.000000,11,0.3940,-21.457,0,0.9460,145.333,0.2880
4,0.56200,0.543,129813,0.575,0,0.000004,2,0.1270,-7.374,1,0.0265,139.272,0.8010
...,...,...,...,...,...,...,...,...,...,...,...,...,...
52312,0.16400,0.512,56253,0.907,0,0.004870,6,0.8010,-7.804,1,0.6620,85.615,0.3150
52313,0.77300,0.533,192838,0.659,0,0.773000,2,0.1130,-9.117,0,0.0426,158.366,0.6140
52314,0.45600,0.548,310840,0.568,0,0.000000,6,0.0892,-5.348,1,0.0275,77.495,0.3380
52315,0.96500,0.360,216493,0.132,0,0.000000,10,0.1260,-21.014,1,0.0355,80.909,0.4100


In [14]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_simple_scaled = scaler.fit_transform(X_simple)

In [15]:
X_simple_scaled

array([[0.65662651, 0.50608519, 0.04460374, ..., 0.04680412, 0.61378112,
        0.43      ],
       [0.00594378, 0.44523327, 0.09969636, ..., 0.06979381, 0.56688309,
        0.0587    ],
       [0.73694779, 0.53042596, 0.04999909, ..., 0.22061856, 0.31156805,
        0.464     ],
       ...,
       [0.45783133, 0.55578093, 0.06358801, ..., 0.02835052, 0.31824547,
        0.338     ],
       [0.9688755 , 0.36511156, 0.04390831, ..., 0.03659794, 0.3322656 ,
        0.41      ],
       [0.54518072, 0.836714  , 0.04434217, ..., 0.03278351, 0.40749547,
        0.67      ]])

In [16]:
y = data['popularity']
y

0        40
1        22
2        40
3         1
4        15
         ..
52312    25
52313    43
52314     0
52315     0
52316    40
Name: popularity, Length: 52053, dtype: int64

### Holdout evaluation

**📝 Create the 4 variables `X_train_simple` `y_train`, `X_test_simple`, `y_test` with a 50% split with random sampling**

In [17]:
from sklearn.model_selection import train_test_split

X_train_simple, X_test_simple, y_train, y_test = train_test_split(X_simple_scaled, y, test_size=0.5)

**📝 Fit and evaluate a basic linear model (do not fine tune it) with this holdout method**
- Store your model true performance in a float variable `score_simple_holdout`

In [18]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

In [19]:
model.fit(X_train_simple, y_train)

LinearRegression()

In [20]:
y_pred_base = model.predict(X_test_simple)

In [21]:
from sklearn.metrics import mean_squared_error

score_simple_holdout = -mean_squared_error(y_test, y_pred_base, squared=False)

In [22]:
score_simple_holdout

-18.422146314648934

### Cross-validation evaluation

📝 **Let's be sure our score is representative**: 
- 5-times cross validate a basic linear model on the whole numeric dataset (`X_simple`, `y`)
- Do not fine tune your model
- Store your mean performance in a variable `score_simple_cv_mean` as a `float`
- Store the standard deviation of your performances in a float variable `score_simple_cv_std`

In [23]:
from sklearn.model_selection import cross_validate

model = LinearRegression()

cv_results = cross_validate(model,
                            X_simple_scaled,
                            y,
                            cv=5, 
                            scoring=scoring)

In [24]:
score_simple_cv_mean = cv_results['test_score'].mean()
score_simple_cv_mean

-18.360558551569312

In [25]:
score_simple_cv_std = cv_results['test_score'].std()
score_simple_cv_std

0.09730190039196857

🧪 **Run the following cell to save your results**

In [26]:
from nbresult import ChallengeResult

ChallengeResult(
    "simple_model",
    scoring=scoring,
    shape_train = X_train_simple.shape,
    score_simple_holdout=score_simple_holdout,
    score_simple_cv_mean=score_simple_cv_mean,
    score_simple_cv_std=score_simple_cv_std,
).write()

## Feature engineering

(From now on, we will stop using train/test split but cross-validation on the whole dataset instead)  

Let's try to improve performance using the feature `release_date`

**📝 Create `X_engineered` by adding a new column `year` to `X`, containing the release year of the track as `integer`**

In [27]:
X_engineered = data.drop(columns=['popularity'])
X_engineered['year'] = X_engineered['release_date']
X_engineered.drop(columns=['release_date'], inplace=True)
X_engineered['year'] = pd.to_datetime(X_engineered['year'])
X_engineered['year'] = X_engineered['year'].dt.year
X_engineered

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,speechiness,tempo,valence,artist,year
0,0.65400,0.499,219827,0.190,0,0B6BeEUd6UwFlbsHMQKjob,0.004090,7,0.0898,-16.435,1,Back in the Goodle Days,0.0454,149.460,0.4300,John Hartford,1971
1,0.00592,0.439,483948,0.808,0,5Gpx4lJy3vKmIvjwbiR5c8,0.140000,2,0.0890,-8.497,1,Worlds Which Break Us - Intro Mix,0.0677,138.040,0.0587,Driftmoon,2015
2,0.73400,0.523,245693,0.288,0,7MxuUYqrCIy93h1EEHrIrL,0.000000,0,0.0771,-11.506,1,I'm The Greatest Star,0.2140,75.869,0.4640,Barbra Streisand,1968
3,0.42900,0.681,130026,0.165,0,4GeYbfIx1vSQXTfQb1m8Th,0.000000,11,0.3940,-21.457,0,Kapitel 281 - Der Page und die Herzogin,0.9460,145.333,0.2880,Georgette Heyer,1926
4,0.56200,0.543,129813,0.575,0,2JPGGZwajjMk0vvhfC17RK,0.000004,2,0.1270,-7.374,1,Away from You,0.0265,139.272,0.8010,Gerry & The Pacemakers,2008
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52312,0.16400,0.512,56253,0.907,0,2GJxRwFe8oLcbXgTw9P5of,0.004870,6,0.8010,-7.804,1,"Incidental CB Dialogue - Bandit, Smokey & Snowman",0.6620,85.615,0.3150,Burt Reynolds,1977
52313,0.77300,0.533,192838,0.659,0,0EtAPdqg7TysBXKDbnzuSO,0.773000,2,0.1130,-9.117,0,Samba De Verão,0.0426,158.366,0.6140,Walter Wanderley,1965
52314,0.45600,0.548,310840,0.568,0,1s78GLrkZT7rTKAEu056M8,0.000000,6,0.0892,-5.348,1,Kekkonnouta,0.0275,77.495,0.3380,accel,2020
52315,0.96500,0.360,216493,0.132,0,1LUBU2WI4z0dALUM16hoAH,0.000000,10,0.1260,-21.014,1,Die Meistersinger von Nürnberg - Act 1: Wohl M...,0.0355,80.909,0.4100,Richard Wagner,1952


📝 **Let's see how this impact the performance of our model.**
- Retrain the same simple linear model on numerical values only, adding the new feature `year`
- Save the mean cross-validated performance metric in a variable named `score_engineered` as a `float`
- Do not fine tune the model yet

In [28]:
X_engineered_simple = X_engineered.select_dtypes(include=np.number)

In [29]:
scaler = MinMaxScaler()

X_engineered_simple_scaled = scaler.fit_transform(X_engineered_simple)

In [30]:
cv_results_2 = cross_validate(model,
                            X_engineered_simple_scaled,
                            y,
                            cv=5, 
                            scoring=scoring)

In [31]:
score_engineered = cv_results_2['test_score'].mean()
score_engineered

-17.301966769706333

🧪 **Run the following cell to save your results**

In [32]:
from nbresult import ChallengeResult

ChallengeResult("feature_engineering",
    cols = X_engineered.columns,
    years = X_engineered.get("year"),
    score_engineered=score_engineered
).write()

## Pipelining

Let's now look for maximum performance by creating a solid preprocessing pipeline.

**📝 Create a sklearn preprocessing [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) and store it as `preproc`**

- Feel free to add any preprocessing steps you think of
- You may want to integrate your feature engineering for `year`
- You may also further improve it using the `ArtistPopularityTransformer` class given to you below
- Don't add any model to it yet

🚨 Advice: It is better for you to have a working pipeline (even simple one) rather than NO pipeline at all

In [33]:
# 👉 Do not hesitate to reload a clean new dataset if you need a fresh start.
X = X_engineered
y = data['popularity']

In [34]:
X

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,speechiness,tempo,valence,artist,year
0,0.65400,0.499,219827,0.190,0,0B6BeEUd6UwFlbsHMQKjob,0.004090,7,0.0898,-16.435,1,Back in the Goodle Days,0.0454,149.460,0.4300,John Hartford,1971
1,0.00592,0.439,483948,0.808,0,5Gpx4lJy3vKmIvjwbiR5c8,0.140000,2,0.0890,-8.497,1,Worlds Which Break Us - Intro Mix,0.0677,138.040,0.0587,Driftmoon,2015
2,0.73400,0.523,245693,0.288,0,7MxuUYqrCIy93h1EEHrIrL,0.000000,0,0.0771,-11.506,1,I'm The Greatest Star,0.2140,75.869,0.4640,Barbra Streisand,1968
3,0.42900,0.681,130026,0.165,0,4GeYbfIx1vSQXTfQb1m8Th,0.000000,11,0.3940,-21.457,0,Kapitel 281 - Der Page und die Herzogin,0.9460,145.333,0.2880,Georgette Heyer,1926
4,0.56200,0.543,129813,0.575,0,2JPGGZwajjMk0vvhfC17RK,0.000004,2,0.1270,-7.374,1,Away from You,0.0265,139.272,0.8010,Gerry & The Pacemakers,2008
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52312,0.16400,0.512,56253,0.907,0,2GJxRwFe8oLcbXgTw9P5of,0.004870,6,0.8010,-7.804,1,"Incidental CB Dialogue - Bandit, Smokey & Snowman",0.6620,85.615,0.3150,Burt Reynolds,1977
52313,0.77300,0.533,192838,0.659,0,0EtAPdqg7TysBXKDbnzuSO,0.773000,2,0.1130,-9.117,0,Samba De Verão,0.0426,158.366,0.6140,Walter Wanderley,1965
52314,0.45600,0.548,310840,0.568,0,1s78GLrkZT7rTKAEu056M8,0.000000,6,0.0892,-5.348,1,Kekkonnouta,0.0275,77.495,0.3380,accel,2020
52315,0.96500,0.360,216493,0.132,0,1LUBU2WI4z0dALUM16hoAH,0.000000,10,0.1260,-21.014,1,Die Meistersinger von Nürnberg - Act 1: Wohl M...,0.0355,80.909,0.4100,Richard Wagner,1952


In [35]:
# We are giving you below a custom transformer that you may want to use in your pipeline (make sure you understanding it)

from sklearn.base import BaseEstimator, TransformerMixin

class ArtistPopularityTransformer(BaseEstimator, TransformerMixin):
    """
    Compute, as a new feature of the test set, the mean popularity of 
    all songs made by the artist on the train set.
    """

    def __init__(self):
        pass

    def fit(self, X, y=None):
        """
        process artist mean popularity from artists songs popularity
        process song global mean popularity
        """

        # process artist popularity
        self.artist_popularity = y.groupby(X.artist).agg("mean")
        self.artist_popularity.name = "artist_popularity"

        # process mean popularity
        self.mean_popularity = y.mean()

        return self

    def transform(self, X, y=None):
        """
        apply artist mean popularity vs song global mean popularity to songs
        """

        # inject artist popularity
        X_copy = X.merge(self.artist_popularity, how="left", left_on="artist", right_index=True)

        # fills popularity of unknown artists with song global mean popularity
        X_copy.replace(np.nan, self.mean_popularity, inplace=True)

        return X_copy[["artist_popularity"]]

In [36]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

num_features = list(X.drop(columns=['id', 'name', 'key', 'artist']).columns)
cat_features = ['key']
artists = ['artist']


num_preproc = Pipeline([
    ('scaler', MinMaxScaler())
     ])


cat_preproc = Pipeline([
    ('OHE', OneHotEncoder(sparse=False))
     ])


artist_preproc = Pipeline([
    ('artist_popularity', ArtistPopularityTransformer())
     ])


preproc = ColumnTransformer([
    ('num_preproc', num_preproc, num_features),
    ('cat_preproc', cat_preproc, cat_features),
    ('artist_preproc', artist_preproc, artists)])

preproc

ColumnTransformer(transformers=[('num_preproc',
                                 Pipeline(steps=[('scaler', MinMaxScaler())]),
                                 ['acousticness', 'danceability', 'duration_ms',
                                  'energy', 'explicit', 'instrumentalness',
                                  'liveness', 'loudness', 'mode', 'speechiness',
                                  'tempo', 'valence', 'year']),
                                ('cat_preproc',
                                 Pipeline(steps=[('OHE',
                                                  OneHotEncoder(sparse=False))]),
                                 ['key']),
                                ('artist_preproc',
                                 Pipeline(steps=[('artist_popularity',
                                                  ArtistPopularityTransformer())]),
                                 ['artist'])])

**📝 Store the number of columns/feature after preprocessing your inputs in a variable `col_number`**

In [38]:
pd.DataFrame(preproc.fit_transform(X,y)).head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,0.656627,0.506085,0.044604,0.19,0.0,0.00409,0.08989,0.683437,1.0,0.046804,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,35.0
1,0.005944,0.445233,0.099696,0.808,0.0,0.14,0.089089,0.807966,1.0,0.069794,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0
2,0.736948,0.530426,0.049999,0.288,0.0,0.0,0.077177,0.760762,1.0,0.220619,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,34.333333


In [39]:
col_number = 26

🧪 **Run the following cells to save your results**

In [40]:
# Visually print your preproc
from sklearn import set_config; set_config(display='diagram')
preproc

In [41]:
# Save your preproc
from nbresult import ChallengeResult

ChallengeResult(
    "preprocessing",
    col_number=col_number,
    first_observation = preproc.fit_transform(X, y)[0]
).write()

## Training

📝 **Time to fine tune your models**

- Add an **estimator** to your pipeline (only from Scikit-learn) 

- Train your pipeline and **fine-tune** (optimize) your estimator to maximize prediction score

- You must try to fine tune at least 2 different models: 
    - create one pipeline with a **linear model** of your choice
    - create one pipeline with an **ensemble model** of your choice

Then, 

- Save your two best 5-time cross-validated scores as _float_: `score_linear` and `score_ensemble`

- Save your two best trained pipelines as _Pipeline_ objects: `pipe_linear` and `pipe_ensemble`

### Linear

In [43]:
pipe_linear = Pipeline([
    ('preprocessing', preproc),
    ('linear_regression', LinearRegression())])

In [44]:
cv_results_linear = cross_validate(pipe_linear, X, y,
                            scoring=scoring,
                            cv=5)

In [45]:
score_linear = cv_results_linear['test_score'].mean()
score_linear

-13.505084892043874

### Ensemble

In [46]:
from xgboost import XGBRegressor

pipe_ensemble = Pipeline([
    ('preprocessing', preproc),
    ('XGBoost', XGBRegressor())])

In [47]:
cv_results_xgboost = cross_validate(pipe_ensemble, X, y,
                            scoring=scoring,
                            cv=5)

In [48]:
score_ensemble = cv_results_xgboost['test_score'].mean()
score_ensemble

-12.076024785872765

🧪 **Run the following cells to save your results**

In [49]:
# Print below your best pipe for correction purpose
pipe_ensemble

In [50]:
from nbresult import ChallengeResult

ChallengeResult("model_tuning",
    scoring = scoring,
    score_linear=score_linear,
    score_ensemble=score_ensemble).write()

## API 

Time to put a pipeline in production!

👉 Go to https://github.com/lewagon/data-certification-api and follow instructions

**This final part is independent from the above notebook**

In [65]:
# example

columns = list(data.drop(columns=['popularity']).columns)
columns

['acousticness',
 'danceability',
 'duration_ms',
 'energy',
 'explicit',
 'id',
 'instrumentalness',
 'key',
 'liveness',
 'loudness',
 'mode',
 'name',
 'release_date',
 'speechiness',
 'tempo',
 'valence',
 'artist']

In [60]:
track_test = X.iloc[2,:]

track_test_df = pd.DataFrame(track_test).T
track_test_df

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,speechiness,tempo,valence,artist,year
2,0.734,0.523,245693,0.288,0,7MxuUYqrCIy93h1EEHrIrL,0,0,0.0771,-11.506,1,I'm The Greatest Star,0.214,75.869,0.464,Barbra Streisand,1968


In [62]:
pipe_ensemble.fit(X,y)

In [63]:
pipe_ensemble.predict(track_test_df)

array([35.56709], dtype=float32)