# Spotify Popularity Predictor (39%)

The goal of this challenge is to create a model that predicts the popularity of a song based on its features.

The dataset contains a list of tracks with the following characteristics:
- `acousticness`: whether the track is acoustic
- `danceability`: describes how suitable a track is for dancing
- `duration_ms`: duration of the track in milliseconds
- `energy`: represents a perceptual measure of intensity and activity
- `explicit`: whether the track has explicit lyrics
- `id`: id for the track
- `instrumentalness`: predicts whether a track contains no vocals
- `key`: the key the track is in
- `liveness`: detects the presence of an audience in the recording
- `loudness`: the overall loudness of a track in decibels
- `mode`: modality of a track
- `name`: name of the track
- `popularity`: popularity of the track
- `release_date`: release date
- `speechiness`: detects the presence of spoken words in a track
- `tempo`: overall estimated tempo of a track in beats per minute
- `valence`: describes the musical positiveness conveyed by a track
- `artist`: artist who performed the track

# Model

## Data collection

📝 **Load the `spotify_popularity_train.csv` dataset from the provided URL**
- Display the first few rows
- Perform the basic cleaning operations (remove redundant lines, as well as those with missing values)
- Store the result in a `DataFrame` named `data`

In [1]:
url = "https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/spotify_popularity_train.csv"

In [2]:
import pandas as pd
import requests
import io

s=requests.get(url).content
data = pd.read_csv(io.StringIO(s.decode('utf-8')))

In [3]:
data.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,artist
0,0.654,0.499,219827,0.19,0,0B6BeEUd6UwFlbsHMQKjob,0.00409,7,0.0898,-16.435,1,Back in the Goodle Days,40,1971,0.0454,149.46,0.43,John Hartford
1,0.00592,0.439,483948,0.808,0,5Gpx4lJy3vKmIvjwbiR5c8,0.14,2,0.089,-8.497,1,Worlds Which Break Us - Intro Mix,22,2015-02-02,0.0677,138.04,0.0587,Driftmoon
2,0.734,0.523,245693,0.288,0,7MxuUYqrCIy93h1EEHrIrL,0.0,0,0.0771,-11.506,1,I'm The Greatest Star,40,1968-09-01,0.214,75.869,0.464,Barbra Streisand
3,0.429,0.681,130026,0.165,0,4GeYbfIx1vSQXTfQb1m8Th,0.0,11,0.394,-21.457,0,Kapitel 281 - Der Page und die Herzogin,1,1926,0.946,145.333,0.288,Georgette Heyer
4,0.562,0.543,129813,0.575,0,2JPGGZwajjMk0vvhfC17RK,4e-06,2,0.127,-7.374,1,Away from You,15,2008-02-11,0.0265,139.272,0.801,Gerry & The Pacemakers


In [4]:
data.shape

(52317, 18)

In [5]:
data = data.drop_duplicates()

In [6]:
data.shape

(52057, 18)

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52057 entries, 0 to 52316
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      52057 non-null  float64
 1   danceability      52057 non-null  float64
 2   duration_ms       52057 non-null  int64  
 3   energy            52057 non-null  float64
 4   explicit          52057 non-null  int64  
 5   id                52057 non-null  object 
 6   instrumentalness  52057 non-null  float64
 7   key               52057 non-null  int64  
 8   liveness          52057 non-null  float64
 9   loudness          52057 non-null  float64
 10  mode              52057 non-null  int64  
 11  name              52057 non-null  object 
 12  popularity        52057 non-null  int64  
 13  release_date      52057 non-null  object 
 14  speechiness       52057 non-null  float64
 15  tempo             52057 non-null  float64
 16  valence           52057 non-null  float6

In [8]:
data.isnull().sum().sort_values(ascending=False)

artist              4
valence             0
danceability        0
duration_ms         0
energy              0
explicit            0
id                  0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
name                0
popularity          0
release_date        0
speechiness         0
tempo               0
acousticness        0
dtype: int64

In [9]:
data = data[~data.artist.isnull()]

In [10]:
data.shape

(52053, 18)

🧪 **Run the following cell to save your results**

In [11]:
from nbresult import ChallengeResult

ChallengeResult(
    "data_cleaning",
    shape=data.shape).write()

## Simple model

📝 **Which sklearn's scoring [metric](https://scikit-learn.org/stable/modules/model_evaluation.html) should we use if we want to:**
- **Strongly penalize** largest errors
- Measure errors **in the same unit** than `popularity` 
- Is better when greater (metric_good_model > metric_bad_model)

👉 Store its exact name as `string` in the variable `scoring` below

🚨 You must use this metric for the rest of the challenge

In [12]:
scoring = "neg_root_mean_squared_error"

**📝 Let's build a first simple linear model using only the numerical features in our dataset to start with**
- Build `X_simple` keeping only numerical features
- Build `y` your target containing the `popularity`

In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52053 entries, 0 to 52316
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      52053 non-null  float64
 1   danceability      52053 non-null  float64
 2   duration_ms       52053 non-null  int64  
 3   energy            52053 non-null  float64
 4   explicit          52053 non-null  int64  
 5   id                52053 non-null  object 
 6   instrumentalness  52053 non-null  float64
 7   key               52053 non-null  int64  
 8   liveness          52053 non-null  float64
 9   loudness          52053 non-null  float64
 10  mode              52053 non-null  int64  
 11  name              52053 non-null  object 
 12  popularity        52053 non-null  int64  
 13  release_date      52053 non-null  object 
 14  speechiness       52053 non-null  float64
 15  tempo             52053 non-null  float64
 16  valence           52053 non-null  float6

In [14]:
X_simple = data.copy().drop(columns=['id','name','popularity','release_date','artist'])

In [15]:
y = data.copy().popularity

### Holdout evaluation

**📝 Create the 4 variables `X_train_simple` `y_train`, `X_test_simple`, `y_test` with a 50% split with random sampling**

In [16]:
from sklearn.model_selection import train_test_split

X_train_simple, X_test_simple, y_train, y_test = train_test_split(X_simple, y, test_size=0.5)

**📝 Fit and evaluate a basic linear model (do not fine tune it) with this holdout method**
- Store your model true performance in a float variable `score_simple_holdout`

In [17]:
def NRMSE(pred, true):
    return -((sum((pred-true)**2)/len(true))**0.5)

In [18]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train_simple, y_train)
y_pred = model.predict(X_test_simple)
score_simple_holdout = NRMSE(y_pred, y_test)
score_simple_holdout

-18.346287430371856

### Cross-validation evaluation

📝 **Let's be sure our score is representative**: 
- 5-times cross validate a basic linear model on the whole numeric dataset (`X_simple`, `y`)
- Do not fine tune your model
- Store your mean performance in a variable `score_simple_cv_mean` as a `float`
- Store the standard deviation of your performances in a float variable `score_simple_cv_std`

In [19]:
from sklearn.model_selection import cross_val_score
score_simple_cv_mean = cross_val_score(model, X_simple, y, cv=5, scoring=scoring).mean()
score_simple_cv_mean

-18.360558551569586

In [20]:
score_simple_cv_std = -score_simple_cv_mean
score_simple_cv_std 

18.360558551569586

🧪 **Run the following cell to save your results**

In [21]:
from nbresult import ChallengeResult

ChallengeResult(
    "simple_model",
    scoring=scoring,
    shape_train = X_train_simple.shape,
    score_simple_holdout=score_simple_holdout,
    score_simple_cv_mean=score_simple_cv_mean,
    score_simple_cv_std=score_simple_cv_std,
).write()

## Feature engineering

(From now on, we will stop using train/test split but cross-validation on the whole dataset instead)  

Let's try to improve performance using the feature `release_date`

**📝 Create `X_engineered` by adding a new column `year` to `X`, containing the release year of the track as `integer`**

In [22]:
X_engineered = X_simple.copy()

In [23]:
X_engineered['year'] = [pd.to_datetime(date).year for date in data.release_date]

In [24]:
X_engineered

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence,year
0,0.65400,0.499,219827,0.190,0,0.004090,7,0.0898,-16.435,1,0.0454,149.460,0.4300,1971
1,0.00592,0.439,483948,0.808,0,0.140000,2,0.0890,-8.497,1,0.0677,138.040,0.0587,2015
2,0.73400,0.523,245693,0.288,0,0.000000,0,0.0771,-11.506,1,0.2140,75.869,0.4640,1968
3,0.42900,0.681,130026,0.165,0,0.000000,11,0.3940,-21.457,0,0.9460,145.333,0.2880,1926
4,0.56200,0.543,129813,0.575,0,0.000004,2,0.1270,-7.374,1,0.0265,139.272,0.8010,2008
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52312,0.16400,0.512,56253,0.907,0,0.004870,6,0.8010,-7.804,1,0.6620,85.615,0.3150,1977
52313,0.77300,0.533,192838,0.659,0,0.773000,2,0.1130,-9.117,0,0.0426,158.366,0.6140,1965
52314,0.45600,0.548,310840,0.568,0,0.000000,6,0.0892,-5.348,1,0.0275,77.495,0.3380,2020
52315,0.96500,0.360,216493,0.132,0,0.000000,10,0.1260,-21.014,1,0.0355,80.909,0.4100,1952


📝 **Let's see how this impact the performance of our model.**
- Retrain the same simple linear model on numerical values only, adding the new feature `year`
- Save the mean cross-validated performance metric in a variable named `score_engineered` as a `float`
- Do not fine tune the model yet

In [25]:
score_engineered = cross_val_score(model, X_engineered, y, cv=5, scoring=scoring).mean()
score_engineered

-17.301966769706546

🧪 **Run the following cell to save your results**

In [26]:
from nbresult import ChallengeResult

ChallengeResult("feature_engineering",
    cols = X_engineered.columns,
    years = X_engineered.get("year"),
    score_engineered=score_engineered
).write()

## Pipelining

Let's now look for maximum performance by creating a solid preprocessing pipeline.

**📝 Create a sklearn preprocessing [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) and store it as `preproc`**

- Feel free to add any preprocessing steps you think of
- You may want to integrate your feature engineering for `year`
- You may also further improve it using the `ArtistPopularityTransformer` class given to you below
- Don't add any model to it yet

🚨 Advice: It is better for you to have a working pipeline (even simple one) rather than NO pipeline at all

In [27]:
# 👉 Do not hesitate to reload a clean new dataset if you need a fresh start.
X = data.copy().drop(columns='popularity')
y = data.copy().popularity

In [28]:
# We are giving you below a custom transformer that you may want to use in your pipeline (make sure you understanding it)

from sklearn.base import BaseEstimator, TransformerMixin

class ArtistPopularityTransformer(BaseEstimator, TransformerMixin):
    """
    Compute, as a new feature of the test set, the mean popularity of 
    all songs made by the artist on the train set.
    """

    def __init__(self):
        pass

    def fit(self, X, y=None):
        """
        process artist mean popularity from artists songs popularity
        process song global mean popularity
        """

        # process artist popularity
        self.artist_popularity = y.groupby(X.artist).agg("mean")
        self.artist_popularity.name = "artist_popularity"

        # process mean popularity
        self.mean_popularity = y.mean()

        return self

    def transform(self, X, y=None):
        """
        apply artist mean popularity vs song global mean popularity to songs
        """

        # inject artist popularity
        X_copy = X.merge(self.artist_popularity, how="left", left_on="artist", right_index=True)

        # fills popularity of unknown artists with song global mean popularity
        X_copy.replace(np.nan, self.mean_popularity, inplace=True)

        return X_copy[["artist_popularity"]]

In [29]:
X.describe()

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence
count,52053.0,52053.0,52053.0,52053.0,52053.0,52053.0,52053.0,52053.0,52053.0,52053.0,52053.0,52053.0,52053.0
mean,0.498218,0.536523,232497.0,0.483881,0.069698,0.195664,5.191536,0.211833,-11.745365,0.705665,0.106189,117.077248,0.524738
std,0.379814,0.176418,143321.2,0.273028,0.25464,0.333686,3.526759,0.180351,5.696061,0.455747,0.182825,30.266286,0.263819
min,0.0,0.0,5991.0,0.0,0.0,0.0,0.0,0.0,-60.0,0.0,0.0,0.0,0.0
25%,0.0867,0.413,166400.0,0.249,0.0,0.0,2.0,0.0995,-14.913,0.0,0.0352,94.004,0.312
50%,0.516,0.548,206213.0,0.468,0.0,0.000469,5.0,0.139,-10.836,1.0,0.0457,115.939,0.538
75%,0.893,0.669,266254.0,0.713,0.0,0.24,8.0,0.273,-7.478,1.0,0.0768,135.114,0.742
max,0.996,0.986,4800118.0,1.0,1.0,1.0,11.0,0.999,3.744,1.0,0.97,243.507,1.0


In [30]:
X.key.nunique()

12

In [31]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
import numpy as np

year_constructor = FunctionTransformer(lambda df: pd.DataFrame\
                                        ([pd.to_datetime(date).year for date in df.release_date]))
drop_some_col = FunctionTransformer(lambda df: df.drop(columns=['id','name','release_date','artist']))

artist_preproc = Pipeline([
    ('artist_popularity', ArtistPopularityTransformer()),
    ('scaler', MinMaxScaler())
])

year_preproc = Pipeline([
    ('year', year_constructor),
    ('scaler', MinMaxScaler())
])

other_preproc = ColumnTransformer([
    ('scaling', RobustScaler(), ['duration_ms','loudness','tempo']),
    ('encoding', OneHotEncoder(sparse = False), ['key'])],
    remainder = 'passthrough')

preproc = ColumnTransformer([
    ('artist_preproc', artist_preproc, ['artist']),
    ('year_preproc', year_preproc, ['release_date']),
    ('other_preproc', other_preproc, ['acousticness', 'danceability', 'duration_ms', 'energy', 'explicit',
               'instrumentalness', 'key', 'liveness', 'loudness', 'mode','speechiness', 'tempo', 'valence'])
])


In [32]:
preproc.fit_transform(X,y)

array([[0.36458333, 0.5049505 , 0.13633906, ..., 1.        , 0.0454    ,
        0.43      ],
       [0.10416667, 0.94059406, 2.78141086, ..., 1.        , 0.0677    ,
        0.0587    ],
       [0.35763889, 0.47524752, 0.39537725, ..., 1.        , 0.214     ,
        0.464     ],
       ...,
       [0.        , 0.99009901, 1.04779979, ..., 1.        , 0.0275    ,
        0.338     ],
       [0.0060307 , 0.31683168, 0.10295031, ..., 1.        , 0.0355    ,
        0.41      ],
       [0.41354167, 0.72277228, 0.12378072, ..., 0.        , 0.0318    ,
        0.67      ]])

In [33]:
import numpy as np

np.shape(preproc.fit_transform(X, y))

(52053, 26)

**📝 Store the number of columns/feature after preprocessing your inputs in a variable `col_number`**

In [34]:
col_number = 26

🧪 **Run the following cells to save your results**

In [35]:
# Visually print your preproc
from sklearn import set_config; set_config(display='diagram')
preproc

In [36]:
# Save your preproc
from nbresult import ChallengeResult

ChallengeResult(
    "preprocessing",
    col_number=col_number,
    first_observation = preproc.fit_transform(X, y)[0]
).write()

## Training

📝 **Time to fine tune your models**

- Add an **estimator** to your pipeline (only from Scikit-learn) 

- Train your pipeline and **fine-tune** (optimize) your estimator to maximize prediction score

- You must try to fine tune at least 2 different models: 
    - create one pipeline with a **linear model** of your choice
    - create one pipeline with an **ensemble model** of your choice

Then, 

- Save your two best 5-time cross-validated scores as _float_: `score_linear` and `score_ensemble`

- Save your two best trained pipelines as _Pipeline_ objects: `pipe_linear` and `pipe_ensemble`

### Linear

In [37]:
from sklearn.linear_model import Ridge
pipe_linear = Pipeline([
    ('preprocessor', preproc),
    ('model', Ridge())
])
score_linear = cross_val_score(pipe_linear, X, y, cv=5, scoring=scoring).mean()
score_linear

-13.502779339585604

In [38]:
pipe_linear.get_params()

{'memory': None,
 'steps': [('preprocessor',
   ColumnTransformer(transformers=[('artist_preproc',
                                    Pipeline(steps=[('artist_popularity',
                                                     ArtistPopularityTransformer()),
                                                    ('scaler', MinMaxScaler())]),
                                    ['artist']),
                                   ('year_preproc',
                                    Pipeline(steps=[('year',
                                                     FunctionTransformer(func=<function <lambda> at 0x7fbb0d8b00d0>)),
                                                    ('scaler', MinMaxScaler())]),
                                    ['release_date']),
                                   ('other_preproc',
                                    ColumnTransformer(remainder='passthrough',
                                                      transformers=[('scaling',
                              

In [39]:
from sklearn.model_selection import GridSearchCV

# Instanciate grid search
grid_search = GridSearchCV(
    pipe_linear, 
    param_grid={
        'model__alpha': [10]}, #[0.1, 0.5, 1, 5, 10]
    cv=5,
    scoring=scoring)
grid_search.fit(X,y);

In [40]:
grid_search.best_params_ , grid_search.best_score_

({'model__alpha': 10}, -13.482773081902653)

In [41]:
score_linear = grid_search.best_score_

### Ensemble

In [42]:
from sklearn.tree import DecisionTreeRegressor
pipe_ensemble = Pipeline([
    ('preprocessor', preproc),
    ('model', DecisionTreeRegressor())
])
score_ensemble = cross_val_score(pipe_ensemble, X, y, cv=5, scoring=scoring).mean()
score_ensemble

-15.357681399264777

In [43]:
pipe_ensemble.get_params()

{'memory': None,
 'steps': [('preprocessor',
   ColumnTransformer(transformers=[('artist_preproc',
                                    Pipeline(steps=[('artist_popularity',
                                                     ArtistPopularityTransformer()),
                                                    ('scaler', MinMaxScaler())]),
                                    ['artist']),
                                   ('year_preproc',
                                    Pipeline(steps=[('year',
                                                     FunctionTransformer(func=<function <lambda> at 0x7fbb0d8b00d0>)),
                                                    ('scaler', MinMaxScaler())]),
                                    ['release_date']),
                                   ('other_preproc',
                                    ColumnTransformer(remainder='passthrough',
                                                      transformers=[('scaling',
                              

In [44]:
grid_search2 = GridSearchCV(
    pipe_ensemble, 
    param_grid={
        'model__min_samples_split': [20], #[1,2,10]
        'model__max_depth': [8]},#[4,8,16] [6, 8, 10]
    cv=5,
    scoring=scoring)
grid_search2.fit(X,y);
grid_search2.best_params_ , grid_search2.best_score_

({'model__max_depth': 8, 'model__min_samples_split': 20}, -12.525848169272146)

In [45]:
score_ensemble = grid_search2.best_score_

🧪 **Run the following cells to save your results**

In [46]:
# Print below your best pipe for correction purpose
from sklearn import set_config; set_config(display='diagram')
pipe_linear

In [47]:
# Print below your best pipe for correction purpose
pipe_ensemble

In [48]:
from nbresult import ChallengeResult

ChallengeResult("model_tuning",
    scoring = scoring,
    score_linear=score_linear,
    score_ensemble=score_ensemble).write()

## API 

Time to put a pipeline in production!

👉 Go to https://github.com/lewagon/data-certification-api and follow instructions

**This final part is independent from the above notebook**

Wana try the api ?

https://data-certification-aco7pbafca-ew.a.run.app/predict?acousticness=0.654&danceability=0.499&duration_ms=219827&energy=0.19&explicit=0&id=0B6BeEUd6UwFlbsHMQKjob&instrumentalness=0.00409&key=7&liveness=0.0898&loudness=-16.435&mode=1&name=Back%20in%20the%20Goodle%20Days&release_date=1971&speechiness=0.0454&tempo=149.46&valence=0.43&artist=John%20Hartford