# Spotify Popularity Predictor (39%)

The goal of this challenge is to create a model that predicts the popularity of a song based on its features.

The dataset contains a list of tracks with the following characteristics:
- `acousticness`: whether the track is acoustic
- `danceability`: describes how suitable a track is for dancing
- `duration_ms`: duration of the track in milliseconds
- `energy`: represents a perceptual measure of intensity and activity
- `explicit`: whether the track has explicit lyrics
- `id`: id for the track
- `instrumentalness`: predicts whether a track contains no vocals
- `key`: the key the track is in
- `liveness`: detects the presence of an audience in the recording
- `loudness`: the overall loudness of a track in decibels
- `mode`: modality of a track
- `name`: name of the track
- `popularity`: popularity of the track
- `release_date`: release date
- `speechiness`: detects the presence of spoken words in a track
- `tempo`: overall estimated tempo of a track in beats per minute
- `valence`: describes the musical positiveness conveyed by a track
- `artist`: artist who performed the track

# Model

## Data collection

📝 **Load the `spotify_popularity_train.csv` dataset from the provided URL**
- Display the first few rows
- Perform the basic cleaning operations (remove redundant lines, as well as those with missing values)
- Store the result in a `DataFrame` named `data`

In [1]:
import pandas as pd

In [2]:
import requests
import io
url = "https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/spotify_popularity_train.csv"
s = requests.get(url).content
database = pd.read_csv(io.StringIO(s.decode('utf-8')))

In [3]:
database.head(2)

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,artist
0,0.654,0.499,219827,0.19,0,0B6BeEUd6UwFlbsHMQKjob,0.00409,7,0.0898,-16.435,1,Back in the Goodle Days,40,1971,0.0454,149.46,0.43,John Hartford
1,0.00592,0.439,483948,0.808,0,5Gpx4lJy3vKmIvjwbiR5c8,0.14,2,0.089,-8.497,1,Worlds Which Break Us - Intro Mix,22,2015-02-02,0.0677,138.04,0.0587,Driftmoon


In [4]:
database.shape

(52317, 18)

In [5]:
databasenoduplicate=database.drop_duplicates()

In [6]:
databasenoduplicate.shape

(52057, 18)

In [7]:
databasenoduplicate.isnull().sum().sort_values(ascending=False)

artist              4
danceability        0
valence             0
tempo               0
speechiness         0
release_date        0
popularity          0
name                0
mode                0
acousticness        0
liveness            0
key                 0
instrumentalness    0
id                  0
explicit            0
energy              0
duration_ms         0
loudness            0
dtype: int64

In [8]:
dataclean=databasenoduplicate.dropna(subset=['artist']) 

In [9]:
data=dataclean

🧪 **Run the following cell to save your results**

In [10]:
from nbresult import ChallengeResult

ChallengeResult(
    "data_cleaning",
    shape=data.shape).write()

## Simple model

📝 **Which sklearn's scoring [metric](https://scikit-learn.org/stable/modules/model_evaluation.html) should we use if we want to:**
- **Strongly penalize** largest errors
- Measure errors **in the same unit** than `popularity` 
- Is better when greater (metric_good_model > metric_bad_model)

👉 Store its exact name as `string` in the variable `scoring` below

🚨 You must use this metric for the rest of the challenge

In [11]:
scoring = "neg_root_mean_squared_error"

**📝 Let's build a first simple linear model using only the numerical features in our dataset to start with**
- Build `X_simple` keeping only numerical features
- Build `y` your target containing the `popularity`

In [12]:
y=data['popularity']

In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52053 entries, 0 to 52316
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      52053 non-null  float64
 1   danceability      52053 non-null  float64
 2   duration_ms       52053 non-null  int64  
 3   energy            52053 non-null  float64
 4   explicit          52053 non-null  int64  
 5   id                52053 non-null  object 
 6   instrumentalness  52053 non-null  float64
 7   key               52053 non-null  int64  
 8   liveness          52053 non-null  float64
 9   loudness          52053 non-null  float64
 10  mode              52053 non-null  int64  
 11  name              52053 non-null  object 
 12  popularity        52053 non-null  int64  
 13  release_date      52053 non-null  object 
 14  speechiness       52053 non-null  float64
 15  tempo             52053 non-null  float64
 16  valence           52053 non-null  float6

In [14]:
X_simple = data.select_dtypes(include=['int64','float64']).drop(columns='popularity')

In [15]:
X_simple.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52053 entries, 0 to 52316
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   acousticness      52053 non-null  float64
 1   danceability      52053 non-null  float64
 2   duration_ms       52053 non-null  int64  
 3   energy            52053 non-null  float64
 4   explicit          52053 non-null  int64  
 5   instrumentalness  52053 non-null  float64
 6   key               52053 non-null  int64  
 7   liveness          52053 non-null  float64
 8   loudness          52053 non-null  float64
 9   mode              52053 non-null  int64  
 10  speechiness       52053 non-null  float64
 11  tempo             52053 non-null  float64
 12  valence           52053 non-null  float64
dtypes: float64(9), int64(4)
memory usage: 5.6 MB


In [16]:
X_simple

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence
0,0.65400,0.499,219827,0.190,0,0.004090,7,0.0898,-16.435,1,0.0454,149.460,0.4300
1,0.00592,0.439,483948,0.808,0,0.140000,2,0.0890,-8.497,1,0.0677,138.040,0.0587
2,0.73400,0.523,245693,0.288,0,0.000000,0,0.0771,-11.506,1,0.2140,75.869,0.4640
3,0.42900,0.681,130026,0.165,0,0.000000,11,0.3940,-21.457,0,0.9460,145.333,0.2880
4,0.56200,0.543,129813,0.575,0,0.000004,2,0.1270,-7.374,1,0.0265,139.272,0.8010
...,...,...,...,...,...,...,...,...,...,...,...,...,...
52312,0.16400,0.512,56253,0.907,0,0.004870,6,0.8010,-7.804,1,0.6620,85.615,0.3150
52313,0.77300,0.533,192838,0.659,0,0.773000,2,0.1130,-9.117,0,0.0426,158.366,0.6140
52314,0.45600,0.548,310840,0.568,0,0.000000,6,0.0892,-5.348,1,0.0275,77.495,0.3380
52315,0.96500,0.360,216493,0.132,0,0.000000,10,0.1260,-21.014,1,0.0355,80.909,0.4100


### Holdout evaluation

**📝 Create the 4 variables `X_train_simple` `y_train`, `X_test_simple`, `y_test` with a 50% split with random sampling**

In [17]:
from sklearn.model_selection import train_test_split
X_train_simple, X_test_simple, y_train, y_test = train_test_split(X_simple, y, test_size=0.5)

**📝 Fit and evaluate a basic linear model (do not fine tune it) with this holdout method**
- Store your model true performance in a float variable `score_simple_holdout`

In [18]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train_simple, y_train)

LinearRegression()

In [19]:
y_pred=model.predict(X_test_simple)

In [20]:
from sklearn.metrics import mean_squared_error
result=-mean_squared_error(y_test, y_pred,squared=False)

In [21]:
result

-18.39489216460111

In [22]:
score_simple_holdout=result

### Cross-validation evaluation

📝 **Let's be sure our score is representative**: 
- 5-times cross validate a basic linear model on the whole numeric dataset (`X_simple`, `y`)
- Do not fine tune your model
- Store your mean performance in a variable `score_simple_cv_mean` as a `float`
- Store the standard deviation of your performances in a float variable `score_simple_cv_std`

In [23]:
from sklearn.model_selection import cross_validate
model = LinearRegression()
cv_results = cross_validate(model, X_simple, y, cv=5, scoring="neg_root_mean_squared_error")
cv_results

{'fit_time': array([0.01509523, 0.01495337, 0.0186739 , 0.01528478, 0.01505613]),
 'score_time': array([0.00249791, 0.00614977, 0.00278473, 0.00268364, 0.00272584]),
 'test_score': array([-18.41383767, -18.45203104, -18.30188799, -18.43790327,
        -18.19713279])}

In [24]:
score_simple_cv_mean=cv_results['test_score'].mean()

In [25]:
score_simple_cv_std=cv_results['test_score'].std()

🧪 **Run the following cell to save your results**

In [26]:
from nbresult import ChallengeResult

ChallengeResult(
    "simple_model",
    scoring=scoring,
    shape_train = X_train_simple.shape,
    score_simple_holdout=score_simple_holdout,
    score_simple_cv_mean=score_simple_cv_mean,
    score_simple_cv_std=score_simple_cv_std,
).write()

## Feature engineering

(From now on, we will stop using train/test split but cross-validation on the whole dataset instead)  

Let's try to improve performance using the feature `release_date`

**📝 Create `X_engineered` by adding a new column `year` to `X`, containing the release year of the track as `integer`**

In [27]:
import datetime

In [28]:
X_engineered=X_simple

In [75]:
X_engineered['year'] = pd.to_datetime(data['release_date']).dt.year

In [30]:
X_engineered.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence,year
0,0.654,0.499,219827,0.19,0,0.00409,7,0.0898,-16.435,1,0.0454,149.46,0.43,1971
1,0.00592,0.439,483948,0.808,0,0.14,2,0.089,-8.497,1,0.0677,138.04,0.0587,2015
2,0.734,0.523,245693,0.288,0,0.0,0,0.0771,-11.506,1,0.214,75.869,0.464,1968
3,0.429,0.681,130026,0.165,0,0.0,11,0.394,-21.457,0,0.946,145.333,0.288,1926
4,0.562,0.543,129813,0.575,0,4e-06,2,0.127,-7.374,1,0.0265,139.272,0.801,2008


📝 **Let's see how this impact the performance of our model.**
- Retrain the same simple linear model on numerical values only, adding the new feature `year`
- Save the mean cross-validated performance metric in a variable named `score_engineered` as a `float`
- Do not fine tune the model yet

In [31]:
model = LinearRegression()
cv_results = cross_validate(model, X_engineered, y, cv=5, scoring="neg_root_mean_squared_error")
cv_results

{'fit_time': array([0.01417422, 0.01538706, 0.01657796, 0.01872802, 0.01713729]),
 'score_time': array([0.00209236, 0.0033412 , 0.00294948, 0.00325751, 0.00356793]),
 'test_score': array([-17.4412204 , -17.4317139 , -17.2453717 , -17.32187397,
        -17.06965388])}

In [32]:
score_engineered=cv_results['test_score'].mean()

🧪 **Run the following cell to save your results**

In [33]:
from nbresult import ChallengeResult

ChallengeResult("feature_engineering",
    cols = X_engineered.columns,
    years = X_engineered.get("year"),
    score_engineered=score_engineered
).write()

## Pipelining

Let's now look for maximum performance by creating a solid preprocessing pipeline.

**📝 Create a sklearn preprocessing [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) and store it as `preproc`**

- Feel free to add any preprocessing steps you think of
- You may want to integrate your feature engineering for `year`
- You may also further improve it using the `ArtistPopularityTransformer` class given to you below
- Don't add any model to it yet

🚨 Advice: It is better for you to have a working pipeline (even simple one) rather than NO pipeline at all

In [124]:
# 👉 Do not hesitate to reload a clean new dataset if you need a fresh start.
X=data.drop(columns='popularity')
y=data['popularity']

In [125]:
# We are giving you below a custom transformer that you may want to use in your pipeline (make sure you understanding it)

from sklearn.base import BaseEstimator, TransformerMixin

class ArtistPopularityTransformer(BaseEstimator, TransformerMixin):
    """
    Compute, as a new feature of the test set, the mean popularity of 
    all songs made by the artist on the train set.
    """

    def __init__(self):
        pass

    def fit(self, X, y=None):
        """
        process artist mean popularity from artists songs popularity
        process song global mean popularity
        """

        # process artist popularity
        self.artist_popularity = y.groupby(X.artist).agg("mean")
        self.artist_popularity.name = "artist_popularity"

        # process mean popularity
        self.mean_popularity = y.mean()

        return self

    def transform(self, X, y=None):
        """
        apply artist mean popularity vs song global mean popularity to songs
        """

        # inject artist popularity
        X_copy = X.merge(self.artist_popularity, how="left", left_on="artist", right_index=True)

        # fills popularity of unknown artists with song global mean popularity
        X_copy.replace(np.nan, self.mean_popularity, inplace=True)

        return X_copy[["artist_popularity"]]

In [126]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector
from sklearn.preprocessing import StandardScaler


# Impute then Scale for numerical variables: 
num_transformer = Pipeline([('scaler', StandardScaler())])

preprocnum = ColumnTransformer([
    ('num_transformer', num_transformer, make_column_selector(dtype_include=['float64', 'int64']))])


In [132]:
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import FunctionTransformer

# Create a transformer that add artistpopularity
artistpoptrans=ArtistPopularityTransformer()

# Create a transformer that add the year
year_calc2=FunctionTransformer(lambda data: pd.DataFrame(pd.to_datetime(data['release_date']).dt.year))


preproc_inter = FeatureUnion([
    ('artistpoptrans', artistpoptrans),
    ('year_calc',year_calc2),
    ('preprocnum', preprocnum),   
])

In [138]:
preproc = Pipeline([('preproc_inter', preproc_inter),('scaler', StandardScaler())])

**📝 Store the number of columns/feature after preprocessing your inputs in a variable `col_number`**

In [140]:
col_number=15

🧪 **Run the following cells to save your results**

In [141]:
# Visually print your preproc
from sklearn import set_config; set_config(display='diagram')
preproc

In [142]:
# Save your preproc
from nbresult import ChallengeResult

ChallengeResult(
    "preprocessing",
    col_number=col_number,
    first_observation = preproc.fit_transform(X, y)[0]
).write()

## Training

📝 **Time to fine tune your models**

- Add an **estimator** to your pipeline (only from Scikit-learn) 

- Train your pipeline and **fine-tune** (optimize) your estimator to maximize prediction score

- You must try to fine tune at least 2 different models: 
    - create one pipeline with a **linear model** of your choice
    - create one pipeline with an **ensemble model** of your choice

Then, 

- Save your two best 5-time cross-validated scores as _float_: `score_linear` and `score_ensemble`

- Save your two best trained pipelines as _Pipeline_ objects: `pipe_linear` and `pipe_ensemble`

### Linear

In [147]:
from sklearn.linear_model import ElasticNet

In [170]:
final_pipe = Pipeline([
    ('preprocessing', preproc),
    ('linear_regression', ElasticNet())])
final_pipe

In [171]:
cv_results = cross_validate(final_pipe, X, y, scoring = "neg_root_mean_squared_error", cv=5)
print('Neg RMSE: ',cv_results['test_score'].mean())

Neg RMSE:  -14.005544853728978


In [172]:
final_pipe.get_params()

{'memory': None,
 'steps': [('preprocessing', Pipeline(steps=[('preproc_inter',
                    FeatureUnion(transformer_list=[('artistpoptrans',
                                                    ArtistPopularityTransformer()),
                                                   ('year_calc',
                                                    FunctionTransformer(func=<function <lambda> at 0x7f63f120ff70>)),
                                                   ('preprocnum',
                                                    ColumnTransformer(transformers=[('num_transformer',
                                                                                     Pipeline(steps=[('scaler',
                                                                                                      StandardScaler())]),
                                                                                     <sklearn.compose._column_transformer.make_column_selector object at 0x7f63f147d880>)]))])),
 

In [173]:
from sklearn.model_selection import GridSearchCV

# Instanciate grid search
grid_search = GridSearchCV(
    final_pipe, 
    param_grid={
        'linear_regression__alpha': [0.05, 0.1, 0.15],
        'linear_regression__l1_ratio': [0.1, 0.2, 0.3]},
    cv=5,
    scoring="neg_root_mean_squared_error")

grid_search.fit(X, y)
grid_search.best_params_

{'linear_regression__alpha': 0.15, 'linear_regression__l1_ratio': 0.1}

In [174]:
grid_search.best_score_

-13.309793641587657

In [175]:
score_linear=grid_search.best_score_

### Ensemble

In [159]:
from sklearn.ensemble import GradientBoostingRegressor

In [160]:
final_pipe_2 = Pipeline([
    ('preprocessing', preproc),
    ('GradientBoost', GradientBoostingRegressor())])
final_pipe_2

In [161]:
cv_results = cross_validate(final_pipe_2, X, y, scoring = "neg_root_mean_squared_error", cv=5)
print('Neg RMSE: ',cv_results['test_score'].mean())

Neg RMSE:  -12.478929695470399


In [162]:
final_pipe_2.get_params()

{'memory': None,
 'steps': [('preprocessing', Pipeline(steps=[('preproc_inter',
                    FeatureUnion(transformer_list=[('artistpoptrans',
                                                    ArtistPopularityTransformer()),
                                                   ('year_calc',
                                                    FunctionTransformer(func=<function <lambda> at 0x7f63f120ff70>)),
                                                   ('preprocnum',
                                                    ColumnTransformer(transformers=[('num_transformer',
                                                                                     Pipeline(steps=[('scaler',
                                                                                                      StandardScaler())]),
                                                                                     <sklearn.compose._column_transformer.make_column_selector object at 0x7f63f147d880>)]))])),
 

In [164]:
from sklearn.model_selection import GridSearchCV

# Instanciate grid search
grid_search = GridSearchCV(
    final_pipe_2, 
    param_grid={
        'GradientBoost__n_estimators': [50, 100, 150],
        'GradientBoost__learning_rate': [0.05, 0.1, 0.2]},
    cv=5,
    scoring="neg_root_mean_squared_error")

grid_search.fit(X, y)
grid_search.best_params_

{'GradientBoost__learning_rate': 0.2, 'GradientBoost__n_estimators': 150}

In [168]:
grid_search.best_score_

-12.25585050949678

In [169]:
score_ensemble=grid_search.best_score_

🧪 **Run the following cells to save your results**

In [176]:
score_linear

-13.309793641587657

In [177]:
score_ensemble

-12.25585050949678

In [165]:
pipe_linear=final_pipe
pipe_ensemble=final_pipe_2

In [166]:
# Print below your best pipe for correction purpose
from sklearn import set_config; set_config(display='diagram')
pipe_linear

In [167]:
# Print below your best pipe for correction purpose
pipe_ensemble

In [178]:
from nbresult import ChallengeResult

ChallengeResult("model_tuning",
    scoring = scoring,
    score_linear=score_linear,
    score_ensemble=score_ensemble).write()

## API 

Time to put a pipeline in production!

👉 Go to https://github.com/lewagon/data-certification-api and follow instructions

**This final part is independent from the above notebook**

In [None]:
https://certif-33vjx73zaq-ew.a.run.app/predict?acousticness=0.654&danceability=0.499&duration_ms=219827&energy=0.19&explicit=0&id=0B6BeEUd6UwFlbsHMQKjob&instrumentalness=0.00409&key=7&liveness=0.0898&loudness=-16.435&mode=1&name=Back%20in%20the%20Goodle%20Days&release_date=1971&speechiness=0.0454&tempo=149.46&valence=0.43&artist=John%20Hartford