# Spotify Popularity Predictor (39%)

The goal of this challenge is to create a model that predicts the popularity of a song based on its features.

The dataset contains a list of tracks with the following characteristics:
- `acousticness`: whether the track is acoustic
- `danceability`: describes how suitable a track is for dancing
- `duration_ms`: duration of the track in milliseconds
- `energy`: represents a perceptual measure of intensity and activity
- `explicit`: whether the track has explicit lyrics
- `id`: id for the track
- `instrumentalness`: predicts whether a track contains no vocals
- `key`: the key the track is in
- `liveness`: detects the presence of an audience in the recording
- `loudness`: the overall loudness of a track in decibels
- `mode`: modality of a track
- `name`: name of the track
- `popularity`: popularity of the track
- `release_date`: release date
- `speechiness`: detects the presence of spoken words in a track
- `tempo`: overall estimated tempo of a track in beats per minute
- `valence`: describes the musical positiveness conveyed by a track
- `artist`: artist who performed the track

# Model

## Data collection

**📝 Load the `spotify_popularity_train.csv` dataset from the provided URL. Display the first few rows. Perform the usual cleaning operations. Store the result in a `DataFrame` named `data`.**

👉 Do not forget to clean the dataset

In [1]:
url = "https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/spotify_popularity_train.csv"

In [2]:
import pandas as pd
data = pd.read_csv(url)
data.head(3)

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,artist
0,0.654,0.499,219827,0.19,0,0B6BeEUd6UwFlbsHMQKjob,0.00409,7,0.0898,-16.435,1,Back in the Goodle Days,40,1971,0.0454,149.46,0.43,John Hartford
1,0.00592,0.439,483948,0.808,0,5Gpx4lJy3vKmIvjwbiR5c8,0.14,2,0.089,-8.497,1,Worlds Which Break Us - Intro Mix,22,2015-02-02,0.0677,138.04,0.0587,Driftmoon
2,0.734,0.523,245693,0.288,0,7MxuUYqrCIy93h1EEHrIrL,0.0,0,0.0771,-11.506,1,I'm The Greatest Star,40,1968-09-01,0.214,75.869,0.464,Barbra Streisand


In [3]:
data.dropna(inplace=True)

In [14]:
data.describe()

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,popularity,speechiness,tempo,valence
count,52313.0,52313.0,52313.0,52313.0,52313.0,52313.0,52313.0,52313.0,52313.0,52313.0,52313.0,52313.0,52313.0,52313.0
mean,0.497207,0.536677,0.047244,0.484487,0.069581,0.196626,5.193489,0.211652,0.757173,0.705274,25.732571,0.106073,0.480855,0.524564
std,0.380006,0.17653,0.029892,0.273274,0.254442,0.334348,3.526633,0.180203,0.089356,0.455924,21.862463,0.182482,0.124215,0.263814
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0852,0.413,0.033446,0.249,0.0,0.0,2.0,0.0995,0.707471,0.0,1.0,0.0352,0.386063,0.312
50%,0.515,0.548,0.041761,0.469,0.0,0.000475,5.0,0.139,0.771429,1.0,26.0,0.0457,0.476266,0.537
75%,0.893,0.669,0.054298,0.714,0.0,0.247,8.0,0.272,0.824156,1.0,42.0,0.0768,0.554871,0.742
max,0.996,0.986,1.0,1.0,1.0,1.0,11.0,0.999,1.0,1.0,96.0,0.97,1.0,1.0


In [5]:
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()

In [6]:
data['duration_ms'] = mms.fit_transform(data[['duration_ms']])
data['loudness'] = mms.fit_transform(data[['loudness']])
data['tempo'] = mms.fit_transform(data[['tempo']])

In [16]:
data.describe()

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,popularity,speechiness,tempo,valence
count,52313.0,52313.0,52313.0,52313.0,52313.0,52313.0,52313.0,52313.0,52313.0,52313.0,52313.0,52313.0,52313.0,52313.0
mean,0.497207,0.536677,0.047244,0.484487,0.069581,0.196626,5.193489,0.211652,0.757173,0.705274,25.732571,0.106073,0.480855,0.524564
std,0.380006,0.17653,0.029892,0.273274,0.254442,0.334348,3.526633,0.180203,0.089356,0.455924,21.862463,0.182482,0.124215,0.263814
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0852,0.413,0.033446,0.249,0.0,0.0,2.0,0.0995,0.707471,0.0,1.0,0.0352,0.386063,0.312
50%,0.515,0.548,0.041761,0.469,0.0,0.000475,5.0,0.139,0.771429,1.0,26.0,0.0457,0.476266,0.537
75%,0.893,0.669,0.054298,0.714,0.0,0.247,8.0,0.272,0.824156,1.0,42.0,0.0768,0.554871,0.742
max,0.996,0.986,1.0,1.0,1.0,1.0,11.0,0.999,1.0,1.0,96.0,0.97,1.0,1.0


### Save your results

Run the following cell to save your results.

In [17]:
from nbresult import ChallengeResult

ChallengeResult(
    "c5_data_cleaning",
    data=data).write()

## Baseline

**📝 We want to use a metric that measures the prediction error in the same unit than `popularity`. In addition, it should strongly penalize largest errors. Which sklearn's [metric](https://scikit-learn.org/stable/modules/model_evaluation.html) should we use? Store its exact name as string below**

In [39]:
scoring = "neg_mean_squared_error"

**📝 Let's build a baseline model using only the numerical features in our dataset.**
- Build `X_baseline` with only numerical features
- Build `y` your target containing the `popularity`
- Then 5 times cross validate the baseline linear model of your choice (do not fine tune it)
- Store your mean performance in a `float` variable named `baseline_score`

In [40]:
# all columns list here
baseline_cols = list(data)

In [41]:
X_baseline = data.drop(columns=['id','name','popularity','release_date','artist','key'])

In [42]:
X_baseline

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,liveness,loudness,mode,speechiness,tempo,valence
0,0.65400,0.499,0.044604,0.190,0,0.004090,0.0898,0.683437,1,0.0454,0.613781,0.4300
1,0.00592,0.439,0.099696,0.808,0,0.140000,0.0890,0.807966,1,0.0677,0.566883,0.0587
2,0.73400,0.523,0.049999,0.288,0,0.000000,0.0771,0.760762,1,0.2140,0.311568,0.4640
3,0.42900,0.681,0.025872,0.165,0,0.000000,0.3940,0.604653,0,0.9460,0.596833,0.2880
4,0.56200,0.543,0.025828,0.575,0,0.000004,0.1270,0.825584,1,0.0265,0.571942,0.8010
...,...,...,...,...,...,...,...,...,...,...,...,...
52312,0.16400,0.512,0.010484,0.907,0,0.004870,0.8010,0.818838,1,0.6620,0.351592,0.3150
52313,0.77300,0.533,0.038974,0.659,0,0.773000,0.1130,0.798240,0,0.0426,0.650355,0.6140
52314,0.45600,0.548,0.063588,0.568,0,0.000000,0.0892,0.857367,1,0.0275,0.318245,0.3380
52315,0.96500,0.360,0.043908,0.132,0,0.000000,0.1260,0.611603,1,0.0355,0.332266,0.4100


In [43]:
y = data['popularity']

In [44]:
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

In [45]:
baseline_model = DummyRegressor(strategy="median")

In [46]:
cv_results = cross_validate(baseline_model,X_baseline, y, cv=5, scoring=scoring) # cross validate baseline

cv_results['test_score'].mean()

-478.27094112282094

In [47]:
cv_results

{'fit_time': array([0.00951004, 0.00563216, 0.00596476, 0.00662684, 0.00779891]),
 'score_time': array([0.00081396, 0.00082374, 0.00063896, 0.00074196, 0.0010581 ]),
 'test_score': array([-480.54257861, -474.01271146, -477.51457517, -481.27499522,
        -478.00984515])}

In [48]:
baseline_score = float(cv_results['test_score'].mean())
baseline_score

-478.27094112282094

### Save your results

Run the following cell to save your results.

In [49]:
from nbresult import ChallengeResult

ChallengeResult(
    "baseline",
    scoring=scoring,
    baseline_score=baseline_score).write()

## Feature engineering

Let's now use the features that we left aside: `release_date` and `artist` to improve the performance of our model. We'll create them manually in a train vs. test context first (and pipeline them later)

### holdout
**📝 Create the 4 variables `X_train` `y_train`, `X_test`, `y_test` with a 50% split with random sampling**

In [111]:
X = data.drop(columns=['name','id'])
y = data['popularity']

In [112]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

In [113]:
X_train.shape, y_train.shape

((26156, 16), (26156,))

### year

**📝 Create `X_train_year` and `X_test_year` by adding the new column `year` containing the release year of the track as integer**

In [114]:
def str_year(date_string):
    year = date_string.split('-')[0]
    return int(year)

In [115]:
X_train['release_date']

14404    1985-01-01
8033     1990-09-16
47905    1936-12-23
30990    2019-07-16
42236    1970-03-23
            ...    
24003    1937-01-01
24612    2017-11-17
37578    1944-12-31
28659          1954
49009    1980-09-29
Name: release_date, Length: 26156, dtype: object

In [117]:
X_train['year'] = X_train['release_date'].apply(lambda x: str_year(x))
X_test['year'] = X_test['release_date'].apply(lambda x: str_year(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['year'] = X_train['release_date'].apply(lambda x: str_year(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['year'] = X_test['release_date'].apply(lambda x: str_year(x))


In [118]:
X_train_year = X_train.drop(columns='release_date')

In [119]:
X_test_year = X_test.drop(columns='release_date')

In [126]:
X_train_year 

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,popularity,speechiness,tempo,valence,artist,year
14404,0.13200,0.837,0.045733,0.377,0,0.010500,7,0.1700,0.710530,0,48,0.1110,0.500487,0.6350,Fito Paez,1985
8033,0.16000,0.791,0.051963,0.805,0,0.058300,2,0.0563,0.781219,1,36,0.0464,0.501912,0.8820,Gloria Estefan,1990
47905,0.22900,0.718,0.025494,0.247,0,0.000000,1,0.2110,0.618850,1,0,0.9490,0.519866,0.5760,Tadeusz Dolega Mostowicz,1936
30990,0.00003,0.621,0.121735,0.691,0,0.774000,1,0.1130,0.805582,1,2,0.0314,0.521632,0.0395,Suffused,2019
42236,0.62900,0.566,0.034469,0.318,0,0.000060,7,0.1070,0.708114,1,21,0.0344,0.472233,0.7040,Conway Twitty,1970
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24003,0.99500,0.280,0.018628,0.219,0,0.001860,4,0.3620,0.714483,1,0,0.0378,0.358914,0.5590,Κος Χρήστος,1937
24612,0.00473,0.795,0.099804,0.992,0,0.807000,1,0.2230,0.817693,1,0,0.0419,0.533849,0.3030,Gravity One,2017
37578,0.99300,0.679,0.039179,0.356,0,0.855000,8,0.2610,0.817897,1,0,0.0346,0.429959,0.7870,Amirbai Karnataki,1944
28659,0.50300,0.464,0.030428,0.511,0,0.010700,3,0.1500,0.810492,1,21,0.0422,0.617407,0.7270,The Modern Jazz Quartet,1954


### artist

How could we use the `artist` column? There are too many artists to one hot encode it.  
We could instead create an `artist_popularity` feature containing the mean popularity of an artist, computed as the mean popularity of all tracks the artist released _on the train set_.

#### Process artist popularity from the Training set

**📝 Compute and store the `artist_popularity` as a new pandas `Series`**  

In [127]:
artist_pop = X_train[['artist','popularity']].groupby(['artist']).mean()

In [128]:
artist_pop

Unnamed: 0_level_0,popularity
artist,Unnamed: 1_level_1
"""Test for Victor Young""",3.0
"""Weird Al"" Yankovic",32.4
$tar$eed,0.0
$uicideBoy$,68.0
*NSYNC,44.0
...,...
羅大佑,38.0
葛蘭,0.0
鈴木 弘,35.0
須田景凪,68.0


In [129]:
artist_popularity = artist_pop['popularity']

#### Apply the artist popularity to `X_train_year`

**📝 Create a new DataFrame `X_train_engineered` which adds a new column to the existing `X_train_year` with the `artist_popularity` corresponding to the song's artist.** 

🚨 Make sure that the target `popularity` does not end up in `X_train_engineered` 

In [131]:
X_train_engineered = X_train_year.drop(columns='popularity').join(artist_pop, on='artist')

In [135]:
X_train_engineered.rename(columns={'popularity':'artist_popularity'}, inplace=True)

In [136]:
X_train_engineered

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence,artist,year,artist_popularity
14404,0.13200,0.837,0.045733,0.377,0,0.010500,7,0.1700,0.710530,0,0.1110,0.500487,0.6350,Fito Paez,1985,55.666667
8033,0.16000,0.791,0.051963,0.805,0,0.058300,2,0.0563,0.781219,1,0.0464,0.501912,0.8820,Gloria Estefan,1990,40.166667
47905,0.22900,0.718,0.025494,0.247,0,0.000000,1,0.2110,0.618850,1,0.9490,0.519866,0.5760,Tadeusz Dolega Mostowicz,1936,0.000000
30990,0.00003,0.621,0.121735,0.691,0,0.774000,1,0.1130,0.805582,1,0.0314,0.521632,0.0395,Suffused,2019,5.142857
42236,0.62900,0.566,0.034469,0.318,0,0.000060,7,0.1070,0.708114,1,0.0344,0.472233,0.7040,Conway Twitty,1970,24.909091
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24003,0.99500,0.280,0.018628,0.219,0,0.001860,4,0.3620,0.714483,1,0.0378,0.358914,0.5590,Κος Χρήστος,1937,0.000000
24612,0.00473,0.795,0.099804,0.992,0,0.807000,1,0.2230,0.817693,1,0.0419,0.533849,0.3030,Gravity One,2017,0.000000
37578,0.99300,0.679,0.039179,0.356,0,0.855000,8,0.2610,0.817897,1,0.0346,0.429959,0.7870,Amirbai Karnataki,1944,0.000000
28659,0.50300,0.464,0.030428,0.511,0,0.010700,3,0.1500,0.810492,1,0.0422,0.617407,0.7270,The Modern Jazz Quartet,1954,16.500000


#### Apply the artist popularity to `X_test_year`

**📝 Similarily, create a new DataFrame `X_test_engineered` which also adds a new column to the existing `X_test_year` with the `artist_popularity` corresponding to the song's artist, computed from the training set.**

🚨**If an artist has never been seen in the training set, use the global mean popularity of all the tracks of `X_train`**

In [143]:
X_test_engineered = X_test_year.drop(columns='popularity').join(artist_pop, on='artist').rename(columns={'popularity':'artist_popularity'})

In [145]:
X_test_engineered = X_test_engineered.drop(columns='artist')
X_test_engineered['artist_popularity'] = X_test_engineered['artist_popularity'].fillna(X_train_engineered.artist_popularity.mean())
X_test_engineered

KeyError: "['artist'] not found in axis"

### Save your results

Run the following cell to save your results.

In [None]:
from nbresult import ChallengeResult

_ = pd.concat([X_train_engineered, X_test_engineered])

ChallengeResult("c7_feature_engineering",
    shape = _.shape,
    cols = _.columns,
    years = _.get("year"),
    popularities = _.get("artist_popularity"),
).write()

### Performance

**📝 Let's see how these features impact the performance of our model. Retrain the same baseline model on numerical values only, but adding the new features `year` and `artist_popularity`, and see how the performance is impacted. Save the performance in a `float` variable named `score_engineered`**

👉 Do not fine tune the model yet

### Save your results

Run the following cell to save your results.

In [None]:
from nbresult import ChallengeResult

ChallengeResult(
    "c7_score_engineering",
    scoring=scoring,
    score_engineered=score_engineered).write()

## Pipelining

**📝 Let's create a full sklearn preprocessing pipeline called `preproc`. It should integrate our feature engineering for `year` and `artist_popularity`, as well as any other preprocessing of your choice**

**Store also the number of columns/feature after preprocessing your inputs in a variable `col_number`**

**🚨⚠️ Advice: SKIP the `ArtistPopularityTransformer` if you don't have time to do it. It is better for you to have a working pipeline rather than NO pipeline at all**

In [None]:
# 👉 Do not hesitate to reload clean new dataset if you need a fresh start
y = data.popularity
X = data.drop("popularity", axis=1)

In [None]:
# Run this cell to visualize your pipeline as you build it
from sklearn import set_config; set_config(display='diagram')

In [None]:
# We give you below the skeleton of the custom ArtistPopularityTransformer to complete

from sklearn.base import BaseEstimator, TransformerMixin

class ArtistPopularityTransformer(BaseEstimator, TransformerMixin):

    def __init__(self):
        pass

    def fit(self, X, y=None):
        """
        process artist mean popularity from artists songs popularity
        process song global mean popularity
        """

        # process artist popularity

        # process mean popularity

        return self

    def transform(self, X, y=None):
        """
        apply artist mean popularity vs song global mean popularity to songs
        """

        # inject artist popularity

        # fills popularity of unknown artists with song global mean popularity

        return # TODO return X_copy

#### Save your results

Run the following cell to save your results.

In [None]:
# Print below your preproc here for the correctors
from sklearn import set_config; set_config(display='diagram')
preproc

In [None]:
from nbresult import ChallengeResult

ChallengeResult(
    "c6_preprocessing",
    col_number=col_number
).write()

## Training

📝 Time to optimize 

- **Add an estimator to your pipeline (only from scikit-learn)** 

- **Train your pipeline and fine-tune (optimize) your estimator to get the best prediction score**

- **You must create 2 pipelines (one with a linear model, one with an ensemble model)**

Then, 

- Save your two best 5-time cross-validated scores as _float_: `score_linear` and `score_ensemble`

- Save your two best trained pipelines as _Pipeline_ objects: `pipe_linear` and `pipe_ensemble`

### Linear

### Ensemble

### Save your results

Run the following cell to save your results.

In [None]:
# Print below your best pipe for correction purpose
from sklearn import set_config; set_config(display='diagram')
pipe_linear

In [None]:
# Print below your best pipe for correction purpose
pipe_ensemble

In [None]:
from nbresult import ChallengeResult

ChallengeResult("c8_c9_c11_c13_model_tuning",
    scoring = scoring,
    score_linear=score_linear,
    score_ensemble=score_ensemble).write()

## API 

Time to put a pipeline in production!

👉 Go to https://github.com/lewagon/data-certification-api and follow instructions

**This final part is independent from the above notebook**