# Spotify Popularity Predictor (39%)

The goal of this challenge is to create a model that predicts the popularity of a song based on its features.

The dataset contains a list of tracks with the following characteristics:
- `acousticness`: whether the track is acoustic
- `danceability`: describes how suitable a track is for dancing
- `duration_ms`: duration of the track in milliseconds
- `energy`: represents a perceptual measure of intensity and activity
- `explicit`: whether the track has explicit lyrics
- `id`: id for the track
- `instrumentalness`: predicts whether a track contains no vocals
- `key`: the key the track is in
- `liveness`: detects the presence of an audience in the recording
- `loudness`: the overall loudness of a track in decibels
- `mode`: modality of a track
- `name`: name of the track
- `popularity`: popularity of the track
- `release_date`: release date
- `speechiness`: detects the presence of spoken words in a track
- `tempo`: overall estimated tempo of a track in beats per minute
- `valence`: describes the musical positiveness conveyed by a track
- `artist`: artist who performed the track

# Model

## Data collection

**📝 Load the `spotify_popularity_train.csv` dataset from the provided URL. Display the first few rows. Perform the usual cleaning operations. Store the result in a `DataFrame` named `data`.**

👉 Do not forget to clean the dataset

In [1]:
url = "https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/spotify_popularity_train.csv"

In [2]:
import pandas as pd
data = pd.read_csv(url)
data.head(3)

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,artist
0,0.654,0.499,219827,0.19,0,0B6BeEUd6UwFlbsHMQKjob,0.00409,7,0.0898,-16.435,1,Back in the Goodle Days,40,1971,0.0454,149.46,0.43,John Hartford
1,0.00592,0.439,483948,0.808,0,5Gpx4lJy3vKmIvjwbiR5c8,0.14,2,0.089,-8.497,1,Worlds Which Break Us - Intro Mix,22,2015-02-02,0.0677,138.04,0.0587,Driftmoon
2,0.734,0.523,245693,0.288,0,7MxuUYqrCIy93h1EEHrIrL,0.0,0,0.0771,-11.506,1,I'm The Greatest Star,40,1968-09-01,0.214,75.869,0.464,Barbra Streisand


In [3]:
#remove null
data=data.dropna()
data=data.drop_duplicates()

In [4]:
#standardscaler for numerical data 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
cols = list(data[['acousticness', 'danceability', 'duration_ms', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo','valence']].columns)
data[cols]=scaler.fit_transform(data[cols])

### Save your results

Run the following cell to save your results.

In [5]:
from nbresult import ChallengeResult

ChallengeResult(
    "c5_data_cleaning",
    data=data).write()

## Baseline

**📝 We want to use a metric that measures the prediction error in the same unit than `popularity`. In addition, it should strongly penalize largest errors. Which sklearn's [metric](https://scikit-learn.org/stable/modules/model_evaluation.html) should we use? Store its exact name as string below**

In [6]:
scoring = "neg_mean_squared_error"

**📝 Let's build a baseline model using only the numerical features in our dataset.**
- Build `X_baseline` with only numerical features
- Build `y` your target containing the `popularity`
- Then 5 times cross validate the baseline linear model of your choice (do not fine tune it)
- Store your mean performance in a `float` variable named `baseline_score`

In [7]:
# Prepare X and y
X_baseline = data.drop(['popularity','id','name','artist','release_date'], axis=1)
y=data.popularity

In [8]:
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate


baseline_model = DummyRegressor(strategy="median") # Baseline model that predicts "median"

cv_results = cross_validate(baseline_model,X_baseline,y,scoring=scoring) # cross validate baseline

baseline_score=cv_results['test_score'].mean()

### Save your results

Run the following cell to save your results.

In [9]:
from nbresult import ChallengeResult

ChallengeResult(
    "baseline",
    scoring=scoring,
    baseline_score=baseline_score).write()

## Feature engineering

Let's now use the features that we left aside: `release_date` and `artist` to improve the performance of our model. We'll create them manually in a train vs. test context first (and pipeline them later)

### holdout
**📝 Create the 4 variables `X_train` `y_train`, `X_test`, `y_test` with a 50% split with random sampling**

In [10]:
from sklearn.model_selection import train_test_split
X=data.drop("popularity", axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

### year

**📝 Create `X_train_year` and `X_test_year` by adding the new column `year` containing the release year of the track as integer**

In [11]:
X_train_year = X_train
X_train_year['year'] = pd.to_datetime(X_train_year['release_date']).dt.year
X_test_year = X_test
X_test_year['year'] = pd.to_datetime(X_test_year['release_date']).dt.year

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train_year['year'] = pd.to_datetime(X_train_year['release_date']).dt.year
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test_year['year'] = pd.to_datetime(X_test_year['release_date']).dt.year


### artist

How could we use the `artist` column? There are too many artists to one hot encode it.  
We could instead create an `artist_popularity` feature containing the mean popularity of an artist, computed as the mean popularity of all tracks the artist released _on the train set_.

#### Process artist popularity from the Training set

**📝 Compute and store the `artist_popularity` as a new pandas `Series`**  

In [12]:
df_train = pd.concat([X_train, y_train], axis=1)
artist_popularity = df_train[['artist', 'popularity']].groupby(by='artist').mean()
artist_popularity = artist_popularity.rename(columns={'popularity': 'artist_popularity'})

In [13]:
df_tmp=X_train_year.join(artist_popularity, on='artist')

#### Apply the artist popularity to `X_train_year`

**📝 Create a new DataFrame `X_train_engineered` which adds a new column to the existing `X_train_year` with the `artist_popularity` corresponding to the song's artist.** 

🚨 Make sure that the target `popularity` does not end up in `X_train_engineered` 

In [14]:
X_train_engineered=df_tmp.drop(["release_date",'id','artist_popularity','artist','name'],axis=1)

#### Apply the artist popularity to `X_test_year`

**📝 Similarily, create a new DataFrame `X_test_engineered` which also adds a new column to the existing `X_test_year` with the `artist_popularity` corresponding to the song's artist, computed from the training set.**

🚨**If an artist has never been seen in the training set, use the global mean popularity of all the tracks of `X_train`**

In [15]:
df_test = pd.concat([X_test, y_test], axis=1)
artist_popularity = df_test[['artist', 'popularity']].groupby(by='artist').mean()
artist_popularity = artist_popularity.rename(columns={'popularity': 'artist_popularity'})
df_tmp=X_test_year.join(artist_popularity, on='artist')

In [16]:
df_tmp['artist_popularity'] = df_tmp['artist_popularity'].fillna(df_tmp.artist_popularity.mean())

X_test_engineered=df_tmp.drop(["release_date",'id','artist_popularity','artist','name'],axis=1)

### Save your results

Run the following cell to save your results.

In [17]:
from nbresult import ChallengeResult

_ = pd.concat([X_train_engineered, X_test_engineered])

ChallengeResult("c7_feature_engineering",
    shape = _.shape,
    cols = _.columns,
    years = _.get("year"),
    popularities = _.get("artist_popularity"),
).write()

### Performance

**📝 Let's see how these features impact the performance of our model. Retrain the same baseline model on numerical values only, but adding the new features `year` and `artist_popularity`, and see how the performance is impacted. Save the performance in a `float` variable named `score_engineered`**

👉 Do not fine tune the model yet

In [18]:
X_train_engineered

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence,year
33174,-1.264363,1.810924,-0.417458,0.201881,0,-0.112872,5,-0.448203,-0.351761,1,-0.039869,-1.451902,0.558200,1980
51002,0.541802,0.586549,-0.658415,0.363038,0,0.609371,7,0.122911,0.483385,1,-0.178254,-0.329057,1.555106,1995
3101,1.276380,-1.227339,-0.200154,-0.362168,0,-0.585534,2,0.167269,0.984965,1,-0.392669,-0.076530,-0.787432,1947
39993,-1.208019,1.850603,-0.012120,1.088245,0,-0.585501,5,-0.481471,0.561686,0,-0.151452,0.193641,0.906928,1994
21951,0.431221,-0.059648,1.268283,0.828196,0,-0.287290,10,0.122911,0.333456,1,-0.331955,0.413360,1.187426,1954
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11305,0.878813,-0.660499,0.159148,0.575473,0,-0.262715,7,-0.548009,0.060984,1,-0.324297,-0.354796,-0.817757,1968
44935,-1.310699,-1.250012,-1.394804,0.656051,0,-0.586373,5,0.377971,0.201609,1,-0.267958,0.649896,-0.472820,2017
38313,-1.254621,0.070725,-0.192855,1.366607,0,-0.586376,4,-0.751503,0.632612,1,-0.370790,0.758202,0.887975,2020
860,1.250051,-0.224032,0.139060,-0.424434,0,2.140767,1,-0.254135,-0.523461,1,-0.385011,0.833303,1.183635,1953


In [19]:
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate


baseline_model = DummyRegressor(strategy="median") # Baseline model that predicts "median"

cv_results = cross_validate(baseline_model,X_train_engineered,y_train,scoring=scoring) # cross validate baseline

score_engineered=cv_results['test_score'].mean()

### Save your results

Run the following cell to save your results.

In [20]:
from nbresult import ChallengeResult

ChallengeResult(
    "c7_score_engineering",
    scoring=scoring,
    score_engineered=score_engineered).write()

## Pipelining

**📝 Let's create a full sklearn preprocessing pipeline called `preproc`. It should integrate our feature engineering for `year` and `artist_popularity`, as well as any other preprocessing of your choice**

**Store also the number of columns/feature after preprocessing your inputs in a variable `col_number`**

**🚨⚠️ Advice: SKIP the `ArtistPopularityTransformer` if you don't have time to do it. It is better for you to have a working pipeline rather than NO pipeline at all**

In [21]:
# 👉 Do not hesitate to reload clean new dataset if you need a fresh start
y = data.popularity
X = data.drop(['popularity','id','name'],axis=1)

In [22]:
# Run this cell to visualize your pipeline as you build it
from sklearn import set_config; set_config(display='diagram')

In [23]:
from sklearn.base import BaseEstimator, TransformerMixin

In [24]:
class ArtistPopularityTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        """
        process artist mean popularity from artists songs popularity
        process song global mean popularity
        """
        # process artist popularity
        self.artist_popularity = pd.concat([X, y],
                               axis=1,
                               join='inner').groupby(by='artist').mean()['popularity']
        # process mean popularity
        self.mean_train_popularity = y.mean()
        
        return self
    def transform(self, X, y=None):
        """
        apply artist mean popularity vs song global mean popularity to songs
        """
        # join other features
        X_copy = X.join(self.artist_popularity, on='artist').rename(columns={"popularity": "artist_popularity"}).drop(columns='artist')
        
        # fills popularity of unknown artists with song global mean popularity
        X_copy['artist_popularity'].fillna(self.mean_train_popularity, inplace=True)
        return X_copy # TODO return X_copy

In [25]:
class to_year(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        """
        Transform the release_dat in year (int)
        """
        X['year'] = X['release_date'].map(lambda x: x[:4]).astype(int)
        X_copy = X.drop(columns='release_date')
        return X_copy # TODO return X_copy

In [26]:
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline

year_cols = ['release_date']
artist_cols=['artist']

#preprocessing
preproc = make_column_transformer(
            (StandardScaler(), ['acousticness', 'danceability', 'duration_ms', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo','valence']),
            (ArtistPopularityTransformer(),['artist']),
            (to_year(),['release_date']))



col_number=len(list(pd.DataFrame(preproc.fit_transform(X,y)).columns))


In [27]:
# cols = list(data[['duration_ms','energy','loudness','tempo']].columns)

# num_transformer = Pipeline([('scaler', StandardScaler())])

# preprocessor = ColumnTransformer([
#     ('num_transformer', num_transformer, cols)])

# rf = RandomForestClassifier(max_depth=40,
#  min_samples_split=2,
#  n_estimators=500,
#  n_jobs=-1,
#  random_state=42)

# pipe = Pipeline([
#     ('preprocessing', preprocessor),
#     ('rf', rf)])

# pipe

#### Save your results

Run the following cell to save your results.

In [28]:
# Print below your preproc here for the correctors
from sklearn import set_config; set_config(display='diagram')
preproc

In [29]:
from nbresult import ChallengeResult

ChallengeResult(
    "c6_preprocessing",
    col_number=col_number
).write()

## Training

📝 Time to optimize 

- **Add an estimator to your pipeline (only from scikit-learn)** 

- **Train your pipeline and fine-tune (optimize) your estimator to get the best prediction score**

- **You must create 2 pipelines (one with a linear model, one with an ensemble model)**

Then, 

- Save your two best 5-time cross-validated scores as _float_: `score_linear` and `score_ensemble`

- Save your two best trained pipelines as _Pipeline_ objects: `pipe_linear` and `pipe_ensemble`

### Linear

### Ensemble

In [30]:
# from sklearn.model_selection import GridSearchCV
# from sklearn.ensemble import RandomForestRegressor
# parameters = {
#     'n_estimators': [100, 200, 300, 500],
#     'min_samples_split' : [2, 4, 5],
#     'max_depth': [15, 20, 30, 40, 50],
#     'n_jobs': [-1],
#     'random_state':[42],
# }


# rf = RandomForestRegressor()
# gscv_rf = GridSearchCV(rf, parameters)
# rf_model = gscv_rf.fit(X_train, y_train)

In [31]:
# rf_model.best_params_

In [32]:
# from sklearn.ensemble import RandomForestRegressor
# rf = RandomForestRegressor(max_depth=40,
#  min_samples_split=2,
#  n_estimators=500,
#  n_jobs=-1,
#  random_state=42)

# cv_results = cross_validate(rf, X_train_engineered, y_train, cv=5, scoring=scoring)

### Save your results

Run the following cell to save your results.

In [33]:
# Print below your best pipe for correction purpose
from sklearn import set_config; set_config(display='diagram')

year_cols = ['release_date']
artist_cols=['artist']

#preprocessing
preproc = make_column_transformer(
            (StandardScaler(), ['acousticness', 'danceability', 'duration_ms', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo','valence']),
            (ArtistPopularityTransformer(),['artist']),
            (to_year(),['release_date']),remainder='passthrough'
)

lr=LinearRegression()


pipe_linear = Pipeline([
    ('preprocessing', preproc),
    ('linearRegression', lr)])

pipe_linear

In [34]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)

pipe_linear_trained = pipe_linear.fit(X_train,y_train)
# Make predictions
y_pred=pipe_linear_trained.predict(X_test)
#mse score
score_linear=mean_squared_error(y_test, y_pred)

In [35]:
# Print below your best pipe for correction purpose
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
year_cols = ['release_date']
artist_cols=['artist']

#preprocessing
preproc = make_column_transformer(
            (StandardScaler(), ['acousticness', 'danceability', 'duration_ms', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo','valence']),
            (ArtistPopularityTransformer(),['artist']),
            (to_year(),['release_date']),remainder='passthrough'
)

rf = RandomForestRegressor(max_depth=40,
 min_samples_split=2,
 n_estimators=500,
 n_jobs=-1,
 random_state=42)

pipe_ensemble = Pipeline([
    ('preprocessing', preproc),
    ('rf', rf)])

pipe_ensemble

In [36]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)

pipe_ensemble_trained = pipe_ensemble.fit(X_train,y_train)
# Make predictions
y_pred=pipe_ensemble_trained.predict(X_test)
#mse score
score_ensemble=mean_squared_error(y_test, y_pred)

In [38]:
from nbresult import ChallengeResult

ChallengeResult("c8_c9_c11_c13_model_tuning",
    scoring = scoring,
    score_linear=score_linear,
    score_ensemble=score_ensemble).write()