**Author**: Inès Multrier  
**Contact**: ines.multrier@polytechnique.edu

In this Notebook, I will try to predict the popularity of songs in the Spotify Dataset, available [here](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks). 

My approach is based on splitting the dataset according to the years, to be as similar to a real usecase as possible, without any data leakage.

This work is a part of the recruitment process for an internship at [Illuin Technology](https://www.illuin.tech/en/).

All intermediary data should be placed in a sub-folder called 'data' if you want to run the notebook as it is.

The outline of the document is: 
1. Imports and data cleaning
2. EDA and data visualisation 
3. Model construction
4. Model fine-tuning
5. Evaluation on the test set
6. Conclusion
7. Bonus!

# 1. Imports and data cleaning

In [None]:
# Data manipulation
import numpy as np
import pandas as pd

In [None]:
# Data visualization 
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Scikit-Learn imports

## Preprocessing and feature extraction
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, RobustScaler, OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer

## Models
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.neighbors import KNeighborsRegressor

## Model selection and fine tuning
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error

In [None]:
# XGBoost import
from xgboost import XGBRegressor

In [None]:
# Tensorflow Keras imports
from tensorflow.keras import Sequential, layers
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.metrics import RootMeanSquaredError
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
# Utils 
from scipy.stats import uniform, randint

In [None]:
# Each row represents a single track, each column represents a field of the track (audio features and identifiers)
data = pd.read_csv("../input/spotify-dataset-19212020-160k-tracks/data.csv") 

# This file is an extension to the "data_by_artist.csv" file with genres implementation for each artist. 
# Each row represents a single artist, each column represents an audio feature.
data_w_genres = pd.read_csv('../input/spotify-dataset-19212020-160k-tracks/data_w_genres.csv') 

In [None]:
data.head()

In [None]:
data.info()

We sense that `release_date` will not have much more added value than the year songs were released. We will thus drop this column. We have to find out whether there may be duplicates in the dataset in order to understand if we can only keep the name as a primary key. 

**We thus define a duplicate as several songs appearing in the dataset, for which the song title and the artist are strictly identical.** 

In order to identify these possible duplicates, we create the `artist+name` column.

In [None]:
data['artists+name'] = data.apply(lambda row: row['artists'] + row['name'], axis=1)

In [None]:
df = data[data['artists+name'].duplicated()]

In [None]:
df.shape

There are thus 14,948 duplicates. Let us look at a concrete example.

In [None]:
data[data['name']=='champagne problems']

This is problematic. These four lines obviously represent the same song ("Champagne problems" by Taylor Swift, issued around the end of 2020), but **they have very different popularity ratings**: the latter vary from 54 to 85!

**Info on popularity score**

(according to Spotify for Developers - Documentation / https://developer.spotify.com/documentation/web-api/reference/tracks/get-several-tracks/)

The `popularity` of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are.

Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Duplicate tracks (e.g. the same track from a single and an album) are rated independently. Artist and album popularity is derived mathematically from track popularity. Note that the popularity value may lag actual popularity by a few days: the value is not updated in real time.

According to this info, a solution would be to replace the popularity for duplicates by the maximum of the popularity ratings. Indeed, popularity is a "positive" metric: the only thing that can make a song's popularity decrease is if it is not played at all for a while. Thus, if a song has three perfectly identical duplicates but one has a higher popularity score, it means it has been played more and, in particular, it has been played more recently. Thus, this score is more relevant than the two other to monitor this song's popularity. 

In [None]:
# We gather the list of indices corresponding to the maximum popularity for each duplicated artist/song pair.
indices = []

for name in df['artists+name'].unique():
    subset = data[data['artists+name'] == name].copy()
    m = subset['popularity'].max()
    index = subset[subset['popularity'] == m].index[0]
    indices.append(index)

In [None]:
data_bis = data.loc[indices].copy()
data_bis['artists+name'].duplicated().sum()

In [None]:
data_ter = data.copy()
for i, row in data.iterrows():
    if row['artists+name'] in df['artists+name'].unique():
        data_ter.drop(index=i, inplace=True)

In [None]:
frames = [data_bis, data_ter]
data_four = pd.concat(frames)

In [None]:
data_four['artists+name'].duplicated().sum()

In [None]:
data_four.shape

We can easily check that: 

$14948 + 159441 = 174389$

Where the latter corresponds to the number of entries in the initial dataset. We have thus succeeded in removing duplicates. Now that each song is unique, we will remove the `id` which is unnecessary. We also do not need `artist+name` anymore, as it was only meant to look for duplicates.

In [None]:
data_four.drop(columns=['release_date','id','artists+name'], inplace=True)

Another area of concern is that **some songs have a null tempo**. It does not make any sense to have `tempo=0`. If we print the corresponding data subset, we can see that tempo is not the only null metric for these songs: `danceability`, `speechiness`, `valence` are also equal to 0. 

This is the case for approximately 100 songs, we will thus drop such songs. 

In [None]:
data_four = data_four[data_four['tempo'] != 0].copy()

In [None]:
data_four.reset_index(inplace=True, drop=True)

In [None]:
data_four.to_csv('data/songs.csv', index=False)

In [None]:
# del df
# del data_bis
# del data_ter
# del data_four

# 2. EDA and data visualisation

You can start by loading the non-duplicates, cleaned dataset, by uncommenting the line below.

This analysis of the dataset was inspired by [this notebook](https://www.kaggle.com/anatpeled/spotify-popularity-prediction) on Kaggle. We figured that plotting several metrics compared to popularity could give us a first insight of relevant features to predict popularity.

Thanks to Guy Kahana & Anat Peled for enabling me to have a quick, clear vision of the dataset! 

In [None]:
songs=pd.read_csv('data/songs.csv')

In [None]:
songs.describe()

Here, we can see that our metric of interest (`popularity`) has values that range from 0 to 100. 25% of the songs are 1/100 popular or less, whereas only 25% of the songs achieve more than 43/100 popularity.  

Since we know that, for an equal number of plays, Spotify gives a higher popularity score to tracks that have been played recently than to those that have been played earlier, we might want to plot popularity according to time. 

In [None]:
fig, ax = plt.subplots(figsize=(15, 4))
ax = songs.groupby('year')['popularity'].mean().plot()
ax.set_title('Mean Popularity over the years')
ax.set_ylabel('Mean Popularity', weight='bold')
ax.set_xlabel('Year', weight='bold')
ax.set_xticks(range(1920, 2021, 5))
plt.show()

The drop around the year 2021 is not surprising as we have read that popularity "lags" by a few days/weeks. The rest of the trend is interesting, as popularity rises until the year 2000 and then falls. 

### **Acousticness**

The `acousticness` ranges from 0 to 1. Acousticness of the majority of tracks is either close to 0 or 1.

In [None]:
fig, ax = plt.subplots(figsize=(16, 4))
sns.histplot(songs['acousticness'],  bins=30)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(15, 6))
ax1_data =  songs.groupby('acousticness')['popularity'].mean().to_frame().reset_index()
ax = sns.scatterplot(x=ax1_data['acousticness'], y=ax1_data['popularity'], color='blue', ax=ax)
ax.set_title('Acousticness vs. Mean Popularity')
ax.set_ylabel('Mean Popularity', fontsize=12)
plt.tight_layout()
plt.show()

### **Danceability**

The `danceability` ranges from 0 to 1. Danceability seems normally distributed.

In [None]:
sns.histplot(songs['danceability'], color='green', bins=30)
plt.show()

In [None]:
fig, ax = plt.subplots(1, figsize=(15, 6), sharey=True, sharex = True)
ax_data =  songs.groupby('danceability')['popularity'].mean().to_frame().reset_index()
ax = sns.scatterplot(x='danceability', y='popularity', data=ax_data, color='green', ax=ax)
ax.set_title('danceability')
ax.set_ylabel('Mean Popularity', fontsize=12)
plt.tight_layout()
plt.show()

### **Duration_ms**

Tracks last from 5 seconds to 90 minutes.

In [None]:
fig, ax = plt.subplots(figsize = (15, 4))
ax = sns.histplot(songs['duration_ms']/60000, color='orange')
ax.set_title('Length of Tracks (in minutes!)')
ax.set_xticks(range(0,25,1))
ax.set_xlim(0,25)
plt.show()

### **Energy**

`energy` measures the intensity and activity: energetic tracks feel faster and louder.

Intuitively, we could suspect `energy` and `danceability` or `energy` and `loudness` would be similar. 

Let us check this by using Pearson's correlation coefficient. This correlation coefficient ranges from −1 to 1: a value of 1 implies that a linear equation describes the relationship between X and Y perfectly; a value of −1 implies that all data points lie on a line for which Y decreases as X increases. A value of 0 implies that there is no linear correlation between the variables.

In [None]:
a = songs['energy'].corr(songs['danceability']).round(3)
print(f'The Pearson correlation coefficient is: {a:^10}')

In [None]:
a = songs['energy'].corr(songs['loudness']).round(3)
print(f'The Pearson correlation coefficient is: {a:^10}')

`energy` and `loudness` are strongly correlated, we will probably have to abandon one of the two. Given how the two plots "Popularity vs. energy" and "Popularity vs. loudness" (not shown here) look, we will keep energy and discard loudness.

In [None]:
songs.drop(columns=['loudness'], inplace=True)

In [None]:
fig, ax = plt.subplots(1, figsize=(15, 6), sharey=True, sharex = True)
ax_data =  songs.groupby('energy')['popularity'].mean().to_frame().reset_index()
ax = sns.scatterplot(x='energy', y='popularity', data=ax_data, color='pink', ax=ax)
ax.set_title('energy')
ax.set_ylabel('Mean Popularity', fontsize=12)
plt.tight_layout()
plt.show()

### **Instrumentalness**

The `instrumentalness` being close to 1 means there are no vocals.

In [None]:
fig, ax = plt.subplots(1, figsize=(15, 6), sharey=True, sharex = True)
ax_data =  songs.groupby('instrumentalness')['popularity'].mean().to_frame().reset_index()
ax = sns.scatterplot(x='instrumentalness', y='popularity', data=ax_data, color='grey', ax=ax)
ax.set_title('instrumentalness')
ax.set_ylabel('Mean Popularity', fontsize=12)
plt.tight_layout()
plt.show()

### **Liveness**

`liveness` detects the presence of an audience. High liveness suggests the track was live.

In [None]:
fig, ax = plt.subplots(figsize=(15, 4))
sns.histplot(songs['liveness'], color='purple', bins=30)
plt.show()

### **Speechiness**
When studying `speechiness`, we can see that songs that are too "speechy" are less popular. We will thus create a binary variable: "Speech over 0.57".

In [None]:
fig, ax = plt.subplots(1, figsize=(15, 6), sharey=True, sharex = True)
ax_data =  songs.groupby('speechiness')['popularity'].mean().to_frame().reset_index()
ax = sns.scatterplot(x='speechiness', y='popularity', data=ax_data, color='brown', ax=ax)
ax.axvline(x=0.57, ymin=0, ymax=1, color='red', linestyle='dashed')
ax.set_title('speechiness')
ax.set_ylabel('Mean Popularity', fontsize=12)
plt.tight_layout()
plt.show()

In [None]:
songs['speech_over.57']=1*(songs['speechiness']>=0.57)

In [None]:
songs.drop(columns='speechiness',inplace=True)

### **Artists and genres?**

Using the artists' popularity without any data leakage is going to be tough but makes sense. Intuitively, **a song's popularity is largely based on its artist's popularity** but according to the Spotify API documentation, artists' popularity is itself computed based on the popularity of their tracks! 

Wondering how to pre-process the artist, we realized that we had a dataframe linking artists to their music genre. This could be a good popularity indicator. 

We thus tried to use the songs' genre, thanks to the `data_w_genres` dataframe that was provided.

In [None]:
genres = data_w_genres[['artists', 'genres']].copy()
genres['artists'] = genres['artists'].map(lambda x: '['+x+']')
genres.head()

After some investigations, it seems like the artists presents in the `data_w_genres` dataframe are not the same ones as the ones present in the main file. 

We thus drop this approach with genres and will develop another method to take artists into account.

# 3. Model construction

## Splitting the data

First, we need to split the data. 

If we consider the usecase of Spotify, the company certainly wants to predict the popularity of future songs. Thus, we should split the dataset between songs issued before a certain year and songs issued after it. 

We will make a train set of all the songs previous to 1996, a validation set composed of the songs issued between 1996 and 2014 to fine-tune and select our models (18% of the whole dataset), and a test set composed of the songs issued on and after 2014 (10% of the total dataset).

In [None]:
X = songs.sort_values(by='year').drop(columns=['popularity']).copy()
y = songs.sort_values(by='year')['popularity'].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, shuffle=False)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, shuffle=False)

For illustration purposes, here are the last songs of the train and val dataset. These songs were indeed issued in 1996 and 2014, respectively. 

In [None]:
X_train.tail(5)

In [None]:
X_val.tail(5)

## Model explorations

### Basic preprocessing - First iterations


We use `ColumnTransformer` to do various preprocessing tasks simultaneously: scaling the year, tempo and duration (contrarily to the others, these features are not on a 0-1 scale), one-hot encoding the key as it is a categorical variable ranging from 1 to 11 (more on the music key [here](https://en.wikipedia.org/wiki/Key_(music)) ): this will create 10 columns taking 0/1 values.

We will use the artists and the songs title later on, for now we drop the columns. 

In [None]:
ct = ColumnTransformer([('minmax', MinMaxScaler(), ['year', 'tempo', 'duration_ms']),
                        ('categorical', OneHotEncoder(), ['key']),
                        ('drop_cols', 'drop', ['artists', 'name'])],
                       remainder='passthrough')

ct.fit(X_train)

X_train_preprocessed = ct.transform(X_train)
X_val_preprocessed = ct.transform(X_val)
X_test_preprocessed = ct.transform(X_test)

In [None]:
X_train_preprocessed.shape

**We are now ready to explore various models!**

In [None]:
# Usual linear regression

## We instantiate the model
lin_reg = LinearRegression()

# We fit the model on preprocessed train data
lin_reg.fit(X_train_preprocessed, y_train)

# We make predictions on the validation set, also preprocessed
y_pred = lin_reg.predict(X_val_preprocessed)

# We output the root mean squared error on the validation set
mean_squared_error(y_val, y_pred, squared=False)

In [None]:
# Lasso
lasso=Lasso()

lasso.fit(X_train_preprocessed,y_train)

y_pred = lasso.predict(X_val_preprocessed)

mean_squared_error(y_val, y_pred, squared=False)

In [None]:
# XGBoost regressor
xgb_regressor = XGBRegressor(n_estimators=100, max_depth=20, learning_rate=0.01)

xgb_regressor.fit(X_train_preprocessed, y_train)

y_pred = xgb_regressor.predict(X_val_preprocessed)

mean_squared_error(y_val, y_pred, squared=False)

In [None]:
# K-Nearest-Neighbors (KNN) regressor
neigh = KNeighborsRegressor(n_neighbors=7)

neigh.fit(X_train_preprocessed, y_train)

y_pred=neigh.predict(X_val_preprocessed)

mean_squared_error(y_val, y_pred, squared=False)

### Introducing song names

Does the `name` of the song influence its popularity?

I tried to lemmatize the title and use a TF-IDF Vectorization afterwards, but song titles in the dataset are in multiple languages. They could not be used as such.

Thus, I imagined that a song could have more audience if its title were in English. I decided to use the [enchant](https://pypi.org/project/pyenchant/) library to **check whether at least half the words in the song's title are in English**. 

In [None]:
import enchant

In [None]:
def check_if_english(text):
    d = enchant.Dict("en_US")
    c = [d.check(word) for word in text.split(' ')]
    if np.mean(c) < 0.5:
        return 0
    else: 
        return 1

In [None]:
X_train['name_en'] = X_train['name'].map(check_if_english)
X_val['name_en'] = X_val['name'].map(check_if_english)
X_test['name_en'] = X_test['name'].map(check_if_english)

At that stage, I wanted to find out whether the language was relevant to predict popularity. To do so, I re-ran models with the additional binary variable, indicating if the song title is in English or not.

In [None]:
ct = ColumnTransformer([('minmax', MinMaxScaler(), ['year', 'tempo', 'duration_ms']),
                        ('categorial', OneHotEncoder(), ['key']),
                        ('drop_cols', 'drop', ['artists','name'])],
                       remainder='passthrough')

ct.fit(X_train)

X_train_preprocessed = ct.transform(X_train)
X_val_preprocessed = ct.transform(X_val)
X_test_preprocessed = ct.transform(X_test)

In [None]:
lin_reg = LinearRegression()

lin_reg.fit(X_train_preprocessed, y_train)

y_pred = lin_reg.predict(X_val_preprocessed)

mean_squared_error(y_val, y_pred, squared=False)

In [None]:
lasso = Lasso()

lasso.fit(X_train_preprocessed,y_train)

y_pred = lasso.predict(X_val_preprocessed)

mean_squared_error(y_val, y_pred, squared=False)

In [None]:
xgb_regressor = XGBRegressor(n_estimators=100, max_depth=20, learning_rate=0.01)

xgb_regressor.fit(X_train_preprocessed, y_train)

y_pred = xgb_regressor.predict(X_val_preprocessed)

mean_squared_error(y_val, y_pred, squared=False)

In [None]:
neigh = KNeighborsRegressor(n_neighbors=7)

neigh.fit(X_train_preprocessed, y_train)

y_pred = neigh.predict(X_val_preprocessed)

mean_squared_error(y_val, y_pred, squared=False)

It did reduce the mean squared error! We will therefore dig deeper into this approach and add binary features that will tell us whether the song's name is in French, Spanish, Russian or Arabic (almost all of the world's most spoken languages, except Chinese and Hindi).

**Warning**: the column creation is quite long to run (my guess is that `enchant` can be capricious), so I would advise you to do it only once and the next times, directly load `X_train`, `X_test` and `X_val` from the `csv` files with the cell at the end of this section.

In [None]:
def check_language(text, abbreviation):
    '''
    abbreviation is a string, that corresponds to the abbreviation of the language you want to check. 
    
    For example: 
    - French is "fr_FR", 
    - Spanish is "es", 
    - Arabic is "ar"
    - etc.
    '''
    d = enchant.Dict(abbreviation)
    c = [d.check(word) for word in text.split(' ')]
    if np.mean(c) < 0.5:
        return 0
    else: 
        return 1

In [None]:
X_train['name_fr'] = X_train['name'].map(lambda text: check_language(text, abbreviation='fr_FR'))
X_val['name_fr'] = X_val['name'].map(lambda text: check_language(text, abbreviation='fr_FR'))
X_test['name_fr'] = X_test['name'].map(lambda text: check_language(text, abbreviation='fr_FR'))

In [None]:
X_train['name_sp'] = X_train['name'].map(lambda text: check_language(text, abbreviation='es'))
X_val['name_sp'] = X_val['name'].map(lambda text: check_language(text, abbreviation='es'))
X_test['name_sp'] = X_test['name'].map(lambda text: check_language(text, abbreviation='es'))

In [None]:
X_train['name_ar'] = X_train['name'].map(lambda text: check_language(text, abbreviation='ar'))
X_val['name_ar'] = X_val['name'].map(lambda text: check_language(text, abbreviation='ar'))
X_test['name_ar'] = X_test['name'].map(lambda text: check_language(text, abbreviation='ar'))

In [None]:
X_train['name_ru'] = X_train['name'].map(lambda text: check_language(text, abbreviation='ru'))
X_val['name_ru'] = X_val['name'].map(lambda text: check_language(text, abbreviation='ru'))
X_test['name_ru'] = X_test['name'].map(lambda text: check_language(text, abbreviation='ru'))

The computations are quite lengthy, so I would advise you save your train, validation and test set once and for all as I did in the cell below.

In [None]:
X_train.to_csv('data/X_train.csv', index=False)
X_test.to_csv('data/X_test.csv', index=False)
X_val.to_csv('data/X_val.csv', index=False)

Uncomment and run this cell to avoid the lengthy computations with the `enchant` library.

In [None]:
# X_train = pd.read_csv('data/X_train.csv')
# X_test = pd.read_csv('data/X_test.csv')
# X_val = pd.read_csv('data/X_val.csv')

### Introducing the artists' popularity

After our unsuccessful first approach with the genres, we will now try another one, based on what we would do "in real life". 

- If the artist has done **more than one song in the train set**, we will compute the artist's mean popularity in the train set and replace the artist's name by his/her popularity.
- Otherwise, we will replace the artist's name by the mean popularity of the train dataset. We need this distinction to avoid training the model to look for a popularity score that is alreay included in the artist's popularity.

In [None]:
# Defining a new dataframe that corresponds to X_train, but where popularity has not been dropped. 
songs_train=songs.sort_values(by='year').loc[:54568].copy() # the 54,568 corresponds to the last index of X_train

In [None]:
# This cell aims at creating a dictionary that gives, for each artist in the train set, its mean popularity
# or the mean popularity in the whole train set if the artist is only present once.
artists_and_pop = {}

train_mean_pop = songs_train['popularity'].mean()

for artist in X_train['artists'].unique():
    temp = songs_train[songs_train['artists'] == artist]['popularity'].copy()
    if len(temp) > 1:
        artists_and_pop[artist] = temp.mean()
    elif len(temp) == 1:
        artists_and_pop[artist] = train_mean_pop
    else:
        print('Stopping iteration due to unexpected result.')
        break

In [None]:
# We map this dictionary upon the "artists" column.
X_train['artists'] = X_train['artists'].map(artists_and_pop)

In [None]:
# For the validation set, we also map the dictionary upon the "artists" column.
# If an artist, active after 1996 only, is not in the train set, we attribute the mean popularity to him/her.
X_val['artists'] = X_val['artists'].map(lambda artist: artists_and_pop.get(artist, train_mean_pop))

In [None]:
ct = ColumnTransformer([('minmax', MinMaxScaler(), ['year', 'tempo', 'duration_ms', 'artists']),
                        ('categorial', OneHotEncoder(), ['key']),
                        ('drop_cols', 'drop', ['name'])],
                       remainder='passthrough')

ct.fit(X_train)

X_train_preprocessed = ct.transform(X_train)
X_val_preprocessed=ct.transform(X_val)

The `len(songs_train[songs_train['artists'] == artist]['popularity'])>1` condition is supposed to ensure that if an artist has only done one song in the train set, then the song's popularity will not be counted as the artist's popularity (the model would be impossible to train!).

Still, what might happen with this approach is that the artists's popularity will be a crucial information for the model: during training, it will give it more importance than it should. For the validation and the test set, we will use popularity from artists that are already present in the train set. 

In [None]:
lin_reg = LinearRegression()

lin_reg.fit(X_train_preprocessed, y_train)

y_pred = lin_reg.predict(X_val_preprocessed)

mean_squared_error(y_val, y_pred, squared=False)

In [None]:
lasso=Lasso()

lasso.fit(X_train_preprocessed,y_train)

y_pred = lasso.predict(X_val_preprocessed)

mean_squared_error(y_val, y_pred, squared=False)

In [None]:
xgb_regressor = XGBRegressor(n_estimators=100, max_depth=20, learning_rate=0.01)

xgb_regressor.fit(X_train_preprocessed, y_train)

y_pred = xgb_regressor.predict(X_val_preprocessed)

mean_squared_error(y_val, y_pred, squared=False)

In [None]:
neigh = KNeighborsRegressor(n_neighbors=7)

neigh.fit(X_train_preprocessed, y_train)

y_pred=neigh.predict(X_val_preprocessed)

mean_squared_error(y_val, y_pred, squared=False)

# 4. Further explorations

## Fine-tuning KNN hyperparameters

In [None]:
neigh = KNeighborsRegressor()

params = {
    'weights': ['uniform', 'distance'], 
    'n_neighbors': randint(2, 15),
    'algorithm': ['ball_tree', 'kd_tree', 'brute']
}

rnd_search = RandomizedSearchCV(estimator=neigh, 
                                param_distributions=params,
                                n_iter=10, 
                                cv=5,
                                verbose=1,
                                n_jobs=-1)

rnd_search.fit(X_train_preprocessed, y_train)

rnd_search.best_score_

The Random Search process was running indefinitely, probably because of a problem with the multiprocessing package (mentioned in [this issue](https://github.com/jupyter/notebook/issues/5261)). 

Having tried several values for `n_neighbors`, performance - as measured by the root mean squared error on the validation set - seems to increase with the number of neighbors used. For instance, with `n_neighbors=18`:

In [None]:
neigh = KNeighborsRegressor(n_neighbors=18)

neigh.fit(X_train_preprocessed, y_train)

y_pred=neigh.predict(X_val_preprocessed)

mean_squared_error(y_val, y_pred, squared=False)

## Simple neural network

In [None]:
def build_nn_model():
    # We instantiate the sequential model.
    model = Sequential()

    # We add several Dense layers with ReLU activation, and 1 Dropout layer to prevent overfitting.
    model.add(layers.Dense(100, activation = 'relu',input_dim=30))
    model.add(layers.Dense(50, activation = 'relu'))
    model.add(layers.Dropout(0.2))
    model.add(layers.Dense(30, activation = 'relu'))

    # Finally, the last layer will count 1 neuron with linear activation since we are dealing with a regression model. 
    model.add(layers.Dense(1, activation = 'linear'))
    return model
    
model = build_nn_model()

model.summary()

In [None]:
adam = Adam(learning_rate=0.00001)
mse = MeanSquaredError()

# We compile the model with mean squared error as loss and root mean squared error as metric. 
model.compile(loss=mse, optimizer=adam, metrics=[RootMeanSquaredError()])

In [None]:
es = EarlyStopping(monitor='val_loss', patience=10, verbose=1)

history = model.fit(X_train_preprocessed, y_train, 
                    validation_data=(X_val_preprocessed, y_val),
                    epochs = 1000, 
                    batch_size = 32, 
                    callbacks = [es], 
                    verbose = 2)

Let's have more visual look at the training of the model.

In [None]:
def plot_history(history):
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(17, 5))
    
    axes[0].plot(history.history['loss'], color='darkred', label='Train - Loss')
    axes[0].plot(history.history['val_loss'], color='darkblue', label='Validation - Loss')
    axes[0].legend()
    axes[0].set_title('Loss (MSE) on train and validation sets')    
    
    axes[1].plot(history.history['root_mean_squared_error'], color='darkred', label='Train - RMSE')
    axes[1].plot(history.history['val_root_mean_squared_error'], color='darkblue', label='Validation - RMSE')
    axes[1].legend()
    axes[1].set_title('RMSE on train and validation sets')

plot_history(history)

We can see that after a few epochs, none of the two losses/metrics is decreasing anymore. If we kept training the network, the `val_loss` would probably start increasing again as we would be overfitting the train set. 

We can now make predictions on the validation set with our model. In fact, we already know what the root mean squared error on the validation set will be, as we used `X_val`as a validation set during the training of our model, so the `val_RMSE`was computed at each step. We can verify this by running the cell below.

In [None]:
y_pred = model.predict(X_val_preprocessed)

mean_squared_error(y_pred, y_val, squared=False)

# 5. Evaluation on the test set

## Recomposing the train set

Now we will train our model on a larger train set, composed of the former `X_train` and `X_val`, and test its performances on `X_test`. 

In [None]:
# Recovering our train and test dataset with the languages of the song titles
# Will work only if you have saved the output of the language check function (for the song name) 
X_train_old = pd.read_csv('data/X_train.csv')
X_val = pd.read_csv('data/X_val.csv')
X_test = pd.read_csv('data/X_test.csv')

X_train = pd.concat([X_train_old, X_val])

In [None]:
y_train_old = y_train.copy()
y_train = pd.concat([y_train_old, y_val])

In [None]:
# Defining a new dataframe that corresponds to X_train, but where popularity has not been dropped. 
songs_train = songs.sort_values(by='year').loc[:130014].copy() 

As we saw with the plot `popularity v. year`, there is a sudden drop of popularity around the end of 2020 that might be due to a change in the metric (cf. [this thread](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks/discussion/214183)). Thus, and since we know popularity scores tend to lag on the Spotify API, we drop the songs from 2021. 

In [None]:
X_test_no_2021=X_test[X_test['year']!=2021].copy()
y_test_no_2021=y_test[:X_test_no_2021.shape[0]].copy() 

In [None]:
# Preprocessing the 'artists' column as seen before
artists_and_pop = {}

train_mean_pop = songs_train['popularity'].mean()

for artist in X_train['artists'].unique():
    temp = songs_train[songs_train['artists'] == artist]['popularity'].copy()
    if len(temp) > 1:
        artists_and_pop[artist] = temp.mean()
    elif len(temp) == 1:
        artists_and_pop[artist] = train_mean_pop

X_train['artists'] = X_train['artists'].map(artists_and_pop)
X_test_no_2021['artists'] = X_test_no_2021['artists'].map(lambda artist: artists_and_pop.get(artist, train_mean_pop))

In [None]:
# Adding usual preprocessing steps
ct = ColumnTransformer([('minmax', MinMaxScaler(), ['year', 'tempo', 'duration_ms', 'artists']),
                        ('categorial', OneHotEncoder(), ['key']),
                        ('drop_cols', 'drop', ['name'])],
                       remainder='passthrough')

ct.fit(X_train)

X_train_preprocessed = ct.transform(X_train)
X_test_preprocessed=ct.transform(X_test_no_2021)

## Scoring the KNN regressor

Now that we have reconstituted our new train set, we can score our best model (KNN with 18 neighbors) on the test set which it has "never seen before".

In [None]:
neigh = KNeighborsRegressor(n_neighbors=18)

neigh.fit(X_train_preprocessed, y_train)

y_pred=neigh.predict(X_test_preprocessed)

mean_squared_error(y_test_no_2021, y_pred, squared=False)

This is our **root mean squared error on the test set**, ie. songs issued roughly between 2014 and 2020 (10% of the whole dataset). It is not very impressive, we will try to explain why later. 

## Scoring the neural network

Let's do the same with our neural network.

In [None]:
# We re-instantiate the model thanks to the function defined previously.
model = build_nn_model()

adam = Adam(learning_rate=0.00001)
mse = MeanSquaredError()
model.compile(loss=mse, optimizer=adam, metrics=[RootMeanSquaredError()])

es = EarlyStopping(monitor = 'val_loss', patience = 10, verbose = 1)

history = model.fit(X_train_preprocessed, y_train, 
                    validation_split=0.3,
                    epochs = 1000, 
                    batch_size = 32, 
                    callbacks = [es], 
                    verbose = 0)

plot_history(history)

And we can score the model on the test set. 

In [None]:
y_pred_nn = model.predict(X_test_preprocessed)
mean_squared_error(y_test_no_2021, y_pred_nn, squared=False)

The K-Nearest Neighbors regressor seems to perform better than the neural network on our test set.

## Visualising our result

Here, I will try to plot the predicted popularity for the test set, compare it to the true popularity for the train and test set. Thus, I start by sorting the songs.

In [None]:
songs_sorted = songs.sort_values(by='year').copy()
songs_sorted_no_2021 = songs_sorted[songs_sorted['year'] < 2021]

I then add two columns that will be a copy of popularity, except on the test set, where these two columns will represent the predicted values for our best predictors.

In [None]:
songs_sorted_no_2021['y_pred_knn'] = songs_sorted_no_2021['popularity'].copy()
songs_sorted_no_2021['y_pred_knn'].loc[130012:] = np.reshape(y_pred, (-1, )).copy() # I found the value 130,012 by displaying the first song of the test set... Not the best way to do, I know! 
songs_sorted_no_2021['y_pred_nn'] = songs_sorted_no_2021['popularity'].copy()
songs_sorted_no_2021['y_pred_nn'].loc[130012:] = np.reshape(y_pred_nn, (-1, )).copy()

In [None]:
fig, ax = plt.subplots(figsize=(15, 4))
ax.plot(songs_sorted_no_2021.groupby('year')['y_pred_knn'].mean(),label='Predicted Popularity - K-Nearest Neigh.')
ax.plot(songs_sorted_no_2021.groupby('year')['y_pred_nn'].mean(),color="green",label='Predicted Popularity - Neural Net.')
ax.plot(songs_sorted_no_2021.groupby('year')['popularity'].mean(),color="orange",label='True Popularity')
ax.legend()
ax.set_title('Songs Popularity - Historic and Predictions')
ax.set_ylabel('Popularity', weight='bold')
ax.set_xlabel('Year', weight='bold')
ax.set_xticks(range(1920, 2020, 5))


plt.show()

We clearly see the better performance of the K-Nearest Neighbors regressor on this graph.

# 6. Conclusion

- We can see that our prediction is systematically higher than the real value. The origin of this "problem" can lie in the fact that the metric seems to have changed (as raised in [this thread](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks/discussion/214183)). Our models fail to adapt to the drop that occurs after the beginning of the 2000s, and it is even worse when including 2021! 

- If I had more than a week's time, I would have liked to find a way to use the lyrics of the songs. Several APIs propose this service but they require specific credentials and are generally not free. I tried to scrap a lyrics website but the result was not fully satisfying and I was banned after too many queries 😢 With clean lyrics, we could translate the dataset and introduce it into our analysis/modelling.

- What could also be improved in this notebook is the whole pre-processing process: it should be implemented into a single pipeline. 

# 7. A small bonus...

A data scientist should always know his/her dataset well! 

Run the following cells to play a random song in the dataset 😉

In [None]:
# !pip install youtube-search-python
# !pip install IPython

In [None]:
import random
from youtubesearchpython import VideosSearch
from IPython.display import YouTubeVideo

In [None]:
songs = pd.read_csv('data/data.csv')

random_song = songs['name'].loc[random.randint(0, len(songs))]

videosSearch = VideosSearch(random_song, limit=1)

song_yt_id = videosSearch.result()['result'][0]['id']
YouTubeVideo(song_yt_id, width=1000, height=500)