<a href="https://colab.research.google.com/github/AstridSerruto/Projects/blob/master/Predicting_Spotify_Song_Popularity_with_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Can we predict a song's popularity with Machine Learning?

### Table of Contents

1. Introduction

2. Data Source

3. Setup

4. Preprocessing

5. Model Training

6. Decisions

 1. Introduction

Occasionaly, unknown singers have a smash hit. With this project I want create a model that can predict a song's popularity by looking at feature like danceability, liveness and valence.

2. Data Source

Dataset obtained from: https://www.kaggle.com/datasets/subhaskumarray/spotify-tracks-data?select=tracks.csv

This dataset contains data on tracks released from 1921 to 2020. The data entries include the following features:

*   id: ID of track generated by Spotify
*   id_artists: ID of artist generated by Spotify
*   acousticness: ranges from 0 to 1
*   danceability: ranges from 0 to 1
*   energy: ranges from 0 to 1
*   duration_ms: duration of track in milliseconds; integer   typically ranges from 200k to 300k
*   instrumentalness: ranges from 0 to 1
*   valence: how happy the song is; ranges from 0 to 1
*   popularity: ranges from 0 to 100
*   tempo: float typically ranging from 50 to 150
*   liveness: ranges from 0 to 1
*   loudness: float typically ranging from -60 to 0
*   peechiness: ranges from 0 to 1
*   year: ranges from 1921 to 2020
*   mode: 0 = minor, 1 = major
*   explicit: 0 = no explicit content, 1 = explicit content
*   key: all keys on octave encoded as values ranging from 0 to 11, starting with C as 0, C# as 1, etc.
*   artists: artist’s name
*   release_date: date of release
*   name: name of song

    
    
    
    





3. Setup

Importing Python libraries that are needed throughout the project and load dataset into Pandas dataframe

In [17]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re

from numpy import random
from sklearn.neighbors import KNeighborsRegressor
from sklearn import preprocessing

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [19]:
from google.colab import drive
drive.mount('/content/drive')

	
df = pd.read_csv('/content/drive/My Drive/Datasets/Kaggle Data Sets/tracks.csv')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
df.head()

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,35iwgR4jXetI318WEWsa1Q,Carve,6,126903,0,['Uli'],['45tIt06XoI0Iio4LBEVpls'],1922-02-22,0.645,0.445,0,-13.338,1,0.451,0.674,0.744,0.151,0.127,104.851,3
1,021ht4sdgPcrDgSk7JTbKY,Capítulo 2.16 - Banquero Anarquista,0,98200,0,['Fernando Pessoa'],['14jtPCOoNZwquk5wd9DxrY'],1922-06-01,0.695,0.263,0,-22.136,1,0.957,0.797,0.0,0.148,0.655,102.009,1
2,07A5yehtSnoedViJAZkNnc,Vivo para Quererte - Remasterizado,0,181640,0,['Ignacio Corsini'],['5LiOoJbxVSAMkBS2fUm3X2'],1922-03-21,0.434,0.177,1,-21.18,1,0.0512,0.994,0.0218,0.212,0.457,130.418,5
3,08FmqUhxtyLTn6pAh6bk45,El Prisionero - Remasterizado,0,176907,0,['Ignacio Corsini'],['5LiOoJbxVSAMkBS2fUm3X2'],1922-03-21,0.321,0.0946,7,-27.961,1,0.0504,0.995,0.918,0.104,0.397,169.98,3
4,08y9GfoqCWfOGsKdwojr5e,Lady of the Evening,0,163080,0,['Dick Haymes'],['3BiJGZsyX9sJchTqcSA7Su'],1922,0.402,0.158,3,-16.9,0,0.039,0.989,0.13,0.311,0.196,103.22,4


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 586672 entries, 0 to 586671
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                586672 non-null  object 
 1   name              586601 non-null  object 
 2   popularity        586672 non-null  int64  
 3   duration_ms       586672 non-null  int64  
 4   explicit          586672 non-null  int64  
 5   artists           586672 non-null  object 
 6   id_artists        586672 non-null  object 
 7   release_date      586672 non-null  object 
 8   danceability      586672 non-null  float64
 9   energy            586672 non-null  float64
 10  key               586672 non-null  int64  
 11  loudness          586672 non-null  float64
 12  mode              586672 non-null  int64  
 13  speechiness       586672 non-null  float64
 14  acousticness      586672 non-null  float64
 15  instrumentalness  586672 non-null  float64
 16  liveness          58

We have 586672 rows and 20 columns, 15 numeric columns and 5 text or mixed. There are not null values in the set.

For this project, I’ll be using the k-nearest neighbors (KNN) algorithm. This algorithm first finds the k closest points to the data point of interest. The predicted popularity of this point of interest will then be the average of the popularity of the k closest points.

4. Preprocessing

We will drop the non-quantitative features, as the k-nearest neighbors regressor algorithm requires quantitative features. 

In [5]:
df_quantitative = df
cols_to_drop = []
for column in df:
    if df[column].dtype == 'object':
        cols_to_drop.append(column)
df_quantitative = df.drop(columns=cols_to_drop)

In [6]:
df_quan_nm=(df_quantitative-df_quantitative.min())/(df_quantitative.max()-df_quantitative.min())

We will randomly split our data into training and testing sets in an 8:2 ratio, which will allow us to determine the final error of our model when predicting on previously unseen data. We will also make a validation set from the training set. The set will be used to determine the optimal value of k for our model.

In [7]:
#To replicate results set seed
np.random.seed(111)

#Create a mask and randomly select 0.8 of the songs for training.
df_train_full = df_quan_nm.sample(frac=0.8,random_state=1)
df_test = df_quan_nm.drop(df_train_full.index)

df_validation = df_train_full.sample(frac=0.2,random_state=2)
df_train = df_train_full.drop(df_validation.index)

We will separate the sets into X and Y.

    X: all the features our model will use to predict popularity (i.e. valence, acousticness, loudness, etc.)

    Y: popularity column, our target

In [8]:
predict = 'popularity'
X_train = df_train.drop(columns=[predict])
X_validation = df_validation.drop(columns=[predict])
X_test = df_test.drop(columns=[predict])
Y_train = df_train[[predict]].values.ravel()
Y_validation = df_train[[predict]].values.ravel()
Y_test = df_train[[predict]].values.ravel()

 To determine the error between predicted and actual popularity, we will use the mean square error as our prediction variable is continous.

In [9]:
def calculate_error(Y_pred, Y_actual):
  error = 0
  for i in range(len(Y_pred)):
    error += abs(Y_pred[i] - Y_actual[i])**2
  return error / len(Y_pred)

5. Model Training

We will determine the optimal value of k by using the training validations sets.

In [10]:
k_errors = [np.inf]
for k in range(1,50):
  model = KNeighborsRegressor(n_neighbors=k)
  model.fit(X_train, Y_train)
  Y_val_pred = model.predict(X_validation)
  k_errors.append(calculate_error(Y_val_pred, Y_validation))

We will use the results from the previous code to plot the k values against their respective error.

In [11]:
if not os.path.exists('figs'):
    os.makedirs('figs')

plt.scatter(x=range(len(k_errors)), 
            y=k_errors)
plt.xlabel('Value of k')
plt.ylabel('Error')
plt.title('Different k values on a KNN classifier')
plt.grid(axis='both',alpha=0.5)

fname = os.path.join(".", "figs", "k_values.png")
plt.savefig(fname)
print(f"\nFigure saved as {fname}\n")
plt.close()


Figure saved as ./figs/k_values.png



We can see the error consistently decreases as the value of k goes up, with the lowest error corresponding to a k value of 49. If we choose this lowest error value however, we would be overfitting our model on our dataset.

Overfitting means that we’re training our model to be overly specific to our dataset, which would make it a poor predictor for unseen data.

A rule of thumb, we find the point where the error stops decreasing rapidly, alson known as loking for the elbos on the plot.

In our case, the elbow is approximately at k = 6  or k=8, either value should perform well for our purposes.

We will try both values to train our final model using the training set and determine our model’s final error using the test set.

In [12]:
k=6
model = KNeighborsRegressor(n_neighbors=k)
model.fit(X_train, Y_train) 
Y_pred = model.predict(X_test)
print(f"Our testing error is {calculate_error(Y_pred, Y_test)}\n\n")

Our testing error is 0.047973880569801966




In [13]:
k=8
model = KNeighborsRegressor(n_neighbors=k)
model.fit(X_train, Y_train) 
Y_pred = model.predict(X_test)
print(f"Our testing error is {calculate_error(Y_pred, Y_test)}\n\n")

Our testing error is 0.04681148681328441




The best final error for our predictor is k=8O 0.04681, which is a pretty good model!