## IN3062 Introduction to AI Project
### Spotify Machine Learning Experiments

Being a fan of music and an avid user of Spotify, I decided to use a dataset of over 160,000 songs each with granular information behind their charataristics such as: acousticness, energy, danceability and tempo. 

Each song also has a popularity score between 0 and 100 indicating how popular the song is with worldwide Spotify users, this interested me as I have the hypothesis: Is it possible to predict how popular a song could be based purely on characteristics?

Let's explore this hypothesis...

### Load The Dataset

The dataset can be located at https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks. 

I am using the main data.csv file to start, I will explore the additional datasets: data_by_genre.csv and data_by_artists later in the notebook.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error, r2_score
from sklearn import linear_model

df = pd.read_csv('data.csv')


### Exploring The Dataset

Taking a look at the columns of the dataset we can see that each song has:

'valence' = The positiveness of the song.

'year' = The release year of the song.

'acousticness' = The relative metric of the song being acoustic (not having electrical amplification).

'artists' = An array containing the artist(s) name's.

'danceability' = The relative measurement of the song being danceable.

'duration_ms' = The duration of the song in miliseconds.

'energy' = The relative measurement of the song's energy.

'explicit' = A Boolean value indicating if the song contains explicit lyrics.

'id' = The inique identifer for the song.

'instrumentalness' = The relative ratio of the song being instrumental.

'key' = The musical key that the song is recorded in. Represented as Integers between 0 and 11.

'liveness' = The relative duration of the song sounding like a live performance.

'loudness' = The relative loudness of the song in the range of [-60, 0] decibels (dB).

'mode' = A Boolean value indicating if the song starts with a major chord progression or not.

'name' = The name of the song.

'popularity' = The current popularity of the song measured by Spotify. Represented as Integers between 0 and 100.

'release_date' = The date of release of the song in yyyy-mm-dd, yyyy-mm or yyyy format.

'speechiness' = The relative length of the song containing any kind of vocals.

'tempo' = The musical tempo (speed) of the song.


In [2]:
df.head()

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
0,0.0594,1921,0.982,"['Sergei Rachmaninoff', 'James Levine', 'Berli...",0.279,831667,0.211,0,4BJqT0PrAfrxzMOxytFOIz,0.878,10,0.665,-20.096,1,"Piano Concerto No. 3 in D Minor, Op. 30: III. ...",4,1921,0.0366,80.954
1,0.963,1921,0.732,['Dennis Day'],0.819,180533,0.341,0,7xPhfUan2yNtyFG0cUWkt8,0.0,7,0.16,-12.441,1,Clancy Lowered the Boom,5,1921,0.415,60.936
2,0.0394,1921,0.961,['KHP Kridhamardawa Karaton Ngayogyakarta Hadi...,0.328,500062,0.166,0,1o6I8BglA6ylDMrIELygv1,0.913,3,0.101,-14.85,1,Gati Bali,5,1921,0.0339,110.339
3,0.165,1921,0.967,['Frank Parker'],0.275,210000,0.309,0,3ftBPsC5vPBKxYSee08FDH,2.8e-05,5,0.381,-9.316,1,Danny Boy,3,1921,0.0354,100.109
4,0.253,1921,0.957,['Phil Regan'],0.418,166693,0.193,0,4d6HGyGT8e121BsdKmw9v6,2e-06,3,0.229,-10.096,1,When Irish Eyes Are Smiling,2,1921,0.038,101.665


Let's see what the top five songs are according to popularity, to judge if there is a trend in their characteristics.

In [3]:
df.nlargest(5, 'popularity')

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
19611,0.145,2020,0.401,"['Bad Bunny', 'Jhay Cortez']",0.731,205090,0.573,1,47EiUVwUp4C9fGccaPuUCS,5.2e-05,4,0.113,-10.059,0,Dakiti,100,2020-10-30,0.0544,109.928
19606,0.756,2020,0.221,"['24kGoldn', 'iann dior']",0.7,140526,0.722,1,3tjFYV6RSFtuktYl3ZtYcq,0.0,7,0.272,-3.558,0,Mood (feat. iann dior),99,2020-07-24,0.0369,90.989
19618,0.737,2020,0.0112,['BTS'],0.746,199054,0.765,0,0t1kP63rueHleOhQkYSXFY,0.0,6,0.0936,-4.41,0,Dynamite,97,2020-08-28,0.0993,114.044
19608,0.357,2020,0.0194,"['Cardi B', 'Megan Thee Stallion']",0.935,187541,0.454,1,4Oun2ylbjFKMPTiaSbbCih,0.0,1,0.0824,-7.509,1,WAP (feat. Megan Thee Stallion),96,2020-08-07,0.375,133.073
19610,0.682,2020,0.468,['Ariana Grande'],0.737,172325,0.802,1,35mvY5S1H3J2QZyna3TFe0,0.0,0,0.0931,-4.771,1,positions,96,2020-10-30,0.0878,144.015


Apart from not enjoying any of these songs myself, we can draw that the main similarities that they exhibit are that they all have no or very low instrumentalness, a danceability > 7.0 and they were all released in 2020.

So as a generalisation, we can say that 'dancy' songs released in 2020 with no instrumentals are likley to be popular. Lets see if the training can pick up on this later.

### Preparing The Dataset

If we take a look at the info behind the data frame, we can see that there are no null values for each column which is great as we won't need to remove any null values which would have resulted in losing some data.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170653 entries, 0 to 170652
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   valence           170653 non-null  float64
 1   year              170653 non-null  int64  
 2   acousticness      170653 non-null  float64
 3   artists           170653 non-null  object 
 4   danceability      170653 non-null  float64
 5   duration_ms       170653 non-null  int64  
 6   energy            170653 non-null  float64
 7   explicit          170653 non-null  int64  
 8   id                170653 non-null  object 
 9   instrumentalness  170653 non-null  float64
 10  key               170653 non-null  int64  
 11  liveness          170653 non-null  float64
 12  loudness          170653 non-null  float64
 13  mode              170653 non-null  int64  
 14  name              170653 non-null  object 
 15  popularity        170653 non-null  int64  
 16  release_date      17

We can also see that the data types for the majority of the columns are either Integer or Float values which are ready for training. However for 'artists', 'id', 'name' and 'release_date'. I will touch on 'artists' and 'name' later when I attempt to increase the accuracy of the predictions but for now we can drop it from the dataframe along with: 'id' and 'release_date'.

In [5]:
baselineDf = df.drop(columns=['id', 'artists', 'release_date', 'name', 'mode']).copy()

### Baseline Training

Now that the dataframe has had the non-numeric objects removed we can start with the baseline training.
To start I am going to create two dataframes, one for the dataIn (X) and one for the dataOut (y). 

Then using Scikit-learn's test_train_spilt I am going to split the training and testing data 80:20. 

And finally I will run the training data though a Linear Regression model, then test the predictions against the test data to get an accuracy score.

This will be run 10 times then a mean average baseline accuracy will be given.

In [6]:
def train(current, data):
    
    dataIn = data.drop(columns=['popularity'])
    dataOut = data['popularity']

    dataInTrain, dataInTest, dataOutTrain, dataOutTest = train_test_split(dataIn, dataOut, test_size=0.2)

    model = linear_model.LinearRegression()
    model.fit(dataInTrain, dataOutTrain)

    predictions = model.predict(dataInTest)

    print("Test: " + str(current + 1))
    
    print("MSE: " + str(mean_squared_error(dataOutTest, predictions)))

    print("R^2: " + str(r2_score(dataOutTest, predictions)))

    accuracy = model.score(dataInTest, dataOutTest)
    
    print("Accuracy: {:.5f}%\n".format(accuracy * 100))
    
    return accuracy


temp = 0

for x in range(10):
  
    temp += train(x, baselineDf)

print("=== Baseline Average Accuracy: {:.5f}% ===".format((temp / 10) * 100))


Test: 1
MSE: 119.00396552931943
R^2: 0.7493200446097539
Accuracy: 74.93200%

Test: 2
MSE: 117.45329580860924
R^2: 0.7538573994896848
Accuracy: 75.38574%

Test: 3
MSE: 117.17888027851846
R^2: 0.7527903838692405
Accuracy: 75.27904%

Test: 4
MSE: 117.48560325180048
R^2: 0.7532636539575905
Accuracy: 75.32637%

Test: 5
MSE: 115.98699692277275
R^2: 0.7559633247361626
Accuracy: 75.59633%

Test: 6
MSE: 116.8600398633487
R^2: 0.7540966215685551
Accuracy: 75.40966%

Test: 7
MSE: 117.64116198727967
R^2: 0.7536506411731694
Accuracy: 75.36506%

Test: 8
MSE: 114.91520831115808
R^2: 0.7592856689396691
Accuracy: 75.92857%

Test: 9
MSE: 116.44494740066978
R^2: 0.755336074060079
Accuracy: 75.53361%

Test: 10
MSE: 116.00623107604838
R^2: 0.756449347947416
Accuracy: 75.64493%

=== Baseline Average Accuracy: 75.44013% ===


Just from this initial basline we can see that the accuracy is on average 75.421262% accurate at predicting the popularity of a song based on it's characteristics.

Let's see if we can improve this by adjusting the dataset.

### Adjusting The Dataset

The first thing I want to try is to add artists to the training model. As some artists may be so popular that it is likley that when they release a song it is going to have a high popularity score.

To achieve this I have decided to use a LabelEncoder to convert the arrays of artists names to a unique Integer value so it can be trained in the model alongside the rest of the data.

In [7]:
from sklearn.preprocessing import LabelEncoder

adjustmentDf = df.drop(columns=['id', 'release_date', 'name', 'mode']).copy()

le = LabelEncoder()
adjustmentDf['artists'] = le.fit_transform(df['artists'].astype('str'))

temp = 0

for x in range(10):
  
    temp += train(x, adjustmentDf)

print("=== Artist Adjustment Average Accuracy: {:.5f}% ===".format((temp / 10) * 100))


Test: 1
MSE: 120.03609920130587
R^2: 0.74837246918173
Accuracy: 74.83725%

Test: 2
MSE: 115.52941722726285
R^2: 0.7577552236049949
Accuracy: 75.77552%

Test: 3
MSE: 116.48370938413115
R^2: 0.7579538373727256
Accuracy: 75.79538%

Test: 4
MSE: 117.64813082896518
R^2: 0.75158221945133
Accuracy: 75.15822%

Test: 5
MSE: 120.58927556336545
R^2: 0.7455874052051069
Accuracy: 74.55874%

Test: 6
MSE: 118.63804773319791
R^2: 0.7525625802208334
Accuracy: 75.25626%

Test: 7
MSE: 120.50282995176086
R^2: 0.7477449897743973
Accuracy: 74.77450%

Test: 8
MSE: 117.81924824360762
R^2: 0.7527585600776023
Accuracy: 75.27586%

Test: 9
MSE: 116.82618217900124
R^2: 0.7541219486177408
Accuracy: 75.41219%

Test: 10
MSE: 118.32879206480881
R^2: 0.7532749421875686
Accuracy: 75.32749%

=== Artist Adjustment Average Accuracy: 75.21714% ===


With this small improvement we can see a gain of 0.01974% in accuracy. Not the greatest improvement, but improvement nonetheless.

Next I want to modify the dataframe even further, this time by applying a function to some of the columns to generalise the data to make matches more likley.

I am going to define two functions:

normaliseMetrics : Rounds the given column's values to two decimal places.

normaliseTempo: Rounds the tempo to the nearest ten.


In [8]:
def normaliseMetrics(value):
    return round(value, 2)


def normaliseTempo(value):
    return round(abs(value), -1)


I will now apply these modifier functions to the relevent columns in the dataframe.

In [9]:
adjustmentDf['valence'] = adjustmentDf['valence'].apply(normaliseMetrics)
adjustmentDf['acousticness'] = adjustmentDf['acousticness'].apply(normaliseMetrics)
adjustmentDf['danceability'] = adjustmentDf['danceability'].apply(normaliseMetrics)
adjustmentDf['energy'] = adjustmentDf['energy'].apply(normaliseMetrics)
adjustmentDf['instrumentalness'] = adjustmentDf['instrumentalness'].apply(normaliseMetrics)
adjustmentDf['liveness'] = adjustmentDf['liveness'].apply(normaliseMetrics)
adjustmentDf['loudness'] = adjustmentDf['loudness'].apply(normaliseMetrics)
adjustmentDf['speechiness'] = adjustmentDf['speechiness'].apply(normaliseMetrics)
adjustmentDf['tempo'] = adjustmentDf['tempo'].apply(normaliseTempo)

adjustmentDf.head(5)

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,popularity,speechiness,tempo
0,0.06,1921,0.98,26839,0.28,831667,0.21,0,0.88,10,0.67,-20.1,4,0.04,80.0
1,0.96,1921,0.73,7382,0.82,180533,0.34,0,0.0,7,0.16,-12.44,5,0.41,60.0
2,0.04,1921,0.96,16378,0.33,500062,0.17,0,0.91,3,0.1,-14.85,5,0.03,110.0
3,0.17,1921,0.97,10077,0.28,210000,0.31,0,0.0,5,0.38,-9.32,3,0.04,100.0
4,0.25,1921,0.96,23719,0.42,166693,0.19,0,0.0,3,0.23,-10.1,2,0.04,100.0


As you can see the dataframe has now had each of the relevent columns modified to make the data more general.

Now we can run the updated dataframe though the training model.

In [10]:
temp = 0

for x in range(10):
  
    temp += train(x, adjustmentDf)

print("=== Adjustment Average Accuracy: {:.5f}% ===".format((temp / 10) * 100))

Test: 1
MSE: 116.36055523981716
R^2: 0.7543405711638276
Accuracy: 75.43406%

Test: 2
MSE: 118.15155312709508
R^2: 0.7511972590148117
Accuracy: 75.11973%

Test: 3
MSE: 119.10631030500161
R^2: 0.7482853345935057
Accuracy: 74.82853%

Test: 4
MSE: 118.56997984098334
R^2: 0.7516343805838517
Accuracy: 75.16344%

Test: 5
MSE: 115.59587978411409
R^2: 0.7582463869806112
Accuracy: 75.82464%

Test: 6
MSE: 116.48722714889439
R^2: 0.7554306563070602
Accuracy: 75.54307%

Test: 7
MSE: 115.58621101086439
R^2: 0.7568042101074692
Accuracy: 75.68042%

Test: 8
MSE: 117.6005228239975
R^2: 0.752570153201906
Accuracy: 75.25702%

Test: 9
MSE: 118.57005567587032
R^2: 0.7498087156524886
Accuracy: 74.98087%

Test: 10
MSE: 118.68606580402653
R^2: 0.7501231888753777
Accuracy: 75.01232%

=== Adjustment Average Accuracy: 75.28441% ===


Not the biggest enchancement, but enhancement nonetheless. 