## IN3062 Introduction to AI Project
### Spotify Machine Learning Experiments

Being a fan of music and an avid user of Spotify, I decided to use a dataset of over 160,000 songs each with granular information behind their charataristics such as: acousticness, energy, danceability and tempo. 

Each song also has a popularity score between 0 and 100 indicating how popular the song is with worldwide Spotify users, this interested me as I have the hypothesis: Is it possible to predict how popular a song could be based purely on characteristics?

Let's explore this hypothesis...

### Load The Dataset

The dataset can be located at https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks. 

I am using the main data.csv file to start, I will explore the additional datasets: data_by_genre.csv and data_by_artists later in the notebook.

In [32]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error, r2_score
from sklearn import linear_model
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')


### Exploring The Dataset

Taking a look at the columns of the dataset we can see that each song has:

'valence' = The positiveness of the song.

'year' = The release year of the song.

'acousticness' = The relative metric of the song being acoustic (not having electrical amplification).

'artists' = An array containing the artist(s) name's.

'danceability' = The relative measurement of the song being danceable.

'duration_ms' = The duration of the song in miliseconds.

'energy' = The relative measurement of the song's energy.

'explicit' = A Boolean value indicating if the song contains explicit lyrics.

'id' = The inique identifer for the song.

'instrumentalness' = The relative ratio of the song being instrumental.

'key' = The musical key that the song is recorded in. Represented as Integers between 0 and 11.

'liveness' = The relative duration of the song sounding like a live performance.

'loudness' = The relative loudness of the song in the range of [-60, 0] decibels (dB).

'mode' = A Boolean value indicating if the song starts with a major chord progression or not.

'name' = The name of the song.

'popularity' = The current popularity of the song measured by Spotify. Represented as Integers between 0 and 100.

'release_date' = The date of release of the song in yyyy-mm-dd, yyyy-mm or yyyy format.

'speechiness' = The relative length of the song containing any kind of vocals.

'tempo' = The musical tempo (speed) of the song.


In [2]:
df.head()

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
0,0.0594,1921,0.982,"['Sergei Rachmaninoff', 'James Levine', 'Berli...",0.279,831667,0.211,0,4BJqT0PrAfrxzMOxytFOIz,0.878,10,0.665,-20.096,1,"Piano Concerto No. 3 in D Minor, Op. 30: III. ...",4,1921,0.0366,80.954
1,0.963,1921,0.732,['Dennis Day'],0.819,180533,0.341,0,7xPhfUan2yNtyFG0cUWkt8,0.0,7,0.16,-12.441,1,Clancy Lowered the Boom,5,1921,0.415,60.936
2,0.0394,1921,0.961,['KHP Kridhamardawa Karaton Ngayogyakarta Hadi...,0.328,500062,0.166,0,1o6I8BglA6ylDMrIELygv1,0.913,3,0.101,-14.85,1,Gati Bali,5,1921,0.0339,110.339
3,0.165,1921,0.967,['Frank Parker'],0.275,210000,0.309,0,3ftBPsC5vPBKxYSee08FDH,2.8e-05,5,0.381,-9.316,1,Danny Boy,3,1921,0.0354,100.109
4,0.253,1921,0.957,['Phil Regan'],0.418,166693,0.193,0,4d6HGyGT8e121BsdKmw9v6,2e-06,3,0.229,-10.096,1,When Irish Eyes Are Smiling,2,1921,0.038,101.665


Let's see what the top five songs are according to popularity, to judge if there is a trend in their characteristics.

In [3]:
df.nlargest(5, 'popularity')

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
19611,0.145,2020,0.401,"['Bad Bunny', 'Jhay Cortez']",0.731,205090,0.573,1,47EiUVwUp4C9fGccaPuUCS,5.2e-05,4,0.113,-10.059,0,Dakiti,100,2020-10-30,0.0544,109.928
19606,0.756,2020,0.221,"['24kGoldn', 'iann dior']",0.7,140526,0.722,1,3tjFYV6RSFtuktYl3ZtYcq,0.0,7,0.272,-3.558,0,Mood (feat. iann dior),99,2020-07-24,0.0369,90.989
19618,0.737,2020,0.0112,['BTS'],0.746,199054,0.765,0,0t1kP63rueHleOhQkYSXFY,0.0,6,0.0936,-4.41,0,Dynamite,97,2020-08-28,0.0993,114.044
19608,0.357,2020,0.0194,"['Cardi B', 'Megan Thee Stallion']",0.935,187541,0.454,1,4Oun2ylbjFKMPTiaSbbCih,0.0,1,0.0824,-7.509,1,WAP (feat. Megan Thee Stallion),96,2020-08-07,0.375,133.073
19610,0.682,2020,0.468,['Ariana Grande'],0.737,172325,0.802,1,35mvY5S1H3J2QZyna3TFe0,0.0,0,0.0931,-4.771,1,positions,96,2020-10-30,0.0878,144.015


Apart from not enjoying any of these songs myself, we can draw that the main similarities that they exhibit are that they all have no or very low instrumentalness, a danceability > 7.0 and they were all released in 2020.

So as a generalisation, we can say that 'dancy' songs released in 2020 with no instrumentals are likley to be popular. Lets see if the training can pick up on this later.

### Preparing The Dataset

If we take a look at the info behind the data frame, we can see that there are no null values for each column which is great as we won't need to remove any null values which would have resulted in losing some data.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170653 entries, 0 to 170652
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   valence           170653 non-null  float64
 1   year              170653 non-null  int64  
 2   acousticness      170653 non-null  float64
 3   artists           170653 non-null  object 
 4   danceability      170653 non-null  float64
 5   duration_ms       170653 non-null  int64  
 6   energy            170653 non-null  float64
 7   explicit          170653 non-null  int64  
 8   id                170653 non-null  object 
 9   instrumentalness  170653 non-null  float64
 10  key               170653 non-null  int64  
 11  liveness          170653 non-null  float64
 12  loudness          170653 non-null  float64
 13  mode              170653 non-null  int64  
 14  name              170653 non-null  object 
 15  popularity        170653 non-null  int64  
 16  release_date      17

We can also see that the data types for the majority of the columns are either Integer or Float values which are ready for training. However for 'artists', 'id', 'name' and 'release_date'. I will touch on 'artists' and 'name' later when I attempt to increase the accuracy of the predictions but for now we can drop it from the dataframe along with: 'id' and 'release_date'.

In [5]:
baselineDf = df.drop(columns=['id', 'artists', 'release_date', 'name', 'mode']).copy()

### Baseline Training

Now that the DataFrame has had the non-numeric objects removed we can start with the baseline training.
To start I am going to create two dataframes, one for the dataIn (X) and one for the dataOut (y). 

Then using Scikit-learn's test_train_spilt I am going to split the training and testing data 80:20. 

And finally I will run the training data though a Linear Regression model, then test the predictions against the test data to get an accuracy score.

The accuracy score is derived from the following measures:

Mean Squared Error (MSE) - The mean average of which predicted data points stray away from the regresssion line. Lower is better.

R Squared (R^2) - The statistical measure of how close the predicted data points are to the fitted regression line.
The closer to 1 the better.

With the R^2 score being multiplied by 100 and represented to five significant figures as the overall accuracy score.

In [6]:
# Defines a function that takes a given DataFrame, splits the data into testing and training sets then applies Linear Regression.

def train(data):
    
    # Sets the subject and target DataFrames. 
    
    dataIn = data.drop(columns=['popularity'])
    dataOut = data['popularity']

    dataInTrain, dataInTest, dataOutTrain, dataOutTest = train_test_split(dataIn, dataOut, test_size=0.2, random_state=0)

    model = linear_model.LinearRegression()
    model.fit(dataInTrain, dataOutTrain)

    predictions = model.predict(dataInTest)
    
    print("MSE: " + str(mean_squared_error(dataOutTest, predictions)))

    print("R^2: " + str(r2_score(dataOutTest, predictions)))

    accuracy = model.score(dataInTest, dataOutTest)
    
    print("Accuracy: {:.5f}%\n".format(accuracy * 100))
    
    return accuracy

print("=== Baseline Accuracy: {:.5f}% ===".format((train(baselineDf)) * 100))


MSE: 113.74587287017013
R^2: 0.7611216828203751
Accuracy: 76.11217%

=== Baseline Accuracy: 76.11217% ===


Just from this initial basline we can see that the accuracy is 76.11217% accurate at predicting the popularity of a song based on it's characteristics.

Let's see if we can improve this by adjusting the dataset.

### Adjusting The Dataset

The first thing I want to try is to add artists to the training model. As some artists may be so popular that it is likley that when they release a song it is going to have a high popularity score.

To achieve this I have decided to use a LabelEncoder to convert the arrays of artists names to a unique Integer value so it can be trained in the model alongside the rest of the data.

In [22]:
# Imports sklearn's LabelEncoder library.

from sklearn.preprocessing import LabelEncoder

# Creates a new DataFrame for the adjustments.

adjustmentDf = df.drop(columns=['id', 'release_date', 'name', 'mode']).copy()

# A function to apply the artists adjustment to a DataFrame.

def encodeStrings(data, column):

    le = LabelEncoder()
    return le.fit_transform(data[column].astype('str'))

# Applies function to the artists column.

adjustmentDf['artists'] = encodeStrings(adjustmentDf, 'artists')

# Trains the new adjustment DataFrame and outputs the results.

print("=== Artist Adjustment Accuracy: {:.5f}% ===".format((train(adjustmentDf)) * 100))


MSE: 113.67924490171757
R^2: 0.7612616085739821
Accuracy: 76.12616%

=== Artist Adjustment Accuracy: 76.12616% ===


With this small improvement we can see a gain of 0.01399% in accuracy. Not the greatest improvement, but improvement nonetheless.

Next I want to modify the dataframe even further, this time by applying a function to some of the columns to generalise the data to make matches more likley.

I am going to define two functions:

normaliseMetrics : Rounds the given column's value's to two decimal places.

normaliseTempo: Rounds the tempo column's value's to the nearest ten.


In [8]:
# Function to round a given column's value's to two decimal places.

def normaliseMetrics(value):
    return round(value, 2)

# Function to round the tempo column's value's to the nearest ten.

def normaliseTempo(value):
    return round(abs(value), -1)


I will now apply these modifier functions to the relevent columns in the dataframe.

In [9]:
# Applies the normaliseMetrics function to each of the following columns.

adjustmentDf['valence'] = adjustmentDf['valence'].apply(normaliseMetrics)
adjustmentDf['acousticness'] = adjustmentDf['acousticness'].apply(normaliseMetrics)
adjustmentDf['danceability'] = adjustmentDf['danceability'].apply(normaliseMetrics)
adjustmentDf['energy'] = adjustmentDf['energy'].apply(normaliseMetrics)
adjustmentDf['instrumentalness'] = adjustmentDf['instrumentalness'].apply(normaliseMetrics)
adjustmentDf['liveness'] = adjustmentDf['liveness'].apply(normaliseMetrics)
adjustmentDf['loudness'] = adjustmentDf['loudness'].apply(normaliseMetrics)
adjustmentDf['speechiness'] = adjustmentDf['speechiness'].apply(normaliseMetrics)

# Applies the normaliseMetrics function to the tempo column.

adjustmentDf['tempo'] = adjustmentDf['tempo'].apply(normaliseTempo)

adjustmentDf.head(5)

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,popularity,speechiness,tempo
0,0.06,1921,0.98,26839,0.28,831667,0.21,0,0.88,10,0.67,-20.1,4,0.04,80.0
1,0.96,1921,0.73,7382,0.82,180533,0.34,0,0.0,7,0.16,-12.44,5,0.41,60.0
2,0.04,1921,0.96,16378,0.33,500062,0.17,0,0.91,3,0.1,-14.85,5,0.03,110.0
3,0.17,1921,0.97,10077,0.28,210000,0.31,0,0.0,5,0.38,-9.32,3,0.04,100.0
4,0.25,1921,0.96,23719,0.42,166693,0.19,0,0.0,3,0.23,-10.1,2,0.04,100.0


As you can see the dataframe has now had each of the relevent columns modified to make the data more general.

Now we can run the updated dataframe though the training model.

In [10]:
print("=== Adjustment Average Accuracy: {:.5f}% ===".format((train(adjustmentDf)) * 100))

MSE: 113.68077577361474
R^2: 0.7612583935817069
Accuracy: 76.12584%

=== Adjustment Average Accuracy: 76.12584% ===


As it turns out this adjustment has actually decreased the accuracy by -0.00032%, proving that the regression works as effectivly even if the data is more general.

### Genres

Next up I will now use the data_w_genres.csv dataset to provide each song in the existing data.csv dataset with the genres associated with each accredited artist. 

If we take a look at: Lana Del Rey, Royal Blood and Bill Withers as an example artists, we can see their associated genres.

In [11]:
# Loads the data_w_genres.csv as a new DataFrame.

genresDf = pd.read_csv('data_w_genres.csv')

genresDf.loc[genresDf['artists'] ==  "Lana Del Rey"]

Unnamed: 0,genres,artists,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,count
14100,"['art pop', 'pop']",Lana Del Rey,0.44669,0.423221,256964.019608,0.542764,0.020926,0.17142,-8.448186,0.055098,115.393554,0.254493,57.892157,0,0,204


In [12]:
genresDf.loc[genresDf['artists'] ==  "Royal Blood"]

Unnamed: 0,genres,artists,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,count
21432,"['brighton indie', 'garage rock', 'modern alte...",Royal Blood,0.04634,0.5242,212152.8,0.876,0.00208,0.15664,-4.4422,0.0574,107.982,0.5894,54.6,0,1,10


In [13]:
genresDf.loc[genresDf['artists'] ==  "Bill Withers"]

Unnamed: 0,genres,artists,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,count
2701,"['funk', 'motown', 'quiet storm', 'soul']",Bill Withers,0.471254,0.653619,257137.321429,0.468529,0.065161,0.209877,-12.269083,0.064256,111.738393,0.629061,40.52381,9,1,168


The below code will iterate over each song in the data.csv dataset and for each of the accredited artists will lookup in the data_w_genres.csv dataset what the genres are for the given artist. Then export the new dataset as a csv called: songs_w_genres.csv

#### Warning: 

This process requires a lot of data processing and will be an intensive task on your computer. It took my 2020 MacBook Pro with a 2.3 GHz Quad-Core Intel Core i7 and 16 GB of Memory over 25 minutes at 100% CPU to process this. So if you don't wish to run this I have provided an already processed songs_w_genres.csv in the repository.

In [14]:
# Creates a new DataFrame based on the existing data.csv.

genreAdjustmentDf = df.copy()

# Adds a 'genres' column to the new DataFrame.

genreAdjustmentDf['genres'] = ''

# Iterates over every row in the new DataFrame.

for index, row in genreAdjustmentDf.iterrows():
    
    # Parses the string data in the 'artists' column into a list of individual artists that are acredited to the given song. 
    
    artists = row.get('artists').replace('[', '').replace(']', '').replace("'", '').split(', ')
    
    genresList = []
    
    # Iterates over each artist in the accredited list.
    
    for artist in artists:
        
        # Searches the artists with genres DataFrame for a match with the given artist. Returns the row as a Series.
        
        target = genresDf.loc[genresDf['artists'] == artist].squeeze()
        
        # Parses the string data in the 'genres' column into a list of individual genres that are assosiated with the given artist. Appends this list to an existing list if there are multiple given artists.
        
        genresList = genresList + str(target.get('genres')).replace('[', '').replace(']', '').replace("'", '').split(', ')
        
    # Removes any duplicate genres in the list and appends it as a string to the new genres column for the given song.

    genreAdjustmentDf.loc[index, 'genres'] = str(list(set(genresList)))
    
    print('Parsed ' + str(index + 1) + ' of 170653')
    
# Exports the new DataFrame as a new csv file called songs_w_genres.csv so it can be used without needing to parse again.
                         
genreAdjustmentDf.to_csv(r'songs_w_genres.csv', index = False)
        

Parsed 1 of 170653
Parsed 2 of 170653
Parsed 3 of 170653
Parsed 4 of 170653
Parsed 5 of 170653
Parsed 6 of 170653
Parsed 7 of 170653
Parsed 8 of 170653
Parsed 9 of 170653
Parsed 10 of 170653
Parsed 11 of 170653
Parsed 12 of 170653
Parsed 13 of 170653
Parsed 14 of 170653
Parsed 15 of 170653
Parsed 16 of 170653
Parsed 17 of 170653
Parsed 18 of 170653
Parsed 19 of 170653
Parsed 20 of 170653
Parsed 21 of 170653
Parsed 22 of 170653
Parsed 23 of 170653
Parsed 24 of 170653
Parsed 25 of 170653
Parsed 26 of 170653
Parsed 27 of 170653
Parsed 28 of 170653
Parsed 29 of 170653
Parsed 30 of 170653
Parsed 31 of 170653
Parsed 32 of 170653
Parsed 33 of 170653
Parsed 34 of 170653
Parsed 35 of 170653
Parsed 36 of 170653
Parsed 37 of 170653
Parsed 38 of 170653
Parsed 39 of 170653
Parsed 40 of 170653
Parsed 41 of 170653
Parsed 42 of 170653
Parsed 43 of 170653
Parsed 44 of 170653
Parsed 45 of 170653
Parsed 46 of 170653
Parsed 47 of 170653
Parsed 48 of 170653
Parsed 49 of 170653
Parsed 50 of 170653
Parsed 51

Parsed 415 of 170653
Parsed 416 of 170653
Parsed 417 of 170653
Parsed 418 of 170653
Parsed 419 of 170653
Parsed 420 of 170653
Parsed 421 of 170653
Parsed 422 of 170653
Parsed 423 of 170653
Parsed 424 of 170653
Parsed 425 of 170653
Parsed 426 of 170653
Parsed 427 of 170653
Parsed 428 of 170653
Parsed 429 of 170653
Parsed 430 of 170653
Parsed 431 of 170653
Parsed 432 of 170653
Parsed 433 of 170653
Parsed 434 of 170653
Parsed 435 of 170653
Parsed 436 of 170653
Parsed 437 of 170653
Parsed 438 of 170653
Parsed 439 of 170653
Parsed 440 of 170653
Parsed 441 of 170653
Parsed 442 of 170653
Parsed 443 of 170653
Parsed 444 of 170653
Parsed 445 of 170653
Parsed 446 of 170653
Parsed 447 of 170653
Parsed 448 of 170653
Parsed 449 of 170653
Parsed 450 of 170653
Parsed 451 of 170653
Parsed 452 of 170653
Parsed 453 of 170653
Parsed 454 of 170653
Parsed 455 of 170653
Parsed 456 of 170653
Parsed 457 of 170653
Parsed 458 of 170653
Parsed 459 of 170653
Parsed 460 of 170653
Parsed 461 of 170653
Parsed 462 of

Parsed 834 of 170653
Parsed 835 of 170653
Parsed 836 of 170653
Parsed 837 of 170653
Parsed 838 of 170653
Parsed 839 of 170653
Parsed 840 of 170653
Parsed 841 of 170653
Parsed 842 of 170653
Parsed 843 of 170653
Parsed 844 of 170653
Parsed 845 of 170653
Parsed 846 of 170653
Parsed 847 of 170653
Parsed 848 of 170653
Parsed 849 of 170653
Parsed 850 of 170653
Parsed 851 of 170653
Parsed 852 of 170653
Parsed 853 of 170653
Parsed 854 of 170653
Parsed 855 of 170653
Parsed 856 of 170653
Parsed 857 of 170653
Parsed 858 of 170653
Parsed 859 of 170653
Parsed 860 of 170653
Parsed 861 of 170653
Parsed 862 of 170653
Parsed 863 of 170653
Parsed 864 of 170653
Parsed 865 of 170653
Parsed 866 of 170653
Parsed 867 of 170653
Parsed 868 of 170653
Parsed 869 of 170653
Parsed 870 of 170653
Parsed 871 of 170653
Parsed 872 of 170653
Parsed 873 of 170653
Parsed 874 of 170653
Parsed 875 of 170653
Parsed 876 of 170653
Parsed 877 of 170653
Parsed 878 of 170653
Parsed 879 of 170653
Parsed 880 of 170653
Parsed 881 of

KeyboardInterrupt: 

If we now look up a songs by the previously given artists we can see that on the end there is now a 'genres' column at the end of the DataFrame which features the genres of the accredited artists.

In [15]:
songsGenresDf = pd.read_csv('songs_w_genres.csv')

songsGenresDf.loc[songsGenresDf['name'] == 'Summertime Sadness'].head(1)

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,genres
18137,0.235,2012,0.0542,['Lana Del Rey'],0.565,265427,0.654,0,2dBwB667LHQkLhdYlwLUZK,2e-06,1,0.122,-6.826,0,Summertime Sadness,63,2012-11-12,0.0335,111.968,"['art pop', 'pop']"


In [16]:
songsGenresDf.loc[songsGenresDf['name'] == 'Out of the Black'].head(1)

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,genres
74002,0.633,2014,0.000867,['Royal Blood'],0.52,240572,0.981,1,2nIulsLMiP2SMTDrtxfQXX,0.00177,2,0.197,-2.75,1,Out of the Black,53,2014-08-25,0.0906,158.937,"['modern blues rock', 'garage rock', 'modern a..."


In [17]:
songsGenresDf.loc[songsGenresDf['name'] == 'Lovely Day'].head(1)

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,genres
11013,0.706,1977,0.292,['Bill Withers'],0.69,254560,0.651,0,0bRXwKfigvpKZUurwqAlEh,0.00241,9,0.105,-8.267,1,Lovely Day,76,1977-10-29,0.0324,97.918,"['motown', 'funk', 'quiet storm', 'soul']"


Also if we take a look at a song that features multiple accredited artists we can see that the genres of both artists have been combined, without duplicating genres.

In [18]:
target = songsGenresDf.loc[songsGenresDf['name'] == 'Endless Love'].head(1).squeeze()

print('Artists: ' + str(target.get('artists')))

print('Lionel Richie - Genres: ' + str(genresDf.loc[genresDf['artists'] == 'Lionel Richie'].squeeze().get('genres'))) 
print('Diana Ross - Genres: ' + str(genresDf.loc[genresDf['artists'] == 'Diana Ross'].squeeze().get('genres')))

print('Combined Genres for Endless Love: ' + str(target.get('genres')))

Artists: ['Lionel Richie', 'Diana Ross']
Lionel Richie - Genres: ['adult standards', 'disco', 'mellow gold', 'motown', 'quiet storm', 'soft rock']
Diana Ross - Genres: ['adult standards', 'disco', 'funk', 'mellow gold', 'motown', 'quiet storm', 'soft rock', 'soul']
Combined Genres for Endless Love: ['soft rock', 'funk', 'quiet storm', 'motown', 'mellow gold', 'disco', 'adult standards', 'soul']


So now that we have the genres featured for each song, let's see if this will improve the accuracy of the Linear Regression model.

Firstly we need to drop the irrelevant columns from the new data frame. Then we need to encode the genres into unique integer values the same way we did with the artists earlier. So we can use the same function: 'encodeStrings()'


In [26]:
songsGenresDf = pd.read_csv('songs_w_genres.csv')

songsGenresDf = songsGenresDf.drop(columns=['id', 'release_date', 'name', 'mode'])

songsGenresDf['artists'] = encodeStrings(songsGenresDf, 'artists')
songsGenresDf['genres'] = encodeStrings(songsGenresDf, 'genres')

print("=== Genre Addition Accuracy: {:.5f}% ===".format((train(songsGenresDf)) * 100))

MSE: 113.38291980221359
R^2: 0.7618839225034588
Accuracy: 76.18839%

=== Genre Addition Accuracy: 76.18839% ===


This has added 0.06223% to the accuracy a massive improvement.

To check if Linear Regression is the best Regressor for this task I will try RidgeCV regressor 

In [40]:
def ridgeTrain(data):
    
    # Sets the subject and target DataFrames. 
    
    dataIn = data.drop(columns=['popularity'])
    dataOut = data['popularity']

    dataInTrain, dataInTest, dataOutTrain, dataOutTest = train_test_split(dataIn, dataOut, test_size=0.2, random_state=0)

    model = linear_model.OrthogonalMatchingPursuit()
    model.fit(dataInTrain, dataOutTrain)

    predictions = model.predict(dataInTest)
    
    print("MSE: " + str(mean_squared_error(dataOutTest, predictions)))

    print("R^2: " + str(r2_score(dataOutTest, predictions)))

    accuracy = model.score(dataInTest, dataOutTest)
    
    print("Accuracy: {:.5f}%\n".format(accuracy * 100))
    
    return accuracy

print("=== Baseline Accuracy: {:.5f}% ===".format((ridgeTrain(songsGenresDf)) * 100))



MSE: 117.94334271441588
R^2: 0.7523065539062024
Accuracy: 75.23066%

=== Baseline Accuracy: 75.23066% ===


## Content Based Filtering

Starting work on content based filtering.

In [60]:
from sklearn.feature_extraction.text import CountVectorizer

reccomendationDF = pd.read_csv('songs_w_genres.csv')

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(reccomendationDF['genres'])



IndexError: index 154165 is out of bounds for axis 0 with size 38000