# Machine Learning

## Linear Regression Model - Predicting Popularity

This section creates a Linear Regression model to predict the popularity of a song after analyzing the following quantitative features to find the optinal set to train a mode on: tempo, energy, danceability, loudness, speechiness, instrumentalness, duration, and valence. After finding the model that made the best popularity prediction with the features at hand, we predicted popularity on newly released songs

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

data_dir = "https://raw.githubusercontent.com/ShainaBagri/SpotifyDataAnalysis/main/"
top_songs_df = pd.read_csv(data_dir + "spotify_2010sHits.csv")
top_songs_df

Unnamed: 0,index,topYear,rank,trackId,trackName,artistName,albumId,albumName,genres,popularity,tempo,energy,danceability,loudness,speechiness,instrumentalness,duration_ms,valence,explicit,duration_m,topGenre
0,0,2009,1,4kLLWz7srcuLKA7Et40PQR,I Gotta Feeling,Black Eyed Peas,1dgbFU08pXJXZhGPlybdMX,THE E.N.D. (THE ENERGY NEVER DIES) [Deluxe Ver...,"['dance pop', 'pop', 'pop dance', 'pop rap']",81,127.960,0.766,0.743,-6.375,0.0265,0.000000,289133,0.610,False,5.0,pop
1,1,2009,2,1QV6tiMFM6fSOKOGLMHYYg,Poker Face,Lady Gaga,1qwlxZTNLe1jq3b0iidlue,The Fame,"['dance pop', 'pop', 'pop dance']",77,118.999,0.806,0.851,-4.620,0.0787,0.000002,237200,0.787,False,4.0,pop
2,2,2009,3,4kgTdThcDHfuDS2kKxB7Lc,You Belong With Me,Taylor Swift,2gP2LMVcIFgVczSJqn340t,Fearless (Platinum Edition),"['dance pop', 'pop', 'pop dance']",61,129.964,0.771,0.687,-4.424,0.0384,0.000025,231146,0.445,False,4.0,pop
3,3,2009,4,0iGckQFyv6svOfAbAY9aWJ,Hot N Cold,Katy Perry,3OALgjCs6Lqw41853v4wEQ,One Of The Boys,"['dance pop', 'pop', 'pop dance', 'post-teen p...",71,132.032,0.841,0.706,-3.956,0.0418,0.000000,220226,0.861,False,4.0,pop
4,4,2009,5,3GpbwCm3YxiWDvy29Uo3vP,Right Round,Flo Rida,2vBLKFrI1rZqB7VtGxcsR5,R.O.O.T.S. (Route of Overcoming the Struggle),"['dance pop', 'miami hip hop', 'pop', 'pop dan...",74,124.986,0.672,0.720,-6.852,0.0551,0.000000,204640,0.705,False,3.0,pop
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1084,94,2019,95,1wJRveJZLSb1rjhnUHQiv6,Swervin (feat. 6ix9ine),A Boogie Wit da Hoodie,3r5hf3Cj3EMh1C2saQ8jyt,Hoodie SZN,"['melodic rap', 'pop rap', 'rap', 'trap']",80,93.023,0.662,0.581,-5.239,0.3030,0.000000,189486,0.434,True,3.0,rap
1085,95,2019,96,5wIjM4q7oIgiLqn8Qfoyxh,Keisha & Becky - Remix,Russ Millions,5zab8YLQV8MOXSTpcK6mT3,Keisha & Becky (Remix),"['uk drill', 'uk hip hop']",68,140.969,0.471,0.863,-9.545,0.4780,0.000000,252906,0.644,True,4.0,hip hop
1086,96,2019,97,4SSnFejRGlZikf02HLewEF,bury a friend,Billie Eilish,0S0KGZnfBGSIssfF54WSJh,"WHEN WE ALL FALL ASLEEP, WHERE DO WE GO?","['electropop', 'pop']",79,120.046,0.389,0.905,-14.505,0.3320,0.162000,193143,0.196,False,3.0,pop
1087,97,2019,98,6UnCGAEmrbGIOSmGRZQ1M2,Light On,Maggie Rogers,5AHWNPo3gllDmixgAoFru4,Heard It In A Past Life,"['indie pop', 'pop']",73,102.054,0.569,0.657,-6.287,0.0542,0.000014,233880,0.399,False,4.0,pop


### Manual Handling of Categorical Variables

The manual addition of the dummy variables is to fix an error where a level of the categorical variable "topGenre" was present in the training set, but not in the validation set. It solves it by giving every song a column for each unique top *genre*

In [None]:
top_songs_train = (pd.concat([top_songs_df, pd.get_dummies(top_songs_df["topGenre"])], axis=1))
top_songs_train.drop(labels=["topGenre", "duration_ms"],  inplace=True, axis = 1)
top_songs_train = top_songs_train.loc[: , "tempo":]
top_songs_train.columns[9:]
top_songs_train["explicit"] = (top_songs_train["explicit"] == True) * 1

### Model Selection

Ran a linear regression model on the most popular songs dataframe with various combinations of features, choosing the best one available by calculating the coefficient of determination of each.



The following function takes in a set of features to train the linear regression model on and then calculates the average coefficient of determination across all the folds in the 10-Fold Cross Validation used.

Note: Reasons for k = 10 in final paper.

In [None]:
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score

def calc_r2(ts_training_features):
  model = LinearRegression()
  model.fit(X= top_songs_train[ts_training_features], 
              y=top_songs_df['popularity'])
  
  scores = cross_val_score(model, 
                          X = top_songs_train[ts_training_features],
                          y = top_songs_df['popularity'],
                          scoring="r2",
                          cv=10)
  return scores.mean()

To select the features that improve the model's Test Error, combinations that included at least 3 of the 10 features were put into a list "all_combinations," which will be used to optimize our model. The total number of feature combinations to be examined was 932.



In [None]:
from itertools import combinations
genres_list = ['country', 'edm', 'hip hop', 'metal', 'other', 'pop', 'punk', 'r&b',
       'rap', 'reggae', 'reggaeton', 'rock', 'singer-songwriter']
lst = top_songs_train.columns[0:9]
combs_to_test = []
for size in range(3, len(lst) + 1, 1):
  for comb in combinations(lst, size):
    combs_to_test.append(list(comb))
    combs_to_test.append(list(comb) + genres_list)
len(combs_to_test)

932

Then, the $R^2$ values for each combination are made into a Dataframe and sorted to find the combination that maximized the coefficient of determination(analysis on results in final paper) 

In [None]:
r2_test_error = []
for comb in combs_to_test:
  r2_test_error.append(calc_r2(comb))

df_vals = pd.DataFrame()
df_vals = df_vals.assign(Combination = combs_to_test, r2TestError = r2_test_error)
df_vals.sort_values(by = "r2TestError", ascending=False)[["Combination", "r2TestError"]]

Unnamed: 0,Combination,r2TestError
56,"[energy, danceability, loudness]",-0.085810
72,"[energy, loudness, valence]",-0.086158
66,"[energy, danceability, duration_m]",-0.086364
288,"[energy, danceability, loudness, duration_m]",-0.086431
92,"[energy, valence, explicit]",-0.086584
...,...,...
541,"[tempo, loudness, speechiness, explicit, durat...",-0.113669
779,"[tempo, loudness, speechiness, valence, explic...",-0.114582
927,"[tempo, danceability, loudness, speechiness, i...",-0.114916
777,"[tempo, loudness, speechiness, instrumentalnes...",-0.115463


According to our analysis, the best features to use out of the 10 chosen are energy, danceability and loudness and how popularity is affected by each is shown through the sign and magnitude of the coefficients.

In [None]:
model = LinearRegression()
model.fit(X = top_songs_train[["energy", "danceability", "loudness"]], 
          y = top_songs_df['popularity'])

print("Coefficients are: ", model.coef_)

Coefficients are:  [-13.54135281   8.05181036   0.65391566]


Note: 

(+) --> popularity increases when feature increases
      
(-) --> popularity decreases when feature decreases


Energy: -

Danceability: +

Loudness: +



In [None]:
print("The Intercept is : ", model.intercept_)

The Intercept is :  71.4711948770525


In [None]:
top_songs_df['popularity'].describe()

count    1089.000000
mean       63.691460
std        23.213411
min         0.000000
25%        63.000000
50%        71.000000
75%        77.000000
max        90.000000
Name: popularity, dtype: float64

### Chosen LR Model for New Releases to predict top 5 songs


Using best regression model calculated from previous analysis to predict top 5 songs from the "New Releases" playlist will be most popular going forward.

In [None]:
new_songs_df = pd.read_csv(data_dir + "spotify_NewMusic.csv")
new_songs_df

Unnamed: 0,rank,trackId,trackName,artistName,albumId,albumName,genres,popularity,tempo,energy,danceability,loudness,speechiness,instrumentalness,duration_ms,valence,explicit,duration_m,topGenre
0,1,36CiGk9oRdwTnBDMgKEfjl,Dámelo To’ (feat. Myke Towers),Selena Gomez,2jGa3OwXatFYQAIS7OV7k9,Revelación - EP,"['dance pop', 'pop', 'pop dance', 'post-teen p...",73,182.003,0.641,0.787,-7.376,0.315,0.00534,184134,0.437,False,3.0,pop
1,2,45kgqMRkC29qjBlzeaJcad,Street Runner,Rod Wave,1aYz6lWTYHwEz9sfA1Cvrw,Street Runner,['florida rap'],69,160.004,0.61,0.613,-8.633,0.246,0.000361,252021,0.433,True,4.0,rap
2,3,1NsoJ2lSWD61hD4hRY5Qby,"CINDERELLA, Pt. 2",CHIKA,22UE2Lc7VdTqbkGmNBtMDu,ONCE UPON A TIME,"['alabama rap', 'alternative r&b']",54,176.959,0.317,0.58,-10.769,0.137,3e-06,130882,0.338,False,2.0,r&b
3,4,5JycxhApZmzbA4xSwvqh6k,All To Me,Giveon,1otOJAtgvO5VCBL4Gykrrd,When It's All Said And Done... Take Time,"['alternative r&b', 'pop']",66,116.161,0.543,0.523,-8.39,0.156,0.0,127807,0.318,False,2.0,r&b
4,5,2eFjKl5cyPPYElDByCh6Tb,First Time,ILLENIUM,6GwqbWxgikcrhZn8M2M7sc,First Time,"['edm', 'electropop', 'melodic dubstep', 'pop'...",66,155.085,0.667,0.526,-5.451,0.0437,0.0,165779,0.43,False,3.0,pop
5,6,33o0xXMPY41CWwDTnxyM5Z,2Drunk,Nick Jonas,3FTjOu2zQLWcl1NVos4eAq,Spaceman,"['dance pop', 'pop', 'pop dance', 'pop rock', ...",64,156.023,0.798,0.648,-6.088,0.179,0.0,192319,0.607,False,3.0,pop
6,7,2pn8dNVSpYnAtlKFC8Q0DJ,On The Ground,ROSÉ,5BQcoDfcZ8aBcikYX9B7Ob,R,[],76,188.7,0.607,0.311,-6.578,0.11,0.0,168085,0.286,False,3.0,other
7,8,7FdUvDkaE24o3FPIWTvzv2,Follow You,Imagine Dragons,1nz0PWfAcTQVbFtpU6u1UY,Follow You / Cutthroat,"['modern rock', 'rock']",73,124.912,0.732,0.542,-5.956,0.0521,7.9e-05,175643,0.489,False,3.0,rock
8,9,00selpxxljfn9n5Pf4K3VR,Show U Off,Brent Faiyaz,4vmD2mzd6e6UCvuQgKT00m,Show U Off,"['dmv rap', 'rap']",63,84.997,0.405,0.583,-11.295,0.0534,0.00391,251132,0.549,True,4.0,rap
9,10,3AO2MYgrCiTorCUura1szR,Fck Boys,Blxst,2c8AZI8aUhi4zPNkcV44NE,Just for Clarity,['pop r&b'],62,98.368,0.491,0.63,-5.549,0.41,0.0,163200,0.497,True,3.0,r&b


In [None]:
popularity_predictions = model.predict(X = new_songs_df[['energy', 'danceability', 'loudness']])

ind = np.argpartition(popularity_predictions, 5)
new_songs_df.loc[ind[:5]]

Unnamed: 0,rank,trackId,trackName,artistName,albumId,albumName,genres,popularity,tempo,energy,danceability,loudness,speechiness,instrumentalness,duration_ms,valence,explicit,duration_m,topGenre
6,7,2pn8dNVSpYnAtlKFC8Q0DJ,On The Ground,ROSÉ,5BQcoDfcZ8aBcikYX9B7Ob,R,[],76,188.7,0.607,0.311,-6.578,0.11,0.0,168085,0.286,False,3.0,other
41,42,3nZD0cDcEiz8rqK48SYhgn,Thanksgiving,Benny The Butcher,6slT6nHyQrMQaUj2jMl52i,Thanksgiving,"['alternative hip hop', 'boom bap', 'buffalo h...",49,118.528,0.848,0.52,-5.508,0.363,0.0,150532,0.218,True,3.0,hip hop
44,45,1toNKayLMeCcVlsLGXJl7n,Haunted,Laura Les,2iguPTaSTwtx4MiAkj6w5O,Haunted,"['hyperpop', 'transpop']",52,169.481,0.817,0.51,-8.666,0.0373,0.0344,102007,0.547,False,2.0,pop
17,18,0G5KlXqssFee9PDQYJ8PqL,Sade - Spotify Singles,D Smoke,7gpKIWEXYkDbsVhJnJItYb,Spotify Singles,[],50,134.985,0.542,0.592,-11.329,0.389,3.4e-05,146756,0.393,True,2.0,other
35,36,76djDZcj6Og5HcVAaRjQHu,Like A Lady,Lady A,6wRC37A3ZrIxGZpk5ZHTqN,Like A Lady,"['contemporary country', 'country', 'country d...",52,95.511,0.933,0.637,-3.519,0.0343,3e-06,181198,0.797,False,3.0,country


## KNN Classifier - Predicting Genre

This section creates a KNN Classifier model to predict the genre of a song after analyzing the following quantitative features: tempo, energy, danceability, loudness, speechiness, instrumentalness, duration, and valence. We tried different models that took in different subsets of the above quantitative features and different values of k in order to find the model that made the best genre prediction.

### Recreate Dataframes

First, we recreate our dataframes from the csv files we created in Data Collection and Cleaning.

In [None]:
import pandas as pd
import ast

data_dir = "https://raw.githubusercontent.com/ShainaBagri/SpotifyDataAnalysis/main/"
df2010Hits = pd.read_csv(data_dir + "spotify_2010sHits.csv", converters={'genres': ast.literal_eval})
df2010Hits.head()

Unnamed: 0,index,topYear,rank,trackId,trackName,artistName,albumId,albumName,genres,popularity,tempo,energy,danceability,loudness,speechiness,instrumentalness,duration_ms,valence,explicit,duration_m,topGenre
0,0,2009,1,4kLLWz7srcuLKA7Et40PQR,I Gotta Feeling,Black Eyed Peas,1dgbFU08pXJXZhGPlybdMX,THE E.N.D. (THE ENERGY NEVER DIES) [Deluxe Ver...,"[dance pop, pop, pop dance, pop rap]",81,127.96,0.766,0.743,-6.375,0.0265,0.0,289133,0.61,False,5.0,pop
1,1,2009,2,1QV6tiMFM6fSOKOGLMHYYg,Poker Face,Lady Gaga,1qwlxZTNLe1jq3b0iidlue,The Fame,"[dance pop, pop, pop dance]",77,118.999,0.806,0.851,-4.62,0.0787,2e-06,237200,0.787,False,4.0,pop
2,2,2009,3,4kgTdThcDHfuDS2kKxB7Lc,You Belong With Me,Taylor Swift,2gP2LMVcIFgVczSJqn340t,Fearless (Platinum Edition),"[dance pop, pop, pop dance]",61,129.964,0.771,0.687,-4.424,0.0384,2.5e-05,231146,0.445,False,4.0,pop
3,3,2009,4,0iGckQFyv6svOfAbAY9aWJ,Hot N Cold,Katy Perry,3OALgjCs6Lqw41853v4wEQ,One Of The Boys,"[dance pop, pop, pop dance, post-teen pop]",71,132.032,0.841,0.706,-3.956,0.0418,0.0,220226,0.861,False,4.0,pop
4,4,2009,5,3GpbwCm3YxiWDvy29Uo3vP,Right Round,Flo Rida,2vBLKFrI1rZqB7VtGxcsR5,R.O.O.T.S. (Route of Overcoming the Struggle),"[dance pop, miami hip hop, pop, pop dance, pop...",74,124.986,0.672,0.72,-6.852,0.0551,0.0,204640,0.705,False,3.0,pop


In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

dfNewReleases = pd.read_csv(data_dir + "spotify_NewMusic.csv", converters={'genres': ast.literal_eval})
dfNewReleases.head()

Unnamed: 0,rank,trackId,trackName,artistName,albumId,albumName,genres,popularity,tempo,energy,danceability,loudness,speechiness,instrumentalness,duration_ms,valence,explicit,duration_m,topGenre
0,1,36CiGk9oRdwTnBDMgKEfjl,Dámelo To’ (feat. Myke Towers),Selena Gomez,2jGa3OwXatFYQAIS7OV7k9,Revelación - EP,"[dance pop, pop, pop dance, post-teen pop]",73,182.003,0.641,0.787,-7.376,0.315,0.00534,184134,0.437,False,3.0,pop
1,2,45kgqMRkC29qjBlzeaJcad,Street Runner,Rod Wave,1aYz6lWTYHwEz9sfA1Cvrw,Street Runner,[florida rap],69,160.004,0.61,0.613,-8.633,0.246,0.000361,252021,0.433,True,4.0,rap
2,3,1NsoJ2lSWD61hD4hRY5Qby,"CINDERELLA, Pt. 2",CHIKA,22UE2Lc7VdTqbkGmNBtMDu,ONCE UPON A TIME,"[alabama rap, alternative r&b]",54,176.959,0.317,0.58,-10.769,0.137,3e-06,130882,0.338,False,2.0,r&b
3,4,5JycxhApZmzbA4xSwvqh6k,All To Me,Giveon,1otOJAtgvO5VCBL4Gykrrd,When It's All Said And Done... Take Time,"[alternative r&b, pop]",66,116.161,0.543,0.523,-8.39,0.156,0.0,127807,0.318,False,2.0,r&b
4,5,2eFjKl5cyPPYElDByCh6Tb,First Time,ILLENIUM,6GwqbWxgikcrhZn8M2M7sc,First Time,"[edm, electropop, melodic dubstep, pop, pop da...",66,155.085,0.667,0.526,-5.451,0.0437,0.0,165779,0.43,False,3.0,pop


### Finding Best Subset of Quantitatitve Features

Then, we created different models, one for each possible subset of the list of quantitative features, and found which one had the highest accuracy on the New Releases dataframe. We calculated accuracy by comparing how many predicted topGenres matched the actual topGenres. 

NOTE: The models are trained on the 2010's Hits dataframe, but are used to predict data from the New Releases dataframe. This way, our training data (2010's Hits) is separate from our test data (New Releases).

In [None]:
import itertools

# Calculates accuracy
def get_accuracy(df1, df2, features):
  y_train = df1['topGenre']
  X_train = df1[features]
  pipeline = make_pipeline(
      StandardScaler(),
      KNeighborsClassifier(n_neighbors=7)
  )
  pipeline.fit(X_train, y_train)

  matches = 0
  for i in range(1, 51):
    X_new = df2.loc[df2['rank']==i, features]
    pred = pipeline.predict([X_new.iloc[0]])[0]
    actual = df2.loc[df2['rank']==i, 'topGenre']
    if(pred==actual.iloc[0]):
      matches += 1
  return matches

featureList = ['tempo', 'energy', 'danceability', 'loudness', 
                  'speechiness', 'instrumentalness', 'duration_m', 'valence']

# Finds all subsets of the features list
allFeatures = []
comb = itertools.combinations(featureList, 8)
for i in list(comb):  
    allFeatures.append(list(i))
comb = itertools.combinations(featureList, 7)
for i in list(comb):  
    allFeatures.append(list(i))
comb = itertools.combinations(featureList, 6)
for i in list(comb):  
    allFeatures.append(list(i))
comb = itertools.combinations(featureList, 5)
for i in list(comb):  
    allFeatures.append(list(i))
comb = itertools.combinations(featureList, 4)
for i in list(comb):  
    allFeatures.append(list(i))
comb = itertools.combinations(featureList, 3)
for i in list(comb):  
    allFeatures.append(list(i))
comb = itertools.combinations(featureList, 2)
for i in list(comb):  
    allFeatures.append(list(i))
comb = itertools.combinations(featureList, 1)
for i in list(comb):  
    allFeatures.append(list(i))

# Calculates accuracy for each subset of features list
errs = pd.Series()
for features in allFeatures:
  errs[str(features)] = get_accuracy(df2010Hits, dfNewReleases, features)

# This stores the count of data that was predicted correctly
errs




['tempo', 'energy', 'danceability', 'loudness', 'speechiness', 'instrumentalness', 'duration_m', 'valence']    14
['tempo', 'energy', 'danceability', 'loudness', 'speechiness', 'instrumentalness', 'duration_m']               15
['tempo', 'energy', 'danceability', 'loudness', 'speechiness', 'instrumentalness', 'valence']                  14
['tempo', 'energy', 'danceability', 'loudness', 'speechiness', 'duration_m', 'valence']                        14
['tempo', 'energy', 'danceability', 'loudness', 'instrumentalness', 'duration_m', 'valence']                   16
                                                                                                               ..
['loudness']                                                                                                   16
['speechiness']                                                                                                15
['instrumentalness']                                                                    

In [None]:
# This stores the percentage of data that was predicted correctly
errs = errs.astype(float)/len(dfNewReleases)
errs

['tempo', 'energy', 'danceability', 'loudness', 'speechiness', 'instrumentalness', 'duration_m', 'valence']    0.28
['tempo', 'energy', 'danceability', 'loudness', 'speechiness', 'instrumentalness', 'duration_m']               0.30
['tempo', 'energy', 'danceability', 'loudness', 'speechiness', 'instrumentalness', 'valence']                  0.28
['tempo', 'energy', 'danceability', 'loudness', 'speechiness', 'duration_m', 'valence']                        0.28
['tempo', 'energy', 'danceability', 'loudness', 'instrumentalness', 'duration_m', 'valence']                   0.32
                                                                                                               ... 
['loudness']                                                                                                   0.32
['speechiness']                                                                                                0.30
['instrumentalness']                                                    

In [None]:
# Finds highest accuracy (percentage)
errs.max()

0.38

In [None]:
# Finds subset that produces the highest accuracy
errs.idxmax()

"['tempo', 'energy', 'danceability', 'instrumentalness', 'duration_m', 'valence']"

### Finding Best K

After finding which subset of quantitative features produced the model, we created more models with that subset of quantitative features but differing values of k, in order to find the best model taking into account both quantitative features and the parameter k. We found the best model by finding the one with the highest accuracy, which we calculated the same as above.

In [None]:
# This is the best subset of features, as found above
X_train_found = df2010Hits[['tempo', 'energy', 'danceability', 'instrumentalness', 'duration_m', 'valence']]
y_train = df2010Hits['topGenre']

# This finds the accuracy for each value of k
for k in range(1, 21):
  pipeline = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier(n_neighbors=k)
  )
  pipeline.fit(X_train_found, y_train)

  matches = 0.0
  for i in range(1, 51):
    X_new = dfNewReleases.loc[dfNewReleases['rank']==i, ['tempo', 'energy', 'danceability', 'instrumentalness', 'duration_m', 'valence']]
    pred = pipeline.predict([X_new.iloc[0]])[0]
    actual = dfNewReleases.loc[dfNewReleases['rank']==i, 'topGenre']
    if(pred==actual.iloc[0]):
      matches += 1.0
  
  # This is the count of data that was predicted correctly
  print(k, ": ", matches)
  # This is the percentage of data that was predicted correctly
  print(k, ": ", matches/len(dfNewReleases))
  print()

1 :  13.0
1 :  0.26

2 :  12.0
2 :  0.24

3 :  15.0
3 :  0.3

4 :  15.0
4 :  0.3

5 :  17.0
5 :  0.34

6 :  19.0
6 :  0.38

7 :  19.0
7 :  0.38

8 :  19.0
8 :  0.38

9 :  19.0
9 :  0.38

10 :  18.0
10 :  0.36

11 :  17.0
11 :  0.34

12 :  17.0
12 :  0.34

13 :  17.0
13 :  0.34

14 :  17.0
14 :  0.34

15 :  17.0
15 :  0.34

16 :  17.0
16 :  0.34

17 :  17.0
17 :  0.34

18 :  16.0
18 :  0.32

19 :  16.0
19 :  0.32

20 :  17.0
20 :  0.34



The KNN Classifier model with 7 nearest neighbors and ['tempo', 'energy', 'danceability', 'instrumentalness', 'duration_m', 'valence'] seems to be most effective when used to predict the topGenre for the New Releases dataframe.

In [None]:
# Testing the above model on the 2010's Hits dataframe which was used to train the model
pipeline = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier(n_neighbors=7)
)
pipeline.fit(X_train_found, y_train)

matches = 0
for i in range(2009, 2020):
  for j in range(1, 100):
    X_new = df2010Hits.loc[df2010Hits['rank']==j].loc[df2010Hits['topYear']==i, ['tempo', 'energy', 'danceability', 'instrumentalness', 'duration_m', 'valence']]
    pred = pipeline.predict([X_new.iloc[0]])[0]
    actual = df2010Hits.loc[df2010Hits['rank']==j].loc[df2010Hits['topYear']==i, 'topGenre']
    if(pred==actual.iloc[0]):
      matches += 1

# This is the count of data that was predicted correctly
print(matches)

764


In [None]:
# This is the percentage of data that was predicted correctly
matches/len(df2010Hits)

0.7015610651974288

This simply checks the best model found for the New Releases data to make sure that it retains a high accuracy when it is run on the 2010's Hits data, which it was trained on. It does indeed retain a high accuracy for the 2010's Hits data.

## KMeans Clustering - Predicting Which New Releases Will Be Hits

This section uses KMeans Clustering to find the ideal characteristics of the types of songs that are in the current Top 50. For example, if we had 3 clusters, we would find 3 types of songs whose characteristics are ideal. Then, we use a distance metric to see which of the songs from the New Releases best exemplify the ideal characteristics found through KMeans Clustering. For example if we had 3 clusters, we would pick 3 songs from the New Releases, where each one is the song closest to the centroid from a specific cluster.

### Recreate Dataframes

First, we recreate our dataframes from the csv files we created in Data Collection and Cleaning.

In [None]:
import pandas as pd

data_dir = "https://raw.githubusercontent.com/ShainaBagri/SpotifyDataAnalysis/main/"
top50 = pd.read_csv(data_dir + "spotify_top50.csv")
top50.head()

Unnamed: 0,rank,trackId,trackName,artistName,albumId,albumName,genres,popularity,tempo,energy,danceability,loudness,speechiness,instrumentalness,duration_ms,valence,explicit,duration_m,topGenre
0,1,3aQem4jVGdhtg116TmJnHz,What’s Next,Drake,5LuoozUhs2pl3glZeAJl89,Scary Hours 2,"['canadian hip hop', 'canadian pop', 'hip hop'...",89,129.895,0.594,0.781,-6.959,0.0485,0.0,178153,0.0628,True,3.0,rap
1,2,65OVbaJR5O1RmwOQx0875b,Wants and Needs (feat. Lil Baby),Drake,5LuoozUhs2pl3glZeAJl89,Scary Hours 2,"['canadian hip hop', 'canadian pop', 'hip hop'...",88,136.006,0.449,0.578,-6.349,0.286,2e-06,192956,0.1,True,3.0,rap
2,3,7lPN2DXiMsVn7XUKtOW1CS,drivers license,Olivia Rodrigo,66FPnVL9G4CMKy3wvaGTcr,drivers license,"['pop', 'post-teen pop']",100,143.874,0.436,0.585,-8.761,0.0601,1.3e-05,242013,0.132,True,4.0,pop
3,4,6tDDoYIxWvMLTdKpjFkc1B,telepatía,Kali Uchis,00wSTrFxoSzA7eeS1UxHgd,Sin Miedo (del Amor y Otros Demonios) ∞,"['colombian pop', 'pop']",93,83.97,0.524,0.653,-9.016,0.0502,0.0,160191,0.553,False,3.0,pop
4,5,5Kskr9LcNYa0tpt5f0ZEJx,Calling My Phone,Lil Tjay,1QhKOq11hGEoNA42rV2IHp,Calling My Phone,"['brooklyn drill', 'melodic rap', 'nyc rap']",94,104.949,0.393,0.907,-7.636,0.0539,1e-06,205458,0.202,True,3.0,rap


In [None]:
newMusic = pd.read_csv(data_dir + "spotify_NewMusic.csv")
newMusic.head()

Unnamed: 0,rank,trackId,trackName,artistName,albumId,albumName,genres,popularity,tempo,energy,danceability,loudness,speechiness,instrumentalness,duration_ms,valence,explicit,duration_m,topGenre
0,1,36CiGk9oRdwTnBDMgKEfjl,Dámelo To’ (feat. Myke Towers),Selena Gomez,2jGa3OwXatFYQAIS7OV7k9,Revelación - EP,"['dance pop', 'pop', 'pop dance', 'post-teen p...",73,182.003,0.641,0.787,-7.376,0.315,0.00534,184134,0.437,False,3.0,pop
1,2,45kgqMRkC29qjBlzeaJcad,Street Runner,Rod Wave,1aYz6lWTYHwEz9sfA1Cvrw,Street Runner,['florida rap'],69,160.004,0.61,0.613,-8.633,0.246,0.000361,252021,0.433,True,4.0,rap
2,3,1NsoJ2lSWD61hD4hRY5Qby,"CINDERELLA, Pt. 2",CHIKA,22UE2Lc7VdTqbkGmNBtMDu,ONCE UPON A TIME,"['alabama rap', 'alternative r&b']",54,176.959,0.317,0.58,-10.769,0.137,3e-06,130882,0.338,False,2.0,r&b
3,4,5JycxhApZmzbA4xSwvqh6k,All To Me,Giveon,1otOJAtgvO5VCBL4Gykrrd,When It's All Said And Done... Take Time,"['alternative r&b', 'pop']",66,116.161,0.543,0.523,-8.39,0.156,0.0,127807,0.318,False,2.0,r&b
4,5,2eFjKl5cyPPYElDByCh6Tb,First Time,ILLENIUM,6GwqbWxgikcrhZn8M2M7sc,First Time,"['edm', 'electropop', 'melodic dubstep', 'pop'...",66,155.085,0.667,0.526,-5.451,0.0437,0.0,165779,0.43,False,3.0,pop


### KMeans Clustering Model

Then, we create the KMeans Clustering model, fit the top50 songs to it, and find the centroids.

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

In [None]:
X_train = top50[["topGenre", "explicit", "tempo", "energy", "danceability", "loudness", "speechiness", "instrumentalness", "valence"]]

ct = make_column_transformer(
    (OneHotEncoder(), ["topGenre", "explicit"]),
    (StandardScaler(), ["tempo", "energy", "danceability", "loudness", "speechiness", "instrumentalness", "valence"]),
    remainder="drop"
)

pipeline=make_pipeline(
    ct,
    KMeans(n_clusters=4)
)

pipeline.fit(X_train)

clusters = pd.Series(pipeline.steps[1][1].labels_)
clusters

0     2
1     0
2     2
3     2
4     2
5     1
6     2
7     1
8     2
9     1
10    1
11    2
12    0
13    1
14    1
15    1
16    1
17    2
18    0
19    2
20    2
21    1
22    1
23    1
24    0
25    1
26    0
27    0
28    2
29    2
30    2
31    0
32    0
33    0
34    1
35    0
36    1
37    0
38    0
39    2
40    3
41    1
42    2
43    2
44    1
45    1
46    1
47    1
48    1
49    1
dtype: int32

In [None]:
# Getting the column names that correspond to the array of values for each centroid
oneHotFeatures = list(pipeline.named_steps['columntransformer'].named_transformers_['onehotencoder'].get_feature_names())
actualFeatures = oneHotFeatures + ["tempo", "energy", "danceability", "loudness", "speechiness", "instrumentalness", "valence"]
actualFeatures

['x0_country',
 'x0_edm',
 'x0_hip hop',
 'x0_other',
 'x0_pop',
 'x0_r&b',
 'x0_rap',
 'x0_rock',
 'x1_False',
 'x1_True',
 'tempo',
 'energy',
 'danceability',
 'loudness',
 'speechiness',
 'instrumentalness',
 'valence']

In [None]:
# x0 = topGenre
# x1 = explicit

# Transforming the arrays of centroid data into a dataframe
centroids = pipeline.steps[1][1].cluster_centers_
centroidsDF = pd.DataFrame(columns=actualFeatures)
for i in range(4):
  cent = pd.Series(centroids[i], index=actualFeatures)
  centroidsDF = centroidsDF.append(cent, ignore_index=True)
centroidsDF = centroidsDF.round(4)
centroidsDF

Unnamed: 0,x0_country,x0_edm,x0_hip hop,x0_other,x0_pop,x0_r&b,x0_rap,x0_rock,x1_False,x1_True,tempo,energy,danceability,loudness,speechiness,instrumentalness,valence
0,0.0,0.0,0.0833,0.1667,0.0833,0.0,0.6667,0.0,0.0,1.0,0.2138,-0.5004,0.3502,-0.2364,1.5219,-0.2162,-0.4527
1,0.0476,-0.0,0.0952,0.0952,0.4762,-0.0,0.2381,0.0476,0.5238,0.4762,0.2932,0.8725,-0.1257,0.8072,-0.4974,-0.1174,0.4752
2,0.0,0.0625,0.125,0.0,0.4375,0.0625,0.3125,0.0,0.25,0.75,-0.4861,-0.6401,-0.1933,-0.8432,-0.4541,-0.097,-0.3574
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,-0.9446,-2.0759,1.5291,-0.6215,-0.5518,6.6135,1.1714


### Scaling New Releases Data

Then, we scale/transform the New Releases data the same way we scaled/transformed the top50 data.

In [None]:
import numpy as np 

newMusicTransformed = pd.DataFrame(columns=actualFeatures)

for i in range(1, 51):
  row = newMusic.loc[newMusic['rank']==i, ["topGenre", "explicit", "tempo", "energy", "danceability", "loudness", "speechiness", "instrumentalness", "valence"]]

  # Scales New Releases data with the same column transfomer used to scale Top 50 data
  transformedRow = pd.Series(ct.transform(row)[0], index=actualFeatures)
  newMusicTransformed = newMusicTransformed.append(transformedRow, ignore_index=True)

newMusicTransformed.head()

Unnamed: 0,x0_country,x0_edm,x0_hip hop,x0_other,x0_pop,x0_r&b,x0_rap,x0_rock,x1_False,x1_True,tempo,energy,danceability,loudness,speechiness,instrumentalness,valence
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,2.138603,0.287365,0.637849,-0.380875,1.623115,0.061377,-0.045947
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.367777,0.043968,-0.882503,-1.142626,1.011348,-0.200319,-0.0634
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.961866,-2.256518,-1.170846,-2.437056,0.044934,-0.219112,-0.477925
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,-0.168445,-0.482081,-1.668892,-0.995366,0.213391,-0.219293,-0.565194
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.195419,0.491503,-1.642679,0.785688,-0.782281,-0.219293,-0.076491


### Find Songs with Smallest Distance to Each Centroid

After scaling the data, we calculate the distance from each song in New Releases to each centroid, and find the songs with the smallest distance to each centroid. We should end up with one song found per centroid.

In [None]:
from sklearn.metrics.pairwise import manhattan_distances

# Find distances from all New Releases songs (after being scaled) to each of the 4 centroids
distancesCent0 = manhattan_distances(centroidsDF.iloc[[0]], newMusicTransformed)
distancesCent1 = manhattan_distances(centroidsDF.iloc[[1]], newMusicTransformed)
distancesCent2 = manhattan_distances(centroidsDF.iloc[[2]], newMusicTransformed)
distancesCent3 = manhattan_distances(centroidsDF.iloc[[3]], newMusicTransformed)

In [None]:
# Find the New Releases song closest to the first centroid
min0Ind = distancesCent0[0].argmin()
print("Min: ", distancesCent0[0].min())
print("Ind: ", min0Ind)
newMusicTransformed.iloc[min0Ind]

Min:  4.429708692363194
Ind:  40


x0_country          0.000000
x0_edm              0.000000
x0_hip hop          0.000000
x0_other            1.000000
x0_pop              0.000000
x0_r&b              0.000000
x0_rap              0.000000
x0_rock             0.000000
x1_False            0.000000
x1_True             1.000000
tempo               0.667905
energy             -0.356458
danceability        0.620373
loudness           -0.699029
speechiness         1.481256
instrumentalness   -0.219293
valence             0.935823
Name: 40, dtype: float64

In [None]:
newMusic.iloc[min0Ind]

rank                                    41
trackId             6qZIy961yxJZpFkgsq8Vm2
trackName                         Jacknife
artistName                       Belaganas
albumId             3kybax0X96QAkAWWeoGO54
albumName                       Smile More
genres                   ['phoenix indie']
popularity                              40
tempo                               140.03
energy                               0.559
danceability                         0.785
loudness                            -7.901
speechiness                          0.299
instrumentalness                         0
duration_ms                         163714
valence                              0.662
explicit                              True
duration_m                               3
topGenre                             other
Name: 40, dtype: object

In [None]:
# Find the New Releases song closest to the second centroid
min1Ind = distancesCent1[0].argmin()
print("Min: ", distancesCent1[0].min())
print("Ind: ", min1Ind)
newMusicTransformed.iloc[min1Ind]

Min:  4.423846933950945
Ind:  38


x0_country          0.000000
x0_edm              0.000000
x0_hip hop          0.000000
x0_other            0.000000
x0_pop              1.000000
x0_r&b              0.000000
x0_rap              0.000000
x0_rock             0.000000
x1_False            1.000000
x1_True             0.000000
tempo               0.596214
energy              1.292355
danceability       -0.786389
loudness            1.165048
speechiness        -0.394829
instrumentalness   -0.219293
valence             0.953276
Name: 38, dtype: float64

In [None]:
newMusic.iloc[min1Ind]

rank                                              39
trackId                       5KYcDLH4KmNTOt3vUZcQHo
trackName           Part Time Psycho (with Two Feet)
artistName                                     SHAED
albumId                       4mKAasEvSuDzx0jEn1gup1
albumName           Part Time Psycho (with Two Feet)
genres                         ['electropop', 'pop']
popularity                                        53
tempo                                        137.984
energy                                         0.769
danceability                                   0.624
loudness                                      -4.825
speechiness                                   0.0874
instrumentalness                                   0
duration_ms                                   154493
valence                                        0.666
explicit                                       False
duration_m                                         3
topGenre                                      

In [None]:
# Find the New Releases song closest to the third centroid
min2Ind = distancesCent2[0].argmin()
print("Min: ", distancesCent2[0].min())
print("Ind: ", min2Ind)
newMusicTransformed.iloc[min2Ind]

Min:  4.823598284638777
Ind:  43


x0_country          0.000000
x0_edm              0.000000
x0_hip hop          1.000000
x0_other            0.000000
x0_pop              0.000000
x0_r&b              0.000000
x0_rap              0.000000
x0_rock             0.000000
x1_False            0.000000
x1_True             1.000000
tempo               0.144630
energy             -1.220122
danceability       -0.061164
loudness           -1.528046
speechiness        -0.848778
instrumentalness   -0.219293
valence            -0.386294
Name: 43, dtype: float64

In [None]:
newMusic.iloc[min2Ind]

rank                                                               44
trackId                                        1DPLs8cVyL1Trlg5Kayua7
trackName                                                TIME FOR YOU
artistName                                                 FRVRFRIDAY
albumId                                        7kdjlaRQdB1u6PhdTSgXMC
albumName                                                TIME FOR YOU
genres              ['canadian contemporary r&b', 'canadian hip ho...
popularity                                                         49
tempo                                                         125.096
energy                                                          0.449
danceability                                                    0.707
loudness                                                       -9.269
speechiness                                                    0.0362
instrumentalness                                                    0
duration_ms         

In [None]:
# Find the New Releases song closest to the fourth centroid
min3Ind = distancesCent3[0].argmin()
print("Min: ", distancesCent3[0].min())
print("Ind: ", min3Ind)
newMusicTransformed.iloc[min3Ind]

Min:  9.231731003373577
Ind:  48


x0_country          0.000000
x0_edm              0.000000
x0_hip hop          0.000000
x0_other            0.000000
x0_pop              1.000000
x0_r&b              0.000000
x0_rap              0.000000
x0_rock             0.000000
x1_False            1.000000
x1_True             0.000000
tempo              -1.085806
energy             -1.950310
danceability        0.961142
loudness           -1.546226
speechiness        -0.611164
instrumentalness   -0.219100
valence             0.591112
Name: 48, dtype: float64

In [None]:
newMusic.iloc[min3Ind]

rank                                                               49
trackId                                        72xzw4sk7CcgYIyQsj0pxn
trackName                                           I'll Be Home Soon
artistName                                                     Shoffy
albumId                                        4RQN3Cqkj3dHhEa5wa7Ag8
albumName                                                    Marathon
genres              ['alternative r&b', 'chill pop', 'chill r&b', ...
popularity                                                         50
tempo                                                           89.98
energy                                                          0.356
danceability                                                    0.824
loudness                                                       -9.299
speechiness                                                     0.063
instrumentalness                                             3.68e-06
duration_ms         

The above 4 songs are the ones we predict from the New Releases to most likely become a hit and make it into the Top 50.