# Predicting Song Popularity using Machine Learning

This Jupyter Notebook uses several machine learning algorithms to predict the popularity of a song. The dataset used in this notebook is a cleaned and preprocessed version of the original dataset containing the audio features of songs. We define the top 25% popular songs as "popular", and the bottom 75% popular songs as "not popular".

In [1]:
import numpy as np 
import pandas as pd # for working with songDatas

In [2]:
songData = pd.read_csv('datasets/cleaned-song-dataset.csv')
songData.head()

Unnamed: 0.1,Unnamed: 0,name,artists,popularity,release_date,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence
0,0,Keep A Song In Your Soul,['Mamie Smith'],12,1920-01-01,0.991,0.598,168333,0.224,0,0.000522,5,0.379,-12.628,0,0.0936,149.976,0.634
1,1,I Put A Spell On You,"[""Screamin' Jay Hawkins""]",7,1920-05-01,0.643,0.852,150200,0.517,0,0.0264,5,0.0809,-7.261,0,0.0534,86.889,0.95
2,2,Golfing Papa,['Mamie Smith'],4,1920-01-01,0.993,0.647,163827,0.186,0,1.8e-05,0,0.519,-12.098,1,0.174,97.6,0.689
3,3,True House Music - Xavier Santos & Carlos Gomi...,['Oscar Velazquez'],17,1920-01-01,0.000173,0.73,422087,0.798,0,0.801,2,0.128,-7.311,1,0.0425,127.997,0.0422
4,4,Xuniverxe,['Mixe'],2,1920-01-10,0.295,0.704,165224,0.707,1,0.000246,10,0.402,-6.036,0,0.0768,122.076,0.299


In [3]:
songData.describe()

Unnamed: 0.1,Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence
count,133484.0,133484.0,133484.0,133484.0,133484.0,133484.0,133484.0,133484.0,133484.0,133484.0,133484.0,133484.0,133484.0,133484.0,133484.0
mean,86738.448623,33.566892,0.445756,0.537559,232895.6,0.517069,0.064457,0.152212,5.198271,0.208644,-11.092275,0.711703,0.079103,118.397793,0.533009
std,50840.922522,18.992977,0.360302,0.173297,127336.8,0.266594,0.245566,0.301002,3.510869,0.183613,5.358354,0.452972,0.118517,30.009354,0.263969
min,0.0,1.0,0.0,0.0,14708.0,0.0,0.0,0.0,0.0,0.0,-60.0,0.0,0.0,0.0,0.0
25%,43371.25,20.0,0.0707,0.421,169665.8,0.299,0.0,0.0,2.0,0.0964,-13.922,0.0,0.0339,95.35775,0.319
50%,86632.5,33.0,0.412,0.547,212400.0,0.519,0.0,0.000179,5.0,0.134,-10.2665,1.0,0.0429,116.4635,0.543
75%,132120.25,47.0,0.805,0.663,267973.0,0.737,0.0,0.0618,8.0,0.265,-7.144,1.0,0.0668,136.567,0.754
max,174387.0,100.0,0.996,0.988,4892761.0,1.0,1.0,1.0,11.0,1.0,3.744,1.0,0.971,243.507,1.0


## Data Preparation
First, we load the preprocessed dataset using pandas and explore it using the head() and describe() methods. We then preprocess the dataset by converting the popularity score to a binary classification problem using a threshold of 47 for popularity (75th percentile).


In [4]:
songData.loc[songData['popularity'] < 47, 'popularity'] = 0
songData.loc[songData['popularity'] >= 47, 'popularity'] = 1
songData.loc[songData['popularity'] == 1]


Unnamed: 0.1,Unnamed: 0,name,artists,popularity,release_date,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence
312,1062,Ain't Misbehavin',['Fats Waller'],1,1926-01-01,0.82100,0.515,237773,0.2220,0,0.001930,0,0.1900,-16.918,0,0.0575,98.358,0.350
524,1462,"Sing, Sing, Sing",['Benny Goodman'],1,1928-01-01,0.84700,0.626,520133,0.7440,0,0.892000,2,0.1450,-9.189,0,0.0662,113.117,0.259
663,1662,Mack the Knife,['Louis Armstrong'],1,1929-01-01,0.58600,0.673,201467,0.3770,0,0.000000,0,0.3320,-14.141,1,0.0697,88.973,0.713
689,1862,"Hungarian Rhapsody No. 2 in C-Sharp Minor, S. ...","['Franz Liszt', 'Vladimir Horowitz']",1,1930-01-01,0.98700,0.349,541600,0.3260,0,0.886000,1,0.7840,-15.347,1,0.0551,80.233,0.168
952,2462,All of Me (with Eddie Heywood & His Orchestra),"['Billie Holiday', 'Eddie Heywood']",1,1933-01-01,0.97200,0.504,181440,0.0644,0,0.000004,2,0.1740,-14.754,0,0.0408,106.994,0.403
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133475,174351,Waiting On A War,['Foo Fighters'],1,2021-01-14,0.00984,0.530,253840,0.7590,0,0.000000,7,0.3190,-7.067,1,0.0351,131.999,0.502
133476,174353,Precious' Tale,['Jazmine Sullivan'],1,2021-08-01,0.71500,0.734,43320,0.3460,0,0.000000,2,0.3940,-11.722,1,0.3550,88.849,0.930
133477,174355,Connexion,['ZAYN'],1,2021-01-15,0.49800,0.597,196493,0.3680,0,0.000000,2,0.1090,-10.151,0,0.0936,171.980,0.590
133479,174361,Little Boy,['Ashnikko'],1,2021-01-15,0.10500,0.781,172720,0.4870,1,0.000000,1,0.0802,-7.301,0,0.1670,129.941,0.327


## Model Training and Evaluation
We use the following machine learning algorithms to predict the popularity of a song:

**Logistic Regression**

**Random Forest Classifier**

**K-Nearest Neighbors Classifier**

**Decision Tree Classifier**

**Linear Support Vector Classification**

**XGBoost**

For each algorithm, we train a model using the training set, and evaluate its performance using the validation set. We use the accuracy_score and roc_auc_score metrics for evaluation.

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC, LinearSVC
from xgboost import XGBClassifier

from sklearn.metrics import make_scorer, accuracy_score, roc_auc_score 
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split


In [6]:
features = ["acousticness", "danceability", "duration_ms", "energy", "instrumentalness", "key", "liveness", 
            "mode", "speechiness", "tempo", "valence"]

In [7]:
training = songData.sample(frac = 0.8)
X_train = training[features]
y_train = training['popularity']
X_test = songData.drop(training.index)[features]

In [8]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size = 0.2)

**Logistic Regression**

In [9]:
LR_Model = LogisticRegression()
LR_Model.fit(X_train, y_train)
LR_Predict = LR_Model.predict(X_valid)
LR_Accuracy = accuracy_score(y_valid, LR_Predict)
print("Accuracy: " + str(LR_Accuracy))

LR_AUC = roc_auc_score(y_valid, LR_Predict) 
print("AUC: " + str(LR_AUC))

Accuracy: 0.7403783125760839
AUC: 0.5


**Random Forest Classifier**

In [10]:
RFC_Model = RandomForestClassifier()
RFC_Model.fit(X_train, y_train)
RFC_Predict = RFC_Model.predict(X_valid)
RFC_Accuracy = accuracy_score(y_valid, RFC_Predict)
print("Accuracy: " + str(RFC_Accuracy))

RFC_AUC = roc_auc_score(y_valid, RFC_Predict) 
print("AUC: " + str(RFC_AUC))

Accuracy: 0.7809720011237007
AUC: 0.6420585224618864


**K-Nearest Neighbors Classifier**

In [11]:
KNN_Model = KNeighborsClassifier()
KNN_Model.fit(X_train, y_train)
KNN_Predict = KNN_Model.predict(X_valid)
KNN_Accuracy = accuracy_score(y_valid, KNN_Predict)
print("Accuracy: " + str(KNN_Accuracy))

KNN_AUC = roc_auc_score(y_valid, KNN_Predict) 
print("AUC: " + str(KNN_AUC))

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Accuracy: 0.6957112089146924
AUC: 0.5279183037412518


**Decision Tree Classifier**

In [12]:
DT_Model = DecisionTreeClassifier()
DT_Model.fit(X_train, y_train)
DT_Predict = DT_Model.predict(X_valid)
DT_Accuracy = accuracy_score(y_valid, DT_Predict)
print("Accuracy: " + str(DT_Accuracy))

DT_AUC = roc_auc_score(y_valid, DT_Predict) 
print("AUC: " + str(DT_AUC))

Accuracy: 0.6965539844554733
AUC: 0.6089961421863749


**Linear Support Vector Classification**

In [13]:
training_LSVC = training
X_train_LSVC = X_train
y_train_LSVC = y_train
X_test_LSVC = songData.drop(training_LSVC.index)[features]
X_train_LSVC, X_valid_LSVC, y_train_LSVC, y_valid_LSVC = train_test_split(
    X_train_LSVC, y_train_LSVC, test_size = 0.2, random_state = 420)


In [14]:
LSVC_Model = DecisionTreeClassifier()
LSVC_Model.fit(X_train_LSVC, y_train_LSVC)
LSVC_Predict = LSVC_Model.predict(X_valid_LSVC)
LSVC_Accuracy = accuracy_score(y_valid_LSVC, LSVC_Predict)
print("Accuracy: " + str(LSVC_Accuracy))

LSVC_AUC = roc_auc_score(y_valid_LSVC, LSVC_Predict) 
print("AUC: " + str(LSVC_AUC))

Accuracy: 0.6920870888446682
AUC: 0.6041936529889785


**XGBOOST**

In [27]:
XGB_Model = XGBClassifier(objective = "binary:logistic", n_estimators = 10)
XGB_Model.fit(X_train, y_train)
XGB_Predict = XGB_Model.predict(X_valid)
XGB_Accuracy = accuracy_score(y_valid, XGB_Predict)
print("Accuracy: " + str(XGB_Accuracy))

XGB_AUC = roc_auc_score(y_valid, XGB_Predict) 
print("AUC: " + str(XGB_AUC))

Accuracy: 0.7759153478790148
AUC: 0.6210195273124801


**Model Performance Summary**

In [28]:
model_performance_accuracy = pd.DataFrame({'Model': ['LogisticRegression', 
                                                      'RandomForestClassifier', 
                                                      'KNeighborsClassifier',
                                                      'DecisionTreeClassifier',
                                                      'LinearSVC',
                                                      'XGBClassifier'],
                                            'Accuracy': [LR_Accuracy,
                                                         RFC_Accuracy,
                                                         KNN_Accuracy,
                                                         DT_Accuracy,
                                                         LSVC_Accuracy,
                                                         XGB_Accuracy]})

model_performance_AUC = pd.DataFrame({'Model': ['LogisticRegression', 
                                                      'RandomForestClassifier', 
                                                      'KNeighborsClassifier',
                                                      'DecisionTreeClassifier',
                                                      'LinearSVC',
                                                      'XGBClassifier'],
                                            'AUC': [LR_AUC,
                                                         RFC_AUC,
                                                         KNN_AUC,
                                                         DT_AUC,
                                                         LSVC_AUC,
                                                         XGB_AUC]})

In [29]:
model_performance_accuracy.sort_values(by = "Accuracy", ascending = False)

Unnamed: 0,Model,Accuracy
1,RandomForestClassifier,0.780972
5,XGBClassifier,0.775915
0,LogisticRegression,0.740378
3,DecisionTreeClassifier,0.696554
2,KNeighborsClassifier,0.695711
4,LinearSVC,0.692087


In [30]:
model_performance_AUC.sort_values(by = "AUC", ascending = False)

Unnamed: 0,Model,AUC
1,RandomForestClassifier,0.642059
5,XGBClassifier,0.62102
3,DecisionTreeClassifier,0.608996
4,LinearSVC,0.604194
2,KNeighborsClassifier,0.527918
0,LogisticRegression,0.46103


## Results
We present the accuracy and AUC values for each model in a summary table.
The Random Forest Classifier and XGBoost algorithms perform the best in terms of accuracy and AUC, with the RandomForestClassifier algorithm achieving the highest accuracy of 0.783734 and AUC of 0.644808