# Predicting Song Popularity using Machine Learning

This Jupyter Notebook uses several machine learning algorithms to predict the popularity of a song. The dataset used in this notebook is a cleaned and preprocessed version of the original dataset containing the audio features of songs. We define the top 25% popular songs as "popular", and the bottom 75% popular songs as "not popular".

In [18]:
import numpy as np
import pandas as pd # for working with songDatas

In [19]:
songData = pd.read_csv('cleaned-song-dataset.csv')
songData.head()

Unnamed: 0.1,Unnamed: 0,name,artists,popularity,release_date,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence
0,0,Keep A Song In Your Soul,['Mamie Smith'],12,1920-01-01,0.991,0.598,168333,0.224,0,0.000522,5,0.379,-12.628,0,0.0936,149.976,0.634
1,1,I Put A Spell On You,"[""Screamin' Jay Hawkins""]",7,1920-05-01,0.643,0.852,150200,0.517,0,0.0264,5,0.0809,-7.261,0,0.0534,86.889,0.95
2,2,Golfing Papa,['Mamie Smith'],4,1920-01-01,0.993,0.647,163827,0.186,0,1.8e-05,0,0.519,-12.098,1,0.174,97.6,0.689
3,3,True House Music - Xavier Santos & Carlos Gomi...,['Oscar Velazquez'],17,1920-01-01,0.000173,0.73,422087,0.798,0,0.801,2,0.128,-7.311,1,0.0425,127.997,0.0422
4,4,Xuniverxe,['Mixe'],2,1920-01-10,0.295,0.704,165224,0.707,1,0.000246,10,0.402,-6.036,0,0.0768,122.076,0.299


In [20]:
songData.describe()

Unnamed: 0.1,Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence
count,133484.0,133484.0,133484.0,133484.0,133484.0,133484.0,133484.0,133484.0,133484.0,133484.0,133484.0,133484.0,133484.0,133484.0,133484.0
mean,86738.448623,33.566892,0.445756,0.537559,232895.6,0.517069,0.064457,0.152212,5.198271,0.208644,-11.092275,0.711703,0.079103,118.397793,0.533009
std,50840.922522,18.992977,0.360302,0.173297,127336.8,0.266594,0.245566,0.301002,3.510869,0.183613,5.358354,0.452972,0.118517,30.009354,0.263969
min,0.0,1.0,0.0,0.0,14708.0,0.0,0.0,0.0,0.0,0.0,-60.0,0.0,0.0,0.0,0.0
25%,43371.25,20.0,0.0707,0.421,169665.8,0.299,0.0,0.0,2.0,0.0964,-13.922,0.0,0.0339,95.35775,0.319
50%,86632.5,33.0,0.412,0.547,212400.0,0.519,0.0,0.000179,5.0,0.134,-10.2665,1.0,0.0429,116.4635,0.543
75%,132120.25,47.0,0.805,0.663,267973.0,0.737,0.0,0.0618,8.0,0.265,-7.144,1.0,0.0668,136.567,0.754
max,174387.0,100.0,0.996,0.988,4892761.0,1.0,1.0,1.0,11.0,1.0,3.744,1.0,0.971,243.507,1.0


## Data Preparation
First, we load the preprocessed dataset using pandas and explore it using the head() and describe() methods. We then preprocess the dataset by converting the popularity score to a binary classification problem using a threshold of 47 for popularity (75th percentile).


In [21]:
songData.loc[songData['popularity'] < 47, 'popularity'] = 0
songData.loc[songData['popularity'] >= 47, 'popularity'] = 1
songData.loc[songData['popularity'] == 1]

Unnamed: 0.1,Unnamed: 0,name,artists,popularity,release_date,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence
312,1062,Ain't Misbehavin',['Fats Waller'],1,1926-01-01,0.82100,0.515,237773,0.2220,0,0.001930,0,0.1900,-16.918,0,0.0575,98.358,0.350
524,1462,"Sing, Sing, Sing",['Benny Goodman'],1,1928-01-01,0.84700,0.626,520133,0.7440,0,0.892000,2,0.1450,-9.189,0,0.0662,113.117,0.259
663,1662,Mack the Knife,['Louis Armstrong'],1,1929-01-01,0.58600,0.673,201467,0.3770,0,0.000000,0,0.3320,-14.141,1,0.0697,88.973,0.713
689,1862,"Hungarian Rhapsody No. 2 in C-Sharp Minor, S. ...","['Franz Liszt', 'Vladimir Horowitz']",1,1930-01-01,0.98700,0.349,541600,0.3260,0,0.886000,1,0.7840,-15.347,1,0.0551,80.233,0.168
952,2462,All of Me (with Eddie Heywood & His Orchestra),"['Billie Holiday', 'Eddie Heywood']",1,1933-01-01,0.97200,0.504,181440,0.0644,0,0.000004,2,0.1740,-14.754,0,0.0408,106.994,0.403
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133475,174351,Waiting On A War,['Foo Fighters'],1,2021-01-14,0.00984,0.530,253840,0.7590,0,0.000000,7,0.3190,-7.067,1,0.0351,131.999,0.502
133476,174353,Precious' Tale,['Jazmine Sullivan'],1,2021-08-01,0.71500,0.734,43320,0.3460,0,0.000000,2,0.3940,-11.722,1,0.3550,88.849,0.930
133477,174355,Connexion,['ZAYN'],1,2021-01-15,0.49800,0.597,196493,0.3680,0,0.000000,2,0.1090,-10.151,0,0.0936,171.980,0.590
133479,174361,Little Boy,['Ashnikko'],1,2021-01-15,0.10500,0.781,172720,0.4870,1,0.000000,1,0.0802,-7.301,0,0.1670,129.941,0.327


## Model Training and Evaluation
We use the following machine learning algorithms to predict the popularity of a song:

**Logistic Regression**

**Random Forest Classifier**

**K-Nearest Neighbors Classifier**

**Decision Tree Classifier**

**Linear Support Vector Classification**

**XGBoost**

**LightGBM + Hyperparameter Tuning**

**Voting Ensemble - LGBM, XGB, MLP**

**Deep Learning - Neural Networks**

**Deeper Neural Network**

We use the training set to train a model for each algorithm, and the validation set is used to assess the model's performance. For evaluation, we employ the metrics roc_auc_score and accuracy_score.

In [22]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC, LinearSVC
from xgboost import XGBClassifier

from sklearn.metrics import make_scorer, accuracy_score, roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split


In [23]:
features = ["acousticness", "danceability", "duration_ms", "energy", "instrumentalness", "key", "liveness",
            "mode", "speechiness", "tempo", "valence"]

In [24]:
training = songData.sample(frac = 0.8)
X_train = training[features]
y_train = training['popularity']
X_test = songData.drop(training.index)[features]

In [25]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size = 0.2)

**Logistic Regression**

In [26]:
LR_Model = LogisticRegression()
LR_Model.fit(X_train, y_train)
LR_Predict = LR_Model.predict(X_valid)
LR_Accuracy = accuracy_score(y_valid, LR_Predict)
print("Accuracy: " + str(LR_Accuracy))

LR_AUC = roc_auc_score(y_valid, LR_Predict)
print("AUC: " + str(LR_AUC))

Accuracy: 0.7421106845210226
AUC: 0.5


**Random Forest Classifier**

In [27]:
RFC_Model = RandomForestClassifier()
RFC_Model.fit(X_train, y_train)
RFC_Predict = RFC_Model.predict(X_valid)
RFC_Accuracy = accuracy_score(y_valid, RFC_Predict)
print("Accuracy: " + str(RFC_Accuracy))

RFC_AUC = roc_auc_score(y_valid, RFC_Predict)
print("AUC: " + str(RFC_AUC))

Accuracy: 0.7875737428598183
AUC: 0.6473758273025298


**K-Nearest Neighbors Classifier**

In [28]:
KNN_Model = KNeighborsClassifier()
KNN_Model.fit(X_train, y_train)
KNN_Predict = KNN_Model.predict(X_valid)
KNN_Accuracy = accuracy_score(y_valid, KNN_Predict)
print("Accuracy: " + str(KNN_Accuracy))

KNN_AUC = roc_auc_score(y_valid, KNN_Predict)
print("AUC: " + str(KNN_AUC))

Accuracy: 0.6958984923681992
AUC: 0.5307018411991505


**Decision Tree Classifier**

In [29]:
DT_Model = DecisionTreeClassifier()
DT_Model.fit(X_train, y_train)
DT_Predict = DT_Model.predict(X_valid)
DT_Accuracy = accuracy_score(y_valid, DT_Predict)
print("Accuracy: " + str(DT_Accuracy))

DT_AUC = roc_auc_score(y_valid, DT_Predict)
print("AUC: " + str(DT_AUC))

Accuracy: 0.6940724786965071
AUC: 0.6095522887271511


**Linear Support Vector Classification**

In [30]:
training_LSVC = training
X_train_LSVC = X_train
y_train_LSVC = y_train
X_test_LSVC = songData.drop(training_LSVC.index)[features]
X_train_LSVC, X_valid_LSVC, y_train_LSVC, y_valid_LSVC = train_test_split(
    X_train_LSVC, y_train_LSVC, test_size = 0.2, random_state = 420)


In [31]:
LSVC_Model = DecisionTreeClassifier()
LSVC_Model.fit(X_train_LSVC, y_train_LSVC)
LSVC_Predict = LSVC_Model.predict(X_valid_LSVC)
LSVC_Accuracy = accuracy_score(y_valid_LSVC, LSVC_Predict)
print("Accuracy: " + str(LSVC_Accuracy))

LSVC_AUC = roc_auc_score(y_valid_LSVC, LSVC_Predict)
print("AUC: " + str(LSVC_AUC))

Accuracy: 0.6882242771860002
AUC: 0.6015530956570263


**XGBOOST**

In [32]:
XGB_Model = XGBClassifier(objective = "binary:logistic", n_estimators = 10)
XGB_Model.fit(X_train, y_train)
XGB_Predict = XGB_Model.predict(X_valid)
XGB_Accuracy = accuracy_score(y_valid, XGB_Predict)
print("Accuracy: " + str(XGB_Accuracy))

XGB_AUC = roc_auc_score(y_valid, XGB_Predict)
print("AUC: " + str(XGB_AUC))

Accuracy: 0.7798483004026594
AUC: 0.624579080843694


**LightGBM**

In [33]:
from lightgbm import LGBMClassifier

lgbm = LGBMClassifier(n_estimators=100, learning_rate=0.05)
lgbm.fit(X_train, y_train)

y_pred_lgbm = lgbm.predict(X_valid)
y_proba_lgbm = lgbm.predict_proba(X_valid)[:, 1]

accuracy_lgbm = accuracy_score(y_valid, y_pred_lgbm)
auc_lgbm = roc_auc_score(y_valid, y_proba_lgbm)

print("LightGBM Accuracy:", accuracy_lgbm)
print("LightGBM AUC:", auc_lgbm)

[LightGBM] [Info] Number of positive: 22216, number of negative: 63213
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003721 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2306
[LightGBM] [Info] Number of data points in the train set: 85429, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.260052 -> initscore=-1.045697
[LightGBM] [Info] Start training from score -1.045697
LightGBM Accuracy: 0.7856072665979961
LightGBM AUC: 0.7829418007417945


**Hyperparemeter Tuning**

In [34]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'num_leaves': [31, 61, 91],
    'max_depth': [10, 20, 30],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [50, 100, 200]
}

lgbm = LGBMClassifier()

grid_search = GridSearchCV(estimator=lgbm, param_grid=param_grid, cv=3, scoring='roc_auc', verbose=1)

grid_search.fit(X_train, y_train)

print("Best parameters found: ", grid_search.best_params_)
print("Best AUC found: ", grid_search.best_score_)

y_pred_lgbm = grid_search.best_estimator_.predict(X_valid)
y_proba_lgbm = grid_search.best_estimator_.predict_proba(X_valid)[:, 1]

accuracy_lgbm = accuracy_score(y_valid, y_pred_lgbm)
auc_lgbm = roc_auc_score(y_valid, y_proba_lgbm)

print("Optimized LightGBM Accuracy:", accuracy_lgbm)
print("Optimized LightGBM AUC:", auc_lgbm)

Fitting 3 folds for each of 81 candidates, totalling 243 fits
[LightGBM] [Info] Number of positive: 14810, number of negative: 42142
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002064 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2306
[LightGBM] [Info] Number of data points in the train set: 56952, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.260044 -> initscore=-1.045742
[LightGBM] [Info] Start training from score -1.045742
[LightGBM] [Info] Number of positive: 14811, number of negative: 42142
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001948 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2306
[LightGBM] [Info] Number of data points in the tr

**Voting Ensemble - LGBM, XGB, MLP**

In [35]:
from sklearn.ensemble import VotingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier

estimators = [
    ('lgbm', LGBMClassifier(n_estimators=200, learning_rate=0.05)),
    ('xgb', make_pipeline(StandardScaler(), XGBClassifier(use_label_encoder=False, eval_metric='logloss'))),
    ('mlp', make_pipeline(StandardScaler(), MLPClassifier(hidden_layer_sizes=(100,), max_iter=300)))
]

voting = VotingClassifier(estimators=estimators, voting='soft')

voting.fit(X_train, y_train)

y_pred_voting = voting.predict(X_valid)
y_proba_voting = voting.predict_proba(X_valid)[:, 1]

accuracy_voting = accuracy_score(y_valid, y_pred_voting)
auc_voting = roc_auc_score(y_valid, y_proba_voting)

print("Voting Ensemble Accuracy:", accuracy_voting)
print("Voting Ensemble AUC:", auc_voting)

[LightGBM] [Info] Number of positive: 22216, number of negative: 63213
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.011383 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2306
[LightGBM] [Info] Number of data points in the train set: 85429, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.260052 -> initscore=-1.045697
[LightGBM] [Info] Start training from score -1.045697
Voting Ensemble Accuracy: 0.7894465773948872
Voting Ensemble AUC: 0.7890296992730962


**Deep Learning - Neural Networks**

In [37]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, accuracy_score
from tensorflow.keras.utils import to_categorical

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)

num_classes = y_train.nunique()
y_train_cat = to_categorical(y_train, num_classes)
y_valid_cat = to_categorical(y_valid, num_classes)

# Neural network architecture
model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(num_classes, activation='softmax')  # Use 'sigmoid' if it's binary classification
])

# Compile the model
model.compile(optimizer='adam',
              loss='categorical_crossentropy',  # Use 'binary_crossentropy' for binary classification
              metrics=['accuracy'])

# Train the model
history = model.fit(X_train_scaled, y_train_cat, epochs=50, batch_size=32, validation_data=(X_valid_scaled, y_valid_cat), verbose=1)

# Evaluate the model
y_pred_prob = model.predict(X_valid_scaled)
y_pred = y_pred_prob.argmax(axis=1)  # Use (y_pred_prob > 0.5).astype(int) for binary classification

# Calculate accuracy and AUC
accuracy_dl1 = accuracy_score(y_valid, y_pred)
auc_dl1 = roc_auc_score(y_valid_cat, y_pred_prob)  # Ensure y_valid_cat is used for multiclass AUC

print("Deep Learning Model Accuracy:", accuracy_dl1)
print("Deep Learning Model AUC:", auc_dl1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Deep Learning Model Accuracy: 0.7784904953647346
Deep Learning Model AUC: 0.7766162324259065


**Deeper Neural Network**

In [38]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization, Activation
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, accuracy_score
from tensorflow.keras.utils import to_categorical

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)

# Convert labels to categorical
num_classes = y_train.nunique()
y_train_cat = to_categorical(y_train, num_classes)
y_valid_cat = to_categorical(y_valid, num_classes)

# More complex neural network architecture
model = Sequential([
    Dense(256, input_shape=(X_train_scaled.shape[1],)),
    BatchNormalization(),
    Activation('relu'),
    Dropout(0.3),
    Dense(128),
    BatchNormalization(),
    Activation('relu'),
    Dropout(0.3),
    Dense(64),
    BatchNormalization(),
    Activation('relu'),
    Dropout(0.3),
    Dense(num_classes, activation='softmax')  # Use 'sigmoid' if it's binary classification and change last layer to Dense(1)
])

# Compile the model
model.compile(optimizer='adam',
              loss='categorical_crossentropy',  # Change to 'binary_crossentropy' for binary classification
              metrics=['accuracy'])

# Train the model
history = model.fit(X_train_scaled, y_train_cat, epochs=100, batch_size=32, validation_data=(X_valid_scaled, y_valid_cat), verbose=1)

# Evaluate the model
y_pred_prob = model.predict(X_valid_scaled)
y_pred = y_pred_prob.argmax(axis=1)  # Use (y_pred_prob > 0.5).astype(int) for binary classification

# Calculate accuracy and AUC
accuracy_dl2 = accuracy_score(y_valid, y_pred)
auc_dl2 = roc_auc_score(y_valid_cat, y_pred_prob)  # Ensure y_valid_cat is used for multiclass AUC

print("Deep Learning Model Accuracy:", accuracy_dl2)
print("Deep Learning Model AUC:", auc_dl2)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

**Model Performance Summary**

In [42]:
import pandas as pd
model_performance_accuracy = pd.DataFrame({'Model': ['LogisticRegression',
                                                      'RandomForestClassifier',
                                                      'KNeighborsClassifier',
                                                      'DecisionTreeClassifier',
                                                      'LinearSVC',
                                                      'XGBClassifier',
                                                      'LGBMClassifier',
                                                      'Voting_Ensemble',
                                                      'NeuralNetwork1',
                                                      'NeuralNetwork2'
                                                     ],
                                            'Accuracy': [LR_Accuracy,
                                                         RFC_Accuracy,
                                                         KNN_Accuracy,
                                                         DT_Accuracy,
                                                         LSVC_Accuracy,
                                                         XGB_Accuracy,
                                                         accuracy_lgbm,
                                                         accuracy_voting,
                                                         accuracy_dl1,
                                                         accuracy_dl2
                                                         ]})

model_performance_AUC = pd.DataFrame({'Model': ['LogisticRegression',
                                                 'RandomForestClassifier',
                                                 'KNeighborsClassifier',
                                                 'DecisionTreeClassifier',
                                                 'LinearSVC',
                                                 'XGBClassifier',
                                                 'LGBMClassifier',
                                                 'Voting_Ensemble',
                                                 'NeuralNetwork1',
                                                 'NeuralNetwork2'
                                                ],
                                      'AUC': [LR_AUC,
                                              RFC_AUC,
                                              KNN_AUC,
                                              DT_AUC,
                                              LSVC_AUC,
                                              XGB_AUC,
                                              auc_lgbm,
                                              auc_voting,
                                              auc_dl1,
                                              auc_dl2
                                             ]})


In [43]:
model_performance_accuracy.sort_values(by = "Accuracy", ascending = False)

Unnamed: 0,Model,Accuracy
7,Voting_Ensemble,0.789447
1,RandomForestClassifier,0.787574
6,LGBMClassifier,0.787574
9,NeuralNetwork2,0.782985
5,XGBClassifier,0.779848
8,NeuralNetwork1,0.77849
0,LogisticRegression,0.742111
2,KNeighborsClassifier,0.695898
3,DecisionTreeClassifier,0.694072
4,LinearSVC,0.688224


In [44]:
model_performance_AUC.sort_values(by = "AUC", ascending = False)

Unnamed: 0,Model,AUC
7,Voting_Ensemble,0.78903
6,LGBMClassifier,0.787249
9,NeuralNetwork2,0.782739
8,NeuralNetwork1,0.776616
1,RandomForestClassifier,0.647376
5,XGBClassifier,0.624579
3,DecisionTreeClassifier,0.609552
4,LinearSVC,0.601553
2,KNeighborsClassifier,0.530702
0,LogisticRegression,0.5


## Results
We provide a summary table with the accuracy and AUC values for every model.

At an individual level the Random Forest Classifier and XGBoost algorithms perform the best in terms of accuracy and AUC, with the RandomForestClassifier algorithm achieving the highest accuracy of 0.783734 and AUC of 0.644808.

The reason for Voting Ensemble being the highest is because it is a collective model of LGBM, XGB, MLP having an accuracy of 0.789447 and AUC of 0.789030.