<h2>Model Selection - Most Streamed Spotify Songs 2024</h2>

<h3>Importing libraries</h3>

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

<h3>Loading the dataset</h3>

In [2]:
df = pd.read_csv("./data/train_spotify_data.csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 38 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   All Time Rank                      4600 non-null   int64  
 1   Spotify Streams                    4600 non-null   float64
 2   Spotify Playlist Count             4600 non-null   float64
 3   Spotify Playlist Reach             4600 non-null   float64
 4   YouTube Views                      4600 non-null   float64
 5   YouTube Likes                      4600 non-null   float64
 6   TikTok Posts                       4600 non-null   float64
 7   TikTok Likes                       4600 non-null   float64
 8   TikTok Views                       4600 non-null   float64
 9   YouTube Playlist Reach             4600 non-null   float64
 10  AirPlay Spins                      4600 non-null   float64
 11  Deezer Playlist Reach              4600 non-null   float

In [4]:
X = df.iloc[:, df.columns != "All Time Rank"].values
y = df.iloc[:, df.columns == "All Time Rank"].values.reshape(-1)

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=32)

<h3>Selecting the model</h3>

In [6]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor

models = []
models.append(RandomForestRegressor())
models.append(DecisionTreeRegressor())
models.append(LinearRegression())
models.append(XGBRegressor())


In [7]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

for model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    print(type(model), "is done with training...")
    print("MAE:", mean_absolute_error(y_test, y_pred))
    print("MSE:", mean_squared_error(y_test, y_pred))
    print("R2:", r2_score(y_test, y_pred), "\n")

<class 'sklearn.ensemble._forest.RandomForestRegressor'> is done with training...
MAE: 7.251445652173915
MSE: 943.79911902174
R2: 0.999452821727787 

<class 'sklearn.tree._classes.DecisionTreeRegressor'> is done with training...
MAE: 8.056521739130435
MSE: 868.5413043478261
R2: 0.9994964533016821 

<class 'sklearn.linear_model._base.LinearRegression'> is done with training...
MAE: 831.9542578469311
MSE: 983083.7600225242
R2: 0.4300460104186631 

<class 'xgboost.sklearn.XGBRegressor'> is done with training...
MAE: 9.740742313343546
MSE: 694.174952879996
R2: 0.9995975494384766 



<h3>Conclusion</h3>

We've tested a few regression models in order to know: which one is the most suitable to predict the rank of all time. We base the output on the reach of many platforms. The best ones are: ***Random Forest***, ***Decision Tree***, ***XGB Regressor***.