# Baseline Model and Pipeline for Hit-Predict Project

## Table of Contents
1. [Setup](#setup)
    - [Importing Libraries](#lib)
    - [Loading Data](#data)
2. [Baseline Models](#baseline)
    - [Regression Problem](#sub1)
    - [Classification Problem with Classes by 20](#sub2)
    - [Classification Problem with Classes by 5](#sub3)
3. [Final Model Pipeline](#pipeline)

## 1. Setup
<a id="setup"></a>

### Importing Libraries
<a id="lib"></a>

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

### Loading Data
<a id="data"></a>

In [2]:
DATA_PATH_nothing = "../data/processed_spotify_songs.csv"
df_nothing = pd.read_csv(DATA_PATH_nothing)

DATA_PATH_0mean = "../data/0mean_data.csv"
df_0mean = pd.read_csv(DATA_PATH_0mean)

DATA_PATH_no0 = "../data/no0_data.csv"
df_no0 = pd.read_csv(DATA_PATH_no0)

In [3]:
df_0mean.head()

Unnamed: 0,track_artist,track_popularity,track_album_id,track_album_release_date,playlist_id,playlist_subgenre,danceability,energy,key,loudness,...,rap,latin,rock,pop,artist_track_encoded,playlist_id_encoded,track_album_id_encoded,release_year,release_month,release_day
0,Barbie's Cradle,41.0,1srJQ0njEQgd8w4XSqI4JQ,2001-01-01,37i9dQZF1DWYDQ8wBxd7xt,classic rock,0.481351,0.160801,2.0,0.588413,...,0,0,1,0,43.5,43.1,41.0,2001,1,1
1,RIKA,15.0,1ficfUnZMaY1QkNp15Slzm,2018-01-26,0JmBB9HfrzDiZoPVRdv8ns,neo soul,0.350541,0.495616,5.0,0.715122,...,0,0,0,0,15.0,26.206186,15.0,2018,1,26
2,Steady Rollin,28.0,3z04Lb9Dsilqw68SHt6jLB,2017-11-21,3YouF0u7waJnolytf9JCXf,hard rock,0.095012,0.7744,9.0,0.768273,...,0,0,1,0,29.5,31.697917,28.0,2017,11,21
3,The.madpix.project,24.0,1Z4ANBVuhTlS6DprlP0m1q,2015-08-07,5TiiHps0hNCyQ6ijVkNZQs,electropop,0.449432,0.630436,10.0,0.736041,...,0,0,0,1,12.2,39.436364,24.0,2015,8,7
4,YOSA & TAAR,38.0,2BuYm9UcKvI0ydXs5JKwt0,2018-11-16,37i9dQZF1DXdOtZGKonFlM,dance pop,0.453533,0.702244,1.0,0.713109,...,0,0,0,1,38.0,35.673469,38.0,2018,11,16


In [4]:
print(df_nothing.columns)

Index(['track_artist', 'track_popularity', 'track_album_id',
       'track_album_release_date', 'playlist_id', 'playlist_subgenre',
       'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'duration_ms', 'edm', 'r&b', 'rap', 'latin', 'rock', 'pop',
       'artist_track_encoded', 'playlist_id_encoded', 'track_album_id_encoded',
       'release_year', 'release_month', 'release_day'],
      dtype='object')


In [5]:
# We noticed that we forgot to drop the following columns 
# (that we had encoded and added in our dataset during preprocessing, so there is no risk in dropping them)
col_id_drop = ['track_album_id', 'playlist_id', 'track_album_release_date']
df_nothing = df_nothing.drop(col_id_drop, axis=1)
df_0mean = df_0mean.drop(col_id_drop, axis=1)
df_no0 = df_no0.drop(col_id_drop, axis=1)

# Furthermore, there are remaining columns that we still need to one-hot encode
non_num_col = ['track_artist', 'playlist_subgenre']
df_nothing = pd.get_dummies(df_nothing, columns=non_num_col)
df_0mean = pd.get_dummies(df_0mean, columns=non_num_col)
df_no0 = pd.get_dummies(df_no0, columns=non_num_col)

In [6]:
X_nothing = df_nothing.drop(columns=['track_popularity'])
Y_nothing = df_nothing['track_popularity']

X_0mean = df_0mean.drop(columns=['track_popularity'])
Y_0mean = df_0mean['track_popularity']

X_no0 = df_no0.drop(columns=['track_popularity'])
Y_no0 = df_no0['track_popularity']

## 2. Baseline Models
<a id="baseline"></a>

Recall that in the EDA notebook, we noticed that there was a high class imbalance in the target variable, especially with regards to a very high number of tracks that had a popularity score of 0. We discussed different ways to address this imbalance in the preprocessing steps of our data in previous notebooks and saved the different final preprocessed data in 3 different csv files. We will thus test all of our methods against these 3 preprocessing methods to evaluate potential improvements or not of the performances of our different models.

## Regression Problem

## Classification Problem with Classes by 20

## Classification Problem with Classes by 5

We start by clustering the popularity scores into classes 5 by 5.

In [7]:
bins = range(0, 105, 5)
labels = range(len(bins)-1)

Y_nothing_clustered = pd.cut(Y_nothing, bins=bins, labels=labels, include_lowest=True)
Y_0mean_clustered = pd.cut(Y_0mean, bins=bins, labels=labels, include_lowest=True)
Y_no0_clustered = pd.cut(Y_no0, bins=bins, labels=labels, include_lowest=True)

# We will append all final models in this model dictionary  
models = {}

In [8]:
## Helper function

def evaluate_models(X, Y, models, test_size=0.2, random_state=42):
    
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=random_state)
    
    results = {}
    for name, model in models.items():
        print(f"Running {name}")
        model.fit(X_train, Y_train)

        train_acc = accuracy_score(Y_train, model.predict(X_train))
        cv_acc = np.mean(cross_val_score(model, X_train, Y_train, cv=5))
        test_acc = accuracy_score(Y_test, model.predict(X_test))

        results[name] = {
            "Train Accuracy": train_acc,
            "Cross-Validation Accuracy": cv_acc,
            "Test Accuracy": test_acc
        }
        print(f"Results for {name} saved!")
    return results

### K-Nearest Neighbors

In [9]:
X_train, X_test, Y_train, Y_test = train_test_split(X_nothing, Y_nothing, test_size=0.2, random_state=42)
model = KNeighborsClassifier(n_neighbors=1)
model.fit(X_train, Y_train)
train_acc = accuracy_score(Y_train, model.predict(X_train))
cv_acc = np.mean(cross_val_score(model, X_train, Y_train, cv=5))
test_acc = accuracy_score(Y_test, model.predict(X_test))



In [10]:
print(train_acc)
print(cv_acc)
print(test_acc)

1.0
0.18522141482981982
0.17915711514724034


In [11]:
n_list = [2, 3, 5, 10, 20]
knn_models = {}
for n in n_list:
    knn_models[f"knn with {n} neighb"] = KNeighborsClassifier(n_neighbors=n)

nothing_knn = evaluate_models(X_nothing, Y_nothing_clustered, knn_models)
zero_mean_knn = evaluate_models(X_0mean, Y_0mean_clustered, knn_models)
no_zero_knn = evaluate_models(X_no0, Y_no0_clustered, knn_models)

print("Nothing")
print(nothing_knn)
print("0mean")
print(zero_mean_knn)
print("No0")
print(no_zero_knn)

Running knn with 2 neighb
Results for knn with 2 neighb saved!
Running knn with 3 neighb
Results for knn with 3 neighb saved!
Running knn with 5 neighb
Results for knn with 5 neighb saved!
Running knn with 10 neighb
Results for knn with 10 neighb saved!
Running knn with 20 neighb
Results for knn with 20 neighb saved!
Running knn with 2 neighb
Results for knn with 2 neighb saved!
Running knn with 3 neighb
Results for knn with 3 neighb saved!
Running knn with 5 neighb
Results for knn with 5 neighb saved!
Running knn with 10 neighb
Results for knn with 10 neighb saved!
Running knn with 20 neighb
Results for knn with 20 neighb saved!
Running knn with 2 neighb
Results for knn with 2 neighb saved!
Running knn with 3 neighb
Results for knn with 3 neighb saved!
Running knn with 5 neighb
Results for knn with 5 neighb saved!
Running knn with 10 neighb
Results for knn with 10 neighb saved!
Running knn with 20 neighb
Results for knn with 20 neighb saved!
Nothing
{'knn with 2 neighb': {'Train Accur

For this next part, we assembled into a single cell for simplicity the implementations and gridsearch on the cross-val accuracy of the following models:
- Logistic Regression
- Ridge Classifier
- Decision Tree Classifier
- Random Forest Classifier
- Gradient Boosting Tree Classifier
We implemetned as well a helper function to train and evaluate the models.

In [28]:
models = {
        "Logistic Regression": {
            "model": LogisticRegression(max_iter=1000),
            "params": {
                "C": [0.01, 0.1, 1, 10],
                "penalty": ["l2", "l1"],
                "solver": ["lbfgs", "saga"]
            }
        },
        "Ridge Classifier": {
            "model": RidgeClassifier(),
            "params": {
                "alpha": [0.1, 1, 10]
            }
        },
        "Decision Tree Classifier": {
            "model": DecisionTreeClassifier(),
            "params": {
                "max_depth": [2, 5, 10],
                "min_samples_split": [2, 10, 20]
            }
        },
        "Gradient Boosted Trees": {
            "model": GradientBoostingClassifier(),
            "params": {
                "n_estimators": [50, 100, 200],
                "learning_rate": [0.01, 0.1, 0.2],
                "max_depth": [3, 5, 10]
            }
        }
    }

In [24]:
def evaluate_models_with_params(X, Y, models, name, test_size=0.2, random_state=42):

    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=random_state)
    
    results = {}
    config = models
    print(f"Evaluating: {name}")
    clf = GridSearchCV(config["model"], config["params"], cv=5, scoring="accuracy", n_jobs=-1)
    clf.fit(X_train, Y_train)
    
    best_params = clf.best_params_
    train_acc = accuracy_score(Y_train, clf.predict(X_train))
    test_acc = accuracy_score(Y_test, clf.predict(X_test))
    cross_val_acc = clf.best_score_
    
    results[name] = {
        "Best Parameters": best_params,
        "Train Accuracy": train_acc,
        "Cross-Validation Accuracy": cross_val_acc,
        "Test Accuracy": test_acc
    }
    
    return results

### Logistic Regression

In [25]:
name = "Logistic Regression"
model_logreg = models[name]

In [27]:
nothing_logreg = evaluate_models_with_params(X_nothing, Y_nothing_clustered, model_logreg, name)
print(nothing_logreg)

Evaluating: Logistic Regression


KeyboardInterrupt: 

In [None]:
zero_mean_logreg = evaluate_models_with_params(X_0mean, Y_0mean_clustered, model_logreg, name)
print(zero_mean_logreg)


In [None]:
no_zero_logreg = evaluate_models_with_params(X_no0, Y_no0_clustered, model_logreg, name)
print(no_zero_logreg)

### Ridge Classifier

In [29]:
name = "Ridge Classifier"
model_ridge = models[name]

In [None]:
nothing_ridge = evaluate_models_with_params(X_nothing, Y_nothing_clustered, model_ridge, name)
print(nothing_ridge)

Evaluating: Ridge Classifier


In [None]:
zero_mean_ridge = evaluate_models_with_params(X_0mean, Y_0mean_clustered, model_ridge, name)
print(zero_mean_ridge)

In [None]:
no_zero_ridge = evaluate_models_with_params(X_no0, Y_no0_clustered, model_ridge, name)
print(no_zero_ridge)

### Decision Tree Classifier

In [None]:
name = "Decision Tree Classifier"
model_dtc = models[name]

In [None]:
nothing_tree = evaluate_models_with_params(X_nothing, Y_nothing_clustered, model_dtc, name)
print(nothing_tree)

In [None]:
zero_mean_tree = evaluate_models_with_params(X_0mean, Y_0mean_clustered, model_dtc, name)
print(zero_mean_tree)

In [None]:
no_zero_tree = evaluate_models_with_params(X_no0, Y_no0_clustered, model_dtc, name)
print(no_zero_tree)

In [None]:
# Best performing 

### Random Forest Classifier

In [None]:
name = "Random Forest Classifier" 
model_rfc = models[name]

In [None]:
nothing_forest = evaluate_models_with_params(X_nothing, Y_nothing_clustered, model_rfc, name)
print(nothing_forest)

In [None]:
zero_mean_forest = evaluate_models_with_params(X_0mean, Y_0mean_clustered, model_rfc, name)
print(zero_mean_forest)

In [None]:
no_zero_forest = evaluate_models_with_params(X_no0, Y_no0_clustered, model_rfc, name)
print(no_zero_forest)

### Gradient Boosting Tree Classifier

In [None]:
name = "Gradient Boosted Trees"
model_gbt = models[name]

In [None]:
nothing_boost = evaluate_models_with_params(X_nothing, Y_nothing_clustered, model_gbt, name)
print(nothing_boost)

In [None]:
zero_mean_boost = evaluate_models_with_params(X_0mean, Y_0mean_clustered, model_gbt, name)
print(zero_mean_boost)

In [None]:
no_zero_boost = evaluate_models_with_params(X_no0, Y_no0_clustered, model_gbt, name)
print(no_zero_boost)

### Models Performance Comparison and Discussion

Let us first combine all of our dictionaries of results into a single dictionary for easier manipulation/comparison.

In [None]:
all_results = {}
all_results["nothing"] = nothing_knn | nothing_logreg | nothing_ridge | nothing_tree | nothing_forest | nothing_boost
all_results["zero_mean"] = zero_mean_knn | zero_mean_logreg | zero_mean_ridge | zero_mean_tree | zero_mean_forest | zero_mean_boost
all_results["no_zero"] = no_zero_knn | no_zero_logreg | no_zero_ridge | no_zero_tree | no_zero_forest | no_zero_boost

We can now observe the performance of the different models for each datasets!

In [None]:
print("Results for 'nothing' dataset:")
for model, metrics in all_results["nothing"].items():
    print(f"{model}: {metrics}")

print("------------------------------------------------------------------------------------------------------------")
print("------------------------------------------------------------------------------------------------------------")

print("Results for '0mean' dataset:")
for model, metrics in all_results["0mean"].items():
    print(f"{model}: {metrics}")

print("------------------------------------------------------------------------------------------------------------")
print("------------------------------------------------------------------------------------------------------------")

print("Results for 'no0' dataset:")
for model, metrics in all_results["no0"].items():
    print(f"{model}: {metrics}")

From the results above, we can build the following table. 







We observe that: