# CS985/6 Spotify Regression Problem 2024

## Group 14

|Name |Student No.|
|-----|-----------|
|[Ishita Namdeo](ishita.namdeo.2023@uni.strath.ac.uk)| 202353325|
|[Mohamed Tarek Mokhtar Omara Ahmed](mohamed.t.ahmed.2023@uni.strath.ac.uk)|202356621|
|[Ramandeep Gil](ramandeep.gill.2023@uni.strath.ac.uk) ||
|[Riley Simpson](riley.simpson.2019@uni.strath.ac.uk) | 202363053|
|[S A Nawash Akhtar](nawash.akhtar.2023@uni.strath.ac.uk) |202352528|
---

# Regressors Selection  

### Importing the proper modules 


In [59]:
#Import necessery libraries & pre-processing 
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures , StandardScaler
from sklearn.model_selection import train_test_split


# Import all relevant regressors from sklearn  
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import BayesianRidge,ElasticNet,LogisticRegression,SGDRegressor,LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import LinearSVR,SVR

### Load the dataset 
In order to train the regressors , we will use the cleaned/processed training datasets created in the [Preprocessing.ipynb](Preprocessing.ipynb) notebook. 

In [60]:

df = pd.read_csv("Training_Cleaned.csv")
df.head()
x = df.drop(columns=['pop','Id'])
y = df["pop"]

x_train,x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)


### Setting the Random state
>"Using random_state is important for reproducibility, debugging, and comparison of results. By setting this parameter, you can >ensure that your experiments are reproducible, debug problems more effectively, and compare the performance of different models >more accurately" -[What Is 'random_state' in sklearn.model_selection.train_test_split Example?](https://saturncloud.io/blog/what-is-randomstate-in-sklearnmodelselectiontraintestsplit-example/#:~:text=Using%20random_state%20is%20important%20for%20reproducibility%2C%20debugging%2C%20and%20comparison%20of,of%20different%20models%20more%20accurately.)

In [61]:
random_state = 42

### Enabling the regressors 
From here we can adjust the hyperparameters adjust the performance of each one.

#### GradientBoostingRegressor (GBR)
- Gradient Boosting is an ensemble learning technique that builds models sequentially, with each model trying to correct the    errors of its predecessor. The default settings are often a good starting point.
KernelRidge (KR):

#### Kernel Ridge Regression
- Combines Ridge Regression (linear least squares with l2-norm regularization) with the kernel trick. It thus learns a linear function in the space induced by the respective kernel and the data. For non-linear data, this can be very effective.
BayesianRidge (BR):

#### Bayesian Ridge Regression 
- implements Bayesian linear regression. It is particularly useful when the size of the data is not too large and you want to include regularization parameters that are tuned to the data.

#### ElasticNet (EN)
- ElasticNet is a linear regression model trained with both l1 and l2-norm regularization of the coefficients. This combination allows for learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization properties of Ridge.

#### DecisionTreeRegressor (DT)
- 'friedman_mse' criterion measures the quality of a split based on mean squared error with improvement score by Friedman. It's often a good choice for regression.
- 'max_depth=20' sets a limit on the depth of the tree. Deep trees can lead to overfitting.

Other parameters like 'max_features', 'max_leaf_nodes', and 'min_impurity_decrease' are set to their default values, which are generally a good starting point.
- 'min_samples_leaf' and 'min_samples_split' control the size of the leaf nodes and splits, influencing model complexity.
- 'random_state' ensures reproducibility.
- 'splitter=best' chooses the best split at each node.

#### LinearSVR
- 'epsilon=8' defines the margin of tolerance where no penalty is given to errors. The choice of this value can significantly affect the fit.
- 'random_state' for reproducibility.

#### SVR with Polynomial Kernel (Poly_SVR)
- 'kernel="poly"' specifies the use of a polynomial kernel.
- 'degree=2' for the polynomial kernel. A degree of 2 usually captures non-linear relationships well without overfitting.
- 'C=1' sets the regularization parameter. A smaller value of C means more regularization.
- 'epsilon=0.1' sets the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value.
- 'gamma="scale"' automatically adjusts gamma based on the number of features, which often leads to better performance.

#### LogisticRegression
Used for classification problems. The default parameters are often a reasonable starting point.
- 'random_state' for reproducibility.

#### SGDRegressor
Stochastic Gradient Descent is a simple yet very efficient approach to fitting linear models.
- 'random_state' for reproducibility.

### LinearRegression
Standard linear regression without regularization. It's the most basic form of regression.

In [62]:
GBR = GradientBoostingRegressor()
KR = KernelRidge()
BR = BayesianRidge()
EN = ElasticNet()
DT = DecisionTreeRegressor()
Lin_SVR = LinearSVR(epsilon = 8 , random_state=random_state )
Poly_SVR = SVR(kernel="poly", degree=2, C=1, epsilon=0.1, gamma="scale")
Log_reg = LogisticRegression(random_state=random_state)
SGD = SGDRegressor(random_state=random_state)
Lin_reg = LinearRegression()

# Combining these models into a list allows for a simple training/testing function to be used for each of them. 
reg_list = [GBR,KR,BR,DT, Lin_SVR, Lin_reg, Log_reg, SGD,Poly_SVR]


### Scaling the Data 
Scaling our dataset before training / testing presents a few advantages:

- **Feature Standardization**: Our dataset contains numerical features like `bpm`, `nrgy`, `dnce`, etc., with varying scales. StandardScaler() will standardize these features to have a mean of 0 and a standard deviation of 1, ensuring that no single feature will dominate others during the learning process.

- **Model Performance**: Some of the regressor models are sensitive to scale (such as linear and non-linear SVR) , standardizing the features will likely improve model performance.

- **Gradient Descent Optimization**: For models that use gradient descent (like linear regression or neural networks), having features on the same scale can speed up convergence.

- **Handling Skewed Distributions**: If any of your numerical features are not normally distributed, `StandardScaler()` can mitigate the effect of skewness.

In [63]:
std_scaler = StandardScaler()
x_train = std_scaler.fit_transform(x_train)
x_test = std_scaler.fit_transform(x_test)

### The Regressor Function 
In order to determine which regressor is most suitable for the task , we need to compare different regression models and evaluate their mean squared error. To do this the following steps must be taken:

1. Train the regressor on the training dataset 
2. Predict the popularity of the songs in the testing dataset
3. Compare these predicted values with the real values via the mean squared error: $$RMSD = \sqrt{\frac{\sum\limits_{i=1}^{N} (x_{i} - \hat{x_{i}})}{N}}$$

In order to streamline the selection process these steps were summarised in the function `use_regressor()`.

In [64]:
def use_regressor(reg):
    regressor = reg.fit(x_train,y_train)
    y_pred = regressor.predict(x_test)
    mse=mean_squared_error(y_test,y_pred)
    print(f"MSE:{np.sqrt(mse)} for {reg.__class__.__name__}")
    return y_pred

for reg in reg_list:
    use_regressor(reg)

MSE:11.470727253839497 for GradientBoostingRegressor
MSE:63.54870272274179 for KernelRidge
MSE:10.746770268144891 for BayesianRidge
MSE:14.941644730508996 for DecisionTreeRegressor
MSE:12.22028355474507 for LinearSVR
MSE:572248032970038.9 for LinearRegression
MSE:11.797746106219323 for LogisticRegression
MSE:10.97351120861915 for SGDRegressor
MSE:12.112180434207856 for SVR




### Results 

| Model                  | MSE                  |
|------------------------|----------------------|
| **GradientBoostingRegressor** | <mark>9.478579817350175</mark>   |
| KernelRidge            | 61.455887080158426  |
| BayesianRidge          | 9.561217353361643   |
| DecisionTreeRegressor  | 14.408642349609984  |
| LinearSVR              | 10.509947931975468  |
| LinearRegression       | 335209587401381.75  |
| LogisticRegression     | 13.643015360684315  |
| SGDRegressor           | 9.864147214378303   |
| SVR                    | 12.962918698414367  |

Gradient Boosting is an ensemble learning technique that combines multiple weak prediction models (typically decision trees) to create a strong predictive model. By sequentially correcting the mistakes of previous models. 

The training dataset consisted of 453 rows and 122 columns, featuring a diverse range of features. Key features include 'bpm' (beats per minute), 'nrgy' (energy), 'dnce' (danceability), 'dB' (loudness), 'live' (liveness), 'val' (valence), 'dur' (duration), 'acous' (acousticness), 'spch' (speechiness), and 'pop' (popularity), along with several decade-related binary features (indicating the song's decade) and various statistics about the song title and artist name (like length and word count). 

These wide range of values in the features indicate a rich and varied collection of songs with different characteristics. This diversity, combined with the presence of both numerical and categorical data, likely contributed to the effective performance of the GradientBoostingRegressor, which excels in handling complex and non-linear relationships, is robust against overfitting, and can leverage the benefits of ensemble learning for predictive accuracy.

> As such Gradient Boosting Regressor was chosen for our predictions.

### Tuning the Regressor
Before applying a trained regressor to the new testing dataset , the model must be optimised first based on the data it's being tested on. [Grid Search CV](https://www.mygreatlearning.com/blog/gridsearchcv/) will be used for this. 

Parameters like `loss`, `learning_rate`, `n_estimators`, m`ax_depth, `min_samples_split`, and `min_samples_leaf` are chosen due to their significant impact on the model's learning dynamics and generalization capabilities. 

The loss function, including 'squared_error' and 'huber', is crucial as it dictates how the model penalizes errors, with 'huber' being particularly useful for handling outliers. 

The learning_rate influences the speed and robustness of the learning process, and a range of values is selected to explore the trade-off between fast learning and the risk of overfitting. 

The n_estimators parameter, determining the number of boosting stages, is pivotal in defining the model complexity and is set to moderate values to balance accuracy and overfitting.

The choice of max_depth, `min_samples_split`, and `min_samples_leaf` is aimed at controlling the model structure to prevent overfitting while capturing sufficient data patterns. The max_depth of the trees strikes a balance between learning detailed data relationships and maintaining model simplicity. 

Similarly, `min_samples_split` and `min_samples_leaf` ensure that splits and leaf nodes do not cater to overly specific or minute data samples, thereby smoothing the model and enhancing its ability to generalize. These parameters are set to a range that allows exploration of different model behaviors while keeping the computational load manageable, ensuring an efficient yet thorough search for the optimal model configuration.

In [48]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5],
    'min_samples_split': [2, 4, 6],
    'min_samples_leaf': [1, 2, 3]
}

# Initialize the GridSearchCV object
grid_search = GridSearchCV(GradientBoostingRegressor(random_state=random_state), 
                           param_grid, 
                           scoring='neg_mean_squared_error', 
                           cv=5, 
                           verbose=1)

# Fit the grid search to the data
grid_search.fit(x_train, y_train)

# Print the best parameters and the corresponding score
print("Best parameters found: ", grid_search.best_params_)
print("Best MSE: ",np.sqrt( -grid_search.best_score_))

Fitting 5 folds for each of 243 candidates, totalling 1215 fits
Best parameters found:  {'learning_rate': 0.01, 'max_depth': 3, 'min_samples_leaf': 3, 'min_samples_split': 2, 'n_estimators': 300}
Best MSE:  102.54963136970562


# Predicting the Popularity with training dataset
This next step is similar to previous ones with the addition of a predictions dataframe which we use to store our predictions.

In [71]:
training_url =  "https://raw.githubusercontent.com/Riley-Simpson/Postgraduate-Work-/main/Semester%202/Machine%20Learning%20for%20Data%20Analysis/Coursework/Training_cleaned.csv"
testing_url = "https://raw.githubusercontent.com/Riley-Simpson/Postgraduate-Work-/main/Semester%202/Machine%20Learning%20for%20Data%20Analysis/Coursework/Testing_Cleaned.csv"
df = pd.read_csv(training_url)
df.head()
x_train = df.drop(columns=['pop'])
y_train = df["pop"]
x_test = pd.read_csv(testing_url)
predictions = pd.DataFrame(x_test['Id']) 

std_scaler = StandardScaler()
x_train = std_scaler.fit_transform(x_train)
x_test = std_scaler.fit_transform(x_test)



In [72]:

# Import Gradient Boosting Regressor from sklearn.
from sklearn.ensemble import GradientBoostingRegressor

# Set a fixed random state for reproducibility.
random_state = 42

# Initialize Gradient Boosting Regressor with specified settings.
GBR = GradientBoostingRegressor(learning_rate=0.1, max_depth=3, min_samples_leaf=1, min_samples_split=2, n_estimators=100)

# Train the model on the training data.
regressor = GBR.fit(x_train, y_train)

# Predict values using the model on the test data.
y_pred = regressor.predict(x_test)

# Convert predictions to a list.
predictions_list = list(y_pred)

# Add predictions to a DataFrame.
predictions['pop'] = predictions_list

# Save the predictions to a CSV file.
predictions.to_csv(f'{GBR.__class__.__name__}_pop_predictions.csv', index=False)


# Kaggle Competition Results 

As of 18/02/2024 , 15:50 : CS985 Group 14 are `7th` with a score of `7.14804`
    