# Model Selection (& fine tuning)

In [None]:
# DS18 ML Essentials project
# Module 6: Model selection & hyperparameter tuning

# Submitted by: Tzvi Eliezer Nir
# mail: tzvienir@gmail.com
# First submission: 29/03/2025

## In this notebook

It is time to put all previous work into action. In this model we will fit, test and evaluate the model that will be selected for our system. The process can be divided into two parts:

1. Model Selection: by fitting multiple regression models on our dataset, and comparing the results.
2. Fine tuning: for the best-performance model (in terms of lowest error) we will do a `RandomSearch` to find the best hyperparameters.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.metrics as metrics 
import warnings
warnings.filterwarnings("ignore")

## Model Selection

### Import dataset

Lets import the dataset from the previous chapter:

In [2]:
df = pd.read_pickle('pickle/05_feature_selection/feature_selection.pkl')

This dataset has the 22 selected features + the target variable:

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28356 entries, 0 to 28355
Data columns (total 23 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   danceability            28356 non-null  float64
 1   energy                  28356 non-null  float64
 2   key                     28356 non-null  int64  
 3   loudness                28356 non-null  float64
 4   acousticness            28356 non-null  float64
 5   instrumentalness        28356 non-null  float64
 6   liveness                28356 non-null  float64
 7   tempo                   28356 non-null  float64
 8   duration_ms             28356 non-null  int64  
 9   playlist_count          28356 non-null  int64  
 10  edm                     28356 non-null  bool   
 11  pop                     28356 non-null  bool   
 12  r&b                     28356 non-null  bool   
 13  rap                     28356 non-null  bool   
 14  rock                    28356 non-null

In [4]:
df.head()

Unnamed: 0,danceability,energy,key,loudness,acousticness,instrumentalness,liveness,tempo,duration_ms,playlist_count,...,rap,rock,year,month,day,decade,feat,Remix,track_artist_followers,track_popularity
0,0.682,0.401,2,-10.068,0.279,0.0117,0.0887,97.091,235440,1,...,False,True,2001,1,1,2000,False,False,103090.0,41
1,0.582,0.704,5,-6.242,0.0651,0.0,0.212,150.863,197286,1,...,False,False,2018,1,26,2010,False,False,366482.0,15
2,0.303,0.88,9,-4.739,0.0117,0.00994,0.347,135.225,373512,1,...,False,True,2017,11,21,2010,False,False,4132.0,28
3,0.659,0.794,10,-5.644,0.000761,0.132,0.322,128.041,228565,1,...,False,False,2015,8,7,2010,False,False,557.0,24
4,0.662,0.838,1,-6.3,0.114,0.000697,0.0881,129.884,236308,1,...,False,False,2018,11,16,2010,False,False,2913.0,38


### Define error metrics

To compare the models' performance, we choose four known error metrics:

1. **Mean Squared Error (MSE)**  
    $$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

2. **Root Mean Squared Error (RMSE)**  
    $$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$

3. **Root Mean Squared Logarithmic Error (RMSLE)**  
    $$\text{RMSLE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \left( \log(1 + y_i) - \log(1 + \hat{y}_i) \right)^2}$$

4. **Mean Absolute Error (MAE)**  
    $$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$$


In [5]:
def regressionMetrics(y, yhat):
    res = {'MSE': metrics.mean_squared_error(y,yhat),
           'RMSE': np.sqrt(metrics.mean_squared_error(y,yhat)),
           'MAE': metrics.mean_absolute_error(y,yhat),
           
          }
    # Calculate RMSLE using absolute values to avoid negative issues
    res['RMSLE'] = np.sqrt(metrics.mean_squared_log_error(np.abs(y), np.abs(yhat)))
    
    return res

### Split the dataset to train, test and validation

Define the target variable y, and the feature set X:

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
y = df['track_popularity']
X = df.drop(columns=['track_popularity'])

We are gonna split the dataset to (train, test). Than split the train again to (train, validation) data:

In [8]:
# Split into train+val and test sets (80% train+val, 20% test)
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Split train+val into train and val sets (75% train, 25% val from the train+val set)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=0.25, random_state=42
)

### Train the models

We are gonna try and fit seven different regression models on our data and see who gets the best results (lowest error) of them all:

In [9]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
#!pip install xgboost
import xgboost as xgb

In [10]:
# List of models to evaluate
models = {
    'LinearRegression': LinearRegression(),
    'DecisionTreeRegressor': DecisionTreeRegressor(),
    'RandomForestRegressor': RandomForestRegressor(),
    'AdaBoostRegressor': AdaBoostRegressor(),
    'GradientBoostingRegressor': GradientBoostingRegressor(),
    'SVR': SVR(),
    'XGBoost': xgb.XGBRegressor()
}

In [11]:
# Dictionary to store the results
results = {}

# Fit and predict using each model
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    results[name] = regressionMetrics(y_val, y_pred)

# Display the results
for name, metrics in results.items():
    print(f"Model: {name}")
    for metric, value in metrics.items():
        print(f"  {metric}: {value}")
    print()

Model: LinearRegression
  MSE: 463.23831507532424
  RMSE: 21.522971799343235
  MAE: 17.86629886651494
  RMSLE: 1.312507683594363

Model: DecisionTreeRegressor
  MSE: 759.5683741844472
  RMSE: 27.560268035424603
  MAE: 20.61320754716981
  RMSLE: 1.6026709564217243

Model: RandomForestRegressor
  MSE: 391.70682921340887
  RMSE: 19.791584808029114
  MAE: 15.693209856328354
  RMSLE: 1.2345863538688304

Model: AdaBoostRegressor
  MSE: 455.8584219251971
  RMSE: 21.350841246311518
  MAE: 18.140785519924062
  RMSLE: 1.2956978186481694

Model: GradientBoostingRegressor
  MSE: 400.26336156962054
  RMSE: 20.00658295585782
  MAE: 16.19303648927646
  RMSLE: 1.268597243258889

Model: SVR
  MSE: 570.4613591516761
  RMSE: 23.88433292247611
  MAE: 18.778443352101235
  RMSLE: 1.4014299937135324

Model: XGBoost
  MSE: 402.0318666748082
  RMSE: 20.050732322656152
  MAE: 15.905198201890546
  RMSLE: 1.2345041437195001



Below are the results of comparing the various models:

In [15]:
# Convert the results dictionary to a DataFrame
results_df = pd.DataFrame(results).T

# Sort the DataFrame by MAE
results_df_sorted = results_df.sort_values(by='MAE')

# Display the sorted DataFrame
results_df_sorted

Unnamed: 0,MSE,RMSE,MAE,RMSLE
RandomForestRegressor,391.706829,19.791585,15.69321,1.234586
XGBoost,402.031867,20.050732,15.905198,1.234504
GradientBoostingRegressor,400.263362,20.006583,16.193036,1.268597
LinearRegression,463.238315,21.522972,17.866299,1.312508
AdaBoostRegressor,455.858422,21.350841,18.140786,1.295698
SVR,570.461359,23.884333,18.778443,1.40143
DecisionTreeRegressor,759.568374,27.560268,20.613208,1.602671


As can be seen in the table, `RandomForestRegressor` is the big winner, with the lowest *Mean Absolute Error*.

In [16]:
models['RandomForestRegressor']

## Fine Tuning

Lets try to improve the best model! we will use `RandomizedSearchCV` for finding the best hyperparameters for the regression model, and see if the new model is indeed having better results than the original:

In [None]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

In [20]:
# Reduced number of options for each hyperparameter
n_estimators = [100, 200, 300]  # Fewer values for the number of trees
max_features = ['auto','sqrt']  #  # Number of features to consider at each split
max_depth = [10, 20, 30, 40, None]  # Fewer values for max depth
min_samples_split = [2, 5, 10]  # Keep essential options only
min_samples_leaf = [1, 2, 4]  # Reduced options for leaf samples
bootstrap = [True, False]  # Keep as is

# Create a lighter random grid
lighter_grid = {
    'n_estimators': n_estimators,
    'max_features': max_features,
    'max_depth': max_depth,
    'min_samples_split': min_samples_split,
    'min_samples_leaf': min_samples_leaf,
    'bootstrap': bootstrap
}

print(lighter_grid)

# Reduced number of iterations and cross-validation folds
rf_random = RandomizedSearchCV(estimator=models['RandomForestRegressor'], param_distributions=lighter_grid, n_iter=25, cv=3, 
                               verbose=2, random_state=42, n_jobs=1)

# Fit the random search model
rf_random.fit(X_train, y_train)

{'n_estimators': [100, 200, 300], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]}
Fitting 3 folds for each of 25 candidates, totalling 75 fits
[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time=   0.0s
[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time=   0.0s
[CV] END bootstrap=True, max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time=   0.0s
[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=200; total time=   0.0s
[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=200; total time=   0.0s
[CV] END bootstrap=True, max_depth=20, max_features=auto

Lets evaluate both the original `RandomForestRegressor` and the RandomSearch best estimator, and see which one we should choose:

In [21]:
def evaluate(model, test_features, test_labels):
    predictions = model.predict(test_features)
    errors = abs(predictions - test_labels)
    mae = 100 * np.mean(errors)
    print('Model Performance')
    print('Mean Absolute Error: {:0.4f}'.format(np.mean(errors)))
    return mae

In [22]:
base_accuracy = evaluate(models['RandomForestRegressor'], X_test, y_test)

Model Performance
Mean Absolute Error: 15.6069


In [23]:
best_random = rf_random.best_estimator_
random_accuracy = evaluate(best_random, X_test, y_test)

Model Performance
Mean Absolute Error: 15.6327


In [24]:
print('Improvement of {:0.2f}%.'.format( 100 * (base_accuracy - random_accuracy) / base_accuracy))

Improvement of -0.17%.


No Improvment from on Finetuning, we will stay with the base model.

![alt text](assets/theend.avif)