# A Little Regression Challenge

In [1]:
# importing general python libraries
import numpy as np
import pandas as pd

# importing libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# importing libraries for data preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# importing libraries for model building
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# importing libraries for model evaluation
from sklearn.metrics import classification_report
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error,mean_absolute_percentage_error

# importing libraries for model tuning
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# importing tensorflow libraries for deep learning
import tensorflow as tf
from tensorflow import keras


## The Dataset - Spotify Songs

### Description:

In this task, we will use a sample of 150K records, out of the ["Spotify Dataset 1921-2020, 600k+ Tracks"](https://www.kaggle.com/datasets/yamaerenay/spotify-dataset-19212020-600k-tracks?select=tracks.csv) which is available on kaggle. 

### The columns:

>**Target Column** we will predict the following column:
- `popularity` (Ranges from 0 to 100), float, representing the popularity of the song in the Spotify platform.

>**Numerical Columns**:
- `id` (Id of tracks generated by Spotify)
- `acousticness` (Ranges from 0 to 1)
- `danceability` (Ranges from 0 to 1)
- `energy` (Ranges from 0 to 1)
- `duration_ms` (Integer typically ranging from 200k to 300k)
- `instrumentalness` (Ranges from 0 to 1)
- `valence` (Ranges from 0 to 1)
- `animality` (Ranges from 0 to 1)
- `tempo` (Float typically ranging from 50 to 150)
- `liveness` (Ranges from 0 to 1)
- `loudness` (Float typically ranging from -60 to 0)
- `speechiness` (Ranges from 0 to 1)
- `release_year` a column which we are going to extract out of the `Release` column and predict based on song's features.


> **Categorical Columns** (string types):
- `explicit` (Whether the song is explicit (contains swearing or inappropriate language) or not)
  
> The following categorical columns will be removed to simplify the task (to many categories):
- `artists` (List of artists mentioned)
- `track_name` (Name of the song)
- `genre` is the genre of the song. String type, Multiclass.<br>
- `key` (All keys on octave encoded as values ranging from 0 to 11, starting on C as 0, C# as 1, and so on…)
- `time_signature` A notational convention to specify how many beats are in each bar (or measure). For example, rock music often has a time signature of 4/4, while classical music often has a time signature of 3/4 or 4/4.
- `Release` the date which the song was released on.



## Loading and Preprocessing

In [2]:
reg_url = 'https://raw.githubusercontent.com/FreeDataSets/DataPool/main/tracks_150000.csv' # this is the url for the dataset
reg_df = pd.read_csv(reg_url).sample(10000,random_state=42) # In order to reduce the size of the dataset, we are taking a random sample of 5000 rows from the dataset

# a preview of the dataframe
reg_df.info() 
reg_df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 59770 to 93473
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                10000 non-null  object 
 1   name              9998 non-null   object 
 2   popularity        10000 non-null  int64  
 3   duration_ms       10000 non-null  int64  
 4   explicit          10000 non-null  int64  
 5   artists           10000 non-null  object 
 6   release_date      10000 non-null  object 
 7   danceability      10000 non-null  float64
 8   energy            10000 non-null  float64
 9   key               10000 non-null  int64  
 10  loudness          10000 non-null  float64
 11  speechiness       10000 non-null  float64
 12  acousticness      10000 non-null  float64
 13  instrumentalness  10000 non-null  float64
 14  liveness          10000 non-null  float64
 15  valence           10000 non-null  float64
 16  tempo             10000 non-null  fl

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,release_date,danceability,energy,key,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
59770,0vIDmQVcdN5xoKXCTzTciu,L'ombre sur la mesure,34,198613,0,['La Rumeur'],2002-04-12,0.695,0.746,1,-5.119,0.392,0.404,4.8e-05,0.0979,0.72,169.813,4
21362,6Kv5DGGNsTmMPwWHDPdquI,Roshni Apni Umangon Ki,0,189013,0,['Noor Jehan'],1943-01-01,0.434,0.225,10,-16.287,0.0418,0.993,0.93,0.239,0.814,72.649,3
127324,6FRugAcwKEgQ3r2MkWC5Mm,Kiss the Dirt (Falling Down the Mountain),26,236160,0,['INXS'],1985,0.697,0.695,6,-8.211,0.0465,0.0325,0.00828,0.114,0.605,115.813,4
140509,0GeAGyjTjCufXGm9Q2DNLa,Bisa,36,227657,0,['Billfold'],2013-01-01,0.372,0.945,2,-2.431,0.0484,8.4e-05,0.0253,0.38,0.624,92.478,4
144297,2rvDPgJnBxnkIRpYwVFrY2,Eclipse Total del Amor (Total Eclipse of the H...,55,324493,0,"['Yuridia', 'Patricio Borghetti']",2006-10-27,0.688,0.508,8,-6.725,0.0286,0.218,0.0,0.0979,0.331,134.927,4


In [3]:
# convert Release to date and then extract year from it 
reg_df['release_date'] = pd.to_datetime(reg_df['release_date'])
reg_df['release_date'] = reg_df['release_date'].dt.year
reg_df.drop('release_date', axis=1, inplace=True) 
reg_df.head()


Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,danceability,energy,key,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
59770,0vIDmQVcdN5xoKXCTzTciu,L'ombre sur la mesure,34,198613,0,['La Rumeur'],0.695,0.746,1,-5.119,0.392,0.404,4.8e-05,0.0979,0.72,169.813,4
21362,6Kv5DGGNsTmMPwWHDPdquI,Roshni Apni Umangon Ki,0,189013,0,['Noor Jehan'],0.434,0.225,10,-16.287,0.0418,0.993,0.93,0.239,0.814,72.649,3
127324,6FRugAcwKEgQ3r2MkWC5Mm,Kiss the Dirt (Falling Down the Mountain),26,236160,0,['INXS'],0.697,0.695,6,-8.211,0.0465,0.0325,0.00828,0.114,0.605,115.813,4
140509,0GeAGyjTjCufXGm9Q2DNLa,Bisa,36,227657,0,['Billfold'],0.372,0.945,2,-2.431,0.0484,8.4e-05,0.0253,0.38,0.624,92.478,4
144297,2rvDPgJnBxnkIRpYwVFrY2,Eclipse Total del Amor (Total Eclipse of the H...,55,324493,0,"['Yuridia', 'Patricio Borghetti']",0.688,0.508,8,-6.725,0.0286,0.218,0.0,0.0979,0.331,134.927,4


In [4]:

reg_df.drop(['name', 'artists','id','release_date', 'artists_id','genre','key','time_signature'], axis=1, inplace=True, errors='ignore') # Removing Categorical features with more then 10 unique values
reg_df.info()
reg_df.head(3)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 59770 to 93473
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   popularity        10000 non-null  int64  
 1   duration_ms       10000 non-null  int64  
 2   explicit          10000 non-null  int64  
 3   danceability      10000 non-null  float64
 4   energy            10000 non-null  float64
 5   loudness          10000 non-null  float64
 6   speechiness       10000 non-null  float64
 7   acousticness      10000 non-null  float64
 8   instrumentalness  10000 non-null  float64
 9   liveness          10000 non-null  float64
 10  valence           10000 non-null  float64
 11  tempo             10000 non-null  float64
dtypes: float64(9), int64(3)
memory usage: 1015.6 KB


Unnamed: 0,popularity,duration_ms,explicit,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo
59770,34,198613,0,0.695,0.746,-5.119,0.392,0.404,4.8e-05,0.0979,0.72,169.813
21362,0,189013,0,0.434,0.225,-16.287,0.0418,0.993,0.93,0.239,0.814,72.649
127324,26,236160,0,0.697,0.695,-8.211,0.0465,0.0325,0.00828,0.114,0.605,115.813


In [5]:
# split the data into features and target variable 
Xreg = reg_df.drop('popularity', axis=1) # features
yreg = reg_df['popularity'] # target variable

In [6]:
# we standardize the data as asked in question 9, although it creates a problem of data leakage as we are using the test data to fit the scaler
scaler_reg = StandardScaler().fit(Xreg)
Xreg_scaled = scaler_reg.transform(Xreg)


In [7]:
# split the data into train and test sets
Xreg_train_scaled, Xreg_test_scaled, yreg_train, yreg_test = train_test_split(Xreg, yreg, test_size=0.2, random_state=42)
# end of Q9

## Applying a simple Linear Regression

In [8]:
# Simple linear regression
from sklearn.linear_model import LinearRegression

# Create a Linear Regression object
lin_reg = LinearRegression()

# Train the model using the training sets
lin_reg.fit(Xreg_train_scaled, yreg_train)


## Evaluating the linear regression model

In [9]:
### Helper function to save and compare regression metrics 

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error,mean_absolute_percentage_error

def calculate_and_append_metrics(model_name, model, X_train, y_train, X_test, y_test, train_results_df, test_results_df):
    # Calculate metrics for the training dataset
    train_metrics = pd.DataFrame({
        'Model': model_name,
        'R2 Score': [r2_score(y_train, model.predict(X_train))],
        'RMSE': [mean_squared_error(y_train, model.predict(X_train), squared=False)],
        'MAE': [mean_absolute_error(y_train, model.predict(X_train))],
        'MAPE': [mean_absolute_percentage_error(y_train, model.predict(X_train))]
    })

    # Calculate metrics for the test dataset
    test_metrics = pd.DataFrame({
        'Model': model_name,
        'R2 Score': [r2_score(y_test, model.predict(X_test))],
        'RMSE': [mean_squared_error(y_test, model.predict(X_test), squared=False)],
        'MAE': [mean_absolute_error(y_test, model.predict(X_test))],
        'MAPE': [mean_absolute_percentage_error(y_test, model.predict(X_test))]
    })

    # Concatenate metrics to the respective DataFrames
    train_results_df = pd.concat([train_results_df, train_metrics], ignore_index=True)
    test_results_df = pd.concat([test_results_df, test_metrics], ignore_index=True)

    return train_results_df, test_results_df

In [18]:
# evaluate the model using the train and test set and different metrics

# Create empty DataFrames to store the results
train_results_df = pd.DataFrame()
test_results_df = pd.DataFrame()

# Calculate metrics for the Linear Regression model
train_results_df, test_results_df = calculate_and_append_metrics('Linear Regression', lin_reg, Xreg_train_scaled, yreg_train, Xreg_test_scaled, yreg_test, train_results_df, test_results_df)

# display the results
print("Train:")
display(train_results_df)
print("-"*70,"\n")
print("Test:")
display(test_results_df)

Train:


Unnamed: 0,Model,R2 Score,RMSE,MAE,MAPE
0,Linear Regression,0.214264,16.390692,13.333271,7207390000000000.0


---------------------------------------------------------------------- 

Test:


Unnamed: 0,Model,R2 Score,RMSE,MAE,MAPE
0,Linear Regression,0.201381,16.912657,13.620896,7177582000000000.0


## Polynomial Regression

In [11]:
""" Ridge Regression """
from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import GridSearchCV

X = Xreg_train_scaled
y = yreg_train

# Define the degrees to consider in the polynomial features
degrees = range(1, 5)

# Create a RidgeCV model with cross-validation
ridge_cv = RidgeCV([0.01, 0.1, 1, 10])

# Create a PolynomialFeatures transformer
poly = PolynomialFeatures()

# Perform a grid search over polynomial degrees
param_grid = {'poly__degree': degrees}

# Create a pipeline that combines PolynomialFeatures and RidgeCV
from sklearn.pipeline import Pipeline
pipe = Pipeline([('poly', poly),('ridge_cv', ridge_cv)])

# Use GridSearchCV to find the best polynomial degree
ridge_grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='neg_mean_squared_error',verbose=3)

import warnings
from sklearn.exceptions import ConvergenceWarning

warnings.filterwarnings("ignore", category=RuntimeWarning)
ridge_grid_search.fit(X, y)
warnings.filterwarnings("default", category=RuntimeWarning)


# Get the best polynomial degree and the best alpha for RidgeCV
best_degree = ridge_grid_search.best_params_['poly__degree']
best_alpha = ridge_grid_search.best_estimator_.named_steps['ridge_cv'].alpha_

# Print the results
print("RidgeCV Results:")
print("Best Polynomial Degree:", best_degree)
print("Best Alpha for RidgeCV:", best_alpha)
print("Best Negative MSE:", (-ridge_grid_search.best_score_)**0.5)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV 1/5] END .................poly__degree=1;, score=-263.523 total time=   0.0s
[CV 2/5] END .................poly__degree=1;, score=-276.441 total time=   0.0s
[CV 3/5] END .................poly__degree=1;, score=-266.625 total time=   0.0s
[CV 4/5] END .................poly__degree=1;, score=-289.154 total time=   0.0s
[CV 5/5] END .................poly__degree=1;, score=-252.761 total time=   0.0s
[CV 1/5] END poly__degree=2;, score=-314267376111512230821888.000 total time=   0.0s
[CV 2/5] END poly__degree=2;, score=-932116276046215774208.000 total time=   0.0s
[CV 3/5] END poly__degree=2;, score=-1321455849821151833358336.000 total time=   0.0s
[CV 4/5] END poly__degree=2;, score=-127048632761362356895744.000 total time=   0.0s
[CV 5/5] END poly__degree=2;, score=-167432140955028553728.000 total time=   0.0s
[CV 1/5] END poly__degree=3;, score=-3613012821950220115477845421432131996246178004992.000 total time=   0.8s
[CV 2

In [12]:
""" Lasso Regression"""
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

X = Xreg_train_scaled
y = yreg_train

# Define the degrees to consider in the polynomial features
last_degree = 3
degrees = range(1, last_degree)

# Create a LassoCV model with cross-validation
lasso_cv = LassoCV(alphas=[0.01,0.1, 1.0, 10.0]
                #    max_iter=100000
                   )

# Create a PolynomialFeatures transformer
poly = PolynomialFeatures()

# Perform a grid search over polynomial degrees
param_grid = {'poly__degree': degrees}

# Create a pipeline that combines PolynomialFeatures and LassoCV
from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ('poly', poly),
    ('lasso_cv', lasso_cv)
])

# Use GridSearchCV to find the best polynomial degree
lasso_grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='neg_mean_squared_error', verbose=3)

import warnings
from sklearn.exceptions import ConvergenceWarning

# Filter out ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

lasso_grid_search.fit(X, y)

# Optionally, you can reset the warning filters to their original state
warnings.filterwarnings("default", category=ConvergenceWarning)


# Get the best polynomial degree and the best alpha for LassoCV
best_degree = lasso_grid_search.best_params_['poly__degree']
best_alpha = lasso_grid_search.best_estimator_.named_steps['lasso_cv'].alpha_
best_degree = lasso_grid_search.best_params_['poly__degree']

# Print the results
print("/nLassoCV Results:")
print("Best Polynomial Degree:", best_degree)
print("Best Polynomial Degree:", best_degree)
print("Best Alpha for LassoCV:", best_alpha)
print("Best RMSE:", (-lasso_grid_search.best_score_)**0.5)



Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5] END .................poly__degree=1;, score=-263.400 total time=   0.0s
[CV 2/5] END .................poly__degree=1;, score=-276.630 total time=   0.0s
[CV 3/5] END .................poly__degree=1;, score=-266.686 total time=   0.0s
[CV 4/5] END .................poly__degree=1;, score=-289.086 total time=   0.0s
[CV 5/5] END .................poly__degree=1;, score=-253.006 total time=   0.0s
[CV 1/5] END .................poly__degree=2;, score=-241.196 total time=   0.7s
[CV 2/5] END .................poly__degree=2;, score=-247.375 total time=   0.7s
[CV 3/5] END .................poly__degree=2;, score=-251.092 total time=   0.6s
[CV 4/5] END .................poly__degree=2;, score=-270.894 total time=   0.6s
[CV 5/5] END .................poly__degree=2;, score=-236.734 total time=   0.6s
/nLassoCV Results:
Best Polynomial Degree: 2
Best Polynomial Degree: 2
Best Alpha for LassoCV: 0.01
Best RMSE: 15.79424077243227


## A Few Words About Regularization

The problem with a complex model of second order or higher is the risk of **Overfitting:**

When a model fits the *noise* and random fluctuations in the training data rather than capturing the underlying patterns that are truly representative of the target population. 

A *solution* to the overfitting risk is **Regularization**: 
Adding a penalty term to the model's *loss function*, encouraging the model to have smaller parameter values or simpler parameter patterns, discourages overfitting.

**Lasso (Least Absolute Shrinkage and Selection Operator):** adds a penalty term $||β||_1$ which is the sum of the absolute values of the coefficients.
**Ridge** adds a penalty term $||β||_2^2$ which is the sum of the squared values of the coefficients.
Lasso is better for Feature Selection and ridge is better for datasets with Multicollinearity, because Lasso tends to drive the coefficients of irrelevant features to exactly zero, effectively performing feature selection, while Ridge doesn't. 




## Evaluating polynomial regressions

In [13]:
# evaluate the model using the train and test set and different metrics
train_results_df, test_results_df = calculate_and_append_metrics('RidgeCV', ridge_grid_search.best_estimator_, Xreg_train_scaled, yreg_train, Xreg_test_scaled, yreg_test, train_results_df, test_results_df)

train_results_df, test_results_df = calculate_and_append_metrics('LassoCV', lasso_grid_search.best_estimator_, Xreg_train_scaled, yreg_train, Xreg_test_scaled, yreg_test, train_results_df, test_results_df)

# display the results
print("Train:")
display(train_results_df)
print("-"*70,"\n")
print("Test:")
display(test_results_df)

Train:


Unnamed: 0,Model,R2 Score,RMSE,MAE,MAPE
0,Linear Regression,0.214264,16.390692,13.333271,7207390000000000.0
1,RidgeCV,0.214053,16.392894,13.33609,7210138000000000.0
2,LassoCV,0.281635,15.672253,12.666545,6073200000000000.0


---------------------------------------------------------------------- 

Test:


Unnamed: 0,Model,R2 Score,RMSE,MAE,MAPE
0,Linear Regression,0.201381,16.912657,13.620896,7177582000000000.0
1,RidgeCV,0.201787,16.908353,13.62278,7154488000000000.0
2,LassoCV,0.268373,16.187763,12.983852,6464504000000000.0


## Applying RandomForrestRegressor and Xgbregressor
We will use pre-tuned xgb and rf models and also hyperparameter tuned xgb and rf models


In [14]:
# run RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
rf_reg = RandomForestRegressor()
rf_reg.fit(Xreg_train_scaled, yreg_train)

# run Xgboost regressor
import xgboost as xgb
xgb_reg = xgb.XGBRegressor()
xgb_reg.fit(Xreg_train_scaled, yreg_train)


In [15]:
"""
Hyperparameter Tuning the XGBoost and Random Forest Regressors
"""

from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor

# Define the parameter grid for XGBoost
xgb_param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [3, 4, 5],
    'learning_rate': [0.01, 0.1, 0.2],
}

# Initialize the XGBoost regressor
xgb_reg_s = XGBRegressor()

# Create a GridSearchCV instance for XGBoost
xgb_grid_search = GridSearchCV(estimator=xgb_reg, param_grid=xgb_param_grid, cv=5, scoring='neg_mean_squared_error', verbose=3)

# Fit the GridSearchCV on your training data
xgb_grid_search.fit(Xreg_train_scaled, yreg_train)

# Print the best parameters and the corresponding RMSE
print("Best parameters for XGBoost:")
print(xgb_grid_search.best_params_)
print("Best RMSE for XGBoost:", (-xgb_grid_search.best_score_) ** 0.5)



Fitting 5 folds for each of 18 candidates, totalling 90 fits


[CV 1/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=-269.112 total time=   0.1s
[CV 2/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=-281.307 total time=   0.4s
[CV 3/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=-272.632 total time=   0.1s
[CV 4/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=-289.229 total time=   0.2s
[CV 5/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=-269.671 total time=   0.1s
[CV 1/5] END learning_rate=0.01, max_depth=3, n_estimators=200;, score=-251.122 total time=   0.6s
[CV 2/5] END learning_rate=0.01, max_depth=3, n_estimators=200;, score=-261.373 total time=   0.3s
[CV 3/5] END learning_rate=0.01, max_depth=3, n_estimators=200;, score=-257.736 total time=   0.3s
[CV 4/5] END learning_rate=0.01, max_depth=3, n_estimators=200;, score=-276.017 total time=   0.3s
[CV 5/5] END learning_rate=0.01, max_depth=3, n_estimators=200;, score=-250.944 total time=   0.3s
[CV 1/5] E

In [16]:
"""Random Forest Regressor Hyperparameter Tuning""" 
from sklearn.ensemble import RandomForestRegressor

# Define the parameter grid for Random Forest
rf_param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [ 3,4, 5],
    'min_samples_split': [2, 5, 10],
}

# Initialize the Random Forest regressor
rf_reg_s = RandomForestRegressor()

# Create a GridSearchCV instance for Random Forest
rf_grid_search = GridSearchCV(estimator=rf_reg_s, param_grid=rf_param_grid, cv=5, scoring='neg_mean_squared_error', verbose=3)

# Fit the GridSearchCV on your training data
rf_grid_search.fit(Xreg_train_scaled, yreg_train)
                   

# Print the best parameters and the corresponding RMSE
print("\nBest parameters for Random Forest:")
print(rf_grid_search.best_params_)
print("Best RMSE for Random Forest:", (-rf_grid_search.best_score_) ** 0.5)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
[CV 1/5] END max_depth=3, min_samples_split=2, n_estimators=100;, score=-257.184 total time=   1.6s
[CV 2/5] END max_depth=3, min_samples_split=2, n_estimators=100;, score=-267.335 total time=   1.4s
[CV 3/5] END max_depth=3, min_samples_split=2, n_estimators=100;, score=-266.972 total time=   1.6s
[CV 4/5] END max_depth=3, min_samples_split=2, n_estimators=100;, score=-282.835 total time=   1.2s
[CV 5/5] END max_depth=3, min_samples_split=2, n_estimators=100;, score=-258.359 total time=   1.3s
[CV 1/5] END max_depth=3, min_samples_split=2, n_estimators=200;, score=-257.731 total time=   2.9s
[CV 2/5] END max_depth=3, min_samples_split=2, n_estimators=200;, score=-267.403 total time=   2.7s
[CV 3/5] END max_depth=3, min_samples_split=2, n_estimators=200;, score=-265.829 total time=   2.8s
[CV 4/5] END max_depth=3, min_samples_split=2, n_estimators=200;, score=-282.787 total time=   2.5s
[CV 5/5] END max_depth=3, min_samples_s

## Evaluating new regression models 

In [17]:
# Calculate metrics for the 4 latest models and append them to the results DataFrame 
train_results_df, test_results_df = calculate_and_append_metrics('RandomForestRegressor', rf_reg, Xreg_train_scaled, yreg_train, Xreg_test_scaled, yreg_test, train_results_df, test_results_df)
train_results_df, test_results_df = calculate_and_append_metrics('XGBRegressor', xgb_reg, Xreg_train_scaled, yreg_train, Xreg_test_scaled, yreg_test, train_results_df, test_results_df)
train_results_df, test_results_df = calculate_and_append_metrics('RandomForestRegressor_tuned', rf_grid_search.best_estimator_, Xreg_train_scaled, yreg_train, Xreg_test_scaled, yreg_test, train_results_df, test_results_df)
train_results_df, test_results_df = calculate_and_append_metrics('XGBRegressor_tuned', xgb_grid_search.best_estimator_, Xreg_train_scaled, yreg_train, Xreg_test_scaled, yreg_test, train_results_df, test_results_df)

# display the results
print("Train:")
display(train_results_df)
print("-"*70,"\n")
print("Test:")
display(test_results_df)

Train:


Unnamed: 0,Model,R2 Score,RMSE,MAE,MAPE
0,Linear Regression,0.214264,16.390692,13.333271,7207390000000000.0
1,RidgeCV,0.214053,16.392894,13.33609,7210138000000000.0
2,LassoCV,0.281635,15.672253,12.666545,6073200000000000.0
3,RandomForestRegressor,0.899715,5.85566,4.616772,2030969000000000.0
4,XGBRegressor,0.817735,7.894246,5.984327,2177899000000000.0
5,RandomForestRegressor_tuned,0.302973,15.437744,12.434532,5660471000000000.0
6,XGBRegressor_tuned,0.419903,14.083454,11.266502,4835413000000000.0


---------------------------------------------------------------------- 

Test:


Unnamed: 0,Model,R2 Score,RMSE,MAE,MAPE
0,Linear Regression,0.201381,16.912657,13.620896,7177582000000000.0
1,RidgeCV,0.201787,16.908353,13.62278,7154488000000000.0
2,LassoCV,0.268373,16.187763,12.983852,6464504000000000.0
3,RandomForestRegressor,0.307665,15.74708,12.36713,5435473000000000.0
4,XGBRegressor,0.245117,16.443025,12.831641,5371123000000000.0
5,RandomForestRegressor_tuned,0.26943,16.176066,12.916685,5821930000000000.0
6,XGBRegressor_tuned,0.300365,15.829881,12.51383,5475917000000000.0


## Choosing the best model

The best performing model is the XGBRegressor_tuned, with the lowest RMSE and MAE values on the test set, as well as the highest $R^2$ score.
Except the non-tuned XGBRegressor, all of the models did not over-fit. Never the less, all of the models have low goodness of fit.

* $R^2$ quantifies the proportion of the variance in the dependent variable that is explained by the independent variables in our model. In this case we witness poor fit. 
* Root Mean Square Error (RMSE) is a metric used to measure the average magnitude of the errors between predicted and actual values in a regression or forecasting problem, with lower values indicating better model accuracy. In our model we used RMSE as the main target function.
* MAE provides a straightforward measure of how far, on average, the model's predictions are from the actual values. It helps assess the model's ability to make accurate predictions while considering both overestimations and underestimations equally. Our MAE values turned relatively low, indicating accurate models.
* MAPE is a metric of the accuracy of predictions in relative terms. It tells us how much, on average, the predictions deviate from the actual values as a percentage of the actual values. All of our models reached small MAPE values, indicating relatively accurate predictions. 