# A Little Regression Challenge

In [1]:
# importing general python libraries
import numpy as np
import pandas as pd

# importing libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# importing libraries for data preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# importing libraries for model building
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# importing libraries for model evaluation
from sklearn.metrics import classification_report
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error,mean_absolute_percentage_error

# importing libraries for model tuning
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# importing tensorflow libraries for deep learning
import tensorflow as tf
from tensorflow import keras


## The Dataset - Spotify Songs

### Description:

In this task, we will use a sample of 150K records, out of the ["Spotify Dataset 1921-2020, 600k+ Tracks"](https://www.kaggle.com/datasets/yamaerenay/spotify-dataset-19212020-600k-tracks?select=tracks.csv) which is available on kaggle. 

### The columns:

>**Target Column** we will predict the following column:
- `popularity` (Ranges from 0 to 100), float, representing the popularity of the song in the Spotify platform.

>**Numerical Columns**:
- `id` (Id of tracks generated by Spotify)
- `acousticness` (Ranges from 0 to 1)
- `danceability` (Ranges from 0 to 1)
- `energy` (Ranges from 0 to 1)
- `duration_ms` (Integer typically ranging from 200k to 300k)
- `instrumentalness` (Ranges from 0 to 1)
- `valence` (Ranges from 0 to 1)
- `animality` (Ranges from 0 to 1)
- `tempo` (Float typically ranging from 50 to 150)
- `liveness` (Ranges from 0 to 1)
- `loudness` (Float typically ranging from -60 to 0)
- `speechiness` (Ranges from 0 to 1)
- `release_year` a column which we are going to extract out of the `Release` column and predict based on song's features.


> **Categorical Columns** (string types):
- `explicit` (Whether the song is explicit (contains swearing or inappropriate language) or not)
  
> The following categorical columns will be removed to simplify the task (to many categories):
- `artists` (List of artists mentioned)
- `track_name` (Name of the song)
- `genre` is the genre of the song. String type, Multiclass.<br>
- `key` (All keys on octave encoded as values ranging from 0 to 11, starting on C as 0, C# as 1, and so on…)
- `time_signature` A notational convention to specify how many beats are in each bar (or measure). For example, rock music often has a time signature of 4/4, while classical music often has a time signature of 3/4 or 4/4.
- `Release` the date which the song was released on.



## Loading and Preprocessing

In [2]:
reg_url = 'https://raw.githubusercontent.com/FreeDataSets/DataPool/main/tracks_150000.csv' # this is the url for the dataset
reg_df = pd.read_csv(reg_url)#.sample(100000,random_state=42) # In order to reduce the size of the dataset, we are taking a random sample of 5000 rows from the dataset

# a preview of the dataframe
reg_df.info() 
reg_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                150000 non-null  object 
 1   name              149984 non-null  object 
 2   popularity        150000 non-null  int64  
 3   duration_ms       150000 non-null  int64  
 4   explicit          150000 non-null  int64  
 5   artists           150000 non-null  object 
 6   release_date      150000 non-null  object 
 7   danceability      150000 non-null  float64
 8   energy            150000 non-null  float64
 9   key               150000 non-null  int64  
 10  loudness          150000 non-null  float64
 11  speechiness       150000 non-null  float64
 12  acousticness      150000 non-null  float64
 13  instrumentalness  150000 non-null  float64
 14  liveness          150000 non-null  float64
 15  valence           150000 non-null  float64
 16  tempo             15

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,release_date,danceability,energy,key,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,6ot2x31QPJlJ4f6AM2yHlT,Amigo Mío (Homenaje a Juan Gabriel),25,219907,0,['Ana Gabriel'],1991,0.673,0.282,7,-14.247,0.0428,0.454,0.0,0.118,0.549,126.041,4
1,5ooilm0qewfPaWkY93uEQ4,Ca C'est Gentil Ca C'est Pas Mal,0,181427,0,['Pierrette Mad'],1925,0.52,0.359,0,-11.863,0.0817,0.988,0.0,0.106,0.861,89.963,4
2,1FlNozP5jet4DrGU2Ava1l,เหมือนไม่เคย,8,181560,0,['สุเทพ วงศ์กำแหง'],1992-04-01,0.654,0.312,8,-17.238,0.0305,0.662,0.598,0.106,0.526,95.032,4
3,2KP3zqq9MQarh2WwsuonoM,Menino Bonito,50,166333,0,['Rita Lee'],1974-01-01,0.247,0.426,0,-7.775,0.0316,0.725,0.0,0.16,0.253,178.251,4
4,4mKNBGNpDA7Ldu0fSyb7MX,Lekkerkry,31,191000,0,['ZAK VAN NIEKERK'],2014-03-28,0.663,0.844,7,-4.548,0.0528,0.08,5e-06,0.176,0.911,168.093,4


In [3]:
# convert Release to date and then extract year from it 
reg_df['release_date'] = pd.to_datetime(reg_df['release_date'])
reg_df['release_year'] = reg_df['release_date'].dt.year
reg_df['release_month'] = reg_df['release_date'].dt.month
reg_df.drop('release_date', axis=1, inplace=True) 
reg_df.head()


Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,danceability,energy,key,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,release_year,release_month
0,6ot2x31QPJlJ4f6AM2yHlT,Amigo Mío (Homenaje a Juan Gabriel),25,219907,0,['Ana Gabriel'],0.673,0.282,7,-14.247,0.0428,0.454,0.0,0.118,0.549,126.041,4,1991,1
1,5ooilm0qewfPaWkY93uEQ4,Ca C'est Gentil Ca C'est Pas Mal,0,181427,0,['Pierrette Mad'],0.52,0.359,0,-11.863,0.0817,0.988,0.0,0.106,0.861,89.963,4,1925,1
2,1FlNozP5jet4DrGU2Ava1l,เหมือนไม่เคย,8,181560,0,['สุเทพ วงศ์กำแหง'],0.654,0.312,8,-17.238,0.0305,0.662,0.598,0.106,0.526,95.032,4,1992,4
3,2KP3zqq9MQarh2WwsuonoM,Menino Bonito,50,166333,0,['Rita Lee'],0.247,0.426,0,-7.775,0.0316,0.725,0.0,0.16,0.253,178.251,4,1974,1
4,4mKNBGNpDA7Ldu0fSyb7MX,Lekkerkry,31,191000,0,['ZAK VAN NIEKERK'],0.663,0.844,7,-4.548,0.0528,0.08,5e-06,0.176,0.911,168.093,4,2014,3


In [4]:
reg_df.drop(['name', 'artists','id','release_date', 'artists_id','genre',], axis=1, inplace=True, errors='ignore') # Removing Categorical features with more then 10 unique values
reg_df.info()
reg_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 16 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   popularity        150000 non-null  int64  
 1   duration_ms       150000 non-null  int64  
 2   explicit          150000 non-null  int64  
 3   danceability      150000 non-null  float64
 4   energy            150000 non-null  float64
 5   key               150000 non-null  int64  
 6   loudness          150000 non-null  float64
 7   speechiness       150000 non-null  float64
 8   acousticness      150000 non-null  float64
 9   instrumentalness  150000 non-null  float64
 10  liveness          150000 non-null  float64
 11  valence           150000 non-null  float64
 12  tempo             150000 non-null  float64
 13  time_signature    150000 non-null  int64  
 14  release_year      150000 non-null  int64  
 15  release_month     150000 non-null  int64  
dtypes: float64(9), int64

Unnamed: 0,popularity,duration_ms,explicit,danceability,energy,key,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,release_year,release_month
0,25,219907,0,0.673,0.282,7,-14.247,0.0428,0.454,0.0,0.118,0.549,126.041,4,1991,1
1,0,181427,0,0.52,0.359,0,-11.863,0.0817,0.988,0.0,0.106,0.861,89.963,4,1925,1
2,8,181560,0,0.654,0.312,8,-17.238,0.0305,0.662,0.598,0.106,0.526,95.032,4,1992,4
3,50,166333,0,0.247,0.426,0,-7.775,0.0316,0.725,0.0,0.16,0.253,178.251,4,1974,1
4,31,191000,0,0.663,0.844,7,-4.548,0.0528,0.08,5e-06,0.176,0.911,168.093,4,2014,3


In [5]:
# split the data into features and target variable 
Xreg = reg_df.drop('popularity', axis=1) # features
yreg = reg_df['popularity'] # target variable

In [6]:
# split the data into train and test sets
Xreg_train, Xreg_test, yreg_train, yreg_test = train_test_split(Xreg, yreg, test_size=0.2, random_state=42)

In [7]:
scaler_reg = StandardScaler().fit(Xreg_train)
Xreg_train_scaled = scaler_reg.transform(Xreg_train)
Xreg_test_scaled = scaler_reg.transform(Xreg_test)

## Applying a simple Linear Regression

In [8]:
# Simple linear regression
from sklearn.linear_model import LinearRegression

# Create a Linear Regression object
lin_reg = LinearRegression()

# Train the model using the training sets
lin_reg.fit(Xreg_train_scaled, yreg_train)


## Evaluating the linear regression model

In [9]:
### Helper function to save and compare regression metrics 

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error,mean_absolute_percentage_error

def calculate_and_append_metrics(model_name, model, X_train, y_train, X_test, y_test, train_results_df, test_results_df):
    # Calculate metrics for the training dataset
    train_metrics = pd.DataFrame({
        'Model': model_name,
        'R2 Score': [r2_score(y_train, model.predict(X_train))],
        'RMSE': [mean_squared_error(y_train, model.predict(X_train), squared=False)],
        'MAE': [mean_absolute_error(y_train, model.predict(X_train))],
        'MAPE': [mean_absolute_percentage_error(y_train, model.predict(X_train))]
    })

    # Calculate metrics for the test dataset
    test_metrics = pd.DataFrame({
        'Model': model_name,
        'R2 Score': [r2_score(y_test, model.predict(X_test))],
        'RMSE': [mean_squared_error(y_test, model.predict(X_test), squared=False)],
        'MAE': [mean_absolute_error(y_test, model.predict(X_test))],
        'MAPE': [mean_absolute_percentage_error(y_test, model.predict(X_test))]
    })

    # Concatenate metrics to the respective DataFrames
    train_results_df = pd.concat([train_results_df, train_metrics], ignore_index=True)
    test_results_df = pd.concat([test_results_df, test_metrics], ignore_index=True)

    return train_results_df, test_results_df

In [10]:
# evaluate the model using the train and test set and different metrics

# Create empty DataFrames to store the results
train_results_df = pd.DataFrame()
test_results_df = pd.DataFrame()

# Calculate metrics for the Linear Regression model
train_results_df, test_results_df = calculate_and_append_metrics('Linear Regression', lin_reg, Xreg_train_scaled, yreg_train, Xreg_test_scaled, yreg_test, train_results_df, test_results_df)

# display the results
print("Train:")
display(train_results_df)
print("-"*70,"\n")
print("Test:")
display(test_results_df)

Train:


Unnamed: 0,Model,R2 Score,RMSE,MAE,MAPE
0,Linear Regression,0.379368,14.476721,11.146972,4020286000000000.0


---------------------------------------------------------------------- 

Test:


Unnamed: 0,Model,R2 Score,RMSE,MAE,MAPE
0,Linear Regression,0.364608,14.69266,11.293555,4288415000000000.0


## Polynomial Regression

In [11]:
""" Ridge Regression """
from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import GridSearchCV

X = Xreg_train_scaled
y = yreg_train

# Define the degrees to consider in the polynomial features
degrees = range(1, 3)

# Create a RidgeCV model with cross-validation
ridge_cv = RidgeCV([0.01, 0.1, 1, 10])

# Create a PolynomialFeatures transformer
poly = PolynomialFeatures()

# Perform a grid search over polynomial degrees
param_grid = {'poly__degree': degrees}

# Create a pipeline that combines PolynomialFeatures and RidgeCV
from sklearn.pipeline import Pipeline
pipe = Pipeline([('poly', poly),('ridge_cv', ridge_cv)])

# Use GridSearchCV to find the best polynomial degree
ridge_grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='neg_mean_squared_error',verbose=3)

import warnings
from sklearn.exceptions import ConvergenceWarning

warnings.filterwarnings("ignore", category=RuntimeWarning)
ridge_grid_search.fit(X, y)
warnings.filterwarnings("default", category=RuntimeWarning)


# Get the best polynomial degree and the best alpha for RidgeCV
best_degree = ridge_grid_search.best_params_['poly__degree']
best_alpha = ridge_grid_search.best_estimator_.named_steps['ridge_cv'].alpha_

# Print the results
print("RidgeCV Results:")
print("Best Polynomial Degree:", best_degree)
print("Best Alpha for RidgeCV:", best_alpha)
print("Best Negative MSE:", (-ridge_grid_search.best_score_)**0.5)

Fitting 5 folds for each of 2 candidates, totalling 10 fits


[CV 1/5] END .................poly__degree=1;, score=-206.539 total time=   0.2s
[CV 2/5] END .................poly__degree=1;, score=-206.059 total time=   0.3s
[CV 3/5] END .................poly__degree=1;, score=-214.474 total time=   0.2s
[CV 4/5] END .................poly__degree=1;, score=-212.714 total time=   0.2s
[CV 5/5] END .................poly__degree=1;, score=-208.377 total time=   0.2s
[CV 1/5] END .................poly__degree=2;, score=-193.044 total time=   4.1s
[CV 2/5] END .................poly__degree=2;, score=-192.724 total time=   3.4s
[CV 3/5] END .................poly__degree=2;, score=-201.650 total time=   3.5s
[CV 4/5] END .................poly__degree=2;, score=-199.318 total time=   3.2s
[CV 5/5] END .................poly__degree=2;, score=-194.684 total time=   3.4s
RidgeCV Results:
Best Polynomial Degree: 2
Best Alpha for RidgeCV: 10.0
Best Negative MSE: 14.010135583142546


In [12]:
""" Lasso Regression"""
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

X = Xreg_train_scaled
y = yreg_train

# Define the degrees to consider in the polynomial features
last_degree = 2
degrees = range(1, last_degree)

# Create a LassoCV model with cross-validation
lasso_cv = LassoCV(alphas=[0.01,0.1, 1.0, 10.0]
                #    max_iter=100000
                   )

# Create a PolynomialFeatures transformer
poly = PolynomialFeatures()

# Perform a grid search over polynomial degrees
param_grid = {'poly__degree': degrees}

# Create a pipeline that combines PolynomialFeatures and LassoCV
from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ('poly', poly),
    ('lasso_cv', lasso_cv)
])

# Use GridSearchCV to find the best polynomial degree
lasso_grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='neg_mean_squared_error', verbose=3)

import warnings
from sklearn.exceptions import ConvergenceWarning

# Filter out ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

lasso_grid_search.fit(X, y)

# Optionally, you can reset the warning filters to their original state
warnings.filterwarnings("default", category=ConvergenceWarning)


# Get the best polynomial degree and the best alpha for LassoCV
best_degree = lasso_grid_search.best_params_['poly__degree']
best_alpha = lasso_grid_search.best_estimator_.named_steps['lasso_cv'].alpha_
best_degree = lasso_grid_search.best_params_['poly__degree']

# Print the results
print("/nLassoCV Results:")
print("Best Polynomial Degree:", best_degree)
print("Best Polynomial Degree:", best_degree)
print("Best Alpha for LassoCV:", best_alpha)
print("Best RMSE:", (-lasso_grid_search.best_score_)**0.5)



Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 1/5] END .................poly__degree=1;, score=-206.537 total time=   0.5s
[CV 2/5] END .................poly__degree=1;, score=-206.062 total time=   0.4s
[CV 3/5] END .................poly__degree=1;, score=-214.484 total time=   0.3s
[CV 4/5] END .................poly__degree=1;, score=-212.711 total time=   0.3s
[CV 5/5] END .................poly__degree=1;, score=-208.379 total time=   0.3s
/nLassoCV Results:
Best Polynomial Degree: 1
Best Polynomial Degree: 1
Best Alpha for LassoCV: 0.01
Best RMSE: 14.478770678641188


## A Few Words About Regularization

The problem with a complex model of second order or higher is the risk of **Overfitting:**

When a model fits the *noise* and random fluctuations in the training data rather than capturing the underlying patterns that are truly representative of the target population. 

A *solution* to the overfitting risk is **Regularization**: 
Adding a penalty term to the model's *loss function*, encouraging the model to have smaller parameter values or simpler parameter patterns, discourages overfitting.

**Lasso (Least Absolute Shrinkage and Selection Operator):** adds a penalty term $||β||_1$ which is the sum of the absolute values of the coefficients.
**Ridge** adds a penalty term $||β||_2^2$ which is the sum of the squared values of the coefficients.
Lasso is better for Feature Selection and ridge is better for datasets with Multicollinearity, because Lasso tends to drive the coefficients of irrelevant features to exactly zero, effectively performing feature selection, while Ridge doesn't. 




## Evaluating polynomial regressions

In [13]:
# evaluate the model using the train and test set and different metrics
train_results_df, test_results_df = calculate_and_append_metrics('RidgeCV', ridge_grid_search.best_estimator_, Xreg_train_scaled, yreg_train, Xreg_test_scaled, yreg_test, train_results_df, test_results_df)

train_results_df, test_results_df = calculate_and_append_metrics('LassoCV', lasso_grid_search.best_estimator_, Xreg_train_scaled, yreg_train, Xreg_test_scaled, yreg_test, train_results_df, test_results_df)

# display the results
print("Train:")
display(train_results_df)
print("-"*70,"\n")
print("Test:")
display(test_results_df)

Train:


Unnamed: 0,Model,R2 Score,RMSE,MAE,MAPE
0,Linear Regression,0.379368,14.476721,11.146972,4020286000000000.0
1,RidgeCV,0.420357,13.990506,10.809304,3630854000000000.0
2,LassoCV,0.379361,14.476797,11.146772,4021165000000000.0


---------------------------------------------------------------------- 

Test:


Unnamed: 0,Model,R2 Score,RMSE,MAE,MAPE
0,Linear Regression,0.364608,14.69266,11.293555,4288415000000000.0
1,RidgeCV,0.408121,14.180653,10.941115,3851846000000000.0
2,LassoCV,0.364629,14.692417,11.293287,4289202000000000.0


## Applying RandomForrestRegressor and Xgbregressor
We will use pre-tuned xgb and rf models and also hyperparameter tuned xgb and rf models


In [14]:
# run RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
rf_reg = RandomForestRegressor()
rf_reg.fit(Xreg_train_scaled, yreg_train)

# run Xgboost regressor
import xgboost as xgb
xgb_reg = xgb.XGBRegressor()
xgb_reg.fit(Xreg_train_scaled, yreg_train)


In [15]:
"""
Hyperparameter Tuning the XGBoost and Random Forest Regressors
"""

from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor

# Define the parameter grid for XGBoost
xgb_param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [3, 4, 5],
    'learning_rate': [0.01, 0.1, 0.2],
}

# Initialize the XGBoost regressor
xgb_reg_s = XGBRegressor()

# Create a GridSearchCV instance for XGBoost
xgb_grid_search = GridSearchCV(estimator=xgb_reg, param_grid=xgb_param_grid, cv=5, scoring='neg_mean_squared_error', verbose=3)

# Fit the GridSearchCV on your training data
xgb_grid_search.fit(Xreg_train_scaled, yreg_train)

# Print the best parameters and the corresponding RMSE
print("Best parameters for XGBoost:")
print(xgb_grid_search.best_params_)
print("Best RMSE for XGBoost:", (-xgb_grid_search.best_score_) ** 0.5)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[CV 1/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=-218.508 total time=   1.3s
[CV 2/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=-219.373 total time=   1.1s
[CV 3/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=-225.924 total time=   1.2s
[CV 4/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=-222.784 total time=   0.9s
[CV 5/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=-220.407 total time=   0.8s
[CV 1/5] END learning_rate=0.01, max_depth=3, n_estimators=200;, score=-195.326 total time=   1.4s
[CV 2/5] END learning_rate=0.01, max_depth=3, n_estimators=200;, score=-195.743 total time=   1.5s
[CV 3/5] END learning_rate=0.01, max_depth=3, n_estimators=200;, score=-203.230 total time=   1.3s
[CV 4/5] END learning_rate=0.01, max_depth=3, n_estimators=200;, score=-200.308 total time=   1.3s
[CV 5/5] END learning_rate=0.01, max_depth=3, n_estimators=200;, score=-197.688 total time=   1.3s
[CV 1/5] E

In [16]:
"""Random Forest Regressor Hyperparameter Tuning""" 
from sklearn.ensemble import RandomForestRegressor

# Define the parameter grid for Random Forest
rf_param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [ 3,4, 5],
    'min_samples_split': [2, 5, 10],
}

# Initialize the Random Forest regressor
rf_reg_s = RandomForestRegressor()

# Create a GridSearchCV instance for Random Forest
rf_grid_search = GridSearchCV(estimator=rf_reg_s, param_grid=rf_param_grid, cv=5, scoring='neg_mean_squared_error', verbose=3)

# Fit the GridSearchCV on your training data
rf_grid_search.fit(Xreg_train_scaled, yreg_train)
                   

# Print the best parameters and the corresponding RMSE
print("\nBest parameters for Random Forest:")
print(rf_grid_search.best_params_)
print("Best RMSE for Random Forest:", (-rf_grid_search.best_score_) ** 0.5)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
[CV 1/5] END max_depth=3, min_samples_split=2, n_estimators=100;, score=-205.082 total time=  27.5s
[CV 2/5] END max_depth=3, min_samples_split=2, n_estimators=100;, score=-203.947 total time=  28.9s
[CV 3/5] END max_depth=3, min_samples_split=2, n_estimators=100;, score=-212.118 total time=  22.1s
[CV 4/5] END max_depth=3, min_samples_split=2, n_estimators=100;, score=-209.800 total time=  25.3s
[CV 5/5] END max_depth=3, min_samples_split=2, n_estimators=100;, score=-206.726 total time=  23.4s
[CV 1/5] END max_depth=3, min_samples_split=2, n_estimators=200;, score=-205.211 total time=  46.9s
[CV 2/5] END max_depth=3, min_samples_split=2, n_estimators=200;, score=-203.881 total time=  56.4s
[CV 3/5] END max_depth=3, min_samples_split=2, n_estimators=200;, score=-212.173 total time=  45.9s
[CV 4/5] END max_depth=3, min_samples_split=2, n_estimators=200;, score=-209.787 total time=  54.6s
[CV 5/5] END max_depth=3, min_samples_s

## Evaluating new regression models 

In [17]:
# Calculate metrics for the 4 latest models and append them to the results DataFrame 
train_results_df, test_results_df = calculate_and_append_metrics('RandomForestRegressor', rf_reg, Xreg_train_scaled, yreg_train, Xreg_test_scaled, yreg_test, train_results_df, test_results_df)
train_results_df, test_results_df = calculate_and_append_metrics('XGBRegressor', xgb_reg, Xreg_train_scaled, yreg_train, Xreg_test_scaled, yreg_test, train_results_df, test_results_df)
train_results_df, test_results_df = calculate_and_append_metrics('RandomForestRegressor_tuned', rf_grid_search.best_estimator_, Xreg_train_scaled, yreg_train, Xreg_test_scaled, yreg_test, train_results_df, test_results_df)
train_results_df, test_results_df = calculate_and_append_metrics('XGBRegressor_tuned', xgb_grid_search.best_estimator_, Xreg_train_scaled, yreg_train, Xreg_test_scaled, yreg_test, train_results_df, test_results_df)

# display the results
print("Train:")
display(train_results_df)
print("-"*70,"\n")
print("Test:")
display(test_results_df)

Train:


Unnamed: 0,Model,R2 Score,RMSE,MAE,MAPE
0,Linear Regression,0.379368,14.476721,11.146972,4020286000000000.0
1,RidgeCV,0.420357,13.990506,10.809304,3630854000000000.0
2,LassoCV,0.379361,14.476797,11.146772,4021165000000000.0
3,RandomForestRegressor,0.929776,4.869643,3.629849,840067400000000.0
4,XGBRegressor,0.582,11.880669,8.986569,1976237000000000.0
5,RandomForestRegressor_tuned,0.428732,13.889061,10.503918,2890572000000000.0
6,XGBRegressor_tuned,0.558086,12.215797,9.252229,2099336000000000.0


---------------------------------------------------------------------- 

Test:


Unnamed: 0,Model,R2 Score,RMSE,MAE,MAPE
0,Linear Regression,0.364608,14.69266,11.293555,4288415000000000.0
1,RidgeCV,0.408121,14.180653,10.941115,3851846000000000.0
2,LassoCV,0.364629,14.692417,11.293287,4289202000000000.0
3,RandomForestRegressor,0.501104,13.019212,9.800499,2439984000000000.0
4,XGBRegressor,0.487759,13.192184,9.928907,2537362000000000.0
5,RandomForestRegressor_tuned,0.419436,14.044447,10.610486,3087203000000000.0
6,XGBRegressor_tuned,0.495152,13.096637,9.885771,2508373000000000.0


## Choosing the best model

The best performing model is the RandomForestRegressor, with the lowest RMSE and MAE values on the test set, as well as the highest $R^2$ score.

* $R^2$ quantifies the proportion of the variance in the dependent variable that is explained by the independent variables in our model. In this case we witness poor fit. 
* Root Mean Square Error (RMSE) is a metric used to measure the average magnitude of the errors between predicted and actual values in a regression or forecasting problem, with lower values indicating better model accuracy. In our model we used RMSE as the main target function.
* MAE provides a straightforward measure of how far, on average, the model's predictions are from the actual values. It helps assess the model's ability to make accurate predictions while considering both overestimations and underestimations equally. Our MAE values turned relatively low, indicating accurate models.
* MAPE is a metric of the accuracy of predictions in relative terms. It tells us how much, on average, the predictions deviate from the actual values as a percentage of the actual values. All of our models reached small MAPE values, indicating relatively accurate predictions. 