<span style="font-size: 40px; color: Red">Problem Statement</span>


## Problem Definition
* The goal of this project is to develop a class attendance predictor for gym classes, leveraging historical attendance data and relevant features. The predictor will assist gym management in estimating the number of participants likely to attend a given class. By utilizing machine learning techniques, the project aims to provide accurate attendance predictions that can facilitate resource allocation, class scheduling, and overall gym operation optimization.


## Data Source
https://www.kaggle.com/datasets/nithilaa/fitness-analysis

<span style="font-size: 30px; color: Green">Importing Libaries</span>

In [45]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, KFold
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

import joblib

<span style="font-size: 30px; color: Green">Loading the Dataset</span>

In [2]:
df = pd.read_csv("Crowdedness_at_Campus_Gym.csv")

<span style="font-size: 30px; color: Green">Data Preprocessing</span>

### a). Data Inspection

In [3]:
df.head(3)

Unnamed: 0,number_people,date,timestamp,day_of_week,is_weekend,is_holiday,temperature,is_start_of_semester,is_during_semester,month,hour
0,37,2015-08-14 17:00:11-07:00,61211,4,0,0,71.76,0,0,8,17
1,45,2015-08-14 17:20:14-07:00,62414,4,0,0,71.76,0,0,8,17
2,40,2015-08-14 17:30:15-07:00,63015,4,0,0,71.76,0,0,8,17


In [4]:
# rows and columns( or no. of records)
df.shape

(62184, 11)

In [None]:
df.hist(figsize=(12, 10))
plt.show()

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62184 entries, 0 to 62183
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   number_people         62184 non-null  int64  
 1   date                  62184 non-null  object 
 2   timestamp             62184 non-null  int64  
 3   day_of_week           62184 non-null  int64  
 4   is_weekend            62184 non-null  int64  
 5   is_holiday            62184 non-null  int64  
 6   temperature           62184 non-null  float64
 7   is_start_of_semester  62184 non-null  int64  
 8   is_during_semester    62184 non-null  int64  
 9   month                 62184 non-null  int64  
 10  hour                  62184 non-null  int64  
dtypes: float64(1), int64(9), object(1)
memory usage: 5.2+ MB


In [6]:
df.describe()

Unnamed: 0,number_people,timestamp,day_of_week,is_weekend,is_holiday,temperature,is_start_of_semester,is_during_semester,month,hour
count,62184.0,62184.0,62184.0,62184.0,62184.0,62184.0,62184.0,62184.0,62184.0,62184.0
mean,29.072543,45799.437958,2.982504,0.28287,0.002573,58.557108,0.078831,0.660218,7.439824,12.23646
std,22.689026,24211.275891,1.996825,0.450398,0.05066,6.316396,0.269476,0.473639,3.445069,6.717631
min,0.0,0.0,0.0,0.0,0.0,38.14,0.0,0.0,1.0,0.0
25%,9.0,26624.0,1.0,0.0,0.0,55.0,0.0,0.0,5.0,7.0
50%,28.0,46522.5,3.0,0.0,0.0,58.34,0.0,1.0,8.0,12.0
75%,43.0,66612.0,5.0,1.0,0.0,62.28,0.0,1.0,10.0,18.0
max,145.0,86399.0,6.0,1.0,1.0,87.17,1.0,1.0,12.0,23.0


### b). Data Cleaning

In [7]:
# Checking for dublicate rows
duplicate_rows = df[df.duplicated()]
duplicate_rows

Unnamed: 0,number_people,date,timestamp,day_of_week,is_weekend,is_holiday,temperature,is_start_of_semester,is_during_semester,month,hour


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62184 entries, 0 to 62183
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   number_people         62184 non-null  int64  
 1   date                  62184 non-null  object 
 2   timestamp             62184 non-null  int64  
 3   day_of_week           62184 non-null  int64  
 4   is_weekend            62184 non-null  int64  
 5   is_holiday            62184 non-null  int64  
 6   temperature           62184 non-null  float64
 7   is_start_of_semester  62184 non-null  int64  
 8   is_during_semester    62184 non-null  int64  
 9   month                 62184 non-null  int64  
 10  hour                  62184 non-null  int64  
dtypes: float64(1), int64(9), object(1)
memory usage: 5.2+ MB


In [9]:
df.head(2)

Unnamed: 0,number_people,date,timestamp,day_of_week,is_weekend,is_holiday,temperature,is_start_of_semester,is_during_semester,month,hour
0,37,2015-08-14 17:00:11-07:00,61211,4,0,0,71.76,0,0,8,17
1,45,2015-08-14 17:20:14-07:00,62414,4,0,0,71.76,0,0,8,17


### C). Feature Engineering

In [10]:
# Column Year consisting of Year alone
# df['Year'] = df['date'].str[:4]

# Column Date consisting of Date of the month alone
df['Date'] = df['date'].str[8:11]

# Column Time
df['minutes'] = df['date'].str[14:-9]

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62184 entries, 0 to 62183
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   number_people         62184 non-null  int64  
 1   date                  62184 non-null  object 
 2   timestamp             62184 non-null  int64  
 3   day_of_week           62184 non-null  int64  
 4   is_weekend            62184 non-null  int64  
 5   is_holiday            62184 non-null  int64  
 6   temperature           62184 non-null  float64
 7   is_start_of_semester  62184 non-null  int64  
 8   is_during_semester    62184 non-null  int64  
 9   month                 62184 non-null  int64  
 10  hour                  62184 non-null  int64  
 11  Date                  62184 non-null  object 
 12  minutes               62184 non-null  object 
dtypes: float64(1), int64(9), object(3)
memory usage: 6.2+ MB


In [12]:
df.head(2)

Unnamed: 0,number_people,date,timestamp,day_of_week,is_weekend,is_holiday,temperature,is_start_of_semester,is_during_semester,month,hour,Date,minutes
0,37,2015-08-14 17:00:11-07:00,61211,4,0,0,71.76,0,0,8,17,14,0
1,45,2015-08-14 17:20:14-07:00,62414,4,0,0,71.76,0,0,8,17,14,20


In [13]:
# Drop columns
df = df.drop(columns=["date","timestamp"], axis=1)

In [14]:
df.head(2)

Unnamed: 0,number_people,day_of_week,is_weekend,is_holiday,temperature,is_start_of_semester,is_during_semester,month,hour,Date,minutes
0,37,4,0,0,71.76,0,0,8,17,14,0
1,45,4,0,0,71.76,0,0,8,17,14,20


In [15]:
# Rearrange the columns
columns = ["month", "Date", "day_of_week", "hour", "minutes", "temperature", "is_weekend", "is_holiday", "is_start_of_semester", "is_during_semester", "number_people"]
df = df[columns]

In [16]:
# Rename columns
df.columns = [cols.capitalize() for cols in df.columns]

In [17]:
df.head(2)

Unnamed: 0,Month,Date,Day_of_week,Hour,Minutes,Temperature,Is_weekend,Is_holiday,Is_start_of_semester,Is_during_semester,Number_people
0,8,14,4,17,0,71.76,0,0,0,0,37
1,8,14,4,17,20,71.76,0,0,0,0,45


In [18]:
df.shape

(62184, 11)

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62184 entries, 0 to 62183
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Month                 62184 non-null  int64  
 1   Date                  62184 non-null  object 
 2   Day_of_week           62184 non-null  int64  
 3   Hour                  62184 non-null  int64  
 4   Minutes               62184 non-null  object 
 5   Temperature           62184 non-null  float64
 6   Is_weekend            62184 non-null  int64  
 7   Is_holiday            62184 non-null  int64  
 8   Is_start_of_semester  62184 non-null  int64  
 9   Is_during_semester    62184 non-null  int64  
 10  Number_people         62184 non-null  int64  
dtypes: float64(1), int64(8), object(2)
memory usage: 5.2+ MB


<span style="font-size: 30px; color: Green">Encode categorical variables</span>

In [22]:
# label encoding for year, since this column has an ordinal r/ship(In order)
# label_encoder = LabelEncoder()
# df["Year"] = label_encoder.fit_transform(df["Year"])

In [20]:
# Converting Date and Minutes columns as int datatype
df["Date"] = df["Date"].astype(int)
df["Minutes"] = df["Minutes"].astype(int)

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62184 entries, 0 to 62183
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Month                 62184 non-null  int64  
 1   Date                  62184 non-null  int64  
 2   Day_of_week           62184 non-null  int64  
 3   Hour                  62184 non-null  int64  
 4   Minutes               62184 non-null  int64  
 5   Temperature           62184 non-null  float64
 6   Is_weekend            62184 non-null  int64  
 7   Is_holiday            62184 non-null  int64  
 8   Is_start_of_semester  62184 non-null  int64  
 9   Is_during_semester    62184 non-null  int64  
 10  Number_people         62184 non-null  int64  
dtypes: float64(1), int64(10)
memory usage: 5.2 MB


<span style="font-size: 30px; color: Green">Data Scaling and Normalization</span>

In [22]:
# Using Normalization since the data have outliers
# Normalizing all the columns except the target column
data = df.drop(["Number_people"], axis=1)
scaler = MinMaxScaler()
data = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)

In [23]:
data["Number_people"] = df["Number_people"]

In [24]:
data.head(2)

Unnamed: 0,Month,Date,Day_of_week,Hour,Minutes,Temperature,Is_weekend,Is_holiday,Is_start_of_semester,Is_during_semester,Number_people
0,0.636364,0.433333,0.666667,0.73913,0.0,0.685703,0.0,0.0,0.0,0.0,37
1,0.636364,0.433333,0.666667,0.73913,0.338983,0.685703,0.0,0.0,0.0,0.0,45


In [25]:
data.to_csv("newdf.csv", index=False)

<span style="font-size: 30px; color: Green">Reload the Dataset</span>

In [26]:
df = pd.read_csv("newdf.csv")
df.head(2)

Unnamed: 0,Month,Date,Day_of_week,Hour,Minutes,Temperature,Is_weekend,Is_holiday,Is_start_of_semester,Is_during_semester,Number_people
0,0.636364,0.433333,0.666667,0.73913,0.0,0.685703,0.0,0.0,0.0,0.0,37
1,0.636364,0.433333,0.666667,0.73913,0.338983,0.685703,0.0,0.0,0.0,0.0,45


<span style="font-size: 30px; color: Green">Data Splitting</span>

In [27]:
# Training and Testing data
X = df.drop(["Number_people"], axis=1)

# Target column
y = df["Number_people"]

In [28]:
# Splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

<span style="font-size: 30px; color: Green">Model Selection and Training</span>

## Simple Linear Regression

In [29]:
# Simple Linear Regression on Day_of_week
X_train_ind_var = X_train["Day_of_week"].values.reshape(-1, 1)
X_test_ind_var = X_test["Day_of_week"].values.reshape(-1, 1)

# fit model on training data
slr_model = LinearRegression()
slr_model.fit(X_train_ind_var, y_train)

# predict on testing data
slr_y_pred = slr_model.predict(X_test_ind_var)

### SLR Model evaluation

In [30]:
print(f"Mean Squared Error(MSE): {mean_squared_error(y_test, slr_y_pred)}")
print(f"Mean Absolute Error(MAE): {mean_absolute_error(y_test, slr_y_pred)}")
print(f"R2_Score(R-Squared Score): {r2_score(y_test, slr_y_pred)}")

Mean Squared Error(MSE): 504.3150135497321
Mean Absolute Error(MAE): 18.188769881696313
R2_Score(R-Squared Score): 0.024248315186191194


In [31]:
# Simple Linear Regression on Temperature
X_train_ind_var = X_train["Temperature"].values.reshape(-1, 1)
X_test_ind_var = X_test["Temperature"].values.reshape(-1, 1)

# fit model on training data
slr_model = LinearRegression()
slr_model.fit(X_train_ind_var, y_train)

# predict on testing data
slr_y_pred = slr_model.predict(X_test_ind_var)

print(f"Mean Squared Error(MSE): {mean_squared_error(y_test, slr_y_pred)}")
print(f"Mean Absolute Error(MAE): {mean_absolute_error(y_test, slr_y_pred)}")
print(f"R2_Score(R-Squared Score): {r2_score(y_test, slr_y_pred)}")

Mean Squared Error(MSE): 439.2203062328981
Mean Absolute Error(MAE): 16.77973429289268
R2_Score(R-Squared Score): 0.15019394169012834


In [32]:
# Simple Linear Regression on Hour
X_train_ind_var = X_train["Hour"].values.reshape(-1, 1)
X_test_ind_var = X_test["Hour"].values.reshape(-1, 1)

# fit model on training data
slr_model = LinearRegression()
slr_model.fit(X_train_ind_var, y_train)

# predict on testing data
slr_y_pred = slr_model.predict(X_test_ind_var)

print(f"Mean Squared Error(MSE): {mean_squared_error(y_test, slr_y_pred)}")
print(f"Mean Absolute Error(MAE): {mean_absolute_error(y_test, slr_y_pred)}")
print(f"R2_Score(R-Squared Score): {r2_score(y_test, slr_y_pred)}")

Mean Squared Error(MSE): 360.5633548602126
Mean Absolute Error(MAE): 14.67565188781765
R2_Score(R-Squared Score): 0.302379878579051


## Multiple Linear Regression

In [33]:
mlr_model = LinearRegression()

# fit the model on training data
mlr_model.fit(X_train, y_train)

# Predict on testing data
mlr_y_pred = mlr_model.predict(X_test)

### MLR Model Evaluation

In [34]:
print(f"Mean Squared Error(MSE): {mean_squared_error(y_test, mlr_y_pred)}")
print(f"Mean Absolute Error(MAE): {mean_absolute_error(y_test, mlr_y_pred)}")
print(f"R2_Score(R-Squared Score): {r2_score(y_test, mlr_y_pred)}")

Mean Squared Error(MSE): 250.27428523617226
Mean Absolute Error(MAE): 12.09120651327613
R2_Score(R-Squared Score): 0.5157678258161058


## Polynomial Regression

In [38]:
poly_ft = PolynomialFeatures()

# Transform input features to poly features
X_train_poly = poly_ft.fit_transform(X_train)
X_test_poly = poly_ft.transform(X_test)

# fit on Linear regression model
lr_model = LinearRegression()
lr_model.fit(X_train_poly, y_train)

# predict on testing data
poly_y_pred = lr_model.predict(X_test_poly)

### PR Model Evaluation

In [39]:
print(f"Mean Squared Error(MSE): {mean_squared_error(y_test, poly_y_pred)}")
print(f"Mean Absolute Error(MAE): {mean_absolute_error(y_test, poly_y_pred)}")
print(f"R2_Score(R-Squared Score): {r2_score(y_test, poly_y_pred)}")

Mean Squared Error(MSE): 158.4265352575406
Mean Absolute Error(MAE): 9.320230343594968
R2_Score(R-Squared Score): 0.6934753982264392


## Ridge regression (L2 regularization)

In [40]:
ridge = Ridge()
# fit the model on training data
ridge.fit(X_train, y_train)
# predict the model on testing data
y_ridge_pred = ridge.predict(X_test)

### L2 Model Evaluation

In [41]:
print(f"Mean Squared Error(MSE): {mean_squared_error(y_test, y_ridge_pred)}")
print(f"Mean Absolute Error(MAE): {mean_absolute_error(y_test, y_ridge_pred)}")
print(f"R2_Score(R-Squared Score): {r2_score(y_test, y_ridge_pred)}")

Mean Squared Error(MSE): 202.22331735698506
Mean Absolute Error(MAE): 10.631339907427238
R2_Score(R-Squared Score): 0.6087371239836044


## Lasso Regression(L1 regularization)

In [42]:
lasso = Lasso()
# fit the model on training data
lasso.fit(X_train, y_train)
# predict the model on testing data
y_lasso_pred = lasso.predict(X_test)

### L1 Model Evaluation

In [43]:
print(f"Mean Squared Error(MSE): {mean_squared_error(y_test, y_lasso_pred)}")
print(f"Mean Absolute Error(MAE): {mean_absolute_error(y_test, y_lasso_pred)}")
print(f"R2_Score(R-Squared Score): {r2_score(y_test, y_lasso_pred)}")

Mean Squared Error(MSE): 294.20955092063457
Mean Absolute Error(MAE): 13.273204394432952
R2_Score(R-Squared Score): 0.4307616127101205


## Support Vector Regression (SVR)

In [44]:
svr = SVR()
# fit the model on training data
svr.fit(X_train, y_train)
# predict the model on testing data
y_svr_pred = svr.predict(X_test)

# model evaluation
print(f"Mean Squared Error(MSE): {mean_squared_error(y_test, y_svr_pred)}")
print(f"Mean Absolute Error(MAE): {mean_absolute_error(y_test, y_svr_pred)}")
print(f"R2_Score(R-Squared Score): {r2_score(y_test, y_svr_pred)}")

Mean Squared Error(MSE): 143.88256784440776
Mean Absolute Error(MAE): 8.594186756555255
R2_Score(R-Squared Score): 0.7216151527964112


## Decision Tree Regression

In [45]:
dtree = DecisionTreeRegressor()

# fit the model on training data
dtree.fit(X_train, y_train)

# predict the model on testing data
y_dtree_pred = dtree.predict(X_test)

# model evaluation
print(f"Mean Squared Error(MSE): {mean_squared_error(y_test, y_dtree_pred)}")
print(f"Mean Absolute Error(MAE): {mean_absolute_error(y_test, y_dtree_pred)}")
print(f"R2_Score(R-Squared Score): {r2_score(y_test, y_dtree_pred)}")

Mean Squared Error(MSE): 51.888035699927634
Mean Absolute Error(MAE): 4.8435313982471655
R2_Score(R-Squared Score): 0.8996067202133956


## Random Forest Regression

In [35]:
rfr = RandomForestRegressor()
# fit the model on training data
rfr.fit(X_train, y_train)
# predict on testing data
y_rfr_pred = rfr.predict(X_test)

# model evaluation
print(f"Mean Squared Error(MSE): {mean_squared_error(y_test, y_rfr_pred)}")
print(f"Mean Absolute Error(MAE): {mean_absolute_error(y_test, y_rfr_pred)}")
print(f"R2_Score(R-Squared Score): {r2_score(y_test, y_rfr_pred)}")

Mean Squared Error(MSE): 40.97559699434984
Mean Absolute Error(MAE): 4.283123824073329
R2_Score(R-Squared Score): 0.920720171461749


<span style="font-size: 30px; color: Green">Model Validation and Tuning</span>

### Random Forest Regression- Grid Search

In [36]:
# Hyperparameters
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}
# Cross-Validation
cv = KFold(n_splits=5, shuffle=True, random_state=42)
# Model
model = RandomForestRegressor()
# Grid Search Object
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=cv, scoring='neg_mean_squared_error')
# Fit on Training Data
grid_search.fit(X_train, y_train)
# best hyperparameters and best score
best_params = grid_search.best_params_
best_score = -grid_search.best_score_ 
# Best model
best_model = grid_search.best_estimator_

# Predict on the test set using the best model
y_pred = best_model.predict(X_test)

# Evaluate the model's performance on the test set
print(f"Best Hyperparameters: {best_params}")
print(f"Best Mean Squared Error (MSE): {best_score}")
print(f"Mean Squared Error(MSE) on Test Data: {mean_squared_error(y_test, y_pred)}")
print(f"Mean Absolute Error(MAE): {mean_absolute_error(y_test, y_pred)}")
print(f"R2_Score(R-Squared Score): {r2_score(y_test, y_pred)}")

540 fits failed out of a total of 1620.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
540 fits failed with the following error:
Traceback (most recent call last):
  File "/home/jane/.local/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/jane/.local/lib/python3.8/site-packages/sklearn/base.py", line 1144, in wrapper
    estimator._validate_params()
  File "/home/jane/.local/lib/python3.8/site-packages/sklearn/base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "/home/jane/.local/lib/python3.8/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise 

Best Hyperparameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Best Mean Squared Error (MSE): 38.551638991542404
Mean Squared Error(MSE) on Test Data: 33.744080368761445
Mean Absolute Error(MAE): 4.018918791662359
R2_Score(R-Squared Score): 0.9347117527980067


### Random Forest Regression Randomized Search

In [49]:
# Hyperparameters
param_dist = {
    'n_estimators': [100, 200, 300, 400],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}
# Cross-Validation
cv = KFold(n_splits=5, shuffle=True, random_state=42)
# Model
model = RandomForestRegressor()
# RRandomized Search Object
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=100, scoring='neg_mean_squared_error', cv=cv, random_state=42)
# Fit on Training Data
random_search.fit(X_train, y_train)
# best hyperparameters and best score
best_params = random_search.best_params_
best_score = -random_search.best_score_ 
# Best model
best_model = random_search.best_estimator_

# Predict on the test set using the best model
y_pred = best_model.predict(X_test)

# Evaluate the model's performance on the test set
mse = mean_squared_error(y_test, y_pred)
print(f"Best Hyperparameters: {best_params}")
print(f"Best Mean Squared Error (MSE): {best_score}")
print(f"Mean Squared Error (MSE) on Test Data: {mse}")
print(f"R2_Score(R-Squared Score): {r2_score(y_test, y_pred)}")

Best Hyperparameters: {'n_estimators': 400, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 30, 'bootstrap': True}
Best Mean Squared Error (MSE): 32.699236963579565
Mean Squared Error (MSE) on Test Data: 30.06516827357286


Best Hyperparameters: {'max_depth': None, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Best Mean Squared Error (MSE): 33.168489516243554
Mean Squared Error(MSE) on Test Data: 30.22181752515541
Mean Absolute Error(MAE): 3.833016853513135
R2_Score(R-Squared Score): 0.9415266478768078


Best Hyperparameters: {'n_estimators': 400, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 30, 'bootstrap': True}
Best Mean Squared Error (MSE): 32.699236963579565
Mean Squared Error (MSE) on Test Data: 30.06516827357286

<span style="font-size: 30px; color: Green">Model Evaluation</span>

In [43]:
# From Grid Search
# Best Hyperparameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
best_model = RandomForestRegressor(n_estimators=200, min_samples_split=2, min_samples_leaf=1, max_depth=None)
# Fit the model on training data
best_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = best_model.predict(X_test)

# Evaluation scores
print(f"Mean Squared Error(MSE): {mean_squared_error(y_test, y_pred)}")
print(f"Mean Absolute Error(MAE): {mean_absolute_error(y_test, y_pred)}")
print(f"R2_Score(R-Squared Score): {r2_score(y_test, y_pred)}")

Mean Squared Error(MSE): 40.60656049070724
Mean Absolute Error(MAE): 4.271315094106041
R2_Score(R-Squared Score): 0.9214341854817809


In [46]:
joblib.dump(best_model, "best_model.pkl")

['best_model.pkl']