#                           **Modelling**

**In this notebook we will we building models for the prepared dataset. I've used a host of regression algorithms like linear regression, gradient boosting etc. I've also tried to get the best model possible and for that purpose I've done hyperparameter tuning.**



In [58]:
# Importing the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

In [59]:
# Did this for recreating the exact enviornment in the application.

df = pd.read_csv('/content/cleaned_data.csv')
df.head()
import sklearn

print("numpy version:", np.__version__)
print("pandas version:", pd.__version__)
print("scikit-learn version:", sklearn.__version__)


numpy version: 1.25.2
pandas version: 2.0.3
scikit-learn version: 1.2.2


In [60]:
# reading the prepared dataset
df_prices = pd.read_csv('/content/CSUSHPISA.csv')
df_prices.head()

Unnamed: 0,DATE,CSUSHPISA
0,2004-01-01,141.646
1,2004-02-01,143.191
2,2004-03-01,145.058
3,2004-04-01,146.592
4,2004-05-01,148.185


In [61]:
# Merging the prices and the final dataframe we wil have a complete dataframe
df = pd.concat([df,df_prices],axis = 1)
df.drop(columns = ['DATE'],inplace = True)
df.head()

Unnamed: 0,FEDFUNDS,construction_materials,Income,Total Units,Inflation,Unemployment,Year,Month,CSUSHPISA
0,1.0,150.0,11051.2,1709.0,1.949813,5.1,2004,1,141.646
1,1.01,153.4,11071.0,1718.0,2.037157,5.0,2004,2,143.191
2,1.0,156.5,11115.6,1794.0,2.126567,5.2,2004,3,145.058
3,1.0,160.1,11153.3,1938.0,2.247883,5.0,2004,4,146.592
4,1.0,162.7,11208.9,1893.0,2.228612,5.0,2004,5,148.185


In [62]:
# checking for null values
df.isna().sum()

FEDFUNDS                  0
construction_materials    0
Income                    0
Total Units               0
Inflation                 0
Unemployment              0
Year                      0
Month                     0
CSUSHPISA                 1
dtype: int64

In [63]:
# removing all the null values
df.dropna(subset=['CSUSHPISA'], inplace=True)

**Now we wil proceed towards building models as we have already sourced the relevant data and cleaned it.**

## Model Building



```
Algorithms that we'll use in this case are

1. Linear Regression
2. Lasso Regression
3. Ridge Regression
4. Elastic Net Regression
5. Decision Tree Regressor
6. Random Forest Regressor
7. Gradient Boosting
8. Adaboost
9. XG Boost
10.Voting Regressor
11.Stacking Regressor
```





### Linear Regression

Based on the evaluation metrics, the linear regression model appears to be performing well on the given dataset, with a high R-squared score of 0.99 and a low mean squared error of 28.04. However, it's important to ensure that the model is not overfitting to the training data.To address overfitting, we can try regularization techniques such as Lasso and Ridge regression. These techniques add a penalty term to the model's objective function, which helps to reduce the complexity of the model and prevent overfitting.

In [64]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, r2_score

In [65]:
X = df.drop('CSUSHPISA', axis=1)
y = df['CSUSHPISA']

In [66]:
scalar = MinMaxScaler()
X = scalar.fit_transform(X)

In [67]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [68]:
# Create a linear regression model
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

In [69]:
predictions = model.predict(X_test)

# Evaluate the model using mean squared error and R-squared score
mse_linear = mean_squared_error(y_test, predictions)
r2_linear = r2_score(y_test, predictions)

print(f"Mean squared error: {mse_linear:.2f}")
print(f"R-squared score: {r2_linear:.2f}")

Mean squared error: 28.04
R-squared score: 0.99


In [70]:
#Printing the regression coefficients of different variables
coefs = model.coef_
cols = df.columns
for i in range(len(coefs)):
    print(f"The coefficient for {cols[i]} is {coefs[i]}")

The coefficient for FEDFUNDS is 21.611703956964384
The coefficient for construction_materials is 141.25732448424975
The coefficient for Income is 16.251554506415026
The coefficient for Total Units is 65.40903758886513
The coefficient for Inflation is 0.011712845878577127
The coefficient for Unemployment is 11.60050967905157
The coefficient for Year is 28.700896880463954
The coefficient for Month is 1.8430341972027984


## Ridge Regression

The ridge regression model achieved a mean squared error of 42.24 and an R-squared score of 0.98, indicating strong predictive performance and ability to explain 98% of the target variable's variance.

In [71]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score


In [72]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Ridge regression model with alpha=1
ridge_model = Ridge(alpha=1)

# Fit the model to the training data
ridge_model.fit(X_train, y_train)


In [73]:
# Make predictions on the scaled testing data
predictions = ridge_model.predict(X_test)

# Evaluate the model using mean squared error and R-squared score
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print(f"Ridge regression - Mean squared error: {mse:.2f}")
print(f"Ridge regression - R-squared score: {r2:.2f}")

Ridge regression - Mean squared error: 42.24
Ridge regression - R-squared score: 0.98


In [74]:
#Printing the regression coefficients of different variables
coefs = ridge_model.coef_
cols = df.columns
for i in range(len(coefs)):
    print(f"The coefficient for {cols[i]} is {coefs[i]}")

The coefficient for FEDFUNDS is 20.36951309567287
The coefficient for construction_materials is 94.52338394383426
The coefficient for Income is 35.23277139748283
The coefficient for Total Units is 52.89125646067446
The coefficient for Inflation is 26.720619832278164
The coefficient for Unemployment is 4.611068195775025
The coefficient for Year is 37.47645875495033
The coefficient for Month is 2.1181018787078663


### Hyperparameter Tuning for Ridge Regression

After conducting a random search with 9 candidates and 5-fold cross-validation, the optimal hyperparameters for the ridge regression model were found to be alpha=0.1. This resulted in a best R-squared score of 0.9819 and a mean squared error of 27.80. With the best hyperparameters, the model achieved an impressive R-squared score of 0.99, indicating excellent predictive performance.

In [75]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import Ridge

In [76]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the parameter space for the hyperparameter tuning
param_dist = {'alpha': np.logspace(-4, 4, 9)}

# Create a Ridge regression model
ridge_model = Ridge()

# Create a randomized search cross-validation object
random_search = RandomizedSearchCV(ridge_model, param_dist, n_iter=100, cv=5, scoring='r2', random_state=42, n_jobs=-1, verbose=1)

# Fit the model to the training data
random_search.fit(X_train, y_train)

# Print the best hyperparameters and the corresponding R-squared score
print("Best hyperparameters: ", random_search.best_params_)
print("Best R-squared score: ", random_search.best_score_)

# Fit the model with the best hyperparameters to the training data
best_ridge_model = Ridge(alpha=random_search.best_params_['alpha'])
best_ridge_model.fit(X_train, y_train)

# Make predictions on the testing data
predictions = best_ridge_model.predict(X_test)

# Evaluate the model using mean squared error and R-squared score
mse_ridge = mean_squared_error(y_test, predictions)
r2_ridge = r2_score(y_test, predictions)

print(f"Ridge regression with best hyperparameters - Mean squared error: {mse_ridge:.2f}")
print(f"Ridge regression with best hyperparameters - R-squared score: {r2_ridge:.2f}")


Fitting 5 folds for each of 9 candidates, totalling 45 fits




Best hyperparameters:  {'alpha': 0.1}
Best R-squared score:  0.9818581983006922
Ridge regression with best hyperparameters - Mean squared error: 27.80
Ridge regression with best hyperparameters - R-squared score: 0.99


Next, we can try Lasso regression with an L1 penalty, which can lead to feature selection and further reduce overfitting. Lasso regression adds an L1 penalty to the model coefficients, which encourages sparsity and can lead to some coefficients being set to zero.

## Lasso Regression

The Lasso regression model with alpha=0.1 achieved a mean squared error of 33.52 and an R-squared score of 0.99, indicating excellent predictive performance and ability to explain nearly all of the variance in the target variable.

In [77]:
# Import necessary libraries
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

In [78]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Lasso regression model with alpha=0.1
lasso_model = Lasso(alpha=0.1)

# Fit the model to the training data
lasso_model.fit(X_train, y_train)


In [79]:
predictions = lasso_model.predict(X_test)

# Evaluate the model using mean squared error and R-squared score
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print(f"Lasso regression with alpha=0.1 - Mean squared error: {mse:.2f}")
print(f"Lasso regression with alpha=0.1 - R-squared score: {r2:.2f}")

Lasso regression with alpha=0.1 - Mean squared error: 33.52
Lasso regression with alpha=0.1 - R-squared score: 0.99


### Hyperparameter Tuning with Lasso Regression

After tuning, the optimal hyperparameters for the Lasso regression model were found to be alpha=0.01. This resulted in a best R-squared score of 0.9817 and a mean squared error of 28.32. With the best hyperparameters, the model achieved an impressive R-squared score of 0.99, indicating excellent predictive performance.

In [80]:
from sklearn.model_selection import RandomizedSearchCV

In [81]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the parameter space for the hyperparameter tuning
param_dist = {'alpha': np.logspace(-4, 4, 9)}

# Create a Lasso regression model
lasso_model = Lasso()


In [82]:
random_search = RandomizedSearchCV(lasso_model, param_dist, n_iter=100, cv=5, scoring='r2', random_state=42, n_jobs=-1, verbose=1)

# Fit the model to the training data
random_search.fit(X_train, y_train)

# Print the best hyperparameters and the corresponding R-squared score
print("Best hyperparameters: ", random_search.best_params_)
print("Best R-squared score: ", random_search.best_score_)

# Fit the model with the best hyperparameters to the training data
best_lasso_model = Lasso(alpha=random_search.best_params_['alpha'])
best_lasso_model.fit(X_train, y_train)

# Make predictions on the testing data
predictions = best_lasso_model.predict(X_test)

# Evaluate the model using mean squared error and R-squared score
mse_lasso = mean_squared_error(y_test, predictions)
r2_lasso = r2_score(y_test, predictions)

print(f"Lasso regression with best hyperparameters - Mean squared error: {mse_lasso:.2f}")
print(f"Lasso regression with best hyperparameters - R-squared score: {r2_lasso:.2f}")


Fitting 5 folds for each of 9 candidates, totalling 45 fits




Best hyperparameters:  {'alpha': 0.01}
Best R-squared score:  0.9817421302831981
Lasso regression with best hyperparameters - Mean squared error: 28.32
Lasso regression with best hyperparameters - R-squared score: 0.99


## Elastic Net

After conducting a random search with 100 candidates and 5-fold cross-validation, the optimal hyperparameters for the Elastic Net regression model were found to be l1_ratio=0.05 and alpha=0.0107. This resulted in a best R-squared score of 0.9817, indicating excellent predictive performance.

In [83]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import ElasticNet

In [84]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [85]:
# Define hyperparameter grid
param_dist = {
    'alpha': np.logspace(-5, 5, 100, endpoint=True),
    'l1_ratio': np.arange(0, 1, 0.01)
}

# Create an instance of the Elastic Net Regressor
regressor = ElasticNet()

# Perform RandomizedSearchCV
elastic_net = RandomizedSearchCV(regressor,param_dist,n_iter=100, cv=5,scoring='r2', random_state=42, n_jobs=-1, verbose=1)
elastic_net.fit(X_train, y_train)

Y_pred = elastic_net.predict(X_test)
r2 = elastic_net.score(X_test, y_test)
RMSE_score = np.sqrt(mean_squared_error(y_test, Y_pred))

# Print the best hyperparameters and the corresponding R-squared score
print("Best hyperparameters: ", elastic_net.best_params_)
print("Best R-squared score: ", elastic_net.best_score_)
r2_en =  elastic_net.best_score_


Fitting 5 folds for each of 100 candidates, totalling 500 fits
Best hyperparameters:  {'l1_ratio': 0.62, 'alpha': 0.001047615752789665}
Best R-squared score:  0.9819736628629068


## Decision Tree Regressor

The Decision Tree Regressor with optimized hyperparameters achieved a mean squared error of 12.18 and a perfect R-squared score of 1.00, indicating a nearly flawless fit to the data and exceptional predictive performance.

In [86]:

from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score


In [87]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the parameter space for the hyperparameter tuning
param_dist = {'max_depth': [None, 10, 20, 30, 40, 50],
              'min_samples_split': [2, 5, 10, 20],
              'min_samples_leaf': [1, 2, 4, 8]}

# Create a Decision Tree Regressor model
decision_tree_model = DecisionTreeRegressor()

# Create a randomized search cross-validation object
random_search = RandomizedSearchCV(decision_tree_model, param_dist, n_iter=100, cv=5, scoring='r2', random_state=42, n_jobs=-1, verbose=1)

# Fit the model to the training data
random_search.fit(X_train, y_train)

# Print the best hyperparameters and the corresponding R-squared score
print("Best hyperparameters: ", random_search.best_params_)
print("Best R-squared score: ", random_search.best_score_)

# Fit the model with the best hyperparameters to the training data
best_decision_tree_model = DecisionTreeRegressor(**random_search.best_params_)
best_decision_tree_model.fit(X_train, y_train)

# Make predictions on the testing data
predictions = best_decision_tree_model.predict(X_test)


Fitting 5 folds for each of 96 candidates, totalling 480 fits




Best hyperparameters:  {'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 50}
Best R-squared score:  0.994769327570927


In [88]:
# Evaluate the model using mean squared error and R-squared score
mse_dtr = mean_squared_error(y_test, predictions)
r2_dtr = r2_score(y_test, predictions)

print(f"Decision Tree Regressor with best hyperparameters - Mean squared error: {mse_dtr:.2f}")
print(f"Decision Tree Regressor with best hyperparameters - R-squared score: {r2_dtr:.2f}")

Decision Tree Regressor with best hyperparameters - Mean squared error: 19.58
Decision Tree Regressor with best hyperparameters - R-squared score: 0.99


## Random Forest

After tuning, the optimal hyperparameters for the Random Forest Regressor were found to be n_estimators=50, min_samples_split=2, min_samples_leaf=1, max_depth=10, and bootstrap=True. This resulted in a best R-squared score of 0.9970, indicating excellent predictive performance.

In [89]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

In [90]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Random Forest Regressor model
random_forest_model = RandomForestRegressor()

# Fit the model to the training data
random_forest_model.fit(X_train, y_train)


In [91]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the parameter space for the hyperparameter tuning
param_dist = {'n_estimators': [10, 50, 100, 200],
              'max_depth': [None, 10, 20, 30, 40, 50],
              'min_samples_split': [2, 5, 10, 20],
              'min_samples_leaf': [1, 2, 4, 8],
              'bootstrap': [True, False]}

# Create a Random Forest Regressor model
random_forest_model = RandomForestRegressor()



In [92]:
# Create a randomized search cross-validation object
random_search = RandomizedSearchCV(random_forest_model, param_dist, n_iter=100, cv=5, scoring='r2', random_state=42, n_jobs=-1, verbose=1)

# Fit the model to the training data
random_search.fit(X_train, y_train)

# Print the best hyperparameters and the corresponding R-squared score
print("Best hyperparameters: ", random_search.best_params_)
print("Best R-squared score: ", random_search.best_score_)
r2_rfr = random_search.best_score_
# Fit the model with the best hyperparameters to the training data
best_random_forest_model = RandomForestRegressor(**random_search.best_params_)
best_random_forest_model.fit(X_train, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits
Best hyperparameters:  {'n_estimators': 50, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 10, 'bootstrap': True}
Best R-squared score:  0.9969695156142355


## Gradient Boosting

After tuning, the optimal hyperparameters for the Gradient Boosting Regressor were found to be subsample=0.8, n_estimators=350, min_samples_split=12, min_samples_leaf=1, max_depth=4, and learning_rate=0.1. This resulted in an exceptional R2 score of 0.9985, indicating outstanding predictive performance.

In [93]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingRegressor

In [94]:
X = df.drop('CSUSHPISA', axis=1)
y = df['CSUSHPISA']

In [95]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [96]:
param_dist = {
    'n_estimators': np.arange(50, 500, 50),
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': np.arange(3, 11),
    'min_samples_split': np.arange(2, 21),
    'min_samples_leaf': np.arange(1, 11),
    'subsample': [0.8, 0.9, 1.0]
}

# Create an instance of the Gradient Boosting Regressor
regressor = GradientBoostingRegressor()

# Perform RandomizedSearchCV
grad_boost = RandomizedSearchCV(regressor, param_dist,n_iter=100, cv=5, scoring='r2', random_state=42, n_jobs=-1, verbose=1)

In [97]:
grad_boost.fit(X_train, y_train)

# Get results
Y_pred = grad_boost.predict(X_test)
r2_gb = grad_boost.score(X_test, y_test)

print(f"Best hyperparameters: {grad_boost.best_params_}")
print(f"R2 Score: {r2_gb:.4f}")

Fitting 5 folds for each of 100 candidates, totalling 500 fits
Best hyperparameters: {'subsample': 0.9, 'n_estimators': 400, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_depth': 4, 'learning_rate': 0.1}
R2 Score: 0.9988


## Xg Boost

After tuning, the optimal hyperparameters for the XGBoost Regressor were found to be subsample=1.0, n_estimators=400, min_child_weight=10, max_depth=5, and learning_rate=0.05. This resulted in an exceptional R2 score of 0.9983, indicating outstanding predictive performance.

In [98]:
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBRegressor

In [99]:
X = df.drop('CSUSHPISA', axis=1)
y = df['CSUSHPISA']

In [100]:
param_dist = {
    'n_estimators': np.arange(50, 500, 50),
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': np.arange(3, 11),
    'min_child_weight': np.arange(1, 11),
    'subsample': [0.8, 0.9, 1.0]
}

regressor = XGBRegressor()

# Perform RandomizedSearchCV
xgb = RandomizedSearchCV(regressor, param_distributions=param_dist,n_iter=100, cv=5, scoring='r2', random_state=42, n_jobs=-1, verbose=1)

xgb.fit(X_train, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


In [101]:
Y_pred = xgb.predict(X_test)
r2_xgb = xgb.score(X_test, y_test)
mse = mean_squared_error(y_test, Y_pred)
print(f"Best hyperparameters: {xgb.best_params_}")
print(f"R2 Score: {r2_xgb:.4f}")

Best hyperparameters: {'subsample': 1.0, 'n_estimators': 400, 'min_child_weight': 10, 'max_depth': 5, 'learning_rate': 0.05}
R2 Score: 0.9983


## Adaptive Boosting

After tuning, the optimal hyperparameters for the AdaBoost Regressor were found to be n_estimators=400 and learning_rate=0.2. This resulted in an R2 score of 0.9927, indicating strong predictive performance.

In [102]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import AdaBoostRegressor

In [103]:
X = df.drop('CSUSHPISA', axis=1)
y = df['CSUSHPISA']

In [104]:
param_dist = {
    'n_estimators': np.arange(50, 500, 50),
    'learning_rate': [0.01, 0.05, 0.1, 0.2]
}

# Create an instance of the AdaBoost Regressor
regressor = AdaBoostRegressor()

# Perform RandomizedSearchCV
adb = RandomizedSearchCV(regressor, param_distributions=param_dist,n_iter=100, cv=5, scoring='r2', random_state=42, n_jobs=-1, verbose=1)

adb.fit(X_train, y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits




In [105]:
# Get results
Y_pred = adb.predict(X_test)
r2_agb = adb.score(X_test, y_test)

print(f"Best hyperparameters: {adb.best_params_}")
print(f"R2 Score: {r2_agb:.4f}")

Best hyperparameters: {'n_estimators': 400, 'learning_rate': 0.2}
R2 Score: 0.9924


## Voting Regressor

After tuning, the optimal hyperparameters for the Stacking Regressor were found to be xgb__n_estimators=450, xgb__learning_rate=0.1, gb__n_estimators=250, gb__learning_rate=0.1, ada__n_estimators=250, and ada__learning_rate=0.2. This resulted in an exceptional R2 score of 0.9983, indicating outstanding predictive performance.

In [106]:
from sklearn.ensemble import VotingRegressor

In [107]:
X =  df.drop('CSUSHPISA', axis=1)
y = df['CSUSHPISA']

In [108]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [109]:
param_dist = {
    'gb__n_estimators': np.arange(50, 500, 50),
    'gb__learning_rate': [0.01, 0.05, 0.1, 0.2],
    'xgb__n_estimators': np.arange(50, 500, 50),
    'xgb__learning_rate': [0.01, 0.05, 0.1, 0.2],
    'ada__n_estimators': np.arange(50, 500, 50),
    'ada__learning_rate': [0.01, 0.05, 0.1, 0.2]
}

# Create instances of the base regressors
gb_regressor = GradientBoostingRegressor()
xgb_regressor = XGBRegressor()
ada_regressor = AdaBoostRegressor()
rf_regressor = RandomForestRegressor()

# Create the Voting Regressor
voting_regressor = VotingRegressor(
    estimators=[
        ('gb', gb_regressor),
        ('xgb', xgb_regressor),
        ('ada', ada_regressor),
        ('rf', rf_regressor)
    ]
)

In [110]:
vr = RandomizedSearchCV(voting_regressor, param_distributions=param_dist, n_iter=100, cv=5, scoring='r2', random_state=42, n_jobs=-1, verbose=1)
vr.fit(X_train, y_train)


Fitting 5 folds for each of 100 candidates, totalling 500 fits


In [111]:
# Get results
Y_pred = vr.predict(X_test)
r2_vr = vr.score(X_test, y_test)

print(f"Best hyperparameters: {vr.best_params_}")
print(f"R2 Score: {r2_vr:.4f}")

Best hyperparameters: {'xgb__n_estimators': 400, 'xgb__learning_rate': 0.2, 'gb__n_estimators': 400, 'gb__learning_rate': 0.2, 'ada__n_estimators': 250, 'ada__learning_rate': 0.2}
R2 Score: 0.9981


In [112]:
# Plotting Voting Regressor
df = pd.DataFrame({
    'Value': list(y_test) + list(Y_pred),
    'Type': ['Actual'] * len(y_test) + ['Predicted'] * len(Y_pred)
})

# Now create the scatter plot
fig = px.scatter(df, x='Value', y='Value', color='Type',
                 color_discrete_sequence=['yellow', 'red'],
                 labels={'Value': 'Value', 'Type': 'Type'})

fig.update_traces(marker=dict(size=10), selector=dict(mode='markers'))

# Set the background color to white (Google Colab theme)
fig.update_layout(
    plot_bgcolor='#1e1e1e',  # Background color
    paper_bgcolor='#1e1e1e',  # Outside plot color
)

# Remove gridlines
fig.update_xaxes(title=dict(text='Value', font=dict(color='white')),showgrid = False,tickfont=dict(color='white'))
fig.update_yaxes(title=dict(text='Value', font=dict(color='white')),showgrid = False,tickfont=dict(color='white'))

# Set label text to white
fig.update_layout(
    title=dict(font=dict(color='white')),
    legend_title=dict(font=dict(color='white')),
    legend=dict(font=dict(color='white'))
)

# Show the figure
fig.show()

## Stacking Regressor

After tuning, the optimal hyperparameters for the Gradient Boosting Regressor with Stacking were found to be final_estimator__n_estimators=250 and final_estimator__learning_rate=0.2. This resulted in an R2 score of 0.9969, indicating strong predictive performance

In [113]:
from sklearn.ensemble import StackingRegressor

In [114]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [115]:

param_dist = {
    'final_estimator__n_estimators': np.arange(50, 500, 50),
    'final_estimator__learning_rate': [0.01, 0.05, 0.1, 0.2]
}

# Create instances of the base regressors
gb_regressor = GradientBoostingRegressor()
xgb_regressor = XGBRegressor()
ada_regressor = AdaBoostRegressor()
rf_regressor = RandomForestRegressor()

# Create the Stacking Regressor
stacking_regressor = StackingRegressor(
    estimators=[
        ('gb', gb_regressor),
        ('xgb', xgb_regressor),
        ('ada', ada_regressor),
        ('rf', rf_regressor)
    ],
    final_estimator=GradientBoostingRegressor()
)

In [116]:
sr = RandomizedSearchCV(stacking_regressor, param_distributions=param_dist, n_iter=100, cv=5, scoring='r2', random_state=42, n_jobs=-1, verbose=1)
sr.fit(X_train, y_train)


The total space of parameters 36 is smaller than n_iter=100. Running 36 iterations. For exhaustive searches, use GridSearchCV.



Fitting 5 folds for each of 36 candidates, totalling 180 fits


In [117]:
# Get results
Y_pred = sr.predict(X_test)
r2_sr = sr.score(X_test, y_test)

print(f"Best hyperparameters: {sr.best_params_}")
print(f"R2 Score: {r2_sr:.4f}")

Best hyperparameters: {'final_estimator__n_estimators': 350, 'final_estimator__learning_rate': 0.1}
R2 Score: 0.9976


In [118]:
# Plotting Stacking Regressor
df_stack = pd.DataFrame({
    'Value': list(y_test) + list(Y_pred),
    'Type': ['Actual'] * len(y_test) + ['Predicted'] * len(Y_pred)
})

# Now create the scatter plot for the Stacking Regressor
fig_stack = px.scatter(df_stack, x='Value', y='Value', color='Type',
                       color_discrete_sequence=['yellow', 'red'],
                       labels={'Value': 'Value', 'Type': 'Type'})

fig_stack.update_traces(marker=dict(size=10), selector=dict(mode='markers'))

# Set the background color to match Google Colab theme
fig_stack.update_layout(
    plot_bgcolor='#1e1e1e',  # Background color
    paper_bgcolor='#1e1e1e',  # Outside plot color
)

# Remove gridlines and set tick labels to white
fig_stack.update_xaxes(title=dict(text='Value', font=dict(color='white')),
                       showgrid=False, tickfont=dict(color='white'))
fig_stack.update_yaxes(title=dict(text='Value', font=dict(color='white')),
                       showgrid=False, tickfont=dict(color='white'))

# Set label text to white
fig_stack.update_layout(
    title=dict(font=dict(color='white')),
    legend_title=dict(font=dict(color='white')),
    legend=dict(font=dict(color='white'))
)

# Show the figure
fig_stack.show()


# **Comparision of every model created on their R2 scores**.

In [119]:
models = ['Linear', 'Elastic Net', 'Lasso', 'XGBoost', 'Gradient Boosting', 'Decision Tree', 'Random Forest', 'Ridge', 'AdaBoost', 'Stacking', 'Voting Regressor']

r2_values = [r2_linear, r2_en, r2_lasso, r2_xgb, r2_gb,r2_dtr, r2_rfr, r2_ridge, r2_agb, r2_sr, r2_vr]

fig = go.Figure(go.Scatter(x=models, y=r2_values, line=dict(color='#ADD8E6')))
fig.update_layout(
    title='R-squared Score of Regression Models',
    xaxis_title='Model',
    yaxis_title='R-squared Score',
    plot_bgcolor='#1e1e1e',
    paper_bgcolor='#1e1e1e',
    font=dict(color='white'),
    xaxis=dict(showgrid=False),
    yaxis=dict(showgrid=False)
)
fig.show()

In [120]:
# Downloading the model
import pickle
from google.colab import files

# Save the model to disk
filename = 'finalized_model.pkl'
pickle.dump(xgb, open(filename, 'wb'))

files.download(filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [121]:
# installing kaliedo for downloading the chart
!pip install --upgrade "kaleido==0.2.0"

Collecting kaleido==0.2.0
  Downloading kaleido-0.2.0-py2.py3-none-manylinux1_x86_64.whl (79.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.9/79.9 MB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: kaleido
Successfully installed kaleido-0.2.0


# THANK YOU!!!