# Abstract

This Colab notebook provides two worked examples on regularization techniques in machine learning. 

The notebook is structured in a step-by-step manner, starting with loading and preprocessing the data, followed by building a baseline model without regularization. Then, each regularization technique is introduced, and its effect on the model performance is evaluated through experiments. The examples are provided in Python using popular libraries such as NumPy, TensorFlow, and Keras.

By the end of this notebook, the reader will have a clear understanding of how regularization works, how to implement it in Python, and how it affects the performance of a machine learning model. The examples presented in the notebook will serve as a good starting point for further exploration and experimentation with regularization techniques in various machine learning tasks.





# Example - 1 

Kaggle Dataset - https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction?select=Clean_Dataset.csv


# Dataset
The dataset contains details about flight fares. There are 10683 rows and 11 columns. Each row of the file represents a single flight's information.

The columns in the dataset are:

- **Airline:** The name of the airline company
- **Date_of_Journey:** The date of the journey
- **Source:** The starting location of the flight
- **Destination:** The final location of the flight
- **Route:** The route of the flight
- **Dep_Time:** The departure time of the flight
- **Arrival_Time:** The arrival time of the flight
- **Duration:** The total duration of the flight
- **Total_Stops:** The total number of stops in the flight
- **Additional_Info:** Any additional information about the flight
- **Price:** The price of the flight in Indian Rupees (INR)


Here the dependent and the target variable is "Price".

## Importing all the required libraries

In [1]:
# !kaggle datasets download -d brllrb/uber-and-lyft-dataset-boston-ma

In [2]:
# !pip install kaggle
!pip install h2o
import h2o
from h2o.automl import H2OAutoML
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.grid.grid_search import H2OGridSearch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
# Setting up maximum runtime for the AutoML
min_mem_size = 6
run_time = 222

In [4]:
#psutil library to gather information about the system's virtual memory and uses that information to calculate a minimum memory size.
import psutil
pct_memory = 0.5
virtual_memory = psutil.virtual_memory()
min_mem_size = int(round(int(pct_memory * virtual_memory.available) / 1073741824, 0))
print(min_mem_size)

6


In [5]:
# 65535 Highest port no
# Start the H2O server on a random port
import random, os, sys
port_no = random.randint(5555, 55555)

#  h2o.init(strict_version_check=False,min_mem_size_GB=min_mem_size,port=port_no) # start h2o
try:
    h2o.init(
        strict_version_check=False, min_mem_size_GB=min_mem_size, port=port_no
    )  # to initialize h2o
except:
    logging.critical("h2o.init")
    h2o.download_all_logs(dirname=logs_path, filename=logfile)
    h2o.cluster().shutdown()
    sys.exit(2)
        
# h2o.init(ip="localhost", port=54323)

Checking whether there is an H2O instance running at http://localhost:53862..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.18" 2023-01-17; OpenJDK Runtime Environment (build 11.0.18+10-post-Ubuntu-0ubuntu120.04.1); OpenJDK 64-Bit Server VM (build 11.0.18+10-post-Ubuntu-0ubuntu120.04.1, mixed mode, sharing)
  Starting server from /usr/local/lib/python3.9/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmp2cdp89ph
  JVM stdout: /tmp/tmp2cdp89ph/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmp2cdp89ph/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:53862
Connecting to H2O server at http://127.0.0.1:53862 ... successful.


0,1
H2O_cluster_uptime:,06 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.40.0.3
H2O_cluster_version_age:,"14 days, 14 hours and 48 minutes"
H2O_cluster_name:,H2O_from_python_unknownUser_tit4ia
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,6 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


In [6]:
import pandas as pd
flight = pd.read_csv('https://raw.githubusercontent.com/Shreyasi632/CrashCourse/main/Flight_Fare.csv')


In [7]:
# #Rearraging the response varaible i.e Price column to the end of the dataframe
# column_to_move = flight.pop("price")

# # insert column with insert(location, column_name, column_value)

# flight.insert(15, "price", column_to_move)

In [8]:
#Data Cleaning
# df.drop(['Unnamed: 0'], axis=1)
flight.drop(["Unnamed"], axis=1)
# "market_segment_type"

Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
0,SpiceJet,SG-8709,Delhi,Evening,zero,Night,Mumbai,Economy,2.17,1,5953
1,SpiceJet,SG-8157,Delhi,Early_Morning,zero,Morning,Mumbai,Economy,2.33,1,5953
2,AirAsia,I5-764,Delhi,Early_Morning,zero,Early_Morning,Mumbai,Economy,2.17,1,5956
3,Vistara,UK-995,Delhi,Morning,zero,Afternoon,Mumbai,Economy,2.25,1,5955
4,Vistara,UK-963,Delhi,Morning,zero,Morning,Mumbai,Economy,2.33,1,5955
...,...,...,...,...,...,...,...,...,...,...,...
300148,Vistara,UK-822,Chennai,Morning,one,Evening,Hyderabad,Business,10.08,49,69265
300149,Vistara,UK-826,Chennai,Afternoon,one,Night,Hyderabad,Business,10.42,49,77105
300150,Vistara,UK-832,Chennai,Early_Morning,one,Night,Hyderabad,Business,13.83,49,79099
300151,Vistara,UK-828,Chennai,Early_Morning,one,Evening,Hyderabad,Business,10.00,49,81585


In [9]:
#checking the percentage of null values
percentage_missing = flight.isnull().sum()*100 / len(flight)
percentage_missing

Unnamed             0.0
airline             0.0
flight              0.0
source_city         0.0
departure_time      0.0
stops               0.0
arrival_time        0.0
destination_city    0.0
class               0.0
duration            0.0
days_left           0.0
price               0.0
dtype: float64

There are no missing values in the data

In [10]:
#Checking categorical data with object datatype
categorical = flight.select_dtypes("object").columns
categorical

Index(['airline', 'flight', 'source_city', 'departure_time', 'stops',
       'arrival_time', 'destination_city', 'class'],
      dtype='object')

## LabelEncoder
LabelEncoder is used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels. Converting categorical data into numerical data using LabelEncoder.

In [11]:
#converting categorical data into numerical data using LabelEncoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for item in categorical:
    le.fit(flight[item])
    flight[item] = le.transform(flight[item])


for cat in categorical:
    print(f"The current column is : {cat}\n")
    print(flight[cat].value_counts())
    print("-" *100 +"\n\n")

The current column is : airline

5    127859
1     80892
3     43120
2     23173
0     16098
4      9011
Name: airline, dtype: int64
----------------------------------------------------------------------------------------------------


The current column is : flight

1442    3235
1454    2741
1445    2650
1490    2542
1477    2468
        ... 
1426       1
487        1
647        1
1083       1
927        1
Name: flight, Length: 1561, dtype: int64
----------------------------------------------------------------------------------------------------


The current column is : source_city

2    61343
5    60896
0    52061
4    46347
3    40806
1    38700
Name: source_city, dtype: int64
----------------------------------------------------------------------------------------------------


The current column is : departure_time

4    71146
1    66790
2    65102
5    48015
0    47794
3     1306
Name: departure_time, dtype: int64
------------------------------------------------------------------

In [12]:
flight.isnull().sum()

Unnamed             0
airline             0
flight              0
source_city         0
departure_time      0
stops               0
arrival_time        0
destination_city    0
class               0
duration            0
days_left           0
price               0
dtype: int64

## Imputation
Imputation using mean is a technique for handling missing values in a dataset. In this technique, the missing values are replaced with the mean value of the respective column.

This method is useful when the number of missing values is small compared to the total size of the dataset. It is also useful when the distribution of the data is approximately normal and the missing values are missing at random.



In [13]:
# mean = nyc_airbnb["reviews_per_month"].mean()
# nyc_airbnb["reviews_per_month"].fillna(mean, inplace=True)
# nyc_airbnb.isnull().sum()

In [14]:
# from sklearn.impute import KNNImputer
# imputer = KNNImputer(n_neighbors=10)
# df_imputed = imputer.fit_transform(nyc_airbnb)
# df_imputed = pd.DataFrame(df_imputed, columns=nyc_airbnb.columns)

In [15]:
#Checking for null values
flight.isnull().sum().sum()

0

## Initializing H2O using the below code

In [16]:
df = h2o.H2OFrame(
   flight
) 

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


**Splitting the dataset into training and test dataset**

In [17]:
# Create a 80/20 train/test split
pct_rows=0.80
df_train, df_test = df.split_frame([pct_rows])

In [18]:
# Checking the shape of both training an dtest dataset
print(df_train.shape)
print(df_test.shape)

(240043, 12)
(60110, 12)


In [19]:
X=df.columns
print(X)

['Unnamed', 'airline', 'flight', 'source_city', 'departure_time', 'stops', 'arrival_time', 'destination_city', 'class', 'duration', 'days_left', 'price']


In [20]:
#Seperating Dependent variable from Independent variable
y_numeric ='price'
X.remove(y_numeric)
print(X)

['Unnamed', 'airline', 'flight', 'source_city', 'departure_time', 'stops', 'arrival_time', 'destination_city', 'class', 'duration', 'days_left']


## H20 AutoML Execution
Run AutoML. The max_runtime_secs argument provides a way to limit the AutoML run by time.

In [21]:
# Setting of AutoML
aml = H2OAutoML(max_runtime_secs=run_time, seed=1)

In [22]:
# Training the dataset on different models thereby passing the data through H20AutoML
aml.train(x=X,y=y_numeric,training_frame=df_train)

AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%


key,value
Stacking strategy,blending
Number of base models (used / total),3/8
# GBM base models (used / total),0/4
# XGBoost base models (used / total),2/2
# DRF base models (used / total),1/1
# GLM base models (used / total),0/1
Metalearner algorithm,GLM
Metalearner fold assignment scheme,AUTO
Metalearner nfolds,0
Metalearner fold_column,


## Identifing predictor significance using OLS regression
Ordinary least squares (OLS) regression is a statistical method of analysis used for multivariate model that estimates the relationship between one or more independent variables and a dependent variable; the method estimates the relationship by minimizing the sum of the squares in the difference between the observed and predicted values of the dependent variable.

In [23]:
flight.columns

Index(['Unnamed', 'airline', 'flight', 'source_city', 'departure_time',
       'stops', 'arrival_time', 'destination_city', 'class', 'duration',
       'days_left', 'price'],
      dtype='object')

In [24]:
flight.drop(["Unnamed"], axis=1, inplace=True)

In [25]:
flight.columns

Index(['airline', 'flight', 'source_city', 'departure_time', 'stops',
       'arrival_time', 'destination_city', 'class', 'duration', 'days_left',
       'price'],
      dtype='object')

In [26]:
#Using OLS for finding the p value and t statistics 
import statsmodels.api as sm
model = sm.OLS(flight['price'], flight[['airline', 'flight', 'source_city', 'departure_time', 'stops',
       'arrival_time', 'destination_city', 'class', 'duration', 'days_left']]).fit()

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,price,R-squared (uncentered):,0.861
Model:,OLS,Adj. R-squared (uncentered):,0.861
Method:,Least Squares,F-statistic:,186700.0
Date:,"Wed, 19 Apr 2023",Prob (F-statistic):,0.0
Time:,02:44:44,Log-Likelihood:,-3231900.0
No. Observations:,300153,AIC:,6464000.0
Df Residuals:,300143,BIC:,6464000.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
airline,1511.0466,15.046,100.429,0.000,1481.557,1540.536
flight,9.4411,0.064,147.949,0.000,9.316,9.566
source_city,1879.1570,11.587,162.183,0.000,1856.447,1901.867
departure_time,1173.6111,11.752,99.862,0.000,1150.577,1196.645
stops,606.8024,34.706,17.484,0.000,538.780,674.825
arrival_time,1760.3595,11.538,152.565,0.000,1737.744,1782.974
destination_city,2095.6178,11.470,182.711,0.000,2073.138,2118.098
class,-3.581e+04,42.650,-839.613,0.000,-3.59e+04,-3.57e+04
duration,599.7285,3.223,186.088,0.000,593.412,606.045

0,1,2,3
Omnibus:,1913.228,Durbin-Watson:,0.526
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2227.793
Skew:,0.141,Prob(JB):,0.0
Kurtosis:,3.314,Cond. No.,2440.0


This above output shows the results of an OLS and here are some observations:

- **R-Squared:** In this case, the uncentered R-squared value is 	0.861, indicating that the model explains 86.1% of the variation in SalePrice
- **Adj. R-squared:** The adjusted R-squared is also 	0.861 (86.1%), which means that the model is not overfitting the data
- **F-statistic:** The F-statistic value, which tests the overall significance of the model. In this case, the F-statistic is 1290 with a probability (p-value) of 0.00, indicating that the model is statistically significant.
- **AIC and BIC** - These values indicate a better model fit. In this case, the AIC is 3.532e+04 and the BIC is 3.546e+04, indicating that the model has a good fit.

## GLM Model
Generalized Linear Models (GLM) estimate regression models for outcomes following exponential distributions. In addition to the Gaussian (i.e. normal) distribution, these include Poisson, binomial, and gamma distributions. Each serves a different purpose, and depending on distribution and link function choice, can be used either for prediction or classification.

In [27]:
#Build Simple GLM Model
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator

data_glm = H2OGeneralizedLinearEstimator(family="gaussian", standardize=True)
data_glm.train(x=X,
               y=y_numeric,
               training_frame  =df_train,
               validation_frame=df_test)

glm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,family,link,regularization,number_of_predictors_total,number_of_active_predictors,number_of_iterations,training_frame
,gaussian,identity,"Elastic Net (alpha = 0.5, lambda = 42.319 )",11,11,1,py_2_sid_b1f0

Unnamed: 0,timestamp,duration,iterations,negative_log_likelihood,objective,training_rmse,training_deviance,training_mae,training_r2,validation_rmse,validation_deviance,validation_mae,validation_r2
,2023-04-19 02:44:44,0.000 sec,0,123638914745700.92,515069861.4235821,,,,,,,,
,2023-04-19 02:44:44,0.355 sec,1,,,21047.8096948,443010292.9468412,18220.9609525,0.1399025,21059.2984808,443494052.5020397,18237.3553961,0.1399374

variable,relative_importance,scaled_importance,percentage
class,925.3551025,1.0,0.3362423
Unnamed,737.6430054,0.7971459,0.2680341
flight,285.4485168,0.3084746,0.1037222
airline,226.6506653,0.2449337,0.0823571
stops,192.6266022,0.2081651,0.0699939
duration,189.5926514,0.2048864,0.0688915
days_left,93.0922623,0.1006017,0.0338265
departure_time,51.3892326,0.0555346,0.0186731
arrival_time,39.6668739,0.0428667,0.0144136
source_city,7.1626368,0.0077404,0.0026027


In [28]:
# LASSO Regularization
data_glm_regularization_lasso = H2OGeneralizedLinearEstimator(
    family="gaussian", alpha=1, nfolds=5
)

# RIDGE Regularization 
data_glm_regularization_ridge = H2OGeneralizedLinearEstimator(
    family="gaussian", alpha=0, nfolds=5
)

We have built a GLM model and have got th above metrics. We will now use Lasso regularization to check the difference between a model without regularization and and one with regularization

## Lasso Regularization

In [29]:
data_glm_regularization_lasso.train(x=X, y=y_numeric, training_frame=df_train)

glm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,family,link,regularization,number_of_predictors_total,number_of_active_predictors,number_of_iterations,training_frame
,gaussian,identity,Lasso (lambda = 21.16 ),11,11,1,py_2_sid_b1f0

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,4628.162,36.33194,4596.85,4582.269,4659.2515,4642.1787,4660.2617
mean_residual_deviance,49158668.0,682958.7,48782620.0,48362620.0,49977236.0,48912300.0,49758564.0
mse,49158668.0,682958.7,48782620.0,48362620.0,49977236.0,48912300.0,49758564.0
null_deviance,24728100100000.0,182373581000.0,24649180100000.0,24892546200000.0,24808888700000.0,24445678800000.0,24844209000000.0
r2,0.9045597,0.0009678,0.9043316,0.9061823,0.9035906,0.9042203,0.9044739
residual_deviance,2359914140000.0,23171076100.0,2358005470000.0,2335285710000.0,2391810510000.0,2341334160000.0,2373135370000.0
rmse,7011.189,48.679962,6984.4556,6954.3237,7069.458,6993.733,7053.975
rmsle,,0.0,,,,,

Unnamed: 0,timestamp,duration,iterations,negative_log_likelihood,objective,training_rmse,training_deviance,training_mae,training_r2
,2023-04-19 02:44:51,0.000 sec,0,123638914745700.92,515069861.4235821,,,,
,2023-04-19 02:44:51,0.344 sec,1,,,7010.5409948,49147685.039806,4629.4691118,0.9045805

variable,relative_importance,scaled_importance,percentage
class,20456.3417969,1.0,0.7400665
stops,2154.6977539,0.1053315,0.0779523
days_left,1768.432251,0.0864491,0.0639781
airline,1644.7015381,0.0804006,0.0595018
duration,640.1790161,0.0312949,0.0231603
arrival_time,308.8533936,0.0150982,0.0111737
destination_city,178.3216705,0.0087172,0.0064513
Unnamed,177.0115509,0.0086531,0.0064039
source_city,162.0771637,0.0079231,0.0058636
flight,118.3498764,0.0057855,0.0042816


On comparing GLM model details without regularization and model details with Lasso Regularization we can see that the Lasso Regularization has significantly improved the model's performance compared to the model without regularization.

In particular, we can observe the following changes:

- The MSE, RMSE, MAE, and Mean Residual Deviance values of the GLM model with Lasso regularization are significantly lower than those of the GLM model without regularization. This suggests that the Lasso regularization has helped to improve the model's accuracy and reduce the error.

- The R-squared value of the GLM model with Lasso regularization is much higher than that of the GLM model without regularization. This indicates that the Lasso regularization has resulted in a model that explains a larger proportion of the variance in the dependent variable.

- The null degrees of freedom, residual degrees of freedom, null deviance, residual deviance, and AIC values of the two models are different, reflecting the different characteristics of the models.

**In summary, the GLM model with Lasso regularization outperforms the GLM model without regularization in terms of accuracy and predictive power.**

## Ridge Regularization

In [30]:
data_glm_regularization_ridge.train(x=X, y=y_numeric, training_frame=df_train)

glm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,family,link,regularization,number_of_predictors_total,number_of_active_predictors,number_of_iterations,training_frame
,gaussian,identity,Ridge ( lambda = 21.16 ),11,11,1,py_2_sid_b1f0

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,18502.75,36.6042,18490.963,18524.812,18478.559,18464.7,18554.719
mean_residual_deviance,455760320.0,2097749.5,455903552.0,456514688.0,453524704.0,454071424.0,458787232.0
mse,455760320.0,2097749.5,455903552.0,456514688.0,453524704.0,454071424.0,458787232.0
null_deviance,24727924000000.0,137978348000.0,24752202700000.0,24903323500000.0,24635005500000.0,24551125700000.0,24797964700000.0
r2,0.1151416,0.0003288,0.1148528,0.114974,0.1151375,0.1156996,0.1150443
residual_deviance,21880390000000.0,128452420000.0,21908900800000.0,22040073500000.0,21798211000000.0,21710516500000.0,21944252500000.0
rmse,21348.498,49.112553,21351.898,21366.205,21296.12,21308.951,21419.318
rmsle,1.2162515,0.0018561,1.2188153,1.216439,1.2169784,1.2139673,1.2150571

Unnamed: 0,timestamp,duration,iterations,negative_log_likelihood,objective,training_rmse,training_deviance,training_mae,training_r2
,2023-04-19 02:44:55,0.000 sec,0,123638914745700.92,515069861.4235821,,,,
,2023-04-19 02:44:55,0.382 sec,1,,,21045.3067572,442904936.5031524,18218.8211365,0.1401071

variable,relative_importance,scaled_importance,percentage
class,926.2459717,1.0,0.3353233
Unnamed,738.5368042,0.7973441,0.2673681
flight,286.3410645,0.3091415,0.1036624
airline,227.5623016,0.2456824,0.082383
stops,193.5469055,0.2089584,0.0700686
duration,190.5017242,0.2056708,0.0689662
days_left,94.0447388,0.1015332,0.0340465
departure_time,52.331768,0.0564988,0.0189454
arrival_time,40.616745,0.0438509,0.0147042
source_city,8.1302376,0.0087776,0.0029433


The comparison between these two models based on the given summary statistics:

- MSE: The mean squared error (MSE) of the Ridge-regularized model is higher than that of the model without regularization, indicating that the former model has a higher prediction error.

- RMSE: The root mean squared error (RMSE) of the Ridge-regularized model is higher than that of the model without regularization, indicating that the former model has a higher average prediction error.

- MAE: The mean absolute error (MAE) of the Ridge-regularized model is higher than that of the model without regularization, indicating that the former model has a higher average absolute prediction error.

- RMSLE: The root mean squared logarithmic error (RMSLE) of the Ridge-regularized model is higher than that of the model without regularization, indicating that the former model has a higher prediction error when the outcome variable is expressed in logarithmic terms.

- Mean Residual Deviance: The mean residual deviance of the Ridge-regularized model is higher than that of the model without regularization, indicating that the former model has a higher average squared difference between the predicted and observed values of the outcome variable.

- R-squared: The R-squared value of the model without regularization is higher than that of the Ridge-regularized model, indicating that the former model is better at explaining the variance in the outcome variable. However, both models have relatively low R-squared values, indicating that they do not explain a large proportion of the variance in the outcome variable.

- Null deviance: The null deviance of the Ridge-regularized model is higher than that of the model without regularization, indicating that the former model has a higher sum of squared differences between the observed outcome variable and its mean.

- Residual deviance: The residual deviance of the Ridge-regularized model is higher than that of the model without regularization, indicating that the former model has a higher sum of squared differences between the observed outcome variable and the predicted values from the model.

- AIC: The Akaike Information Criterion (AIC) of the Ridge-regularized model is higher than that of the model without regularization, indicating that the former model has a higher information loss when used for model selection.

**Overall, we can see that the Ridge-regularized model has higher prediction errors, worse performance in terms of explaining the variance in the outcome variable, and higher information loss compared to the model without regularization.**

## Which Regularization Method Helps ?

Based on the above information, it appears that the GLM model with Lasso regularization is performing better than the GLM model with Ridge regularization for this particular dataset. This is indicated by the lower values for the MSE, RMSE, MAE, and Mean Residual Deviance for the Lasso regularized model compared to the Ridge regularized model.

Additionally, the R^2 value for the Lasso regularized model is much higher than that for the Ridge regularized model, indicating that the Lasso model is able to explain more of the variance in the data.

In [31]:
h2o.cluster().shutdown()

H2O session _sid_b1f0 closed.


# Example 2

Kaggle dataset - https://www.kaggle.com/datasets/afsaja/workout-supplements-and-nutrition-products

# Dataset

The dataset contains data related to nutrition products commonly used in bodybuilding. There are a total of 840 rows and 14 columns.

Here's a brief version of what you'll find in the dataset below:

- **Variable Description:** average_flavor_rating : The average of the flavor rated by the user

- **brand_name:** The name of the brand

- **link:** The link of the product

- **number_of_flavors:** The total number of flavors that brand has

- **number_of_reviews:** The total number of reviews for a product

- **overall_rating:** The overall rating of the product

- **price:** The actual price of the product

- **price_per_serving:** The cost of the supplement per serving

- **product_category:** The category of the supplement

- **product_description:** The description of the product

- **product_name:** The name of the product

- **top_flavor_rated:** Top flavor of a particular supplement

- **verified_buyer_number:** The number of the verified buyer

- **verified_buyer_rating:** The rating of the verified buyer

**Price_per_serving is the target variable**

In [32]:
# Setting up maximum runtime for the AutoML
min_mem_size = 6
run_time = 222

In [33]:
#psutil library to gather information about the system's virtual memory and uses that information to calculate a minimum memory size.
import psutil
pct_memory = 0.5
virtual_memory = psutil.virtual_memory()
min_mem_size = int(round(int(pct_memory * virtual_memory.available) / 1073741824, 0))
print(min_mem_size)

3


In [34]:
# 65535 Highest port no
# Start the H2O server on a random port
import random, os, sys
port_no = random.randint(5555, 55555)

#  h2o.init(strict_version_check=False,min_mem_size_GB=min_mem_size,port=port_no) # start h2o
try:
    h2o.init(
        strict_version_check=False, min_mem_size_GB=min_mem_size, port=port_no
    )  # to initialize h2o
except:
    logging.critical("h2o.init")
    h2o.download_all_logs(dirname=logs_path, filename=logfile)
    h2o.cluster().shutdown()
    sys.exit(2)
        
# h2o.init(ip="localhost", port=54323)

Checking whether there is an H2O instance running at http://localhost:17981..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.18" 2023-01-17; OpenJDK Runtime Environment (build 11.0.18+10-post-Ubuntu-0ubuntu120.04.1); OpenJDK 64-Bit Server VM (build 11.0.18+10-post-Ubuntu-0ubuntu120.04.1, mixed mode, sharing)
  Starting server from /usr/local/lib/python3.9/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpq01hnynk
  JVM stdout: /tmp/tmpq01hnynk/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpq01hnynk/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:17981
Connecting to H2O server at http://127.0.0.1:17981 ... successful.


0,1
H2O_cluster_uptime:,03 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.40.0.3
H2O_cluster_version_age:,"14 days, 14 hours and 53 minutes"
H2O_cluster_name:,H2O_from_python_unknownUser_ibndih
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.172 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


# Importing and cleaning the Data

In [35]:
import pandas as pd
nutrition = pd.read_csv('https://raw.githubusercontent.com/Shreyasi632/CrashCourse/main/bodybuilding_nutrition_products.csv')

# #Data Cleaning
# nutrition.drop(["MSSubClass", "MSZoning", "LotFrontage","LotShape","LandContour","Utilities","LotConfig","LandSlope", "Condition1", "Condition2", "BldgType","OverallQual", "OverallCond", "YearBuilt", "YearRemodAdd","RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd", "MasVnrType","MasVnrArea", "ExterQual", "ExterCond", "Foundation", "BsmtQual","BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinSF1","BsmtFinType2", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "Heating", "CentralAir", "Electrical","LowQualFinSF", "BsmtFullBath", "BsmtHalfBath","TotRmsAbvGrd", "Functional","GarageYrBlt", "GarageFinish", "GarageArea", "GarageQual","PavedDrive", "WoodDeckSF","3SsnPorch", "MiscFeature", "YrSold","SaleCondition"], axis=1, inplace = True )
nutrition.drop(columns=["link"], axis=1, inplace=True)

In [36]:
#checking the percentage of null values
percentage_missing = nutrition.isnull().sum()*100 / len(nutrition)
percentage_missing

average_flavor_rating    53.928571
brand_name                0.000000
number_of_flavors        54.642857
number_of_reviews         2.261905
overall_rating            2.261905
price                     0.000000
price_per_serving         0.000000
product_category         20.357143
product_description       0.000000
product_name              0.000000
top_flavor_rated         54.642857
verified_buyer_number    39.404762
verified_buyer_rating    39.404762
dtype: float64

# Imputation



In [37]:
#Imputing using the Mode function for categorical data

nutrition['product_category'] = nutrition['product_category'].fillna(nutrition['product_category'].mode()[0])
nutrition['top_flavor_rated'] = nutrition['top_flavor_rated'].fillna(nutrition['top_flavor_rated'].mode()[0])
nutrition['verified_buyer_number'] = nutrition['verified_buyer_number'].fillna(nutrition['verified_buyer_number'].mode()[0])


# Imputing the NULL Values in a column for using the Mean function
nutrition['average_flavor_rating'] = nutrition['average_flavor_rating'].fillna((nutrition['average_flavor_rating'].mean()))
nutrition['number_of_flavors'] = nutrition['number_of_flavors'].fillna((nutrition['number_of_flavors'].mean()))

#Imputing using padding

nutrition['verified_buyer_rating'].fillna(method='pad', inplace=True)

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=3)

# Fit and transform the data using the imputer

nutrition['number_of_reviews'] = imputer.fit_transform(nutrition['number_of_reviews'].values.reshape(-1, 1))
nutrition['overall_rating'] = imputer.fit_transform(nutrition['overall_rating'].values.reshape(-1, 1))

## Label Encoder

In [38]:
label_encoder = LabelEncoder()

print (label_encoder)

nutrition["brand_name"] = label_encoder.fit_transform(nutrition["brand_name"])

nutrition.head()

LabelEncoder()


Unnamed: 0,average_flavor_rating,brand_name,number_of_flavors,number_of_reviews,overall_rating,price,price_per_serving,product_category,product_description,product_name,top_flavor_rated,verified_buyer_number,verified_buyer_rating
0,9.1,26,29.0,2575.0,9.4,19.99,0.67,BCAAs,BCAA Powder with Natural Energizers Sourced fr...,BCAA Energy,Pink Starblast,1594,9.0
1,8.4,62,43.0,9926.0,9.3,57.99,0.79,Build Muscle Products,24g of Whey Protein with Amino Acids for Muscl...,Gold Standard 100% Whey,Unflavored,3932,9.0
2,8.3,36,9.0,3947.0,9.1,48.99,1.63,Improve Workout Products,Pre-Workout Powder Powerhouse Packed with 13-H...,Pre JYM,Raspberry Lemonade,3471,9.0
3,8.66615,62,6.288714,2466.0,9.1,18.99,0.63,Amino Acids,Amino Acid Powder with Caffeine from Natural S...,Essential AmiN.O. Energy,Unflavored,1,9.0
4,8.7,36,14.0,2506.0,9.2,56.98,1.1,Whey Protein Isolate,"24g of Pure, Quality Protein in Every Scoop wi...",Pro JYM,S'mores,2275,9.0


In [39]:
#Checking for null values
nutrition.isnull().sum().sum()

0

## Initializing H2O using the below code

In [40]:
df1 = h2o.H2OFrame(
   nutrition
) 

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


**Splitting the dataset into training and test dataset**

In [41]:
# Create a 80/20 train/test split
pct_rows=0.80
df1_train, df1_test = df1.split_frame([pct_rows])

In [42]:
# Checking the shape of both training an dtest dataset
print(df1_train.shape)
print(df1_test.shape)

(680, 13)
(160, 13)


In [43]:
X1=df1.columns
print(X)

['Unnamed', 'airline', 'flight', 'source_city', 'departure_time', 'stops', 'arrival_time', 'destination_city', 'class', 'duration', 'days_left']


In [44]:
#Seperating Dependent variable from Independent variable
y1_numeric ='price_per_serving'
X1.remove(y_numeric)
print(X)

['Unnamed', 'airline', 'flight', 'source_city', 'departure_time', 'stops', 'arrival_time', 'destination_city', 'class', 'duration', 'days_left']


## H20 AutoML Execution
Run AutoML. The max_runtime_secs argument provides a way to limit the AutoML run by time.

In [45]:
# Setting of AutoML
aml = H2OAutoML(max_runtime_secs=run_time, seed=1)

In [46]:
# Training the dataset on different models thereby passing the data through H20AutoML
aml.train(x=X1,y=y1_numeric,training_frame=df1_train)

AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%


key,value
Stacking strategy,cross_validation
Number of base models (used / total),5/6
# GBM base models (used / total),1/1
# XGBoost base models (used / total),1/1
# GLM base models (used / total),1/1
# DRF base models (used / total),2/2
# DeepLearning base models (used / total),0/1
Metalearner algorithm,GLM
Metalearner fold assignment scheme,Random
Metalearner nfolds,5

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,0.38922,0.059953,0.4719174,0.3291771,0.3604937,0.3529507,0.4315609
mean_residual_deviance,0.3451613,0.1219603,0.4334089,0.2091109,0.2887004,0.2864406,0.5081459
mse,0.3451613,0.1219603,0.4334089,0.2091109,0.2887004,0.2864406,0.5081459
null_deviance,100.44998,25.79637,99.79257,81.627945,86.355194,89.44474,145.02948
r2,0.5366154,0.0913987,0.3919653,0.6427287,0.5602464,0.5608267,0.5273101
residual_deviance,46.877834,16.406862,60.677246,29.066416,37.81975,39.24236,67.583405
rmse,0.5801956,0.1032862,0.658338,0.4572865,0.5373085,0.5352014,0.7128435
rmsle,0.2347033,0.0320528,0.2818045,0.2096398,0.2173958,0.2105199,0.2541565


# Identifing predictor significance using OLS regression

In [47]:
nutrition.columns

Index(['average_flavor_rating', 'brand_name', 'number_of_flavors',
       'number_of_reviews', 'overall_rating', 'price', 'price_per_serving',
       'product_category', 'product_description', 'product_name',
       'top_flavor_rated', 'verified_buyer_number', 'verified_buyer_rating'],
      dtype='object')

In [48]:
# #Using OLS for finding the p value and t statistics 
# import statsmodels.api as sm
# model = sm.OLS(nutrition['price_per_serving'], nutrition[['average_flavor_rating', 'brand_name', 'link', 'number_of_flavors',
#        'number_of_reviews', 'overall_rating', 'price',
#        'product_category', 'product_description', 'product_name',
#        'top_flavor_rated', 'verified_buyer_number', 'verified_buyer_rating']]).fit()

# # Print out the statistics
# model.summary()

# GLM Model

In [49]:
#Build Simple GLM Model
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator

data_glm1 = H2OGeneralizedLinearEstimator(family="gaussian", standardize=True)
data_glm1.train(x=X1,
               y=y1_numeric,
               training_frame  =df1_train,
               validation_frame=df1_test)

glm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,family,link,regularization,number_of_predictors_total,number_of_active_predictors,number_of_iterations,training_frame
,gaussian,identity,"Elastic Net (alpha = 0.5, lambda = 0.03397 )",1450,11,2,py_8_sid_8161

Unnamed: 0,timestamp,duration,iterations,negative_log_likelihood,objective,training_rmse,training_deviance,training_mae,training_r2,validation_rmse,validation_deviance,validation_mae,validation_r2
,2023-04-19 02:48:53,0.000 sec,0,501.2702376,0.7371621,,,,,,,,
,2023-04-19 02:48:53,0.023 sec,2,,,0.7375064,0.5439157,0.5255275,0.2621492,0.7605416,0.5784235,0.5602883,0.245535

variable,relative_importance,scaled_importance,percentage
top_flavor_rated.Unflavored,0.5783988,1.0,0.3724489
product_category.Creatine Monohydrate,0.2963899,0.5124317,0.1908547
product_category.Whey Protein,0.2906271,0.5024684,0.1871438
top_flavor_rated.Sem sabor,0.1339930,0.2316619,0.0862822
number_of_flavors,0.0800770,0.1384461,0.0515641
brand_name,0.0514357,0.0889277,0.0331210
average_flavor_rating,0.0353280,0.0610789,0.0227488
verified_buyer_number,0.0312463,0.0540221,0.0201205
verified_buyer_rating,0.0229276,0.0396398,0.0147638
number_of_reviews,0.0170620,0.0294986,0.0109867


In [50]:
# LASSO Regularization
data_glm1_regularization_lasso = H2OGeneralizedLinearEstimator(
    family="gaussian", alpha=1, nfolds=5
)

# RIDGE Regularization 
data_glm1_regularization_ridge = H2OGeneralizedLinearEstimator(
    family="gaussian", alpha=0, nfolds=5
)

We have built a GLM model and have got th above metrics. We will now use Lasso regularization to check the difference between a model without regularization and and one with regularization

# Lasso Regularization

In [51]:
data_glm1_regularization_lasso.train(x=X1, y=y1_numeric, training_frame=df1_train)

glm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,family,link,regularization,number_of_predictors_total,number_of_active_predictors,number_of_iterations,training_frame
,gaussian,identity,Lasso (lambda = 0.01698 ),1450,11,2,py_8_sid_8161

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,0.5425274,0.0637993,0.4740872,0.5816492,0.4978881,0.5278376,0.6311747
mean_residual_deviance,0.5727012,0.2293339,0.3897725,0.5554106,0.4632907,0.4859503,0.9690822
mse,0.5727012,0.2293339,0.3897725,0.5554106,0.4632907,0.4859503,0.9690822
null_deviance,100.64282,35.783554,71.524414,94.70941,85.757835,88.393234,162.82922
r2,0.2270907,0.0826722,0.2714689,0.2313915,0.1938197,0.3292166,0.1095569
residual_deviance,78.78296,36.015038,51.83974,72.75878,69.03031,58.799984,141.486
rmse,0.7463504,0.139921,0.6243176,0.7452587,0.6806546,0.6971013,0.9844197
rmsle,0.3212326,0.0271408,0.2858759,0.3343574,0.3082489,0.3196639,0.3580172

Unnamed: 0,timestamp,duration,iterations,negative_log_likelihood,objective,training_rmse,training_deviance,training_mae,training_r2
,2023-04-19 02:48:54,0.000 sec,0,501.2702376,0.7371621,,,,
,2023-04-19 02:48:54,0.005 sec,2,,,0.7313424,0.5348618,0.5215246,0.2744313

variable,relative_importance,scaled_importance,percentage
top_flavor_rated.Unflavored,0.6310681,1.0,0.3537629
product_category.Creatine Monohydrate,0.3931770,0.6230342,0.2204064
product_category.Whey Protein,0.2945803,0.4667963,0.1651352
top_flavor_rated.Sem sabor,0.2118687,0.3357304,0.1187689
number_of_flavors,0.0809235,0.1282326,0.0453639
brand_name,0.0490588,0.0777393,0.0275013
verified_buyer_number,0.0414688,0.0657121,0.0232465
average_flavor_rating,0.0354386,0.0561565,0.0198661
verified_buyer_rating,0.0218823,0.0346751,0.0122668
overall_rating,0.0132832,0.0210487,0.0074463


On comparing GLM model details without regularization and model details with Lasso Regularization we got to know following things:

- The GLM model without regularization and the GLM model with Lasso regularization have different model details, as shown by the different values of their evaluation metrics.

- The Lasso regularization appears to have improved the performance of the model, as the MSE, RMSE, MAE, and RMSLE have all decreased in value compared to the model without regularization. Additionally, the R^2 value has increased, indicating that the model with Lasso regularization is better able to explain the variance in the data.

- The Null deviance and residual deviance also show improvements in the Lasso regularization model, with a decrease in both values. The AIC value, which is a measure of the model's goodness-of-fit while taking into account the number of model parameters, is higher for the Lasso regularization model, indicating that it has more parameters but also better performance.

**Overall, we can understand that the Lasso regularization has provided better model performance and reduced overfitting compared to the GLM model without regularization.**

# Ridge Regularization

In [52]:
data_glm1_regularization_ridge.train(x=X1, y=y1_numeric, training_frame=df1_train)

glm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,family,link,regularization,number_of_predictors_total,number_of_active_predictors,number_of_iterations,training_frame
,gaussian,identity,Ridge ( lambda = 0.01698 ),1450,1242,1,py_8_sid_8161

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,0.4706992,0.0372914,0.4622189,0.493725,0.4495162,0.5213979,0.4266378
mean_residual_deviance,0.4529226,0.1420174,0.4107626,0.6503722,0.3534915,0.5442941,0.3056924
mse,0.4529226,0.1420174,0.4107626,0.6503722,0.3534915,0.5442941,0.3056924
null_deviance,100.67359,26.389181,97.731476,144.06277,83.36978,101.889694,76.31424
r2,0.3870768,0.0557738,0.4254221,0.3575597,0.4431766,0.3053877,0.403838
residual_deviance,61.604362,20.133204,54.63143,92.352844,45.953903,70.75823,44.3254
rmse,0.6665146,0.1041681,0.6409077,0.8064566,0.5945516,0.7377629,0.5528946
rmsle,0.2774564,0.0148629,0.2656692,0.2855299,0.2735189,0.2990966,0.2634671

Unnamed: 0,timestamp,duration,iterations,negative_log_likelihood,objective,training_rmse,training_deviance,training_mae,training_r2
,2023-04-19 02:48:57,0.000 sec,0,501.2702376,0.7371621,,,,
,2023-04-19 02:48:57,0.249 sec,1,,,0.5401496,0.2917615,0.3772906,0.6042098

variable,relative_importance,scaled_importance,percentage
top_flavor_rated.Unflavored,0.6064063,1.0,0.0103461
"product_description.50G Of Ultra-Premium, Fast And Slow Release Protein Matrix",0.4914008,0.8103491,0.0083840
product_name.Stacked Protein Gainer,0.4914008,0.8103491,0.0083840
top_flavor_rated.Sem sabor,0.4302544,0.7095151,0.0073407
product_category.Creatine Monohydrate,0.4246346,0.7002478,0.0072448
product_category.Protein,0.4121689,0.6796911,0.0070322
top_flavor_rated.Vanilla Ice Cream,0.3102536,0.5116267,0.0052933
product_category.Whey Protein,0.3082081,0.5082534,0.0052584
"product_description.Pre-Mix Pre-Workout for Energy, Focus and Ultimate Convenience!*",0.2666955,0.4397967,0.0045502
product_name.C4 On The Go,0.2666955,0.4397967,0.0045502


On comparing GLM model details without regularization and model details with Lasso Regularization we got to know following things:


- The mean squared error (MSE) of the GLM model with ridge regularization is lower than that of the GLM model without regularization, which means that it is making fewer errors in its predictions. Similarly, the root mean squared error (RMSE), mean absolute error (MAE), and root mean squared logarithmic error (RMSLE) are all lower for the model with ridge regularization.

- Furthermore, the R-squared value for the GLM model with ridge regularization is significantly higher than that for the GLM model without regularization. This indicates that the model with ridge regularization is able to explain a larger proportion of the variance in the response variable.

- The residual degrees of freedom for the GLM model with ridge regularization is negative which can be an indication of overfitting, so it is recommended to check the model for overfitting before making any conclusion. Additionally, it is worth noting that the AIC value for the GLM model with ridge regularization is much higher than that for the model without regularization. This is because AIC penalizes the complexity of the model, and the regularization term adds complexity to the model.


**Based on the above observations, we can see that the GLM model with ridge regularization outperforms the GLM model without regularization in terms of predictive accuracy.**

# Which Regularization Method Helps ?

Based on the given metrics, the GLM model with Ridge regularization appears to be more useful compared to the GLM model with Lasso regularization.

The Ridge regularization model has lower values for all the metrics, including MSE, RMSE, MAE, and RMSLE, indicating better performance in terms of predictive accuracy. Additionally, the R-squared value for the Ridge model is higher, indicating a better fit of the model to the data.

Furthermore, the residual degrees of freedom for the Lasso model are higher, which suggests that the Lasso model is overfitting the data. In contrast, the Ridge model has negative residual degrees of freedom, indicating that the model is underfitting the data. However, underfitting can be corrected by increasing the complexity of the model, which can be achieved by adjusting the regularization parameter.

Therefore, in this case, the Ridge regularization model is preferred over the Lasso regularization model as it shows better performance in terms of predictive accuracy and a better fit to the data.


# Conclusion

# Reference 


1. H20.ai- https://docs.h2o.ai/

2. OLS Model- http://net-informations.com/ds/mla/ols.html

3. Github Notebooks- https://github.com/aiskunks/YouTube/blob/main/A_Crash_Course_in_Statistical_Learning/AutoML/CC_Kaggle_AutoML_Regression_Melbourne_Housing.ipynb
    https://github.com/aiskunks/YouTube/blob/main/A_Crash_Course_in_Statistical_Learning/AutoML/6105_AutoML_The_World_Happiness_Data.ipynb
    https://github.com/aiskunks/YouTube/blob/main/A_Crash_Course_in_Statistical_Learning/AutoML/AutoML_Wine_Quality.ipynb

4. Youtbube - https://youtu.be/21TgKhy1GY4

5. https://chat.openai.com/

# License 

MIT License

Copyright (c) 2023 Shreyasi Wakankar

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.