### About the Dataset

Food delivery is a courier service in which a restaurant, store, or independent food-delivery company delivers food to a customer. An order is typically made either through a restaurant or grocer's website or mobile app, or through a food ordering company. The delivered items can include entrees, sides, drinks, desserts, or grocery items and are typically delivered in boxes or bags. The delivery person will normally drive a car, but in bigger cities where homes and restaurants are closer together, they may use bikes or motorized scooters.




### Importing the necessary header files:

In [1]:
### Basic Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

### ML Libraries:
## Preprocessing:
import sklearn
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import cross_val_score, train_test_split

## Models:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor

## Pipeline
from sklearn.pipeline import Pipeline

## Tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

## Evalution:
from sklearn.metrics import mean_squared_error,r2_score

import joblib

import xgboost
from xgboost import XGBRegressor
import lightgbm
from lightgbm import LGBMRegressor

In [2]:
!pip install lightgbm



In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**General Steps to follow::**

1. Define the problem: Clearly articulate the problem you want to solve using machine learning. Identify whether it is a classification, regression, clustering, or other types of tasks.

2. Preprocessing: Handle missing values, outliers, formatting, Label Encoding/One hot encoding

Feature Removal:
- Simple Correlation

3. Split the data: train_test_split

4. Feature Selection:
- Feature importance from tree-based models
- Feature selection --> statistical methods using SelectKBest
- Principal Component Analysis (PCA)

5. Choose candidate algorithms: Baseline models

6. Initial training and evaluation: Train each candidate algorithm on the training set and evaluate its performance on the testing set using appropriate metrics (e.g., accuracy, F1-score, mean squared error, etc.).

7. Hyperparameter tuning: Grid Search/ Randomised Search

8. Model comparison: Compare the performance of each algorithm based on the evaluation metrics. Consider factors such as accuracy, interpretability, training time, and ease of implementation.

9. Fine-tuning: If necessary, fine-tune the top-performing algorithms by making further adjustments to hyperparameters or exploring different feature sets.

10. Ensemble methods: bagging, boosting, or stacking

11. Final evaluation: Evaluate the top-performing models using k-fold cross-validation to obtain a more robust estimate of their performance.

12. Select the best algorithm: Based on the evaluation results, choose the algorithm that performs the best on the validation set or cross-validation.

13. Test on unseen data: Once you have selected the best algorithm, test it on completely unseen data to ensure its generalization to new samples.

### Preprocessing the data:

In [4]:
### import the data set
df=pd.read_csv("/content/drive/MyDrive/Kural's Project/archive 2/train.csv")
df.head()

Unnamed: 0,ID,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Order_Date,Time_Orderd,Time_Order_picked,Weatherconditions,Road_traffic_density,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken(min)
0,0x4607,INDORES13DEL02,37,4.9,22.745049,75.892471,22.765049,75.912471,19-03-2022,11:30:00,11:45:00,conditions Sunny,High,2,Snack,motorcycle,0,No,Urban,(min) 24
1,0xb379,BANGRES18DEL02,34,4.5,12.913041,77.683237,13.043041,77.813237,25-03-2022,19:45:00,19:50:00,conditions Stormy,Jam,2,Snack,scooter,1,No,Metropolitian,(min) 33
2,0x5d6d,BANGRES19DEL01,23,4.4,12.914264,77.6784,12.924264,77.6884,19-03-2022,08:30:00,08:45:00,conditions Sandstorms,Low,0,Drinks,motorcycle,1,No,Urban,(min) 26
3,0x7a6a,COIMBRES13DEL02,38,4.7,11.003669,76.976494,11.053669,77.026494,05-04-2022,18:00:00,18:10:00,conditions Sunny,Medium,0,Buffet,motorcycle,1,No,Metropolitian,(min) 21
4,0x70a2,CHENRES12DEL01,32,4.6,12.972793,80.249982,13.012793,80.289982,26-03-2022,13:30:00,13:45:00,conditions Cloudy,High,1,Snack,scooter,1,No,Metropolitian,(min) 30


In [5]:
df.shape

(45593, 20)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45593 entries, 0 to 45592
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   ID                           45593 non-null  object 
 1   Delivery_person_ID           45593 non-null  object 
 2   Delivery_person_Age          45593 non-null  object 
 3   Delivery_person_Ratings      45593 non-null  object 
 4   Restaurant_latitude          45593 non-null  float64
 5   Restaurant_longitude         45593 non-null  float64
 6   Delivery_location_latitude   45593 non-null  float64
 7   Delivery_location_longitude  45593 non-null  float64
 8   Order_Date                   45593 non-null  object 
 9   Time_Orderd                  45593 non-null  object 
 10  Time_Order_picked            45593 non-null  object 
 11  Weatherconditions            45593 non-null  object 
 12  Road_traffic_density         45593 non-null  object 
 13  Vehicle_conditio

In [7]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Restaurant_latitude,45593.0,17.017729,8.185109,-30.905562,12.933284,18.546947,22.728163,30.914057
Restaurant_longitude,45593.0,70.231332,22.883647,-88.366217,73.17,75.898497,78.044095,88.433452
Delivery_location_latitude,45593.0,17.465186,7.335122,0.01,12.988453,18.633934,22.785049,31.054057
Delivery_location_longitude,45593.0,70.845702,21.118812,0.01,73.28,76.002574,78.107044,88.563452
Vehicle_condition,45593.0,1.023359,0.839065,0.0,0.0,1.0,2.0,3.0


In [8]:
### checking missing values
df.isnull().sum()

ID                             0
Delivery_person_ID             0
Delivery_person_Age            0
Delivery_person_Ratings        0
Restaurant_latitude            0
Restaurant_longitude           0
Delivery_location_latitude     0
Delivery_location_longitude    0
Order_Date                     0
Time_Orderd                    0
Time_Order_picked              0
Weatherconditions              0
Road_traffic_density           0
Vehicle_condition              0
Type_of_order                  0
Type_of_vehicle                0
multiple_deliveries            0
Festival                       0
City                           0
Time_taken(min)                0
dtype: int64

In [9]:
df.columns

Index(['ID', 'Delivery_person_ID', 'Delivery_person_Age',
       'Delivery_person_Ratings', 'Restaurant_latitude',
       'Restaurant_longitude', 'Delivery_location_latitude',
       'Delivery_location_longitude', 'Order_Date', 'Time_Orderd',
       'Time_Order_picked', 'Weatherconditions', 'Road_traffic_density',
       'Vehicle_condition', 'Type_of_order', 'Type_of_vehicle',
       'multiple_deliveries', 'Festival', 'City', 'Time_taken(min)'],
      dtype='object')

In [10]:
df["ID"].nunique()  ## can be removed

45593

In [11]:
df['Delivery_person_ID'].nunique()

1320

In [12]:
df.Delivery_person_ID

0          INDORES13DEL02 
1          BANGRES18DEL02 
2          BANGRES19DEL01 
3         COIMBRES13DEL02 
4          CHENRES12DEL01 
               ...        
45588       JAPRES04DEL01 
45589       AGRRES16DEL01 
45590      CHENRES08DEL03 
45591     COIMBRES11DEL01 
45592    RANCHIRES09DEL02 
Name: Delivery_person_ID, Length: 45593, dtype: object

In [13]:
### There are NaN values in the data represented in the form of "NaN " string
df.replace('NaN', float(np.nan), regex=True, inplace=True)

In [14]:
### It is noticed that city names are part of the delivery person id:
## So we try to isolate cities into a new column:
df["City"]=df.Delivery_person_ID.apply(lambda x:x[:-11])

In [16]:
## analysing the target variable:
df["Time_taken(min)"]

0        (min) 24
1        (min) 33
2        (min) 26
3        (min) 21
4        (min) 30
           ...   
45588    (min) 32
45589    (min) 36
45590    (min) 16
45591    (min) 26
45592    (min) 36
Name: Time_taken(min), Length: 45593, dtype: object

In [17]:
df["Time_taken(min)"]=df["Time_taken(min)"].apply(lambda x:x.split(" ")[1])

In [18]:
### Converting types:
num_cols = ['Delivery_person_Age','Delivery_person_Ratings','Restaurant_latitude','Restaurant_longitude',
            'Delivery_location_latitude','Delivery_location_longitude','Vehicle_condition',
            'multiple_deliveries','Time_taken(min)']
for col in num_cols:
    df[col]=df[col].astype('float64')

df['Order_Date']=pd.to_datetime(df['Order_Date'],format="%d-%m-%Y")   ### bringing date into format

In [19]:
df.head()

Unnamed: 0,ID,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Order_Date,Time_Orderd,Time_Order_picked,Weatherconditions,Road_traffic_density,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken(min)
0,0x4607,INDORES13DEL02,37.0,4.9,22.745049,75.892471,22.765049,75.912471,2022-03-19,11:30:00,11:45:00,conditions Sunny,High,2.0,Snack,motorcycle,0.0,No,INDO,24.0
1,0xb379,BANGRES18DEL02,34.0,4.5,12.913041,77.683237,13.043041,77.813237,2022-03-25,19:45:00,19:50:00,conditions Stormy,Jam,2.0,Snack,scooter,1.0,No,BANG,33.0
2,0x5d6d,BANGRES19DEL01,23.0,4.4,12.914264,77.6784,12.924264,77.6884,2022-03-19,08:30:00,08:45:00,conditions Sandstorms,Low,0.0,Drinks,motorcycle,1.0,No,BANG,26.0
3,0x7a6a,COIMBRES13DEL02,38.0,4.7,11.003669,76.976494,11.053669,77.026494,2022-04-05,18:00:00,18:10:00,conditions Sunny,Medium,0.0,Buffet,motorcycle,1.0,No,COIMB,21.0
4,0x70a2,CHENRES12DEL01,32.0,4.6,12.972793,80.249982,13.012793,80.289982,2022-03-26,13:30:00,13:45:00,conditions Cloudy,High,1.0,Snack,scooter,1.0,No,CHEN,30.0


In [21]:
df.columns

Index(['ID', 'Delivery_person_ID', 'Delivery_person_Age',
       'Delivery_person_Ratings', 'Restaurant_latitude',
       'Restaurant_longitude', 'Delivery_location_latitude',
       'Delivery_location_longitude', 'Order_Date', 'Time_Orderd',
       'Time_Order_picked', 'Weatherconditions', 'Road_traffic_density',
       'Vehicle_condition', 'Type_of_order', 'Type_of_vehicle',
       'multiple_deliveries', 'Festival', 'City', 'Time_taken(min)'],
      dtype='object')

In [22]:
## Caluclating difference betwen order time and pickup:
k=df[df['Time_Orderd']!=np.nan]
k["difference"]=pd.to_datetime(k['Time_Order_picked'])-pd.to_datetime(k['Time_Orderd'])
k['difference']=k['difference'].dt.total_seconds()/60
k[k['difference']<0]=k['difference'].where(k['difference']>0).mean()
df=pd.concat([df,k["difference"]],axis=1)
df['difference']=df['difference'].fillna(df[df['difference']!=np.nan]['difference'].mean())

In [23]:
df['hour']=df.Time_Order_picked.apply(lambda x:str(x).split(":")[0])
df["month"]=df.Order_Date.apply(lambda x:str(x).split("-")[1])
df["day"]=df.Order_Date.apply(lambda x:str(x).split("-")[2][:2])

In [24]:
df.columns

Index(['ID', 'Delivery_person_ID', 'Delivery_person_Age',
       'Delivery_person_Ratings', 'Restaurant_latitude',
       'Restaurant_longitude', 'Delivery_location_latitude',
       'Delivery_location_longitude', 'Order_Date', 'Time_Orderd',
       'Time_Order_picked', 'Weatherconditions', 'Road_traffic_density',
       'Vehicle_condition', 'Type_of_order', 'Type_of_vehicle',
       'multiple_deliveries', 'Festival', 'City', 'Time_taken(min)',
       'difference', 'hour', 'month', 'day'],
      dtype='object')

In [25]:
df.head()

Unnamed: 0,ID,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Order_Date,Time_Orderd,...,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken(min),difference,hour,month,day
0,0x4607,INDORES13DEL02,37.0,4.9,22.745049,75.892471,22.765049,75.912471,2022-03-19,11:30:00,...,Snack,motorcycle,0.0,No,INDO,24.0,15.0,11,3,19
1,0xb379,BANGRES18DEL02,34.0,4.5,12.913041,77.683237,13.043041,77.813237,2022-03-25,19:45:00,...,Snack,scooter,1.0,No,BANG,33.0,5.0,19,3,25
2,0x5d6d,BANGRES19DEL01,23.0,4.4,12.914264,77.6784,12.924264,77.6884,2022-03-19,08:30:00,...,Drinks,motorcycle,1.0,No,BANG,26.0,15.0,8,3,19
3,0x7a6a,COIMBRES13DEL02,38.0,4.7,11.003669,76.976494,11.053669,77.026494,2022-04-05,18:00:00,...,Buffet,motorcycle,1.0,No,COIMB,21.0,10.0,18,4,5
4,0x70a2,CHENRES12DEL01,32.0,4.6,12.972793,80.249982,13.012793,80.289982,2022-03-26,13:30:00,...,Snack,scooter,1.0,No,CHEN,30.0,15.0,13,3,26


In [26]:
df.drop(['Order_Date','Time_Orderd','Time_Order_picked','Delivery_person_ID','ID'],axis=1,inplace=True)

In [27]:
df.select_dtypes("object")

Unnamed: 0,Weatherconditions,Road_traffic_density,Type_of_order,Type_of_vehicle,Festival,City,hour,month,day
0,conditions Sunny,High,Snack,motorcycle,No,INDO,11,03,19
1,conditions Stormy,Jam,Snack,scooter,No,BANG,19,03,25
2,conditions Sandstorms,Low,Drinks,motorcycle,No,BANG,08,03,19
3,conditions Sunny,Medium,Buffet,motorcycle,No,COIMB,18,04,05
4,conditions Cloudy,High,Snack,scooter,No,CHEN,13,03,26
...,...,...,...,...,...,...,...,...,...
45588,conditions Windy,High,Meal,motorcycle,No,JAP,11,03,24
45589,conditions Windy,Jam,Buffet,motorcycle,No,AGR,20,02,16
45590,conditions Cloudy,Low,Drinks,scooter,No,CHEN,00,03,11
45591,conditions Cloudy,High,Snack,motorcycle,No,COIMB,13,03,07


In [28]:
## converting into int
df=df.astype({'hour':'int','month':'int','day':'int'})

In [29]:
df.isnull().sum()

Delivery_person_Age            1854
Delivery_person_Ratings        1908
Restaurant_latitude               0
Restaurant_longitude              0
Delivery_location_latitude        0
Delivery_location_longitude       0
Weatherconditions               616
Road_traffic_density            601
Vehicle_condition                 0
Type_of_order                     0
Type_of_vehicle                   0
multiple_deliveries             993
Festival                        228
City                              0
Time_taken(min)                   0
difference                        0
hour                              0
month                             0
day                               0
dtype: int64

In [30]:
df.Delivery_person_Age=df.Delivery_person_Age.replace(np.nan,df.Delivery_person_Age.mode()[0])
df.Delivery_person_Ratings=df.Delivery_person_Ratings.replace(np.nan,df.Delivery_person_Ratings.mode()[0])
df.Weatherconditions=df.Weatherconditions.replace(np.nan,df.Weatherconditions.mode()[0])
df.Road_traffic_density=df.Road_traffic_density.replace(np.nan,df.Road_traffic_density.mode()[0])
df.Festival=df.Festival.replace(np.nan,df.Festival.mode()[0])
df.multiple_deliveries=df.multiple_deliveries.replace(np.nan,df.multiple_deliveries.mode()[0])

In [None]:
df.isnull().sum()

In [31]:
df.head()

Unnamed: 0,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Weatherconditions,Road_traffic_density,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken(min),difference,hour,month,day
0,37.0,4.9,22.745049,75.892471,22.765049,75.912471,conditions Sunny,High,2.0,Snack,motorcycle,0.0,No,INDO,24.0,15.0,11,3,19
1,34.0,4.5,12.913041,77.683237,13.043041,77.813237,conditions Stormy,Jam,2.0,Snack,scooter,1.0,No,BANG,33.0,5.0,19,3,25
2,23.0,4.4,12.914264,77.6784,12.924264,77.6884,conditions Sandstorms,Low,0.0,Drinks,motorcycle,1.0,No,BANG,26.0,15.0,8,3,19
3,38.0,4.7,11.003669,76.976494,11.053669,77.026494,conditions Sunny,Medium,0.0,Buffet,motorcycle,1.0,No,COIMB,21.0,10.0,18,4,5
4,32.0,4.6,12.972793,80.249982,13.012793,80.289982,conditions Cloudy,High,1.0,Snack,scooter,1.0,No,CHEN,30.0,15.0,13,3,26


In [32]:
### Label Encoding:
for i in df.select_dtypes("object").columns:
  le=LabelEncoder()
  df[i]=le.fit_transform(df[i])

### Splitting:

In [33]:
df.head()

Unnamed: 0,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Weatherconditions,Road_traffic_density,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken(min),difference,hour,month,day
0,37.0,4.9,22.745049,75.892471,22.765049,75.912471,4,0,2.0,3,2,0.0,0,20,24.0,15.0,11,3,19
1,34.0,4.5,12.913041,77.683237,13.043041,77.813237,3,1,2.0,3,3,1.0,0,6,33.0,5.0,19,3,25
2,23.0,4.4,12.914264,77.6784,12.924264,77.6884,2,2,0.0,1,2,1.0,0,6,26.0,15.0,8,3,19
3,38.0,4.7,11.003669,76.976494,11.053669,77.026494,4,3,0.0,0,2,1.0,0,12,21.0,10.0,18,4,5
4,32.0,4.6,12.972793,80.249982,13.012793,80.289982,0,0,1.0,3,3,1.0,0,10,30.0,15.0,13,3,26


In [34]:
### Split the data::
y=df['Time_taken(min)']
X=df.drop(['Time_taken(min)'],axis=1)

In [35]:
### Splitting the data set into dfing and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

In [36]:
X_train.shape,X_test.shape

((36474, 18), (9119, 18))

### Scaling:

In [37]:
### Standard Scaling for the features::
sc=StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
X_train

array([[ 1.24173218,  0.48234318,  0.64412877, ..., -1.14979807,
         0.03717623,  0.82442691],
       [-0.31140178, -1.33875114, -0.57526239, ...,  0.91681282,
         0.03717623, -1.4747513 ],
       [-1.00168355, -0.12468826,  0.69959353, ..., -0.02255577,
         0.03717623, -1.35979239],
       ...,
       [ 1.58687306,  0.7858589 ,  0.23067665, ..., -0.58617692,
         0.03717623, -1.4747513 ],
       [-0.82911311,  0.7858589 , -0.4837223 , ..., -0.02255577,
         0.03717623,  1.39922146],
       [ 0.20630954, -0.12468826,  0.04511454, ...,  0.16531795,
         0.03717623, -0.09524437]])

### Baseline Model Testing:

In [38]:
%%time
model=[LinearRegression(), DecisionTreeRegressor(), RandomForestRegressor(), Ridge(), Lasso(), GradientBoostingRegressor()]
dict_model_results={}
for i in model:
  acc_score=cross_val_score(i,X_train,y_train,cv=10,scoring='neg_root_mean_squared_error')
  dict_model_results[str(i)]=np.array(acc_score).mean()

CPU times: user 6min 25s, sys: 1.94 s, total: 6min 27s
Wall time: 7min


In [39]:
dict_model_results

{'LinearRegression()': -7.055105448005174,
 'DecisionTreeRegressor()': -6.4682490437239135,
 'RandomForestRegressor()': -4.721740545043749,
 'Ridge()': -7.055104841442541,
 'Lasso()': -7.499202770043441,
 'GradientBoostingRegressor()': -5.1507721821429495}

Based on the results obtained, we can conclude that majority of the models obtain best results on the ANOVA based feature selection.

### Hyperparameter tuning the best models:

In [44]:
param_grid={'bootstrap': [True, False],
 'max_depth': [5, 8, 10, 15, 20,],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [70,90,100,120,150]}
model_rf=RandomForestRegressor()
rs_rf=RandomizedSearchCV(model_rf,param_distributions=param_grid,cv=5,verbose=2, n_jobs =-1)
rs_rf.fit(X_train,y_train)
rs_rf.best_params_

model=RandomForestRegressor(**rs_rf.best_params_)
model.fit(X_train,y_train)

## calculating predictions:
y_pred=model.predict(X_test)
rmse=mean_squared_error(y_test,y_pred)**0.5
rmse

Fitting 5 folds for each of 10 candidates, totalling 50 fits


KeyboardInterrupt: ignored

In [None]:
filename = "/content/drive/MyDrive/Kural's Project/Delivery_baseline.sav"
joblib.dump(model_rf, filename)

["/content/drive/MyDrive/Kural's Project/Shares_baseline.sav"]

### Trying out the Ensemble Techniques

In [45]:
%%time
model=[XGBRegressor(), LGBMRegressor()]
dict_model_results={}
for i in model:
  acc_score=cross_val_score(i,X_train,y_train,cv=5,scoring='neg_root_mean_squared_error')
  dict_model_results[str(i)]=np.array(acc_score).mean()

CPU times: user 42 s, sys: 154 ms, total: 42.1 s
Wall time: 26.6 s


In [46]:
dict_model_results

{'XGBRegressor(base_score=None, booster=None, callbacks=None,\n             colsample_bylevel=None, colsample_bynode=None,\n             colsample_bytree=None, early_stopping_rounds=None,\n             enable_categorical=False, eval_metric=None, feature_types=None,\n             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,\n             interaction_constraints=None, learning_rate=None, max_bin=None,\n             max_cat_threshold=None, max_cat_to_onehot=None,\n             max_delta_step=None, max_depth=None, max_leaves=None,\n             min_child_weight=None, missing=nan, monotone_constraints=None,\n             n_estimators=100, n_jobs=None, num_parallel_tree=None,\n             predictor=None, random_state=None, ...)': -4.347755969592011,
 'LGBMRegressor()': -4.363244703573304}

### Hyperparameter Tuning the best ensemble model:

In [47]:
param_grid = {
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'min_child_samples': [10, 20, 30],
    'feature_fraction': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    'bagging_fraction': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    'reg_lambda': [0, 0.1, 0.5, 1.0],
    'reg_alpha': [0, 0.1, 0.5, 1.0]
}

lgb_model=LGBMRegressor()
random_search = RandomizedSearchCV(lgb_model, param_grid, cv=5)

# Fit the grid search to the training data
random_search.fit(X_train, y_train)



In [49]:
# Get the best hyperparameters and the corresponding model
best_params = random_search.best_params_

model_lgb=LGBMRegressor(**best_params)
model_lgb.fit(X_train,y_train)

pred=model_lgb.predict(X_test)
rmse=mean_squared_error(y_test,pred)**0.5
rmse

4.296780465530134

In [50]:
params = {
    'n_estimators':[500],
    'min_child_weight':[4,5],
    'gamma':[i/10.0 for i in range(3,6)],
    'subsample':[i/10.0 for i in range(6,11)],
    'colsample_bytree':[i/10.0 for i in range(6,11)],
    'max_depth': [2,3,4,6,7],
    'objective': ['reg:squarederror', 'reg:tweedie'],
    'booster': ['gbtree', 'gblinear'],
    'eval_metric': ['rmse'],
    'eta': [i/10.0 for i in range(3,6)],
}


xgb_model=XGBRegressor()
random_search_xgb = RandomizedSearchCV(xgb_model, params, cv=5)

# Fit the grid search to the training data
random_search_xgb.fit(X_train, y_train)

Parameters: { "colsample_bytree", "gamma", "max_depth", "min_child_weight", "subsample" } are not used.

Parameters: { "colsample_bytree", "gamma", "max_depth", "min_child_weight", "subsample" } are not used.

Parameters: { "colsample_bytree", "gamma", "max_depth", "min_child_weight", "subsample" } are not used.

Parameters: { "colsample_bytree", "gamma", "max_depth", "min_child_weight", "subsample" } are not used.

Parameters: { "colsample_bytree", "gamma", "max_depth", "min_child_weight", "subsample" } are not used.

Parameters: { "colsample_bytree", "gamma", "max_depth", "min_child_weight", "subsample" } are not used.

Parameters: { "colsample_bytree", "gamma", "max_depth", "min_child_weight", "subsample" } are not used.

Parameters: { "colsample_bytree", "gamma", "max_depth", "min_child_weight", "subsample" } are not used.

Parameters: { "colsample_bytree", "gamma", "max_depth", "min_child_weight", "subsample" } are not used.

Parameters: { "colsample_bytree", "gamma", "max_depth",

In [51]:
# Get the best hyperparameters and the corresponding model
best_params = random_search_xgb.best_params_

model_xgb=XGBRegressor(**best_params)
model_xgb.fit(X_train,y_train)

pred=model_xgb.predict(X_test)
rmse=mean_squared_error(y_test,pred)**0.5
rmse

4.3790641835956965

In [54]:
### Saving the best model:
filename = "/content/drive/MyDrive/Kural's Project/Food_ensemble_lgb.sav"
joblib.dump(model_lgb, filename)

["/content/drive/MyDrive/Kural's Project/Food_ensemble_lgb.sav"]

In [55]:
### Saving the best model:
filename = "/content/drive/MyDrive/Kural's Project/Food_ensemble_xgb.sav"
joblib.dump(model_xgb, filename)

["/content/drive/MyDrive/Kural's Project/Food_ensemble_xgb.sav"]

### Comparing Model Results:

In [56]:
# load the model from disk
filename1="/content/drive/MyDrive/Kural's Project/Food_ensemble_xgb.sav"
xgb_model = joblib.load(filename1)

xgb_model.fit(X_train,y_train)
pred_xgb=xgb_model.predict(X_test)

rmse_ensemble=mean_squared_error(y_test,pred_xgb)**0.5

In [57]:
# load the model from disk
filename2="/content/drive/MyDrive/Kural's Project/Food_ensemble_lgb.sav"
lgb_model = joblib.load(filename2)

lgb_model.fit(X_train,y_train)
pred_lgb=lgb_model.predict(X_test)

rmse_lgb=mean_squared_error(y_test,pred_lgb)**0.5



In [58]:
print(f"The RMSE obtained from Ensemble is:{rmse_ensemble}")
print(f"The RMSE obtained from Baseline is:{rmse_lgb}")

The RMSE obtained from Ensemble is:4.3790641835956965
The RMSE obtained from Baseline is:4.296780465530134
