**BIG MART SALES PREDICTION**

---

BigMart, a leading retail store, has collected data about its sales across different stores and various products. The dataset contains information about product attributes, store attributes, and historical sales data. The challenge is to build a predictive model that can accurately forecast the sales of each product in each store. The objective is to understand the factors that influence sales and create a model that can assist in optimizing inventory management and increasing sales.

**Benefits of solving the problem.**

1.   **Optimized Inventory**: Minimize excess stock and prevent stockouts.
2.   **Increased Revenue**: Improve sales through data-driven decisions.
3.   **Cost Reduction**: Lower storage and carrying costs.
4.   **Customer Satisfaction**: Ensure product availability.
4.   **Data-Driven Decisions**: Use analytics for insights.






# **Importing Data**

In [None]:
import numpy  as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

In [None]:
df = pd.read_csv('/content/Train.csv')

# **Analysis EDA & Feature Enggineering**

In [None]:
df.shape

(8523, 12)

In [None]:
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [None]:
df.tail()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
8518,FDF22,6.865,Low Fat,0.056783,Snack Foods,214.5218,OUT013,1987,High,Tier 3,Supermarket Type1,2778.3834
8519,FDS36,8.38,Regular,0.046982,Baking Goods,108.157,OUT045,2002,,Tier 2,Supermarket Type1,549.285
8520,NCJ29,10.6,Low Fat,0.035186,Health and Hygiene,85.1224,OUT035,2004,Small,Tier 2,Supermarket Type1,1193.1136
8521,FDN46,7.21,Regular,0.145221,Snack Foods,103.1332,OUT018,2009,Medium,Tier 3,Supermarket Type2,1845.5976
8522,DRG01,14.8,Low Fat,0.044878,Soft Drinks,75.467,OUT046,1997,Small,Tier 1,Supermarket Type1,765.67


In [None]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Item_Weight,7060.0,12.857645,4.643456,4.555,8.77375,12.6,16.85,21.35
Item_Visibility,8523.0,0.066132,0.051598,0.0,0.026989,0.053931,0.094585,0.328391
Item_MRP,8523.0,140.992782,62.275067,31.29,93.8265,143.0128,185.6437,266.8884
Outlet_Establishment_Year,8523.0,1997.831867,8.37176,1985.0,1987.0,1999.0,2004.0,2009.0
Item_Outlet_Sales,8523.0,2181.288914,1706.499616,33.29,834.2474,1794.331,3101.2964,13086.9648


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [None]:
df.nunique()

Item_Identifier              1559
Item_Weight                   415
Item_Fat_Content                5
Item_Visibility              7880
Item_Type                      16
Item_MRP                     5938
Outlet_Identifier              10
Outlet_Establishment_Year       9
Outlet_Size                     3
Outlet_Location_Type            3
Outlet_Type                     4
Item_Outlet_Sales            3493
dtype: int64

In [None]:
df.isnull().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [None]:
df.corr()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
Item_Weight,1.0,-0.014048,0.027141,-0.011588,0.014123
Item_Visibility,-0.014048,1.0,-0.001315,-0.074834,-0.128625
Item_MRP,0.027141,-0.001315,1.0,0.00502,0.567574
Outlet_Establishment_Year,-0.011588,-0.074834,0.00502,1.0,-0.049135
Item_Outlet_Sales,0.014123,-0.128625,0.567574,-0.049135,1.0


In [None]:
sns.heatmap(df.corr(), annot=True,cmap='RdYlGn',vmin=-1,vmax=1)

<Axes: title={'center': 'Wordcloud for Item_Identifier'}>

Filling null values of Item_Weight using Mean




In [None]:
mean_weight = df['Item_Weight'].mean()

In [None]:
df['Item_Weight'].fillna(mean_weight, inplace=True)

In [None]:
df['Item_Weight'].isnull().sum()

0

Filling null values of Outlet_size




In [None]:
df['Outlet_Size'].fillna("missing", inplace=True)

In [None]:
df['Outlet_Size'].value_counts()

Medium     2793
missing    2410
Small      2388
High        932
Name: Outlet_Size, dtype: int64

In [None]:
df['Outlet_Size'].replace({'missing':'Missing'}, inplace=True)

In [None]:
df['Outlet_Size'].isnull().sum()

0

In [None]:
df.isnull().sum()

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

In [None]:
df['Item_Outlet_Sales'].describe()

count     8523.000000
mean      2181.288914
std       1706.499616
min         33.290000
25%        834.247400
50%       1794.331000
75%       3101.296400
max      13086.964800
Name: Item_Outlet_Sales, dtype: float64

In [None]:
q1 = df['Item_Outlet_Sales'].quantile(0.25)
q3 = df['Item_Outlet_Sales'].quantile(0.75)
Outlet_sale_IQR = q3-q1

In [None]:
upper_limit= q3 + 1.5 * Outlet_sale_IQR
lower_limit= q1 - 1.5 * Outlet_sale_IQR

In [None]:
print(upper_limit)
print(lower_limit)

6501.8699
-2566.3261


In [None]:
# Cap the outliers
df['Item_Outlet_Sales'] = df['Item_Outlet_Sales'].clip(lower=lower_limit, upper=upper_limit)

# Check if the outliers have been capped
outliers_count = len(df[(df['Item_Outlet_Sales'] > upper_limit) | (df['Item_Outlet_Sales'] < lower_limit)])
print(f"{outliers_count} outliers have been capped.")

0 outliers have been capped.


In [None]:
df['Item_Outlet_Sales'].describe()

count    8523.000000
mean     2156.313016
std      1624.863069
min        33.290000
25%       834.247400
50%      1794.331000
75%      3101.296400
max      6501.869900
Name: Item_Outlet_Sales, dtype: float64

In [None]:
df.isnull().sum()

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                8523 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                8523 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [None]:
df.nunique()

Item_Identifier              1559
Item_Weight                   416
Item_Fat_Content                5
Item_Visibility              7880
Item_Type                      16
Item_MRP                     5938
Outlet_Identifier              10
Outlet_Establishment_Year       9
Outlet_Size                     4
Outlet_Location_Type            3
Outlet_Type                     4
Item_Outlet_Sales            3323
dtype: int64

In [None]:
cat_columns = ['Item_Fat_Content','Item_Type','Outlet_Identifier','Outlet_Establishment_Year','Outlet_Size','Outlet_Location_Type','Outlet_Type']
con_columns = ['Item_Weight','Item_Visibility','Item_MRP','Item_Outlet_Sales']

In [None]:
df[cat_columns] = df[cat_columns].astype(object)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                8523 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   object 
 8   Outlet_Size                8523 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), object(8)
memory usage: 799.2+ KB


In [None]:
plt.figure(figsize = (16, 4))
plt.subplot(141)
sns.distplot(df['Item_Weight'])
plt.title('Item_Weight')
plt.subplot(142)
sns.distplot(df['Item_Visibility'])
plt.title('Item_Visibility')
plt.subplot(143)
sns.distplot(df['Item_MRP'])
plt.title('Item_MRP')
plt.subplot(144)
sns.distplot(df['Item_Outlet_Sales'])
plt.title('Item_Outlet_Sales')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize = (20, 15))
plt.subplot(231)
sns.histplot(df['Item_Fat_Content'])
plt.xticks()
plt.title('Item_Fat_Content')
plt.subplot(232)
sns.histplot(df['Item_Type'])
plt.xticks(rotation=90)
plt.title('Item_Type')
plt.subplot(233)
sns.histplot(df['Outlet_Identifier'])
plt.xticks(rotation=90)
plt.title('Outlet_Identifier')
plt.subplot(234)
sns.histplot(df['Outlet_Size'])
plt.xticks()
plt.title('Outlet_Size')
plt.subplot(235)
sns.histplot(df['Outlet_Location_Type'])
plt.xticks()
plt.title('Outlet_Location_Type')
plt.subplot(236)
sns.histplot(df['Outlet_Type'])
plt.xticks(rotation=30)
plt.title('Outlet_Type')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize = (16, 4))
plt.subplot(141)
sns.boxplot(df['Item_Weight'])
plt.title('Item_Weight')
plt.subplot(142)
sns.boxplot(df['Item_Visibility'])
plt.title('Item_Visibility')
plt.subplot(143)
sns.boxplot(df['Item_MRP'])
plt.title('Item_MRP')
plt.subplot(144)
sns.boxplot(df['Item_Outlet_Sales'])
plt.title('Item_Outlet_Sales')
plt.tight_layout()
plt.show()

We can see there are no outliers present

In [None]:
plt.figure(figsize = (16, 4))
sns.pairplot(df)

<seaborn.axisgrid.PairGrid at 0x7dd46584dab0>

In [None]:
sns.heatmap(df.corr(), annot=True,cmap='RdYlGn', vmin=-1,vmax=1)

<Axes: >

In [None]:
plt.figure(figsize = (12, 6))
sns.countplot(data=df, x='Outlet_Establishment_Year')
plt.title('Outlet_Establishment_Year')

Text(0.5, 1.0, 'Outlet_Establishment_Year')

In [None]:
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,Missing,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [None]:
Df = df

In [None]:
Df['Outlet_Establishment_Year'] = Df['Outlet_Establishment_Year'].astype(int)

In [None]:
Df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                8523 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                8523 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [None]:
cat_col= ['Item_Fat_Content','Item_Type','Outlet_Identifier','Outlet_Size','Outlet_Location_Type','Outlet_Type','Item_Identifier']
con_col= ['Item_Outlet_Sales','Outlet_Establishment_Year','Item_MRP','Item_Visibility']

In [None]:
Df

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.300,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.1380
1,DRC01,5.920,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.500,Low Fat,0.016760,Meat,141.6180,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.2700
3,FDX07,19.200,Regular,0.000000,Fruits and Vegetables,182.0950,OUT010,1998,Missing,Tier 3,Grocery Store,732.3800
4,NCD19,8.930,Low Fat,0.000000,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052
...,...,...,...,...,...,...,...,...,...,...,...,...
8518,FDF22,6.865,Low Fat,0.056783,Snack Foods,214.5218,OUT013,1987,High,Tier 3,Supermarket Type1,2778.3834
8519,FDS36,8.380,Regular,0.046982,Baking Goods,108.1570,OUT045,2002,Missing,Tier 2,Supermarket Type1,549.2850
8520,NCJ29,10.600,Low Fat,0.035186,Health and Hygiene,85.1224,OUT035,2004,Small,Tier 2,Supermarket Type1,1193.1136
8521,FDN46,7.210,Regular,0.145221,Snack Foods,103.1332,OUT018,2009,Medium,Tier 3,Supermarket Type2,1845.5976


In [None]:
Df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,Missing,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

In [None]:
Df[cat_col] = Df[cat_col].apply(encoder.fit_transform)

In [None]:
Df = Df.drop(columns='Item_Identifier')

In [None]:
X= Df.drop(columns = ['Item_Outlet_Sales'])
y= Df['Item_Outlet_Sales']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

In [None]:
X_train_scaled = scale.fit_transform(X_train)
X_test_scaled = scale.transform(X_test)

In [None]:
X_train_scaled = pd.DataFrame(X_train_scaled)
X_test_scaled = pd.DataFrame(X_test_scaled)

# **MODEL BUILDING**

# **Linear Regression**

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

In [None]:
lr.fit(X_train_scaled,y_train)
lr_pred = lr.predict(X_test_scaled)

In [None]:
print("R squared Value")
print("LR model :", r2_score(y_test,lr_pred))

print("\nMSE")
print("LR model :", mean_squared_error(y_test,lr_pred))

print("\nMAE")
print("LR model :", mean_absolute_error(y_test,lr_pred))

R squared Value
LR model : 0.5301299803628092

MSE
LR model : 1176548.4571170164

MAE
LR model : 834.6658700741752


In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
tuning_params = {
    'fit_intercept': [True, False]
}

GS_LR = GridSearchCV(lr, tuning_params, scoring='accuracy', cv=10)
GS_LR.fit(X_train_scaled, y_train)

In [None]:
GS_LR.best_score_

nan

# **Random forest Model**

In [None]:
from sklearn.ensemble import RandomForestRegressor
rr = RandomForestRegressor()

In [None]:
rr.fit(X_train_scaled,y_train)
rr_pred = rr.predict(X_test_scaled)

In [None]:
print("R squared Value")
print("RR model :", r2_score(y_test,rr_pred))

print("\nMSE")
print("RR model :", mean_squared_error(y_test,rr_pred))

print("\nMAE")
print("RR model :", mean_absolute_error(y_test,rr_pred))

R squared Value
RR model : 0.5770454668600534

MSE
RR model : 1059072.6852091851

MAE
RR model : 735.2189555800587


In [None]:
'''
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(rr, param_grid, scoring='neg_mean_squared_error', cv=5)
grid_search.fit(X_train_scaled,y_train)
'''

"\nparam_grid = {\n    'n_estimators': [100, 200, 300],\n    'max_depth': [None, 10, 20],\n    'min_samples_split': [2, 5, 10],\n    'min_samples_leaf': [1, 2, 4]\n}\n\ngrid_search = GridSearchCV(rr, param_grid, scoring='neg_mean_squared_error', cv=5)\ngrid_search.fit(X_train_scaled,y_train)\n"

In [None]:
#grid_search.best_score_

In [None]:
#best_par_rf =grid_search.best_params_
#best_par_rf

In [None]:
'''
rr2 = RandomForestRegressor(**best_par_rf)
rr2.fit(X_train, y_train)
rr2_pred = rr2.predict(X_test)
r2_score(y_test,rr2_pred)
'''

'\nrr2 = RandomForestRegressor(**best_par_rf)\nrr2.fit(X_train, y_train)\nrr2_pred = rr2.predict(X_test)\nr2_score(y_test,rr2_pred)\n'

# **ADA BOOST**

In [None]:
from sklearn.ensemble import AdaBoostRegressor
ada = AdaBoostRegressor()
ada.fit(X_train_scaled,y_train)

In [None]:
ada_pred = ada.predict(X_test_scaled)

In [None]:
#Model Evaluation

print("R squared Value")
print("ADA Boosting model :", r2_score(y_test,ada_pred))

print("\nMSE")
print(" ADA Boosting model :", mean_squared_error(y_test,ada_pred))


print("\nRMSE")
print(" ADA Boosting model :", mean_absolute_error(y_test,ada_pred))

R squared Value
ADA Boosting model : 0.5881108600680347

MSE
 ADA Boosting model : 1031365.0836126923

RMSE
 ADA Boosting model : 765.6675445592849


# **Gradient Boosting**

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
gb = GradientBoostingRegressor()
gb.fit(X_train_scaled, y_train)

In [None]:
gb_pred = gb.predict(X_test_scaled)

In [None]:
#Model Evaluation

print("R squared Value")
print("GD Boosting model :", r2_score(y_test,gb_pred))

print("\nMSE")
print(" GD Boosting model :", mean_squared_error(y_test,gb_pred))


print("\nRMSE")
print(" GD Boosting model :", mean_absolute_error(y_test,gb_pred))

R squared Value
GD Boosting model : 0.6185862942742965

MSE
 GD Boosting model : 955054.9901893359

RMSE
 GD Boosting model : 703.3403612033519


In [None]:
'''
param_grid = {
    'n_estimators': [50, 100, 200],  # Number of boosting stages (trees)
    'learning_rate': [0.01, 0.1, 0.2],  # Step size shrinkage to prevent overfitting
    'max_depth': [3, 4, 5],  # Maximum depth of individual trees
    'min_samples_split': [2, 5, 10]  # Minimum samples required to split an internal node
}

grid_search = GridSearchCV(gb, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)

grid_search.fit(X_train_scaled, y_train)
'''

"\nparam_grid = {\n    'n_estimators': [50, 100, 200],  # Number of boosting stages (trees)\n    'learning_rate': [0.01, 0.1, 0.2],  # Step size shrinkage to prevent overfitting\n    'max_depth': [3, 4, 5],  # Maximum depth of individual trees\n    'min_samples_split': [2, 5, 10]  # Minimum samples required to split an internal node\n}\n\ngrid_search = GridSearchCV(gb, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)\n\ngrid_search.fit(X_train_scaled, y_train)\n"

# **Fitting XG Boosting Model**

In [None]:
from xgboost import XGBRegressor
xg = XGBRegressor()
xg.fit(X_train_scaled, y_train)

In [None]:
xg_pred = xg.predict(X_test_scaled)

In [None]:
#Model Evaluation

print("R squared Value")
print("XG Boosting model :", r2_score(y_test,xg_pred))

print("\nMSE")
print(" XG Boosting model :", mean_squared_error(y_test,xg_pred))


print("\nRMSE")
print(" XG Boosting model :", mean_absolute_error(y_test,xg_pred))

R squared Value
XG Boosting model : 0.5515123147951506

MSE
 XG Boosting model : 1123007.3680188942

RMSE
 XG Boosting model : 760.3852606826142


# **Fitting SVM Model**

In [None]:
from sklearn.svm import SVR
svr = SVR()
svr.fit(X_train_scaled, y_train)

In [None]:
svr_pred = svr.predict(X_test_scaled)

In [None]:
#Model Evaluation

print("R squared Value")
print("SVM model :", r2_score(y_test,svr_pred))

print("\nMSE")
print(" SVM model :", mean_squared_error(y_test,svr_pred))


print("\nRMSE")
print(" SVM model :", mean_absolute_error(y_test,svr_pred))

R squared Value
SVM model : 0.08015927899505837

MSE
 SVM model : 2303269.278443037

RMSE
 SVM model : 1172.5459863216342


# **Fitting KNN Model**

In [None]:
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor()
knn.fit(X_train_scaled, y_train)

In [None]:
knn_pred = knn.predict(X_test_scaled)

In [None]:
#Model Evaluation

print("R squared Value")
print(" KNN model :", r2_score(y_test,knn_pred))

print("\nMSE")
print(" KNN model :", mean_squared_error(y_test,knn_pred))


print("\nRMSE")
print(" KNN model :", mean_absolute_error(y_test,knn_pred))


R squared Value
 KNN model : 0.5221211055226445

MSE
 KNN model : 1196602.5762193906

RMSE
 KNN model : 775.8212436832844
