# Sale Prediction Project
## Problem Statement:
Nowadays, shopping malls and Big Marts keep track of individual item sales data in
order to forecast future client demand and adjust inventory management. In a data
warehouse, these data stores hold a significant amount of consumer information and
particular item details. By mining the data store from the data warehouse, more
anomalies and common patterns can be discovered.

### Approach: 
The classical machine learning tasks like Data Exploration, Data Cleaning,
Feature Engineering, Model Building and Model Testing. Try out different machine
learning algorithms that’s best fit for the above case.

### Results: 
You have to build a solution that should able to predict the sales of the
different stores of Big Mart according to the provided dataset.

### Datasrt link: https://www.kaggle.com/datasets/brijbhushannanda1979/bigmart-sales-data

## Dataset Background:
We have train (8523) and test (5681) data set, train data set has both input and output
variable(s). We need to predict the sales for test data set.

**Item_Identifier:** Unique product ID 2

**Item_Weight:** Weight of product

**Item_Fat_Content:** Whether the product is low fat or not

**Item_Visibility:** The % of total display area of all products in a store allocated to the
particular product

**Item_Type:** The category to which the product belongs

**Item_MRP:** Maximum Retail Price (list price) of the product

**Outlet_Identifier:** Unique store ID

**Outlet_Establishment_Year:** The year in which store was established

**Outlet_Size:** The size of the store in terms of ground area covered

**Outlet_Location_Type:** The type of city in which the store is located

**Outlet_Type:** Whether the outlet is just a grocery store or some sort of supermarket

**Item_Outlet_Sales:** Sales of the product in the particulat store. This is the outcome
variable to be predicted.


In [18]:
#importing basics libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


# Modelling-
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, Ridge,Lasso
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split,GridSearchCV, RandomizedSearchCV
from catboost import CatBoostRegressor
from xgboost import XGBRegressor

import pickle

In [19]:
df_train = pd.read_csv('data/Train.csv')
df_test = pd.read_csv('data/Test.csv')

In [3]:
df_train.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [4]:
df_train.shape

(8523, 12)

In [5]:
df_train.columns

Index(['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility',
       'Item_Type', 'Item_MRP', 'Outlet_Identifier',
       'Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type',
       'Outlet_Type', 'Item_Outlet_Sales'],
      dtype='object')

In [127]:
df_test.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,FDW58,20.75,Low Fat,0.007565,Snack Foods,107.8622,OUT049,1999,Medium,Tier 1,Supermarket Type1
1,FDW14,8.3,reg,0.038428,Dairy,87.3198,OUT017,2007,,Tier 2,Supermarket Type1
2,NCN55,14.6,Low Fat,0.099575,Others,241.7538,OUT010,1998,,Tier 3,Grocery Store
3,FDQ58,7.315,Low Fat,0.015388,Snack Foods,155.034,OUT017,2007,,Tier 2,Supermarket Type1
4,FDY38,,Regular,0.118599,Dairy,234.23,OUT027,1985,Medium,Tier 3,Supermarket Type3


In [128]:
df_test.shape

(5681, 11)

In [6]:
df_train.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.643456,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


In [130]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [131]:
df_train.isnull().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [132]:
df_test.isnull().sum()

Item_Identifier                 0
Item_Weight                   976
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  1606
Outlet_Location_Type            0
Outlet_Type                     0
dtype: int64

In [4]:
df_train.duplicated().sum()

0

In [134]:
df_train.head(3)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27


In [7]:
df_train['Item_Identifier'].value_counts()

Item_Identifier
FDW13    10
FDG33    10
NCY18     9
FDD38     9
DRE49     9
         ..
FDY43     1
FDQ60     1
FDO33     1
DRF48     1
FDC23     1
Name: count, Length: 1559, dtype: int64

In [8]:
df_train['Item_Identifier'].value_counts()

Item_Identifier
FDW13    10
FDG33    10
NCY18     9
FDD38     9
DRE49     9
         ..
FDY43     1
FDQ60     1
FDO33     1
DRF48     1
FDC23     1
Name: count, Length: 1559, dtype: int64

In [9]:
df_train['Outlet_Establishment_Year'].unique()

array([1999, 2009, 1998, 1987, 1985, 2002, 2007, 1997, 2004], dtype=int64)

In [10]:
df_train.isnull().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [5]:
# Display rows with null values
null_rows = df_train[df_train.isnull().any(axis=1)]
null_rows

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
3,FDX07,19.20,Regular,0.000000,Fruits and Vegetables,182.0950,OUT010,1998,,Tier 3,Grocery Store,732.3800
7,FDP10,,Low Fat,0.127470,Snack Foods,107.7622,OUT027,1985,Medium,Tier 3,Supermarket Type3,4022.7636
8,FDH17,16.20,Regular,0.016687,Frozen Foods,96.9726,OUT045,2002,,Tier 2,Supermarket Type1,1076.5986
9,FDU28,19.20,Regular,0.094450,Frozen Foods,187.8214,OUT017,2007,,Tier 2,Supermarket Type1,4710.5350
18,DRI11,,Low Fat,0.034238,Hard Drinks,113.2834,OUT027,1985,Medium,Tier 3,Supermarket Type3,2303.6680
...,...,...,...,...,...,...,...,...,...,...,...,...
8504,NCN18,,Low Fat,0.124111,Household,111.7544,OUT027,1985,Medium,Tier 3,Supermarket Type3,4138.6128
8508,FDW31,11.35,Regular,0.043246,Fruits and Vegetables,199.4742,OUT045,2002,,Tier 2,Supermarket Type1,2587.9646
8509,FDG45,8.10,Low Fat,0.214306,Fruits and Vegetables,213.9902,OUT010,1998,,Tier 3,Grocery Store,424.7804
8514,FDA01,15.00,Regular,0.054489,Canned,57.5904,OUT045,2002,,Tier 2,Supermarket Type1,468.7232


In [6]:
# define numerical & categorical columns in train data
numeric_features = [feature for feature in df_train.columns if df_train[feature].dtype != 'O']
categorical_features = [feature for feature in df_train.columns if df_train[feature].dtype == 'O']

# print numerical & categorical columns in train data
print('We have {} numerical features in train data and they as as follows : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features train data and they as as follows: {}'.format(len(categorical_features), categorical_features))

We have 5 numerical features in train data and they as as follows : ['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Outlet_Establishment_Year', 'Item_Outlet_Sales']

We have 7 categorical features train data and they as as follows: ['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']


In [13]:
print('Number of unique data points in categorical features in Train data')
print('Number of unique data points in Item_Identifier:', df_train['Item_Identifier'].unique())
print('Number of unique data points in Item_Fat_Content:', df_train['Item_Fat_Content'].unique())
print('Number of unique data points in Item_Type:',df_train['Item_Type'].unique())
print('Number of unique data points in Outlet_Identifier:', df_train['Outlet_Identifier'].unique())
print('Number of unique data points in Outlet_Size:', df_train['Outlet_Size'].unique())
print('Number of unique data points in Outlet_Location_Type:', df_train['Outlet_Location_Type'].unique())
print('Number of unique data points in Outlet_Type:', df_train['Outlet_Type'].unique())

Number of unique data points in categorical features in Train data
Number of unique data points in Item_Identifier: ['FDA15' 'DRC01' 'FDN15' ... 'NCF55' 'NCW30' 'NCW05']
Number of unique data points in Item_Fat_Content: ['Low Fat' 'Regular' 'low fat' 'LF' 'reg']
Number of unique data points in Item_Type: ['Dairy' 'Soft Drinks' 'Meat' 'Fruits and Vegetables' 'Household'
 'Baking Goods' 'Snack Foods' 'Frozen Foods' 'Breakfast'
 'Health and Hygiene' 'Hard Drinks' 'Canned' 'Breads' 'Starchy Foods'
 'Others' 'Seafood']
Number of unique data points in Outlet_Identifier: ['OUT049' 'OUT018' 'OUT010' 'OUT013' 'OUT027' 'OUT045' 'OUT017' 'OUT046'
 'OUT035' 'OUT019']
Number of unique data points in Outlet_Size: ['Medium' nan 'High' 'Small']
Number of unique data points in Outlet_Location_Type: ['Tier 1' 'Tier 3' 'Tier 2']
Number of unique data points in Outlet_Type: ['Supermarket Type1' 'Supermarket Type2' 'Grocery Store'
 'Supermarket Type3']


## Data Preprocessing 
1. Remove Outliers as discover from EDA file
2. Fill Features with null values with median and mode
3. Drop redundant features 
4. Feature encoding 

#### 1. Remove Outliers as discover from EDA file 
##### Winsorization:
Winsorization replaces the extreme values with the nearest non-outlier value. You can choose to replace them with the maximum or minimum non-outlier value.

In [20]:
def find_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return data[(data[column] < lower_bound) | (data[column] > upper_bound)]

outliers_weight = find_outliers_iqr(df_train, 'Item_Weight')
outliers_visibility = find_outliers_iqr(df_train, 'Item_Visibility')
outliers_mrp = find_outliers_iqr(df_train, 'Item_MRP')
outliers_sales = find_outliers_iqr(df_train, 'Item_Outlet_Sales')

print("Number of outliers in Item_Weight:", len(outliers_weight))
print("Number of outliers in Item_Visibility:", len(outliers_visibility))
print("Number of outliers in Item_MRP:", len(outliers_mrp))
print("Number of outliers in Item_Outlet_Sales:", len(outliers_sales))

Number of outliers in Item_Weight: 0
Number of outliers in Item_Visibility: 144
Number of outliers in Item_MRP: 0
Number of outliers in Item_Outlet_Sales: 186


In [21]:
def winsorize(series, limits):
    series[series < limits[0]] = limits[0]
    series[series > limits[1]] = limits[1]
    return series

# Calculate lower and upper bounds based on df_train
lower_bound = df_train['Item_Visibility'].quantile(0.25) - 1.5 * (df_train['Item_Visibility'].quantile(0.75) - df_train['Item_Visibility'].quantile(0.25))
upper_bound = df_train['Item_Visibility'].quantile(0.75) + 1.5 * (df_train['Item_Visibility'].quantile(0.75) - df_train['Item_Visibility'].quantile(0.25))

# Apply winsorization to df_train
df_train['Item_Visibility'] = winsorize(df_train['Item_Visibility'], [lower_bound, upper_bound])
df_train['Item_Outlet_Sales'] = winsorize(df_train['Item_Outlet_Sales'], [lower_bound, upper_bound])


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  series[series < limits[0]] = limits[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  series[series > limits[1]] = limits[1]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  series[series < limits[0]] = limits[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  series[series > limits[1]] = limits[1]


In [22]:
# Creating a new column for Outlet_Age
df_train['Outlet_Age'] = df_train['Outlet_Establishment_Year'].apply(lambda year: 2023 - year)

# Standardize values in the 'Item_Fat_Content' column
df_train['Item_Fat_Content'] = df_train['Item_Fat_Content'].replace({'LF': 'Low Fat', 'low fat': 'Low Fat', 'reg': 'Regular'})

# Drop unnecessary columns
df_train.drop(['Item_Identifier', 'Outlet_Identifier', 'Item_Visibility', 'Outlet_Establishment_Year'], axis=1, inplace=True)

In [24]:
from sklearn.impute import SimpleImputer

# Create a SimpleImputer for 'Item_Weight' with median strategy
item_weight_imputer = SimpleImputer(strategy='median')

# Fill missing values in 'Item_Weight' column with the median
df_train['Item_Weight'] = item_weight_imputer.fit_transform(df_train[['Item_Weight']])

# Create a SimpleImputer for 'Outlet_Size' with 'most_frequent' strategy
outlet_size_imputer = SimpleImputer(strategy='most_frequent')

# Fill missing values in 'Outlet_Size' column with the mode (most frequent value)
df_train['Outlet_Size'] = outlet_size_imputer.fit_transform(df_train[['Outlet_Size']])


ValueError: 2

In [144]:
df_train['Outlet_Type'].value_counts()

Outlet_Type
Supermarket Type1    5577
Grocery Store        1083
Supermarket Type3     935
Supermarket Type2     928
Name: count, dtype: int64

In [145]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Item_Weight           8523 non-null   float64
 1   Item_Fat_Content      8523 non-null   object 
 2   Item_Type             8523 non-null   object 
 3   Item_MRP              8523 non-null   float64
 4   Outlet_Size           6113 non-null   object 
 5   Outlet_Location_Type  8523 non-null   object 
 6   Outlet_Type           8523 non-null   object 
 7   Item_Outlet_Sales     8523 non-null   float64
 8   Outlet_Age            8523 non-null   int64  
dtypes: float64(3), int64(1), object(5)
memory usage: 599.4+ KB


In [None]:
df_train['Item_Fat_Content'].unique()

array(['Low Fat', 'Regular'], dtype=object)

In [None]:
df_train.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Type,Item_MRP,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Outlet_Age
0,9.3,Low Fat,Dairy,249.8092,Medium,Tier 1,Supermarket Type1,3735.138,24
1,5.92,Regular,Soft Drinks,48.2692,Medium,Tier 3,Supermarket Type2,443.4228,14
2,17.5,Low Fat,Meat,141.618,Medium,Tier 1,Supermarket Type1,2097.27,24
4,8.93,Low Fat,Household,53.8614,High,Tier 3,Supermarket Type1,994.7052,36
5,10.395,Regular,Baking Goods,51.4008,Medium,Tier 3,Supermarket Type2,556.6088,14


In [None]:
df_train['Item_Outlet_Sales'].max()

10256.649

In [None]:
df_train['Item_Outlet_Sales'].min()

69.2432

In [None]:
df_train.shape

(4650, 9)

In [None]:
df_train['Item_Fat_Content'].value_counts()

Item_Fat_Content
Low Fat    3004
Regular    1646
Name: count, dtype: int64

In [None]:
df_train[df_train['Outlet_Type'] == 'Grocery Store']

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
3,FDX07,19.200,Regular,0.000000,Fruits and Vegetables,182.0950,OUT010,1998,,Tier 3,Grocery Store,732.3800
23,FDC37,,Low Fat,0.057557,Baking Goods,107.6938,OUT019,1985,Small,Tier 1,Grocery Store,214.3876
28,FDE51,5.925,Regular,0.161467,Dairy,45.5086,OUT010,1998,,Tier 3,Grocery Store,178.4344
29,FDC14,,Regular,0.072222,Canned,43.6454,OUT019,1985,Small,Tier 1,Grocery Store,125.8362
30,FDV38,19.250,Low Fat,0.170349,Dairy,55.7956,OUT010,1998,,Tier 3,Grocery Store,163.7868
...,...,...,...,...,...,...,...,...,...,...,...,...
8473,DRI47,14.700,Low Fat,0.035016,Hard Drinks,144.3128,OUT010,1998,,Tier 3,Grocery Store,431.4384
8480,FDQ58,,Low Fat,0.000000,Snack Foods,154.5340,OUT019,1985,Small,Tier 1,Grocery Store,459.4020
8486,FDR20,20.000,Regular,0.000000,Fruits and Vegetables,46.4744,OUT010,1998,,Tier 3,Grocery Store,45.2744
8490,FDU44,,Regular,0.102296,Fruits and Vegetables,162.3552,OUT019,1985,Small,Tier 1,Grocery Store,487.3656


I will not be dropping Null values because this will remove every information about Grocery Store and 'Supermarket Type3 which important as we cannot dis regard other branches of the company.	

In [None]:
# define numerical & categorical columns in train data
numeric_features = [feature for feature in df_train.columns if df_train[feature].dtype != 'O']
categorical_features = [feature for feature in df_train.columns if df_train[feature].dtype == 'O']

# print numerical & categorical columns in train data
print('We have {} numerical features in train data and they as as follows : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features train data and they as as follows: {}'.format(len(categorical_features), categorical_features))

We have 5 numerical features in train data and they as as follows : ['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Outlet_Establishment_Year', 'Item_Outlet_Sales']

We have 7 categorical features train data and they as as follows: ['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']


## Feature Encoding 
1. Label Encoding 
2. One-Hot-Encoding


Ordinal variables:

Item_Fat_Content
Outlet_Size
Outlet_Location_Type

Nominal variables:

Item_Identifier
Item_Type
Outlet_Identifier
Outlet_Type

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Define the lists of categorical and numerical features
numerical_features = ['Item_Weight', 'Item_MRP', 'Item_Outlet_Sales']
categorical_features = ['Item_Fat_Content', 'Item_Type', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']

le = LabelEncoder()
Label = ['Item_Fat_Content','Outlet_Size','Outlet_Location_Type']

for i in Label:
    df_train[i] = le.fit_transform(df_train[i])

In [None]:
df_train.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Type,Item_MRP,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Outlet_Age
0,9.3,0,Dairy,249.8092,1,0,Supermarket Type1,3735.138,24
1,5.92,1,Soft Drinks,48.2692,1,2,Supermarket Type2,443.4228,14
2,17.5,0,Meat,141.618,1,0,Supermarket Type1,2097.27,24
4,8.93,0,Household,53.8614,0,2,Supermarket Type1,994.7052,36
5,10.395,1,Baking Goods,51.4008,1,2,Supermarket Type2,556.6088,14


In [None]:
#one hot encoding
cols = ['Item_Type','Outlet_Type']
# Apply one-hot encoder
oh_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False, drop = 'first')
oh_encoder_df_train = pd.DataFrame(oh_encoder.fit_transform(df_train[cols])).astype('int64')

#get feature columns
oh_encoder_df_train.columns = oh_encoder.get_feature_names_out(cols)


# One-hot encoding removed index; put it back
oh_encoder_df_train.index = df_train.index

# Add one-hot encoded columns to our main df new name: tr_fe, te_fe (means feature engeenired) 
df_train = pd.concat([df_train, oh_encoder_df_train], axis=1)

df_train = df_train.drop(['Item_Type', 'Outlet_Type'], axis = 1)



In [None]:
df_train.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_MRP,Outlet_Size,Outlet_Location_Type,Item_Outlet_Sales,Outlet_Age,Item_Type_Breads,Item_Type_Breakfast,Item_Type_Canned,...,Item_Type_Hard Drinks,Item_Type_Health and Hygiene,Item_Type_Household,Item_Type_Meat,Item_Type_Others,Item_Type_Seafood,Item_Type_Snack Foods,Item_Type_Soft Drinks,Item_Type_Starchy Foods,Outlet_Type_Supermarket Type2
0,9.3,0,249.8092,1,0,3735.138,24,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,5.92,1,48.2692,1,2,443.4228,14,0,0,0,...,0,0,0,0,0,0,0,1,0,1
2,17.5,0,141.618,1,0,2097.27,24,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,8.93,0,53.8614,0,2,994.7052,36,0,0,0,...,0,0,1,0,0,0,0,0,0,0
5,10.395,1,51.4008,1,2,556.6088,14,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

col_to_scale = ['Item_Weight', 'Item_MRP', 'Item_Outlet_Sales', 'Outlet_Age']

for col in col_to_scale:
    # Reshape the column to a 2D array with a single column
    col_data = df_train[col].values.reshape(-1, 1)
    
    # Fit and transform the scaler on the reshaped data
    df_train[col] = scaler.fit_transform(col_data)


In [None]:
X = df_train.drop(['Item_Outlet_Sales'], axis = 1)
y = df_train['Item_Outlet_Sales']

In [None]:
# Splitting the dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)

In [None]:
len(X_train)


3720

In [None]:
len(X_test)

930

In [None]:
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    r2_square = r2_score(true, predicted)
    return mae, rmse, r2_square

In [None]:
models = {
    "Linear Regression": LinearRegression(),
    "Lasso": Lasso(),
    "Ridge": Ridge(),
    "K-Neighbors Regressor": KNeighborsRegressor(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest Regressor": RandomForestRegressor(),
    "XGBRegressor": XGBRegressor(), 
    "CatBoosting Regressor": CatBoostRegressor(verbose=False),
    "AdaBoost Regressor": AdaBoostRegressor()
}
model_list = []
r2_list =[]

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) # Train model

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Evaluate Train and Test dataset
    model_train_mae , model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)

    model_test_mae , model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)

    
    print(list(models.keys())[i])
    model_list.append(list(models.keys())[i])
    
    print('Model performance for Training set')
    print("- Root Mean Squared Error: {:.4f}".format(model_train_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_train_mae))
    print("- R2 Score: {:.4f}".format(model_train_r2))

    print('----------------------------------')
    
    print('Model performance for Test set')
    print("- Root Mean Squared Error: {:.4f}".format(model_test_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_test_mae))
    print("- R2 Score: {:.4f}".format(model_test_r2))
    r2_list.append(model_test_r2)
    
    print('='*35)
    print('\n')

Linear Regression
Model performance for Training set
- Root Mean Squared Error: 0.7107
- Mean Absolute Error: 0.5273
- R2 Score: 0.4868
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.7871
- Mean Absolute Error: 0.5729
- R2 Score: 0.4171


Lasso
Model performance for Training set
- Root Mean Squared Error: 0.9921
- Mean Absolute Error: 0.7950
- R2 Score: 0.0000
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 1.0312
- Mean Absolute Error: 0.8057
- R2 Score: -0.0003


Ridge
Model performance for Training set
- Root Mean Squared Error: 0.7107
- Mean Absolute Error: 0.5273
- R2 Score: 0.4868
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.7870
- Mean Absolute Error: 0.5728
- R2 Score: 0.4173


K-Neighbors Regressor
Model performance for Training set
- Root Mean Squared Error: 0.6424
- Mean Absolute Error: 0.4823
- R2 Score: 0.5807
----------------------

In [None]:
import pandas as pd

def compare_models(models, X_train, y_train, X_test, y_test):
    results = []
    for model_name, model in models.items():
        model_results = evaluate_model(model, X_train, y_train, X_test, y_test)
        model_results['Model'] = model_name
        results.append(model_results)

    return pd.DataFrame(results)


In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score

def evaluate_model(model, X_train, y_train, X_test, y_test):
    # Train the model on the training data
    model.fit(X_train, y_train)

    # Make predictions on the training and test data
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Calculate evaluation metrics for training data
    mae_train = mean_absolute_error(y_train, y_train_pred)
    mse_train = mean_squared_error(y_train, y_train_pred)
    rmse_train = np.sqrt(mse_train)
    r2_train = r2_score(y_train, y_train_pred)

    # Calculate evaluation metrics for test data
    mae_test = mean_absolute_error(y_test, y_test_pred)
    mse_test = mean_squared_error(y_test, y_test_pred)
    rmse_test = np.sqrt(mse_test)
    r2_test = r2_score(y_test, y_test_pred)

    # Calculate cross-validation RMSE
    cross_val_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
    cross_val_rmse = np.sqrt(-cross_val_scores)

    return {
        'MAE_train': mae_train,
        'MSE_train': mse_train,
        'RMSE_train': rmse_train,
        'R^2_train': r2_train,
        'MAE_test': mae_test,
        'MSE_test': mse_test,
        'RMSE_test': rmse_test,
        'R^2_test': r2_test,
        'Cross_Val_RMSE': cross_val_rmse.mean()
    }


In [None]:
# Use the compare_models function
results_df = compare_models(models, X_train, y_train, X_test, y_test)

# Sort the results by MAE
results_df.sort_values(by='MAE_test', ascending=True, inplace=True)

# Save the results to a CSV file
results_df.to_csv("model_comparison_results.csv", index=False)


In [None]:
result = pd.read_csv('model_comparison_results.csv')

In [None]:
result

Unnamed: 0,MAE_train,MSE_train,RMSE_train,R^2_train,MAE_test,MSE_test,RMSE_test,R^2_test,Cross_Val_RMSE,Model
0,0.5273,0.505072,0.710684,0.486815,0.572765,0.619341,0.786982,0.41734,0.714451,Ridge
1,0.52734,0.505069,0.710682,0.486818,0.572898,0.619546,0.787113,0.417147,0.7145,Linear Regression
2,0.404316,0.286481,0.535239,0.708917,0.593641,0.682691,0.826251,0.357742,0.753268,CatBoosting Regressor
3,0.541418,0.518625,0.720156,0.473044,0.59668,0.658464,0.811458,0.380534,0.731138,AdaBoost Regressor
4,0.209739,0.083443,0.288866,0.915216,0.599848,0.685928,0.828208,0.354697,0.759834,Random Forest Regressor
5,0.282617,0.145752,0.381775,0.851907,0.620428,0.727764,0.853091,0.315338,0.799782,XGBRegressor
6,0.482295,0.412717,0.642431,0.580653,0.635614,0.762039,0.872949,0.283093,0.789499,K-Neighbors Regressor
7,0.0,0.0,0.0,1.0,0.767132,1.074625,1.036641,-0.01098,1.048164,Decision Tree
8,0.794974,0.984191,0.992064,0.0,0.805717,1.063307,1.031168,-0.000332,0.991993,Lasso
