The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. <br>
Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and predict the sales of each product at a particular outlet.<br>
Using this model, BigMart will try to understand the properties of products and outlets which play a key role in increasing sales.<br>
Please note that the data may have missing values as some stores might not report all the data due to technical glitches. <br>Hence, it will be required to treat them accordingly.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression,Ridge, Lasso, ElasticNet
from sklearn.metrics import r2_score,mean_squared_error
from math import sqrt

In [2]:
df = pd.read_csv('bigmart.csv')
df.head(10)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052
5,FDP36,10.395,Regular,0.0,Baking Goods,51.4008,OUT018,2009,Medium,Tier 3,Supermarket Type2,556.6088
6,FDO10,13.65,Regular,0.012741,Snack Foods,57.6588,OUT013,1987,High,Tier 3,Supermarket Type1,343.5528
7,FDP10,,Low Fat,0.12747,Snack Foods,107.7622,OUT027,1985,Medium,Tier 3,Supermarket Type3,4022.7636
8,FDH17,16.2,Regular,0.016687,Frozen Foods,96.9726,OUT045,2002,,Tier 2,Supermarket Type1,1076.5986
9,FDU28,19.2,Regular,0.09445,Frozen Foods,187.8214,OUT017,2007,,Tier 2,Supermarket Type1,4710.535


In [3]:
df.shape

(8523, 12)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [5]:
df.Item_Fat_Content.dtype

dtype('O')

Let us check the unique values of each categorical column

In [6]:
for col in df.columns:
    if df[col].dtype == "object":
        print(f"The unique values for the column {col} are: \n\t\t{df[col].unique()} and the no of unique values are {df[col].nunique()}")

The unique values for the column Item_Identifier are: 
		['FDA15' 'DRC01' 'FDN15' ... 'NCF55' 'NCW30' 'NCW05'] and the no of unique values are 1559
The unique values for the column Item_Fat_Content are: 
		['Low Fat' 'Regular' 'low fat' 'LF' 'reg'] and the no of unique values are 5
The unique values for the column Item_Type are: 
		['Dairy' 'Soft Drinks' 'Meat' 'Fruits and Vegetables' 'Household'
 'Baking Goods' 'Snack Foods' 'Frozen Foods' 'Breakfast'
 'Health and Hygiene' 'Hard Drinks' 'Canned' 'Breads' 'Starchy Foods'
 'Others' 'Seafood'] and the no of unique values are 16
The unique values for the column Outlet_Identifier are: 
		['OUT049' 'OUT018' 'OUT010' 'OUT013' 'OUT027' 'OUT045' 'OUT017' 'OUT046'
 'OUT035' 'OUT019'] and the no of unique values are 10
The unique values for the column Outlet_Size are: 
		['Medium' nan 'High' 'Small'] and the no of unique values are 3
The unique values for the column Outlet_Location_Type are: 
		['Tier 1' 'Tier 3' 'Tier 2'] and the no of unique v

**Observations**<br>
<li>It can be seen the column Item_Fat_Content is having 3 different names for the category Low Fat ("Low Fat","low fat","LF") and two different names for the category Regular ("Regular","reg")</li>

In [None]:
# Let us correct the categories in the column Item_Fat_Content 
category_mapping = {
    'Low Fat': 'Low Fat',
    'low fat': 'Low Fat',
    'LF': 'Low Fat',
    'Regular': 'Regular',
    'reg': 'Regular'
}
df['Item_Fat_Content'] = df['Item_Fat_Content'].map(category_mapping)

df["Item_Fat_Content"].unique()

Outlet_Establishment_Year can be converted into the age of the Outlet.

In [None]:
df['Outlet_Age'] = 2024-df['Outlet_Establishment_Year']
df.head()

In [None]:
# Dropping the Outlet_Establishment_Year column as we have create the Outlet_Age column
df.drop(columns=["Outlet_Establishment_Year"],inplace=True)
df.columns

Let us check the missing values

In [None]:
df.isna().sum()  # Item_Weight -- 1463 and Outlet_Size -- 2410

In [None]:
df.describe()

Let us check the value counts of each category in each of the categorical columns

In [None]:
for col in df.columns:
    if df[col].dtype == "object":
        print(f"Value counts for the column {col} are: \n\n{df[col].value_counts()}\n")

Handling Missing Values

In [None]:
# Let us check the KDE of Item_Weight 
sns.kdeplot(x=df.Item_Weight)

In [None]:
{"Mean of Item Weight":df.Item_Weight.mean(), "Mode of Item Weight":df.Item_Weight.mode()[0], "Median of Item Weight":df.Item_Weight.median()}

**Observations**<br>
The mean, median and mode are approximately equal and give a true representation of the central tendency of the column Item_Weight.

**Conclusion**<br>
Therefore, we fill the missing values of the Item_Weight column with its mean

In [None]:
df['Item_Weight'] = df['Item_Weight'].fillna(df['Item_Weight'].mean())

For the Outlet_Size column we are choosing to relace the missing values with the mode of the column that is "Medium"

In [None]:
df['Outlet_Size'] = df['Outlet_Size'].fillna(df['Outlet_Size'].mode()[0])

In [None]:
df.isna().sum()

### Univariate Analysis

In [None]:
df['Item_Visibility'].hist(bins=20)
plt.title('Item Visibility')

In [None]:
plt.boxplot(df['Item_Visibility'])

In [None]:
# Filtering the dataset with no outliers with respect to the Item_Visibility Column
Q1 = df['Item_Visibility'].quantile(0.25)
Q3 = df['Item_Visibility'].quantile(0.75)
IQR = (Q3-Q1)
filt_df = df.query('(@Q1 - 1.5*@IQR) <= Item_Visibility <=(@Q3 + 1.5 * @IQR)')   ## Formula for outlier

In [None]:
filt_df.head()

In [None]:
filt_df.shape , df.shape

In [None]:
df.shape[0] - filt_train.shape[0] # This gives the no of rows that will be removed on filtering

In [None]:
# We will continue with the filtered data 
df = filt_train

In [None]:
df.loc[:,'Item_Visibility_bins'] = pd.cut(df['Item_Visibility'],[0.000,0.065,0.13,0.2],labels=['Low Viz','Viz','High Viz'])

In [None]:
df.groupby(['Item_Visibility']).max()['Item_Visibility_bins']

In [None]:
# Checking the missing values in the Item_Visibility_bins column
df["Item_Visibility_bins"].isna().sum()

**Observations** - The Item_Visibility_bins column is having missing values for the items which have a zero visibility.

In [None]:
# Filling the missing values with Low_Viz
df.loc[:,'Item_Visibility_bins'] = df['Item_Visibility_bins'].fillna('Low Viz')

In [None]:
df["Item_Visibility_bins"].isnull().sum()

## Label Encoding

Now we perform Label Encoding for the categorical columns with ordinal categories

In [None]:
le = LabelEncoder()

In [None]:
df.loc[:,'Item_Fat_Content'] = le.fit_transform(df['Item_Fat_Content'])

In [None]:
df.loc[:,'Item_Visibility_bins'] = le.fit_transform(df['Item_Visibility_bins'].astype(str))

In [None]:
df.loc[:,'Outlet_Size'] = le.fit_transform(df['Outlet_Size'])

In [None]:
df.loc[:,'Outlet_Location_Type'] = le.fit_transform(df['Outlet_Location_Type'])

### One-hot encoding

Let's create dummy variables for categorical columns with nominal categories

In [None]:
## Create dummy variable for Outlet Type
dummy = pd.get_dummies(df['Outlet_Type'],drop_first=True)
dummy.head()

In [None]:
df = pd.concat([df,dummy],axis=1)

In [None]:
df

In [None]:
df.dtypes

### Dropping unnecessary columns

In [None]:
# Removing unnecessary column
df = df.drop(['Item_Identifier','Item_Type','Outlet_Identifier','Outlet_Type'],axis=1)
df.columns

In [None]:
df.head()

In [None]:
df.isnull().sum()

# Model Training

In [None]:
X= df.drop(columns=["Item_Outlet_Sales"])
y = df.Item_Outlet_Sales

### Train test split

In [None]:
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=0.3,random_state=42)
print(xtrain.shape)
print(xtest.shape)
print(ytrain.shape)
print(ytest.shape)

### Standardization 

In [None]:
scaler = StandardScaler()
xtrain = scaler.fit_transform(xtrain)
xtest = scaler.transform(xtest)

### Linear Regression Model

In [None]:
linear = LinearRegression()
linear.fit(xtrain,ytrain)

In [None]:
print(f"The intercept is {linear.intercept_}") # Gives the intercept
print(f"The coefficients are {linear.coef_}") # Gives the coefficient

In [None]:
preds_train = linear.predict(xtrain)
preds_test = linear.predict(xtest)

In [None]:
from sklearn.metrics import r2_score,mean_squared_error
print(f"RMSE for train data is : {np.sqrt(mean_squared_error(ytrain,preds_train))}")
print(f"R2 for train data is {r2_score(ytrain,preds_train)}")
print(f"RMSE for test data is : {np.sqrt(mean_squared_error(ytest,preds_test))}")
print(f"R2 for test data is {r2_score(ytest,preds_test)}")


### Ridge Regression

In [None]:
ridge = Ridge() # alpha =1 by default
ridge.fit(xtrain,ytrain)

In [None]:
print(f"The intercept is {ridge.intercept_}") # Gives the intercept
print(f"The coefficients are {ridge.coef_}") # Gives the coefficient

In [None]:
preds_train = ridge.predict(xtrain)
preds_test = ridge.predict(xtest)

In [None]:
from sklearn.metrics import r2_score,mean_squared_error
print(f"RMSE for train data is : {np.sqrt(mean_squared_error(ytrain,preds_train))}")
print(f"R2 for train data is {r2_score(ytrain,preds_train)}")
print(f"RMSE for test data is : {np.sqrt(mean_squared_error(ytest,preds_test))}")
print(f"R2 for test data is {r2_score(ytest,preds_test)}")


### Lasso Regression

In [None]:
lasso = Lasso() # alpha =1 by default
lasso.fit(xtrain,ytrain)

In [None]:
print(f"The intercept is {lasso.intercept_}") # Gives the intercept
print(f"The coefficients are {lasso.coef_}") # Gives the coefficient

In [None]:
preds_train = lasso.predict(xtrain)
preds_test = lasso.predict(xtest)

In [None]:
from sklearn.metrics import r2_score,mean_squared_error
print(f"RMSE for train data is : {np.sqrt(mean_squared_error(ytrain,preds_train))}")
print(f"R2 for train data is {r2_score(ytrain,preds_train)}")
print(f"RMSE for test data is : {np.sqrt(mean_squared_error(ytest,preds_test))}")
print(f"R2 for test data is {r2_score(ytest,preds_test)}")


### ElasticNet

In [None]:
elastic = ElasticNet(alpha=1) # alpha =1 by default
elastic.fit(xtrain,ytrain)

In [None]:
print(f"The intercept is {elastic.intercept_}") # Gives the intercept
print(f"The coefficients are {elastic.coef_}") # Gives the coefficient

In [None]:
preds_train = elastic.predict(xtrain)
preds_test = elastic.predict(xtest)

In [None]:
from sklearn.metrics import r2_score,mean_squared_error
print(f"RMSE for train data is : {np.sqrt(mean_squared_error(ytrain,preds_train))}")
print(f"R2 for train data is {r2_score(ytrain,preds_train)}")
print(f"RMSE for test data is : {np.sqrt(mean_squared_error(ytest,preds_test))}")
print(f"R2 for test data is {r2_score(ytest,preds_test)}")
