# **1)Problem Statement**

To enhance sales performance, this project focuses on building a predictive model that estimates product sales at individual BigMart stores. By analyzing a comprehensive dataset encompassing various product and store attributes, we employ machine learning techniques to develop a robust model. This model aims to provide predictions by identifying crucial factors and patterns that influence sales. Through this endeavor, we empower BigMart to make data-driven decisions and optimize their sales strategies for improved business outcomes.

# **2)Hypothesis Generation**

**Store Level Hypotheses:**

* *City Type*: Stores located in urban or Tier 1 cities are expected to have higher sales due to the higher income levels of residents.
* *Store Capacity*: Larger stores with ample space are anticipated to have higher sales as they offer a comprehensive shopping experience.
* *Competitors*: Stores in close proximity to similar establishments may face tougher competition, resulting in lower sales.

**Product Level Hypotheses**
* *Brand*: Branded products should have higher sales because of higher trust in the customer.
* *Utility*: Daily use products should have a higher tendency to sell as compared to the specific use products.
* *Display Area*: Products which are given bigger shelves in the store are likely to catch attention first and sell more. Visibility in Store: The location of product in a store will impact sales. Ones which are right at entrance will catch the eye of customer first rather than the ones in back.
* *Promotional Offers*: Products accompanied with attractive offers and discounts will sell more.



# **3)Loading Packages and Data**

In this step, we're going to import the necessary libraries and load the dataset into our programming environment.

In [None]:
#importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder  
import seaborn as sns
import warnings
import ydata_profiling as pp
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV

Now that all the necessary libraries are imported, we proceed to load our training and test datasets.

In [None]:
#loading training dataset
bigm_train=pd.read_csv("../input/big-mart-dataset/Train.csv")
#loading test dataset
bigm_test=pd.read_csv("../input/big-mart-dataset/Test.csv")

 # **4)Data Structure and Content**
After loading the dataset and before processing it, we need to understand the structure of the data and explore its contents beforehand.

 **A- Training dataset**

In [None]:
#number of rows and columns in the dataset file
bigm_train.shape

In [None]:
bigm_train.info()

In [None]:
bigm_train.dtypes

In [None]:
#first 5 rows of the dataset
bigm_train.head()

 **B- Test dataset**

In [None]:
#number of rows and columns in the dataset file
bigm_test.shape

In [None]:
bigm_test.info()

In [None]:
bigm_test.dtypes

In [None]:
#first 5 rows of the dataset
bigm_test.head()

Our datasets has 12 columns:

•Item_Identifier(Categorical): Id of each item

•Item_Weight(Numerical): weight of the product.

•Item_Fat_Content(Categorical): fat content in the product.

•Item_Visibility: proportion of the store's total display area dedicated to the specific product.

•Item_Type(Categorical): indicates to which category the item belongs. 

•Item_MRP(Numerical): maximum retail price (MRP) of the product.

•Outlet_Identifier(Categorical): Id of each store. 

•Outlet_Establishment_Year(Numerical): The year in which the store was established.

•Outlet_Size(Categorical): size of the store. 

•Outlet_Location_Type(Categorical): indicates in which type of city the store is located .

•Outlet_Type (Categorical): type of outlet (grocery store, supermarket, etc.).

•Item_Outlet_Sales(Numerical): Our target variable. Indicates the sales of the item in the store.

# **5)Exploratory Data Analysis**

**Outliers**

Let's check for outliers in our dataset.

In [None]:
plt.figure(figsize=(10, 8)) 

numerical_features = [feature for feature in bigm_train.columns if bigm_train[feature].dtype != 'object']
for feature in numerical_features:
    
    sns.boxplot(bigm_train[feature])
    plt.title(feature)
    plt.show()

Both Item_Visibility and Item_Outlet_Sales have outliers

**A- Training data**

In [None]:
bigm_train.describe()

In [None]:
#ratio of null values in each column
bigm_train.isnull().sum()/bigm_train.shape[0] *100

In [None]:
pp.ProfileReport(bigm_train)

**B- Test dataset**

In [None]:
bigm_test.describe()

In [None]:
#ratio of null values in each column
bigm_test.isnull().sum()/bigm_test.shape[0] *100

In [None]:
pp.ProfileReport(bigm_test)

# **6)Univariate Analysis**

* **Item_Fat_Content**

In [None]:
bigm_train['Item_Fat_Content'].value_counts()

In [None]:
bigm_test['Item_Fat_Content'].value_counts()

It seems that we need to fix the inconsistencies in this column since we only need two distinct entries: Low Fat and Regular. To do so, we need to replace the other entries with the right values.

In [None]:
bigm_train['Item_Fat_Content'].replace(['low fat','LF','reg'],['Low Fat','Low Fat','Regular'],inplace = True)
bigm_test['Item_Fat_Content'].replace(['low fat','LF','reg'],['Low Fat','Low Fat','Regular'],inplace = True)

Now that we fixed the issues of our column, we can visualize our data in a barplot.

In [None]:

sns.countplot(x='Item_Fat_Content',data=bigm_train)

In [None]:
sns.countplot(x='Item_Fat_Content',data=bigm_test)

Low fat products are more bought than regular products

* **Item_Type**

In [None]:
bigm_train.Item_Type.unique()

the different items are indeed all unique and there are no inconsistencies. We can proceed to plot the data in order to analyze it.

In [None]:
plt.figure(figsize=(25,17))
sns.countplot(y='Item_Type',data=bigm_train,order = bigm_train['Item_Type'].value_counts().index)

In [None]:
plt.figure(figsize=(25,17))
sns.countplot(y='Item_Type',data=bigm_test,order = bigm_test['Item_Type'].value_counts().index)

Fruits&Vegetables and Snack Foods are the most bought item types while Seafood, Breakfast and other non categorized types don't sell much.

* **Outlet_Establishment_Year**

In [None]:
bigm_train.Outlet_Establishment_Year.unique()

The outlets were build between 1985 and 2009. let's plot the data to get more info.

In [None]:
sns.countplot(data=bigm_train,x='Outlet_Establishment_Year')

It seems that the outlets built on 1985 have more items than any other outlets. Likewise, the outlets built on 1998 have less items.

* **Outlet_Size**

In [None]:
sns.countplot(data=bigm_train,x='Outlet_Size')

Most outlets are of Medium size. High size outlets are not that abundant.

* **Outlet_Location_Type**

In [None]:
sns.countplot(data=bigm_train,x='Outlet_Location_Type')

Tier 3 cities contain the most number of outlets.

* **Outlet_Type**

In [None]:
plt.figure(figsize=(8,5))
sns.countplot(data=bigm_train,x='Outlet_Type')

Supermakets Type 1 are the most built type. There are more than 5000 outlet of Type1 of Supermarkets.

* **Item_Outlet_Sales**

In [None]:
bigm_train.Item_Outlet_Sales.unique()

This column has a high cardinality. We'll use a distribution plot to visualize the data in an efficient way.

In [None]:
sns.displot(bigm_train.Item_Outlet_Sales,kde=True)

# **7)Bivariate Analysis**

Since our target variable is Item_Outlet_Sales, we'll find its correlation to the different other variables. Below is a table showing the correlation between our dependent variables and target variable.

In [None]:
num_features = bigm_train.select_dtypes(include=[np.number])
corr=num_features.corr()
corr['Item_Outlet_Sales'].sort_values()

Item_MRP has the highest positive correlation to Item_Outlet_Sales. Meanwhile, Item_Weight has the lowest correlation rate with the target variable.

In [None]:
plt.figure(figsize=(12,7))
plt.xlabel("Item_MRP")
plt.ylabel("Item_Outlet_Sales")
plt.title("Item_MRP and Item_Outlet_Sales Analysis")
plt.scatter(bigm_train.Item_MRP, bigm_train.Item_Outlet_Sales)

The scatterplot shows a positive relationship between the two variables: as the item's maximum retail price increases, the item sales tends to increase as well.

In [None]:
#define response variable
y = bigm_train['Item_Outlet_Sales']

#define explanatory variable
x = bigm_train['Item_MRP']

#add constant to predictor variables
x = sm.add_constant(x)

#fit linear regression model
model = sm.OLS(y, x).fit()

#view model summary
print(model.summary())

The fitted regression equation turns out to be:

Item Outlet Sales = 15.5530 * (Item_MRP) - 11.5751

In [None]:
plt.figure(figsize=[20,9])
plt.scatter(bigm_train.Item_Visibility,bigm_train.Item_Outlet_Sales)
plt.xlabel('Item_Visibility')
plt.ylabel('Item_Outlet_Sales')

Low bisibility items sell more than items with high visibility. This isn't surprising since the correlation rate we've found earlier was negative. This result is all logical since items that have high visibility, meaning that they take a lot of space of the display space in the store are less sold and that is because they tend to be pricey products not for everyday use.

In [None]:
pd.pivot_table(bigm_train,'Item_Outlet_Sales',index='Item_Type',columns='Outlet_Size')

By having a look at the table above, we can conclude that medium size outlets have the most sales while small size outlets sell the least.

In [None]:
plt.figure(figsize=(12,7))
plt.xlabel("Item_Weight")
plt.ylabel("Item_Outlet_Sales")
plt.title("Item_Weight and Item_Outlet_Sales Analysis")
plt.scatter(bigm_train.Item_Weight, bigm_train.Item_Outlet_Sales)

There seems to be no correlation between the item weight and the sales. An expected result since it shows the same thing in the correlation table.

# **8)Missing Value Treatment**

In [None]:
bigm_train.isnull().sum()

In [None]:
bigm_test.isnull().sum()

There seems to be only the same two columns in both training and test sets to have missing values. Let's proceed to clean our datasets and handle those missing values.

In [None]:
#Let's try dropping rows with missing values
train=bigm_train.dropna()
train.shape

Just by testing the method on our training dataset, it seems like almost half of the information is lost. We'll look for another method.

The Item_Weight column is a numerical column while the Outlet_Size is a categorical one. That must gives us an idea as to how to handle the missing values.

In [None]:
print('Training set')
print(bigm_train.Item_Weight.describe())
print( '\n')
print('Test Set')
print(bigm_test.Item_Weight.describe())

In [None]:
bigm_train['Item_Weight'].fillna(bigm_train['Item_Weight'].mean(),inplace=True)  #replacing null values with mean values
bigm_test['Item_Weight'].fillna(bigm_train['Item_Weight'].mean(),inplace=True)

In [None]:
print(bigm_train.Item_Weight.isnull().sum())
print(bigm_test.Item_Weight.isnull().sum())

We filled all the missing values with the mean value of the column. Let's check the description of the column after imputation and compare it to the old values.

In [None]:
print('Training set')
print(bigm_train.Item_Weight.describe())
print( '\n')
print('Test Set')
print(bigm_test.Item_Weight.describe())

The mean is still the same in the training set and changed in an insignificant way in the test set. The other values also didn't change that much. The changes are slitely observed so the imputation was succesfull.

Now let's move to the Outlet_Size column. We'll proceed with a mode imputation since we're dealing with a categorical column.

In [None]:
print(bigm_train['Outlet_Size'].mode())
print(bigm_test['Outlet_Size'].mode())

The column is unimodal so that makes things easier.

In [None]:
bigm_train['Outlet_Size'].fillna(bigm_train['Outlet_Size'].mode()[0],inplace=True)
bigm_test['Outlet_Size'].fillna(bigm_test['Outlet_Size'].mode()[0],inplace=True)

In [None]:
print(bigm_test.Outlet_Size.isnull().sum())
print(bigm_train.Outlet_Size.isnull().sum())

No more missing values!!

# **9)Feature Engineering**

In [None]:
bigm_train.info()

In [None]:
bigm_train.head()

**Item_Fat_Content**

For this column, we only have two possibilities: Low Fat or Regular. However, if we look at the item types, there are types that aren't edible meaning they can't be classified as Low Fat or even Regular. And knowing that we don't have any missing values, we can conclude that some inedible items have been mistakenly categorized just like the item in the 5th row above, it is of type Household but is considered as Low Fat which doesn't make sense. 

We can also see that item identifiers start with either FD(Food),DR(Drinks) or NC(Non Consumable).
This said, we should create another category "Non-Consumable" for the Item_Fat_Content.

In [None]:
bigm_train.loc[bigm_train['Item_Identifier'].str.startswith('NC'), 'Item_Fat_Content'] = 'Non-Consumable'
bigm_test.loc[bigm_test['Item_Identifier'].str.startswith('NC'), 'Item_Fat_Content'] = 'Non-Consumable'

In [None]:

print('Training set')
print(bigm_train.Item_Fat_Content.value_counts())
print( '\n')
print('Test Set')
print(bigm_test.Item_Fat_Content.value_counts())

Issue resolved!!

**Item_Visibility**

In [None]:
n_zeros_train=(((bigm_train.Item_Visibility==0).sum())/bigm_train.shape[0])*100
print("Zeros rate in the column for training set:",n_zeros_train,'%')
n_zeros_test=(((bigm_test.Item_Visibility==0).sum())/bigm_test.shape[0])*100
print("Zeros rate in the column for test set:",n_zeros_test,'%')

Over 6% of the variables in the Item_Visibility are zeros which doesn't make sense. A product can't have a null visibility.

In [None]:
bigm_train['Item_Visibility'].replace(0,bigm_train['Item_Visibility'].mean(),inplace=True)
bigm_test['Item_Visibility'].replace(0,bigm_test['Item_Visibility'].mean(),inplace=True)

In [None]:
n_zeros_train=(((bigm_train.Item_Visibility==0).sum())/bigm_train.shape[0])*100
print("Zeros rate in the column for training set:",n_zeros_train,'%')
n_zeros_test=(((bigm_test.Item_Visibility==0).sum())/bigm_test.shape[0])*100
print("Zeros rate in the column for test set:",n_zeros_test,'%')

In [None]:
plt.figure(figsize=[20,9])
plt.scatter(bigm_train.Item_Visibility,bigm_train.Item_Outlet_Sales)
plt.xlabel('Item_Visibility')
plt.ylabel('Item_Outlet_Sales')

The negative correlation is more enhanced

**Outlet_Establishmen_Year**

Since we know that the data is from 2013, we can create a new feature **Outlet_Operation_Years** that gives us for how many years the outlets are operating.

In [None]:
bigm_train['Outlet_Operation_Years'] = 2013 - bigm_train['Outlet_Establishment_Year']
bigm_test['Outlet_Operation_Years'] = 2013 - bigm_test['Outlet_Establishment_Year']

In [None]:
bigm_train['Outlet_Operation_Years'].describe()

In [None]:
bigm_test['Outlet_Operation_Years'].describe()

**Item_Type**

As we can see, the Item_Type fature has a high cardinality so using one hot encoding later on it would not be a good idea. This said, we suggest adding a new feature that splits the items into 3 categories: 'Food', 'Drinks' and 'Non-Consumable'.

In [None]:
bigm_train['New_Item_Type'] = bigm_train['Item_Identifier'].apply(lambda x: x[:2])
bigm_train['New_Item_Type'] = bigm_train['New_Item_Type'].map({'FD':'Food', 'NC':'Non-Consumable', 'DR':'Drinks'})

In [None]:
bigm_test['New_Item_Type'] = bigm_test['Item_Identifier'].apply(lambda x: x[:2])
bigm_test['New_Item_Type'] = bigm_test['New_Item_Type'].map({'FD':'Food', 'NC':'Non-Consumable', 'DR':'Drinks'})

In [None]:
bigm_train.head()

In [None]:
bigm_test.shape

In [None]:
bigm_train.shape

# **10-11)Encoding Categorical Variables (Label Encoding)**

In [None]:
bigm_train.head()

In [None]:
lencoder = LabelEncoder()
# Training Set
for i in (2,4,8,9,10,13):
    bigm_train.iloc[:,i] = lencoder.fit_transform(bigm_train.iloc[:,i])
# Test set
for i in (2,4,8,9,10,12):
    bigm_test.iloc[:,i] = lencoder.fit_transform(bigm_test.iloc[:,i])

In [None]:
# Checking the Unique values for categorical data after label encoding
print("Item_Fat_Content\n ",bigm_train.Item_Fat_Content.unique())
print("Outlet_Size\n ",bigm_train.Outlet_Size.unique())
print("Outlet_Location_Type\n ",bigm_train.Outlet_Location_Type.unique())
print("Item_Type\n ",bigm_train.Item_Type.unique())
print("New_Item_Type\n ",bigm_train.New_Item_Type.unique())
print("Outlet_Type\n ",bigm_train.Outlet_Type.unique())

# **12)One Hot Encoding**

In [None]:
bigm_train.dtypes

In [None]:
bigm_test.columns

In [None]:
bigm_train = pd.get_dummies(bigm_train, columns=['Item_Fat_Content','Outlet_Type','New_Item_Type'])
bigm_test = pd.get_dummies(bigm_test, columns=['Item_Fat_Content','Outlet_Type','New_Item_Type'])

In [None]:
bigm_test.shape

In [None]:
bigm_test.columns

# **13)PreProcessing Data**

We start first by removing outliers from the Item_Visibility and Item_Outlet_Sales columns.

In [None]:
scaler = StandardScaler()

columns = scaler.fit_transform(bigm_train[['Item_Outlet_Sales', 'Item_Visibility']])

z_scores = np.abs(columns)

z_score_threshold = 2.5

outliers = np.where(z_scores > z_score_threshold)

samples_with_outliers = set(outliers[0])
print("Original data shape:", bigm_train.shape)
bigm_train = bigm_train.drop(samples_with_outliers)
print("Data shape after removing outliers:", bigm_train.shape)

We can drop some columns as they contribute nothing to the sales like the 'Item_Identifier'and 'Outlet_Identifier'. The 'Outlet_Establishment_Year' can be dropped too as we created another feature 'Outlet_Operation_Years' that serves the same purpose.

In [None]:
X = bigm_train.drop(columns=['Outlet_Establishment_Year', 'Item_Identifier', 'Outlet_Identifier', 'Item_Outlet_Sales'])
y = bigm_train['Item_Outlet_Sales']
X_test= bigm_test.drop(columns=['Outlet_Establishment_Year', 'Item_Identifier', 'Outlet_Identifier'])

In [None]:
# Split the dataset into training and test sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=22)

# **14)Modeling**

 **A)Linear Regression**

In [None]:
lr= LinearRegression()
model_lr=lr.fit(X_train,y_train)

In [None]:
y_pred_lr=lr.predict(X_valid)

In [None]:
cv_score_lr = cross_val_score(model_lr, X_train, y_train, scoring='neg_mean_squared_error', cv=5)
cv_score_lr = np.abs(np.mean(cv_score_lr))
    
print("Model Report")
print("MSE:",mean_squared_error(y_valid,y_pred_lr))
print("MAE:",mean_absolute_error(y_valid,y_pred_lr))
print("CV Score:", cv_score_lr)

 **B)Regularized Linear Regression**

In [None]:
# Initialize the Ridge regression model
ridge = Ridge(alpha=0.7)  # Adjust the alpha value as needed

# Fit the model on the training data
model_ridge=ridge.fit(X_train, y_train)

# Make predictions on the test data
y_pred_ridge = ridge.predict(X_valid)

cv_score_ridge = cross_val_score(model_ridge, X_train, y_train, scoring='neg_mean_squared_error', cv=5)
cv_score_ridge= np.abs(np.mean(cv_score_ridge))
    
print("Model Report")
print("MSE:",mean_squared_error(y_valid,y_pred_ridge))
print("MAE:",mean_absolute_error(y_valid,y_pred_ridge))
print("CV Score:", cv_score_ridge)

 **C)RandomForest**

In [None]:
# Define the parameter grid for Random Forest
param_grid_rf = {
    'n_estimators': [100, 200, 300],  # Adjust the number of estimators as needed
    'max_depth': [None, 5, 10],  # Adjust the maximum depth as needed
    'min_samples_split': [2, 5, 10]  # Adjust the minimum samples split as needed
}
# Initialize the Random Forest regression model
rf = RandomForestRegressor(random_state=42)  # Adjust the number of estimators as needed

# Perform grid search to find the best parameters
grid_search_rf = GridSearchCV(rf, param_grid_rf, cv=5, scoring='neg_mean_squared_error')
grid_search_rf.fit(X_train, y_train)

# Get the best model and its parameters
best_model_rf = grid_search_rf.best_estimator_
best_params_rf = grid_search_rf.best_params_


# Make predictions on the test data
y_pred_rf = best_model_rf.predict(X_valid)

cv_score_rf = cross_val_score(model_rf, X_train, y_train, scoring='neg_mean_squared_error', cv=5)
cv_score_rf= np.abs(np.mean(cv_score_rf))
    
print("Model Report")
print("MSE:",mean_squared_error(y_valid,y_pred_rf))
print("MAE:",mean_absolute_error(y_valid,y_pred_rf))
print("CV Score:", cv_score_rf)

 **D)XGBoost**

In [None]:
# Define the parameter grid for XGBoost
param_grid_xgb = {
    'n_estimators': [100, 200, 300],  # Adjust the number of estimators as needed
    'max_depth': [3, 5, 7],  # Adjust the maximum depth as needed
    'learning_rate': [0.1, 0.01, 0.001]  # Adjust the learning rate as needed
}

# Initialize the XGBoost regression model
xgb = XGBRegressor(random_state=42)

# Perform grid search to find the best parameters
grid_search_xgb = GridSearchCV(xgb, param_grid_xgb, cv=5, scoring='neg_mean_squared_error')
grid_search_xgb.fit(X_train, y_train)

# Get the best model and its parameters
best_model_xgb = grid_search_xgb.best_estimator_
best_params_xgb = grid_search_xgb.best_params_

# Make predictions on the test data using the best model
y_pred_xgb = best_model_xgb.predict(X_valid)

cv_score_xgb = cross_val_score(model_xgb, X_train, y_train, scoring='neg_mean_squared_error', cv=5)
cv_score_xgb= np.abs(np.mean(cv_score_xgb))
    
print("Model Report")
print("MSE:",mean_squared_error(y_valid,y_pred_xgb))
print("MAE:",mean_absolute_error(y_valid,y_pred_xgb))
print("CV Score:", cv_score_xgb)

In [None]:
print("Linear regression               MAE:",mean_absolute_error(y_valid,y_pred_lr), "/    CV Score:",cv_score_lr)
print("Regularised linear regression   MAE:",mean_absolute_error(y_valid,y_pred_ridge), "/    CV Score:",cv_score_ridge)
print("Random Forest                   MAE:",mean_absolute_error(y_valid,y_pred_rf), "/    CV Score:",cv_score_rf)
print("XGBoost                         MAE:",mean_absolute_error(y_valid,y_pred_xgb), "/    CV Score:",cv_score_xgb)

# **15)Summary**

Judging from the results above, we can conclude that the best combination of MAE and CV score is that obtained by using the XGBoost model. So we'll be using it to predict the sales later.

# **Predictions on Test Set**

In [None]:

bigm_test_identifiers =pd.DataFrame(bigm_test[['Item_Identifier', 'Outlet_Identifier']])
bigm_test_predictions =pd.DataFrame(best_model_xgb.predict(X_test), columns=['Item_Outlet_Sales'])

final_result = pd.concat([bigm_test_identifiers,bigm_test_predictions], axis=1)
final_result

# **Saving The Test Predictions**

In [None]:
final_result.to_csv('bigm_test_predictions.csv')