<a href="https://colab.research.google.com/github/ABDELLAH-Hallou/BigMart-outlet-sales-prediction/blob/master/BigMart_outlet_sales_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Description** : BigMart outlets Sales Prediction







#### The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim of this data science project is to build a predictive model and find out the sales of each product at a particular store.
## Our objective
#### Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.

##We will handle this problem in a structured way. We will be following the table of content given below:

### 1.   Hypothesis Generation
### 2.   Loading Packages and Data
###3.   Data Structure and Content
###4.   Exploratory Data Analysis
###5.   Univariate Analysis
###6.   Bivariate Analysis
###7.   Missing Value Treatment
###8.   Feature Engineering
###9.   Encoding Categorical Variables
###10.   Label Encoding
###11.   One Hot Encoding
###12.   PreProcessing Data
###13.   Modeling
###14.   Summary

#**Hypothesis Generation**


##Features based on the Store : 
####1.   The city in which the store is located : 
*   Stores located in urban cities should have higher sales compared to stores in rural areas
*   Stores located in big cities should have higher sales compared to stores located in small cities.

####2.   The location of the store in the city :
*   Stores located in the city center should have higher sales compared to stores on the outskirts of the city.

####3.   Competitor stores :
*   Stores close to competitor stores should sell less than other stores far away from competitor stores.

####4.   Size of the store :
*   Large stores should have higher sales than medium and small stores.

####5.   Store design and architecture: 
*   Well-designed stores can attract customers.

####6.   Marketing: 
*   Stores having a good marketing division can attract customers through the right offers.

##Features based on the Product : 
####1.   Product Utility:
*   Daily use products have a higher tendency to sell compared to other products.

####2.   Product Quality
*   The quality of the product and its packaging can attract customers and sell more.

####3.   Product Visibility in the store: 
*   Products that are placed in an attention-catching place should have higher sales.

####4.   Product Branding: 
*   Branded products have more trust of the customers so they should have high sales.

##Features based on the Customer : 
####1.   Job profile and annual income: 
*   A customer with a stable job and high income should have higher purchases.

####2.   Family size: 
*   A customer with a large family should have higher purchases

# **Loading Packages and Data**

In [399]:
# Loading Packages
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [400]:
# Loading Data
filename = '/content/drive/MyDrive/Bigmart/Train.csv'
trainDf = pd.read_csv(filename)
trainDf.head(5)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


## Data Dictionary

#### We have train 8523 dataset, which contains both input and output variables,and test 5681 dataset, We need to predict the sales for these.
*   **Item_Identifier**: Product ID
*   **Item_Weight**: Product Weight 
*   **Item_Fat_Content**: Is the product Low Fat or not(Regular)
*   **Item_Visibility**: The percentage of total display area for all products in the store allocated to each product
*   **Item_Type**: Product Category
*   **Item_MRP**: Maximum Retail Price of the Product
*   **Outlet_Identifier**: Store ID
*   **Outlet_Establishment_Year**: The year in which the Store was established
*   **Outlet_Size**: The area of ground space covered by the store (Small-Medium-High)
*   **Outlet_Location_Type**: The type of city in which the store is located
*   **Outlet_Type**: The outlet Category (Grocery store or some sort of supermarket)
*   **Item_Outlet_Sales**: Sales of the product in the particulat store. This is the outcome variable to be predicted.

# **Data Structure and Content**

In [401]:
# Data type of each column
trainDf.dtypes

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object

In [402]:
# Return unique values of columns that have an object data type.
for col in trainDf.columns:
  if trainDf[col].dtype == object and col not in ["Outlet_Identifier","Item_Identifier"]:
    print(col+ ' : ',trainDf[col].unique())

Item_Fat_Content :  ['Low Fat' 'Regular' 'low fat' 'LF' 'reg']
Item_Type :  ['Dairy' 'Soft Drinks' 'Meat' 'Fruits and Vegetables' 'Household'
 'Baking Goods' 'Snack Foods' 'Frozen Foods' 'Breakfast'
 'Health and Hygiene' 'Hard Drinks' 'Canned' 'Breads' 'Starchy Foods'
 'Others' 'Seafood']
Outlet_Size :  ['Medium' nan 'High' 'Small']
Outlet_Location_Type :  ['Tier 1' 'Tier 3' 'Tier 2']
Outlet_Type :  ['Supermarket Type1' 'Supermarket Type2' 'Grocery Store'
 'Supermarket Type3']


In [403]:
# get the number of missing data points per column
missingData = trainDf.isnull().sum()
missingData

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [404]:
# how many total missing values do we have?
total_cells = np.product(trainDf.shape)
total_missing = missingData.sum()
# percent of data that is missing
percent_missing = (total_missing/total_cells) * 100
print(round(percent_missing,2),"%")

3.79 %


In [405]:
product_columns = ['Item_Identifier','Item_Weight','Item_Fat_Content','Item_Visibility','Item_Type','Item_MRP','Outlet_Identifier','Item_Outlet_Sales']
store_columns = ['Outlet_Identifier','Outlet_Establishment_Year','Outlet_Size','Outlet_Location_Type','Outlet_Type','Item_Outlet_Sales']

productDf = trainDf[product_columns].drop_duplicates()
storeDf = trainDf[store_columns].drop_duplicates()
print('the size of product dataframe:',productDf.shape)
print('the size of store dataframe:',storeDf.shape)

the size of product dataframe: (8523, 8)
the size of store dataframe: (7075, 6)


In [406]:
newItem_Outlet_Sales = pd.DataFrame(data=storeDf.groupby(['Outlet_Identifier'])['Item_Outlet_Sales'].sum())
originalItem_Outlet_Sales = storeDf.pop('Item_Outlet_Sales')
storeDf = storeDf.drop_duplicates()
# reindexing after droping duplicate rows
storeDf.reset_index(drop=True, inplace=True)
storeDf['Item_Outlet_Sales']=newItem_Outlet_Sales.Item_Outlet_Sales.to_list()
storeDf.set_index('Outlet_Identifier')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0_level_0,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
Outlet_Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
OUT049,1999,Medium,Tier 1,Supermarket Type1,143257.5
OUT018,2009,Medium,Tier 3,Supermarket Type2,1877975.0
OUT010,1998,,Tier 3,Grocery Store,1868512.0
OUT013,1987,High,Tier 3,Supermarket Type1,1612335.0
OUT027,1985,Medium,Tier 3,Supermarket Type3,135295.2
OUT045,2002,,Tier 2,Supermarket Type1,3109910.0
OUT017,2007,,Tier 2,Supermarket Type1,2023125.0
OUT046,1997,Small,Tier 1,Supermarket Type1,1770828.0
OUT035,2004,Small,Tier 2,Supermarket Type1,1819026.0
OUT019,1985,Small,Tier 1,Grocery Store,1912381.0


In [407]:
# handle missing values manualy
storeDf.loc[(storeDf['Outlet_Location_Type'] == 'Tier 3') & (storeDf['Outlet_Size'].isnull()),'Outlet_Size'] = 'medium'
storeDf.loc[(storeDf['Outlet_Location_Type'] == 'Tier 1') & (storeDf['Outlet_Size'].isnull()),'Outlet_Size'] = 'Small'
storeDf.loc[(storeDf['Outlet_Location_Type'] == 'Tier 2') & (storeDf['Outlet_Size'].isnull()),'Outlet_Size'] = 'Small'
storeDf

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


Unnamed: 0,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,OUT049,1999,Medium,Tier 1,Supermarket Type1,143257.5
1,OUT018,2009,Medium,Tier 3,Supermarket Type2,1877975.0
2,OUT010,1998,medium,Tier 3,Grocery Store,1868512.0
3,OUT013,1987,High,Tier 3,Supermarket Type1,1612335.0
4,OUT027,1985,Medium,Tier 3,Supermarket Type3,135295.2
5,OUT045,2002,Small,Tier 2,Supermarket Type1,3109910.0
6,OUT017,2007,Small,Tier 2,Supermarket Type1,2023125.0
7,OUT046,1997,Small,Tier 1,Supermarket Type1,1770828.0
8,OUT035,2004,Small,Tier 2,Supermarket Type1,1819026.0
9,OUT019,1985,Small,Tier 1,Grocery Store,1912381.0


In [408]:
Supermarket = "Supermarket"
GroceryStore = "Grocery Store"

storeDf.loc[storeDf['Outlet_Type'].str.startswith('S'), 'Outlet_Type'] = Supermarket
storeDf.loc[storeDf['Outlet_Type'].str.startswith('G'), 'Outlet_Type'] = GroceryStore

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


In [409]:
storeDf = storeDf.set_index('Outlet_Identifier')
storeDf

Unnamed: 0_level_0,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
Outlet_Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
OUT049,1999,Medium,Tier 1,Supermarket,143257.5
OUT018,2009,Medium,Tier 3,Supermarket,1877975.0
OUT010,1998,medium,Tier 3,Grocery Store,1868512.0
OUT013,1987,High,Tier 3,Supermarket,1612335.0
OUT027,1985,Medium,Tier 3,Supermarket,135295.2
OUT045,2002,Small,Tier 2,Supermarket,3109910.0
OUT017,2007,Small,Tier 2,Supermarket,2023125.0
OUT046,1997,Small,Tier 1,Supermarket,1770828.0
OUT035,2004,Small,Tier 2,Supermarket,1819026.0
OUT019,1985,Small,Tier 1,Grocery Store,1912381.0


In [410]:
productDf = productDf.set_index('Item_Identifier')

In [411]:
productDf.loc[productDf['Item_Weight'].isnull()]

Unnamed: 0_level_0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Item_Outlet_Sales
Item_Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
FDP10,,Low Fat,0.127470,Snack Foods,107.7622,OUT027,4022.7636
DRI11,,Low Fat,0.034238,Hard Drinks,113.2834,OUT027,2303.6680
FDW12,,Regular,0.035400,Baking Goods,144.5444,OUT027,4064.0432
FDC37,,Low Fat,0.057557,Baking Goods,107.6938,OUT019,214.3876
FDC14,,Regular,0.072222,Canned,43.6454,OUT019,125.8362
...,...,...,...,...,...,...,...
DRK37,,Low Fat,0.043792,Soft Drinks,189.0530,OUT027,6261.8490
DRG13,,Low Fat,0.037006,Soft Drinks,164.7526,OUT027,4111.3150
NCN14,,Low Fat,0.091473,Others,184.6608,OUT027,2756.4120
FDU44,,Regular,0.102296,Fruits and Vegetables,162.3552,OUT019,487.3656


In [412]:
pd.DataFrame(data=productDf.groupby(['Outlet_Identifier'])['Item_Weight'].mean())

Unnamed: 0_level_0,Item_Weight
Outlet_Identifier,Unnamed: 1_level_1
OUT010,12.913153
OUT013,13.006148
OUT017,12.826668
OUT018,12.873346
OUT019,
OUT027,
OUT035,12.829349
OUT045,12.649989
OUT046,12.866801
OUT049,12.917446


##### Now we can clearly see that store OUT019 and OUT027 do not have data on the weight of their product

#### But we can also clearly see that almost all the average weight of store products are the same, so we will replace NaN with the average of the Item_Weight column

In [413]:
productWeightMean = productDf.Item_Weight.mean()
productDf['Item_Weight'] = productDf['Item_Weight'].fillna(productWeightMean)
productDf

Unnamed: 0_level_0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Item_Outlet_Sales
Item_Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
FDA15,9.300,Low Fat,0.016047,Dairy,249.8092,OUT049,3735.1380
DRC01,5.920,Regular,0.019278,Soft Drinks,48.2692,OUT018,443.4228
FDN15,17.500,Low Fat,0.016760,Meat,141.6180,OUT049,2097.2700
FDX07,19.200,Regular,0.000000,Fruits and Vegetables,182.0950,OUT010,732.3800
NCD19,8.930,Low Fat,0.000000,Household,53.8614,OUT013,994.7052
...,...,...,...,...,...,...,...
FDF22,6.865,Low Fat,0.056783,Snack Foods,214.5218,OUT013,2778.3834
FDS36,8.380,Regular,0.046982,Baking Goods,108.1570,OUT045,549.2850
NCJ29,10.600,Low Fat,0.035186,Health and Hygiene,85.1224,OUT035,1193.1136
FDN46,7.210,Regular,0.145221,Snack Foods,103.1332,OUT018,1845.5976
