# MY PROJECT :
# Big Mart Sales Analysis
![title](https://thumbs.dreamstime.com/t/customer-shopping-supermarket-trolley-shift-motion-time-lapse-speed-up-50682151.jpg)

### INTRODUCTION
This is a project on the analysis of the sales of the products in Big Mart. This dataset contains the sales in four types of stores, Supermarket type 1, 2 and 3, and Grocery stores.

The sales of these products depends on various factors and some data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities to relate these various factors. Also, certain attributes of each product and store have been defined in the dataset.

This project aims to build a predictive model and find out the sales of each product at a particular store.

### ADVANTAGES
Using this project, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.

#### Now, We will follow a sequential or stepby step process to get the efficient outcome of this project which includes the following steps :
1. Getting the data
2. Data preprocessing
3. EDA on the data to get a good idea of each trends.
4. Feature Engineering
5. Training the model
6. Evaluation of the model

# (1)
# Importing some Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import statistics
from scipy.stats import mode

## Importing the dataset through csv File

In [2]:
data = pd.read_csv("Big_mart.csv")

## Size and Shape of the dataset

In [3]:
print('Size of the dataset is :')
data.size

Size of the dataset is :


102276

In [4]:
print('Shape of the dataset :')
data.shape

Shape of the dataset :


(8523, 12)

## Information of the Attributes in the dataset
![title](https://ask.qcloudimg.com/http-save/yehe-1314998/biomwioypq.jpeg?imageView2/2/w/1620)

## Let's check the first five rows in the dataset

In [5]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [6]:
data.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.643456,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


In [7]:
data.isnull().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

#### There are some missing values present at the attributes Item_Weight ,Outlet_Size. 

# (2) Data Preprocessing

## Removing the Null Values

In [8]:
value = data.groupby(['Item_Identifier'])['Item_Weight'].mean()

In [9]:
data.loc[data['Item_Weight'].isnull(),'Item_Weight']=data.loc[data['Item_Weight'].isnull(),'Item_Identifier'].apply(lambda x:value[x])

In [10]:
data[data['Item_Weight'].isnull()]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
927,FDN52,,Regular,0.130933,Frozen Foods,86.9198,OUT027,1985,Medium,Tier 3,Supermarket Type3,1569.9564
1922,FDK57,,Low Fat,0.079904,Snack Foods,120.044,OUT027,1985,Medium,Tier 3,Supermarket Type3,4434.228
4187,FDE52,,Regular,0.029742,Dairy,88.9514,OUT027,1985,Medium,Tier 3,Supermarket Type3,3453.5046
5022,FDQ60,,Regular,0.191501,Baking Goods,121.2098,OUT019,1985,Small,Tier 1,Grocery Store,120.5098


In [11]:
data.drop([927,1922,4187,5022],inplace=True)

In [12]:
data.isnull().sum()

Item_Identifier                 0
Item_Weight                     0
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [13]:
data.loc[data['Outlet_Size'].isnull(),'Outlet_Size']='missing'

In [14]:
values=data.groupby(['Item_Identifier'])['Outlet_Size'].apply(lambda x:mode(x).mode[0])



In [15]:
values=data.groupby(['Outlet_Type'])['Outlet_Size'].apply(lambda x:mode(x).mode[0])



In [16]:
data.loc[data['Outlet_Size']=='missing','Outlet_Size']=data.loc[data['Outlet_Size']=='missing','Outlet_Type'].apply(lambda x:values[x])

In [17]:
values

Outlet_Type
Grocery Store        missing
Supermarket Type1      Small
Supermarket Type2     Medium
Supermarket Type3     Medium
Name: Outlet_Size, dtype: object

In [18]:
data['Outlet_Size'].head(20)

0      Medium
1      Medium
2      Medium
3     missing
4        High
5      Medium
6        High
7      Medium
8       Small
9       Small
10     Medium
11      Small
12     Medium
13      Small
14       High
15      Small
16     Medium
17     Medium
18     Medium
19      Small
Name: Outlet_Size, dtype: object

In [19]:
data.isnull().sum()

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

### Null value's are Removed.

In [20]:
data['Item_Fat_Content'].value_counts()

Low Fat    5088
Regular    2886
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

### The problem arises over here due to miscoding of data (i.e, in Naming the Variable's)
### So, let us get a solution for this type of error's

In [21]:
data.loc[(data['Item_Fat_Content']=="LF") | (data['Item_Fat_Content']=='low fat') ,'Item_Fat_Content']="Low Fat"
data.loc[(data['Item_Fat_Content']=="reg"),'Item_Fat_Content']="Regular"

In [22]:
print(data['Item_Fat_Content'].unique())

['Low Fat' 'Regular']


In [23]:
print(data['Item_Type'].unique())

['Dairy' 'Soft Drinks' 'Meat' 'Fruits and Vegetables' 'Household'
 'Baking Goods' 'Snack Foods' 'Frozen Foods' 'Breakfast'
 'Health and Hygiene' 'Hard Drinks' 'Canned' 'Breads' 'Starchy Foods'
 'Others' 'Seafood']


In [24]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8519 entries, 0 to 8522
Data columns (total 12 columns):
Item_Identifier              8519 non-null object
Item_Weight                  8519 non-null float64
Item_Fat_Content             8519 non-null object
Item_Visibility              8519 non-null float64
Item_Type                    8519 non-null object
Item_MRP                     8519 non-null float64
Outlet_Identifier            8519 non-null object
Outlet_Establishment_Year    8519 non-null int64
Outlet_Size                  8519 non-null object
Outlet_Location_Type         8519 non-null object
Outlet_Type                  8519 non-null object
Item_Outlet_Sales            8519 non-null float64
dtypes: float64(4), int64(1), object(7)
memory usage: 865.2+ KB


# (3) Exploratory Data Analysis (EDA) on the dataset :

## UNIVAIATE ANALYSIS

In [25]:
sns.countplot(x='Item_Fat_Content' , df = data)

ValueError: Could not interpret input 'Item_Fat_Content'