### iNeuron Internship
    1. Project Domain : Sales & Marketing
    2. Project Name   : Stores Sales Prediction
### Dataset Source 
    1. Link : https://www.kaggle.com/brijbhushannanda1979/bigmart-sales-data
### Feartues
    1. Item_Identifier       : Unique product ID 
    2. Item_Weight           : Weight of product 
    3. Item_Fat_Content      : Whether the product is low fat or not 
    4. Item_Visibility       : The % of total display area of all products in a store allocated to the particular product 
    5. Item_Type             : The category to which the product belongs 
    6. Item_MRP              : Maximum Retail Price (list price) of the product 
    7. Outlet_Identifier     : Unique store ID 
    8. Outlet_Establishment_Year : The year in which store was established 
    9. Outlet_Size           : The size of the store in terms of ground area covered 
    10. Outlet_Location_Type : The type of city in which the store is located 
    11. Outlet_Type          : Whether the outlet is just a grocery store or some sort of supermarket 
    12. Item_Outlet_Sales    : Sales of the product in the particular store. Is is the outcome variable to be predicted.
### General Information
    1. Our problem comes under Supervised Learning Technique. Because we are going to predict sales of a store. Here the output type is continuous value, so we need to apply Regression algorithms.

### 1. Import Needed Libraries & Dependencies

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

pd.set_option("display.max_rows", None, "display.max_columns", None)
pd.set_option('display.max_colwidth', None)

### 2. Import Needed Dataset

In [2]:
df = pd.read_csv("Train.csv")
df.head(2)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228


### 3. Data Preprocessing

#### 3.1. Drop Duplicates

In [3]:
df.shape

df.drop_duplicates(inplace=True)

df.shape

(8523, 12)

(8523, 12)

#### 3.2. Missing Value Treatment

In [4]:
df.isnull().values.any()

True

In [5]:
df.isnull().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [6]:
# Item_Weight: MISSING VALUE TREATMENT

# sort data based of item_weight
sorted_df = df.sort_values(by=['Item_Weight'])

# get unique Item_Identifiers and its corresponding Item_Weight
uniqueItemIdentifiers = (sorted_df.drop_duplicates(subset ="Item_Identifier")).iloc[:, [0,1]]

# covert it to dictinary with identifier as key and weight as value.
uniqueIdentifiers = uniqueItemIdentifiers.set_index('Item_Identifier')['Item_Weight'].to_dict()

# fill missing weights
for i in range(0, len(df)):
    if np.isnan(df.at[i, 'Item_Weight']):
        df.at[i, 'Item_Weight'] = uniqueIdentifiers[str(df.at[i, 'Item_Identifier'])]
        
# fill with mean item weight if any row have nan value in item weight column after above treatment
df.Item_Weight.fillna(value=df.Item_Weight.mean(), inplace=True)

In [7]:
# Outlet_Size: MISSING VALUE TREATMENT

# Fill all nan values with small as outlet size.
df.Outlet_Size.fillna(value="Small", inplace=True)

# Instead the above method, we can also fill values with mode of outlet_size column.
# df.Outlet_Size.fillna(value=df.Outlet_Size.mode(), inplace=True)

In [8]:
df.isnull().values.any()

False

#### 3.3. Removing Redundant Names

In [9]:
feature = ['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier', 'Outlet_Size', 
                   'Outlet_Location_Type', 'Outlet_Type']

uniqueValue = lambda i: df[i].unique()
uniqueValuesCount = lambda i: len(df[i].value_counts())

uniqueValueCounts = [uniqueValuesCount(i) for i in feature]
uniqueValues = [str(uniqueValue(i).tolist()) for i in feature]

uniqueValues[0] = str(['...']) #Item identifier contains too many unique values. view it seperately
data = {'Unique_Value_Counts': uniqueValueCounts, 'Unique_Values': uniqueValues}

pd.DataFrame(data=data, index=feature)

Unnamed: 0,Unique_Value_Counts,Unique_Values
Item_Identifier,1559,['...']
Item_Fat_Content,5,"['Low Fat', 'Regular', 'low fat', 'LF', 'reg']"
Item_Type,16,"['Dairy', 'Soft Drinks', 'Meat', 'Fruits and Vegetables', 'Household', 'Baking Goods', 'Snack Foods', 'Frozen Foods', 'Breakfast', 'Health and Hygiene', 'Hard Drinks', 'Canned', 'Breads', 'Starchy Foods', 'Others', 'Seafood']"
Outlet_Identifier,10,"['OUT049', 'OUT018', 'OUT010', 'OUT013', 'OUT027', 'OUT045', 'OUT017', 'OUT046', 'OUT035', 'OUT019']"
Outlet_Size,3,"['Medium', 'Small', 'High']"
Outlet_Location_Type,3,"['Tier 1', 'Tier 3', 'Tier 2']"
Outlet_Type,4,"['Supermarket Type1', 'Supermarket Type2', 'Grocery Store', 'Supermarket Type3']"


In [10]:
df.Item_Fat_Content.value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

In [11]:
# removing spelling error in variables
itemFatContent = {'low fat': "Low Fat", 'LF': "Low Fat", 'reg': "Regular"}

df.Item_Fat_Content.replace(itemFatContent, inplace=True)

In [12]:
df.Item_Fat_Content.value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

#### 3.4. Drop Unnecessary Colums

In [13]:
df.drop(columns=['Item_Identifier'], axis=1, inplace=True)
df.head(2)

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228


#### 3.5. Handling Categorical Variables (Label Encoding)

In [30]:
# Get k dummies

df1 = pd.get_dummies(data=df, columns=['Item_Fat_Content', 'Item_Type', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'], )
df1.head(2)

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Item_Outlet_Sales,Item_Fat_Content_Low Fat,Item_Fat_Content_Regular,Item_Type_Baking Goods,Item_Type_Breads,Item_Type_Breakfast,Item_Type_Canned,Item_Type_Dairy,Item_Type_Frozen Foods,Item_Type_Fruits and Vegetables,Item_Type_Hard Drinks,Item_Type_Health and Hygiene,Item_Type_Household,Item_Type_Meat,Item_Type_Others,Item_Type_Seafood,Item_Type_Snack Foods,Item_Type_Soft Drinks,Item_Type_Starchy Foods,Outlet_Size_High,Outlet_Size_Medium,Outlet_Size_Small,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
0,9.3,0.016047,249.8092,OUT049,1999,3735.138,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0
1,5.92,0.019278,48.2692,OUT018,2009,443.4228,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0


In [31]:
# Get k-1 dummies

df2 = pd.get_dummies(data=df, columns=['Item_Fat_Content', 'Item_Type', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'], drop_first=True)
df2.head(2)

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Item_Outlet_Sales,Item_Fat_Content_Regular,Item_Type_Breads,Item_Type_Breakfast,Item_Type_Canned,Item_Type_Dairy,Item_Type_Frozen Foods,Item_Type_Fruits and Vegetables,Item_Type_Hard Drinks,Item_Type_Health and Hygiene,Item_Type_Household,Item_Type_Meat,Item_Type_Others,Item_Type_Seafood,Item_Type_Snack Foods,Item_Type_Soft Drinks,Item_Type_Starchy Foods,Outlet_Size_Medium,Outlet_Size_Small,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
0,9.3,0.016047,249.8092,OUT049,1999,3735.138,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0
1,5.92,0.019278,48.2692,OUT018,2009,443.4228,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,1,0


#### 3.6. Feature Scaling