#### Problem Statement:
Nowadays, shopping malls and Big Marts keep track of individual item sales data in order to forecast future client demand and adjust inventory management. In a data warehouse, these data stores hold a significant amount of consumer information and particular item details. By mining the data store from the data warehouse, more anomalies and common patterns can be discovered.


#### Approach: 
The classical machine learning tasks like Data Exploration, Data Cleaning, Feature Engineering, Model Building and Model Testing. Try out different machine learning algorithms that’s best fit for the above case.


#### Results: 
You have to build a solution that should able to predict the sales of the different stores of Big Mart according to the provided dataset.


    Item_Identifier: Unique product ID
    
    Item_Weight: Weight of product 
    
    Item_Fat_Content: Whether the product is low fat or not 
    
    Item_Visibility: The % of total display area of all products in a store allocated to the particular product 
    
    Item_Type: The category to which the product belongs 
    
    Item_MRP: Maximum Retail Price (list price) of the product
    
    Outlet_Identifier: Unique store ID 
    
    Outlet_Establishment_Year: The year in which store was established
    
    Outlet_Size: The size of the store in terms of ground area covered 
    
    Outlet_Location_Type: The type of city in which the store is located 
    
    Outlet_Type: Whether the outlet is just a grocery store or some sort of supermarket 
    
    Item_Outlet_Sales: Sales of the product in the particulat store. This is the outcomevariable to be predicted.

In [1]:
import pandas as pd
train_data=pd.read_csv(r"D:\Full stack Data  Science by INEURON\Projects\Store sales prediction\Train.csv")
train_data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [2]:
pd.set_option("display.max_rows",None)

In [3]:
train_data.columns

Index(['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility',
       'Item_Type', 'Item_MRP', 'Outlet_Identifier',
       'Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type',
       'Outlet_Type', 'Item_Outlet_Sales'],
      dtype='object')

In [5]:
train_data.drop(['Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type',
       'Outlet_Type'],axis=1,inplace=True)

In [6]:
train_data.columns

Index(['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility',
       'Item_Type', 'Item_MRP', 'Outlet_Identifier', 'Item_Outlet_Sales'],
      dtype='object')

In [19]:
train_data["Item_Fat_Content"].value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

In [4]:
def func(a):
    a=a.lower()
    if a=='lf':
        a='low fat'
    elif a== 'reg':
        a='regular'
    return a

train_data["Item_Fat_Content"]=train_data["Item_Fat_Content"].apply(func)

In [7]:
train_data.groupby(["Item_Type",'Item_Identifier','Outlet_Type', 'Outlet_Location_Type','Outlet_Identifier'])[[ 'Item_Visibility', 'Item_MRP','Item_Outlet_Sales','Outlet_Size']].first()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Item_Visibility,Item_MRP,Item_Outlet_Sales,Outlet_Size
Item_Type,Item_Identifier,Outlet_Type,Outlet_Location_Type,Outlet_Identifier,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Baking Goods,FDA11,Supermarket Type1,Tier 1,OUT046,0.043239,92.5436,1701.7848,Small
Baking Goods,FDA11,Supermarket Type1,Tier 2,OUT017,0.043483,94.3436,2363.59,
Baking Goods,FDA11,Supermarket Type1,Tier 2,OUT045,0.043327,95.6436,1134.5232,
Baking Goods,FDA11,Supermarket Type2,Tier 3,OUT018,0.043415,93.1436,1418.154,Medium
Baking Goods,FDA11,Supermarket Type3,Tier 3,OUT027,0.043029,94.7436,1701.7848,Medium
Baking Goods,FDA23,Supermarket Type1,Tier 1,OUT046,0.047187,100.6016,1619.2256,Small
Baking Goods,FDA23,Supermarket Type1,Tier 1,OUT049,0.04726,102.8016,1922.8304,Medium
Baking Goods,FDA23,Supermarket Type1,Tier 2,OUT017,0.047454,101.7016,1518.024,
Baking Goods,FDA23,Supermarket Type1,Tier 2,OUT035,0.047178,99.4016,2428.8384,Small
Baking Goods,FDA23,Supermarket Type1,Tier 3,OUT013,0.047148,102.4016,1720.4272,High


In [75]:
train_data.groupby(['Outlet_Identifier','Item_Type','Item_Fat_Content'])[['Item_Identifier',"Item_MRP","Item_Outlet_Sales",'Item_Weight','Item_Visibility']]

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001C183EEDEE0>

In [91]:
train_data.groupby(['Outlet_Identifier','Item_Type','Item_Fat_Content'])[['Item_Outlet_Sales']].sum()# get_group(("OUT010","Dairy","regular"))  #first()  #indices#groups

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Item_Outlet_Sales
Outlet_Identifier,Item_Type,Item_Fat_Content,Unnamed: 3_level_1
OUT010,Baking Goods,low fat,4948.2256
OUT010,Baking Goods,regular,5745.1882
OUT010,Breads,low fat,3847.6582
OUT010,Breads,regular,3809.7076
OUT010,Breakfast,low fat,1775.6886
OUT010,Breakfast,regular,2305.6654
OUT010,Canned,low fat,5137.9786
OUT010,Canned,regular,3881.614
OUT010,Dairy,low fat,9767.9518
OUT010,Dairy,regular,5539.456


In [70]:
train_data['Item_Identifier'].value_counts().sum()

8523

In [5]:
test_data=pd.read_csv(r"D:\Full stack Data  Science by INEURON\Projects\Store sales prediction\Test.csv")
test_data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,FDW58,20.75,Low Fat,0.007565,Snack Foods,107.8622,OUT049,1999,Medium,Tier 1,Supermarket Type1
1,FDW14,8.3,reg,0.038428,Dairy,87.3198,OUT017,2007,,Tier 2,Supermarket Type1
2,NCN55,14.6,Low Fat,0.099575,Others,241.7538,OUT010,1998,,Tier 3,Grocery Store
3,FDQ58,7.315,Low Fat,0.015388,Snack Foods,155.034,OUT017,2007,,Tier 2,Supermarket Type1
4,FDY38,,Regular,0.118599,Dairy,234.23,OUT027,1985,Medium,Tier 3,Supermarket Type3


In [13]:
test_data["Outlet_Identifier"].value_counts()

OUT027    624
OUT013    621
OUT049    620
OUT046    620
OUT035    620
OUT045    619
OUT018    618
OUT017    617
OUT010    370
OUT019    352
Name: Outlet_Identifier, dtype: int64