# Prediction of sales

### Problem Statement
[The dataset](https://drive.google.com/file/d/1B07fvYosBNdIwlZxSmxDfeAf9KaygX89/view?usp=sharing) represents sales data for 1559 products across 10 stores in different cities. Also, attributes of each product and store are available. The aim is to build a predictive model and determine the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.



### In following weeks, we will explore the problem in following stages:

1. **Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome**
2. **Data Exploration – looking at categorical & continuous feature summaries and making inferences about the data**
3. **Data Cleaning – imputing missing values in the data and checking for outliers**
4. **Feature Engineering – modifying existing variables and/or creating new ones for analysis**
5. **Model Building – making predictive models on the data**
---------

In [48]:
import pandas as pd
import numpy as np
import datetime

import matplotlib.pyplot as plt
import seaborn as sns

# from pandas_profiling import ProfileReport

#Read files:
data = pd.read_csv("data/regression_exercise.csv", delimiter=',')

# prof = ProfileReport(data)
# prof.to_file(output_file='output.html')

In [49]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [50]:
# fill in missing values for Item_wieght  & Outlet Size
data['Item_Weight'] = data['Item_Weight'].fillna(data['Item_Weight'].mean())
data["Outlet_Size"] = data["Outlet_Size"].fillna("Empty")
data.isnull().sum()

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

## 4. Feature Engineering

1. Resolving the issues in the data to make it ready for the analysis.
2. Create some new variables using the existing ones.





### Create a broad category of Type of Item

`Item_Type` variable has many categories which might prove to be very useful in analysis. Look at the `Item_Identifier`, i.e. the unique ID of each item, it starts with either FD, DR or NC. If you see the categories, these look like being Food, Drinks and Non-Consumables. 

**Task:** Use the Item_Identifier variable to create a new column

In [51]:
data[["Item_Identifier", 'Item_Type']]

Unnamed: 0,Item_Identifier,Item_Type
0,FDA15,Dairy
1,DRC01,Soft Drinks
2,FDN15,Meat
3,FDX07,Fruits and Vegetables
4,NCD19,Household
...,...,...
8518,FDF22,Snack Foods
8519,FDS36,Baking Goods
8520,NCJ29,Health and Hygiene
8521,FDN46,Snack Foods


In [52]:
data['Broad_Item_Type'] =data['Item_Identifier'].astype(str).str[0:2]
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Broad_Item_Type
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,FD
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,DR
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27,FD
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,Empty,Tier 3,Grocery Store,732.38,FD
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,NC


### Determine the years of operation of a store

**Task:** Make a new column depicting the years of operation of a store (i.e. how long the store exists). 

In [53]:
data.Outlet_Establishment_Year.dtypes
data.Outlet_Establishment_Year.head()
current_year = datetime.datetime.now().year
current_year

2021

In [54]:
data['Years_Opened'] = current_year - data.Outlet_Establishment_Year
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Broad_Item_Type,Years_Opened
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,FD,22
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,DR,12
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27,FD,22
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,Empty,Tier 3,Grocery Store,732.38,FD,23
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,NC,34


### Modify categories of Item_Fat_Content

**Task:** There are difference in representation in categories of Item_Fat_Content variable. This should be corrected.

In [55]:
data['Item_Fat_Content'].unique()

array(['Low Fat', 'Regular', 'low fat', 'LF', 'reg'], dtype=object)

In [56]:
data[data['Item_Fat_Content'] == 'low fat'].Item_Fat_Content 

27      low fat
74      low fat
82      low fat
108     low fat
111     low fat
         ...   
7925    low fat
8068    low fat
8295    low fat
8380    low fat
8404    low fat
Name: Item_Fat_Content, Length: 112, dtype: object

In [57]:
data = data.replace({'Item_Fat_Content' : {'low fat' : 'Low Fat', 'LF': 'Low Fat', 'reg': 'Regular' }})
data['Item_Fat_Content'].unique()

array(['Low Fat', 'Regular'], dtype=object)

**Task:** There are some non-consumables as well and a fat-content should not be specified for them. Create a separate category for such kind of observations.

In [61]:
data[data['Broad_Item_Type'] == 'NC'].replace({'Item_Fat_Content' : {'Low Fat' : 'Not Applic'}})

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Broad_Item_Type,Years_Opened
4,NCD19,8.930000,Not Applic,0.000000,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,NC,34
16,NCB42,11.800000,Not Applic,0.008596,Health and Hygiene,115.3492,OUT018,2009,Medium,Tier 3,Supermarket Type2,1621.8888,NC,12
22,NCB30,14.600000,Not Applic,0.025698,Household,196.5084,OUT035,2004,Small,Tier 2,Supermarket Type1,1587.2672,NC,17
25,NCD06,13.000000,Not Applic,0.099887,Household,45.9060,OUT017,2007,Empty,Tier 2,Supermarket Type1,838.9080,NC,14
31,NCS17,18.600000,Not Applic,0.080829,Health and Hygiene,96.4436,OUT018,2009,Medium,Tier 3,Supermarket Type2,2741.7644,NC,12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8500,NCQ42,20.350000,Not Applic,0.000000,Household,125.1678,OUT017,2007,Empty,Tier 2,Supermarket Type1,1907.5170,NC,14
8502,NCH43,8.420000,Not Applic,0.070712,Household,216.4192,OUT045,2002,Empty,Tier 2,Supermarket Type1,3020.0688,NC,19
8504,NCN18,12.857645,Not Applic,0.124111,Household,111.7544,OUT027,1985,Medium,Tier 3,Supermarket Type3,4138.6128,NC,36
8516,NCJ19,18.600000,Not Applic,0.118661,Others,58.7588,OUT018,2009,Medium,Tier 3,Supermarket Type2,858.8820,NC,12


In [62]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Broad_Item_Type,Years_Opened
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,FD,22
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,DR,12
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27,FD,22
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,Empty,Tier 3,Grocery Store,732.38,FD,23
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,NC,34


### Numerical and One-Hot Encoding of Categorical variables

Since scikit-learn algorithms accept only numerical variables, we need to **convert all categorical variables into numeric types.** 

- if the variable is Ordinal we can simply map its values into numbers
- if the variable is Nominal (we cannot sort the values) we need to One-Hot Encode them --> create dummy variables

In [59]:
data.select_dtypes(include=['object'])

Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Type,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type,Broad_Item_Type
0,FDA15,Low Fat,Dairy,OUT049,Medium,Tier 1,Supermarket Type1,FD
1,DRC01,Regular,Soft Drinks,OUT018,Medium,Tier 3,Supermarket Type2,DR
2,FDN15,Low Fat,Meat,OUT049,Medium,Tier 1,Supermarket Type1,FD
3,FDX07,Regular,Fruits and Vegetables,OUT010,Empty,Tier 3,Grocery Store,FD
4,NCD19,Low Fat,Household,OUT013,High,Tier 3,Supermarket Type1,NC
...,...,...,...,...,...,...,...,...
8518,FDF22,Low Fat,Snack Foods,OUT013,High,Tier 3,Supermarket Type1,FD
8519,FDS36,Regular,Baking Goods,OUT045,Empty,Tier 2,Supermarket Type1,FD
8520,NCJ29,Low Fat,Health and Hygiene,OUT035,Small,Tier 2,Supermarket Type1,NC
8521,FDN46,Regular,Snack Foods,OUT018,Medium,Tier 3,Supermarket Type2,FD


In [69]:
cat_feats = ['Item_Fat_Content', 'Item_Type', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type', 'Broad_Item_Type']
# cat_feats = data[].index.tolist()
df_dummy = pd.get_dummies(data[cat_feats])
df_dummy.head()

Unnamed: 0,Item_Fat_Content_Low Fat,Item_Fat_Content_Regular,Item_Type_Baking Goods,Item_Type_Breads,Item_Type_Breakfast,Item_Type_Canned,Item_Type_Dairy,Item_Type_Frozen Foods,Item_Type_Fruits and Vegetables,Item_Type_Hard Drinks,...,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3,Broad_Item_Type_DR,Broad_Item_Type_FD,Broad_Item_Type_NC
0,1,0,0,0,0,0,1,0,0,0,...,1,0,0,0,1,0,0,0,1,0
1,0,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,1,0,1,0,0
2,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,1,0
3,0,1,0,0,0,0,0,0,1,0,...,0,0,1,1,0,0,0,0,1,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,1


In [None]:
# do I get rid of the identifiers?????

**All variables should be by now numeric.**

---------
### Exporting Data

**Task:** You can save the processed data to your local machine as a csv file.

In [72]:
processed_data =pd.concat([data, df_dummy], axis=1)
processed_data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,...,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3,Broad_Item_Type_DR,Broad_Item_Type_FD,Broad_Item_Type_NC
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,...,1,0,0,0,1,0,0,0,1,0
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,...,0,0,1,0,0,1,0,1,0,0
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,...,1,0,0,0,1,0,0,0,1,0
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,Empty,Tier 3,...,0,0,1,1,0,0,0,0,1,0
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,...,0,0,1,0,1,0,0,0,0,1


In [74]:
processed_data.dtypes

Item_Identifier                     object
Item_Weight                        float64
Item_Fat_Content                    object
Item_Visibility                    float64
Item_Type                           object
Item_MRP                           float64
Outlet_Identifier                   object
Outlet_Establishment_Year            int64
Outlet_Size                         object
Outlet_Location_Type                object
Outlet_Type                         object
Item_Outlet_Sales                  float64
Broad_Item_Type                     object
Years_Opened                         int64
Item_Fat_Content_Low Fat             uint8
Item_Fat_Content_Regular             uint8
Item_Type_Baking Goods               uint8
Item_Type_Breads                     uint8
Item_Type_Breakfast                  uint8
Item_Type_Canned                     uint8
Item_Type_Dairy                      uint8
Item_Type_Frozen Foods               uint8
Item_Type_Fruits and Vegetables      uint8
Item_Type_H

In [75]:
numerics = ['uint8', 'int16', 'int32', 'int64', 'float16', 'float32', 'float64']

only_numeric_data = processed_data.select_dtypes(include=numerics)
only_numeric_data.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales,Years_Opened,Item_Fat_Content_Low Fat,Item_Fat_Content_Regular,Item_Type_Baking Goods,Item_Type_Breads,...,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3,Broad_Item_Type_DR,Broad_Item_Type_FD,Broad_Item_Type_NC
0,9.3,0.016047,249.8092,1999,3735.138,22,1,0,0,0,...,1,0,0,0,1,0,0,0,1,0
1,5.92,0.019278,48.2692,2009,443.4228,12,0,1,0,0,...,0,0,1,0,0,1,0,1,0,0
2,17.5,0.01676,141.618,1999,2097.27,22,1,0,0,0,...,1,0,0,0,1,0,0,0,1,0
3,19.2,0.0,182.095,1998,732.38,23,0,1,0,0,...,0,0,1,1,0,0,0,0,1,0
4,8.93,0.0,53.8614,1987,994.7052,34,1,0,0,0,...,0,0,1,0,1,0,0,0,0,1


In [76]:
only_numeric_data.to_csv('fixed_data.csv')