# BIGMART_SALES DATASET
## Here in this notebook,we will see the preprocessing techniques for categorical variables.The need arises because libraries like scikit learn dont work with the categorical values.

![](download.png)

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [17]:
df=pd.read_csv('train_bm.csv')
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [18]:
df.shape

(8523, 12)

In [19]:
df.dtypes

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object

*Here the features which have "object" as their dtype are the categorical variables and these are the ones we need to work on*

### Outlet_Type

In [20]:
df['Outlet_Type'].value_counts()

Supermarket Type1    5577
Grocery Store        1083
Supermarket Type3     935
Supermarket Type2     928
Name: Outlet_Type, dtype: int64

*Here we are determining the total number of values in the diff types of Outlet_Type*

In [21]:
pd.get_dummies(df['Outlet_Type']).head()

Unnamed: 0,Grocery Store,Supermarket Type1,Supermarket Type2,Supermarket Type3
0,0,1,0,0
1,0,0,1,0
2,0,1,0,0
3,1,0,0,0
4,0,1,0,0


*We have used one hot encoding for converting categorical to continos variables*

In [22]:
df_encoded=pd.get_dummies(df)
df_encoded.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales,Item_Identifier_DRA12,Item_Identifier_DRA24,Item_Identifier_DRA59,Item_Identifier_DRB01,Item_Identifier_DRB13,...,Outlet_Size_High,Outlet_Size_Medium,Outlet_Size_Small,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
0,9.3,0.016047,249.8092,1999,3735.138,0,0,0,0,0,...,0,1,0,1,0,0,0,1,0,0
1,5.92,0.019278,48.2692,2009,443.4228,0,0,0,0,0,...,0,1,0,0,0,1,0,0,1,0
2,17.5,0.01676,141.618,1999,2097.27,0,0,0,0,0,...,0,1,0,1,0,0,0,1,0,0
3,19.2,0.0,182.095,1998,732.38,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0
4,8.93,0.0,53.8614,1987,994.7052,0,0,0,0,0,...,1,0,0,0,0,1,0,1,0,0


*In this way , we have converted all our categorical variables to continuos variables*

*The function automatically selects the categorical variables and performs one hot encoding on those variables*

## 2 problems arise due to one hot encoding

**Problem 1**

*Looking at the newly created variables Outlet_Size,Outlet_Size_Medium and Outlet_Size_Small.The order between these variables is destroyed and we will miss important information*

In [23]:
df_encoded[['Outlet_Size_High','Outlet_Size_Medium','Outlet_Size_Small']].head()

Unnamed: 0,Outlet_Size_High,Outlet_Size_Medium,Outlet_Size_Small
0,0,1,0
1,0,1,0
2,0,1,0
3,0,0,0
4,1,0,0


**Problem 2**

*The number of columns of our dataset have increases from 12 to 1600,where maximum values are 0*

In [24]:
df_encoded.shape

(8523, 1605)

**Solution 1**
*Instead of one-hot encoding we will use label encoding which will not create new column but will give continuos values in the same column*

In [25]:
from sklearn.preprocessing import LabelEncoder

In [26]:
df['Outlet_Size'].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

In [27]:
le=LabelEncoder()
le.fit_transform(['Small','Medium','High'])

array([2, 1, 0], dtype=int64)

*Label Encoder uses alphabetical order to assign values.To modify it according to our needs , we will use*

In [28]:
df['Outlet_Size']=df['Outlet_Size'].map({'Small':0,'Medium':1,'High':2})
df['Outlet_Size'].head()


0    1.0
1    1.0
2    1.0
3    NaN
4    2.0
Name: Outlet_Size, dtype: float64

*So that is how we deal with the first problem*

**Solution 2**

In [30]:
df.nunique()

Item_Identifier              1559
Item_Weight                   415
Item_Fat_Content                5
Item_Visibility              7880
Item_Type                      16
Item_MRP                     5938
Outlet_Identifier              10
Outlet_Establishment_Year       9
Outlet_Size                     3
Outlet_Location_Type            3
Outlet_Type                     4
Item_Outlet_Sales            3493
dtype: int64

*Out of all the categorical variables,Item_Identifier has max nmber of uniue values*

In [34]:
temp=df['Item_Identifier'].value_counts()
temp.head()

FDG33    10
FDW13    10
FDG09     9
FDX04     9
NCQ06     9
Name: Item_Identifier, dtype: int64

In [37]:
df['Item_Identifier_Count']=df['Item_Identifier'].apply(lambda x: temp[x])
df[['Item_Identifier','Item_Identifier_Count']].head()

Unnamed: 0,Item_Identifier,Item_Identifier_Count
0,FDA15,8
1,DRC01,6
2,FDN15,7
3,FDX07,6
4,NCD19,6


In [38]:
for i in range(0,len(df)):
    if df['Item_Identifier_Count'][i]<4:
        df['Item_Identifier_Count'][i]='other'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [42]:
df['Item_Identifier_Count'].value_counts()

6        2298
5        1975
7        1771
4         936
8         880
other     418
9         225
10         20
Name: Item_Identifier_Count, dtype: int64

*What we just applied here is called combining sparse classes,i.e those classes which did not have a significant number of values have all been converted to others*