# Variable Encoding
  
**Why we need encoding?**  
Consider a dataset having some categorical features 
 - As per our knowledge Scikitlearn cannot deal with string data for classifcation so encoding is necessary
 - In this module, we will first see One Hot Encoding and its disadvantages(In some cases)
 - Then we will see Label Encoder
 - And at last we will see how to combine sparse classes to reduce no. of unique classes in a feature without messing with dimensonality of DataFrame

In [1]:
#importing libraries
import pandas as pd
import numpy as np

In [2]:
#reading data
bigmart = pd.read_csv('train_bm.csv')

In [3]:
bigmart.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [4]:
bigmart.dtypes

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object

## Encoding a single variable

In [5]:
bigmart['Outlet_Type'].value_counts()

Supermarket Type1    5577
Grocery Store        1083
Supermarket Type3     935
Supermarket Type2     928
Name: Outlet_Type, dtype: int64

In [6]:
pd.get_dummies(bigmart['Outlet_Type']).head()

Unnamed: 0,Grocery Store,Supermarket Type1,Supermarket Type2,Supermarket Type3
0,0,1,0,0
1,0,0,1,0
2,0,1,0,0
3,1,0,0,0
4,0,1,0,0


## One Hot Encoding for all variables

In [7]:
bigmart_encoded = pd.get_dummies(bigmart)
bigmart_encoded.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales,Item_Identifier_DRA12,Item_Identifier_DRA24,Item_Identifier_DRA59,Item_Identifier_DRB01,Item_Identifier_DRB13,...,Outlet_Size_High,Outlet_Size_Medium,Outlet_Size_Small,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
0,9.3,0.016047,249.8092,1999,3735.138,0,0,0,0,0,...,0,1,0,1,0,0,0,1,0,0
1,5.92,0.019278,48.2692,2009,443.4228,0,0,0,0,0,...,0,1,0,0,0,1,0,0,1,0
2,17.5,0.01676,141.618,1999,2097.27,0,0,0,0,0,...,0,1,0,1,0,0,0,1,0,0
3,19.2,0.0,182.095,1998,732.38,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0
4,8.93,0.0,53.8614,1987,994.7052,0,0,0,0,0,...,1,0,0,0,0,1,0,1,0,0


### We encountered two problems:

**Problem 1**

Sometimes the order of data in a feature is important.  
eg. In Outlet_Type column Small,Medium and High classes can have order which can be valuable information  
so one hot encoding destroys this order.

In [22]:
bigmart_encoded[['Outlet_Size_Small','Outlet_Size_Medium','Outlet_Size_High']].head()

Unnamed: 0,Outlet_Size_Small,Outlet_Size_Medium,Outlet_Size_High
0,0,1,0
1,0,1,0
2,0,1,0
3,0,0,0
4,0,0,1


**Problem 2**

The number of features has increased from 12 to 1605 and most cell contain 0

In [24]:
bigmart.shape, bigmart_encoded.shape

((8523, 12), (8523, 1605))

## Problem 1 Solution:

In [25]:
from sklearn.preprocessing import LabelEncoder

In [26]:
le = LabelEncoder()

In [27]:
bigmart['Outlet_Size'].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

In [28]:
le.fit_transform(['Small', 'Medium', 'High'])

array([2, 1, 0], dtype=int64)

Label encoder uses the alphabetical order.

In [29]:
bigmart['Outlet_Size'] = bigmart['Outlet_Size'].map({'Small': 0,
                                                     'Medium': 1,
                                                     'High': 2})

In [30]:
bigmart['Outlet_Size'].head()

0    1.0
1    1.0
2    1.0
3    NaN
4    2.0
Name: Outlet_Size, dtype: float64

## Problem 2 Solution:

In [31]:
bigmart.nunique()

Item_Identifier              1559
Item_Weight                   415
Item_Fat_Content                5
Item_Visibility              7880
Item_Type                      16
Item_MRP                     5938
Outlet_Identifier              10
Outlet_Establishment_Year       9
Outlet_Size                     3
Outlet_Location_Type            3
Outlet_Type                     4
Item_Outlet_Sales            3493
dtype: int64

In [32]:
bigmart['Item_Identifier'].value_counts()

FDG33    10
FDW13    10
NCQ06     9
FDX31     9
NCF42     9
         ..
FDT35     1
FDN52     1
FDK57     1
DRF48     1
FDE52     1
Name: Item_Identifier, Length: 1559, dtype: int64

In [33]:
temp= bigmart['Item_Identifier'].value_counts()
temp.head()

FDG33    10
FDW13    10
NCQ06     9
FDX31     9
NCF42     9
Name: Item_Identifier, dtype: int64

Here temp is Pandas Series type where value_count of each class in Item_Identifier feature is stored  
In order to reduce number of unique classes in Item_Identifier we will convert those classes having value_count less than 4 to 'other' category

In [39]:
bigmart['Item_identifier_count'] = bigmart['Item_Identifier'].apply(lambda x: temp[x])
bigmart[['Item_Identifier','Item_identifier_count']].head()

Unnamed: 0,Item_Identifier,Item_identifier_count
0,FDA15,8
1,DRC01,6
2,FDN15,7
3,FDX07,6
4,NCD19,6


In [40]:
for i in range(0, len(bigmart)):
    if bigmart['Item_identifier_count'][i] < 4:
        bigmart['Item_Identifier'][i] = 'other'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [41]:
bigmart.head(10)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Item_identifier_count
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,1.0,Tier 1,Supermarket Type1,3735.138,8
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,1.0,Tier 3,Supermarket Type2,443.4228,6
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,1.0,Tier 1,Supermarket Type1,2097.27,7
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38,6
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,2.0,Tier 3,Supermarket Type1,994.7052,6
5,other,10.395,Regular,0.0,Baking Goods,51.4008,OUT018,2009,1.0,Tier 3,Supermarket Type2,556.6088,3
6,FDO10,13.65,Regular,0.012741,Snack Foods,57.6588,OUT013,1987,2.0,Tier 3,Supermarket Type1,343.5528,8
7,FDP10,,Low Fat,0.12747,Snack Foods,107.7622,OUT027,1985,1.0,Tier 3,Supermarket Type3,4022.7636,7
8,FDH17,16.2,Regular,0.016687,Frozen Foods,96.9726,OUT045,2002,,Tier 2,Supermarket Type1,1076.5986,7
9,FDU28,19.2,Regular,0.09445,Frozen Foods,187.8214,OUT017,2007,,Tier 2,Supermarket Type1,4710.535,5


In [43]:
bigmart.nunique()

Item_Identifier              1403
Item_Weight                   415
Item_Fat_Content                5
Item_Visibility              7880
Item_Type                      16
Item_MRP                     5938
Outlet_Identifier              10
Outlet_Establishment_Year       9
Outlet_Size                     3
Outlet_Location_Type            3
Outlet_Type                     4
Item_Outlet_Sales            3493
Item_identifier_count          10
dtype: int64

In [44]:
 bigmart['Item_Identifier'].value_counts()

other    418
FDG33     10
FDW13     10
FDF52      9
NCI54      9
        ... 
NCT05      4
FDF14      4
NCU53      4
NCU41      4
FDJ07      4
Name: Item_Identifier, Length: 1403, dtype: int64

**Sparse classes from Item_Identifier column are successfully removed**