### LabelEncoder and OneHotEncoder


One hot encoding is the `technique to convert categorical values into a 1-dimensional numerical vector`. 

The resulting vector will have only one element equal to 1 and the rest will be 0.

    The 1 is called Hot and the 0’s are Cold. This is where its name of one hot encoding comes from.

#### X = [Dog, Cat, Bird]

After one hot encoding each element of our vector X, we end up with the following:

`Dog = [1 0 0]

Cat = [0 1 0]

Bird = [0 0 1]`

#### Why one hot encode numerical categorical variables?

Our machine learning algorithm can only read numerical values. 

It is essential to encoding categorical features into numerical values.

### One Hot Encoding with Pandas

#### Data Set
The data set we use here is from UCI Machine Learning Repository. It is used to predict whether a patient has kidney disease using various blood indicators as features. We use pandas to read the data in.

In [13]:
import pandas as pd
import numpy as np
import seaborn as sns
#df = pd.read_csv('chronic_kidney_disease.csv', header=None, 
# names=['age','bp','sg','al','su','rbc','pc','pcc','ba','bgr','bu','sc','sod','pot','hemo','pcv','wc','rc','htn','dm','cad','appet','pe','ane','class'])
# head of df
#df.head(10)

In [14]:
tips = sns.load_dataset("tips")
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [15]:
tips.dtypes

total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object

Data has various categorical features, such as `‘sex’(Male and Female)`, `‘smoker’(yes or No)`, `‘day’(sun, mon..)` and so on.

In [16]:
# Categorical boolean mask
categorical_feature_mask = tips.dtypes=='category'  #for Qualitative data dtype can be of type "object" also
categorical_feature_mask

total_bill    False
tip           False
sex            True
smoker         True
day            True
time           True
size          False
dtype: bool

In [17]:
# filter categorical columns using mask and turn it into a list
categorical_cols = tips.columns[categorical_feature_mask].tolist()
categorical_cols

['sex', 'smoker', 'day', 'time']

    LabelEncoder converts each class under specified feature to a numerical value.

In [18]:
# import labelencoder
from sklearn.preprocessing import LabelEncoder
# instantiate labelencoder object
le = LabelEncoder()

In [19]:
# apply le on categorical feature columns
tips[categorical_cols] = tips[categorical_cols].apply(lambda col: le.fit_transform(col))
tips[categorical_cols].head()

Unnamed: 0,sex,smoker,day,time
0,0,0,2,0
1,1,0,2,0
2,1,0,2,0
3,1,0,2,0
4,0,0,2,0


#### All the categorical feature columns are binary class. 



#### But if the categorical `feature is multi class`, like :  `day` column in the tips then 
    
LabelEncoder will `return different values for different classes`.

### One-HotEncoding

In [20]:
# import OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

# instantiate OneHotEncoder
ohe = OneHotEncoder(categorical_features = categorical_feature_mask, sparse=False ) 
# categorical_features = boolean mask for categorical columns
# sparse = False output an array not sparse matrix

# apply OneHotEncoder on categorical feature columns
one_hot_enc = ohe.fit_transform(tips) # It returns an numpy array

In [21]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,0,0,2,0,2
1,10.34,1.66,1,0,2,0,3
2,21.01,3.5,1,0,2,0,3
3,23.68,3.31,1,0,2,0,2
4,24.59,3.61,0,0,2,0,4


In [10]:
pd.DataFrame(one_hot_enc)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,16.99,1.01,2.0
1,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,10.34,1.66,3.0
2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,21.01,3.50,3.0
3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,23.68,3.31,2.0
4,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,24.59,3.61,4.0
5,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,25.29,4.71,4.0
6,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,8.77,2.00,2.0
7,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,26.88,3.12,4.0
8,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,15.04,1.96,2.0
9,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,14.78,3.23,2.0


In [11]:
one_hot_enc.shape

(244, 13)

In [12]:
tips.shape

(244, 7)