In [1]:
# Import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [12]:
# Read Datafile
data = pd.read_csv('tips.csv')
data.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,Date
0,16.99,1.01,Female,No,Sun,Dinner,2,08/28/2017
1,10.34,1.66,Male,No,Sun,Dinner,3,08/28/2017
2,21.01,3.5,Male,No,Sun,Dinner,3,08/28/2017
3,23.68,3.31,Male,No,Sun,Dinner,2,08/28/2017
4,24.59,3.61,Female,No,Sun,Dinner,4,08/28/2017


In [3]:
# Check the null values in the data
data.isnull().sum()

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
Date          0
dtype: int64

### Label Encoding or Ordinal Encoding
We use this categorical data encoding technique when the categorical feature is ordinal. In this case, retaining the order is important. Hence encoding should reflect the sequence.     
In Label encoding, each label is converted into an integer value. We will create a variable that contains the categories representing the education qualification of a person.

In [4]:
import category_encoders as ce
df=pd.DataFrame({'Degree':['High school', 'Masters', 'Diploma', 'Bachelors', 'Bachelors', 'Masters', 'Phd', 
                           'High school', 'High school']})

# create object of Ordinalencoding
encoder= ce.OrdinalEncoder(cols=['Degree'], return_df=True, mapping=[{'col':'Degree', 'mapping':
                                                                      {'None':0, 'High school':1, 'Diploma':2, 
                                                                       'Bachelors':3, 'Masters':4, 'phd':5}}])

#Original data
df

Unnamed: 0,Degree
0,High school
1,Masters
2,Diploma
3,Bachelors
4,Bachelors
5,Masters
6,Phd
7,High school
8,High school


In [6]:
#fit and transform train data 
df_transformed = encoder.fit_transform(df)
df_transformed

Unnamed: 0,Degree
0,1.0
1,4.0
2,2.0
3,3.0
4,3.0
5,4.0
6,-1.0
7,1.0
8,1.0


### One Hot Encoding
We use this categorical data encoding technique when the features are nominal(do not have any order). In one hot encoding, for each level of a categorical feature, we create a new variable. Each category is mapped with a binary variable containing either 0 or 1. Here, 0 represents the absence, and 1 represents the presence of that category.    
These newly created binary features are known as Dummy variables. The number of dummy variables depends on the levels present in the categorical variable. This might sound complicated. Let us take an example to understand this better. Suppose we have a dataset with a category animal, having different animals like Dog, Cat, Sheep, Cow, Lion. Now we have to one-hot encode this data.

In [8]:
df1=pd.DataFrame({'City':['Delhi', 'Mumbai', 'Hydrabad', 'Chennai', 'Bangalore', 'Delhi', 'Hydrabad', 'Bangalore', 
                           'Delhi']})

#Create object for one-hot encoding
encoder=ce.OneHotEncoder(cols='City', handle_unknown='return_nan', return_df=True, use_cat_names=True)

#Original Data
df1

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Hydrabad
3,Chennai
4,Bangalore
5,Delhi
6,Hydrabad
7,Bangalore
8,Delhi


In [9]:
#Fit and transform Data
df1_encoded = encoder.fit_transform(df1)
df1_encoded

Unnamed: 0,City_Delhi,City_Mumbai,City_Hydrabad,City_Chennai,City_Bangalore
0,1.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,1.0
5,1.0,0.0,0.0,0.0,0.0
6,0.0,0.0,1.0,0.0,0.0
7,0.0,0.0,0.0,0.0,1.0
8,1.0,0.0,0.0,0.0,0.0


In [18]:
#Create object for one-hot encoding
encoder=ce.OneHotEncoder(cols='day', handle_unknown='return_nan', return_df=True, use_cat_names=True)

data['day']

0       Sun
1       Sun
2       Sun
3       Sun
4       Sun
       ... 
239     Sat
240     Sat
241     Sat
242     Sat
243    Thur
Name: day, Length: 244, dtype: object

In [19]:
data_encoded = encoder.fit_transform(data['day'])
data_encoded

Unnamed: 0,day_Sun,day_Sat,day_Thur,day_Fri
0,1.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0
...,...,...,...,...
239,0.0,1.0,0.0,0.0
240,0.0,1.0,0.0,0.0
241,0.0,1.0,0.0,0.0
242,0.0,1.0,0.0,0.0


### Dummy Encoding
Dummy coding scheme is similar to one-hot encoding. This categorical data encoding method transforms the categorical variable into a set of binary variables (also known as dummy variables). In the case of one-hot encoding, for N categories in a variable, it uses N binary variables. The dummy encoding is a small improvement over one-hot-encoding. Dummy encoding uses N-1 features to represent N labels/categories.      
To understand this better let’s see the image below. Here we are coding the same data using both one-hot encoding and dummy encoding techniques. While one-hot uses 3 variables to represent the data whereas dummy encoding uses 2 variables to code 3 categories.

In [21]:
df2=pd.DataFrame({'City':['Delhi', 'Mumbai', 'Hyderabad', 'Chennai', 'Bangalore', 'Delhi', 'Hyderabad']})

#Original Data
df2

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Hyderabad
3,Chennai
4,Bangalore
5,Delhi
6,Hyderabad


In [23]:
#encode the data
df2_encoded=pd.get_dummies(data=df2,drop_first=True)
df2_encoded

Unnamed: 0,City_Chennai,City_Delhi,City_Hyderabad,City_Mumbai
0,0,1,0,0
1,0,0,0,1
2,0,0,1,0
3,1,0,0,0
4,0,0,0,0
5,0,1,0,0
6,0,0,1,0


In [29]:
data['day']

0       Sun
1       Sun
2       Sun
3       Sun
4       Sun
       ... 
239     Sat
240     Sat
241     Sat
242     Sat
243    Thur
Name: day, Length: 244, dtype: object

In [31]:
data1_encoded=pd.get_dummies(data=data['day'], drop_first=True)
data1_encoded

Unnamed: 0,Sat,Sun,Thur
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0
...,...,...,...
239,1,0,0
240,1,0,0
241,1,0,0
242,1,0,0


### Effect Encoding:
This encoding technique is also known as Deviation Encoding or Sum Encoding. Effect encoding is almost similar to dummy encoding, with a little difference. In dummy coding, we use 0 and 1 to represent the data but in effect encoding, we use three values i.e. 1,0, and -1.     
The row containing only 0s in dummy encoding is encoded as -1 in effect encoding.  In the dummy encoding example, the city Bangalore at index 4  was encoded as 0000. Whereas in effect encoding it is represented by -1-1-1-1.

In [33]:
df3=pd.DataFrame({'City':['Delhi', 'Mumbai', 'Hyderabad', 'Chennai', 'Bangalore', 'Delhi', 'Hyderabad']}) 

encoder=ce.sum_coding.SumEncoder(cols='City',verbose=False,)

#Original Data
df3

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Hyderabad
3,Chennai
4,Bangalore
5,Delhi
6,Hyderabad


In [34]:
encoder.fit_transform(df3)

Unnamed: 0,intercept,City_0,City_1,City_2,City_3
0,1,1.0,0.0,0.0,0.0
1,1,0.0,1.0,0.0,0.0
2,1,0.0,0.0,1.0,0.0
3,1,0.0,0.0,0.0,1.0
4,1,-1.0,-1.0,-1.0,-1.0
5,1,1.0,0.0,0.0,0.0
6,1,0.0,0.0,1.0,0.0


### Hash Encoder

To understand Hash encoding it is necessary to know about hashing. Hashing is the transformation of arbitrary size input in the form of a fixed-size value. We use hashing algorithms to perform hashing operations i.e to generate the hash value of an input. Further, hashing is a one-way process, in other words, one can not generate original input from the hash representation.     
Hashing has several applications like data retrieval, checking data corruption, and in data encryption also. We have multiple hash functions available for example Message Digest (MD, MD2, MD5), Secure Hash Function (SHA0, SHA1, SHA2), and many more.      
Just like one-hot encoding, the Hash encoder represents categorical features using the new dimensions. Here, the user can fix the number of dimensions after transformation using n_component argument. Here is what I mean – A feature with 5 categories can be represented using N new features similarly, a feature with 100 categories can also be transformed using N new features. Doesn’t this sound amazing?

In [40]:
import category_encoders as ce

#Create the dataframe
df4=pd.DataFrame({'Month':['January', 'April', 'March', 'April', 'Februay', 'June', 'July', 'June', 'September']})

#Create object for hash encoder
encoder=ce.HashingEncoder(cols='Month', n_components=6)

df4

Unnamed: 0,Month
0,January
1,April
2,March
3,April
4,Februay
5,June
6,July
7,June
8,September


In [41]:
#Fit and Transform Data
encoder.fit_transform(df4)

Unnamed: 0,Month
0,January
1,April
2,March
3,April
4,Februay
5,June
6,July
7,June
8,September


### Binary Encoding

Binary encoding is a combination of Hash encoding and one-hot encoding. In this encoding scheme, the categorical feature is first converted into numerical using an ordinal encoder. Then the numbers are transformed in the binary number. After that binary value is split into different columns.      
Binary encoding works really well when there are a high number of categories. For example the cities in a country where a company supplies its products.

In [45]:
#Create the Dataframe
df5=pd.DataFrame({'City':['Delhi', 'Mumbai', 'Hyderabad', 'Chennai', 'Bangalore', 'Delhi', 'Hyderabad', 'Mumbai', 
                          'Agra']})

#Create object for binary encoding
encoder= ce.BinaryEncoder(cols=['City'],return_df=True)

#Original Data
df5

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Hyderabad
3,Chennai
4,Bangalore
5,Delhi
6,Hyderabad
7,Mumbai
8,Agra


In [47]:
data_encoded=encoder.fit_transform(df5) 
data_encoded

Unnamed: 0,City_0,City_1,City_2,City_3
0,0,0,0,1
1,0,0,1,0
2,0,0,1,1
3,0,1,0,0
4,0,1,0,1
5,0,0,0,1
6,0,0,1,1
7,0,0,1,0
8,0,1,1,0


### Base N Encoding

Before diving into BaseN encoding let’s first try to understand what is Base here?

In the numeral system, the Base or the radix is the number of digits or a combination of digits and letters used to represent the numbers. The most common base we use in our life is 10  or decimal system as here we use 10 unique digits i.e 0 to 9 to represent all the numbers. Another widely used system is binary i.e. the base is 2. It uses 0 and 1 i.e 2 digits to express all the numbers.    
For Binary encoding, the Base is 2 which means it converts the numerical values of a category into its respective Binary form. If you want to change the Base of encoding scheme you may use Base N encoder. In the case when categories are more and binary encoding is not able to handle the dimensionality then we can use a larger base such as 4 or 8.

In [50]:
#Create the dataframe
df6=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad','Mumbai','Agra']})

#Create an object for Base N Encoding
encoder= ce.BaseNEncoder(cols=['City'],return_df=True,base=5)

#Original Data
df6


Unnamed: 0,City
0,Delhi
1,Mumbai
2,Hyderabad
3,Chennai
4,Bangalore
5,Delhi
6,Hyderabad
7,Mumbai
8,Agra


In [51]:
#Fit and Transform Data
data_encoded=encoder.fit_transform(df6)
data_encoded

Unnamed: 0,City_0,City_1,City_2
0,0,0,1
1,0,0,2
2,0,0,3
3,0,0,4
4,0,1,0
5,0,0,1
6,0,0,3
7,0,0,2
8,0,1,1


### Target Encoding

In target encoding, we calculate the mean of the target variable for each category and replace the category variable with the mean value. In the case of the categorical target variables, the posterior probability of the target replaces each category.

In [52]:
#Create the Dataframe
df7=pd.DataFrame({'class':['A,','B','C','B','C','A','A','A'],'Marks':[50,30,70,80,45,97,80,68]})

#Create target encoding object
encoder=ce.TargetEncoder(cols='class') 

#Original Data
df7

Unnamed: 0,class,Marks
0,"A,",50
1,B,30
2,C,70
3,B,80
4,C,45
5,A,97
6,A,80
7,A,68


In [53]:
#Fit and Transform Train Data
encoder.fit_transform(df7['class'], df7['Marks'])

Unnamed: 0,class
0,65.0
1,57.689414
2,59.517061
3,57.689414
4,59.517061
5,79.679951
6,79.679951
7,79.679951


#### The End