# Categorical Variable

<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/cat.png" width="2000">


Few algorithms such as CATBOAST, decision-trees can handle categorical values very well but most of the algorithms expect numerical values to achieve state-of-the-art results.
Over your learning curve in AI and Machine Learning, one thing you would notice that most of the algorithms work better with numerical inputs. Therefore, the main challenge faced by an analyst is to convert text/categorical data into numerical data and still make an algorithm/model to make sense out of it. Neural networks, which is a base of deep-learning, expects input values to be numerical.

#### Label Encoding in Python (#target)
This approach is very simple and it involves converting each value in a column to a number. Consider a dataset of bridges having a column names bridge-types having below values

Using category codes approach:
This approach requires the category column to be of ‘category’ datatype. By default, a non-numerical column is of ‘object’ type. So you might have to change type to ‘category’ before using this approach.

In [18]:
# import required libraries
import pandas as pd
import numpy as np
# creating initial dataframe
data = ('Arch','Beam','Truss','Cantilever','Tied_Arch','Suspension','Cable')
bridge_df = pd.DataFrame(data, columns=['Bridge_Types'])
bridge_df.dtypes

Bridge_Types    object
dtype: object

In [19]:
# converting type of columns to 'category'
bridge_df['Bridge_Types'] = bridge_df['Bridge_Types'].astype('category')
bridge_df.dtypes

Bridge_Types    category
dtype: object

In [20]:
# Assigning numerical values and storing in another column
bridge_df['Bridge_Cat'] = bridge_df['Bridge_Types'].cat.codes
bridge_df

Unnamed: 0,Bridge_Types,Bridge_Cat
0,Arch,0
1,Beam,1
2,Truss,6
3,Cantilever,3
4,Tied_Arch,5
5,Suspension,4
6,Cable,2


#### Using sci-kit learn library approach:
Another common approach which many data analyst perform label-encoding is by using SciKit learn library.

In [21]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder
# creating initial dataframe
data = ('Arch','Beam','Truss','Cantilever','Tied_Arch','Suspension','Cable')
bridge_df = pd.DataFrame(data, columns=['Bridge_Types'])

# creating instance of labelencoder
labelencoder = LabelEncoder()
# Assigning numerical values and storing in another column
bridge_df['Bridge_Cat'] = labelencoder.fit_transform(bridge_df['Bridge_Types'])
bridge_df

Unnamed: 0,Bridge_Types,Bridge_Cat
0,Arch,0
1,Beam,1
2,Truss,6
3,Cantilever,3
4,Tied_Arch,5
5,Suspension,4
6,Cable,2


In [22]:
bridge_df.dtypes

Bridge_Types    object
Bridge_Cat       int64
dtype: object

## One-Hot Encoder
Though label encoding is straight but it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of hierarchy/order in them. This ordering issue is addressed in another common alternative approach called ‘One-Hot Encoding’. In this strategy, each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column. Let’s consider the previous example of bridge type and safety levels with one-hot encoding.

OneHotEncoder from SciKit library only takes numerical categorical values, hence any value of string type should be label encoded before one hot encoded. So taking the dataframe from the previous example, we will apply OneHotEncoder on column Bridge_Cat.

In [23]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

# creating instance of one-hot-encoder
enc = OneHotEncoder(handle_unknown='ignore')

# passing bridge-types-cat column (label encoded values of bridge_types)
enc_df = pd.DataFrame(enc.fit_transform(bridge_df[['Bridge_Cat']]).toarray())

# merge with main df bridge_df on key values
bridge_df = bridge_df.join(enc_df)
bridge_df

Unnamed: 0,Bridge_Types,Bridge_Cat,0,1,2,3,4,5,6
0,Arch,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Beam,1,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,Truss,6,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,Cantilever,3,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,Tied_Arch,5,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5,Suspension,4,0.0,0.0,0.0,0.0,1.0,0.0,0.0
6,Cable,2,0.0,0.0,1.0,0.0,0.0,0.0,0.0


#### Using dummies values approach with Pandas:
This approach is more flexible because it allows encoding as many category columns as you would like and choose how to label the columns using a prefix. Proper naming will make the rest of the analysis just a little bit easier.

In [24]:
import pandas as pd
import numpy as np
# creating initial dataframe
data = ('Arch','Beam','Truss','Cantilever','Tied_Arch','Suspension','Cable')
bridge_df = pd.DataFrame(data, columns=['Bridge_Types'])

# generate binary values using get_dummies
dum_df = pd.get_dummies(bridge_df, columns=["Bridge_Types"])#, prefix=["Type_is"] )
# merge with main df bridge_df on key values
bridge_df = bridge_df.join(dum_df)
bridge_df

Unnamed: 0,Bridge_Types,Bridge_Types_Arch,Bridge_Types_Beam,Bridge_Types_Cable,Bridge_Types_Cantilever,Bridge_Types_Suspension,Bridge_Types_Tied_Arch,Bridge_Types_Truss
0,Arch,True,False,False,False,False,False,False
1,Beam,False,True,False,False,False,False,False
2,Truss,False,False,False,False,False,False,True
3,Cantilever,False,False,False,True,False,False,False
4,Tied_Arch,False,False,False,False,False,True,False
5,Suspension,False,False,False,False,True,False,False
6,Cable,False,False,True,False,False,False,False


### When to use a Label Encoding vs. One Hot Encoding
It is important to understand various option for encoding categorical variables because each approach has its own pros and cons.

**We apply One-Hot Encoding when:**

The categorical feature is not ordinal (like the countries above)
The number of categorical features is less so one-hot encoding can be effectively applied

**We apply Label Encoding when:**

The categorical feature is ordinal (like Jr. kg, Sr. kg, Primary school, high school)
 The number of categories is quite large as one-hot encoding can lead to high memory consumption

In [25]:
import seaborn as sns
# Load an example dataset
df = sns.load_dataset("tips")
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


## Method1: pandas dummies

In [26]:
df1 = df.copy()
sex = pd.get_dummies(df['sex'])
sex

Unnamed: 0,Male,Female
0,False,True
1,True,False
2,True,False
3,True,False
4,False,True
...,...,...
239,True,False
240,False,True
241,True,False
242,True,False


In [27]:
# Create 0-1 value on sex column (from Male Female column)

df1['sex'] = pd.Categorical(df['sex'],['Male','Female'])

In [28]:
df1['sex']=df1['sex'].cat.codes
df1

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,1,No,Sun,Dinner,2
1,10.34,1.66,0,No,Sun,Dinner,3
2,21.01,3.50,0,No,Sun,Dinner,3
3,23.68,3.31,0,No,Sun,Dinner,2
4,24.59,3.61,1,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,0,No,Sat,Dinner,3
240,27.18,2.00,1,Yes,Sat,Dinner,2
241,22.67,2.00,0,Yes,Sat,Dinner,2
242,17.82,1.75,0,No,Sat,Dinner,2


In [29]:
#merge 2 dataframes
data = pd.concat([df1,sex],axis=1).drop('sex', axis=1)
data.head()

Unnamed: 0,total_bill,tip,smoker,day,time,size,Male,Female
0,16.99,1.01,No,Sun,Dinner,2,False,True
1,10.34,1.66,No,Sun,Dinner,3,True,False
2,21.01,3.5,No,Sun,Dinner,3,True,False
3,23.68,3.31,No,Sun,Dinner,2,True,False
4,24.59,3.61,No,Sun,Dinner,4,False,True


## Method2

In [30]:
from sklearn.preprocessing import LabelEncoder

In [31]:
le = LabelEncoder()
label = le.fit_transform(df['sex'])
label

array([0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0,
       0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0,
       1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 0])

In [32]:
# Label Encoder classes
le.classes_

array(['Female', 'Male'], dtype=object)

In [33]:
# remove sex column
data = df.drop('sex',axis='columns')

In [34]:
#put on sex column label of LabelEncoder
data['sex'] = label
data

Unnamed: 0,total_bill,tip,smoker,day,time,size,sex
0,16.99,1.01,No,Sun,Dinner,2,0
1,10.34,1.66,No,Sun,Dinner,3,1
2,21.01,3.50,No,Sun,Dinner,3,1
3,23.68,3.31,No,Sun,Dinner,2,1
4,24.59,3.61,No,Sun,Dinner,4,0
...,...,...,...,...,...,...,...
239,29.03,5.92,No,Sat,Dinner,3,1
240,27.18,2.00,Yes,Sat,Dinner,2,0
241,22.67,2.00,Yes,Sat,Dinner,2,1
242,17.82,1.75,No,Sat,Dinner,2,1


In [35]:
sex = pd.get_dummies(df['sex'])
sex.head()

Unnamed: 0,Male,Female
0,False,True
1,True,False
2,True,False
3,True,False
4,False,True


In [36]:
#merge two dataframes
data = pd.concat([data,sex],axis=1).drop('sex',axis=1)
data.head()

Unnamed: 0,total_bill,tip,smoker,day,time,size,Male,Female
0,16.99,1.01,No,Sun,Dinner,2,False,True
1,10.34,1.66,No,Sun,Dinner,3,True,False
2,21.01,3.5,No,Sun,Dinner,3,True,False
3,23.68,3.31,No,Sun,Dinner,2,True,False
4,24.59,3.61,No,Sun,Dinner,4,False,True


## Method3 - manual with MAP method

In [37]:
df2=df.copy()

In [38]:
df2['sex'] = df.sex.map({'Female':0, 'Male':1})
df2

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,0,No,Sun,Dinner,2
1,10.34,1.66,1,No,Sun,Dinner,3
2,21.01,3.50,1,No,Sun,Dinner,3
3,23.68,3.31,1,No,Sun,Dinner,2
4,24.59,3.61,0,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,1,No,Sat,Dinner,3
240,27.18,2.00,0,Yes,Sat,Dinner,2
241,22.67,2.00,1,Yes,Sat,Dinner,2
242,17.82,1.75,1,No,Sat,Dinner,2


In [39]:
sex = pd.get_dummies(df['sex'])
sex.head()

Unnamed: 0,Male,Female
0,False,True
1,True,False
2,True,False
3,True,False
4,False,True


In [40]:
df2 = pd.concat([df2,sex],axis=1)
df2.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,Male,Female
0,16.99,1.01,0,No,Sun,Dinner,2,False,True
1,10.34,1.66,1,No,Sun,Dinner,3,True,False
2,21.01,3.5,1,No,Sun,Dinner,3,True,False
3,23.68,3.31,1,No,Sun,Dinner,2,True,False
4,24.59,3.61,0,No,Sun,Dinner,4,False,True


In [41]:
df2 = df2.drop('sex',axis='columns')
df2

Unnamed: 0,total_bill,tip,smoker,day,time,size,Male,Female
0,16.99,1.01,No,Sun,Dinner,2,False,True
1,10.34,1.66,No,Sun,Dinner,3,True,False
2,21.01,3.50,No,Sun,Dinner,3,True,False
3,23.68,3.31,No,Sun,Dinner,2,True,False
4,24.59,3.61,No,Sun,Dinner,4,False,True
...,...,...,...,...,...,...,...,...
239,29.03,5.92,No,Sat,Dinner,3,True,False
240,27.18,2.00,Yes,Sat,Dinner,2,False,True
241,22.67,2.00,Yes,Sat,Dinner,2,True,False
242,17.82,1.75,No,Sat,Dinner,2,True,False
