# Categorical Encoding and One Hot Encoding

Categorical data are variables that contain labels rather than numeric values. Nominal (Things you can describe like hair color) and ordinal data (Things you can rank like class grades) are forms of categorical data each with their own unique transformation into numeric data. This transformation is called encoding. Categorical columns need to be encoded as many machine learning models require numerical input. Two of the most common encoding methods are:

1. **Label Encoding:** Each unique category value is assigned an integer. This approach is simple, but is not suitable for nominal data where the data has no intrinsic ordering.
2. **One-Hot Encoding:** Each unique category value is converted into an entirely new binary column. This approach is suitable for nominal data (those without an intrinsic order), but can result in an explosion in the number of columns if the variety of variables in the nominal data is large.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

# Load the Titanic dataset
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
df = pd.read_csv(url)
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


## Label Encoding

Label encoding is useful for ordinal data like grades in class. Transforming the grades to integers preserves the order.

In the Titanic dataset, while the 'Pclass' feature is already an integer in this datatable, it can be considered ordinal as the classes have a natural order (1st, 2nd, 3rd).

We will demostrate the use of the LabelEncoder function to reassign the 'Pclass' column as a numeric. **Note: Python is 0-indexed meaning the first group starts with 0, so 1st class is now labeled as 0**

In [None]:
# Initialize the LabelEncoder
le = LabelEncoder()

df['Pclass'] = df['Pclass'].astype('category')
print('Ordinal Data Type: '+ str(df['Pclass'].dtype))
# Apply label encoding to 'Pclass'
df['Pclass'] = le.fit_transform(df['Pclass'])
print("After LabelEncoder() Transformation: " + str(df['Pclass'].dtype))

Ordinal Data Type: category
After LabelEncoder() Transformation: int64


In [None]:
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,2,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,2,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


## One-Hot Encoding

One-hot encoding is for nominal data without an intrinsic order like hair color. This method converts each category value into a new binary column of 0s (False) or 1s (True). This encoding is useful for models that cannot understand categorical datatypes, such as linear regression.

In the Titanic dataset, the 'Sex' and 'Embarked' features are nominal. We will use the pandas [get_dummies()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) function to expand our nominal data into new columns.



In [None]:
# One-hot encoding of 'Sex' and 'Embarked' columns
df = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Sex_male,Embarked_Q,Embarked_S
0,1,0,2,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,True,False,True
1,2,1,0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,False,False,False
2,3,1,2,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,False,False,True


In [None]:
## End of Script