# One-Hot Encoding with Pandas

One-Hot Encoding is a technique used to transform categorical data into numeric values understandable by machine learning algorithms.

Because the One-Hot Encoding technique converts categories into seperate columns with binary values used to indicate applicability it is best suited to nominal categorical data (categories that do not have a natural rank order) eg. the states of Australia Vicotria, Tasmania & Queensland.

Attempting to use One-Hot Encoding on ordinal categorical data (categories that have a natural rank order) can lead to machine learning algorithms failing to identify the rank order relationship eg. for T-Shirt sizes: Small is less than Medium which is less than Large so they could be label encoded as 0, 1 & 2 respectively.

The following example uses the Pandas Dummies function to encode a dataset containing Australian States.

In [2]:
import pandas as pd

In [3]:
# Create a sample pandas DataFrame
data = {
    'Location': ['Victoria', 'Queensland', 'Tasmania', 'Queensland', 'Tasmania', 'Victoria'],
    'Person_ID': [0, 1, 2, 3, 4, 5]
}

# Create DataFrame
df = pd.DataFrame(data)
df

Unnamed: 0,Location,Person_ID
0,Victoria,0
1,Queensland,1
2,Tasmania,2
3,Queensland,3
4,Tasmania,4
5,Victoria,5


In [4]:
# Use Pandas get dummies to perform category encoding
df = pd.get_dummies(df)
df

Unnamed: 0,Person_ID,Location_Queensland,Location_Tasmania,Location_Victoria
0,0,0,0,1
1,1,1,0,0
2,2,0,1,0
3,3,1,0,0
4,4,0,1,0
5,5,0,0,1


## Exploring Edgecase

The below code explores an edgecase that can arise when attempting to merge data from disparate sources that have both been encoded using the Pandas dummies method. The issue arises because the dummies method generates columns based on the variations included in the data provided which may lead to inconsistent column counts.

In [5]:
## Create dataframes to represent data from disparate sources

df1 = pd.DataFrame({'Location': ['Victoria', 'Queensland'], 'Person_ID': [0, 1]})
df2 = pd.DataFrame({'Location': ['Tasmania', 'Queensland', 'Tasmania', 'Victoria'], 'Person_ID': [2, 3, 4, 5]})

In [6]:
# Use Pandas get dummies to perform category encoding on df1
df1 = pd.get_dummies(df1)
df1

Unnamed: 0,Person_ID,Location_Queensland,Location_Victoria
0,0,0,1
1,1,1,0


In [7]:
# Use Pandas get dummies to perform category encoding on df1
df2 = pd.get_dummies(df2)
df2

Unnamed: 0,Person_ID,Location_Queensland,Location_Tasmania,Location_Victoria
0,2,0,1,0
1,3,1,0,0
2,4,0,1,0
3,5,0,0,1


In [8]:
print(f"df1 column count:" + str(len(df1.columns)))
print(f"df2 column count:" + str(len(df2.columns)))
print("The column counts do not match!")

df1 column count:3
df2 column count:4
The column counts do not match!


In [9]:
# Concatinate the dataframes
df = pd.concat([df1, df2], axis=0)
df

Unnamed: 0,Person_ID,Location_Queensland,Location_Victoria,Location_Tasmania
0,0,0,1,
1,1,1,0,
0,2,0,0,1.0
1,3,1,0,0.0
2,4,0,0,1.0
3,5,0,1,0.0


In [10]:
# Because the columns had inconsistent counts this has resulted in NaN values
# Fix the 'NaN' values by replacing them with '0'
df = df.fillna(0)
df

Unnamed: 0,Person_ID,Location_Queensland,Location_Victoria,Location_Tasmania
0,0,0,1,0.0
1,1,1,0,0.0
0,2,0,0,1.0
1,3,1,0,0.0
2,4,0,0,1.0
3,5,0,1,0.0
