# One-Hot Encoding with SciKitLearn

One-Hot Encoding is a technique used to transform categorical data into numeric values understandable by machine learning algorithms.

Because the One-Hot Encoding technique converts categories into seperate columns with binary values used to indicate applicability it is best suited to nominal categorical data (categories that do not have a natural rank order) eg. the states of Australia Vicotria, Tasmania & Queensland.

Attempting to use One-Hot Encoding on ordinal categorical data (categories that have a natural rank order) can lead to machine learning algorithms failing to identify the rank order relationship eg. for T-Shirt sizes: Small is less than Medium which is less than Large so they could be label encoded as 0, 1 & 2 respectively.

The following example uses a SciKitLearn OneHotEncoder to encode a dataset containing Australian States.

In [4]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

## Approach 1: One-Hot Encoding (with undefined categories)

In [5]:
# Create a sample pandas DataFrame
data = {
    'Location': ['Victoria', 'Queensland', 'Tasmania', 'Queensland', 'Tasmania', 'Victoria'],
    'Person_ID': [0, 1, 2, 3, 4, 5]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Location,Person_ID
0,Victoria,0
1,Queensland,1
2,Tasmania,2
3,Queensland,3
4,Tasmania,4
5,Victoria,5


In [6]:
# Create a OneHotEncoder instance
onehotencoder = OneHotEncoder()

In [7]:
# Perform One Hot encoding on the Location column
cat_encoded_columns = onehotencoder.fit_transform(df[['Location']]).toarray()
cat_encoded_columns_df = pd.DataFrame(cat_encoded_columns)
cat_encoded_columns_df # dataframe should contain 3 columns (one for each state)

Unnamed: 0,0,1,2
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,1.0,0.0,0.0
4,0.0,1.0,0.0
5,0.0,0.0,1.0


In [8]:
# Concatinate the dataframe containing the encoded location columns with the original dataframe
df = pd.concat([cat_encoded_columns_df, df], axis = 1)
df

Unnamed: 0,0,1,2,Location,Person_ID
0,0.0,0.0,1.0,Victoria,0
1,1.0,0.0,0.0,Queensland,1
2,0.0,1.0,0.0,Tasmania,2
3,1.0,0.0,0.0,Queensland,3
4,0.0,1.0,0.0,Tasmania,4
5,0.0,0.0,1.0,Victoria,5


In [9]:
# Drop the original 'Location' column from the dataframe
df = df.drop(['Location'], axis = 1)
df

Unnamed: 0,0,1,2,Person_ID
0,0.0,0.0,1.0,0
1,1.0,0.0,0.0,1
2,0.0,1.0,0.0,2
3,1.0,0.0,0.0,3
4,0.0,1.0,0.0,4
5,0.0,0.0,1.0,5


## Approach 2: One-Hot Encoding (with defined categories)

In [10]:
# Create a sample pandas DataFrame
data = {
    'Location': ['Victoria', 'Queensland', 'Tasmania', 'Queensland', 'Tasmania', 'Victoria'],
    'Person_ID': [0, 1, 2, 3, 4, 5]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Location,Person_ID
0,Victoria,0
1,Queensland,1
2,Tasmania,2
3,Queensland,3
4,Tasmania,4
5,Victoria,5


In [11]:
# List of all possible state names
all_states = ['Victoria', 'Queensland', 'Tasmania', 'New South Whales', 'Western Austalia', 'South Australia']

# Create a OneHotEncoder instance with specified categories
encoder = OneHotEncoder(categories=[all_states], sparse=False)

In [12]:
# Fit and transform the "State" column
encoded_states = encoder.fit_transform(df[["Location"]])

# Create a new DataFrame with the encoded state columns
encoded_df = pd.DataFrame(encoded_states, columns=[f"Location_{state}" for state in all_states])
encoded_df

Unnamed: 0,Location_Victoria,Location_Queensland,Location_Tasmania,Location_New South Whales,Location_Western Austalia,Location_South Australia
0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0
5,1.0,0.0,0.0,0.0,0.0,0.0


In [13]:
# Concatenate the original DataFrame with the encoded state columns
df_encoded = pd.concat([df, encoded_df], axis=1)

In [14]:
# Drop the original 'Location' column from the dataframe
df = df.drop(['Location'], axis = 1)
df

Unnamed: 0,Person_ID
0,0
1,1
2,2
3,3
4,4
5,5
