# Label Encoding with SciKitLearn

Label Encoding is a technique used to transform categorical data into numeric values understandable by machine learning algorithm.

Because the Label Encoding technique converts categories into ascending numeric values it is best suited to ordinal categorical data (categories that have a natural rank order) eg. for T-Shirt sizes: Small is less than Medium which is less than Large so they could be label encoded as 0, 1 & 2 respectively.

Attempting to use Label Encoding on nominal categorical data (categories that do not have a natural rank order) can lead to machine learning algorithm incorrectly inferring a natural rank order when one does not exist eg. the states of Australia Vicotria, Tasmania & Queensland do not have a natural rank order. In these cases alternate encoding approaches such as One Hot Encoding is preferred.

The following example uses a SciKitLearn Label Encoder to encode a dataset containing T-Shirt Sizes.

In [5]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [6]:
# Create a sample pandas DataFrame
data = {
    'Size': ['Large', 'Small', 'Medium', 'Medium', 'Large', 'Small'],
    'Person_ID': [0, 1, 2, 3, 4, 5]
}

# Create DataFrame
df = pd.DataFrame(data)
df

Unnamed: 0,Size,Person_ID
0,Large,0
1,Small,1
2,Medium,2
3,Medium,3
4,Large,4
5,Small,5


In [7]:
# Create a LabelEncoder instance and specify the class order
label_encoder = LabelEncoder()
label_encoder.classes_ = ["Small", "Medium", "Large"]

In [8]:
# Transform the "Size" column using the label encoder
df["Size"] = label_encoder.transform(df["Size"])
df

Unnamed: 0,Size,Person_ID
0,2,0
1,0,1
2,1,2
3,1,3
4,2,4
5,0,5
