In [1]:
pip install scikit-learn


Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Sample categorical data
data = {
    "Color": ["Red", "Blue", "Green", "Red", "Blue"],
    "Size": ["S", "M", "L", "XL", "M"]
}
df = pd.DataFrame(data)
print("Original Data:")
print(df)

# --- Label Encoding (converts categories into numbers) ---
le = LabelEncoder()
df["Color_Label"] = le.fit_transform(df["Color"])
print("\nLabel Encoded Data:")
print(df)

# --- One Hot Encoding (creates binary columns for each category) ---
df_onehot = pd.get_dummies(df[["Color", "Size"]], drop_first=False)
print("\nOne Hot Encoded Data:")
print(df_onehot)


Original Data:
   Color Size
0    Red    S
1   Blue    M
2  Green    L
3    Red   XL
4   Blue    M

Label Encoded Data:
   Color Size  Color_Label
0    Red    S            2
1   Blue    M            0
2  Green    L            1
3    Red   XL            2
4   Blue    M            0

One Hot Encoded Data:
   Color_Blue  Color_Green  Color_Red  Size_L  Size_M  Size_S  Size_XL
0       False        False       True   False   False    True    False
1        True        False      False   False    True   False    False
2       False         True      False    True   False   False    False
3       False        False       True   False   False   False     True
4        True        False      False   False    True   False    False


In [None]:
Label Encoding assigns a numeric value to each category (Red=2, Blue=0, Green=1). But it may imply an order, which is not always desired.

One Hot Encoding creates separate binary columns (isRed, isBlue, etc.), avoiding order assumptions.