import

In [30]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

Example Dataset

We will use a simple example with categorical features to demonstrate encoding.

In [31]:
data = {
    'color': ['red', 'blue', 'green', 'blue', 'green', 'red'],
    'size': ['S', 'M', 'L', 'M', 'S', 'L'],
    'category': ['A', 'B', 'A', 'B', 'A', 'B']
}

df = pd.DataFrame(data)
df


Unnamed: 0,color,size,category
0,red,S,A
1,blue,M,B
2,green,L,A
3,blue,M,B
4,green,S,A
5,red,L,B


Label Encoding

Label Encoding converts categorical values into integer labels. It assigns a unique integer to each category. This encoding is often used for target variables or ordinal features (where the categories have a meaningful order).

Using LabelEncoder


In [32]:
label_encoder = LabelEncoder()

df['color_encoded'] = label_encoder.fit_transform(df['color'])
df


Unnamed: 0,color,size,category,color_encoded
0,red,S,A,2
1,blue,M,B,0
2,green,L,A,1
3,blue,M,B,0
4,green,S,A,1
5,red,L,B,2


One-Hot Encoding

One-Hot Encoding transforms categorical variables into a binary matrix (one column per category) and is used when there is no ordinal relationship between categories (e.g., for nominal variables like colors, city names).

Using OneHotEncoder

In [33]:
# Create the OneHotEncoder (no sparse=False argument)
one_hot_encoder = OneHotEncoder()

# Apply one-hot encoding on multiple columns: 'color', 'size', and 'category'
encoded_data = one_hot_encoder.fit_transform(df[['color', 'size', 'category']])

# Convert the sparse matrix into a dense array
encoded_data = encoded_data.toarray()
encoded_df = pd.DataFrame(encoded_data, columns=one_hot_encoder.get_feature_names_out(['color', 'size', 'category']))

# Concatenate the one-hot encoded columns with the original DataFrame
df_encoded = pd.concat([df, encoded_df], axis=1)

# Display the transformed dataset
print(df_encoded)

   color size category  color_encoded  color_blue  color_green  color_red  \
0    red    S        A              2         0.0          0.0        1.0   
1   blue    M        B              0         1.0          0.0        0.0   
2  green    L        A              1         0.0          1.0        0.0   
3   blue    M        B              0         1.0          0.0        0.0   
4  green    S        A              1         0.0          1.0        0.0   
5    red    L        B              2         0.0          0.0        1.0   

   size_L  size_M  size_S  category_A  category_B  
0     0.0     0.0     1.0         1.0         0.0  
1     0.0     1.0     0.0         0.0         1.0  
2     1.0     0.0     0.0         1.0         0.0  
3     0.0     1.0     0.0         0.0         1.0  
4     0.0     0.0     1.0         1.0         0.0  
5     1.0     0.0     0.0         0.0         1.0  


Handling Multiple Columns with ColumnTransformer

If you need to encode multiple categorical columns at once, you can use the ColumnTransformer to apply different preprocessing techniques to different columns.

Using ColumnTransformer

In [34]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
import pandas as pd

# Create a ColumnTransformer to apply one-hot encoding to 'color' column
preprocessor = ColumnTransformer(
    transformers=[('color', OneHotEncoder(), ['color'])],
    remainder='passthrough'  # Keep other columns unchanged
)

# Apply one-hot encoding on the 'color' column
df_encoded = preprocessor.fit_transform(df)

# Get the feature names generated by OneHotEncoder
encoded_columns = preprocessor.get_feature_names_out()

# Convert the result to DataFrame
df_encoded = pd.DataFrame(df_encoded, columns=encoded_columns)

# Apply LabelEncoder to the 'category' column
label_encoder = LabelEncoder()
df_encoded['category'] = label_encoder.fit_transform(df['category'])

# Display the transformed dataset
print(df_encoded)


  color__color_blue color__color_green color__color_red remainder__size  \
0               0.0                0.0              1.0               S   
1               1.0                0.0              0.0               M   
2               0.0                1.0              0.0               L   
3               1.0                0.0              0.0               M   
4               0.0                1.0              0.0               S   
5               0.0                0.0              1.0               L   

  remainder__category remainder__color_encoded  category  
0                   A                        2         0  
1                   B                        0         1  
2                   A                        1         0  
3                   B                        0         1  
4                   A                        1         0  
5                   B                        2         1  


Inverse Transformation

After encoding, you may want to map the numerical data back to the original categorical values. You can use the inverse_transform() method of LabelEncoder or OneHotEncoder to do this.

Inverse Transformation (LabelEncoder)

In [35]:
df['color_encoded'] = label_encoder.fit_transform(df['color'])

# Now, to get the original labels back, use inverse_transform on the encoded column
original_labels = label_encoder.inverse_transform(df['color_encoded'])

# Add the original labels to the DataFrame
df['color_original'] = original_labels

print(df[['color', 'color_encoded', 'color_original']])

   color  color_encoded color_original
0    red              2            red
1   blue              0           blue
2  green              1          green
3   blue              0           blue
4  green              1          green
5    red              2            red


Summary of Encoding Techniques

Label Encoding: Converts categorical values into integer labels. This method is useful when the categorical feature has an ordinal relationship (i.e., there is a meaningful order to the categories, like 'Low', 'Medium', 'High').

One-Hot Encoding: Converts categorical values into a binary matrix. This method is appropriate when the categorical feature has no ordinal relationship (i.e., the categories are nominal).

ColumnTransformer: A convenient tool to apply different encoding strategies to different columns of the dataset at once.