# Categorical Encoding 

Most common Types:
- Label encoding 
- One-hot encoding

#### Including needed libraries 

In [6]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

In [2]:
# Sample data
data = {'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']}
df = pd.DataFrame(data)

#### #1 Label Encoding

In [8]:
# Make copy of original dataframe 
labelEnc_df = df.copy()

# initializing an instance of LabelEncoder from sklearn.preprocessing.
label_encoder = LabelEncoder()

# Fit and transform the categorical column
labelEnc_df['Color_Encoded'] = label_encoder.fit_transform(labelEnc_df['Color'])

labelEnc_df

Unnamed: 0,Color,Color_encoded
0,Red,2
1,Blue,0
2,Green,1
3,Blue,0
4,Red,2


##### **Notes on Code**:

The **fit_transform()** method is a combination of two steps: fitting and transforming.
- Fitting: The fit() method scans the column (labelEnc_df['Color']) to learn all the unique categories (in this case, 'Red', 'Blue', 'Green'). It assigns a unique integer to each of these categories.
- Transforming: The transform() method takes the categorical column and replaces each category with its corresponding numeric label.

#### #2 One-hot Encoding

In [3]:
# Make copy of the original dataframe
oneHot_df_1 = df.copy()

# Initialize One-Hot Encoding
encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(oneHot_df_1[['Color']])

# Creating DataFrame for the encoded data
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(['Color']))

# Concatenating the original DataFrame with the encoded DataFrame
oneHot_df_1 = pd.concat([oneHot_df_1, encoded_df], axis=1)

oneHot_df_1


Unnamed: 0,Color,Color_Blue,Color_Green,Color_Red
0,Red,0.0,0.0,1.0
1,Blue,1.0,0.0,0.0
2,Green,0.0,1.0,0.0
3,Blue,1.0,0.0,0.0
4,Red,0.0,0.0,1.0


##### **Notes on Code**

- The 'sparse_output=False' parameter in OneHotEncoder:

**sparse_output=False** ensures that the output is returned as a dense matrix (a standard 2D array) rather than a sparse matrix. A sparse matrix is more memory efficient for large datasets with many zeros, but for smaller datasets or when concatenating with other DataFrames, it's often more convenient to use a dense matrix.

-------------------------------------------------------------------------------------------------------

- **encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(['Color']))**

**encoded:** This is the result of the one-hot encoding process, which is a 2D NumPy array containing the binary columns for each category.

**encoder.get_feature_names_out(['Color']):** This method generates proper column names for the one-hot encoded columns. In this case, the new column names will be: Color_Blue, Color_Green, Color_Red

These names correspond to each category in the original Color column, with the prefix Color_ followed by the category name.

In [4]:
# Here is another way of applying One-Hot Encoding!

# Make copy of the original dataframe
oneHot_df_2 = df.copy()

oneHot_df_2 = pd.get_dummies(oneHot_df_2, columns= ['Color'])
oneHot_df_2

Unnamed: 0,Color_Blue,Color_Green,Color_Red
0,False,False,True
1,True,False,False
2,False,True,False
3,True,False,False
4,False,False,True


##### **Notes on Code**

**pd.get_dummies()** is a pandas function that automatically converts categorical variables into one-hot encoded columns.

Arguments:
- oneHot_df_2: The DataFrame on which you want to apply one-hot encoding.

- columns=['Color']: Specifies the column (in this case, Color) that needs to be one-hot encoded.