<div style ="font-family:Trebuchet MS; background-color : #f8f0fa; border-left: 5px solid #1b4332; padding: 12px; border-radius: 50px 50px;">
    <h2 style="color: #1b4332; font-size: 48px; text-align: center;">
        <b>Step 2 in Feature Engineering: Categorical Encoding</b>
        <hr style="border-top: 2px solid #264653;">
    </h2>
    <h3 style="font-size: 14px; color: #264653; text-align: left; "><strong> I hope this is very helpful. let's started </strong></h3>
</div>

When working with machine learning models, converting categorical variables into numerical formats is essential because most algorithms require numerical inputs. This process is known as categorical encoding. Below, we'll explore four common methods of categorical encoding: Label Encoding, One-Hot Encoding, Target Encoding, and Frequency/Count Encoding. We'll use the Titanic dataset to demonstrate these methods.

- we will practice along with the [titanic dataset](https://www.kaggle.com/datasets/brendan45774/test-file/data)

# 1. Label Encoding

Label Encoding converts each category into a unique integer. This method is suitable for ordinal data, where the categories have a natural order (e.g., 'Low', 'Medium', 'High').

Let's apply Label Encoding to the Sex column of the Titanic dataset, which has the categories 'male' and 'female'.

In [5]:
import pandas as pd 
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('..\Data\Titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [18]:
# initialize the labelencoder class

le = LabelEncoder()

# applay le on the Sex column
df['sex_le'] = le.fit_transform(df['Sex'])

# display the sex_le along with the Sex column
print(df[['Sex','sex_le']].head())

df = df.drop(columns=['Sex'])

      Sex  sex_le
0    male       1
1  female       0
2    male       1
3    male       1
4  female       0


## 2. One-Hot Encoding

One-Hot Encoding creates a new binary column for each category of the variable. It is suitable for nominal data, where there is no inherent order among the categories (e.g., 'Red', 'Blue', 'Green').

Let's apply One-Hot Encoding to the Embarked column, which represents the port of embarkation with categories 'C', 'Q', and 'S'.

In [19]:
# we can perform one hot encoding using get_dummies or OneHotEncoder

# using get_dummies function with droping the last column generated by one hot encoding

df_one_hot = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

df_one_hot.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,sex_le,Embarked_Q,Embarked_S
0,892,0,3,"Kelly, Mr. James",34.5,0,0,330911,7.8292,,1,True,False
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",47.0,1,0,363272,7.0,,0,False,True
2,894,0,2,"Myles, Mr. Thomas Francis",62.0,0,0,240276,9.6875,,1,True,False
3,895,0,3,"Wirz, Mr. Albert",27.0,0,0,315154,8.6625,,1,False,True
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",22.0,1,1,3101298,12.2875,,0,False,True


In [25]:
# using OneHotEncoder

from sklearn.preprocessing import OneHotEncoder

# initialize the OneHotEncoder
one_hot = OneHotEncoder()

# Transform the column
embarked_one_hot = one_hot.fit_transform(df[['Embarked']])


# convert the sparse matrix to a data frame
embarked_one_hot_df = pd.DataFrame(embarked_one_hot.toarray(), columns=one_hot.get_feature_names_out())
embarked_one_hot_df.head()

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S
0,0.0,1.0,0.0
1,0.0,0.0,1.0
2,0.0,1.0,0.0
3,0.0,0.0,1.0
4,0.0,0.0,1.0


### Note
It is common practice to remove one of the columns generated by one-hot encoding to avoid multicollinearity. This approach is known as "dummy variable trap avoidance" and is particularly important for linear models.

When you perform one-hot encoding, you create new binary columns for each category. However, these new columns are not truly independent—one of the columns can be derived from the others. For example, if you have three columns generated from the Embarked feature (Embarked_C, Embarked_Q, Embarked_S), knowing the values of any two columns allows you to infer the third. This dependency can introduce multicollinearity, which can lead to issues in models like linear regression.

# 3. Frequency/Count Encoding

Frequency/Count Encoding replaces each category with its frequency or count in the dataset. This method is simple and can sometimes improve model performance by capturing the popularity of each category.

Let's apply Frequency Encoding to the Embarked column.

In [20]:
# Calculate frequency of each category in 'Embarked'
df = pd.read_csv('..\Data\Titanic.csv')
freq_encoding = df['Embarked'].value_counts()

# Map the frequencies to the 'Embarked' column
df['Embarked_FreqEncoded'] = df['Embarked'].map(freq_encoding)

print(df[['Embarked', 'Embarked_FreqEncoded']].head())


  Embarked  Embarked_FreqEncoded
0        Q                    46
1        S                   270
2        Q                    46
3        S                   270
4        S                   270
