# Feature engineering encoding techniques

## Types of Encoding:
### Nominal encoding 
With nominal encoding we refer as features where the order or arrangement of the categories does not matters, For example:
1. Example --> Gender [Female, Male]. 
2. Example --> States [New York, New Jersey, California, Gorgia, etc].
There is not such caracteristic which allocated one above the other, in terms of valuable infromation for the model.

In this case of encoding there are 5 types of encoding:
1. One-hot encoding:
2. One-hot encoding for many categories:
3. Mean encoding:
4. -
5. -
### Ordinal encoding 
With ordinal encoding, the arregment of the categories based on the rank is very important, For Example:
1. Example --> Study degree/Education, for a model which tries to predict the salary of a person [Bachelors, Bcom, Phd, Masters]
In this case the type of degree is proportional in terms of the salary of the person.

In this case of encoding there are 4 types of encoding:
1. Label encoding:
2. Target guided ordinal encoding:
3. -
4. -

---
---

## Nominal Encoding:
### One-Hot Encoding:
**Example**:
| Countries | Mexico | Germany | France | Italy |
|-----------|--------|---------|--------|-------|
| Mexico    |   1    |    0    |    0   |   0   |
| Germany   |   0    |    1    |    0   |   0   |
| France    |   0    |    0    |    1   |   0   |
| Italy     |   0    |    0    |    0   |   1   |


### Dummy Variable Trap:
**Example**:
- 1 in the first column means Mexico
- 1 in the second column means Germany
- 1 in the third column means France
- Three zeros in the first three columns will mean Italy (the last column can be excluded)

| Countries | Mexico | Germany | France |
|-----------|--------|---------|--------|
| Mexico    |   1    |    0    |   0    |
| Germany   |   0    |    1    |   0    |
| France    |   0    |    0    |   1    |
| Italy     |   0    |    0    |   0    |


In [2]:
# Import necessary libraries
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Create a DataFrame from the given data
data = {'Countries': ['Mexico', 'Germany', 'France', 'Italy']}
df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Apply One-Hot Encoding using pandas.get_dummies
encoded_df = pd.get_dummies(df, columns=['Countries'])

# Display the DataFrame after One-Hot Encoding
print("\nOne-Hot Encoded DataFrame:")
print(encoded_df)

# Optional: Exclude one column to avoid the dummy variable trap
# (e.g., drop the last column "Countries_Italy")
encoded_df_no_trap = encoded_df.drop(columns=['Countries_Italy'])

# Display the DataFrame after avoiding dummy variable trap
print("\nDataFrame after avoiding Dummy Variable Trap:")
print(encoded_df_no_trap)


Original DataFrame:
  Countries
0    Mexico
1   Germany
2    France
3     Italy

One-Hot Encoded DataFrame:
   Countries_France  Countries_Germany  Countries_Italy  Countries_Mexico
0             False              False            False              True
1             False               True            False             False
2              True              False            False             False
3             False              False             True             False

DataFrame after avoiding Dummy Variable Trap:
   Countries_France  Countries_Germany  Countries_Mexico
0             False              False              True
1             False               True             False
2              True              False             False
3             False              False             False


**Disadvantages**:

For a greate number of categories, the dimesion og the dtaframe will be increased in order of dimension as number of features -1

--> For 100 categories, 99 dimensions will be created/added. 


This will lead to column dimesionality. 

---

### One-Hot Encoding for many categories (KDD organge):
This particular type of encoding intends to identify just the top 10 categories (most repited) of a culumn. After identifing those top ten categories, it will only encode those.
For the rest of categories, all One-hot encoding columns will be zero.

**Example**:



In [8]:
import pandas as pd
import numpy as np 

#Load the mercedes benz dataset.

data = pd.read_csv('Resources/Complete-Feature-Engineering-master/mercedesbenz.csv', usecols=['X1','X2','X3','X4','X5','X6'])
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,at,a,d,u,j
1,t,av,e,d,y,l
2,w,n,c,d,x,j
3,t,n,f,d,x,l
4,v,n,f,d,h,d


In [9]:
for col in data.columns: 
    print(col, ': ', len(data[col].unique()), ' labels')

X1 :  27  labels
X2 :  44  labels
X3 :  7  labels
X4 :  4  labels
X5 :  29  labels
X6 :  12  labels


In [10]:
# Finiding the to 10 most frequent categories for the variable X2
data.X2.value_counts().sort_values(ascending=False).head(20)

X2
as    1659
ae     496
ai     415
m      367
ak     265
r      153
n      137
s       94
f       87
e       81
aq      63
ay      54
a       47
t       29
k       25
i       25
b       21
ao      20
z       19
ag      19
Name: count, dtype: int64

In [12]:
# Allocate the top 10 most frequent categries in a vector 
top_10 = [x for x in data.X2.value_counts().sort_values(ascending=False).head(10).index]
top_10

['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']

In [16]:
#Get whole set of dummy variables, for all the categorical variables 

def one_hot_top_x (df, variable, top_x_labels):
    #function to create the dummy variables for the most frequent labels
    #the number of most frequent labels can ve varied as convinient

    for label in top_x_labels:
        df[variable+'_'+label] = np.where(data[variable]==label, 1, 0)

#read the data again 
data=pd.read_csv('Resources/Complete-Feature-Engineering-master/mercedesbenz.csv', usecols=['X1','X2','X3','X4','X5','X6'])

#Encode X2 into the top 10 most frequent categories
one_hot_top_x(data, 'X2', top_10)
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X2_as,X2_ae,X2_ai,X2_m,X2_ak,X2_r,X2_n,X2_s,X2_f,X2_e
0,v,at,a,d,u,j,0,0,0,0,0,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,0,0,1,0,0,0
3,t,n,f,d,x,l,0,0,0,0,0,0,1,0,0,0
4,v,n,f,d,h,d,0,0,0,0,0,0,1,0,0,0


**Advantages**:

- Straightfoward to implement.
- Does not requires hrs of data exploration.
- Does not expands massively the feature space.

**Disadvantages**:

- Does not add any information that may take the variable more predictive.
- Does not keep the information of the ignored labels.


### Mean encoding:
**Example**:

Mean encoding, replaces categorical values with the mean of the target variable (usually a numeric variable) for each category. Essentially, it calculates the average value of the target variable for each category in the feature.

Some use cases examples are:
- Predicting conversion rates for marketing campaigns:
    - Replace campaign IDs with their historical average conversion rate.
- Predicting sales revenue for a product category:
    - Encode categories based on their average revenue.
- Customer churn models:
    - Replace demographic groups with their average churn rates.
 
|    F1     |    o/p |  Mean  |
|-----------|--------|--------|
|     A     |   1    |  .73   |
|     B     |   0    |  .62   |
|     R     |   0    |  .50   |
|     T     |   1    |  .23   |
|     T     |   1    |  .23   |
|     B     |   1    |  .62   |
|     R     |   0    |  .50   |
|     T     |   1    |  .23   |

In [21]:
import pandas as pd

# Sample data based on the image
data = {
    'F1': ['A', 'B', 'R', 'T', 'T', 'B', 'R', 'T'],
    'o/p': [1, 0, 0, 1, 1, 1, 0, 1]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Compute the mean encoding for each category in 'F1' based on 'o/p'
mean_encoding = df.groupby('F1')['o/p'].mean()

# Map the mean encoding back to the 'F1' column and replace it
df['F1'] = df['F1'].map(mean_encoding)

# Add a 'Mean' column to display the mean values explicitly (optional, for comparison)
df['Mean'] = df['F1']

# Display the resulting DataFrame
print(df)


    F1  o/p  Mean
0  1.0    1   1.0
1  0.5    0   0.5
2  0.0    0   0.0
3  1.0    1   1.0
4  1.0    1   1.0
5  0.5    1   0.5
6  0.0    0   0.0
7  1.0    1   1.0


---
---

## Ordinal Encoding:
### label Encoding:

For the study degree/Education, for a model which tries to predict the salary of a person [Bachelor, Bcom, PhD, Master]. Each of the cateries will have a rank: PhD as 4, Master as 3, Bcom as 2 and Bachelor as 1.

**Example**:
| Education | Label  |
|-----------|--------|
| Bachelor  |   1    |
| Bcom      |   2    |
| PhD       |   4    |
| Master    |   3    |


In [4]:
import pandas as pd

# Sample data
data = {'Education': ['Bachelor', 'PhD', 'Master', 'Bcom', 'Master', 'Bachelor', 'PhD']}

# Convert to DataFrame
df = pd.DataFrame(data)

# Define the mapping for label encoding
education_rank = {
    'Bachelor': 1,
    'Bcom': 2,
    'Master': 3,
    'PhD': 4
}

# Apply the mapping to encode the education levels
df['Education_Encoded'] = df['Education'].map(education_rank)

# Display the DataFrame
print(df)


  Education  Education_Encoded
0  Bachelor                  1
1       PhD                  4
2    Master                  3
3      Bcom                  2
4    Master                  3
5  Bachelor                  1
6       PhD                  4


---

### Target guided ordinal encoding:

For this type of encoding, it is important to have a catagorical variable with an output reference. Having this case, the encoing will calculate the mean values for al categories on the list

- This encoding technique can be useful when the categorical variable has an ordinal relationship with the output variable.

**Example**:

|    F1     | Mexico |  
|-----------|--------|
|     A     |   1    |
|     B     |   0    |
|     R     |   0    |
|     T     |   1    |
|     T     |   1    |
|     B     |   1    |
|     R     |   0    |
|     T     |   1    |


After calculating the mean for each of the categories, the mean will define the rank for each category respectively. 


In [20]:
import pandas as pd

# Sample data with categorical variable 'Countries' and output reference 'Mexico'
data = {'F1': ['A', 'B', 'R', 'T', 'T', 'B', 'R', 'T'],
        'O/P': [1, 0, 1, 1, 1, 1, 0, 1]}

df = pd.DataFrame(data)

# Calculate the mean value for each category
mean_values = df.groupby('F1')['O/P'].mean().to_dict()

# Rank the categories based on their mean values
ranks = {k: v for v, k in enumerate(sorted(mean_values, key=mean_values.get, reverse=True))}

# Create a new column with the assigned ranks
df['F1_Label'] = df['F1'].map(ranks)

print(df)

  F1  O/P  F1_Label
0  A    1         0
1  B    0         2
2  R    1         3
3  T    1         1
4  T    1         1
5  B    1         2
6  R    0         3
7  T    1         1


**Advantages**:

- Preserves Ordinal Information: It effectively captures the ordinal relationship between categories, especially when there's an inherent ranking or order.
- Improves Model Performance: In many cases, it can lead to improved model performance by providing more informative features to the model.
- Simple to Implement: It's relatively easy to implement, as demonstrated in the provided Python code.

**Disadvantages**:

- Sensitivity to Outliers: The encoding can be sensitive to outliers in the target variable, which might distort the ranking of categories.
- Data Leakage: If not handled carefully, it can introduce data leakage, especially when training and testing data are not properly separated.
- Overfitting: There's a risk of overfitting the training data, especially when the number of categories is large or the dataset is small.

---
---