
# Feature encoding 
Typically your model isn't going to understand categorical data. So we use feature encoding to help us transform categorical data into numeric data.

---
### Label encoding
Transform categorical to numerical data by assigning a numerical value to each of the categories. So like `male=1` and `female=0`. The downside is that your model may treat the higher values as 'more important'. Imagine `New York=7`, but `London=112`, it would prioritize London as having more weight/influence, even though those are just unique identifiers that we picked.

---
### One hot encoding
Use when independent variables are 'nominal', not in a specific order (e.g. Gender, Location). It'll create k different columns, where one is 1 (to indicate presence of a category) and the rest are 0 (to indicate absence). So if you had 3 locations 'New York, Indiana, and Texas', and the current record is located in 'Texas', then 'Texas=1', but Indiana and New York are zero. This is popular since all computers understand binary.

---
### Dummy encoding
A technique where we avoid redundancy by dropping a column whose value we're able to infer based on the other columns. The reason we're removing a column is to avoid an issue in ML called multi-collinearity. 
- **Multi-collinearity:** Happens when two or more independent variables are correlated, like highly correlated. This leads to less accurate results, not in the sense of estimates, but in the sense of being able to figure out how much our independent variables (inputs) influence the dependent variable (output) individually ('weights' are inaccurate). I mean this is pretty important as some models assume each feature contributes to predicting the target variable. It's inevitable that there is some correlation between variables (0.3-0.7), but above that is eyebrow raising.

---
### Target Encoding
Calculate the average of the dependent variable (y), and replace the categorical variable with that mean value. Though there are some pros and cons that should be known about target encoding. Though target encoding isn't commonly used.
- **Advantages:** Doesn't add dimensionality to the dataset. Let me explain, when you do target encoding and create the 'Gender Encoded' column, you probably aren't going to use the original 'Gender' column anymore, so you're adding one and deleting one column. However in something like one-hot or dummy encoding you could be adding various extra columns.
- **Disadvantages:** It's dependent on the target value's distribution, which means target encoding requires careful validation as it can be prone to over-fitting.

---
### Hash Encoder
Encodes categorical variables into numerical values using a hash function. This is how it works:
1. **Apply a hash function:** Each category is passed through a hash function (e.g. MD5, SHA-1) that converts the category into a fixed-length numeric vector.
2. **Map to a smaller vector:** The result of the hash function is mapped to a smaller number of bins/buckets, which is typically smaller than the total number of unique categories. This is done through modulo operator, and it works exactly like how hashmaps work. Basically the color red hashes to `12345`, and then `12345 % 10 = 9`, so the hash value for red is assigned to bucket 9.

---
#### Example of hash encoding
Let's say we have a `Color` feature with 1,000 unique values, but we don't want to create 1,000 new columns via one-hot encoding. With hash encoding, we use a hash function to convert each unique value into a smaller, fixed-size vector, say with 10 bins. Each color is assigned to one of those bins bashed on its hashed value.

---
#### Why use it and when to?
Used when a feature has as a high cardinality (a lot of unique values). We can handle converting these features into numerical values, more efficiently, and without creating like a thousand new columns.
This is especially useful when there are many categorical features, making your system very scalable, and very resource efficient if that's needed in your environment. 

The downsides are the ris kof collisions, which occurs when two distinct categories are put in the same bin. You can reduce this by using more bins, which uses more memory. Another downside is that things aren't very interpretable. I mean with one hot encoding, you have columns where you can understand them, but with hashed features, you can't really tell what the original category was.

---
### Def. Over-fitting:
The scenario of when a model learns too much from the training data, including noise/irrelevant data. This causes the model to perform well on that specific training data, but poorly on new data. This is the idea of the model becoming too tailored or used to that specific set of data, capturing patterns that don't generalize well to other data sets.

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
df = pd.read_csv("./data/Churn_Modelling.csv")
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


In [None]:
# Obvious by now, but drop your un-needed columns
df.drop(columns=['CustomerId', 'RowNumber', 'Surname'], inplace=True)

# Let's pick gender for encoding, but first clean any missing values 
df["Gender"] = df["Gender"].fillna("Male")

# Okay after this we can start with encoding


In [None]:
'''
+ Label Encoding: Let's do label encoding by mapping 'Male' 1 and Femael

NOTE: Always check the comments and documentation. For this sklearn function
it's not recommended to do it on x variables, like how we're doing it now. Ideally
you would do it on output/y/dependent variables such as "exited", but for the example 
we'll ignore this.
'''

labelEncoder = preprocessing.LabelEncoder()
df['gender_label'] = labelEncoder.fit_transform(df['Gender'].values)

In [5]:
'''
+ One Hot Encoding:

'''

one_hot = pd.get_dummies(df['Geography'])


In [None]:
'''
+ Dummy Encoding: The 3 columns we're getting from one-hot are called dummy columns.
We generate n dummy columns, but the idea is that we only need n-1 columns.

'''
df_dummies = pd.get_dummies(df,drop_first=True)

In [None]:
'''
+ Target Encoding: Another encoding technique, but again notice how we're passing in
an independent variable gender, and a dependent/target variable 'Exited'.

'''
from category_encoders import TargetEncoder
encoder = TargetEncoder()

df['Gender Encoded'] = encoder.fit_transform(df['Gender'], df['Exited'])
