<a href="https://colab.research.google.com/github/BrianOuko/python_for_ML/blob/Encoding_categorical_data/encoding_categorical_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Encoding Categorical Data for ML Algorithms

The following exercise explored some of the methods to employ to achieve categorical data encoding so that it is usable by ML algos.
Encoding helps us use categorical data as features for our ML models.
So what is categorical data?

Categorical data is data that contains text as labels rather than
numerals than numeric data would do. Converting the data that contains categorical variables to numbers before we can train an ML models is called encoding.


1. Label Encoding: Assigns a unique integer to each category.
Suitable for ordinal data where there is a meaningful order among the categories. Example: Encoding "low," "medium," and "high" as 0, 1, and 2.

2. One-Hot Encoding: Creates binary columns for each category and represents the presence of a category with a 1 and its absence with a 0.
Suitable for nominal data where there is no inherent order among the categories.Example: Encoding colors like "red," "green," and "blue" as [1, 0, 0], [0, 1, 0], and [0, 0, 1].

3. Ordinal Encoding: Similar to label encoding, but the assigned integers represent the ordinal relationship between categories.
Suitable for ordinal data where the order matters.
Example: Encoding education levels like "high school," "college," and "graduate" as 1, 2, and 3.

4. Binary Encoding: Represents each integer in its binary form and creates binary columns for each digit.
Reduces dimensionality compared to one-hot encoding while preserving some information.
Example: Encoding integers 0-7 as 000, 001, 010, 011, 100, 101, 110, 111.

5. Frequency Encoding: Encodes categories based on their frequency in the dataset.
Assigns a weight to each category based on its occurrence.
Useful when the frequency of a category is informative.
Example: Encoding categories based on their occurrence count in the dataset.

6. Target Encoding (Mean Encoding):
Uses the mean of the target variable for each category as the encoded value.Can improve the model's predictive performance in some cases.
Should be used cautiously to avoid data leakage and overfitting.























In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

**Define the Data**

In [24]:
data = np.array ([['good'],['very good'], ['excellent']])
df = pd.DataFrame(data, columns=["Rating"], index=["Rater 1", "Rater 2", "Rater 3"])
print("Data before encoding:")
print(df)


Data before encoding:
            Rating
Rater 1       good
Rater 2  very good
Rater 3  excellent


**Define ordinal encoding**

In [25]:
encoder=OrdinalEncoder()
#here we could alternatively define categories and define the encoder as encoder=OrdinalEncoder(categories=categories)


In [26]:
categories = [['good','very good','excellent']]
encoder = OrdinalEncoder (categories=categories)

**Transform data** This line of code is fitting the OrdinalEncoder to the "Rating" column of the DataFrame df and simultaneously transforming the original categorical values into numerical representations.

In [27]:
df["Encoded Rating"] = encoder.fit_transform(df)
print("\nData after encoding:")
print(df)


Data after encoding:
            Rating  Encoded Rating
Rater 1       good             0.0
Rater 2  very good             1.0
Rater 3  excellent             2.0


**The OneHot encoding class is found in the scikit-learn library. The following is an example of its use.**

In [28]:
from sklearn.preprocessing import OneHotEncoder

#we then define the data

data=np.array ([['Miami'],['Sydney'],['New York']])
df=pd.DataFrame(data, columns = ["City"], index= ["Alex", "Joe", "Alice"])
print("Data before encodin:")
print(df)

#define the OneHot encoding
categories=[['Miami'], ['sydney'],['New York']]
encoder= OneHotEncoder(categories='auto', sparse=False)

#transforming the data
encoded_data=encoder.fit_transform(df)

#employing the fit_transform method to return an array that is then converted to a pd df
df_encoded = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['City']), index=df.index)
print("\nData before encodin: ")
print(df_encoded)


Data before encodin:
           City
Alex      Miami
Joe      Sydney
Alice  New York

Data before encodin: 
       City_Miami  City_New York  City_Sydney
Alex          1.0            0.0          0.0
Joe           0.0            0.0          1.0
Alice         0.0            1.0          0.0




**Target encoding** on the other hand encodes the mean of the frequency of occurrence of a particular target.



In [31]:
!pip install category_encoders
from category_encoders import TargetEncoder

#defining our data
fruit= ['Apple','Banana','Banana','Tomato','Apple', 'Tomato', 'Apple', 'Banana', 'Tomato', 'Tomato']
target= [1,0,0,0,1,1,0,1,0,0]
df=pd.DataFrame (list(zip(fruit,target)), columns = ["Fruit", "Target"])
#the line above could be replaced wit ~df = pd.DataFrame({"Fruit": fruit, "Target": target}) (using dcitionary and zip to create list pairs as a table)

print("\nData before encodin:")
print(df)

#how we then encode the data

encoder= TargetEncoder(smoothing=0.01)

#smoothing helps reduce risk of overfitting and improv. generalisation performance
#smoothing with a value>0 introduces a form of regularization by blending the mean of the target variable for each category with the OVR mean
#its especially helpful for small datasets

#transform the data
df["Fruit Encoded"]=encoder.fit_transform(df["Fruit"], df["Target"])
print(df)


# Print category means during encoding
encoder = TargetEncoder(smoothing=0.1)
category_means = encoder.fit(df["Fruit"], df["Target"]).transform(df["Fruit"])
print("Category Means:")
print(category_means)



Data before encodin:
    Fruit  Target
0   Apple       1
1  Banana       0
2  Banana       0
3  Tomato       0
4   Apple       1
5  Tomato       1
6   Apple       0
7  Banana       1
8  Tomato       0
9  Tomato       0
    Fruit  Target  Fruit Encoded
0   Apple       1            0.4
1  Banana       0            0.4
2  Banana       0            0.4
3  Tomato       0            0.4
4   Apple       1            0.4
5  Tomato       1            0.4
6   Apple       0            0.4
7  Banana       1            0.4
8  Tomato       0            0.4
9  Tomato       0            0.4
Category Means:
   Fruit
0    0.4
1    0.4
2    0.4
3    0.4
4    0.4
5    0.4
6    0.4
7    0.4
8    0.4
9    0.4
