<a href="https://colab.research.google.com/github/AppleBoiy/intro-to-machine-learning/blob/main/preprocess/one_hot_encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is One-Hot Encoding?

One-hot encoding is a method used to convert categorical data into a binary matrix (or array) where each unique category is represented as a vector with only one “1” (hot) and all other positions as “0” (cold).

This technique ensures that the encoded data does not introduce unintended ordinal relationships between categories, as happens with label encoding.



## How Does One-Hot Encoding Work?

For example, consider the categorical feature Fruit:

Fruit |	One-Hot Encoding
--- | ---
Apple	| [1, 0, 0]
Banana |	[0, 1, 0]
Cherry |	[0, 0, 1]

*	Each category (Apple, Banana, Cherry) is represented as a separate column.
*	A value of 1 in a column indicates the presence of the category.



## When to Use One-Hot Encoding?

1.	No Ordinal Relationship Exists:
  *	For categorical variables without inherent order (e.g., fruits, colors), one-hot encoding is preferred over label encoding.
2.	Compatibility with Algorithms:
  *	Used for algorithms that interpret numerical values as having magnitude or distance, such as linear regression, support vector machines, or neural networks.
3.	Small Cardinality:
  *	Suitable when the number of unique categories is relatively small, as one-hot encoding increases the dimensionality of the dataset.



## Limitations of One-Hot Encoding

1.	High Dimensionality:
  *	For features with high cardinality (e.g., hundreds or thousands of unique categories), one-hot encoding significantly increases the number of columns, leading to memory inefficiency and slower computation.
2.	Sparsity:
  *	The resulting matrix is sparse, as most values are 0, which may not be efficient for some algorithms.
3.	Curse of Dimensionality:
  *	Algorithms like k-Nearest Neighbors or decision trees may perform poorly with high-dimensional one-hot-encoded data due to increased computational complexity.



## Example Implementation

### Step-by-Step Example Using Python

```python3
import pandas as pd

# Example dataset
data = {'Fruit': ['Apple', 'Banana', 'Cherry', 'Apple']}
df = pd.DataFrame(data)

# One-hot encoding using pandas
df_encoded = pd.get_dummies(df, columns=['Fruit'])
print(df_encoded)
```

Output:

```text
   Fruit_Apple  Fruit_Banana  Fruit_Cherry
0            1             0             0
1            0             1             0
2            0             0             1
3            1             0             0
```

Summary

*	One-hot encoding is a robust method to represent categorical data without introducing ordinal bias.
*	Best for categorical features with a small number of unique values.
*	Efficiently handled in frameworks like TensorFlow and PyTorch, especially when paired with sparse representations to reduce memory usage.

# Step-by-Step Guide

## Step 1: Import Necessary Libraries

In [None]:
from sklearn.datasets import load_iris
import pandas as pd

## Step 2: Load and Prepare the Iris Dataset

The Iris dataset has a categorical column, Species, which contains three species of flowers: setosa, versicolor, and virginica.

In [None]:
# Load Iris dataset
iris = load_iris()

# Create a DataFrame
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['Species'] = iris.target_names[iris.target]  # Add the categorical column

In [None]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Step 3: One-Hot Encode Using pandas

In [None]:
# Perform one-hot encoding
df_encoded = pd.get_dummies(df, columns=['Species'])

In [None]:
df_encoded.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Species_setosa,Species_versicolor,Species_virginica
0,5.1,3.5,1.4,0.2,True,False,False
1,4.9,3.0,1.4,0.2,True,False,False
2,4.7,3.2,1.3,0.2,True,False,False
3,4.6,3.1,1.5,0.2,True,False,False
4,5.0,3.6,1.4,0.2,True,False,False


In this output:

*	Species_setosa, Species_versicolor, and Species_virginica are the one-hot encoded columns.
*	Each row has a 1 in the column corresponding to its species.

## Step 4: One-Hot Encode Using sklearn

Alternatively, you can use sklearn’s OneHotEncoder:

In [None]:
from sklearn.preprocessing import OneHotEncoder


encoder = OneHotEncoder(sparse_output=False)
encoded_data = encoder.fit_transform(df[['Species']])

encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['Species']))

df_combined = pd.concat([df.drop('Species', axis=1), encoded_df], axis=1)

In [None]:
df_combined.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Species_setosa,Species_versicolor,Species_virginica
0,5.1,3.5,1.4,0.2,1.0,0.0,0.0
1,4.9,3.0,1.4,0.2,1.0,0.0,0.0
2,4.7,3.2,1.3,0.2,1.0,0.0,0.0
3,4.6,3.1,1.5,0.2,1.0,0.0,0.0
4,5.0,3.6,1.4,0.2,1.0,0.0,0.0


## Step 5: Explanation of the Columns

*	The Species_setosa, Species_versicolor, and Species_virginica columns are binary, with 1 indicating the presence of the corresponding species and 0 indicating its absence.

## Summary of Steps

1.	Load the dataset using pandas or another library.
2.	Use pd.get_dummies for one-hot encoding in pandas (simple and direct).
3.	Use sklearn.preprocessing.OneHotEncoder if you want more control over the encoding or need sparse matrix output.

One-hot encoding is particularly useful for categorical variables without inherent order, ensuring they are represented correctly for machine learning models.

