<a href="https://colab.research.google.com/github/AppleBoiy/intro-to-machine-learning/blob/main/preprocess/label_encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is Label Encoding?

Label encoding is a data preprocessing technique used in machine learning to convert categorical data (non-numeric) into numerical form. Each unique category value is assigned a unique integer value. For example:

Category |	Encoded Value
--- | ---
Apple	| 0
Banana |	1
Cherry |	2

Label encoding is particularly used when algorithms require numerical input or output and cannot directly handle categorical data.


## When Do We Use Label Encoding?

1.	Tree-Based Models (e.g., Decision Trees, Random Forests):
  *	These models can handle label-encoded data effectively since they consider splits based on feature values without assuming an order in encoded integers.

2.	When There is No Ordinal Relationship Between Categories:
	*	Label encoding can be applied if the categorical values do not inherently represent an order or ranking but only for algorithms that are robust to arbitrary integer mappings.

3.	Memory Efficiency:
	*	When the dataset has a very high number of categorical features with limited unique values, label encoding is memory-efficient compared to other encoding methods like one-hot encoding.



## Limitations of Label Encoding

1.	Introduces Unintended Ordinal Relationships:
	*	Machine learning algorithms that interpret numerical values as having a magnitude or order (e.g., linear regression, support vector machines) may misinterpret the encoded integers as ordinal values.
	*	For instance, the encoded values 0, 1, and 2 might imply a ranking or a distance between categories (e.g., Apple < Banana < Cherry), which may not exist.
2.	Bias in Distance-Based Models:
	*	Algorithms like k-Nearest Neighbors or clustering methods that rely on calculating distances can be biased if the categorical values are encoded numerically, as they will assume numerical similarity or dissimilarity between categories.
3.	Not Suitable for High Cardinality:
	*	If the categorical feature has many unique values, label encoding might introduce large integers, which can make learning difficult and lead to suboptimal results.
4.	Overfitting in Some Models:
  *	When used in combination with certain models, particularly those sensitive to numerical values, label encoding may increase the risk of overfitting.



## Alternatives to Label Encoding

1.	One-Hot Encoding:
	*	Converts categories into binary vectors. Useful for algorithms sensitive to unintended ordinal relationships but increases dimensionality.
2.	Target/Mean Encoding:
	*	Replaces categories with their corresponding mean of the target variable. Requires careful validation to prevent data leakage.
3.	Binary Encoding:
	*	Encodes categories as binary digits, balancing memory efficiency and dimensionality reduction.
4.	Ordinal Encoding (Explicit):
	*	If a feature has an inherent order (e.g., Low < Medium < High), explicitly encode it to reflect the order.

# Step-by-Step Guide: Label Encoding with the Iris Dataset


## Step 1: Import Necessary Libraries

In [None]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import LabelEncoder
import pandas as pd

## Step 2: Load and Prepare the Iris Dataset

The dataset includes features like sepal length, sepal width, petal length, petal width, and a categorical Species column.

In [None]:
# Load Iris dataset
iris = load_iris()

# Create a DataFrame
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['Species'] = iris.target_names[iris.target]

In [None]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Step 3: Initialize the LabelEncoder

In [None]:
label_encoder = LabelEncoder()

## Step 4: Encode the Categorical Column

The Species column contains categorical values (setosa, versicolor, virginica). We encode it into numerical values.

In [None]:
df['Species_encoded'] = label_encoder.fit_transform(df['Species'])

In [None]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Species,Species_encoded
0,5.1,3.5,1.4,0.2,setosa,0
1,4.9,3.0,1.4,0.2,setosa,0
2,4.7,3.2,1.3,0.2,setosa,0
3,4.6,3.1,1.5,0.2,setosa,0
4,5.0,3.6,1.4,0.2,setosa,0


## Step 5: Check the Encoding Mapping

You can see how the Species categories are mapped to integers.

In [None]:
label_encoder.classes_

array(['setosa', 'versicolor', 'virginica'], dtype=object)

## Step 6: Decode Numerical Values

To convert encoded values back to their original categories:

In [None]:
decoded = label_encoder.inverse_transform([0, 1, 2])

In [None]:
decoded

array(['setosa', 'versicolor', 'virginica'], dtype=object)

Summary

> `Input`: Categorical column Species with values like setosa, versicolor, virginica.

> `Output`: Numerical column Species_encoded with values like 0, 1, 2.