# Encoding Categorical Data

Machine Learning models work with **numerical data**, not text. 

Categorical variables must be converted into numbers before training models.

In this notebook, we'll cover:
- Label Encoding
- One-Hot Encoding
- Ordinal Encoding
- Encoding with `pandas.get_dummies()`

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
import numpy as np

data = {
    'Country': ['India', 'USA', 'UK', 'India', 'USA'],
    'Purchased': ['Yes', 'No', 'Yes', 'No', 'Yes'],
    'Education': ['High School', 'Bachelors', 'Masters', 'PhD', 'Bachelors']
}

df = pd.DataFrame(data)
df

## 1. Label Encoding
- Converts categories into numeric labels (0, 1, 2,...)
- Useful for target variables.
- Not always suitable for input features (implies order where there may be none).

In [None]:
le = LabelEncoder()
df['Purchased_Encoded'] = le.fit_transform(df['Purchased'])
df

## 2. One-Hot Encoding
- Creates dummy variables (0/1) for each category.
- Prevents giving false ordinal relationships.
- Common for categorical features like 'Country'.

In [None]:
ohe = OneHotEncoder(sparse=False, drop='first')  # drop first to avoid dummy variable trap
country_encoded = ohe.fit_transform(df[['Country']])
country_encoded_df = pd.DataFrame(country_encoded, columns=ohe.get_feature_names_out(['Country']))

df_ohe = pd.concat([df, country_encoded_df], axis=1)
df_ohe

## 3. Ordinal Encoding
- For categorical data that has an **order** (e.g., education levels).
- Assigns numbers based on predefined order.

In [None]:
education_order = [['High School', 'Bachelors', 'Masters', 'PhD']]
ordinal_enc = OrdinalEncoder(categories=education_order)
df['Education_Encoded'] = ordinal_enc.fit_transform(df[['Education']])
df

## 4. Using `pandas.get_dummies()`
- Quick way to one-hot encode categorical features.
- Simpler than scikit-learn for many cases.

In [None]:
df_dummies = pd.get_dummies(df, columns=['Country'], drop_first=True)
df_dummies

## ✅ Summary
- **Label Encoding**: Converts categories to numbers, good for target variables.
- **One-Hot Encoding**: Expands categories into multiple binary columns.
- **Ordinal Encoding**: For ordered categories.
- **pandas.get_dummies()**: Fast and simple one-hot encoding.

👉 Always choose encoding method based on the type of categorical feature!