# Handling Categorical Data: One-Hot Encoding 🏘️

Machine learning models require all input features to be numerical. However, real-world datasets often contain **categorical features**—variables that represent labels rather than numerical quantities (e.g., city names, product types, etc.). Before we can train a model, we must convert these text labels into numbers.

A common but incorrect approach for categories without a natural order (like city names) is to simply assign numbers (e.g., `Banjara Hills=0`, `Kollur=1`, `Mankhal=2`). This is a bad idea because it creates an artificial and misleading mathematical relationship. The model might incorrectly assume that `Mankhal` is somehow "greater than" `Kollur`, which doesn't make sense.

The correct approach for such *nominal* categorical data is **One-Hot Encoding**.

### What is One-Hot Encoding?

One-Hot Encoding transforms each categorical feature into a set of new binary (0 or 1) columns. It creates a new column for each unique category. For any given row, the column corresponding to its original category will have a value of `1`, while all other new columns will have a value of `0`.

This notebook demonstrates how to apply one-hot encoding to a home prices dataset to include the `locality` feature in a linear regression model.

---

## 1. The Dataset with a Categorical Feature

First, we load our dataset. Notice the `locality` column, which contains text data that our model cannot process directly.


In [1]:
import pandas as pd

df = pd.read_csv('home_prices.csv')
df

Unnamed: 0,locality,area_sqr_ft,price_lakhs,bedrooms
0,Kollur,656,39.0,2
1,Kollur,1260,83.2,2
2,Kollur,1057,86.6,3
3,Kollur,1259,59.0,2
4,Kollur,1800,140.0,3
5,Kollur,1325,80.1,2
6,Kollur,1085,116.0,3
7,Kollur,1110,45.0,2
8,Kollur,1700,100.0,3
9,Banjara Hills,1650,200.0,3


## 2. Applying One-Hot Encoding with Pandas

Pandas provides a very convenient function, `pd.get_dummies()`, to perform one-hot encoding.


In [2]:
df_encoded = pd.get_dummies(df, columns=["locality"], drop_first=True)
df_encoded.sample(5)

Unnamed: 0,area_sqr_ft,price_lakhs,bedrooms,locality_Kollur,locality_Mankhal
9,1650,200.0,3,False,False
21,1200,85.0,3,False,True
8,1700,100.0,3,True,False
1,1260,83.2,2,True,False
6,1085,116.0,3,True,False


**Explanation of `drop_first=True` (The Dummy Variable Trap):**

Our original `locality` column had three categories: 'Banjara Hills', 'Kollur', and 'Mankhal'. You might expect `get_dummies` to create three new columns. However, we only see two: `locality_Kollur` and `locality_Mankhal`.

This is because we set `drop_first=True`. This is done to avoid the **dummy variable trap**, a scenario where the input features are perfectly correlated. If we had all three columns, we would know that if `locality_Kollur` is 0 and `locality_Mankhal` is 0, then `locality_Banjara Hills` must be 1. This perfect multicollinearity can be a problem for linear models.

By dropping one column, we make 'Banjara Hills' our **reference category**. The model learns its effect through the intercept, and the coefficients of the other dummy variables represent the difference in price relative to this baseline.


## 3. Training a Linear Regression Model

Now that all our features are numerical, we can train a linear regression model to predict `price_lakhs`.


In [3]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X = df_encoded.drop(columns=["price_lakhs"], axis=1)
y = df_encoded["price_lakhs"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

model.score(X_test, y_test)

0.8558905263155381

The model achieves an R² score of **0.856**, indicating a good fit.

## 4. Making Predictions with Encoded Features

To make predictions on new data, we must structure it in the same one-hot encoded format the model was trained on. Let's predict the price of a 1600 sq. ft., 2-bedroom house in each of the three localities.

Here is how we represent each locality in the encoded format:

In [5]:
# For Banjara Hills (the reference category)
print(f"Banjara Hills: locality_Kollur=False, locality_Mankhal=False")

# For Mankhal
print(f"Mankhal:       locality_Kollur=False, locality_Mankhal=True")

# For Kollur
print(f"Kollur:        locality_Kollur=True,  locality_Mankhal=False")

Banjara Hills: locality_Kollur=False, locality_Mankhal=False
Mankhal:       locality_Kollur=False, locality_Mankhal=True
Kollur:        locality_Kollur=True,  locality_Mankhal=False


Now, let's pass this data to our model for prediction.

In [4]:
test = pd.DataFrame([
    {'area_sqr_ft': 1600, 'bedrooms': 2, 'locality_Kollur': False, 'locality_Mankhal': False},
    {'area_sqr_ft': 1600, 'bedrooms': 2, 'locality_Kollur': False, 'locality_Mankhal': True},
    {'area_sqr_ft': 1600, 'bedrooms': 2, 'locality_Kollur': True, 'locality_Mankhal': False}
])

model.predict(test)

array([157.03383393, 109.25104283, 113.96340695])

**Interpretation:**
The model predicts that a 1600 sq. ft., 2-bedroom house would cost:
* **~157 lakhs** in Banjara Hills
* **~109 lakhs** in Mankhal
* **~114 lakhs** in Kollur

This clearly shows that by using one-hot encoding, our model has successfully learned the impact of the `locality` feature on home prices.