# Feature Engineering - Bangalore House Price Prediction

This notebook focuses on creating new meaningful features from the cleaned data.

## 1. Import Libraries and Load Data

In [None]:
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

In [None]:
# Load the cleaned data
df = pd.read_csv("../data/processed/cleaned_data.csv")

print(f"Dataset Shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")

In [None]:
df.head()

## 2. Feature Engineering

Creating new meaningful features from existing ones.

### 2.1 Ratio Features

In [None]:
# Create a copy for feature engineering
df_fe = df.copy()

# 1. Bathroom to BHK ratio
df_fe['bath_per_bhk'] = df_fe['bath'] / df_fe['bhk']

# 2. Balcony to BHK ratio
df_fe['balcony_per_bhk'] = df_fe['balcony'] / df_fe['bhk']

# 3. Square feet per BHK (room size indicator)
df_fe['sqft_per_bhk'] = df_fe['total_sqft_int'] / df_fe['bhk']

# 4. Total amenities count
df_fe['total_amenities'] = df_fe['bath'] + df_fe['balcony']

print("Ratio features created!")
df_fe[['bath_per_bhk', 'balcony_per_bhk', 'sqft_per_bhk', 'total_amenities']].head()

### 2.2 Log Transformations

Log transformations help normalize skewed distributions.

In [None]:
# Log transformation for skewed features
df_fe['log_total_sqft'] = np.log1p(df_fe['total_sqft_int'])
df_fe['log_price'] = np.log1p(df_fe['price'])

print("Log features created!")
df_fe[['total_sqft_int', 'log_total_sqft', 'price', 'log_price']].head()

### 2.3 Interaction Features

In [None]:
# Interaction features
df_fe['sqft_x_bhk'] = df_fe['total_sqft_int'] * df_fe['bhk']
df_fe['sqft_x_bath'] = df_fe['total_sqft_int'] * df_fe['bath']
df_fe['bhk_x_bath'] = df_fe['bhk'] * df_fe['bath']

print("Interaction features created!")
df_fe[['sqft_x_bhk', 'sqft_x_bath', 'bhk_x_bath']].head()

### 2.4 Squared Features

In [None]:
# Squared features for non-linear relationships
df_fe['sqft_squared'] = df_fe['total_sqft_int'] ** 2
df_fe['bhk_squared'] = df_fe['bhk'] ** 2

print("Squared features created!")
df_fe[['total_sqft_int', 'sqft_squared', 'bhk', 'bhk_squared']].head()

## 3. Review Engineered Features

In [None]:
# List all new features created
original_cols = df.columns.tolist()
new_cols = [col for col in df_fe.columns if col not in original_cols]

print(f"Original columns: {len(original_cols)}")
print(f"New columns created: {len(new_cols)}")
print(f"Total columns now: {len(df_fe.columns)}")
print(f"\nNew features: {new_cols}")

In [None]:
# Final dataset overview
print(f"Final Dataset Shape: {df_fe.shape}")
df_fe.head()

In [None]:
# Check for any null values in new features
print("Null values in new features:")
df_fe[new_cols].isnull().sum()

## 4. Save Engineered Data

In [None]:
# Save the feature engineered dataset
df_fe.to_csv('../data/processed/feature_engineered_data.csv', index=False)

print(f"✅ Feature engineered data saved!")
print(f"   Location: ../data/processed/feature_engineered_data.csv")
print(f"   Shape: {df_fe.shape}")
print(f"   New features added: {len(new_cols)}")

## Summary

### New Features Created:

| Feature | Description |
|---------|-------------|
| `bath_per_bhk` | Bathroom to BHK ratio |
| `balcony_per_bhk` | Balcony to BHK ratio |
| `sqft_per_bhk` | Square feet per BHK (room size) |
| `total_amenities` | Total bath + balcony count |
| `log_total_sqft` | Log of total square feet |
| `log_price` | Log of price |
| `sqft_x_bhk` | Sqft × BHK interaction |
| `sqft_x_bath` | Sqft × Bath interaction |
| `bhk_x_bath` | BHK × Bath interaction |
| `sqft_squared` | Sqft squared |
| `bhk_squared` | BHK squared |