# Housing Data EDA Template
**Student Name:** Mary
**Dataset:** Bangalore Housing Dataset

This notebook provides a skeleton for loading, cleaning, and exploring the Bangalore Housing dataset.

## 1. Import Libraries & Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import sys
import os

# Add shared directory to path to import common functions
sys.path.append(os.path.abspath('../../'))
from shared.templates.common_functions import *

%matplotlib inline

In [None]:
# Load your dataset
df = pd.read_csv('../data/bangalore_housing.csv')
print(f"Loaded dataset with {df.shape[0]} rows and {df.shape[1]} columns.")
df.head()

## 2. Exploration
Check the basic structure, data types, and missing values.

In [None]:
df.info()
display(df.describe())

# Missing value report using shared function
missing_report = missing_value_report(df)
if missing_report.empty:
    print("No missing values found.")
else:
    display(missing_report)

## 3. Cleaning & Feature Engineering
Handle missing values, outliers, and data type conversions.

In [None]:
# 1. Clean total_sqft: Convert ranges to mean
def convert_sqft_to_num(x):
    tokens = x.split('-')
    if len(tokens) == 2:
        return (float(tokens[0]) + float(tokens[1]))/2
    try:
        return float(x)
    except:
        return None

df['total_sqft'] = df['total_sqft'].apply(convert_sqft_to_num)
df = df.dropna(subset=['total_sqft'])

# 2. Clean size: Extract BHK as integer
df['BHK'] = df['size'].apply(lambda x: int(x.split(' ')[0]))

# 3. Handle other missing values
df['bath'] = df['bath'].fillna(df['bath'].median())
df['balcony'] = df['balcony'].fillna(df['balcony'].median())

# 4. Feature Engineering
df['Price_per_SqFt'] = (df['price'] * 100000) / df['total_sqft'] # Convert price (Lakhs) to Rupees

# 5. Outlier removal for price_per_sqft
df_clean = outlier_iqr_removal(df, 'price')

# Save cleaned dataset
os.makedirs('../cleaned_data', exist_ok=True)
df_clean.to_csv('../cleaned_data/cleaned_bangalore_housing.csv', index=False)
print(f"Cleaned data saved. Removed {len(df) - len(df_clean)} outliers.")

## 4. EDA (Exploratory Data Analysis)
Visualize distributions and relationships.

In [None]:
# 1. Histogram of Price
fig1 = px.histogram(df_clean, x='price', title='Bangalore House Price Distribution (in Lakhs)', template='plotly_white')
fig1.show()

# Correlation Heatmap
correlation_heatmap(df_clean)

## 5. Visualizations (Dashboards)
Create interactive plots for key findings.

In [None]:
# 2. Scatter: total_sqft vs price
fig2 = px.scatter(df_clean, x='total_sqft', y='price', color='BHK', 
                 title='Relationship: Total Square Feet vs Price', template='plotly_white')
fig2.show()

# 3. Boxplot: Price by area_type
fig3 = px.box(df_clean, x='area_type', y='price', 
             title='Price Distribution by Area Type', template='plotly_white')
fig3.show()

# 4. Bar: Median Price by Top 10 Locations
top_locations = df_clean['location'].value_counts().head(10).index
median_price_location = df_clean[df_clean['location'].isin(top_locations)].groupby('location')['price'].median().reset_index().sort_values('price', ascending=False)
fig4 = px.bar(median_price_location, x='location', y='price', 
             title='Median House Price in Top 10 Popular Locations', template='plotly_white')
fig4.show()

## 6. Insights Summary
1. **Area Correlation**: Total square feet is a primary driver of price in the Bangalore market.
2. **BHK impact**: Larger apartments (higher BHK) show clear premiums, though price per square foot may vary by location density.
3. **Location Variance**: Significant price differences exist across major tech hubs like Whitefield and Electronic City.