# Housing Data EDA Template
**Student Name:** Kelvin
**Dataset:** California Housing Dataset

This notebook provides a skeleton for loading, cleaning, and exploring your housing dataset.

## 1. Import Libraries & Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import sys
import os

# Add shared directory to path to import common functions
sys.path.append(os.path.abspath('../../'))
from shared.templates.common_functions import *

%matplotlib inline

In [None]:
# Load your dataset
df = pd.read_csv('../data/california_housing.csv')
print(f"Loaded dataset with {df.shape[0]} rows and {df.shape[1]} columns.")
df.head()

## 2. Exploration
Check the basic structure, data types, and missing values.

In [None]:
df.info()
display(df.describe())

# Missing value report using shared function
missing_report = missing_value_report(df)
if missing_report.empty:
    print("No missing values found.")
else:
    display(missing_report)

## 3. Cleaning & Feature Engineering
Handle missing values, outliers, and data type conversions. We also adapt the dataset for consistency and better insights.

In [None]:
# Rename columns for template consistency
df.rename(columns={'MedHouseVal': 'SalePrice'}, inplace=True)

# Feature Engineering
df['Price_per_Income'] = df['SalePrice'] / df['MedInc']          # Housing affordability ratio
df['Rooms_per_Household'] = df['AveRooms'] / df['AveOccup']
df['Age_Category'] = pd.cut(df['HouseAge'], bins=[0, 15, 30, 50, 100], labels=['New', 'Recent', 'Mid', 'Old'])

# Example: Removing outliers from SalePrice (the target)
df_clean = outlier_iqr_removal(df, 'SalePrice')

# Save cleaned dataset
df_clean.to_csv('../cleaned_data/cleaned_california_housing.csv', index=False)
print(f"Cleaned data saved. Removed {len(df) - len(df_clean)} outliers.")

## 4. EDA (Exploratory Data Analysis)
Visualize distributions and relationships.

In [None]:
# Distribution of SalePrice
fig1 = px.histogram(df_clean, x='SalePrice', title='Distribution of Median House Value (SalePrice)', template='plotly_white')
fig1.show()

# Correlation Heatmap
correlation_heatmap(df_clean)

## 5. Visualizations (Dashboards)
Create interactive plots for key findings.

In [None]:
# 1. Median Income vs SalePrice (colored by HouseAge)
fig2 = px.scatter(df_clean, x='MedInc', y='SalePrice', color='HouseAge', 
                 title='Relationship: Median Income vs SalePrice (Colored by HouseAge)',
                 template='plotly_white')
fig2.show()

# 2. Geospatial Scatter (Latitude vs Longitude)
fig3 = px.scatter(df_clean, x='Longitude', y='Latitude', color='SalePrice',
                 size='AveRooms', hover_data=['MedInc', 'HouseAge'],
                 title='California House Prices by Location (Color = Price, Size = Rooms)',
                 color_continuous_scale='RdYlBu_r')
fig3.update_layout(height=700)
fig3.show()

# 3. Boxplot: SalePrice by Age_Category
fig4 = px.box(df_clean, x='Age_Category', y='SalePrice', 
             title='SalePrice Distribution by House Age Category',
             template='plotly_white')
fig4.show()

## 6. Insights Summary
1. **Income vs Price**: There is a strong linear correlation between Median Income and Median House Value.
2. **Location**: Higher prices are concentrated along the coastal areas (Bay Area and Southern California).
3. **Outliers**: Removing outliers in price helped normalize the distribution for better analysis.