# Housing Data EDA Template
**Student Name:** Hillary
**Dataset:** Melbourne Housing Dataset

This notebook provides a skeleton for loading, cleaning, and exploring the Melbourne Housing dataset.

## 1. Import Libraries & Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import sys
import os

# Add shared directory to path to import common functions
sys.path.append(os.path.abspath('../../'))
from shared.templates.common_functions import *

%matplotlib inline

In [None]:
# Load your dataset
df = pd.read_csv('../data/melbourne_housing.csv')
print(f"Loaded dataset with {df.shape[0]} rows and {df.shape[1]} columns.")
df.head()

## 2. Exploration
Check the basic structure, data types, and missing values.

In [None]:
df.info()
display(df.describe())

# Missing value report using shared function
missing_report = missing_value_report(df)
if missing_report.empty:
    print("No missing values found.")
else:
    display(missing_report)

## 3. Cleaning & Feature Engineering
Handle missing values, outliers, and data type conversions.

In [None]:
# Rename columns for template consistency
df.rename(columns={'Price': 'SalePrice'}, inplace=True)

# Handle missing values
df['BuildingArea'] = df['BuildingArea'].fillna(df['BuildingArea'].median())
df['YearBuilt'] = df['YearBuilt'].fillna(df['YearBuilt'].median())
df['Car'] = df['Car'].fillna(df['Car'].mode()[0])

# Feature Engineering
df['HouseAge'] = 2026 - df['YearBuilt']
df['Price_per_SqM'] = df['SalePrice'] / df['BuildingArea']
df['TotalRooms'] = df['Rooms'] + df['Bathroom']

# Outlier Removal for SalePrice
df_clean = outlier_iqr_removal(df, 'SalePrice')

# Save cleaned dataset
os.makedirs('../cleaned_data', exist_ok=True)
df_clean.to_csv('../cleaned_data/cleaned_melbourne_housing.csv', index=False)
print(f"Cleaned data saved. Removed {len(df) - len(df_clean)} outliers.")

## 4. EDA (Exploratory Data Analysis)
Visualize distributions and relationships.

In [None]:
# 1. Histogram of SalePrice
fig1 = px.histogram(df_clean, x='SalePrice', title='Sale Price Distribution', template='plotly_white')
fig1.show()

# Correlation Heatmap
correlation_heatmap(df_clean)

## 5. Visualizations (Dashboards)
Create interactive plots for key findings.

In [None]:
# 2. Scatter: BuildingArea vs SalePrice, color by Rooms
fig2 = px.scatter(df_clean, x='BuildingArea', y='SalePrice', color='Rooms', 
                 title='Building Area vs Price by Rooms', template='plotly_white')
fig2.show()

# 3. Boxplot: SalePrice by Suburb (Top 10)
top_suburbs = df_clean['Suburb'].value_counts().head(10).index
fig3 = px.box(df_clean[df_clean['Suburb'].isin(top_suburbs)], x='Suburb', y='SalePrice', 
             title='Price Distribution by Top 10 Suburbs', template='plotly_white')
fig3.show()

# 4. Bar: Mean SalePrice by Type
mean_price_type = df_clean.groupby('Type')['SalePrice'].mean().reset_index()
fig4 = px.bar(mean_price_type, x='Type', y='SalePrice', 
             title='Average Sale Price by Property Type', template='plotly_white')
fig4.show()

# 5. Geospatial Scatter
fig5 = px.scatter(df_clean, x='Longtitude', y='Lattitude', color='SalePrice', 
                 title='Melbourne House Prices by Location', template='plotly_white')
fig5.show()

## 6. Insights Summary
1. **Location Impact**: Central and established suburbs consistently show higher median prices.
2. **Size vs Price**: Building Area has a positive correlation with Sale Price, though other factors like property type also play a role.
3. **Property Type**: Different property types (houses, units, townhouses) show distinct price ranges and distributions.