# Housing Data EDA Template
**Student Name:** Jesse
**Dataset:** Boston Housing Dataset

This notebook provides a skeleton for loading, cleaning, and exploring the Boston Housing dataset.

## 1. Import Libraries & Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import sys
import os

# Add shared directory to path to import common functions
sys.path.append(os.path.abspath('../../'))
from shared.templates.common_functions import *

%matplotlib inline

In [None]:
# Load your dataset
df = pd.read_csv('../data/boston_housing.csv')
print(f"Loaded dataset with {df.shape[0]} rows and {df.shape[1]} columns.")
df.head()

## 2. Exploration
Check the basic structure, data types, and missing values.

In [None]:
df.info()
display(df.describe())

# Missing value report using shared function
missing_report = missing_value_report(df)
if missing_report.empty:
    print("No missing values found.")
else:
    display(missing_report)

## 3. Cleaning & Feature Engineering
Handle missing values, outliers, and data type conversions.

In [None]:
# Rename columns for consistency if needed (already mostly consistent)
df.rename(columns={'medv': 'SalePrice'}, inplace=True)

# Feature Engineering
df['Tax_per_Room'] = df['tax'] / df['rm']
df['Lower_Income_Impact'] = df['lstat'] * df['rm']
df['Distance_Index'] = 1 / (df['dis'] + 1)

# Outlier Removal for SalePrice
df_clean = outlier_iqr_removal(df, 'SalePrice')

# Save cleaned dataset
os.makedirs('../cleaned_data', exist_ok=True)
df_clean.to_csv('../cleaned_data/cleaned_boston_housing.csv', index=False)
print(f"Cleaned data saved. Removed {len(df) - len(df_clean)} outliers.")

## 4. EDA (Exploratory Data Analysis)
Visualize distributions and relationships.

In [None]:
# 1. Histogram of SalePrice
fig1 = px.histogram(df_clean, x='SalePrice', title='Boston House Price Distribution (MEDV)', template='plotly_white')
fig1.show()

# Correlation Heatmap
correlation_heatmap(df_clean)

## 5. Visualizations (Dashboards)
Create interactive plots for key findings.

In [None]:
# 2. Scatter: RM vs SalePrice
fig2 = px.scatter(df_clean, x='rm', y='SalePrice', trendline='ols', 
                 title='Relationship: Number of Rooms vs Sale Price', template='plotly_white')
fig2.show()

# 3. Scatter: LSTAT vs SalePrice
fig3 = px.scatter(df_clean, x='lstat', y='SalePrice', color='age', 
                 title='Relationship: % Lower Status vs Sale Price (Colored by Age)', template='plotly_white')
fig3.show()

# 4. Boxplot: SalePrice by CHAS
fig4 = px.box(df_clean, x='chas', y='SalePrice', 
             title='Sale Price vs Proximity to Charles River (CHAS=1 if near)', template='plotly_white')
fig4.show()

## 6. Insights Summary
1. **Rooms**: The number of rooms (RM) is a very strong positive indicator of price.
2. **Status**: Higher values of LSTAT (percentage of lower status of the population) are strongly associated with lower house prices.
3. **River Proximity**: Houses located near the Charles River (CHAS=1) tend to have higher median values.