# Housing Data EDA Template
**Student Name:** Ekure
**Dataset:** Ames Housing Dataset

This notebook provides a skeleton for loading, cleaning, and exploring the Ames Housing dataset.

## 1. Import Libraries & Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import sys
import os

# Add shared directory to path to import common functions
sys.path.append(os.path.abspath('../../'))
from shared.templates.common_functions import *

%matplotlib inline

In [None]:
# Load your dataset
df = pd.read_csv('../data/ames_housing.csv')
print(f"Loaded dataset with {df.shape[0]} rows and {df.shape[1]} columns.")
df.head()

## 2. Exploration
Check the basic structure, data types, and missing values.

In [None]:
df.info()
display(df.describe())

# Missing value report using shared function
missing_report = missing_value_report(df)
if missing_report.empty:
    print("No missing values found.")
else:
    display(missing_report)

## 3. Cleaning & Feature Engineering
Handle missing values, outliers, and data type conversions.

In [None]:
# Handle missing values for specific columns as specified
none_cols = ['PoolQC', 'Alley', 'FireplaceQu', 'Fence', 'MiscFeature']
df[none_cols] = df[none_cols].fillna('None')

# Impute numerical missing values using median
df['LotFrontage'] = df['LotFrontage'].fillna(df['LotFrontage'].median())

# Feature Engineering
df['HouseAge'] = df['YrSold'] - df['YearBuilt']
df['TotalBath'] = df['FullBath'] + 0.5 * df['HalfBath'] + df['BsmtFullBath'] + 0.5 * df['BsmtHalfBath']
df['Quality_Age'] = df['OverallQual'] / (df['HouseAge'] + 1)

# Outlier Removal for SalePrice
df_clean = outlier_iqr_removal(df, 'SalePrice')

# Save cleaned dataset
os.makedirs('../cleaned_data', exist_ok=True)
df_clean.to_csv('../cleaned_data/cleaned_ames_housing.csv', index=False)
print(f"Cleaned data saved. Removed {len(df) - len(df_clean)} outliers.")

## 4. EDA (Exploratory Data Analysis)
Visualize distributions and relationships.

In [None]:
# 1. Histogram of SalePrice
fig1 = px.histogram(df_clean, x='SalePrice', title='Sale Price Distribution', template='plotly_white')
fig1.show()

# Correlation Heatmap (Top 10 features)
top_corr = df_clean.select_dtypes(include=[np.number]).corr()['SalePrice'].sort_values(ascending=False).head(11).index
plt.figure(figsize=(12, 10))
sns.heatmap(df_clean[top_corr].corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Top 10 Features Correlation with SalePrice')
plt.show()

## 5. Visualizations (Dashboards)
Create interactive plots for key findings.

In [None]:
# 2. Scatter: GrLivArea vs SalePrice, color by OverallQual
fig2 = px.scatter(df_clean, x='GrLivArea', y='SalePrice', color='OverallQual', 
                 title='Living Area vs Price by Quality', template='plotly_white')
fig2.show()

# 3. Boxplot: SalePrice by Neighborhood
fig3 = px.box(df_clean.sort_values('SalePrice'), x='Neighborhood', y='SalePrice', 
             title='Price Distribution by Neighborhood', template='plotly_white')
fig3.show()

# 4. Bar: Mean SalePrice by OverallQual
mean_price_qual = df_clean.groupby('OverallQual')['SalePrice'].mean().reset_index()
fig4 = px.bar(mean_price_qual, x='OverallQual', y='SalePrice', 
             title='Average Sale Price by Overall Quality', template='plotly_white')
fig4.show()

## 6. Insights Summary
1. **Quality Correlation**: Overall Quality is the strongest predictor of Sale Price.
2. **Living Area**: There is a clear linear relationship between Ground Living Area and Sale Price.
3. **Neighborhood**: Certain neighborhoods command significantly higher median prices than others.