# Housing Data EDA Template
**Student Name:** Ekure
**Dataset:** Ames Housing Dataset

This notebook loads, cleans, and explores the Ames Housing dataset.

## 1. Import Libraries & Load Data

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import sys
import os

# Add shared directory to path to import common functions
sys.path.append(os.path.abspath('../../'))
from shared.templates.common_functions import *

%matplotlib inline

In [3]:
# Load your dataset
df = pd.read_csv('../data/ames_housing.csv')
print(f"Loaded dataset with {df.shape[0]} rows and {df.shape[1]} columns.")
df.head()

Loaded dataset with 1460 rows and 81 columns.


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


## 2. Exploration
Check the basic structure, data types, and missing values.

In [4]:
df.info()
display(df.describe())

# Missing value report using shared function
missing_report = missing_value_report(df)
if missing_report.empty:
    print("No missing values found.")
else:
    display(missing_report)

<class 'pandas.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   str    
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   str    
 6   Alley          91 non-null     str    
 7   LotShape       1460 non-null   str    
 8   LandContour    1460 non-null   str    
 9   Utilities      1460 non-null   str    
 10  LotConfig      1460 non-null   str    
 11  LandSlope      1460 non-null   str    
 12  Neighborhood   1460 non-null   str    
 13  Condition1     1460 non-null   str    
 14  Condition2     1460 non-null   str    
 15  BldgType       1460 non-null   str    
 16  HouseStyle     1460 non-null   str    
 17  OverallQual    1460 non-null   int64  
 18  OverallCond    1460

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


Unnamed: 0,Missing Count,Missing Percentage
PoolQC,1453,99.520548
MiscFeature,1406,96.30137
Alley,1369,93.767123
Fence,1179,80.753425
MasVnrType,872,59.726027
FireplaceQu,690,47.260274
LotFrontage,259,17.739726
GarageType,81,5.547945
GarageYrBlt,81,5.547945
GarageFinish,81,5.547945


## 3. Cleaning & Feature Engineering
Handle missing values, outliers, and data type conversions.

In [5]:
# Set all missing entries in columns with lots of missing data to 'None'
none_cols = ['PoolQC', 'Alley', 'FireplaceQu', 'Fence', 'MiscFeature']
df[none_cols] = df[none_cols].fillna('None')

# Since LotFrontage does not have lots of missing data, 
# Compute the remaining missing values using the average of the column data
df['LotFrontage'] = df['LotFrontage'].fillna(df['LotFrontage'].median())

# Feature Engineering
df['HouseAge'] = df['YrSold'] - df['YearBuilt']
df['TotalBath'] = df['FullBath'] + 0.5 * df['HalfBath'] + df['BsmtFullBath'] + 0.5 * df['BsmtHalfBath']
df['Quality_Age'] = df['OverallQual'] / (df['HouseAge'] + 1)

# Outlier Removal for SalePrice
df_clean = outlier_iqr_removal(df, 'SalePrice')

# Save cleaned dataset
os.makedirs('../cleaned_data', exist_ok=True)
df_clean.to_csv('../cleaned_data/cleaned_ames_housing.csv', index=False)
print(f"Cleaned data saved. Removed {len(df) - len(df_clean)} outliers.")

Cleaned data saved. Removed 61 outliers.


## 4. EDA (Exploratory Data Analysis)
Visualize distributions and relationships.

In [6]:
# 1. Histogram of SalePrice
fig1 = px.histogram(df_clean, x='SalePrice', title='Sale Price Distribution', template='plotly_white')
fig1.show()

This graph highlights how the market is dominated by affordable to mid-range housing, with the majority of sales
concentrated between the price **$130,000** and **$203,500**. 

It also indicates that some people still prefer higher priced housing of almost or above **$350,000**

## 5. Visualizations (Dashboards)
Create interactive plots for key findings.

In [13]:
# 2. Scatter: GrLivArea vs SalePrice, color by OverallQual
fig2 = px.scatter(df_clean, x='HouseAge', y='SalePrice', color='OverallQual', 
                 title='Age of the building in relation to how much it costs', template='plotly_white')
fig2.show()

This plot shows us that, newly built and houses around 20 to 60 years old are most sort out for and are priced moderately
It also shows that houses between 25 and 0 years are higher quality, but more expensive

In [14]:

# 3. Boxplot: SalePrice by Neighborhood
fig3 = px.box(df_clean.sort_values('SalePrice'), x='Neighborhood', y='SalePrice', 
             title='Price Distribution by Neighborhood', template='plotly_white')
fig3.show()


This box plot displays and shows the correlation with neighborhoods and it affects the sales price of an apartment,
with the most expensive neighborhood being 'NoRidge' and the least expensive being 'IDOTRR'.

In [15]:

# 4. Bar: Mean SalePrice by OverallQual
mean_price_qual = df_clean.groupby('OverallQual')['SalePrice'].mean().reset_index()
fig4 = px.bar(mean_price_qual, x='OverallQual', y='SalePrice',
             title='Average Sale Price by Overall Quality', template='plotly_white')
fig4.show()

This graph is to show how the quality of a house also affects the sales price