# üè† House Price Predictor ‚Äî Phase 1 & 2: EDA

**Dataset:** Ames Housing Dataset (~2,900 rows, 80 features)  
**Goal:** Understand the data, find patterns, and prepare for modeling  

### Sections
1. Setup & Data Loading
2. Price Distribution & Skew
3. Missing Values
4. Feature Correlations Heatmap
5. Key Features vs Price (Scatter Plots)
6. Categorical Features vs Price (Boxplots)
7. Outlier Detection
8. EDA Summary & Key Insights

---
## 1. Setup & Data Loading

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Plot style
sns.set_theme(style='whitegrid', palette='muted')
plt.rcParams['figure.dpi'] = 120
plt.rcParams['font.size'] = 11

# Load dataset
URL = 'https://raw.githubusercontent.com/dsrscientist/dataset1/master/AmesHousing.csv'
df = pd.read_csv(URL)

print(f'‚úÖ Loaded: {df.shape[0]:,} rows √ó {df.shape[1]} columns')
df.head()

In [None]:
# Quick overview
print('‚îÄ‚îÄ Data Types ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ')
print(df.dtypes.value_counts())
print(f'\nNumeric columns:  {df.select_dtypes(include=np.number).shape[1]}')
print(f'Categoric columns: {df.select_dtypes(include="object").shape[1]}')
print(f'\nSalePrice range: ${df["SalePrice"].min():,} ‚Üí ${df["SalePrice"].max():,}')
print(f'SalePrice mean:  ${df["SalePrice"].mean():,.0f}')
print(f'SalePrice median:${df["SalePrice"].median():,.0f}')

---
## 2. Price Distribution & Skew

Most ML regression models assume the target is **normally distributed**.  
If it's skewed, a log transform (`log1p`) fixes it ‚Äî let's check.

In [None]:
from scipy import stats

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('SalePrice Distribution Analysis', fontsize=15, fontweight='bold', y=1.01)

# ‚îÄ‚îÄ Raw histogram
axes[0, 0].hist(df['SalePrice'], bins=60, color='#4C72B0', edgecolor='white', alpha=0.85)
axes[0, 0].set_title(f'Raw SalePrice  (skew={df["SalePrice"].skew():.2f})')
axes[0, 0].set_xlabel('Price ($)')
axes[0, 0].set_ylabel('Count')
axes[0, 0].axvline(df['SalePrice'].mean(), color='red', linestyle='--', label='Mean')
axes[0, 0].axvline(df['SalePrice'].median(), color='green', linestyle='--', label='Median')
axes[0, 0].legend()

# ‚îÄ‚îÄ Log-transformed histogram
log_price = np.log1p(df['SalePrice'])
axes[0, 1].hist(log_price, bins=60, color='#55A868', edgecolor='white', alpha=0.85)
axes[0, 1].set_title(f'log1p(SalePrice)  (skew={log_price.skew():.2f})')
axes[0, 1].set_xlabel('log(Price)')
axes[0, 1].set_ylabel('Count')

# ‚îÄ‚îÄ Raw Q-Q plot
stats.probplot(df['SalePrice'], plot=axes[1, 0])
axes[1, 0].set_title('Q-Q Plot: Raw SalePrice')

# ‚îÄ‚îÄ Log Q-Q plot
stats.probplot(log_price, plot=axes[1, 1])
axes[1, 1].set_title('Q-Q Plot: log1p(SalePrice)')

plt.tight_layout()
plt.show()

print('üí° Insight: Raw SalePrice is right-skewed (skew > 1).')
print('   After log transform, skew drops close to 0 ‚Äî much better for modeling!')

---
## 3. Missing Values

In [None]:
# Calculate missing %
missing = (df.isnull().sum() / len(df) * 100)
missing = missing[missing > 0].sort_values(ascending=False)

print(f'Columns with missing data: {len(missing)} / {df.shape[1]}')
print(f'Columns with >50% missing: {(missing > 50).sum()}')
print()

# Colour-coded bar chart
fig, ax = plt.subplots(figsize=(13, 6))
colors = ['#C44E52' if x > 50 else '#DD8452' if x > 20 else '#4C72B0' for x in missing.values]
bars = ax.bar(missing.index, missing.values, color=colors, edgecolor='white')
ax.axhline(50, color='red', linestyle='--', alpha=0.6, label='50% threshold')
ax.axhline(20, color='orange', linestyle='--', alpha=0.6, label='20% threshold')
ax.set_title('Missing Values by Column (%)', fontsize=13, fontweight='bold')
ax.set_ylabel('Missing (%)')
ax.set_xticklabels(missing.index, rotation=45, ha='right')
ax.legend()
plt.tight_layout()
plt.show()

print('üí° Insight: Red bars = >50% missing ‚Üí likely safe to DROP these columns.')
print('   Orange bars = 20-50% missing ‚Üí fill with domain knowledge (e.g. None/0).')
print('   Blue bars = <20% missing ‚Üí fill with median/mode.')

---
## 4. Feature Correlations Heatmap

Which numeric features correlate most strongly with SalePrice?

In [None]:
# Top 15 correlated numeric features
numeric_df = df.select_dtypes(include=[np.number])
corr_matrix = numeric_df.corr()
top_features = corr_matrix['SalePrice'].abs().sort_values(ascending=False).head(15).index

fig, ax = plt.subplots(figsize=(13, 10))
sns.heatmap(
    corr_matrix.loc[top_features, top_features],
    annot=True, fmt='.2f', cmap='coolwarm',
    center=0, linewidths=0.5, ax=ax,
    annot_kws={'size': 9}
)
ax.set_title('Correlation Heatmap ‚Äî Top 15 Features', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Bar chart of top correlations with SalePrice
top_corr = corr_matrix['SalePrice'].drop('SalePrice').sort_values(ascending=False).head(15)

fig, ax = plt.subplots(figsize=(11, 5))
colors = ['#4C72B0' if v > 0 else '#C44E52' for v in top_corr.values]
ax.barh(top_corr.index[::-1], top_corr.values[::-1], color=colors[::-1], edgecolor='white')
ax.axvline(0, color='black', linewidth=0.8)
ax.set_title('Top 15 Features Correlated with SalePrice', fontsize=13, fontweight='bold')
ax.set_xlabel('Pearson Correlation')
plt.tight_layout()
plt.show()

print('üí° Top 5 numeric predictors:')
for feat, val in top_corr.head(5).items():
    print(f'   {feat:<25} r = {val:.3f}')

---
## 5. Key Features vs Price ‚Äî Scatter Plots

Visualising the actual relationship between the top predictors and SalePrice.

In [None]:
scatter_features = [
    ('Gr Liv Area',    'Above Ground Living Area (sq ft)'),
    ('Total Bsmt SF',  'Total Basement Area (sq ft)'),
    ('1st Flr SF',     '1st Floor Area (sq ft)'),
    ('Garage Area',    'Garage Area (sq ft)'),
    ('Year Built',     'Year Built'),
    ('Overall Qual',   'Overall Quality (1-10)'),
]

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
fig.suptitle('Key Features vs SalePrice', fontsize=15, fontweight='bold')
axes = axes.flatten()

for i, (col, label) in enumerate(scatter_features):
    if col in df.columns:
        axes[i].scatter(df[col], df['SalePrice'], alpha=0.35, s=18, color='#4C72B0')
        # Trend line
        z = np.polyfit(df[col].fillna(0), df['SalePrice'], 1)
        p = np.poly1d(z)
        x_line = np.linspace(df[col].min(), df[col].max(), 100)
        axes[i].plot(x_line, p(x_line), color='red', linewidth=1.5, linestyle='--')
        r = df[[col, 'SalePrice']].corr().iloc[0, 1]
        axes[i].set_title(f'{label}  (r={r:.2f})')
        axes[i].set_xlabel(label)
        axes[i].set_ylabel('SalePrice ($)')

plt.tight_layout()
plt.show()

print('üí° Insight: Gr Liv Area and Overall Qual show the strongest linear relationships.')
print('   Year Built shows that newer homes command higher prices, but with more variance.')

---
## 6. Categorical Features vs Price ‚Äî Boxplots

How do categorical features like Neighborhood or House Style affect price?

In [None]:
# Neighborhood vs SalePrice
neighborhood_median = df.groupby('Neighborhood')['SalePrice'].median().sort_values(ascending=False)

fig, ax = plt.subplots(figsize=(14, 6))
df_sorted = df.copy()
df_sorted['Neighborhood'] = pd.Categorical(
    df_sorted['Neighborhood'],
    categories=neighborhood_median.index, ordered=True
)
sns.boxplot(
    data=df_sorted.sort_values('Neighborhood'),
    x='Neighborhood', y='SalePrice',
    palette='Blues_d', ax=ax
)
ax.set_title('SalePrice by Neighborhood', fontsize=13, fontweight='bold')
ax.set_xlabel('Neighborhood')
ax.set_ylabel('SalePrice ($)')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Overall Quality vs SalePrice ‚Äî key feature
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle('Quality & Style vs SalePrice', fontsize=13, fontweight='bold')

# Overall Qual boxplot
sns.boxplot(data=df, x='Overall Qual', y='SalePrice', palette='Blues', ax=axes[0])
axes[0].set_title('Overall Quality vs SalePrice')
axes[0].set_xlabel('Overall Quality (1=Poor ‚Üí 10=Excellent)')
axes[0].set_ylabel('SalePrice ($)')

# House Style boxplot
style_order = df.groupby('House Style')['SalePrice'].median().sort_values(ascending=False).index
sns.boxplot(data=df, x='House Style', y='SalePrice', order=style_order, palette='muted', ax=axes[1])
axes[1].set_title('House Style vs SalePrice')
axes[1].set_xlabel('House Style')
axes[1].set_ylabel('SalePrice ($)')
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=30, ha='right')

plt.tight_layout()
plt.show()

print('üí° Insight: Overall Quality is nearly a perfect price predictor ‚Äî each step up')
print('   in quality adds roughly $30,000-$50,000 to the median sale price.')

---
## 7. Outlier Detection

Outliers can heavily distort model training. The Ames dataset has a few known ones.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle('Outlier Detection', fontsize=13, fontweight='bold')

# Scatter: Gr Liv Area vs SalePrice ‚Äî known outliers visible here
axes[0].scatter(df['Gr Liv Area'], df['SalePrice'], alpha=0.4, s=18, color='#4C72B0')
outliers = df[(df['Gr Liv Area'] > 4000) & (df['SalePrice'] < 300000)]
axes[0].scatter(outliers['Gr Liv Area'], outliers['SalePrice'],
                color='red', s=60, zorder=5, label=f'Outliers ({len(outliers)})')
axes[0].axvline(4000, color='red', linestyle='--', alpha=0.5)
axes[0].set_title('Gr Liv Area vs SalePrice')
axes[0].set_xlabel('Above Ground Living Area (sq ft)')
axes[0].set_ylabel('SalePrice ($)')
axes[0].legend()

# Boxplot of SalePrice to see high-end outliers
axes[1].boxplot(df['SalePrice'], vert=True, patch_artist=True,
                boxprops=dict(facecolor='#4C72B0', alpha=0.6))
axes[1].set_title('SalePrice Boxplot')
axes[1].set_ylabel('SalePrice ($)')
axes[1].set_xticks([])

plt.tight_layout()
plt.show()

# IQR outlier count
Q1 = df['SalePrice'].quantile(0.25)
Q3 = df['SalePrice'].quantile(0.75)
IQR = Q3 - Q1
iqr_outliers = df[(df['SalePrice'] < Q1 - 1.5*IQR) | (df['SalePrice'] > Q3 + 1.5*IQR)]

print(f'üí° IQR method detects {len(iqr_outliers)} SalePrice outliers')
print(f'   Known problematic outliers (large house, cheap price): {len(outliers)}')
print(f'   ‚Üí These will be REMOVED in preprocess.py before training')

---
## 8. EDA Summary & Key Insights

Everything we learned and what it means for modeling.

In [None]:
print('=' * 55)
print('  üìã EDA SUMMARY ‚Äî House Price Predictor')
print('=' * 55)

print('''
1. TARGET (SalePrice)
   ‚Ä¢ Right-skewed ‚Üí use log1p() transform before training
   ‚Ä¢ Range: ~$12,789 to $755,000  |  Median: ~$160,000

2. MISSING VALUES
   ‚Ä¢ Pool QC, Alley, Fence ‚Üí >80% missing, drop or fill "None"
   ‚Ä¢ Garage/Basement cols  ‚Üí NaN means "no garage/basement" ‚Üí fill 0
   ‚Ä¢ Remaining             ‚Üí fill with median / mode

3. TOP NUMERIC PREDICTORS
   ‚Ä¢ Overall Qual   (r ‚âà 0.80) ‚Üê strongest single feature
   ‚Ä¢ Gr Liv Area    (r ‚âà 0.71)
   ‚Ä¢ Garage Cars    (r ‚âà 0.65)
   ‚Ä¢ Total Bsmt SF  (r ‚âà 0.64)
   ‚Ä¢ Year Built     (r ‚âà 0.56)

4. TOP CATEGORICAL PREDICTORS
   ‚Ä¢ Neighborhood   ‚Üí huge price variance between areas
   ‚Ä¢ Overall Qual   ‚Üí nearly monotonic with price
   ‚Ä¢ House Style    ‚Üí 2-story homes command premium

5. OUTLIERS
   ‚Ä¢ 2 large houses (>4000 sqft) sold very cheaply ‚Üí REMOVE
   ‚Ä¢ A few extreme high-price homes ‚Üí keep (real luxury homes)

6. FEATURE ENGINEERING IDEAS (for preprocess.py)
   ‚Ä¢ TotalSF = Bsmt + 1st + 2nd floor  ‚Üê most impactful
   ‚Ä¢ TotalBath = Full + Half*0.5
   ‚Ä¢ HouseAge = YrSold - YearBuilt
   ‚Ä¢ WasRemodeled = (YearBuilt != YearRemod)
''')

print('=' * 55)
print('  ‚ñ∂Ô∏è  Next: 02_training.ipynb ‚Äî Preprocessing + Modeling')
print('=' * 55)