# UK Housing Price Prediction - Data Cleaning

**Author:** Abdul Salam Aldabik  
**Date:** November 2025  
**Course:** CloudAI - Machine Learning Project  

---

## Objective
Clean the merged dataset:
- Handle outliers using domain knowledge filtering
- Apply log transformation to normalize prices
- Create before/after visualizations
- Generate quality reports

## CloudAI Reference
- **Chapter 3:** Model Quality - Data leakage prevention
- **Chapter 5:** Data Augmentation - Outlier handling, transformations

---

## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from datetime import datetime

# Visualization settings
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (14, 6)

print("✓ Libraries loaded")

## 2. Setup Paths

In [None]:
DATA_DIR = Path('../Data')
OUTPUT_DIR = DATA_DIR / 'cleaning_output'
OUTPUT_DIR.mkdir(exist_ok=True)

INPUT_FILE = DATA_DIR / 'housing_with_economic_features.parquet'
OUTPUT_FILE = DATA_DIR / 'housing_cleaned.parquet'

print(f"✓ Output directory: {OUTPUT_DIR}")

## 3. Load Merged Data

In [None]:
print("Loading merged dataset...\n")

df = pd.read_parquet(INPUT_FILE)

print(f"✓ Data loaded")
print(f"  Records: {len(df):,}")
print(f"  Columns: {len(df.columns)}")
print(f"  Memory: {df.memory_usage(deep=True).sum() / 1024**3:.2f} GB")

## 4. Analyze Price Distribution (BEFORE Cleaning)

In [None]:
print("=== PRICE ANALYSIS (BEFORE CLEANING) ===")
print(f"\nTotal records: {len(df):,}")

price_stats = df['price'].describe(percentiles=[0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99])

print(f"\nPrice Statistics:")
print(f"  Mean:   £{price_stats['mean']:,.2f}")
print(f"  Median: £{price_stats['50%']:,.2f}")
print(f"  Std:    £{price_stats['std']:,.2f}")
print(f"\n  Min:    £{price_stats['min']:,.2f}")
print(f"  1%:     £{price_stats['1%']:,.2f}")
print(f"  5%:     £{price_stats['5%']:,.2f}")
print(f"  95%:    £{price_stats['95%']:,.2f}")
print(f"  99%:    £{price_stats['99%']:,.2f}")
print(f"  Max:    £{price_stats['max']:,.2f}")

# Count extreme values
below_10k = (df['price'] < 10000).sum()
above_5m = (df['price'] > 5000000).sum()

print(f"\nExtreme Values:")
print(f"  Below £10,000:  {below_10k:,} ({below_10k/len(df)*100:.3f}%)")
print(f"  Above £5M:      {above_5m:,} ({above_5m/len(df)*100:.3f}%)")
print(f"  Total to remove: {below_10k + above_5m:,} ({(below_10k + above_5m)/len(df)*100:.3f}%)")

## 5. Apply Domain Knowledge Filtering

**Decision:** Remove prices below £10,000 and above £5,000,000  
**Rationale:** Based on realistic UK housing market knowledge

In [None]:
print("\nApplying domain knowledge filtering...")
print("  Range: £10,000 to £5,000,000\n")

original_count = len(df)

df_cleaned = df[(df['price'] >= 10000) & (df['price'] <= 5000000)].copy()

removed = original_count - len(df_cleaned)

print(f"✓ Filtering complete")
print(f"  Original records: {original_count:,}")
print(f"  Removed: {removed:,} ({removed/original_count*100:.2f}%)")
print(f"  Remaining: {len(df_cleaned):,}")

## 6. Apply Log Transformation

**Decision:** Transform price to log(price)  
**Rationale:** Normalizes right-skewed distribution

In [None]:
print("\nApplying log transformation...\n")

df_cleaned['price_transformed'] = np.log(df_cleaned['price'])

print(f"✓ Log transformation applied")
print(f"  Original price range: £{df_cleaned['price'].min():,.0f} - £{df_cleaned['price'].max():,.0f}")
print(f"  Transformed range: {df_cleaned['price_transformed'].min():.3f} - {df_cleaned['price_transformed'].max():.3f}")
print(f"\n  Original mean: £{df_cleaned['price'].mean():,.2f}")
print(f"  Transformed mean: {df_cleaned['price_transformed'].mean():.3f}")

## 7. Price Statistics (AFTER Cleaning)

In [None]:
print("=== PRICE ANALYSIS (AFTER CLEANING) ===")

price_stats_clean = df_cleaned['price'].describe(percentiles=[0.25, 0.50, 0.75])

print(f"\nPrice Statistics:")
print(f"  Mean:   £{price_stats_clean['mean']:,.2f}")
print(f"  Median: £{price_stats_clean['50%']:,.2f}")
print(f"  Std:    £{price_stats_clean['std']:,.2f}")
print(f"  Min:    £{price_stats_clean['min']:,.2f}")
print(f"  Max:    £{price_stats_clean['max']:,.2f}")

print(f"\nLog-Transformed Price:")
print(f"  Mean:   {df_cleaned['price_transformed'].mean():.3f}")
print(f"  Median: {df_cleaned['price_transformed'].median():.3f}")
print(f"  Std:    {df_cleaned['price_transformed'].std():.3f}")

## 8. Visualizations

### 8.1 Before vs After Distributions

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# BEFORE: Full distribution
axes[0, 0].hist(df['price'], bins=100, color='red', alpha=0.6, edgecolor='black')
axes[0, 0].axvline(df['price'].median(), color='darkred', linestyle='--', linewidth=2,
                   label=f'Median: £{df["price"].median():,.0f}')
axes[0, 0].set_xlabel('Price (£)', fontsize=11, fontweight='bold')
axes[0, 0].set_ylabel('Frequency', fontsize=11, fontweight='bold')
axes[0, 0].set_title(f'BEFORE: Price Distribution\n({len(df):,} transactions)', 
                     fontsize=12, fontweight='bold')
axes[0, 0].legend(fontsize=10)
axes[0, 0].grid(alpha=0.3)

# BEFORE: Box plot
axes[0, 1].boxplot(df['price'], vert=True, patch_artist=True,
                   boxprops=dict(facecolor='lightcoral', alpha=0.7),
                   medianprops=dict(color='darkred', linewidth=2))
axes[0, 1].set_ylabel('Price (£)', fontsize=11, fontweight='bold')
axes[0, 1].set_title('BEFORE: Outliers Visible', fontsize=12, fontweight='bold')
axes[0, 1].grid(alpha=0.3, axis='y')

# AFTER: Cleaned distribution
axes[1, 0].hist(df_cleaned['price'], bins=100, color='green', alpha=0.6, edgecolor='black')
axes[1, 0].axvline(df_cleaned['price'].median(), color='darkgreen', linestyle='--', linewidth=2,
                   label=f'Median: £{df_cleaned["price"].median():,.0f}')
axes[1, 0].set_xlabel('Price (£)', fontsize=11, fontweight='bold')
axes[1, 0].set_ylabel('Frequency', fontsize=11, fontweight='bold')
axes[1, 0].set_title(f'AFTER: Price Distribution\n({len(df_cleaned):,} transactions)', 
                     fontsize=12, fontweight='bold')
axes[1, 0].legend(fontsize=10)
axes[1, 0].grid(alpha=0.3)

# AFTER: Box plot
axes[1, 1].boxplot(df_cleaned['price'], vert=True, patch_artist=True,
                   boxprops=dict(facecolor='lightgreen', alpha=0.7),
                   medianprops=dict(color='darkgreen', linewidth=2))
axes[1, 1].set_ylabel('Price (£)', fontsize=11, fontweight='bold')
axes[1, 1].set_title('AFTER: Fewer Outliers', fontsize=12, fontweight='bold')
axes[1, 1].grid(alpha=0.3, axis='y')

plt.suptitle('Data Cleaning: Before vs After\nDomain Filtering (£10K - £5M)',
             fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.savefig(OUTPUT_DIR / '01_before_after_cleaning.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Saved: 01_before_after_cleaning.png")

### 8.2 Log Transformation Effect

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Original price (after outlier removal)
axes[0].hist(df_cleaned['price'], bins=100, color='blue', alpha=0.7, edgecolor='black')
axes[0].axvline(df_cleaned['price'].median(), color='red', linestyle='--', linewidth=2,
               label=f'Median: £{df_cleaned["price"].median():,.0f}')
axes[0].set_xlabel('Price (£)', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Frequency', fontsize=12, fontweight='bold')
axes[0].set_title('Original Price (After Outlier Removal)\nRight-Skewed', 
                 fontsize=13, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(alpha=0.3)

# Log-transformed price
axes[1].hist(df_cleaned['price_transformed'], bins=100, color='green', alpha=0.7, edgecolor='black')
axes[1].axvline(df_cleaned['price_transformed'].median(), color='red', linestyle='--', linewidth=2,
               label=f'Median: {df_cleaned["price_transformed"].median():.3f}')
axes[1].set_xlabel('Log(Price)', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Frequency', fontsize=12, fontweight='bold')
axes[1].set_title('Log-Transformed Price\nMore Normally Distributed', 
                 fontsize=13, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig(OUTPUT_DIR / '02_log_transformation.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Saved: 02_log_transformation.png")

### 8.3 Price by Property Type (After Cleaning)

In [None]:
# Sample for visualization
sample_df = df_cleaned.sample(n=min(50000, len(df_cleaned)), random_state=42)

fig, ax = plt.subplots(figsize=(12, 7))

property_order = df_cleaned.groupby('property_type')['price'].median().sort_values(ascending=False).index

sns.boxplot(data=sample_df, x='property_type', y='price', 
            order=property_order, ax=ax, palette='Set2')

ax.set_xlabel('Property Type', fontsize=12, fontweight='bold')
ax.set_ylabel('Price (£)', fontsize=12, fontweight='bold')
ax.set_title('Price Distribution by Property Type (After Cleaning)', 
             fontsize=14, fontweight='bold', pad=15)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'£{x/1000:.0f}K'))
ax.grid(alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(OUTPUT_DIR / '03_price_by_property_type.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Saved: 03_price_by_property_type.png")

## 9. Save Cleaned Dataset

In [None]:
print(f"\nSaving cleaned dataset...\n")

df_cleaned.to_parquet(OUTPUT_FILE, compression='gzip', index=False)

file_size = OUTPUT_FILE.stat().st_size / 1024**2

print(f"✓ Cleaned dataset saved")
print(f"  File: {OUTPUT_FILE.name}")
print(f"  Size: {file_size:.2f} MB")
print(f"  Rows: {len(df_cleaned):,}")
print(f"  Columns: {len(df_cleaned.columns)}")

## 10. Create Cleaning Report

In [None]:
report_file = OUTPUT_DIR / 'cleaning_report.txt'

with open(report_file, 'w') as f:
    f.write("=" * 80 + "\n")
    f.write("DATA CLEANING REPORT\n")
    f.write("UK Housing Price Prediction Project\n")
    f.write("=" * 80 + "\n\n")
    f.write(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
    
    f.write("CLEANING DECISIONS:\n")
    f.write("-" * 80 + "\n")
    f.write("Outlier Method: Domain Knowledge Filtering\n")
    f.write("  Range: £10,000 to £5,000,000\n")
    f.write("  Rationale: Based on realistic UK housing market\n\n")
    f.write("Transformation: Log Transformation\n")
    f.write("  Method: price_transformed = log(price)\n")
    f.write("  Rationale: Normalize right-skewed distribution\n\n")
    
    f.write("CLEANING RESULTS:\n")
    f.write("-" * 80 + "\n")
    f.write(f"Original records: {original_count:,}\n")
    f.write(f"Removed: {removed:,} ({removed/original_count*100:.2f}%)\n")
    f.write(f"Final records: {len(df_cleaned):,}\n\n")
    
    f.write("PRICE STATISTICS (BEFORE):\n")
    f.write("-" * 80 + "\n")
    f.write(f"Mean: £{df['price'].mean():,.2f}\n")
    f.write(f"Median: £{df['price'].median():,.2f}\n")
    f.write(f"Min: £{df['price'].min():,.2f}\n")
    f.write(f"Max: £{df['price'].max():,.2f}\n\n")
    
    f.write("PRICE STATISTICS (AFTER):\n")
    f.write("-" * 80 + "\n")
    f.write(f"Mean: £{df_cleaned['price'].mean():,.2f}\n")
    f.write(f"Median: £{df_cleaned['price'].median():,.2f}\n")
    f.write(f"Min: £{df_cleaned['price'].min():,.2f}\n")
    f.write(f"Max: £{df_cleaned['price'].max():,.2f}\n\n")
    
    f.write("LOG-TRANSFORMED STATISTICS:\n")
    f.write("-" * 80 + "\n")
    f.write(f"Mean: {df_cleaned['price_transformed'].mean():.4f}\n")
    f.write(f"Median: {df_cleaned['price_transformed'].median():.4f}\n")
    f.write(f"Std: {df_cleaned['price_transformed'].std():.4f}\n\n")
    
    f.write("NEXT STEPS:\n")
    f.write("-" * 80 + "\n")
    f.write("1. Feature engineering\n")
    f.write("2. Model training (use price_transformed as target)\n")
    f.write("3. Remember: Predictions need inverse transform: exp(log_price)\n")

print(f"✓ Cleaning report saved: {report_file.name}")

## 11. Summary

### Cleaning Applied:
- **Outlier Removal:** Domain filtering (£10K - £5M)
- **Transformation:** Log transformation
- **Records Removed:** Minimal (~0.03%)

### Key Results:
1. **Cleaner Distribution:** Removed extreme outliers
2. **Normalized Prices:** Log transformation reduces skewness
3. **Data Preserved:** 99.97% of data retained

### Data Quality:
- ✅ Realistic price range
- ✅ More normal distribution
- ✅ Ready for feature engineering

### Next Steps:
1. Feature engineering (categorical encoding, derived features)
2. Model training with price_transformed as target
3. Remember to inverse transform predictions: `exp(log_price)` → actual price

---

**Notebook Complete**