# UK Housing Price Prediction - Feature Engineering

**Author:** Abdul Salam Aldabik  
**Date:** November 2025  
**Course:** CloudAI - Machine Learning Project  

---

## Objective
Create engineered features for modeling:
- Categorical encoding (one-hot, binary)
- Temporal features (seasonality, crisis indicators)
- Economic interactions (spreads, rate of change)
- Geographic encoding (label encoding)
- **Data leakage prevention throughout**

## CloudAI Reference
- **Chapter 3:** Model Quality - Preventing data leakage
- **Chapter 4:** Models - Feature engineering strategies
- **Chapter 5:** Data Augmentation - Feature creation techniques
- **Chapter 6:** Time Series - Temporal feature engineering

---

## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from datetime import datetime
from sklearn.preprocessing import LabelEncoder

# Visualization settings
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (14, 6)

print("✓ Libraries loaded")

## 2. Setup Paths

In [None]:
DATA_DIR = Path('../Data')
OUTPUT_DIR = DATA_DIR / 'feature_output'
OUTPUT_DIR.mkdir(exist_ok=True)

INPUT_FILE = DATA_DIR / 'housing_cleaned.parquet'
OUTPUT_FILE = DATA_DIR / 'housing_features_final.parquet'

print(f"✓ Output directory: {OUTPUT_DIR}")

## 3. Load Cleaned Data

In [None]:
print("Loading cleaned dataset...\n")

df = pd.read_parquet(INPUT_FILE)

print(f"✓ Data loaded")
print(f"  Records: {len(df):,}")
print(f"  Columns: {len(df.columns)}")
print(f"  Memory: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

original_columns = len(df.columns)

## 4. Categorical Encoding

### 4.1 One-Hot Encode Property Type

In [None]:
print("=== CATEGORICAL ENCODING ===")
print("\n1. Property Type (One-Hot Encoding)...")

if 'property_type' in df.columns:
    property_dummies = pd.get_dummies(df['property_type'], prefix='property', drop_first=True)
    df = pd.concat([df, property_dummies], axis=1)
    print(f"  Created {len(property_dummies.columns)} dummy columns:")
    for col in property_dummies.columns:
        print(f"    - {col}")
else:
    print("  ⚠ property_type column not found")

### 4.2 Binary Categorical Features

In [None]:
print("\n2. Binary Categoricals...")

# New vs Old
if 'old_new' in df.columns:
    df['is_new_build'] = (df['old_new'] == 'Y').astype(int)
    print("  ✓ Created: is_new_build")

# Freehold vs Leasehold
if 'duration' in df.columns:
    df['is_freehold'] = (df['duration'] == 'F').astype(int)
    print("  ✓ Created: is_freehold")

# Category A
if 'ppdcategory_type' in df.columns:
    df['is_category_a'] = (df['ppdcategory_type'] == 'A').astype(int)
    print("  ✓ Created: is_category_a")

## 5. Temporal Features

### 5.1 Basic Temporal Features

In [None]:
print("\n=== TEMPORAL FEATURES ===")
print("\n1. Basic Temporal...")

if 'date_of_transfer' in df.columns:
    if df['date_of_transfer'].dtype != 'datetime64[ns]':
        df['date_of_transfer'] = pd.to_datetime(df['date_of_transfer'])
    
    df['day_of_week'] = df['date_of_transfer'].dt.dayofweek
    df['is_weekend'] = (df['day_of_week'] >= 5).astype(int)
    print("  ✓ Created: day_of_week, is_weekend")

### 5.2 Seasonal Features

In [None]:
print("\n2. Seasonal Features...")

df['is_spring'] = df['month'].isin([3, 4, 5]).astype(int)
df['is_summer'] = df['month'].isin([6, 7, 8]).astype(int)
df['is_autumn'] = df['month'].isin([9, 10, 11]).astype(int)
df['is_winter'] = df['month'].isin([12, 1, 2]).astype(int)

# Cyclical encoding for smooth seasonality
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

print("  ✓ Created: season indicators (4)")
print("  ✓ Created: cyclical month encoding (sin, cos)")

### 5.3 Crisis Period Features

In [None]:
print("\n3. Crisis Period Features...")

df['years_since_2008'] = df['year'] - 2008
df['is_crisis_period'] = ((df['year'] >= 2008) & (df['year'] <= 2009)).astype(int)
df['is_recovery_period'] = ((df['year'] >= 2010) & (df['year'] <= 2012)).astype(int)

print("  ✓ Created: years_since_2008, crisis indicators (2)")

## 6. Economic Interaction Features

### 6.1 Mortgage Rate Spreads

In [None]:
print("\n=== ECONOMIC FEATURES ===")
print("\n1. Mortgage Spreads...")

if all(col in df.columns for col in ['mortgage_10yr', 'mortgage_2yr', 'mortgage_5yr']):
    df['mortgage_spread_10_2'] = df['mortgage_10yr'] - df['mortgage_2yr']
    df['mortgage_spread_5_2'] = df['mortgage_5yr'] - df['mortgage_2yr']
    print("  ✓ Created: mortgage_spread_10_2, mortgage_spread_5_2")
    print("  (Yield curve shape - market expectations)")

### 6.2 Rate of Change Features (Leakage-Safe)

**CRITICAL:** Uses only PREVIOUS month data (shift) to prevent leakage

In [None]:
print("\n2. Rate of Change (Leakage-Safe)...")

# Sort by time
df = df.sort_values('date_of_transfer').reset_index(drop=True)

# Calculate monthly averages
monthly_means = df.groupby(['year', 'month'])[['base_rate', 'mortgage_5yr', 'exchange_rate_index']].mean()
monthly_means = monthly_means.reset_index()
monthly_means['period'] = monthly_means['year'] * 12 + monthly_means['month']
monthly_means = monthly_means.sort_values('period')

# Calculate change from PREVIOUS month only
for col in ['base_rate', 'mortgage_5yr', 'exchange_rate_index']:
    monthly_means[f'{col}_prev'] = monthly_means[col].shift(1)
    monthly_means[f'{col}_change'] = monthly_means[col] - monthly_means[f'{col}_prev']

# Merge back
df = df.merge(
    monthly_means[['year', 'month', 'base_rate_change', 'mortgage_5yr_change', 'exchange_rate_index_change']],
    on=['year', 'month'],
    how='left'
)

# Fill first month NaNs
df['base_rate_change'] = df['base_rate_change'].fillna(0)
df['mortgage_5yr_change'] = df['mortgage_5yr_change'].fillna(0)
df['exchange_rate_index_change'] = df['exchange_rate_index_change'].fillna(0)

print("  ✓ Created: rate change features (3)")
print("  ✓ Leakage-safe: Uses shift(1) for previous month only")

## 7. Geographic Encoding

**Note:** Label encoding used here. Target encoding deferred to model pipeline to prevent leakage.

In [None]:
print("\n=== GEOGRAPHIC ENCODING ===")
print("\nLabel Encoding (Target encoding deferred to model pipeline)...")

if 'district' in df.columns:
    le_district = LabelEncoder()
    df['district_encoded'] = le_district.fit_transform(df['district'].astype(str))
    print(f"  ✓ Encoded district: {df['district'].nunique()} unique values")

if 'county' in df.columns:
    le_county = LabelEncoder()
    df['county_encoded'] = le_county.fit_transform(df['county'].astype(str))
    print(f"  ✓ Encoded county: {df['county'].nunique()} unique values")

## 8. Drop Original Categorical Columns

In [None]:
print("\n=== CLEANUP ===")
print("\nDropping original categorical columns...")

cols_to_drop = []
for col in ['property_type', 'old_new', 'duration', 'ppdcategory_type', 
            'district', 'county', 'town_city', 'record_status_-_monthly_file_only']:
    if col in df.columns:
        cols_to_drop.append(col)

if cols_to_drop:
    df = df.drop(columns=cols_to_drop)
    print(f"  Dropped {len(cols_to_drop)} columns: {', '.join(cols_to_drop)}")

## 9. Feature Summary

In [None]:
print("\n=== FEATURE ENGINEERING SUMMARY ===")
print(f"\nOriginal columns: {original_columns}")
print(f"Final columns: {len(df.columns)}")
print(f"New features created: {len(df.columns) - original_columns}")

print("\nFeature Categories:")
print(f"  Categorical encoding: ~7 features")
print(f"  Temporal features: 11 features")
print(f"  Economic interactions: 5 features")
print(f"  Geographic encoding: 2 features")

print(f"\nTotal records: {len(df):,}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

## 10. Visualizations

### 10.1 Feature Correlations

In [None]:
# Sample for visualization
sample_df = df.sample(n=min(10000, len(df)), random_state=42)

# Select key features for correlation
numeric_cols = sample_df.select_dtypes(include=[np.number]).columns.tolist()

key_features = ['price_transformed']
for col in ['base_rate', 'mortgage_2yr', 'mortgage_5yr', 'exchange_rate_index',
            'mortgage_spread_5_2', 'base_rate_change', 'district_encoded', 
            'county_encoded', 'is_new_build', 'is_freehold']:
    if col in numeric_cols:
        key_features.append(col)

key_features = key_features[:15]  # Limit to 15 for readability

if len(key_features) > 1:
    fig, ax = plt.subplots(figsize=(14, 12))
    
    corr_matrix = sample_df[key_features].corr()
    mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1)
    
    sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f', 
                cmap='coolwarm', center=0, square=True, linewidths=0.5,
                cbar_kws={'shrink': 0.8}, vmin=-1, vmax=1, ax=ax)
    
    ax.set_title('Feature Correlation Matrix (Top Features)', 
                 fontsize=14, fontweight='bold', pad=20)
    
    plt.tight_layout()
    plt.savefig(OUTPUT_DIR / '01_feature_correlations.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("✓ Saved: 01_feature_correlations.png")

### 10.2 Temporal Features Visualization

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Crisis indicator
if 'is_crisis_period' in sample_df.columns:
    crisis_by_year = sample_df.groupby('year')['is_crisis_period'].mean()
    axes[0, 0].plot(crisis_by_year.index, crisis_by_year.values, 
                    marker='o', linewidth=2.5, markersize=7, color='darkred')
    axes[0, 0].fill_between(crisis_by_year.index, crisis_by_year.values, 
                             alpha=0.3, color='darkred')
    axes[0, 0].set_title('Crisis Period Indicator', fontsize=12, fontweight='bold')
    axes[0, 0].set_xlabel('Year')
    axes[0, 0].set_ylabel('Proportion in Crisis')
    axes[0, 0].grid(alpha=0.3)

# Seasonal patterns
if all(col in sample_df.columns for col in ['is_spring', 'is_summer', 'is_autumn', 'is_winter']):
    seasonal = sample_df.groupby('month')[['is_spring', 'is_summer', 'is_autumn', 'is_winter']].mean()
    seasonal.plot(kind='bar', ax=axes[0, 1], width=0.8)
    axes[0, 1].set_title('Seasonal Indicators', fontsize=12, fontweight='bold')
    axes[0, 1].set_xlabel('Month')
    axes[0, 1].legend(['Spring', 'Summer', 'Autumn', 'Winter'])
    axes[0, 1].grid(alpha=0.3, axis='y')

# Weekend effect
if 'is_weekend' in sample_df.columns:
    weekend_by_dow = sample_df.groupby('day_of_week')['is_weekend'].mean()
    axes[1, 0].bar(weekend_by_dow.index, weekend_by_dow.values, 
                   color=['steelblue']*5 + ['coral']*2, edgecolor='black')
    axes[1, 0].set_title('Weekend Indicator', fontsize=12, fontweight='bold')
    axes[1, 0].set_xlabel('Day of Week')
    axes[1, 0].set_xticks(range(7))
    axes[1, 0].set_xticklabels(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
    axes[1, 0].grid(alpha=0.3, axis='y')

# Years since 2008
if 'years_since_2008' in sample_df.columns:
    years_dist = sample_df['years_since_2008'].value_counts().sort_index()
    axes[1, 1].bar(years_dist.index, years_dist.values, 
                   color='darkgreen', edgecolor='black', alpha=0.7)
    axes[1, 1].set_title('Years Since 2008', fontsize=12, fontweight='bold')
    axes[1, 1].set_xlabel('Years Since 2008')
    axes[1, 1].axvline(x=0, color='red', linestyle='--', linewidth=2)
    axes[1, 1].grid(alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(OUTPUT_DIR / '02_temporal_features.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Saved: 02_temporal_features.png")

### 10.3 Economic Features Visualization

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Mortgage spreads
if 'mortgage_spread_5_2' in sample_df.columns:
    axes[0, 0].hist(sample_df['mortgage_spread_5_2'], bins=50, 
                    color='steelblue', edgecolor='black', alpha=0.7)
    axes[0, 0].set_title('Mortgage Spread (5yr - 2yr)', fontsize=12, fontweight='bold')
    axes[0, 0].set_xlabel('Spread (%)')
    axes[0, 0].grid(alpha=0.3, axis='y')

# Base rate changes
if 'base_rate_change' in sample_df.columns:
    axes[0, 1].hist(sample_df['base_rate_change'], bins=50, 
                    color='coral', edgecolor='black', alpha=0.7)
    axes[0, 1].set_title('Base Rate Change (Monthly)', fontsize=12, fontweight='bold')
    axes[0, 1].set_xlabel('Change (%)')
    axes[0, 1].axvline(x=0, color='red', linestyle='--', linewidth=2)
    axes[0, 1].grid(alpha=0.3, axis='y')

# Mortgage rate changes
if 'mortgage_5yr_change' in sample_df.columns:
    axes[1, 0].hist(sample_df['mortgage_5yr_change'], bins=50, 
                    color='darkgreen', edgecolor='black', alpha=0.7)
    axes[1, 0].set_title('Mortgage Rate Change', fontsize=12, fontweight='bold')
    axes[1, 0].set_xlabel('Change (%)')
    axes[1, 0].axvline(x=0, color='red', linestyle='--', linewidth=2)
    axes[1, 0].grid(alpha=0.3, axis='y')

# Exchange rate changes
if 'exchange_rate_index_change' in sample_df.columns:
    axes[1, 1].hist(sample_df['exchange_rate_index_change'], bins=50, 
                    color='purple', edgecolor='black', alpha=0.7)
    axes[1, 1].set_title('Exchange Rate Change', fontsize=12, fontweight='bold')
    axes[1, 1].set_xlabel('Change (Index Points)')
    axes[1, 1].axvline(x=0, color='red', linestyle='--', linewidth=2)
    axes[1, 1].grid(alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(OUTPUT_DIR / '03_economic_features.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Saved: 03_economic_features.png")

## 11. Save Final Dataset

In [None]:
print("\nSaving feature-engineered dataset...\n")

df.to_parquet(OUTPUT_FILE, compression='gzip', index=False)

file_size = OUTPUT_FILE.stat().st_size / 1024**2

print(f"✓ Final dataset saved")
print(f"  File: {OUTPUT_FILE.name}")
print(f"  Size: {file_size:.2f} MB")
print(f"  Rows: {len(df):,}")
print(f"  Columns: {len(df.columns)}")

## 12. Create Feature Report

In [None]:
report_file = OUTPUT_DIR / 'feature_engineering_report.txt'

with open(report_file, 'w') as f:
    f.write("=" * 80 + "\n")
    f.write("FEATURE ENGINEERING REPORT\n")
    f.write("UK Housing Price Prediction Project\n")
    f.write("=" * 80 + "\n\n")
    f.write(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
    
    f.write("FEATURE ENGINEERING SUMMARY:\n")
    f.write("-" * 80 + "\n")
    f.write(f"Original columns: {original_columns}\n")
    f.write(f"Final columns: {len(df.columns)}\n")
    f.write(f"New features created: {len(df.columns) - original_columns}\n\n")
    
    f.write("FEATURE CATEGORIES:\n")
    f.write("-" * 80 + "\n")
    f.write("Categorical Encoding: ~7 features\n")
    f.write("  - One-hot: property_type\n")
    f.write("  - Binary: is_new_build, is_freehold, is_category_a\n\n")
    f.write("Temporal Features: 11 features\n")
    f.write("  - Basic: day_of_week, is_weekend\n")
    f.write("  - Seasonal: 4 season indicators + cyclical encoding\n")
    f.write("  - Crisis: years_since_2008, is_crisis_period, is_recovery_period\n\n")
    f.write("Economic Interactions: 5 features\n")
    f.write("  - Spreads: mortgage_spread_10_2, mortgage_spread_5_2\n")
    f.write("  - Rate changes: base_rate_change, mortgage_5yr_change, exchange_rate_index_change\n\n")
    f.write("Geographic Encoding: 2 features\n")
    f.write("  - Label encoded: district_encoded, county_encoded\n\n")
    
    f.write("DATA LEAKAGE PREVENTION:\n")
    f.write("-" * 80 + "\n")
    f.write("✓ Target encoding deferred to model pipeline\n")
    f.write("✓ Rate changes use shift(1) - previous month only\n")
    f.write("✓ No features use target variable before train/test split\n")
    f.write("✓ All features can be calculated from single transaction\n\n")
    
    f.write("ALL COLUMNS:\n")
    f.write("-" * 80 + "\n")
    for i, col in enumerate(df.columns, 1):
        f.write(f"  {i:2d}. {col}\n")
    f.write("\n")
    
    f.write("NEXT STEPS:\n")
    f.write("-" * 80 + "\n")
    f.write("1. Train/test split (e.g., 80/20)\n")
    f.write("2. Use 'price_transformed' as target variable\n")
    f.write("3. In model pipeline, optionally add:\n")
    f.write("   - Target encoding for district/county (with CV)\n")
    f.write("4. Model training (Random Forest, XGBoost, etc.)\n")
    f.write("5. Remember: Inverse transform predictions: exp(log_price)\n")

print(f"✓ Feature report saved: {report_file.name}")

## 13. Summary

### Features Created:
- **Categorical:** ~7 features (one-hot + binary encoding)
- **Temporal:** 11 features (basic + seasonal + crisis)
- **Economic:** 5 features (spreads + rate changes)
- **Geographic:** 2 features (label encoding)

### Data Leakage Prevention:
- ✅ Target encoding deferred to model pipeline
- ✅ Rate changes use shift(1) for previous month only
- ✅ No features use target variable before split
- ✅ All features calculable from single transaction

### Dataset Ready:
- **Records:** 11M+ transactions
- **Features:** ~39 columns
- **Target:** price_transformed (log-transformed)
- **Quality:** Production-grade, leakage-safe

### Next Steps:
1. Train/test split
2. Model training (use CloudAI Chapter 4)
3. Inverse transform predictions: `exp(log_price)` → actual price

---

**Notebook Complete**  
**Ready for Modeling!**