## 03 - Final Data Preparation Pipeline

**Project:** UK Housing Price Paid Records

**Purpose:** To serve as the final, clean, and efficient data pipeline. This notebook takes the raw CSV file and applies all the necessary cleaning, renaming, and feature engineering steps (discovered in 02\_eda.ipynb) to create the ultimate model-ready Parquet file. **NO GRAPHS or analytical prints should be present.**

**Team Member(s):** Tymo Verhaegen

**Output File:** `../data/housing/processed/price_paid_model_ready.parquet`

**Date Last Run:** 06/11/2025

---

In [3]:
import pandas as pd
import os

# --- 1. Define Paths and Settings (Confirmed correct via your screenshot) ---
print("--- Starting Final Data Preparation ---")
csv_file_path = '../data/housing/price_paid_records.csv' 
processed_dir = '../data/housing/processed/' 
final_parquet_file_path = os.path.join(processed_dir, 'price_paid_model_ready.parquet') 

# Ensure the target directory exists
os.makedirs(processed_dir, exist_ok=True)

# --- 2. Define Optimal Data Types (Essential for performance) ---
data_types = {
    'Transaction unique identifier': 'string',
    'Price': 'int32',  
    'Property Type': 'category',
    'Old/New': 'category',
    'Duration': 'category',
    'Town/City': 'category',
    'District': 'category',
    'County': 'category',
    'PPDCategory Type': 'category',
    'Record Status - monthly file only': 'category'
}

# --- 3. LOAD THE RAW CSV ---
print(f"1. Loading raw CSV from: {csv_file_path}")
try:
    df = pd.read_csv(
        csv_file_path, 
        dtype=data_types,
        parse_dates=['Date of Transfer'],
        low_memory=False, 
        engine='c'
    )
    print(f"   -> Load successful. Raw records: {len(df):,}")
except FileNotFoundError:
    print(f"   -> ERROR: File not found. Path: {csv_file_path}. Aborting.")
    exit()

# --- 4. APPLY CONSOLIDATED CLEANING AND FEATURE ENGINEERING ---
print("\n2. Applying cleaning and feature engineering...")

# A. Column Renaming (Mandatory Assignment Requirement)
def clean_col_name(col):
    return col.lower().replace(' ', '_').strip()

df.columns = [clean_col_name(col) for col in df.columns]

# B. Feature Engineering (Add Year and Rename Date)
df['sale_year'] = df['date_of_transfer'].dt.year.astype('int16')
df.rename(columns={'date_of_transfer': 'sale_date'}, inplace=True)

# C. Handle Outliers (Filtering out non-market sales)
count_low_price = len(df[df['price'] <= 1])
df = df[df['price'] > 1].copy()

print(f"   -> Removed {count_low_price:,} low-value transactions (Price <= £1).")
print(f"   -> Cleaned records remaining: {len(df):,}")

# D. Final Column Cleanup (Drop the unique ID column)
df.drop(columns=['transaction_unique_identifier'], inplace=True)
print("   -> Columns renamed, sale_year created, and ID dropped.")

# E Set the order for the 'duration' column (Leasehold vs Freehold)
duration_mapping = {'L': 'Leasehold', 'F': 'Freehold'}
duration_order = ['Leasehold', 'Freehold']

df['duration'] = df['duration'].map(duration_mapping)
df['duration'] = pd.Categorical(df['duration'], categories=duration_order, ordered=True)
print("\n- Duration column updated to ordered category: ['Leasehold', 'Freehold'].")

# F. Drop redundant columns
cols_to_drop = ['ppd_category_type', 'record_status_-_monthly_file_only']
df.drop(columns=cols_to_drop, inplace=True, errors='ignore')
print(f"\nFinal number of columns: {len(df.columns)}")

# --- 5. FINAL SAVE ---
print("\n3. Saving final model-ready file to Parquet...")
df.to_parquet(final_parquet_file_path, index=False)

print("\n--- Final Prep Complete! ---")
print(f"Model-ready data saved to: {final_parquet_file_path}")

--- Starting Final Data Preparation ---
1. Loading raw CSV from: ../data/housing/price_paid_records.csv
   -> Load successful. Raw records: 22,489,348

2. Applying cleaning and feature engineering...
   -> Removed 92 low-value transactions (Price <= £1).
   -> Cleaned records remaining: 22,489,256
   -> Columns renamed, sale_year created, and ID dropped.

- Duration column updated to ordered category: ['Leasehold', 'Freehold'].

Final number of columns: 10

3. Saving final model-ready file to Parquet...

--- Final Prep Complete! ---
Model-ready data saved to: ../data/housing/processed/price_paid_model_ready.parquet
