## 03 - Final Data Preparation Pipeline

**Project:** UK Housing Price Paid Records

**Purpose:** To serve as the final, clean, and efficient data pipeline. This notebook takes the raw CSV file and applies all the necessary cleaning, renaming, and feature engineering steps (discovered in 02\_eda.ipynb) to create the ultimate model-ready Parquet file. **NO GRAPHS or analytical prints should be present.**

**Team Member(s):** Tymo Verhaegen

**Output File:** `../data/housing/processed/price_paid_model_ready.parquet`

**Date Last Run:** 06/11/2025

---

In [3]:
import pandas as pd
import os

# --- 1. Define Paths and Settings ---
print("--- Starting Final Data Preparation ---")
input_file = "../data/housing/processed/price_paid_init.parquet" 
processed_dir = '../data/housing/processed/' 
final_parquet_file_path = os.path.join(processed_dir, 'price_paid_model_ready.parquet') 

# Ensure the target directory exists
os.makedirs(processed_dir, exist_ok=True)

# --- 2. LOAD THE PARQUET FILE ---
print(f"1. Loading Parquet from: {input_file}")
try:
    df = pd.read_parquet(input_file)
    print(f"   -> Load successful. Raw records: {len(df):,}")
except FileNotFoundError:
    print(f"   -> ERROR: File not found. Path: {input_file}. Aborting.")
    exit()
except Exception as e:
    print(f"   -> ERROR reading Parquet file: {e}")
    exit()

# --- 3. APPLY CLEANING AND FEATURE ENGINEERING ---
print("\n2. Applying cleaning and feature engineering...")

# A. Column Renaming
# We do this FIRST so we can reference easy lowercase names later
def clean_col_name(col):
    return col.lower().replace(' ', '_').replace('/', '_').replace('-', '_').strip()

df.columns = [clean_col_name(col) for col in df.columns]
print("   -> Columns renamed to snake_case.")

# B. OPTIMIZE DATA TYPES (Memory Management)
# Using 'category' for strings saves massive amounts of RAM.
cat_cols = [
    'property_type', 'old_new', 'duration', 'town_city', 
    'district', 'county', 'ppdcategory_type', 'record_status_monthly_file_only'
]

for col in cat_cols:
    if col in df.columns:
        df[col] = df[col].astype('category')

# C. Feature Engineering (Add Year and Fix Date)
if not pd.api.types.is_datetime64_any_dtype(df['date_of_transfer']):
    df['date_of_transfer'] = pd.to_datetime(df['date_of_transfer'])

df['sale_year'] = df['date_of_transfer'].dt.year.astype('int16')
df.rename(columns={'date_of_transfer': 'sale_date'}, inplace=True)

# ---------------------------------------------------------
# D. Handle Outliers (UPDATED LOGIC)
# ---------------------------------------------------------
# We filter for a realistic "Open Market" range: £5,000 to £20,000,000.
min_price = 5000
max_price = 20000000

count_low = len(df[df['price'] < min_price])
count_high = len(df[df['price'] > max_price])

# Apply the filter
df = df[(df['price'] >= min_price) & (df['price'] <= max_price)].copy()

print(f"   -> Removed {count_low:,} low-value transactions (< £{min_price:,}).")
print(f"   -> Removed {count_high:,} high-value outliers (> £{max_price:,}).")
print(f"   -> Remaining records: {len(df):,}")
# ---------------------------------------------------------

# E. Drop Unique ID (if it exists)
if 'transaction_unique_identifier' in df.columns:
    df.drop(columns=['transaction_unique_identifier'], inplace=True)

# F. Fix 'duration' (Leasehold vs Freehold)
duration_mapping = {'L': 'Leasehold', 'F': 'Freehold', 'U': 'Unknown'}
unique_vals = df['duration'].unique()
if 'L' in unique_vals or 'F' in unique_vals:
    df['duration'] = df['duration'].map(duration_mapping)

# Set order
duration_order = ['Leasehold', 'Freehold']
df['duration'] = pd.Categorical(df['duration'], categories=duration_order, ordered=True)
print("   -> Duration updated to ordered category.")

# G. Drop redundant columns
cols_to_drop = ['ppdcategory_type', 'record_status_monthly_file_only', 'record_status']
df.drop(columns=cols_to_drop, inplace=True, errors='ignore')

print(f"\nFinal number of columns: {len(df.columns)}")
print(f"Final columns: {list(df.columns)}")

# --- 4. FINAL SAVE ---
print("\n3. Saving final model-ready file to Parquet...")
df.to_parquet(final_parquet_file_path, index=False)

print("\n--- Final Prep Complete! ---")
print(f"Model-ready data saved to: {final_parquet_file_path}")

--- Starting Final Data Preparation ---
1. Loading Parquet from: ../data/housing/processed/price_paid_init.parquet
   -> Load successful. Raw records: 22,489,348

2. Applying cleaning and feature engineering...
   -> Columns renamed to snake_case.
   -> Removed 7,367 low-value transactions (< £5,000).
   -> Removed 1,159 high-value outliers (> £20,000,000).
   -> Remaining records: 22,480,822
   -> Duration updated to ordered category.

Final number of columns: 10
Final columns: ['price', 'sale_date', 'property_type', 'old_new', 'duration', 'town_city', 'district', 'county', 'record_status___monthly_file_only', 'sale_year']

3. Saving final model-ready file to Parquet...

--- Final Prep Complete! ---
Model-ready data saved to: ../data/housing/processed/price_paid_model_ready.parquet
