## 01 - Initial Data Load & Optimization

**Project:** UK Housing Price Paid Records

**Purpose:** To find the optimal way to load the large 2.4GB CSV file using memory and speed optimizations (`dtype` definition, `parse_dates`). The output of this notebook is the first, basic Parquet file.

**Team Member(s):** Tymo Verhaegen

**Date Last Run:** 05/11/2025

---

In [1]:
import pandas as pd
import os

# --- IMPORTANT SETUP ---
# You may need to install the engine for Parquet files:
# !pip install pyarrow fastparquet

# --- 1. Define Paths ---
csv_file_path = '../data/housing_raw/price_paid_records.csv'
parquet_dir = '../data/housing/processed'
parquet_file_path = os.path.join(parquet_dir, 'price_paid_init.parquet')

# Create the processed directory if it doesn't exist
os.makedirs(parquet_dir, exist_ok=True)

# --- 2. Define Optimal Data Types (The Speed Hack) ---
# This skips type inference and uses less memory.
# Note: 'Date of Transfer' is handled by the parse_dates argument.
data_types = {
    'Transaction unique identifier': 'string',
    'Price': 'int32',  # Assuming max price fits within +/- 2.1 billion
    'Property Type': 'category',
    'Old/New': 'category',
    'Duration': 'category',
    'Town/City': 'category',
    'District': 'category',
    'County': 'category',
    'PPDCategory Type': 'category',
    'Record Status - monthly file only': 'category'
}

# --- 3. LOAD THE RAW DATA ---
print("Starting initial (slow) CSV load...")

try:
    df_raw = pd.read_csv(
        csv_file_path, 
        dtype=data_types,
        parse_dates=['Date of Transfer'],
        low_memory=False, # We're explicitly defining dtypes, so this is safe
        engine='c'       # Explicitly use the fast C engine
    )
    print(f"CSV load complete! Loaded {len(df_raw):,} records.")

    # --- 4. SAVE AS OPTIMIZED PARQUET FILE ---
    print("Saving the optimized DataFrame to Parquet...")
    
    # Parquet is a columnar binary format that is fast and small.
    df_raw.to_parquet(parquet_file_path, index=False)
    
    print(f"\nSUCCESS! Optimized data saved to:\n{parquet_file_path}")
    
except FileNotFoundError:
    print(f"\nERROR: File not found at {csv_file_path}. Please check your path!")
except Exception as e:
    print(f"\nAn error occurred during loading/saving: {e}")

# The notebook run should now finish, having created a blazing fast file for EDA.

Starting initial (slow) CSV load...

ERROR: File not found at ../data/housing_raw/price_paid_records.csv. Please check your path!
