# üìä Final Loading & Preparing of UK Housing Prices Paid

**Author: Tiebe Goossens**

This notebook contains all **necessary** steps to take the raw UK Housing Prices CSV  
and convert it into a **clean, consistent, analysis-ready dataset**.

No graphs. Only essential, auditable steps.
This notebook is intended to be the *final, reproducible cleaning script*.

## 1Ô∏è‚É£ Load Raw Dataset

We load the raw CSV exactly as it comes from the source.  
We avoid specifying dtypes because we will enforce them manually later.

In [1]:
import pandas as pd
import numpy as np
from pandas.api.types import CategoricalDtype

df = pd.read_csv("../Data/housing_prices/price_paid_records.csv", low_memory=False)
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22489348 entries, 0 to 22489347
Data columns (total 11 columns):
 #   Column                             Dtype 
---  ------                             ----- 
 0   Transaction unique identifier      object
 1   Price                              int64 
 2   Date of Transfer                   object
 3   Property Type                      object
 4   Old/New                            object
 5   Duration                           object
 6   Town/City                          object
 7   District                           object
 8   County                             object
 9   PPDCategory Type                   object
 10  Record Status - monthly file only  object
dtypes: int64(1), object(10)
memory usage: 1.8+ GB


## 2Ô∏è‚É£ Standardize Column Names

We convert all column names to:
- lowercase  
- snake_case  
- free of spaces and special characters  

This ensures consistency across notebooks and avoids errors when referencing columns.

In [2]:
df.columns = (
    df.columns
      .str.strip()
      .str.lower()
      .str.replace(" ", "_")
      .str.replace(r"[^\w_]", "", regex=True)
)
df.columns

Index(['transaction_unique_identifier', 'price', 'date_of_transfer',
       'property_type', 'oldnew', 'duration', 'towncity', 'district', 'county',
       'ppdcategory_type', 'record_status__monthly_file_only'],
      dtype='object')

## 3Ô∏è‚É£ Convert Columns to Correct Types

We enforce:
- `price` ‚Üí numeric  
- `date_of_transfer` ‚Üí datetime  
- identifying fields and labels ‚Üí category dtype  

Using categories reduces memory usage and reflects the discrete nature of these variables.

In [3]:
# Convert price
df["price"] = pd.to_numeric(df["price"], errors="coerce")

# Convert date
df["date_of_transfer"] = pd.to_datetime(df["date_of_transfer"], errors="coerce")

# Convert categorical columns
cat_cols = [
    "transaction_unique_identifier",
    "property_type",
    "oldnew",
    "duration",
    "towncity",
    "district",
    "county",
    "ppdcategory_type",
    "record_status__monthly_file_only",
]

for col in cat_cols:
    if col in df.columns:
        df[col] = df[col].astype("category")

## 4Ô∏è‚É£ Check for Missing Values

We verify if any column contains missing (`NaN`) values.  
This helps determine whether imputation or row removal is required.


In [4]:
df.isna().sum()

transaction_unique_identifier       0
price                               0
date_of_transfer                    0
property_type                       0
oldnew                              0
duration                            0
towncity                            0
district                            0
county                              0
ppdcategory_type                    0
record_status__monthly_file_only    0
dtype: int64

## 5Ô∏è‚É£ Creating a subset for training/exportation

When we begin training the model later, we will use a **larger 10% subset of the 
full cleaned dataset** so that the model has enough data to learn realistic patterns 
without requiring the full 22M rows.

In [45]:
df_subset = df.sample(frac=0.1, random_state=42)

## 6Ô∏è‚É£ Exporting the Subset to CSV

To avoid re-sampling every time the notebook is executed, we export the 
10% exploration subset to a CSV file.

This makes it faster to load in future exploration work without needing 
to load the full 22M-row dataset.


These files wont be put on the github, due to their size. They will be GIT ignored.

In [46]:
df_subset.to_csv("../Data/housing_prices/price_paid_records_prepared_subset_10.csv", index=False)