# 02 – Data Cleaning & Feature Engineering

**Notebook Name:** `02_Data_Cleaning_Feature_Engineering.ipynb`

## Objectives
- Load raw data from `inputs/datasets/raw/HousePricesRecords.csv`.
- Handle nulls and duplicates.
- Parse dates into separate Year and Month features.
- Map and one-hot encode key categorical fields.
- Save the cleaned, feature-engineered dataset for modeling.

## Inputs
- `inputs/datasets/raw/HousePricesRecords.csv` (raw Price-Paid records).

## Outputs
- `outputs/datasets/collection/HousePricesRecords_clean.csv` (cleaned and feature-engineered).

## Additional Comments
- Follows CRISP-DM Phase 3: Data Preparation.
- This cleaned CSV is the source of truth for subsequent analysis notebooks and the Streamlit app.

## Import Required Libraries
Loading packages for data manipulation and file operations.

In [1]:
import pandas as pd
import numpy as np
import os

##  Load Raw Data Sample
Read a subset of rows in chunks to manage memory.

We read in chunks of 200,000 rows and keep the newest 1,000 entries overall for initial inspection.

In [None]:
input_path = "inputs/datasets/raw/HousePricesRecords.csv"
chunksize = 200000
keep_n = 1000
latest = None
for chunk in pd.read_csv(input_path,
                         parse_dates=["Date of Transfer"],
                         usecols=["Price","Date of Transfer","Property Type","Old/New","Duration","Town/City","County","PPDCategory Type"],
                         chunksize=chunksize,
                         low_memory=False):
    top = chunk.nlargest(keep_n, "Date of Transfer")
    latest = top if latest is None else pd.concat([latest, top]).nlargest(keep_n, "Date of Transfer")
# Final sample
df = latest.reset_index(drop=True)

FileNotFoundError: [Errno 2] No such file or directory: 'inputs/datasets/raw/HousePricesRecords.csv'

### Preview Sample
Ensure data loaded correctly; limited to 1,000 rows due to memory constraints.

In [None]:
df.head()

## Inspect Data Types & Distributions
Verify dtypes and explore categorical value counts.

In [None]:
print("Column dtypes:\n", df.dtypes)

cat_cols = ["Property Type","Old/New","Duration","County","Town/City"]
for col in cat_cols:
    print(f"\nValue counts for {col}:")
    print(df[col].value_counts())

## Clean Missing and Irrelevant Data
Remove rows with missing critical fields and drop unused columns.

In [None]:

print("Missing before:\", df.isnull().sum())
df.dropna(subset=["Price","Date of Transfer"], inplace=True)
print("Shape after drop:\", df.shape)

df.drop(columns=["Transaction unique identifier","District","Record Status - monthly file only"],
        errors='ignore', inplace=True)

## Date Feature Engineering
Convert date to datetime, extract year and month, and filter last 3 years.

In [None]:

df['Date of Transfer'] = pd.to_datetime(df['Date of Transfer'], errors='coerce')

df['Year'] = df['Date of Transfer'].dt.year
ndf_month = df['Date of Transfer'].dt.month


max_date = df['Date of Transfer'].max()
cutoff = max_date - pd.DateOffset(years=3)
before = df.shape[0]
df = df[df['Date of Transfer'] >= cutoff]
after = df.shape[0]
print(f"Filtered {before-after} rows; {after} remain from {cutoff.date()} to {max_date.date()}")

## Encode Categorical Features
Map binary and one-hot encode needful fields.

In [None]:
# Binary maps
df['Old/New'] = df['Old/New'].map({'N':0,'Y':1})
df['Duration'] = df['Duration'].map({'F':1,'L':0})
# One-hot for Property Type
df = pd.get_dummies(df, columns=['Property Type'], prefix='Property')