# 01_scrape_uk_housing.ipynb

**Authors:** Natan Wojtowicz.

**Purpose / short description (explainable in oral exam):**
This notebook contains reproducible steps to obtain the HM Land Registry "Price Paid" dataset from Kaggle (or to load a local copy if you've uploaded it to the project). It handles the fact that the dataset is large (~2GB) and provides options to:

* download via the Kaggle API
* verify checksum and file size
* extract and inspect the CSV header
* create smaller reproducible subsets (time-based, location-based, stratified sampling by county & year)
* save the resulting subset(s) to Parquet/CSV for faster downstream processing

All code cells contain comments and a short markdown explanation above them so everyone in the group can explain what each cell does.

## Notebook content

### 1. Setup (packages, env notes)

In [1]:
# Who worked on this cell: Natan
# Purpose: install & import packages, notes about Kaggle credentials

# If running in a clean environment, install required packages
# !pip install kaggle pandas pyarrow dask[complete] tqdm

import os
import sys
from pathlib import Path
import hashlib
import json
import gzip
from tqdm import tqdm
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Set paths
ROOT = Path.cwd()
DATA_DIR = ROOT / "data" / "uk_housing"
DATA_DIR.mkdir(parents=True, exist_ok=True)
RAW_DIR = DATA_DIR / "raw"
RAW_DIR.mkdir(exist_ok=True)
PROCESSED_DIR = DATA_DIR / "processed"
PROCESSED_DIR.mkdir(exist_ok=True)

print("Data directory:", DATA_DIR)

Data directory: c:\Users\natan\Desktop\School\3eJaar\ML\Challenge\CloudAikes\dataset_1_uk_housing\data\uk_housing


**Notes to exam answer:** We use pandas for small tests and PyArrow/parquet for faster disk I/O on large datasets.

### 2. Downloading from Kaggle (automated)


In [2]:
# Who: Natan
# Purpose: download the Kaggle dataset programmatically if Kaggle credentials are provided

# Kaggle dataset: hm-land-registry/uk-housing-prices-paid
# Precondition: user has a kaggle.json file with API credentials in ~/.kaggle/kaggle.json
# If running in Binder / Colab you must upload kaggle.json or set env variables.

KAGGLE_DATASET = "hm-land-registry/uk-housing-prices-paid"
TARGET_ZIP = RAW_DIR / "uk_housing_price_paid.zip"

def kaggle_download(dataset, dest_folder):
    # Try to import kaggle CLI programmatically
    try:
        from kaggle.api.kaggle_api_extended import KaggleApi
    except Exception as e:
        raise RuntimeError("kaggle package not available. Install with `pip install kaggle` and ensure kaggle.json is present.")

    api = KaggleApi()
    api.authenticate()
    print("Authenticated to Kaggle")
    # download by dataset
    api.dataset_download_files(dataset, path=str(dest_folder), unzip=True, quiet=False)
    print("Download complete")

# Uncomment to run
# kaggle_download(KAGGLE_DATASET, RAW_DIR)

print("If you cannot run the Kaggle API here, upload the dataset's CSV to data/uk_housing/raw/")

If you cannot run the Kaggle API here, upload the dataset's CSV to data/uk_housing/raw/


**Notes:** Many students cannot use Kaggle API due to credentials; we provide manual upload fallback. The dataset on Kaggle typically extracts to a CSV or folder; check `RAW_DIR` afterwards.

### 3. Inspect the raw CSV header and example rows


In [27]:
# Who: Natan
# Purpose: quickly inspect the header and first few lines without loading everything into memory

CSV_GUESS = RAW_DIR / "price_paid_records.csv"  # adjust to actual filename after download
if not CSV_GUESS.exists():
    # look for any .csv in raw
    csvs = list(RAW_DIR.glob('*.csv'))
    if csvs:
        CSV_GUESS = csvs[0]
    else:
        print("No CSV found in raw folder. Please download or upload the data.")

print("Using:", CSV_GUESS)

# Print header + first 3 lines
with open(CSV_GUESS, 'r', encoding='utf-8', errors='replace') as f:
    for _ in range(5):
        print(f.readline().strip())

Using: c:\Users\natan\Desktop\School\3eJaar\ML\Challenge\CloudAikes\dataset_1_uk_housing\data\uk_housing\raw\price_paid_records.csv
Transaction unique identifier,Price,Date of Transfer,Property Type,Old/New,Duration,Town/City,District,County,PPDCategory Type,Record Status - monthly file only
{81B82214-7FBC-4129-9F6B-4956B4A663AD},25000,1995-08-18 00:00,T,N,F,OLDHAM,OLDHAM,GREATER MANCHESTER,A,A
{8046EC72-1466-42D6-A753-4956BF7CD8A2},42500,1995-08-09 00:00,S,N,F,GRAYS,THURROCK,THURROCK,A,A
{278D581A-5BF3-4FCE-AF62-4956D87691E6},45000,1995-06-30 00:00,T,N,F,HIGHBRIDGE,SEDGEMOOR,SOMERSET,A,A
{1D861C06-A416-4865-973C-4956DB12CD12},43150,1995-11-24 00:00,T,N,F,BEDFORD,NORTH BEDFORDSHIRE,BEDFORDSHIRE,A,A


**Explain to the examiner:** This cell avoids reading 2GB into memory â€” it just checks the schema.


### 4. Create reproducible subsets (time-based and stratified)

We provide two approaches: (A) simple year-based subset (e.g., 2010-2017) and (B) stratified sampling by `county` + `year` to keep distribution.

In [30]:
# Who: Natan
# Purpose: create smaller files to speed up iterative analysis and modeling

SAMPLE_DIR = PROCESSED_DIR / "samples"
SAMPLE_DIR.mkdir(exist_ok=True)

# PARAMETERS
start_year = 2010
end_year = 2017
max_rows_target = 1_000_000  # choose a limit you can work with locally

# Approach A: filter by year using chunked read

def filter_by_year(input_csv, output_csv, year_from, year_to):
    chunks = pd.read_csv(input_csv, parse_dates=["Date of Transfer"], dayfirst=False,
                         chunksize=200_000, low_memory=False)
    written = False
    total = 0
    for c in chunks:
        c.columns = [col.strip() for col in c.columns]
        c['date'] = pd.to_datetime(c['Date of Transfer'], errors='coerce')
        c['year'] = c['date'].dt.year
        sel = c[(c['year'] >= year_from) & (c['year'] <= year_to)]
        if sel.empty:
            continue
        if not written:
            sel.to_csv(output_csv, index=False, mode='w')
            written = True
        else:
            sel.to_csv(output_csv, index=False, header=False, mode='a')
        total += len(sel)
    print(f"Wrote {total} rows to {output_csv}")

# Example usage (uncomment to run)
filter_by_year(CSV_GUESS, SAMPLE_DIR / 'pp_2010_2017.csv', start_year, end_year)

# Approach B: stratified sampling by county+year (chunked)
from collections import defaultdict
import random

def stratified_sample(input_csv, output_csv, n_samples=200_000, stratify_cols=['County', 'year']):
    # first pass: compute weights (counts per strata)
    counts = defaultdict(int)
    reader = pd.read_csv(input_csv, parse_dates=['Date of Transfer'], chunksize=200_000, low_memory=False)
    for chunk in tqdm(reader, desc='Counting strata'):
        chunk.columns = [c.strip() for c in chunk.columns]
        chunk['year'] = pd.to_datetime(chunk['Date of Transfer'], errors='coerce').dt.year
        key_series = chunk[stratify_cols].fillna('NA').astype(str).agg('||'.join, axis=1)
        for key in key_series:
            counts[key] += 1

    # compute sampling fractions per stratum proportional to counts
    total = sum(counts.values())
    fractions = {k: max(1, int(v / total * n_samples)) for k, v in counts.items()}

    # second pass: sample rows matching each stratum up to target
    samples_written = 0
    written_header = False
    reader = pd.read_csv(input_csv, parse_dates=['Date of Transfer'], chunksize=200_000, low_memory=False)
    sampled_per_key = defaultdict(int)

    for chunk in tqdm(reader, desc='Sampling strata'):
        chunk.columns = [c.strip() for c in chunk.columns]
        chunk['year'] = pd.to_datetime(chunk['Date of Transfer'], errors='coerce').dt.year
        chunk['key'] = chunk[stratify_cols].fillna('NA').astype(str).agg('||'.join, axis=1)
        to_take = []
        for key, group in chunk.groupby('key'):
            remain = fractions.get(key, 0) - sampled_per_key.get(key, 0)
            if remain <= 0:
                continue
            # sample min(remain, len(group)) rows
            k = min(remain, len(group))
            sampled = group.sample(n=k, replace=False, random_state=42)
            to_take.append(sampled)
            sampled_per_key[key] += len(sampled)

        if to_take:
            out = pd.concat(to_take)
            if not written_header:
                out.to_csv(output_csv, index=False, mode='w')
                written_header = True
            else:
                out.to_csv(output_csv, index=False, header=False, mode='a')
            samples_written += len(out)
            if samples_written >= n_samples:
                break

    print(f"Wrote {samples_written} sampled rows to {output_csv}")

# Example usage (uncomment to run)
#stratified_sample(CSV_GUESS, SAMPLE_DIR / 'pp_stratified_200k.csv', n_samples=200_000)

Wrote 6200823 rows to c:\Users\natan\Desktop\School\3eJaar\ML\Challenge\CloudAikes\dataset_1_uk_housing\data\uk_housing\processed\samples\pp_2010_2017.csv


**Explainable point for oral exam:** The stratified sampling keeps the distribution across counties and years similar to the original, which helps prevent models trained on the sample from learning biased distributions.

### 5. Save to Parquet for faster reads


In [33]:
# Who: Natan
# Purpose: convert the sample CSV to parquet for faster downstream processing

def csv_to_parquet(csv_path, parquet_path):
    df = pd.read_csv(csv_path, parse_dates=['Date of Transfer'])
    df.columns = [c.strip() for c in df.columns]
    df.to_parquet(parquet_path, index=False, engine='fastparquet')
    print('Saved', parquet_path)

# Example usage
csv_to_parquet(SAMPLE_DIR / 'pp_2010_2017.csv', SAMPLE_DIR / 'pp_2010_2017.parquet')

Saved c:\Users\natan\Desktop\School\3eJaar\ML\Challenge\CloudAikes\dataset_1_uk_housing\data\uk_housing\processed\samples\pp_2010_2017.parquet


### 6. Notes / next steps

* After this notebook you should have `data/uk_housing/processed/samples/*` containing manageable datasets. Use the `02_clean_uk_housing.ipynb` on those files.
* If you plan to use the *full* dataset for final training, prefer to convert the CSV to Parquet and use Dask or PySpark to scale.