# Lesson 01: Data Filtering

In this lesson, you'll learn to filter NSW property sales data down to Sydney suburbs only and separate houses from units.

## Why Are We Doing This?

The end goal of this project is to create a simple webpage where users can:
- Select a city (starting with Sydney)
- View suburb-by-suburb analysis of property prices
- See how prices have changed over time
- Compare different suburbs

**Why separate houses and units?** Houses and units (apartments) are fundamentally different property types with distinct:
- Price ranges (houses typically cost more)
- Market dynamics (different supply/demand patterns)
- Buyer demographics (different target markets)
- Investment characteristics

Users will want to filter and compare houses separately from units, so we need to split them from the start. This separation is critical for meaningful analysis!

**Your Goal:**
- Load parquet files from `data/parquet/` (one file per year, 2005-2025)
- Combine all years into a single DataFrame
- Explore the data structure (district codes, postcodes, suburbs)
- Find or create a list of Sydney suburbs
- Filter the data to include only Sydney suburbs
- Filter to residential properties only
- **Split houses vs units** (using property_unit_number or strata_lot_number)
- Save separate files: `data/sydney/full_houses.parquet` and `data/sydney/full_units.parquet`

**IMPORTANT:**
You will run in to bugs and roadblocks. I have annotated these in my own notebooks with `!#`. Some problems are:
- We have data added to the dataset outside our 2005-2025 range.
- Are Sydney's postcodes broken up into multiple ranges?


In [None]:
# Import necessary libraries
# You'll need pandas, numpy, matplotlib, seaborn, and Path from pathlib
# Don't forget to suppress warnings for cleaner output

# Set up configuration variables:
# - DATA_DIR: path to data directory
# - OUTPUT_DIR: path to parquet files (data/parquet)
# - YEARS: range from 2005 to 2026 (exclusive, so 2005-2025)

# Print a success message when imports are complete


In [None]:
# Load all parquet files and combine into a single DataFrame
# Hints:
# - Loop through each year in YEARS
# - For each year, check if the parquet file exists (use Path)
# - Try reading with fastparquet engine first, fall back to default if it fails
# - Add a 'year' column to each DataFrame before appending to a list
# - Print progress for each year loaded
# - Use pd.concat() to combine all DataFrames
# - Print summary statistics: total records, date range, shape, columns


In [None]:
# Explore the data structure to understand what we're working with
# Hints:
# - Get unique values for: district_code, property_locality, property_post_code, nature_of_property, contract_date, settlement_date, purchase_price, etc
# - Print counts and sample values for each
# - This will help you understand the data and plan your filtering strategy


In [None]:
# Get Sydney suburbs list
# Option A: CHALLENGE (see readme.md)
# Option B: Load from sydney_burbs.json file (if you copied it from final/backend/notebooks/data/)
#

# Hints:
# - If using JSON: load with json module (import json), extract suburbs list, normalize to lowercase
# - If using postcodes: filter DataFrame by postcode ranges, get unique property_locality values
# - Normalize suburb names (lowercase, strip whitespace) for consistent matching
# - Print how many suburbs you found


In [None]:
# Filter DataFrame to Sydney suburbs only
# Hints:
# - Use .str.lower() on property_locality to match against your normalized suburb list
# - Use .isin() to filter rows where suburb is in your Sydney suburbs list
# - Make a copy() of the filtered DataFrame
# - Filter again to keep only residential properties (nature_of_property == 'R')
# - Print summary: total records, percentage of original, date range, unique suburbs/postcodes/districts
# - Display a sample of the filtered data


In [None]:
# Inspect data to understand how to split houses vs units
# Hints:
# - Check property_unit_number column: how many records have non-empty values?
# - Check strata_lot_number column: how many records have non-empty values?
# - Sample records WITH unit numbers vs WITHOUT unit numbers
# - Understand the logic: Units typically have unit_number OR strata_lot_number, Houses have neither
# - Print counts and sample records to verify your understanding


In [None]:
# Split houses and units
# Hints:
# - Convert property_unit_number and strata_lot_number to string (handle NaN properly)
# - Create boolean masks:
#   - has_unit_num: non-empty unit_number (not 'nan', not empty string)
#   - has_strata: non-empty strata_lot_number (not 'nan', not empty string)
# - Units mask = has_unit_num OR has_strata
# - Houses mask = NOT units_mask (neither unit_number nor strata_lot_number)
# - Create separate DataFrames: df_houses and df_units
# - Verify the split adds up correctly (houses + units = total)
# - Print counts and percentages for each type
# - Display sample records from each type to verify


In [None]:
# Save filtered Sydney data to separate parquet files
# Hints:
# - Create output directory if it doesn't exist (use os.makedirs with exist_ok=True)
# - Save houses to data/sydney/full_houses.parquet
# - Save units to data/sydney/full_units.parquet
# - Use to_parquet() with index=False for both
# - Print file paths and sizes in MB for both files
# - Optionally, also save the combined file to data/sydney/full.parquet (useful for exploration)
