# Lesson 03: Quarterly Analysis

In this lesson, you'll transform the Sydney housing data to match the database schema and create quarterly aggregations.

**Your Goal:**
- Transform data to match `properties` table schema (suburb, postcode, district, property_type, dates, prices)
- Split houses vs units (using property_unit_number or strata_lot_number)
- Calculate contract_to_settlement_days from dates
- Add quarter information (year, quarter, quarter_start) from settlement_date
- Create quarterly aggregations per suburb (num_sales, median/mean/min/max prices, percentiles, contract-to-settlement metrics)
- Calculate derived metrics (fast_settlements_percentage, liquidity_score)
- Prepare data for `suburb_quarterly` table structure

**Reference:** Check `src/db/schema.sql` for the exact table structure you need to match.

**Important:**
- I made a mistake with liquidity score. It's meaningless as we don't have listing date... we have contract / settlement date. We need listing for liquidity.

In [None]:
# Import necessary libraries
# You'll need pandas, numpy, matplotlib, seaborn, Path, warnings, json, datetime
# Set up plotting style
# Suppress warnings


In [None]:
# Load Sydney housing data from parquet file
# Hints:
# - Load from data/sydney/full.parquet
# - Print number of records, date range, shape
# - Display available columns and sample data
# - Focus on: property_locality, property_post_code, district_code, contract_date, settlement_date, purchase_price


In [None]:
# Transform data to match properties table schema
# Hints:
# - Create a copy of the DataFrame
# - Map columns:
#   - suburb = property_locality (strip whitespace)
#   - postcode = property_post_code (convert to string, strip)
#   - district = district_code (convert to string, strip)
#   - contract_date = convert to datetime
#   - settlement_date = convert to datetime
#   - sale_price = purchase_price (convert to float)
# - Calculate contract_to_settlement_days: (settlement_date - contract_date).dt.days
# - Filter out invalid records: negative days, missing dates, zero prices, missing suburbs
# - Print how many records were filtered out
# - Select only columns needed for properties table
# - Display sample and info()


In [None]:
# Split houses vs units
# Hints:
# - Inspect property_unit_number and strata_lot_number columns
# - Units typically have: non-empty unit_number OR non-empty strata_lot_number
# - Houses have neither (both are empty/null)
# - Handle NaN and empty strings properly (convert to string, check for 'nan' and '')
# - Create boolean masks: has_unit_num and has_strata
# - Units mask = has_unit_num OR has_strata
# - Houses mask = NOT units_mask
# - Verify the split adds up correctly
# - Print counts and percentages
# - Add property_type column: 'house' or 'unit'


In [None]:
# Add quarter information
# Hints:
# - Extract year from settlement_date: .dt.year
# - Extract quarter from settlement_date: .dt.quarter (returns 1, 2, 3, or 4)
# - Calculate quarter_start: convert to period 'Q', then get start_time
#   Example: df['settlement_date'].dt.to_period('Q').dt.start_time
# - Display sample with suburb, settlement_date, year, quarter, quarter_start
# - Print date range and quarters covered


In [None]:
# Create quarterly aggregations per suburb
# Hints:
# - Group by: suburb, year, quarter, quarter_start
# - For sale_price, calculate: count, median, mean, min, max, std, 25th percentile, 75th percentile
# - For contract_to_settlement_days, calculate: median, mean
# - Use .agg() with named aggregations or list of tuples
# - Flatten column names if needed (they'll be multi-level after groupby)
# - Reset index to get suburb, year, quarter, quarter_start as columns
# - Print sample and summary statistics


In [None]:
# Calculate fast settlements percentage
# Hints:
# - Typical settlement period for NSW is 42 days
# - Filter properties where contract_to_settlement_days <= 42
# - Group by suburb, year, quarter and count fast settlements
# - Also count total sales per suburb/year/quarter
# - Merge the two counts
# - Calculate percentage: (fast_settlement_count / total_sales_count) * 100
# - Handle division by zero (fillna(0))
# - Merge this back into your quarterly stats DataFrame


In [None]:
# Calculate liquidity score (optional advanced metric)
# Hints:
# - Liquidity score combines volume and speed
# - Normalize volume: divide by max volume (0-1 scale)
# - Normalize speed: (100 - fast_settlements_percentage) / 100 (0-1 scale)
# - Combine: volume_score * 0.6 + speed_score * 0.4
# - Add this as a new column to quarterly stats
# - This gives a composite score where higher = more liquid market


In [None]:
# Prepare data for database insertion
# Hints:
# - Review the suburb_quarterly table schema in src/db/schema.sql
# - Ensure column names match exactly (rename if needed)
# - Ensure data types match (REAL for floats, INTEGER for counts, DATE for dates, TEXT for strings)
# - Add any missing columns with None/NULL values
# - Split quarterly stats by property_type if needed
# - Save to parquet files for next lesson (optional, or you can regenerate)
# - Print final shape and sample of prepared data
