# Client Data Update ETL - Test Notebook

**Purpose**: Test the client_data_update_etl pipeline before deploying to Airflow

**Pipeline Steps**:
1. Extract new pool listings (last 2 days)
2. Extract removed pool listings (last 2 days)
3. Match existing properties in master DB
4. Transform - Create property/pool objects
5. Transform - Mark removed listings
6. Load properties to master DB
7. Load pools to master DB
8. Load listings to master DB
9. Update removed listings in master DB

---

## Setup & Imports

In [1]:
import sys
import os
from datetime import datetime, timezone
import tempfile

# Add airflow directory to path
sys.path.insert(0, '/home/james/PDS/client_data_feeds/realestate/airflow')

import pandas as pd

# Import pipeline functions
from include.extract.client_data_extract import (
    extract_new_pool_listings,
    extract_removed_pool_listings,
)
from include.transform.client_data_transform import (
    match_existing_properties,
    transform_create_property_pool_objects,
    transform_mark_removed_listings,
)
from include.load.client_data_load import (
    upsert_properties,
    upsert_pools,
    upsert_listings,
    update_removed_listings,
)

print("‚úì Imports successful")

# Create working directory
workdir = os.path.join(tempfile.gettempdir(), f"client_data_test_{datetime.now().strftime('%Y%m%d_%H%M%S')}")
os.makedirs(workdir, exist_ok=True)
print(f"‚úì Working directory: {workdir}")

‚úì Imports successful
‚úì Working directory: /tmp/client_data_test_20260128_153428


## Step 1: Extract New Pool Listings

Extract all listings added in the last 2 days that have pools.

In [2]:
result = extract_new_pool_listings(workdir)

print(f"\n‚úì Extracted {result['row_count']} new pool listings")
print(f"‚úì Saved to: {result['parquet_path']}")

# Load and inspect
if result['row_count'] > 0:
    new_pools_df = pd.read_parquet(result['parquet_path'])
    print(f"\nColumns: {list(new_pools_df.columns)}")
    print(f"\nSample data:")
    print(new_pools_df.head())
    print(f"\nPool type distribution:")
    print(new_pools_df['pool_type'].value_counts())
else:
    print("\n‚ö†Ô∏è  No new pool listings found")

  df = pd.read_sql(query, conn, params=(cutoff_date,))



‚úì Extracted 1007 new pool listings
‚úì Saved to: /tmp/client_data_test_20260128_153428/new_pool_listings.parquet

Columns: ['mls_id', 'address_number', 'street_name', 'municipality', 'province_state', 'postal_code', 'lat', 'lon', 'date_collected', 'description', 'bedrooms', 'bathrooms', 'size_sqft', 'stories', 'house_cat', 'price', 'pool_type', 'pool_mentioned']

Sample data:
      mls_id  address_number          street_name          municipality  \
0  X12316017            2229       HOUCK CRESCENT     Fort Erie (Bowen)   
1   R3072849           20510           48A AVENUE               Langley   
2  X12507288             699   SEVILLA PARK PLACE  London East (East C)   
3  X12601652              77         SEVERN DRIVE  Guelph (Grange Road)   
4   R3080204             822  34909 OLD YALE ROAD            Abbotsford   

     province_state postal_code        lat         lon  \
0           Ontario      L2A5M4  42.948261  -78.965556   
1  British Columbia      V3A3P3  49.090194 -122.655

## Step 2: Extract Removed Pool Listings

Extract all listings removed in the last 2 days that have pools.

In [3]:
result_removed = extract_removed_pool_listings(workdir)

print(f"\n‚úì Extracted {result_removed['row_count']} removed pool listings")
print(f"‚úì Saved to: {result_removed['parquet_path']}")

# Load and inspect
if result_removed['row_count'] > 0:
    removed_pools_df = pd.read_parquet(result_removed['parquet_path'])
    print(f"\nColumns: {list(removed_pools_df.columns)}")
    print(f"\nSample data:")
    print(removed_pools_df.head())
else:
    print("\n‚ö†Ô∏è  No removed pool listings found")


‚úì Extracted 196 removed pool listings
‚úì Saved to: /tmp/client_data_test_20260128_153428/removed_pool_listings.parquet

Columns: ['mls_id', 'address_number', 'street_name', 'municipality', 'province_state', 'postal_code', 'lat', 'lon', 'pool_type', 'removal_date']

Sample data:
      mls_id  address_number             street_name  \
0   40670516            8036          SHERIDAN Court   
1   40790101             150           BENZIGER Lane   
2  X12714248              49     66 - 49 RHONDA ROAD   
3  X12407922            1501  228 - 1501 LINE 8 ROAD   
4   NB128674              62    60-62 Ashworth Drive   

                                municipality province_state postal_code  \
0                                    Grassie        Ontario      L0R1M0   
1                               Stoney Creek        Ontario      L8E6G6   
2  Guelph (Willow West/Sugarbush/West Acres)        Ontario      N1H7A4   
3            Niagara-on-the-Lake (Queenston)        Ontario      L0S1L0   
4    

  df = pd.read_sql(query, conn, params=(cutoff_date,))


## Step 3: Match Existing Properties

Match new and removed listings against existing properties in master DB.

In [4]:
match_result = match_existing_properties(
    result['parquet_path'],
    result_removed['parquet_path'],
    workdir
)

print(f"\n‚úì New listings matched: {match_result['new_matched_count']}")
print(f"‚úì Removed listings matched: {match_result['removed_matched_count']}")

# Load and inspect matched data
new_matched_df = pd.read_parquet(match_result['new_matched_path'])
removed_matched_df = pd.read_parquet(match_result['removed_matched_path'])

print(f"\nNew listings with property match: {new_matched_df['property_id'].notna().sum()}")
print(f"New listings without property match: {new_matched_df['property_id'].isna().sum()}")

print(f"\nRemoved listings with property match: {removed_matched_df['property_id'].notna().sum()}")
print(f"Removed listings without property match: {removed_matched_df['property_id'].isna().sum()}")

Removed listing 40670516 has no property match
Removed listing 40790101 has no property match
Removed listing X12714248 has no property match
Removed listing X12407922 has no property match
Removed listing NB128674 has no property match
Removed listing 40756588 has no property match
Removed listing 40761729 has no property match
Removed listing 40764452 has no property match
Removed listing 40766023 has no property match
Removed listing 40769220 has no property match
Removed listing R3071010 has no property match
Removed listing 40782535 has no property match
Removed listing R3075558 has no property match
Removed listing 40783881 has no property match
Removed listing 40785063 has no property match
Removed listing 40786193 has no property match
Removed listing 40787819 has no property match
Removed listing W12385750 has no property match
Removed listing NB129368 has no property match
Removed listing R2990216 has no property match
Removed listing R3068488 has no property match
Removed li


‚úì New listings matched: 0
‚úì Removed listings matched: 0

New listings with property match: 0
New listings without property match: 1007

Removed listings with property match: 0
Removed listings without property match: 196


## Step 4: Transform - Create Property/Pool Objects

Create property and pool records for new listings without existing properties.

In [5]:
transform_new_result = transform_create_property_pool_objects(
    match_result['new_matched_path'],
    workdir
)

print(f"\n‚úì Properties to insert: {transform_new_result['property_count']}")
print(f"‚úì Properties file: {transform_new_result['properties_path']}")
print(f"‚úì Pools file: {transform_new_result['pools_path']}")
print(f"‚úì Listings file: {transform_new_result['listings_path']}")

# Inspect created objects
if transform_new_result['property_count'] > 0:
    properties_df = pd.read_parquet(transform_new_result['properties_path'])
    pools_df = pd.read_parquet(transform_new_result['pools_path'])
    listings_df = pd.read_parquet(transform_new_result['listings_path'])
    
    print(f"\nProperties sample:")
    print(properties_df.head())
    
    print(f"\nPools sample:")
    print(pools_df.head())
    
    print(f"\nListings sample:")
    print(listings_df[['mls_id', 'price', 'bedrooms', 'bathrooms']].head())


‚úì Properties to insert: 1007
‚úì Properties file: /tmp/client_data_test_20260128_153428/properties_to_insert.parquet
‚úì Pools file: /tmp/client_data_test_20260128_153428/pools_to_insert.parquet
‚úì Listings file: /tmp/client_data_test_20260128_153428/listings_to_insert.parquet

Properties sample:
            address_id  address_number          street_name  \
0  8848708115112578671            2229       HOUCK CRESCENT   
1  4417291561433416372           20510           48A AVENUE   
2   661525098263364484             699   SEVILLA PARK PLACE   
3  1429104246350380139              77         SEVERN DRIVE   
4  7821956053121296324             822  34909 OLD YALE ROAD   

           municipality    province_state postal_code        lat         lon  \
0     Fort Erie (Bowen)           Ontario      L2A5M4  42.948261  -78.965556   
1               Langley  British Columbia      V3A3P3  49.090194 -122.655384   
2  London East (East C)           Ontario      N5Y4H9  43.012375  -81.235392   

## Step 5: Transform - Mark Removed Listings

Prepare updates for removed listings.

In [6]:
transform_removed_result = transform_mark_removed_listings(
    match_result['removed_matched_path'],
    workdir
)

print(f"\n‚úì Listings to mark as removed: {transform_removed_result['update_count']}")
print(f"‚úì Updates file: {transform_removed_result['updates_path']}")

# Inspect updates
if transform_removed_result['update_count'] > 0:
    updates_df = pd.read_parquet(transform_removed_result['updates_path'])
    print(f"\nUpdates sample:")
    print(updates_df.head())

196 removed listings have no property match - skipping



‚úì Listings to mark as removed: 0
‚úì Updates file: /tmp/client_data_test_20260128_153428/listings_to_update.parquet


## Step 6: Load Properties to Master DB

Upsert properties to master database.

In [7]:
load_properties_result = upsert_properties(transform_new_result['properties_path'])

property_records = load_properties_result['property_records']

print(f"\n‚úì Properties upserted: {len(property_records)}")
print(f"‚úì Property records created")

if property_records:
    # Show sample records
    print(f"\nSample property records:")
    for i, rec in enumerate(property_records[:5]):
        print(f"  Property ID: {rec['property_id']}")
        if i >= 4:  # Show max 5
            break


‚úì Properties upserted: 1007
‚úì Property records created

Sample property records:
  Property ID: 12db370f-b11d-46ba-8bed-e6e2a0f03236
  Property ID: d934fbc4-a0c4-4905-99da-3f4ae1ff906e
  Property ID: e2bd4e9a-2ef1-4254-a0c7-fc91de8912fc
  Property ID: f70a0460-0bdc-467f-ac16-7b48acb10b00
  Property ID: 7c9921e2-35c9-4766-81f6-2a73298d4877


## Step 7: Load Pools to Master DB

Upsert pools to master database.

In [8]:
load_pools_result = upsert_pools(
    transform_new_result['pools_path'],
    property_records
)

print(f"\n‚úì Pools upserted: {load_pools_result['pool_count']}")


‚úì Pools upserted: 1007


## Step 8: Load Listings to Master DB

Upsert listings to master database.

In [9]:
load_listings_result = upsert_listings(
    transform_new_result['listings_path'],
    property_records
)

print(f"\n‚úì Listings upserted: {load_listings_result['listing_count']}")


‚úì Listings upserted: 1007


## Step 9: Update Removed Listings

Mark removed listings as sold in master database.

In [10]:
update_removed_result = update_removed_listings(transform_removed_result['updates_path'])

print(f"\n‚úì Listings marked as removed: {update_removed_result['update_count']}")


‚úì Listings marked as removed: 0


## Summary & Verification

Check results in master database.

In [11]:
from include.db.connections import get_master_db_connection

print("="*60)
print("PIPELINE EXECUTION SUMMARY")
print("="*60)

print(f"\nüìä Extraction:")
print(f"  ‚Ä¢ New pool listings: {result['row_count']}")
print(f"  ‚Ä¢ Removed pool listings: {result_removed['row_count']}")

print(f"\nüîç Matching:")
print(f"  ‚Ä¢ New listings matched: {match_result['new_matched_count']}")
print(f"  ‚Ä¢ New listings unmatched: {transform_new_result['property_count']}")
print(f"  ‚Ä¢ Removed listings matched: {match_result['removed_matched_count']}")

print(f"\nüíæ Database Updates:")
print(f"  ‚Ä¢ Properties upserted: {len(property_records)}")
print(f"  ‚Ä¢ Pools upserted: {load_pools_result['pool_count']}")
print(f"  ‚Ä¢ Listings upserted: {load_listings_result['listing_count']}")
print(f"  ‚Ä¢ Listings marked removed: {update_removed_result['update_count']}")

print(f"\nüóÇÔ∏è  Working directory: {workdir}")

# Verify in database
print("\n" + "="*60)
print("DATABASE VERIFICATION")
print("="*60)

conn = get_master_db_connection()
try:
    with conn.cursor() as cur:
        # Count properties
        cur.execute("SELECT COUNT(*) FROM properties;")
        print(f"\nTotal properties in DB: {cur.fetchone()[0]:,}")
        
        # Count pools
        cur.execute("SELECT COUNT(*) FROM pools;")
        print(f"Total pools in DB: {cur.fetchone()[0]:,}")
        
        # Count listings
        cur.execute("SELECT COUNT(*) FROM listings;")
        total_listings = cur.fetchone()[0]
        print(f"Total listings in DB: {total_listings:,}")
        
        # Count removed listings
        cur.execute("SELECT COUNT(*) FROM listings WHERE is_removed = true;")
        removed_listings = cur.fetchone()[0]
        print(f"Removed listings: {removed_listings:,}")
        print(f"Active listings: {total_listings - removed_listings:,}")
        
finally:
    conn.close()

print("\n" + "="*60)
print("‚úÖ PIPELINE TEST COMPLETE")
print("="*60)

PIPELINE EXECUTION SUMMARY

üìä Extraction:
  ‚Ä¢ New pool listings: 1007
  ‚Ä¢ Removed pool listings: 196

üîç Matching:
  ‚Ä¢ New listings matched: 0
  ‚Ä¢ New listings unmatched: 1007
  ‚Ä¢ Removed listings matched: 0

üíæ Database Updates:
  ‚Ä¢ Properties upserted: 1007
  ‚Ä¢ Pools upserted: 1007
  ‚Ä¢ Listings upserted: 1007
  ‚Ä¢ Listings marked removed: 0

üóÇÔ∏è  Working directory: /tmp/client_data_test_20260128_153428

DATABASE VERIFICATION

Total properties in DB: 23,533
Total pools in DB: 23,443
Total listings in DB: 1,034
Removed listings: 13
Active listings: 1,021

‚úÖ PIPELINE TEST COMPLETE
