# Dataset ingestion

This jupyter noteebook ingests the [Geocoded National Address File][gnaf] ([GNAF][gnaf]) from [data.gov.au](data.gov.au). It also downloads the [land values for NSW][nswlv], and ABS shapefiles 

It loads it all this data into a PostgreSQL database in a docker container, treating it like a disposable sqlite data store. It also downloads the ABS shape files as well as the 

Here we are going to ingest all the data necessary in order to assess land by land values, and filter them by address information. 

### The Steps

1. Download static assets and datasets
2. Setup a docker container with postgresql with GIS capabilities.
3. Ingest the [ABS shape files][abssf]
4. Ingest the latest [NSW valuer general land values][nswlv].
5. Ingest the [Geocoded National Address File][gnaf] ([GNAF][gnaf]) dataset
6. Link NSW Valuer General data with GNAF dataset

[gnaf]: https://data.gov.au/data/dataset/geocoded-national-address-file-g-naf
[nswlv]: https://www.valuergeneral.nsw.gov.au/land_value_summaries/lv.php
[abssf]: https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files

### Note

- Make sure docker is running first.

### Warning

Do not connect this to another database unless you've taken the time to update this, as it'll drop the existing database. I suggest instead take what you need from this script and disregard the rest. DO NOT USE DATABASE CREDENTIALS HERE FOR ANY OTHER STORE (especailly anything with drop permissions).

It also executes sql from a zip file downloaded from an external source.


## Configuration

These are some fields to configure if you wish to configure how the data is injected.

In [1]:
from lib import notebook_constants as nc

# If you mark this as true, the table `nsw_valuer_general.raw_entries`
# will be dropped. If you have space limitations and no desire to debug
# the data than dropping this makes sense. If you wish to debug some values
# then keeping this around may make some sense.
GLOBAL_FLAGS = {
    'drop_raw_nsw_valuer_general_entries': True,
}

db_conf = nc.gnaf_dbconf_2
db_name = nc.gnaf_dbname_2

docker_container_name = 'gnaf_db_test'
docker_image_tag = "20240908_19_53"

## Download Static Files

Here we are downloading static files, as well as fetching the most recently published land values from the valuer generals website.

In [2]:
from lib.gnaf.discovery import GnafPublicationDiscovery
from lib.nsw_vg.discovery import WeeklySalePriceDiscovery, AnnualSalePriceDiscovery, LandValueDiscovery
from lib.remote_resources import StaticFileInitialiser

initialiser = StaticFileInitialiser.create()

land_value = LandValueDiscovery()
w_sale_price = WeeklySalePriceDiscovery()
a_sale_price = AnnualSalePriceDiscovery()
gnaf_dis = GnafPublicationDiscovery.create()

gnaf_pub = gnaf_dis.get_publication()
if gnaf_pub:
    initialiser.add_target(gnaf_pub)

lv_target = land_value.get_latest()
if lv_target:
    initialiser.add_target(lv_target)

for sale_price_target in w_sale_price.get_links():
    initialiser.add_target(sale_price_target)

for sale_price_target in a_sale_price.get_links():
    initialiser.add_target(sale_price_target)

initialiser.setup_dirs()
initialiser.fetch_remote_resources()

Checking gnaf-2020.zip
Checking non_abs_shape.zip
Checking cities.zip
Checking g-naf_aug24_allstates_gda2020_psv_1016.zip
Checking nswvg_lv_01_Sep_2024.zip
Checking nswvg_wps_01_Jan_2024.zip
Checking nswvg_wps_08_Jan_2024.zip
Checking nswvg_wps_15_Jan_2024.zip
Checking nswvg_wps_22_Jan_2024.zip
Checking nswvg_wps_29_Jan_2024.zip
Checking nswvg_wps_05_Feb_2024.zip
Checking nswvg_wps_12_Feb_2024.zip
Checking nswvg_wps_19_Feb_2024.zip
Checking nswvg_wps_26_Feb_2024.zip
Checking nswvg_wps_04_Mar_2024.zip
Checking nswvg_wps_11_Mar_2024.zip
Checking nswvg_wps_18_Mar_2024.zip
Checking nswvg_wps_25_Mar_2024.zip
Checking nswvg_wps_01_Apr_2024.zip
Checking nswvg_wps_08_Apr_2024.zip
Checking nswvg_wps_15_Apr_2024.zip
Checking nswvg_wps_22_Apr_2024.zip
Checking nswvg_wps_29_Apr_2024.zip
Checking nswvg_wps_06_May_2024.zip
Checking nswvg_wps_13_May_2024.zip
Checking nswvg_wps_20_May_2024.zip
Checking nswvg_wps_27_May_2024.zip
Checking nswvg_wps_03_Jun_2024.zip
Checking nswvg_wps_10_Jun_2024.zip
Chec

## Create Container with Database

Here we are creating a container in docker from an image that uses the postgres image, which also installs a few extensions.

### Note

This notebook this is designed to be run more than once, so it'll throw away any existing container and database before creating a new one. After getting rid of any container using the same identifer, it'll create a new one and pull the relevant image if it's not already installed. It'll wait till the postgres instance is live then create the database. 

In [3]:
from lib.gnaf_db import GnafDb, GnafContainer, GnafImage
from lib import notebook_constants as nc

image = GnafImage.create(tag=docker_image_tag)
image.prepare()

container = GnafContainer.create(container_name=docker_container_name, image=image)
container.clean()
container.prepare(db_conf, db_name)
container.start()

gnaf_db = GnafDb.create(db_conf, db_name)
gnaf_db.wait_till_running()
gnaf_db.init_schema(gnaf_pub)

running ./zip-out/g-naf_aug24_allstates_gda2020_psv_1016/G-NAF/Extras/GNAF_TableCreation_Scripts/create_tables_ansi.sql
running ./zip-out/g-naf_aug24_allstates_gda2020_psv_1016/G-NAF/Extras/GNAF_TableCreation_Scripts/add_fk_constraints.sql
running sql/move_gnaf_to_schema.sql


## Consume the ABS Shapefiles

The [ABS provides a number of shape files][all abs shape files], we're going focus on 2 main sets of shapes. The **ABS Main Structures** which is stuff like SA1, 2, 3 & 4 along with greater cities, meshblocks, and states. As well as **Non ABS Main Structures** which is stuff like electoral divisions, suburbs post codes etc.

[all abs shape files]: https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files

### ABS Main Structures 

Any address or region we look up in the GNAF dataset, we want to visualise. The ABS has a few different geographic groups which we can visualise the data against, but each address in the GNAF dataset has a meshblock id, which is the smaller block the ABS breaks addresses up into for SA1, SA2, SA3 and SA4's.

This dataset is pretty useful for visualising the GNAF data for that reason.

In [4]:
import geopandas as gpd
import pandas as pd
    
engine = gnaf_db.engine()

schema = 'abs_main_structures'
        
column_renames_for_table = {
    'SA1_2021_AUST_GDA2020': {
        'SA1_CODE_2021': 'sa1_code', 'SA2_CODE_2021': 'sa2_code', 'SA3_CODE_2021': 'sa3_code',
        'SA4_CODE_2021': 'sa4_code', 'GCCSA_CODE_2021': 'gcc_code', 'STATE_CODE_2021': 'state_code',
        'AREA_ALBERS_SQKM': 'area_sqkm', 'geometry': 'geometry'
    },
    'SA2_2021_AUST_GDA2020': {
        'SA2_CODE_2021': 'sa2_code', 'SA2_NAME_2021': 'sa2_name', 'SA3_CODE_2021': 'sa3_code',
        'SA4_CODE_2021': 'sa4_code', 'GCCSA_CODE_2021': 'gcc_code', 'STATE_CODE_2021': 'state_code',
        'AREA_ALBERS_SQKM': 'area_sqkm', 'geometry': 'geometry'
    },
    'SA3_2021_AUST_GDA2020': {
        'SA3_CODE_2021': 'sa3_code', 'SA3_NAME_2021': 'sa3_name', 'SA4_CODE_2021': 'sa4_code',
        'GCCSA_CODE_2021': 'gcc_code', 'STATE_CODE_2021': 'state_code', 'AREA_ALBERS_SQKM': 'area_sqkm',
        'geometry': 'geometry'
    },
    'SA4_2021_AUST_GDA2020': {
        'SA4_CODE_2021': 'sa4_code', 'SA4_NAME_2021': 'sa4_name', 'GCCSA_CODE_2021': 'gcc_code',
        'STATE_CODE_2021': 'state_code', 'AREA_ALBERS_SQKM': 'area_sqkm', 'geometry': 'geometry'
    },
    'GCCSA_2021_AUST_GDA2020': {
        'GCCSA_CODE_2021': 'gcc_code', 'GCCSA_NAME_2021': 'gcc_name', 'STATE_CODE_2021': 'state_code',
        'geometry': 'geometry'
    },
    'STE_2021_AUST_GDA2020': {
        'STATE_CODE_2021': 'state_code', 'STATE_NAME_2021': 'state_name', 'geometry': 'geometry'
    },
    'MB_2021_AUST_GDA2020': {
        'MB_CODE_2021': 'mb_code', 'MB_CATEGORY_2021': 'mb_cat',
        'SA1_CODE_2021': 'sa1_code', 'SA2_CODE_2021': 'sa2_code', 'SA3_CODE_2021': 'sa3_code',
        'SA4_CODE_2021': 'sa4_code', 'GCCSA_CODE_2021': 'gcc_code', 'STATE_CODE_2021': 'state_code',
        'AREA_ALBERS_SQKM': 'area_sqkm', 'geometry': 'geometry'
    }
}

# Column renames for each layer
layers = {
    'STE_2021_AUST_GDA2020': 'state',
    'GCCSA_2021_AUST_GDA2020': 'gccsa',
    'SA4_2021_AUST_GDA2020': 'sa4',
    'SA3_2021_AUST_GDA2020': 'sa3',
    'SA2_2021_AUST_GDA2020': 'sa2',
    'SA1_2021_AUST_GDA2020': 'sa1',
    'MB_2021_AUST_GDA2020': 'meshblock'
}

with gnaf_db.connect() as conn:
    cursor = conn.cursor()

    # this really won't do anything unless you need to rerun this portion of the script
    for _, table in layers.items():
        cursor.execute(f"""
        DO $$ BEGIN
          IF EXISTS (
            SELECT 1 FROM information_schema.tables 
             WHERE table_name = '{table}' AND table_schema = '{schema}'
          ) THEN
            TRUNCATE TABLE {schema}.{table} RESTART IDENTITY CASCADE;
          END IF;
        END $$;
        """)
    
    with open('sql/abs_main_structures_create_tables.sql', 'r') as f:
        cursor.execute(f.read())
        
    cursor.close()

for layer_name, table_name in layers.items():
    column_renames = column_renames_for_table[layer_name]
    
    # Load each layer into corresponding tables
    df = gpd.read_file('zip-out/cities/ASGS_2021_MAIN_STRUCTURE_GDA2020.gpkg', layer=layer_name)
    df = df.rename(columns=column_renames)
    df = df[list(column_renames.values())]
    df.to_postgis(table_name, engine, schema=schema, if_exists='append', index=False)

    with engine.connect() as connection:
        result = pd.read_sql(f"SELECT COUNT(*) FROM {schema}.{table_name}", connection)
        print(f"Populated {schema}.{table_name} with {result.iloc[0, 0]}/{len(df)} rows.")


Populated abs_main_structures.state with 10/10 rows.
Populated abs_main_structures.gccsa with 35/35 rows.
Populated abs_main_structures.sa4 with 108/108 rows.
Populated abs_main_structures.sa3 with 359/359 rows.
Populated abs_main_structures.sa2 with 2473/2473 rows.
Populated abs_main_structures.sa1 with 61845/61845 rows.
Populated abs_main_structures.meshblock with 368286/368286 rows.


### Non Abs Main Structures 

We are mostly ingesting these to make it simpler to narrow data of interest. Typically if you're looking at this data, you're probably doing it some scope of relevance, such as a local government area, an electorate division, or whatever.

In [5]:
import geopandas as gpd
import pandas as pd

schema = 'non_abs_main_structures'

column_renames_for_table = {
    'SAL_2021_AUST_GDA2020': {
        "SAL_CODE_2021": "locality_id",
        "SAL_NAME_2021": "locality_name",
        "STATE_CODE_2021": "state_code",
        "AUS_CODE_2021": "in_australia",
        "AREA_ALBERS_SQKM": "area_sqkm",
        "geometry": "geometry"
    },
    'SED_2021_AUST_GDA2020': {
        "SED_CODE_2021": "electorate_id",
        "SED_NAME_2021": "electorate_name",
        "STATE_CODE_2021": "state_code",
        "AUS_CODE_2021": "in_australia",
        "AREA_ALBERS_SQKM": "area_sqkm",
        "geometry": "geometry"
    },
    'SED_2022_AUST_GDA2020': {
        "SED_CODE_2022": "electorate_id",
        "SED_NAME_2022": "electorate_name",
        "STATE_CODE_2021": "state_code",
        "AUS_CODE_2021": "in_australia",
        "AREA_ALBERS_SQKM": "area_sqkm",
        "geometry": "geometry"
    },
    'SED_2024_AUST_GDA2020': {
        "SED_CODE_2024": "electorate_id",
        "SED_NAME_2024": "electorate_name",
        "STATE_CODE_2021": "state_code",
        "AUS_CODE_2021": "in_australia",
        "AREA_ALBERS_SQKM": "area_sqkm",
        "geometry": "geometry"
    },
    'CED_2021_AUST_GDA2020': {
        "CED_CODE_2021": "electorate_id",
        "CED_NAME_2021": "electorate_name",
        "STATE_CODE_2021": "state_code",
        "AUS_CODE_2021": "in_australia",
        "AREA_ALBERS_SQKM": "area_sqkm",
        "geometry": "geometry"
    },
    'LGA_2021_AUST_GDA2020': {
        "LGA_CODE_2021": "lga_id",
        "LGA_NAME_2021": "lga_name",
        "STATE_CODE_2021": "state_code",
        "AUS_CODE_2021": "in_australia",
        "AREA_ALBERS_SQKM": "area_sqkm",
        "geometry": "geometry"
    },
    'LGA_2022_AUST_GDA2020': {
        "LGA_CODE_2022": "lga_id",
        "LGA_NAME_2022": "lga_name",
        "STATE_CODE_2021": "state_code",
        "AUS_CODE_2021": "in_australia",
        "AREA_ALBERS_SQKM": "area_sqkm",
        "geometry": "geometry"
    },
    'LGA_2023_AUST_GDA2020': {
        "LGA_CODE_2023": "lga_id",
        "LGA_NAME_2023": "lga_name",
        "STATE_CODE_2021": "state_code",
        "AUS_CODE_2021": "in_australia",
        "AREA_ALBERS_SQKM": "area_sqkm",
        "geometry": "geometry"
    },
    'LGA_2024_AUST_GDA2020': {
        "LGA_CODE_2024": "lga_id",
        "LGA_NAME_2024": "lga_name",
        "STATE_CODE_2021": "state_code",
        "AUS_CODE_2021": "in_australia",
        "AREA_ALBERS_SQKM": "area_sqkm",
        "geometry": "geometry"
    },
    'POA_2021_AUST_GDA2020': {
        "POA_CODE_2021": "post_code",
        "AUS_CODE_2021": "in_australia",
        "AREA_ALBERS_SQKM": "area_sqkm",
        "geometry": "geometry"
    },
    # https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/non-abs-structures/destination-zones
    'DZN_2021_AUST_GDA2020': {
        'DZN_CODE_2021': 'dzn_code', 
        'SA2_CODE_2021': 'sa2_code',
        'STATE_CODE_2021': 'state_code',
        "AUS_CODE_2021": 'in_australia',
        "AREA_ALBERS_SQKM": "area_sqkm",
        "geometry": "geometry",
    },
}

layers = {
    'SAL_2021_AUST_GDA2020': 'localities',
    'SED_2021_AUST_GDA2020': 'state_electoral_division_2021',
    'SED_2022_AUST_GDA2020': 'state_electoral_division_2022',
    'SED_2024_AUST_GDA2020': 'state_electoral_division_2024',
    'CED_2021_AUST_GDA2020': 'federal_electoral_division_2021',
    'LGA_2021_AUST_GDA2020': 'lga_2021',
    'LGA_2022_AUST_GDA2020': 'lga_2022',
    'LGA_2023_AUST_GDA2020': 'lga_2023',
    'LGA_2024_AUST_GDA2020': 'lga_2024',
    'POA_2021_AUST_GDA2020': 'post_code',
    'DZN_2021_AUST_GDA2020': 'dzn',
    # Unused
    # - australian drainage divisions, 'ADD_2021_AUST_GDA2020'
    # - tourism regions, 'TR_2021_AUST_GDA2020'
}

with gnaf_db.connect() as conn:
    cursor = conn.cursor()

    # this really won't do anything unless you need to rerun this portion of the script
    for _, table in layers.items():
        cursor.execute(f"""
        DO $$ BEGIN
          IF EXISTS (
            SELECT 1 FROM information_schema.tables 
             WHERE table_name = '{table}' AND table_schema = '{schema}'
          ) THEN
            TRUNCATE TABLE {schema}.{table} RESTART IDENTITY CASCADE;
          END IF;
        END $$;
        """)
    
    with open('sql/non_abs_main_structures_create_tables.sql', 'r') as f:
        cursor.execute(f.read())
        
    cursor.close()
    
for layer_name, table_name in layers.items():
    column_renames = column_renames_for_table[layer_name]
    
    df = gpd.read_file('zip-out/non_abs_structures_shapefiles/ASGS_Ed3_Non_ABS_Structures_GDA2020_updated_2024.gpkg', layer=layer_name)
    df = df.rename(columns=column_renames)
    df = df[list(column_renames.values())]

    if 'in_australia' in df:
        df['in_australia'] = df['in_australia'] == 'AUS'
    
    df.to_postgis(table_name, engine, schema=schema, if_exists='append', index=False)

    with engine.connect() as connection:
        result = pd.read_sql(f"SELECT COUNT(*) FROM {schema}.{table_name}", connection)
        print(f"Populated {schema}.{table_name} with {result.iloc[0, 0]}/{len(df)} rows.")



Populated non_abs_main_structures.localities with 15353/15353 rows.
Populated non_abs_main_structures.state_electoral_division_2021 with 452/452 rows.
Populated non_abs_main_structures.state_electoral_division_2022 with 452/452 rows.
Populated non_abs_main_structures.state_electoral_division_2024 with 452/452 rows.
Populated non_abs_main_structures.federal_electoral_division_2021 with 170/170 rows.
Populated non_abs_main_structures.lga_2021 with 566/566 rows.
Populated non_abs_main_structures.lga_2022 with 566/566 rows.
Populated non_abs_main_structures.lga_2023 with 566/566 rows.
Populated non_abs_main_structures.lga_2024 with 566/566 rows.
Populated non_abs_main_structures.post_code with 2644/2644 rows.
Populated non_abs_main_structures.dzn with 9329/9329 rows.


## Ingesting NSW Land Values

First lets just get the CSV's into the database, then we'll break it up into seperates tables, then we'll form links with the GNAF dataset.

#### Documentation on this dataset

The valuer general website has a link to documentation on interpretting that data on [this page](https://www.nsw.gov.au/housing-and-construction/land-values-nsw/resource-library/land-value-information-user-guide). I didn't link to the PDF directly as it occasionally updated and a direct link is at risk of going stale.

It's useful getting the meaning behind the codes and terms used in the bulk data.


### Build the `nsw_valuer_general.raw_entries_lv` table

Here we are just loading the each file from the latest land value publication with minimal changes, and a bit of sanitisizing.

In [6]:
from datetime import datetime
import os
import math
import pandas as pd
from concurrent.futures import ThreadPoolExecutor, as_completed
from sqlalchemy import text
from psycopg2.errors import StringDataRightTruncation

from lib import notebook_constants as nc

with gnaf_db.connect() as conn:
    cursor = conn.cursor()
    cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.raw_entries_lv CASCADE")
    with open('sql/nsw_lv_schema_1_raw.sql', 'r') as f:
        cursor.execute(f.read())
    cursor.close()
            
column_mappings = { **nc.lv_long_column_mappings, **nc.lv_wide_columns_mappings }

def count(table, source = None):
    c = pd.read_sql(f'SELECT count(*) FROM nsw_valuer_general.{table}', gnaf_db.engine())
    time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    print(f'{time} {source and f"{source}, " or ""}{table} {c.iloc[0,0]}')

def process_file(file):
    if not file.endswith("csv"):
        return

    full_file_path = f"zip-out/{lv_target.zip_dst}/{file}"
    try:
        df = pd.read_csv(full_file_path, encoding='utf-8')
    except UnicodeDecodeError:
        # Fallback to ISO-8859-1 encoding if utf-8 fails
        df = pd.read_csv(full_file_path, encoding='ISO-8859-1')

    date_str = file.split('_')[-1].replace('.csv', '')
    
    df.index.name = 'source_file_position'
    df = df.drop(columns=['Unnamed: 34'])
    df = df.rename(columns=column_mappings).reset_index()
    df['source_file_name'] = file
    df['source_date'] = datetime.strptime(date_str, "%Y%m%d")
    df['postcode'] = [(n if math.isnan(n) else str(int(n))) for n in df['postcode']]
    
    try:
        df.to_sql('raw_entries_lv', gnaf_db.engine(), schema='nsw_valuer_general', if_exists='append', index=False)
    finally:
        count('raw_entries_lv', f'Consumed {full_file_path}')

files = sorted(os.listdir(f"zip-out/{lv_target.zip_dst}"))

with ThreadPoolExecutor(max_workers=os.cpu_count()) as executor:
    futures = [executor.submit(process_file, file) for file in files]
    for future in as_completed(futures):
        future.result()

2024-09-08 19:56:21 Consumed zip-out/nswvg_lv_01_Sep_2024/052_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 4885
2024-09-08 19:56:21 Consumed zip-out/nswvg_lv_01_Sep_2024/043_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 8034
2024-09-08 19:56:21 Consumed zip-out/nswvg_lv_01_Sep_2024/054_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 11236
2024-09-08 19:56:21 Consumed zip-out/nswvg_lv_01_Sep_2024/061_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 11236
2024-09-08 19:56:21 Consumed zip-out/nswvg_lv_01_Sep_2024/051_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 16450
2024-09-08 19:56:22 Consumed zip-out/nswvg_lv_01_Sep_2024/065_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 244312024-09-08 19:56:22 Consumed zip-out/nswvg_lv_01_Sep_2024/066_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 24431

2024-09-08 19:56:22 Consumed zip-out/nswvg_lv_01_Sep_2024/070_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 28257
2024-09-08 19:56:23 Consumed zip-out/nswvg_lv_01_Sep_2024/083_LAND_VALUE_DATA_20240901.csv, raw_en

  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-09-08 19:56:28 Consumed zip-out/nswvg_lv_01_Sep_2024/002_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 45287
2024-09-08 19:56:30 Consumed zip-out/nswvg_lv_01_Sep_2024/012_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 56349
2024-09-08 19:56:31 Consumed zip-out/nswvg_lv_01_Sep_2024/018_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 75832
2024-09-08 19:56:31 Consumed zip-out/nswvg_lv_01_Sep_2024/085_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 83621
2024-09-08 19:56:33 Consumed zip-out/nswvg_lv_01_Sep_2024/087_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 90528
2024-09-08 19:56:33 Consumed zip-out/nswvg_lv_01_Sep_2024/042_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 98073


  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-09-08 19:56:39 Consumed zip-out/nswvg_lv_01_Sep_2024/008_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 157081
2024-09-08 19:56:41 Consumed zip-out/nswvg_lv_01_Sep_2024/001_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 213056
2024-09-08 19:56:41 Consumed zip-out/nswvg_lv_01_Sep_2024/090_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 213056
2024-09-08 19:56:42 Consumed zip-out/nswvg_lv_01_Sep_2024/050_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 237852


  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-09-08 19:56:45 Consumed zip-out/nswvg_lv_01_Sep_2024/074_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 258957
2024-09-08 19:56:45 Consumed zip-out/nswvg_lv_01_Sep_2024/098_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 258957
2024-09-08 19:56:46 Consumed zip-out/nswvg_lv_01_Sep_2024/088_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 269298
2024-09-08 19:56:47 Consumed zip-out/nswvg_lv_01_Sep_2024/082_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 310782


  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-09-08 19:56:50 Consumed zip-out/nswvg_lv_01_Sep_2024/109_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 318226
2024-09-08 19:56:51 Consumed zip-out/nswvg_lv_01_Sep_2024/092_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 335574
2024-09-08 19:56:53 Consumed zip-out/nswvg_lv_01_Sep_2024/117_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 345446
2024-09-08 19:56:54 Consumed zip-out/nswvg_lv_01_Sep_2024/118_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 349380
2024-09-08 19:56:54 Consumed zip-out/nswvg_lv_01_Sep_2024/123_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 349380
2024-09-08 19:56:56 Consumed zip-out/nswvg_lv_01_Sep_2024/116_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 357841
2024-09-08 19:56:59 Consumed zip-out/nswvg_lv_01_Sep_2024/010_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 389550
2024-09-08 19:57:01 Consumed zip-out/nswvg_lv_01_Sep_2024/097_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 448270
2024-09-08 19:57:02 Consumed zip-out/nswvg_lv_01_Sep_2024/084_LAND_VALUE_DATA_20240901.c

  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-09-08 19:57:17 Consumed zip-out/nswvg_lv_01_Sep_2024/150_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 708307
2024-09-08 19:57:19 Consumed zip-out/nswvg_lv_01_Sep_2024/157_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 723444


  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-09-08 19:57:22 Consumed zip-out/nswvg_lv_01_Sep_2024/164_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 739914
2024-09-08 19:57:23 Consumed zip-out/nswvg_lv_01_Sep_2024/187_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 739914
2024-09-08 19:57:23 Consumed zip-out/nswvg_lv_01_Sep_2024/199_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 743113
2024-09-08 19:57:28 Consumed zip-out/nswvg_lv_01_Sep_2024/004_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 8961692024-09-08 19:57:28 Consumed zip-out/nswvg_lv_01_Sep_2024/192_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 896169

2024-09-08 19:57:28 Consumed zip-out/nswvg_lv_01_Sep_2024/101_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 896169


  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-09-08 19:57:30 Consumed zip-out/nswvg_lv_01_Sep_2024/159_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 924013
2024-09-08 19:57:30 Consumed zip-out/nswvg_lv_01_Sep_2024/188_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 924013
2024-09-08 19:57:31 Consumed zip-out/nswvg_lv_01_Sep_2024/209_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 937161
2024-09-08 19:57:34 Consumed zip-out/nswvg_lv_01_Sep_2024/152_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 9796382024-09-08 19:57:34 Consumed zip-out/nswvg_lv_01_Sep_2024/210_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 979638

2024-09-08 19:57:38 Consumed zip-out/nswvg_lv_01_Sep_2024/103_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1050686
2024-09-08 19:57:39 Consumed zip-out/nswvg_lv_01_Sep_2024/230_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1052331
2024-09-08 19:57:40 Consumed zip-out/nswvg_lv_01_Sep_2024/231_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1068892
2024-09-08 19:57:40 Consumed zip-out/nswvg_lv_01_Sep_2024/222_LAND_VALUE_DATA_2024090

  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-09-08 19:57:53 Consumed zip-out/nswvg_lv_01_Sep_2024/244_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1308208
2024-09-08 19:57:54 Consumed zip-out/nswvg_lv_01_Sep_2024/216_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1320891
2024-09-08 19:57:54 Consumed zip-out/nswvg_lv_01_Sep_2024/251_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1320891
2024-09-08 19:57:54 Consumed zip-out/nswvg_lv_01_Sep_2024/247_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1320891
2024-09-08 19:57:54 Consumed zip-out/nswvg_lv_01_Sep_2024/252_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1328523
2024-09-08 19:57:55 Consumed zip-out/nswvg_lv_01_Sep_2024/254_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1328523
2024-09-08 19:57:55 Consumed zip-out/nswvg_lv_01_Sep_2024/250_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1328523
2024-09-08 19:57:57 Consumed zip-out/nswvg_lv_01_Sep_2024/217_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1379818
2024-09-08 19:57:57 Consumed zip-out/nswvg_lv_01_Sep_2024/253_LAND_VALUE_DATA_20

  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-09-08 19:58:01 Consumed zip-out/nswvg_lv_01_Sep_2024/262_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1389636
2024-09-08 19:58:02 Consumed zip-out/nswvg_lv_01_Sep_2024/257_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1402825
2024-09-08 19:58:03 Consumed zip-out/nswvg_lv_01_Sep_2024/263_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1410736
2024-09-08 19:58:04 Consumed zip-out/nswvg_lv_01_Sep_2024/218_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1468660
2024-09-08 19:58:06 Consumed zip-out/nswvg_lv_01_Sep_2024/265_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1475403
2024-09-08 19:58:07 Consumed zip-out/nswvg_lv_01_Sep_2024/266_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1487180
2024-09-08 19:58:08 Consumed zip-out/nswvg_lv_01_Sep_2024/220_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1541238
2024-09-08 19:58:09 Consumed zip-out/nswvg_lv_01_Sep_2024/270_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1543834


  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-09-08 19:58:10 Consumed zip-out/nswvg_lv_01_Sep_2024/269_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1552745


  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-09-08 19:58:16 Consumed zip-out/nswvg_lv_01_Sep_2024/274_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1562106
2024-09-08 19:58:17 Consumed zip-out/nswvg_lv_01_Sep_2024/273_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1643859
2024-09-08 19:58:18 Consumed zip-out/nswvg_lv_01_Sep_2024/224_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1643859
2024-09-08 19:58:20 Consumed zip-out/nswvg_lv_01_Sep_2024/223_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1711459
2024-09-08 19:58:24 Consumed zip-out/nswvg_lv_01_Sep_2024/300_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1714789


  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-09-08 19:58:25 Consumed zip-out/nswvg_lv_01_Sep_2024/301_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1755276
2024-09-08 19:58:26 Consumed zip-out/nswvg_lv_01_Sep_2024/264_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1755276
2024-09-08 19:58:29 Consumed zip-out/nswvg_lv_01_Sep_2024/302_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1760742


  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-09-08 19:58:35 Consumed zip-out/nswvg_lv_01_Sep_2024/272_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 18354252024-09-08 19:58:35 Consumed zip-out/nswvg_lv_01_Sep_2024/260_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1835425

2024-09-08 19:58:38 Consumed zip-out/nswvg_lv_01_Sep_2024/275_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1868357
2024-09-08 19:58:39 Consumed zip-out/nswvg_lv_01_Sep_2024/511_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1876808
2024-09-08 19:58:39 Consumed zip-out/nswvg_lv_01_Sep_2024/528_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1883804
2024-09-08 19:58:40 Consumed zip-out/nswvg_lv_01_Sep_2024/526_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1883804


  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-09-08 19:58:41 Consumed zip-out/nswvg_lv_01_Sep_2024/261_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1935100
2024-09-08 19:58:44 Consumed zip-out/nswvg_lv_01_Sep_2024/267_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 1986619
2024-09-08 19:58:45 Consumed zip-out/nswvg_lv_01_Sep_2024/538_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 20299152024-09-08 19:58:45 Consumed zip-out/nswvg_lv_01_Sep_2024/537_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 2029915

2024-09-08 19:58:45 Consumed zip-out/nswvg_lv_01_Sep_2024/276_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 2029915
2024-09-08 19:58:50 Consumed zip-out/nswvg_lv_01_Sep_2024/560_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 2037088
2024-09-08 19:58:52 Consumed zip-out/nswvg_lv_01_Sep_2024/529_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 2053665
2024-09-08 19:58:54 Consumed zip-out/nswvg_lv_01_Sep_2024/214_LAND_VALUE_DATA_20240901.csv, raw_entries_lv 2178410
2024-09-08 19:58:56 Consumed zip-out/nswvg_lv_01_Sep_2024/620_LAND_VALUE_DATA_20

### Break CSV data into sepreate relations

Just to break up the data into more efficent representations of the data, and data that will be easier to query, we're going to perform a series of queries against the GNAF data before using it populate the tables we care about.

In [7]:
from datetime import datetime
import os
import math
import pandas as pd
from concurrent.futures import ThreadPoolExecutor, as_completed
from sqlalchemy import text
from psycopg2.errors import StringDataRightTruncation

with gnaf_db.connect() as conn:
    cursor = conn.cursor()
    cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.source_file CASCADE")
    cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.source CASCADE")
    cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.district CASCADE")
    cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.suburb CASCADE")
    cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.street CASCADE")
    cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.property CASCADE")
    cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.property_description CASCADE")
    cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.valuations CASCADE")
    
    with open('sql/nsw_lv_schema_2_structure.sql', 'r') as f:
        cursor.execute(f.read())
        
    with open('sql/nsw_lv_from_raw.sql', 'r') as f:
        cursor.execute(f.read())
        
    cursor.close()
    
count('district')
count('suburb')
count('street')
count('property')
count('property_description')
count('valuations')

2024-09-08 20:01:01 district 128
2024-09-08 20:01:01 suburb 5075
2024-09-08 20:01:01 street 128422
2024-09-08 20:01:01 property 2702450
2024-09-08 20:01:01 property_description 2702450
2024-09-08 20:01:02 valuations 13512250


### Parse contents of the property description

The `property_description` from the original valuer general data constains alot of information. The most important of which is the land parcel or `lot/plan` information. There is other information in there as well.

In [8]:
import numpy as np
import pandas as pd
from lib.nsw_vg.property_description import parse_land_parcel_ids

engine = gnaf_db.engine()

with gnaf_db.connect() as conn:
    cursor = conn.cursor()
    cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.land_parcel_link")
    with open('sql/nsw_lv_schema_3_property_description_meta_data.sql', 'r') as f:
        cursor.execute(f.read())
    cursor.close()

def land_parcels(desc):
    desc, parcels = parse_land_parcel_ids(desc)
    return parcels 

query = "SELECT * FROM nsw_valuer_general.property_description"
for df_chunk in pd.read_sql(query, engine, chunksize=10000):
    df_chunk = df_chunk.dropna(subset=['property_description'])
    df_chunk['parcels'] = df_chunk['property_description'].apply(land_parcels)
    df_chunk_ex = df_chunk.explode('parcels')
    df_chunk_ex = df_chunk_ex.dropna(subset=['parcels'])
    df_chunk_ex['land_parcel_id'] = df_chunk_ex['parcels'].apply(lambda p: p.id)
    df_chunk_ex['part'] = df_chunk_ex['parcels'].apply(lambda p: p.part)
    df_chunk_ex = df_chunk_ex.drop(columns=['property_description', 'parcels'])
    df_chunk_ex.to_sql(
        'land_parcel_link',
        con=engine,
        schema='nsw_valuer_general',
        if_exists='append',
        index=False,
    )

with gnaf_db.connect() as conn:
    cursor = conn.cursor()
    for t in ['property', 'land_parcel_link']:
        cursor.execute(f'SELECT COUNT(*) FROM nsw_valuer_general.{t}')
        count = cursor.fetchone()[0]
        print(f"Table nsw_valuer_general.{t} has {count} rows")


Table nsw_valuer_general.property has 2702450 rows
Table nsw_valuer_general.land_parcel_link has 4245674 rows


### Get rid of `raw_entries_lv` table

We no longer need the raw entries table, deleting it should make the database a bit efficent in terms of storage.

In [9]:
with gnaf_db.connect() as conn:
    cursor = conn.cursor()
    if GLOBAL_FLAGS['drop_raw_nsw_valuer_general_entries']:
        cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.raw_entries_lv")
        print("Dropping raw entries table")
    else:
        print("Keeping raw entries table")
    cursor.close()

Dropping raw entries table


## Ingest NSW Sales data

This data is also from the NSW valuer general

#### Documentation on this dataset

You can find that [here](https://www.nsw.gov.au/housing-and-construction/land-values-nsw/resource-library/property-sales-data-guide).

### Build the `nsw_valuer_general.raw_entries_ps` table

First lets populate the raw sales information into the 

## Gnaf Ingestion

The main thing to consider with ingesting this data is the order in which it is ingested. Now you could actually add the foreign key constraints after populating the database, and go nuts (That might actually even be faster than what I got here). But after a day of different variants of this script while trying to juggle correctness of data ingested and speed, I'm settling for this.

So 90% of the code here is just coordinating the dependencies and order in which everything is ingested, as well as doing as much in parallel as possible. My earlier approach of doing everything sequentially took 6 hours, is between 1-2 hours. 

In [10]:
import csv
import datetime
import glob
import os
import psycopg2
import concurrent.futures

from collections import defaultdict, deque
from datetime import datetime
from lib import notebook_constants as nc
from threading import Lock

schema = 'gnaf'
WORKER_COUNT = os.cpu_count()
BATCH_SIZE = 64000 / WORKER_COUNT # idk what I'm doing here tbh.

def get_table_name(file):
    file = os.path.splitext(os.path.basename(file))[0]
    sidx = 15 if file.startswith('Authority_Code') else file.find('_')+1
    return file[sidx:file.rfind('_')]

def get_batches(batch_size, reader):
    batch = []
    for row in reader:
        row = [(None if v == "" else v) for v in (v.strip() for v in row)]
        batch.append(row)
        
        if len(batch) >= batch_size:
            yield batch
            batch = [] 
    if batch:
        yield batch

def populate_file(file):
    table_name = get_table_name(file)
    with gnaf_db.connect() as conn:
        cursor = conn.cursor()
        with open(file, 'r') as f:
            time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
            print(f"{time} Populating from {os.path.basename(file)}")
            reader = csv.reader(f, delimiter='|')
            headers = next(reader)
            insert_query = f"""
                INSERT INTO {schema}.{table_name} ({', '.join(headers)}) 
                VALUES ({', '.join(['%s'] * len(headers))})
                ON CONFLICT DO NOTHING
            """
            
            for batch_index, batch in enumerate(get_batches(BATCH_SIZE, reader)):
                try:
                    cursor.executemany(insert_query, batch)
                except Exception as e:
                    print(f"Error inserting batch {batch_index + 1} into {table_name}: {e}")
                    raise e
            conn.commit()

def get_ordered_files(dependencies, all_files):
    file_sizes = { file: os.path.getsize(file) for file in all_files }
    total_blocked_sizes = defaultdict(int)
    
    def dfs(file, visited):
        if file in visited:
            return 0
        visited.add(file)
        total_size = file_sizes[file]
        for dep in dependencies[file]:
            total_size += dfs(dep, visited)
        return total_size
    
    for file in dependencies:
        visited = set()
        total_blocked_sizes[file] = dfs(file, visited)
        
    # Sort files based on the total blocked sizes, with higher sizes first
    return sorted(all_files, key=lambda f: total_blocked_sizes[f], reverse=True)

def worker(file_queue, all_files, dependency_count, lock, dependency_completed):
    while True:
        with lock:
            if not file_queue:
                break
            file = file_queue.popleft()
        
        populate_file(file)
        
        with lock:
            dependency_completed.add(file)
            
            for d in dependency_count:
                if file in dependency_count[d]:
                    dependency_count[d].remove(file)
                    
            ready_files = [f for f in all_files if not dependency_count[f]]
            
            for ready_file in ready_files:
                if ready_file not in file_queue:
                    file_queue.append(ready_file)
                    all_files.remove(ready_file)

authority_files = glob.glob(f'{gnaf_pub.psv_dir}/Authority Code/*.psv')

standard_prefix = f'{gnaf_pub.psv_dir}/Standard'
standard_files = [
    f'{standard_prefix}/{s}_{t}_psv.psv' 
    for t in [
        'STATE', 'ADDRESS_SITE', 'MB_2016', 'MB_2021', 'LOCALITY',
        'LOCALITY_ALIAS', 'LOCALITY_NEIGHBOUR', 'LOCALITY_POINT',
        'STREET_LOCALITY', 'STREET_LOCALITY_ALIAS', 'STREET_LOCALITY_POINT',
        'ADDRESS_DETAIL', 'ADDRESS_SITE_GEOCODE', 'ADDRESS_ALIAS', 
        'ADDRESS_DEFAULT_GEOCODE', 'ADDRESS_FEATURE', 
        'ADDRESS_MESH_BLOCK_2016', 'ADDRESS_MESH_BLOCK_2021',
        'PRIMARY_SECONDARY',
    ] 
    for s in ['NSW', 'VIC', 'QLD', 'WA', 'SA', 'TAS', 'NT', 'OT', 'ACT'] 
]

standard_deps = { f: set() for f in authority_files } | { 
    f'{p}/{s}_{t}_psv.psv': { f'{p}/{s}_{d}_psv.psv' for d in ds } | set(authority_files)
    
    for t, ds in ({
        'STATE': [],
        'ADDRESS_SITE': ['STATE'],
        'MB_2016': [],
        'MB_2021': [],
        'LOCALITY': ['STATE'],
        'LOCALITY_ALIAS': ['LOCALITY'],
        'LOCALITY_NEIGHBOUR': ['LOCALITY'],
        'LOCALITY_POINT': ['LOCALITY'],
        'STREET_LOCALITY': ['LOCALITY'],
        'STREET_LOCALITY_ALIAS': ['STREET_LOCALITY'],
        'STREET_LOCALITY_POINT': ['STREET_LOCALITY'],
        'ADDRESS_DETAIL': ['ADDRESS_SITE', 'STATE', 'LOCALITY', 'STREET_LOCALITY'],
        'ADDRESS_SITE_GEOCODE': ['ADDRESS_SITE'],
        'ADDRESS_ALIAS': ['ADDRESS_DETAIL'],
        'ADDRESS_DEFAULT_GEOCODE': ['ADDRESS_DETAIL'],
        'ADDRESS_FEATURE': ['ADDRESS_DETAIL'],
        'ADDRESS_MESH_BLOCK_2016': ['ADDRESS_DETAIL', 'MB_2016'],
        'ADDRESS_MESH_BLOCK_2021': ['ADDRESS_DETAIL', 'MB_2021'],
        'PRIMARY_SECONDARY': ['ADDRESS_DETAIL'],
    }).items()
    for s in ['NSW', 'VIC', 'QLD', 'WA', 'SA', 'TAS', 'NT', 'OT', 'ACT'] 
    for p in [standard_prefix]
}

lock = Lock()
all_files = { *authority_files, *standard_files }
dependency_count = {k: set(v) for k, v in standard_deps.items()}
dependency_completed = set()
file_queue = deque()
file_queue.extend(f for f in get_ordered_files(dependency_count, all_files) if not dependency_count[f])

with concurrent.futures.ThreadPoolExecutor(max_workers=WORKER_COUNT) as executor:
    futures = [executor.submit(worker, file_queue, all_files, dependency_count, lock, dependency_completed) for _ in range(WORKER_COUNT)]
    for future in concurrent.futures.as_completed(futures):
        future.result()


2024-09-08 20:02:13 Populating from Authority_Code_ADDRESS_CHANGE_TYPE_AUT_psv.psv
2024-09-08 20:02:13 Populating from Authority_Code_FLAT_TYPE_AUT_psv.psv
2024-09-08 20:02:13 Populating from Authority_Code_GEOCODE_TYPE_AUT_psv.psv
2024-09-08 20:02:13 Populating from Authority_Code_GEOCODE_RELIABILITY_AUT_psv.psv
2024-09-08 20:02:13 Populating from Authority_Code_STREET_CLASS_AUT_psv.psv
2024-09-08 20:02:13 Populating from Authority_Code_GEOCODED_LEVEL_TYPE_AUT_psv.psv
2024-09-08 20:02:13 Populating from Authority_Code_LOCALITY_CLASS_AUT_psv.psv
2024-09-08 20:02:13 Populating from Authority_Code_PS_JOIN_TYPE_AUT_psv.psv
2024-09-08 20:02:13 Populating from Authority_Code_MB_MATCH_CODE_AUT_psv.psv
2024-09-08 20:02:13 Populating from Authority_Code_ADDRESS_ALIAS_TYPE_AUT_psv.psv
2024-09-08 20:02:13 Populating from Authority_Code_STREET_SUFFIX_AUT_psv.psv
2024-09-08 20:02:13 Populating from Authority_Code_ADDRESS_TYPE_AUT_psv.psv
2024-09-08 20:02:13 Populating from Authority_Code_LEVEL_TYP

## Done

We've now built up the dataset, lets analysis what we got and show the contents of the database.

In [11]:
with gnaf_db.connect() as conn:
    cursor = conn.cursor()

    for schema in ['nsw_valuer_general', 'gnaf', 'abs_main_structures', 'non_abs_main_structures']:
        # Get the list of all tables
        cursor.execute(f"""
            SELECT table_name
            FROM information_schema.tables
            WHERE table_schema = '{schema}'
        """)
        tables = cursor.fetchall()
    
        # Get row count for each table
        for table in tables:
            cursor.execute(f'SELECT COUNT(*) FROM {schema}.{table[0]}')
            count = cursor.fetchone()[0]
            print(f"Table {schema}.{table[0]} has {count} rows")
    
    cursor.close()

Table nsw_valuer_general.source_file has 128 rows
Table nsw_valuer_general.source has 2702450 rows
Table nsw_valuer_general.district has 128 rows
Table nsw_valuer_general.suburb has 5075 rows
Table nsw_valuer_general.street has 128422 rows
Table nsw_valuer_general.property has 2702450 rows
Table nsw_valuer_general.property_description has 2702450 rows
Table nsw_valuer_general.valuations has 13512250 rows
Table nsw_valuer_general.land_parcel_link has 4245674 rows
Table gnaf.address_alias has 861514 rows
Table gnaf.address_detail has 16491363 rows
Table gnaf.geocode_type_aut has 29 rows
Table gnaf.state has 9 rows
Table gnaf.street_locality has 754249 rows
Table gnaf.address_alias_type_aut has 8 rows
Table gnaf.address_change_type_aut has 511 rows
Table gnaf.address_default_geocode has 16491363 rows
Table gnaf.address_feature has 211931 rows
Table gnaf.address_mesh_block_2016 has 16491363 rows
Table gnaf.address_mesh_block_2021 has 16491363 rows
Table gnaf.address_site has 16491363 rows
