# Dataset ingestion

This jupyter noteebook ingests the [Geocoded National Address File][gnaf] ([GNAF][gnaf]) from [data.gov.au](data.gov.au). It also downloads the [land values for NSW][nswlv], and ABS shapefiles 

It loads it all this data into a PostgreSQL database in a docker container, treating it like a disposable sqlite data store. It also downloads the ABS shape files as well as the 

Here we are going to ingest all the data necessary in order to assess land by land values, and filter them by address information. 

### The Steps

1. Download static assets and datasets
2. Setup a docker container with postgresql with GIS capabilities.
3. Ingest the [ABS shape files][abssf]
4. Ingest the latest [NSW valuer general land values][nswlv].
5. Ingest the [Geocoded National Address File][gnaf] ([GNAF][gnaf]) dataset
6. Link NSW Valuer General data with GNAF dataset

[gnaf]: https://data.gov.au/data/dataset/geocoded-national-address-file-g-naf
[nswlv]: https://www.valuergeneral.nsw.gov.au/land_value_summaries/lv.php
[abssf]: https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files

### Note

- Make sure docker is running first.

### Warning

Do not connect this to another database unless you've taken the time to update this, as it'll drop the existing database. I suggest instead take what you need from this script and disregard the rest. DO NOT USE DATABASE CREDENTIALS HERE FOR ANY OTHER STORE (especailly anything with drop permissions).

It also executes sql from a zip file downloaded from an external source.


## Configuration

These are some fields to configure if you wish to configure how the data is injected.

In [1]:
from lib import notebook_constants as nc

# If you mark this as true, the table `nsw_valuer_general.raw_entries`
# will be dropped. If you have space limitations and no desire to debug
# the data than dropping this makes sense. If you wish to debug some values
# then keeping this around may make some sense.
GLOBAL_FLAGS = {
    'drop_raw_nsw_valuer_general_entries': True,
    'reinitialise_container': True,
}

db_conf = nc.gnaf_dbconf_2
db_name = nc.gnaf_dbname_2

docker_container_name = 'gnaf_db_test'
docker_image_tag = "20240908_19_53"

## Download Static Files

Here we are downloading static files, as well as fetching the most recently published land values from the valuer generals website.

In [2]:
import logging
from lib.service.io import IoService
from lib.tasks.fetch_static_files import initialise, get_session

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

io = IoService.create(None)
async with get_session(io) as session:
    environment = await initialise(io, session)

land_value_dis = environment.land_value
w_sale_price = environment.sale_price_weekly
a_sale_price = environment.sale_price_annual
gnaf_dis = environment.gnaf
raise Exception()

2024-10-01 15:07:00,957 - INFO - Checking Target "non_abs_shape.zip"
2024-10-01 15:07:00,958 - INFO - Checking Target "cities.zip"
2024-10-01 15:07:00,958 - INFO - Checking Target "g-naf_aug24_allstates_gda2020_psv_1016.zip"
2024-10-01 15:07:00,958 - INFO - Checking Target "nswvg_lv_01_Oct_2024.zip"
2024-10-01 15:07:00,959 - INFO - Checking Target "nswvg_wps_01_Jan_2024.zip"
2024-10-01 15:07:00,959 - INFO - Checking Target "nswvg_wps_08_Jan_2024.zip"
2024-10-01 15:07:00,959 - INFO - Checking Target "nswvg_wps_15_Jan_2024.zip"
2024-10-01 15:07:00,959 - INFO - Checking Target "nswvg_wps_22_Jan_2024.zip"
2024-10-01 15:07:00,960 - INFO - Checking Target "nswvg_wps_29_Jan_2024.zip"
2024-10-01 15:07:00,960 - INFO - Checking Target "nswvg_wps_05_Feb_2024.zip"
2024-10-01 15:07:00,960 - INFO - Checking Target "nswvg_wps_12_Feb_2024.zip"
2024-10-01 15:07:00,960 - INFO - Checking Target "nswvg_wps_19_Feb_2024.zip"
2024-10-01 15:07:00,961 - INFO - Checking Target "nswvg_wps_26_Feb_2024.zip"
2024-1

## Create Container with Database

Here we are creating a container in docker from an image that uses the postgres image, which also installs a few extensions.

### Note

This notebook this is designed to be run more than once, so it'll throw away any existing container and database before creating a new one. After getting rid of any container using the same identifer, it'll create a new one and pull the relevant image if it's not already installed. It'll wait till the postgres instance is live then create the database. 

In [3]:
from lib.gnaf_db import GnafDb, GnafContainer, GnafImage
from lib import notebook_constants as nc

if GLOBAL_FLAGS['reinitialise_container']:
    image = GnafImage.create(tag=docker_image_tag)
    image.prepare()
    
    container = GnafContainer.create(container_name=docker_container_name, image=image)
    container.clean()
    container.prepare(db_conf, db_name)
    container.start()
else:
    print('skipping container initialisation')

gnaf_db = GnafDb.create(db_conf, db_name)
gnaf_db.wait_till_running()

if GLOBAL_FLAGS['reinitialise_container']:
    gnaf_db.init_schema(gnaf_dis.publication)
else:
    print('skipping DB initialisation')
    raise Exception()

running ./_out_zip/g-naf_aug24_allstates_gda2020_psv_1016/G-NAF/Extras/GNAF_TableCreation_Scripts/create_tables_ansi.sql
running ./_out_zip/g-naf_aug24_allstates_gda2020_psv_1016/G-NAF/Extras/GNAF_TableCreation_Scripts/add_fk_constraints.sql
running sql/move_gnaf_to_schema.sql


## Consume the ABS Shapefiles

The [ABS provides a number of shape files][all abs shape files], we're going focus on 2 main sets of shapes. The **ABS Main Structures** which is stuff like SA1, 2, 3 & 4 along with greater cities, meshblocks, and states. As well as **Non ABS Main Structures** which is stuff like electoral divisions, suburbs post codes etc.

[all abs shape files]: https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files

### ABS Main Structures 

Any address or region we look up in the GNAF dataset, we want to visualise. The ABS has a few different geographic groups which we can visualise the data against, but each address in the GNAF dataset has a meshblock id, which is the smaller block the ABS breaks addresses up into for SA1, SA2, SA3 and SA4's.

This dataset is pretty useful for visualising the GNAF data for that reason.

### Non Abs Main Structures 

We are mostly ingesting these to make it simpler to narrow data of interest. Typically if you're looking at this data, you're probably doing it some scope of relevance, such as a local government area, an electorate division, or whatever.

In [None]:
from lib.tasks.ingest_abs import ingest

await ingest(gnaf_db)

## Ingesting NSW Land Values

First lets just get the CSV's into the database, then we'll break it up into seperates tables, then we'll form links with the GNAF dataset.

#### Documentation on this dataset

The valuer general website has a link to documentation on interpretting that data on [this page](https://www.nsw.gov.au/housing-and-construction/land-values-nsw/resource-library/land-value-information-user-guide). I didn't link to the PDF directly as it occasionally updated and a direct link is at risk of going stale.

It's useful getting the meaning behind the codes and terms used in the bulk data.


### Build the `nsw_valuer_general.raw_entries_lv` table

Here we are just loading the each file from the latest land value publication with minimal changes, and a bit of sanitisizing.

In [6]:
from datetime import datetime
import os
import math
import pandas as pd
from concurrent.futures import ThreadPoolExecutor, as_completed
from sqlalchemy import text

from lib import notebook_constants as nc

with gnaf_db.connect() as conn:
    cursor = conn.cursor()
    cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.raw_entries_lv CASCADE")
    with open('sql/nsw_lv_schema_1_raw.sql', 'r') as f:
        cursor.execute(f.read())
    cursor.close()
            
column_mappings = { **nc.lv_long_column_mappings, **nc.lv_wide_columns_mappings }

def count(table, source = None):
    c = pd.read_sql(f'SELECT count(*) FROM nsw_valuer_general.{table}', gnaf_db.engine())
    time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    print(f'{time} {source and f"{source}, " or ""}{table} {c.iloc[0,0]}')

def process_file(file):
    if not file.endswith("csv"):
        return

    full_file_path = f"_out_zip/{land_value_dis.latest.zip_dst}/{file}"
    try:
        df = pd.read_csv(full_file_path, encoding='utf-8')
    except UnicodeDecodeError:
        # Fallback to ISO-8859-1 encoding if utf-8 fails
        df = pd.read_csv(full_file_path, encoding='ISO-8859-1')

    date_str = file.split('_')[-1].replace('.csv', '')
    
    df.index.name = 'source_file_position'
    df = df.drop(columns=['Unnamed: 34'])
    df = df.rename(columns=column_mappings).reset_index()
    df['source_file_name'] = file
    df['source_date'] = datetime.strptime(date_str, "%Y%m%d")
    df['postcode'] = [(n if math.isnan(n) else str(int(n))) for n in df['postcode']]
    
    try:
        df.to_sql('raw_entries_lv', gnaf_db.engine(), schema='nsw_valuer_general', if_exists='append', index=False)
    finally:
        count('raw_entries_lv', f'Consumed {full_file_path}')

files = sorted(os.listdir(f"_out_zip/{land_value_dis.latest.zip_dst}"))

with ThreadPoolExecutor(max_workers=os.cpu_count()) as executor:
    futures = [executor.submit(process_file, file) for file in files]
    for future in as_completed(futures):
        future.result()

2024-10-01 15:11:03 Consumed _out_zip/nswvg_lv_01_Oct_2024/052_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 5253
2024-10-01 15:11:04 Consumed _out_zip/nswvg_lv_01_Oct_2024/054_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 8453
2024-10-01 15:11:04 Consumed _out_zip/nswvg_lv_01_Oct_2024/061_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 8453
2024-10-01 15:11:04 Consumed _out_zip/nswvg_lv_01_Oct_2024/043_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 11240
2024-10-01 15:11:04 Consumed _out_zip/nswvg_lv_01_Oct_2024/051_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 16451
2024-10-01 15:11:05 Consumed _out_zip/nswvg_lv_01_Oct_2024/066_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 19136
2024-10-01 15:11:05 Consumed _out_zip/nswvg_lv_01_Oct_2024/070_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 22968
2024-10-01 15:11:05 Consumed _out_zip/nswvg_lv_01_Oct_2024/065_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 28267
2024-10-01 15:11:05 Consumed _out_zip/nswvg_lv_01_Oct_2024/083_LAND_VALUE_DATA_20241001.csv

  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-10-01 15:11:16 Consumed _out_zip/nswvg_lv_01_Oct_2024/050_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 161320


  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-10-01 15:11:19 Consumed _out_zip/nswvg_lv_01_Oct_2024/010_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 193061


  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-10-01 15:11:22 Consumed _out_zip/nswvg_lv_01_Oct_2024/088_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 210924
2024-10-01 15:11:22 Consumed _out_zip/nswvg_lv_01_Oct_2024/042_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 210924
2024-10-01 15:11:30 Consumed _out_zip/nswvg_lv_01_Oct_2024/084_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 244529
2024-10-01 15:11:30 Consumed _out_zip/nswvg_lv_01_Oct_2024/005_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 282169
2024-10-01 15:11:31 Consumed _out_zip/nswvg_lv_01_Oct_2024/008_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 341180
2024-10-01 15:11:32 Consumed _out_zip/nswvg_lv_01_Oct_2024/090_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 367657
2024-10-01 15:11:34 Consumed _out_zip/nswvg_lv_01_Oct_2024/087_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 374564
2024-10-01 15:11:35 Consumed _out_zip/nswvg_lv_01_Oct_2024/118_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 378840
2024-10-01 15:11:36 Consumed _out_zip/nswvg_lv_01_Oct_2024/123_LAND_VALUE_DATA_2

  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-10-01 15:11:40 Consumed _out_zip/nswvg_lv_01_Oct_2024/117_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 436069
2024-10-01 15:11:41 Consumed _out_zip/nswvg_lv_01_Oct_2024/098_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 436069
2024-10-01 15:11:42 Consumed _out_zip/nswvg_lv_01_Oct_2024/109_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 443516
2024-10-01 15:11:43 Consumed _out_zip/nswvg_lv_01_Oct_2024/137_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 451236
2024-10-01 15:11:44 Consumed _out_zip/nswvg_lv_01_Oct_2024/143_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 458628
2024-10-01 15:11:48 Consumed _out_zip/nswvg_lv_01_Oct_2024/102_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 483635
2024-10-01 15:11:51 Consumed _out_zip/nswvg_lv_01_Oct_2024/124_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 520344
2024-10-01 15:11:52 Consumed _out_zip/nswvg_lv_01_Oct_2024/092_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 520344
2024-10-01 15:11:52 Consumed _out_zip/nswvg_lv_01_Oct_2024/149_LAND_VALUE_DATA_2

  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-10-01 15:11:57 Consumed _out_zip/nswvg_lv_01_Oct_2024/151_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 669677
2024-10-01 15:11:59 Consumed _out_zip/nswvg_lv_01_Oct_2024/150_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 708740
2024-10-01 15:12:00 Consumed _out_zip/nswvg_lv_01_Oct_2024/097_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 714240
2024-10-01 15:12:00 Consumed _out_zip/nswvg_lv_01_Oct_2024/158_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 714240


  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-10-01 15:12:01 Consumed _out_zip/nswvg_lv_01_Oct_2024/148_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 731153
2024-10-01 15:12:04 Consumed _out_zip/nswvg_lv_01_Oct_2024/164_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 762777
2024-10-01 15:12:04 Consumed _out_zip/nswvg_lv_01_Oct_2024/157_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 762777
2024-10-01 15:12:05 Consumed _out_zip/nswvg_lv_01_Oct_2024/187_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 762777
2024-10-01 15:12:06 Consumed _out_zip/nswvg_lv_01_Oct_2024/188_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 774912
2024-10-01 15:12:06 Consumed _out_zip/nswvg_lv_01_Oct_2024/199_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 781720
2024-10-01 15:12:06 Consumed _out_zip/nswvg_lv_01_Oct_2024/192_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 781720


  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-10-01 15:12:13 Consumed _out_zip/nswvg_lv_01_Oct_2024/159_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 800634
2024-10-01 15:12:15 Consumed _out_zip/nswvg_lv_01_Oct_2024/209_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 842929
2024-10-01 15:12:15 Consumed _out_zip/nswvg_lv_01_Oct_2024/152_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 842929
2024-10-01 15:12:16 Consumed _out_zip/nswvg_lv_01_Oct_2024/210_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 856274
2024-10-01 15:12:19 Consumed _out_zip/nswvg_lv_01_Oct_2024/081_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 919214
2024-10-01 15:12:25 Consumed _out_zip/nswvg_lv_01_Oct_2024/171_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 951146
2024-10-01 15:12:27 Consumed _out_zip/nswvg_lv_01_Oct_2024/230_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 952792
2024-10-01 15:12:29 Consumed _out_zip/nswvg_lv_01_Oct_2024/222_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 964930
2024-10-01 15:12:31 Consumed _out_zip/nswvg_lv_01_Oct_2024/207_LAND_VALUE_DATA_2

  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-10-01 15:12:47 Consumed _out_zip/nswvg_lv_01_Oct_2024/216_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1304265
2024-10-01 15:12:48 Consumed _out_zip/nswvg_lv_01_Oct_2024/252_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1310670
2024-10-01 15:12:49 Consumed _out_zip/nswvg_lv_01_Oct_2024/244_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1310670
2024-10-01 15:12:51 Consumed _out_zip/nswvg_lv_01_Oct_2024/251_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 13163632024-10-01 15:12:51 Consumed _out_zip/nswvg_lv_01_Oct_2024/254_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1316363

2024-10-01 15:12:53 Consumed _out_zip/nswvg_lv_01_Oct_2024/247_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1334244
2024-10-01 15:12:54 Consumed _out_zip/nswvg_lv_01_Oct_2024/253_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1334244
2024-10-01 15:12:54 Consumed _out_zip/nswvg_lv_01_Oct_2024/250_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1338834
2024-10-01 15:12:54 Consumed _out_zip/nswvg_lv_01_Oct_2024/255_LAND_VALU

  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-10-01 15:13:00 Consumed _out_zip/nswvg_lv_01_Oct_2024/262_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 14613792024-10-01 15:13:00 Consumed _out_zip/nswvg_lv_01_Oct_2024/257_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1461379

2024-10-01 15:13:01 Consumed _out_zip/nswvg_lv_01_Oct_2024/218_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1461379
2024-10-01 15:13:04 Consumed _out_zip/nswvg_lv_01_Oct_2024/263_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1523367
2024-10-01 15:13:05 Consumed _out_zip/nswvg_lv_01_Oct_2024/220_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1530112
2024-10-01 15:13:05 Consumed _out_zip/nswvg_lv_01_Oct_2024/265_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1530112
2024-10-01 15:13:06 Consumed _out_zip/nswvg_lv_01_Oct_2024/266_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1541892
2024-10-01 15:13:07 Consumed _out_zip/nswvg_lv_01_Oct_2024/270_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1544489


  df = pd.read_csv(full_file_path, encoding='utf-8')
  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-10-01 15:13:11 Consumed _out_zip/nswvg_lv_01_Oct_2024/269_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1553432
2024-10-01 15:13:16 Consumed _out_zip/nswvg_lv_01_Oct_2024/274_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 16304302024-10-01 15:13:16 Consumed _out_zip/nswvg_lv_01_Oct_2024/223_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1630430

2024-10-01 15:13:18 Consumed _out_zip/nswvg_lv_01_Oct_2024/224_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1697780
2024-10-01 15:13:18 Consumed _out_zip/nswvg_lv_01_Oct_2024/273_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1712189
2024-10-01 15:13:25 Consumed _out_zip/nswvg_lv_01_Oct_2024/300_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1715522
2024-10-01 15:13:26 Consumed _out_zip/nswvg_lv_01_Oct_2024/301_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1720087


  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-10-01 15:13:29 Consumed _out_zip/nswvg_lv_01_Oct_2024/264_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1761482
2024-10-01 15:13:31 Consumed _out_zip/nswvg_lv_01_Oct_2024/302_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1761482
2024-10-01 15:13:32 Consumed _out_zip/nswvg_lv_01_Oct_2024/272_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1787094


  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-10-01 15:13:38 Consumed _out_zip/nswvg_lv_01_Oct_2024/260_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1836184
2024-10-01 15:13:40 Consumed _out_zip/nswvg_lv_01_Oct_2024/275_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1861077
2024-10-01 15:13:42 Consumed _out_zip/nswvg_lv_01_Oct_2024/526_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1868074


  df = pd.read_csv(full_file_path, encoding='utf-8')


2024-10-01 15:13:43 Consumed _out_zip/nswvg_lv_01_Oct_2024/261_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1927427
2024-10-01 15:13:43 Consumed _out_zip/nswvg_lv_01_Oct_2024/511_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1927427
2024-10-01 15:13:46 Consumed _out_zip/nswvg_lv_01_Oct_2024/528_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1935882
2024-10-01 15:13:47 Consumed _out_zip/nswvg_lv_01_Oct_2024/537_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1942270
2024-10-01 15:13:50 Consumed _out_zip/nswvg_lv_01_Oct_2024/538_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1946180
2024-10-01 15:13:55 Consumed _out_zip/nswvg_lv_01_Oct_2024/267_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 1997702
2024-10-01 15:13:58 Consumed _out_zip/nswvg_lv_01_Oct_2024/560_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 2021683
2024-10-01 15:13:59 Consumed _out_zip/nswvg_lv_01_Oct_2024/529_LAND_VALUE_DATA_20241001.csv, raw_entries_lv 2021683
2024-10-01 15:14:00 Consumed _out_zip/nswvg_lv_01_Oct_2024/303_LAND_VALU

### Break CSV data into sepreate relations

Just to break up the data into more efficent representations of the data, and data that will be easier to query, we're going to perform a series of queries against the GNAF data before using it populate the tables we care about.

In [7]:
from datetime import datetime
import os
import math
import pandas as pd
from concurrent.futures import ThreadPoolExecutor, as_completed
from sqlalchemy import text

with gnaf_db.connect() as conn:
    cursor = conn.cursor()
    cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.source_file CASCADE")
    cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.source CASCADE")
    cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.district CASCADE")
    cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.suburb CASCADE")
    cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.street CASCADE")
    cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.property CASCADE")
    cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.property_description CASCADE")
    cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.valuations CASCADE")
    
    with open('sql/nsw_lv_schema_2_structure.sql', 'r') as f:
        cursor.execute(f.read())
        
    with open('sql/nsw_lv_from_raw.sql', 'r') as f:
        cursor.execute(f.read())
        
    cursor.close()
    
count('district')
count('suburb')
count('street')
count('property')
count('property_description')
count('valuations')

2024-10-01 15:16:17 district 128
2024-10-01 15:16:17 suburb 5074
2024-10-01 15:16:17 street 128458
2024-10-01 15:16:17 property 2703626
2024-10-01 15:16:17 property_description 2703626
2024-10-01 15:16:18 valuations 13518130


### Parse contents of the property description

The `property_description` from the original valuer general data constains alot of information. The most important of which is the land parcel or `lot/plan` information. There is other information in there as well.

In [8]:
import numpy as np
import pandas as pd
from lib.nsw_vg.property_description import parse_land_parcel_ids

engine = gnaf_db.engine()

with gnaf_db.connect() as conn:
    cursor = conn.cursor()
    cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.land_parcel_link")
    with open('sql/nsw_lv_schema_3_property_description_meta_data.sql', 'r') as f:
        cursor.execute(f.read())
    cursor.close()

def land_parcels(desc):
    desc, parcels = parse_land_parcel_ids(desc)
    return parcels 

query = "SELECT * FROM nsw_valuer_general.property_description"
for df_chunk in pd.read_sql(query, engine, chunksize=10000):
    df_chunk = df_chunk.dropna(subset=['property_description'])
    df_chunk['parcels'] = df_chunk['property_description'].apply(land_parcels)
    df_chunk_ex = df_chunk.explode('parcels')
    df_chunk_ex = df_chunk_ex.dropna(subset=['parcels'])
    df_chunk_ex['land_parcel_id'] = df_chunk_ex['parcels'].apply(lambda p: p.id)
    df_chunk_ex['part'] = df_chunk_ex['parcels'].apply(lambda p: p.part)
    df_chunk_ex = df_chunk_ex.drop(columns=['property_description', 'parcels'])
    df_chunk_ex.to_sql(
        'land_parcel_link',
        con=engine,
        schema='nsw_valuer_general',
        if_exists='append',
        index=False,
    )

with gnaf_db.connect() as conn:
    cursor = conn.cursor()
    for t in ['property', 'land_parcel_link']:
        cursor.execute(f'SELECT COUNT(*) FROM nsw_valuer_general.{t}')
        count = cursor.fetchone()[0]
        print(f"Table nsw_valuer_general.{t} has {count} rows")


Table nsw_valuer_general.property has 2703626 rows
Table nsw_valuer_general.land_parcel_link has 4247405 rows


### Get rid of `raw_entries_lv` table

We no longer need the raw entries table, deleting it should make the database a bit efficent in terms of storage.

In [9]:
with gnaf_db.connect() as conn:
    cursor = conn.cursor()
    if GLOBAL_FLAGS['drop_raw_nsw_valuer_general_entries']:
        cursor.execute("DROP TABLE IF EXISTS nsw_valuer_general.raw_entries_lv")
        print("Dropping raw entries table")
    else:
        print("Keeping raw entries table")
    cursor.close()

Dropping raw entries table


## Ingest NSW Sales data

This data is also from the NSW valuer general

#### Documentation on this dataset

You can find that [here](https://www.nsw.gov.au/housing-and-construction/land-values-nsw/resource-library/property-sales-data-guide).

### Build the `nsw_valuer_general.raw_entries_ps` table

First lets populate the raw sales information into the 

## Gnaf Ingestion

Here we ingest the GNAF dataset, this will take awhile.

In [None]:
from lib.gnaf.ingestion import ingest
ingest(gnaf_dis.publication, gnaf_db)

2024-10-01 15:17:32 Populating from Authority_Code_STREET_TYPE_AUT_psv.psv
2024-10-01 15:17:32 Populating from Authority_Code_ADDRESS_CHANGE_TYPE_AUT_psv.psv
2024-10-01 15:17:32 Populating from Authority_Code_GEOCODE_TYPE_AUT_psv.psv
2024-10-01 15:17:32 Populating from Authority_Code_LOCALITY_CLASS_AUT_psv.psv
2024-10-01 15:17:32 Populating from Authority_Code_STREET_SUFFIX_AUT_psv.psv
2024-10-01 15:17:32 Populating from Authority_Code_LOCALITY_ALIAS_TYPE_AUT_psv.psv
2024-10-01 15:17:32 Populating from Authority_Code_STREET_CLASS_AUT_psv.psv
2024-10-01 15:17:32 Populating from Authority_Code_ADDRESS_ALIAS_TYPE_AUT_psv.psv
2024-10-01 15:17:32 Populating from Authority_Code_STREET_LOCALITY_ALIAS_TYPE_AUT_psv.psv
2024-10-01 15:17:32 Populating from Authority_Code_PS_JOIN_TYPE_AUT_psv.psv
2024-10-01 15:17:32 Populating from Authority_Code_MB_MATCH_CODE_AUT_psv.psv
2024-10-01 15:17:32 Populating from Authority_Code_FLAT_TYPE_AUT_psv.psv
2024-10-01 15:17:32 Populating from Authority_Code_GEO

## Done

We've now built up the dataset, lets analysis what we got and show the contents of the database.

In [None]:
with gnaf_db.connect() as conn:
    cursor = conn.cursor()

    for schema in ['nsw_valuer_general', 'gnaf', 'abs_main_structures', 'non_abs_main_structures']:
        # Get the list of all tables
        cursor.execute(f"""
            SELECT table_name
            FROM information_schema.tables
            WHERE table_schema = '{schema}'
        """)
        tables = cursor.fetchall()
    
        # Get row count for each table
        for table in tables:
            cursor.execute(f'SELECT COUNT(*) FROM {schema}.{table[0]}')
            count = cursor.fetchone()[0]
            print(f"Table {schema}.{table[0]} has {count} rows")
    
    cursor.close()