# Dataset ingestion

This jupyter noteebook ingests the [Geocoded National Address File][gnaf] ([GNAF][gnaf]) from [data.gov.au](data.gov.au). It also downloads the [land values for NSW][nswlv], and ABS shapefiles 

It loads it all this data into a PostgreSQL database in a docker container, treating it like a disposable sqlite data store. It also downloads the ABS shape files as well as the 

Here we are going to ingest all the data necessary in order to assess land by land values, and filter them by address information. 

### The Steps

1. Download static assets and datasets
2. Setup a docker container with postgresql with GIS capabilities.
3. Ingest the [ABS shape files][abssf]
4. Ingest the latest [NSW valuer general land values][nswlv].
5. Ingest the [Geocoded National Address File][gnaf] ([GNAF][gnaf]) dataset
6. Link NSW Valuer General data with GNAF dataset

[gnaf]: https://data.gov.au/data/dataset/geocoded-national-address-file-g-naf
[nswlv]: https://www.valuergeneral.nsw.gov.au/land_value_summaries/lv.php
[abssf]: https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files

### Note

- Make sure docker is running first.

### Warning

Do not connect this to another database unless you've taken the time to update this, as it'll drop the existing database. I suggest instead take what you need from this script and disregard the rest. DO NOT USE DATABASE CREDENTIALS HERE FOR ANY OTHER STORE (especailly anything with drop permissions).

It also executes sql from a zip file downloaded from an external source.


## Configuration

These are some fields to configure if you wish to configure how the data is injected.

In [1]:
from lib.service.docker.defaults import INSTANCE_1_IMAGE_CONF, INSTANCE_1_CONTAINER_CONF
from lib.service.database.defaults import DB_INSTANCE_1_CONFIG

GLOBAL_FLAGS = {
    # If you mark this as true, the table `nsw_valuer_general.raw_entries`
    # will be dropped. If you have space limitations and no desire to debug
    # the data than dropping this makes sense. If you wish to debug some values
    # then keeping this around may make some sense.
    'drop_raw_nsw_valuer_general_entries': True,
    'reinitialise_container': True,
}

db_service_config = DB_INSTANCE_1_CONFIG
docker_image_conf = INSTANCE_1_IMAGE_CONF
docker_container_conf = INSTANCE_1_CONTAINER_CONF

## Download Static Files

Here we are downloading static files, as well as fetching the most recently published land values from the valuer generals website.

In [2]:
import logging
from lib.service.io import IoService
from lib.tasks.fetch_static_files import initialise, get_session

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

io_service = IoService.create(None)
async with get_session(io_service) as session:
    environment = await initialise(io_service, session)

land_value_dis = environment.land_value
w_sale_price = environment.sale_price_weekly
a_sale_price = environment.sale_price_annual
gnaf_dis = environment.gnaf

2024-10-28 10:11:21,232 - INFO - Checking Target "abs_main_structures.zip"
2024-10-28 10:11:21,233 - INFO - Checking Target "non_abs_shape.zip"
2024-10-28 10:11:21,233 - INFO - Checking Target "g-naf_aug24_allstates_gda2020_psv_1016.zip"
2024-10-28 10:11:21,233 - INFO - Checking Target "nswvg_lv_01_Oct_2024.zip"
2024-10-28 10:11:21,234 - INFO - Checking Target "nswvg_wps_01_Jan_2024.zip"
2024-10-28 10:11:21,235 - INFO - Checking Target "nswvg_wps_08_Jan_2024.zip"
2024-10-28 10:11:21,235 - INFO - Checking Target "nswvg_wps_15_Jan_2024.zip"
2024-10-28 10:11:21,235 - INFO - Checking Target "nswvg_wps_22_Jan_2024.zip"
2024-10-28 10:11:21,236 - INFO - Checking Target "nswvg_wps_29_Jan_2024.zip"
2024-10-28 10:11:21,236 - INFO - Checking Target "nswvg_wps_05_Feb_2024.zip"
2024-10-28 10:11:21,236 - INFO - Checking Target "nswvg_wps_12_Feb_2024.zip"
2024-10-28 10:11:21,236 - INFO - Checking Target "nswvg_wps_19_Feb_2024.zip"
2024-10-28 10:11:21,237 - INFO - Checking Target "nswvg_wps_26_Feb_202

## Create Container with Database

Here we are creating a container in docker from an image that uses the postgres image, which also installs a few extensions.

### Note

This notebook this is designed to be run more than once, so it'll throw away any existing container and database before creating a new one. After getting rid of any container using the same identifer, it'll create a new one and pull the relevant image if it's not already installed. It'll wait till the postgres instance is live then create the database. 

In [3]:
from lib.pipeline.gnaf.init_schema import init_target_schema
from lib.service.docker import DockerService
from lib.service.database import DatabaseService

docker_service = DockerService.create()

if GLOBAL_FLAGS['reinitialise_container']:
    image = docker_service.create_image(docker_image_conf)
    image.prepare()

    container = docker_service.create_container(image, docker_container_conf)
    container.clean()
    container.prepare(db_service_config)
    container.start()
else:
    print('skipping container initialisation')

db_service = DatabaseService.create(db_service_config, 32)
await db_service.wait_till_running()
await db_service.open()

if GLOBAL_FLAGS['reinitialise_container']:
    await init_target_schema(gnaf_dis.publication, io_service, db_service)
else:
    print('skipping DB initialisation')
    raise Exception()

	This probably means the server terminated abnormally
	before or while processing the request.
	This probably means the server terminated abnormally
	before or while processing the request.
	This probably means the server terminated abnormally
	before or while processing the request.
	This probably means the server terminated abnormally
	before or while processing the request.
	This probably means the server terminated abnormally
	before or while processing the request.
	This probably means the server terminated abnormally
	before or while processing the request.
	This probably means the server terminated abnormally
	before or while processing the request.
	This probably means the server terminated abnormally
	before or while processing the request.
	This probably means the server terminated abnormally
	before or while processing the request.
	This probably means the server terminated abnormally
	before or while processing the request.
	This probably means the server terminated abnorma

dbname=gnaf_db port=5432 user=postgres host=localhost password=throwAwayPassword


2024-10-28 10:11:27,080 - INFO - running ./_out_zip/g-naf_aug24_allstates_gda2020_psv_1016/G-NAF/Extras/GNAF_TableCreation_Scripts/create_tables_ansi.sql
2024-10-28 10:11:27,103 - INFO - running ./_out_zip/g-naf_aug24_allstates_gda2020_psv_1016/G-NAF/Extras/GNAF_TableCreation_Scripts/add_fk_constraints.sql
2024-10-28 10:11:27,145 - INFO - running sql/move_gnaf_to_schema.sql


## Create Schema

This initialises the schema used by the different tables.

In [4]:
from lib.tasks.schema.update import update_schema, UpdateSchemaConfig
from lib.tooling.schema.config import ns_dependency_order

await update_schema(
    UpdateSchemaConfig(
        packages=ns_dependency_order,
        range=None,
        apply=True,
    ),
    db_service,
    io_service,
)

2024-10-28 10:11:27,243 - INFO - initalising nsw_vg db schema
2024-10-28 10:11:27,243 - INFO - Command.Create(ns='meta', range=None, dryrun=False, omit_foreign_keys=False)
2024-10-28 10:11:27,262 - INFO - Command.Create(ns='abs', range=None, dryrun=False, omit_foreign_keys=False)
2024-10-28 10:11:27,328 - INFO - Command.Create(ns='nsw_lrs', range=None, dryrun=False, omit_foreign_keys=False)
2024-10-28 10:11:27,359 - INFO - Command.Create(ns='nsw_gnb', range=None, dryrun=False, omit_foreign_keys=False)
2024-10-28 10:11:27,377 - INFO - Command.Create(ns='nsw_planning', range=None, dryrun=False, omit_foreign_keys=False)
2024-10-28 10:11:27,381 - INFO - Command.Create(ns='nsw_vg', range=None, dryrun=False, omit_foreign_keys=False)
  'legacy_vg_2011',
  'ep&a_2006',
  'unknown'
)' contains unsupported syntax. Falling back to parsing as a 'Command'.


## Consume the ABS Shapefiles

The [ABS provides a number of shape files][all abs shape files], we're going focus on 2 main sets of shapes. The **ABS Main Structures** which is stuff like SA1, 2, 3 & 4 along with greater cities, meshblocks, and states. As well as **Non ABS Main Structures** which is stuff like electoral divisions, suburbs post codes etc.

[all abs shape files]: https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files

### ABS Main Structures 

Any address or region we look up in the GNAF dataset, we want to visualise. The ABS has a few different geographic groups which we can visualise the data against, but each address in the GNAF dataset has a meshblock id, which is the smaller block the ABS breaks addresses up into for SA1, SA2, SA3 and SA4's.

This dataset is pretty useful for visualising the GNAF data for that reason.

### Non Abs Main Structures 

We are mostly ingesting these to make it simpler to narrow data of interest. Typically if you're looking at this data, you're probably doing it some scope of relevance, such as a local government area, an electorate division, or whatever.

In [5]:
from lib.tasks.ingest_abs import ingest_all
from lib.pipeline.abs.defaults import ABS_MAIN_STRUCTURES, NON_ABS_MAIN_STRUCTURES
from lib.pipeline.abs.config import IngestionConfig, WorkerConfig, WorkerLogConfig

await ingest_all(
    IngestionConfig(
        ingest_sources=[ABS_MAIN_STRUCTURES, NON_ABS_MAIN_STRUCTURES],
        worker_count=4,
        worker_config=WorkerConfig(
            db_config=db_service_config,
            db_connections=2,
            log_config=WorkerLogConfig(
                level=logging.INFO,
                format='%(asctime)s - %(levelname)s - %(message)s',
                datefmt=None,
            ),
        ),
    ),
    db_service,
    io_service,
)

2024-10-28 10:11:27,425 - INFO - Command.Drop(ns='abs', range=None, dryrun=False, cascade=False)
2024-10-28 10:11:27,449 - INFO - Command.Create(ns='abs', range=None, dryrun=False, omit_foreign_keys=True)
[2]2024-10-28 10:11:44,862 - INFO - Populated abs.federal_electoral_division_2021 with 170/170 rows.
[2]2024-10-28 10:11:44,986 - INFO - Populated abs.state_electoral_division_2024 with 452/452 rows.
[2]2024-10-28 10:11:45,014 - INFO - Populated abs.lga_2022 with 566/566 rows.
[2]2024-10-28 10:11:45,015 - INFO - Populated abs.lga_2021 with 566/566 rows.


dbname=gnaf_db port=5432 user=postgres host=localhost password=throwAwayPassword


[0]2024-10-28 10:11:46,091 - INFO - Populated abs.state with 10/10 rows.
[0]2024-10-28 10:11:46,119 - INFO - Populated abs.gccsa with 35/35 rows.
[0]2024-10-28 10:11:46,460 - INFO - Populated abs.sa4 with 108/108 rows.
[0]2024-10-28 10:11:46,784 - INFO - Populated abs.sa3 with 359/359 rows.
[0]2024-10-28 10:11:47,154 - INFO - Populated abs.sa2 with 2473/2473 rows.


dbname=gnaf_db port=5432 user=postgres host=localhost password=throwAwayPassword


[3]2024-10-28 10:11:51,387 - INFO - Populated abs.lga_2024 with 566/566 rows.
[3]2024-10-28 10:11:51,387 - INFO - Populated abs.lga_2023 with 566/566 rows.
[3]2024-10-28 10:11:52,116 - INFO - Populated abs.post_code with 2644/2644 rows.
[3]2024-10-28 10:11:52,123 - INFO - Populated abs.dzn with 9329/9329 rows.


dbname=gnaf_db port=5432 user=postgres host=localhost password=throwAwayPassword


[1]2024-10-28 10:12:30,157 - INFO - Populated abs.state_electoral_division_2022 with 452/452 rows.
[1]2024-10-28 10:12:30,293 - INFO - Populated abs.state_electoral_division_2021 with 452/452 rows.
[1]2024-10-28 10:12:32,419 - INFO - Populated abs.localities with 15353/15353 rows.
[1]2024-10-28 10:12:32,648 - INFO - Populated abs.sa1 with 61845/61845 rows.
[1]2024-10-28 10:12:40,212 - INFO - Populated abs.meshblock with 368286/368286 rows.


dbname=gnaf_db port=5432 user=postgres host=localhost password=throwAwayPassword


2024-10-28 10:12:40,478 - INFO - Command.AddForeignKeys(ns='abs', range=None, dryrun=False)


## Ingesting NSW Land Values

First lets just get the CSV's into the database, then we'll break it up into seperates tables, then we'll form links with the GNAF dataset.

#### Documentation on this dataset

The valuer general website has a link to documentation on interpretting that data on [this page](https://www.nsw.gov.au/housing-and-construction/land-values-nsw/resource-library/land-value-information-user-guide). I didn't link to the PDF directly as it occasionally updated and a direct link is at risk of going stale.

It's useful getting the meaning behind the codes and terms used in the bulk data.

### Steps

1. **Build the `nsw_valuer_general.raw_entries_lv` table**: Here we are
   just loading the each file from the latest land value publication
   with minimal changes, and a bit of sanitisizing.
2. **Break CSV data into sepreate relations**, Just to break up the data
   into more efficent representations of the data, and data that will be
   easier to query, we're going to perform a series of queries against
   the GNAF data before using it populate the tables we care about.
3. **Parse contents of the property description**, The `property_description`
   from the original valuer general data constains alot of information. The
   most important of which is the land parcel or `lot/plan` information.
   There is other information in there as well.
4. If specified drop the raw entries consumed in step (tho the default is
   to do exactly that. 


In [6]:
from lib.service.clock import ClockService
from lib.pipeline.nsw_vg.property_sales.ingestion import NSW_VG_PS_INGESTION_CONFIG
from lib.tasks.nsw_vg.ingest import ingest_nswvg
from lib.tasks.nsw_vg.ingest import NswVgIngestionConfig, NswVgIngestionDedupConfig, NswVgLandValueIngestionConfig
from lib.tasks.nsw_vg.ingest_property_sales import PropertySaleIngestionConfig, ChildConfig, ParentConfig

await ingest_nswvg(
    environment,
    ClockService(),
    db_service,
    io_service,
    NswVgIngestionConfig(
        load_raw_land_values=NswVgLandValueIngestionConfig(
            truncate_raw_earlier=False,
        ),
        load_raw_property_sales=PropertySaleIngestionConfig(
            worker_count=6,
            worker_config=ChildConfig(
                db_config=db_service_config,
                db_pool_size=4,
                db_batch_size=1000,
                file_limit=None,
                ingestion_config=NSW_VG_PS_INGESTION_CONFIG,
                parser_chunk_size=8 * 2 ** 10,
                log_config=None,
            ),
            parent_config=ParentConfig(
                target_root_dir='./_out_zip',
                publish_min=None,
                publish_max=None,
                download_min=None,
                download_max=None,
            ),
        ),
        deduplicate=NswVgIngestionDedupConfig(
            run_from=1,
            run_till=6,
        ),
        load_parcels=False,
    ),
)

2024-10-28 10:12:46,282 - INFO - Parsed ./_out_zip/nswvg_lv_01_Oct_2024/052_LAND_VALUE_DATA_20241001.csv
2024-10-28 10:12:46,308 - INFO - Parsed ./_out_zip/nswvg_lv_01_Oct_2024/043_LAND_VALUE_DATA_20241001.csv
2024-10-28 10:12:47,853 - INFO - Parsed ./_out_zip/nswvg_lv_01_Oct_2024/230_LAND_VALUE_DATA_20241001.csv
2024-10-28 10:12:48,173 - INFO - Parsed ./_out_zip/nswvg_lv_01_Oct_2024/233_LAND_VALUE_DATA_20241001.csv
2024-10-28 10:12:49,734 - INFO - Parsed ./_out_zip/nswvg_lv_01_Oct_2024/243_LAND_VALUE_DATA_20241001.csv
2024-10-28 10:12:49,858 - INFO - Parsed ./_out_zip/nswvg_lv_01_Oct_2024/235_LAND_VALUE_DATA_20241001.csv
2024-10-28 10:12:50,734 - INFO - Parsed ./_out_zip/nswvg_lv_01_Oct_2024/066_LAND_VALUE_DATA_20241001.csv
2024-10-28 10:12:50,884 - INFO - Parsed ./_out_zip/nswvg_lv_01_Oct_2024/083_LAND_VALUE_DATA_20241001.csv
2024-10-28 10:12:50,909 - INFO - Parsed ./_out_zip/nswvg_lv_01_Oct_2024/061_LAND_VALUE_DATA_20241001.csv
2024-10-28 10:12:50,971 - INFO - Parsed ./_out_zip/nswv

dbname=gnaf_db port=5432 user=postgres host=localhost password=throwAwayPassword
dbname=gnaf_db port=5432 user=postgres host=localhost password=throwAwayPassword
dbname=gnaf_db port=5432 user=postgres host=localhost password=throwAwayPassword


CancelledError: 

## Gnaf Ingestion

Here we ingest the GNAF dataset, this will take awhile.

In [None]:
from lib.tasks.ingest_gnaf import ingest_gnaf

await ingest_gnaf(gnaf_dis.publication, db_service)

## Done

We've now built up the dataset, lets analysis what we got and show the contents of the database.

In [None]:
from lib.tasks.schema.count import count
await count(db_service_config, ns_dependency_order)