# Dataset ingestion

This jupyter noteebook ingests the [Geocoded National Address File][gnaf] ([GNAF][gnaf]) from [data.gov.au](data.gov.au). It also downloads the [land values for NSW][nswlv], and ABS shapefiles 

It loads it all this data into a PostgreSQL database in a docker container, treating it like a disposable sqlite data store. It also downloads the ABS shape files as well as the 

Here we are going to ingest all the data necessary in order to assess land by land values, and filter them by address information. 

### The Steps

1. Download static assets and datasets
2. Setup a docker container with postgresql with GIS capabilities.
3. Ingest the [ABS shape files][abssf]
4. Ingest the latest [NSW valuer general land values][nswlv].
5. Ingest the [Geocoded National Address File][gnaf] ([GNAF][gnaf]) dataset
6. Link NSW Valuer General data with GNAF dataset

[gnaf]: https://data.gov.au/data/dataset/geocoded-national-address-file-g-naf
[nswlv]: https://www.valuergeneral.nsw.gov.au/land_value_summaries/lv.php
[abssf]: https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files

### Note

- Make sure docker is running first.

### Warning

Do not connect this to another database unless you've taken the time to update this, as it'll drop the existing database. I suggest instead take what you need from this script and disregard the rest. DO NOT USE DATABASE CREDENTIALS HERE FOR ANY OTHER STORE (especailly anything with drop permissions).

It also executes sql from a zip file downloaded from an external source.


## Configuration

These are some fields to configure if you wish to configure how the data is injected.

In [1]:
from lib.service.docker.defaults import INSTANCE_1_IMAGE_CONF, INSTANCE_1_CONTAINER_CONF
from lib.service.database.defaults import DB_INSTANCE_1_CONFIG

GLOBAL_FLAGS = {
    # If you mark this as true, the table `nsw_valuer_general.raw_entries`
    # will be dropped. If you have space limitations and no desire to debug
    # the data than dropping this makes sense. If you wish to debug some values
    # then keeping this around may make some sense.
    'drop_raw_nsw_valuer_general_entries': True,
    'reinitialise_container': True,
}

db_service_config = DB_INSTANCE_1_CONFIG
docker_image_conf = INSTANCE_1_IMAGE_CONF
docker_container_conf = INSTANCE_1_CONTAINER_CONF

## Steps

### Download Static Files

First we'll download all the static files. This has been setup to minimise network activity. This is done by:

1. Not fetching files that already on disk.
2. Any HTML that is scrapped to find links is also cached. This isn't really any more than a week.

This means this entire process can be run offline once run online successfully.

### Initialise Docker Container

> Note: If it doesn't go without saying, rerunning this will basically nuke any changes you've made to the schema.

Next we create a docker image and container, uses the constants `docker_image_conf` and `docker_container_conf`. The
image we're creating is defined in `./config/dockerfiles`, it's basically the latest official `postgres` that runs on
apple silicon, along with extensions for `GIS` (as of October 28th). It does the following steps:

1. Nuke any container that matches the one specfied in `docker_container_conf`, _if it exists_.
2. If the image specified in `docker_image_conf` doesn't already exists create it.
3. Create a new container using that image.
4. Await for the database to start accepting connections before proceeding.


### Initalise the schema

You can do this step in CLI by running the following

```
python -m lib.tasks.schema.update --packages meta,abs,nsw_gnb,nsw_lrs,nsw_planning,nsw_vg --instance 1
```

What are the different symbols seperated by comma? Those "packages" are just directories in `./sql` most
of which have one schema, however some of these have multiple mostly for staging prior to processing. For
the latest description of the different packages I would suggest jumping to `./sql/README.md` and read
through the different descriptions of the different subdirectories.

Basically tho, the database schema tries place data in a schema closest to the government agency that is
actually responsible for maintaining that data, even if that data didn't come from them directory. The
main motivation behind that is to have a single place for repeating data and that made sense to me.

### Consume the ABS Shapefiles

You can do this in CLI by running the following

```
python -m lib.tasks.ingest_abs --instance 1 --workers 4
```

The [ABS provides a number of shape files][all abs shape files], we're going focus on 2 main sets of shapes. The **ABS Main Structures** which is stuff like SA1, 2, 3 & 4 along with greater cities, meshblocks, and states. As well as **Non ABS Main Structures** which is stuff like electoral divisions, suburbs post codes etc.

[all abs shape files]: https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files

#### ABS Main Structures 

Any address or region we look up in the GNAF dataset, we want to visualise. The ABS has a few different geographic groups which we can visualise the data against, but each address in the GNAF dataset has a meshblock id, which is the smaller block the ABS breaks addresses up into for SA1, SA2, SA3 and SA4's.

This dataset is pretty useful for visualising the GNAF data for that reason.

#### Non Abs Main Structures 

We are mostly ingesting these to make it simpler to narrow data of interest. Typically if you're looking at this data, you're probably doing it some scope of relevance, such as a local government area, an electorate division, or whatever.

### Ingesting NSW Land Values & PSI

Next to ingest the NSW land values and Property sales data.

#### Documentation on this dataset

The valuer general website has a link to documentation on interpretting that data on [this page](https://www.nsw.gov.au/housing-and-construction/land-values-nsw/resource-library/land-value-information-user-guide). I didn't link to the PDF directly as it occasionally updated and a direct link is at risk of going stale.

It's useful getting the meaning behind the codes and terms used in the bulk data.

#### Steps

1. **Build the `nsw_valuer_general.raw_entries_lv` table**: Here we are
   just loading the each file from the latest land value publication
   with minimal changes, and a bit of sanitisizing.
2. **Break CSV data into sepreate relations**, Just to break up the data
   into more efficent representations of the data, and data that will be
   easier to query, we're going to perform a series of queries against
   the GNAF data before using it populate the tables we care about.
3. **Parse contents of the property description**, The `property_description`
   from the original valuer general data constains alot of information. The
   most important of which is the land parcel or `lot/plan` information.
   There is other information in there as well.
4. If specified drop the raw entries consumed in step (tho the default is
   to do exactly that. 

### Gnaf Ingestion

Here we ingest the GNAF dataset, this will take awhile.

## Running Everything

All of the above has been moved to this command.

In [2]:
from lib.tasks.ingest import ingest_all, IngestConfig

config = IngestConfig(
    io_file_limit=None,
    db_config=db_service_config,
    docker_image_config=docker_image_conf,
    docker_container_config=docker_container_conf
)

await ingest_all(config)

2024-10-28 10:11:21,232 - INFO - Checking Target "abs_main_structures.zip"
2024-10-28 10:11:21,233 - INFO - Checking Target "non_abs_shape.zip"
2024-10-28 10:11:21,233 - INFO - Checking Target "g-naf_aug24_allstates_gda2020_psv_1016.zip"
2024-10-28 10:11:21,233 - INFO - Checking Target "nswvg_lv_01_Oct_2024.zip"
2024-10-28 10:11:21,234 - INFO - Checking Target "nswvg_wps_01_Jan_2024.zip"
2024-10-28 10:11:21,235 - INFO - Checking Target "nswvg_wps_08_Jan_2024.zip"
2024-10-28 10:11:21,235 - INFO - Checking Target "nswvg_wps_15_Jan_2024.zip"
2024-10-28 10:11:21,235 - INFO - Checking Target "nswvg_wps_22_Jan_2024.zip"
2024-10-28 10:11:21,236 - INFO - Checking Target "nswvg_wps_29_Jan_2024.zip"
2024-10-28 10:11:21,236 - INFO - Checking Target "nswvg_wps_05_Feb_2024.zip"
2024-10-28 10:11:21,236 - INFO - Checking Target "nswvg_wps_12_Feb_2024.zip"
2024-10-28 10:11:21,236 - INFO - Checking Target "nswvg_wps_19_Feb_2024.zip"
2024-10-28 10:11:21,237 - INFO - Checking Target "nswvg_wps_26_Feb_202

## In CLI

Note you can do all this in CLI as well running the following:

```sh
python -m lib.tasks.ingest --instance 1
```

As of writing this, there's only 2 instances. The reason for more than one is too allow for testing.