# Data joining
In this notebook, we download another external dataset in order to map SA2 regions from the ABS dataset to consumer postcodes. 

As postcode (POA) and statistical area (SA2) regions do not directly correspond, we assume that postcodes will have similar demographics to the largest SA2 region in the postal area. With this assumption, we produce a dataset that maps SA2 codes to POA.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import pandas as pd
import openpyxl  
import sys

sys.path.append('../scripts')
from preprocess_script import count_outliers
from download_zip import download_zip_file

In [None]:
# Create a spark session (which will run spark jobs)
spark = (
    SparkSession.builder.appName("Dataset Joining")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.driver.memory", "9g") 
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .getOrCreate()
)
spark.sparkContext.setLogLevel("OFF")

Dataset downloaded from https://data.gov.au/dataset/ds-dga-2c79581f-600e-4560-80a8-98adb1922dfc/details?q=correspondence%20asgs, data dictionary also in the same. Data to be put into `tables/correspondence/`

In [None]:
# Download SA2 codes to postcodes correspondence file
fn = "CG_POSTCODE_2021_SA2_2021.xlsx"
folder = "../data/tables/correspondence"
src = "https://data.gov.au/data/dataset/2c79581f-600e-4560-80a8-98adb1922dfc/resource/33d822ba-138e-47ae-a15f-460279c3acc3/download/asgs2021_correspondences.zip"

download_zip_file(fn, folder, src)

In [None]:
# Read xlsx to dataframe
col_types = {"POSTCODE": str, "SA2_CODE_2021":str, "RATIO_FROM_TO": float}
correspondence_df = pd.read_excel("../data/tables/correspondence/CG_POSTCODE_2021_SA2_2021.xlsx", converters=col_types)

# Assign SA2 codes to postcode by the greatest area
preferred_SA2_ratio = correspondence_df.groupby('POSTCODE')['RATIO_FROM_TO'].max().reset_index()
preferred_SA2 = correspondence_df.merge(preferred_SA2_ratio, on=['POSTCODE','RATIO_FROM_TO'])
sa2_poa_codes = preferred_SA2.rename({"POSTCODE":'postcode', "SA2_CODE_2021":'sa2_code', "SA2_NAME_2021": 'sa2_name'}, axis=1).iloc[:,:3]
sa2_poa_codes

In [7]:
# Saving the correspondence
sa2_poa_codes.to_parquet('../data/curated/correspondence.parquet')