## Processing of historical sales 
Process ZIP files of [Property Sales][sales] downloaded to `data/raw/not_synced`. 
* Files are saved in a directory mess 
* File names are prefixed with a three digit code for the LGA. Only consider the following:
    * `090` - Ryde
    * `139` - Canada Bay
    * `260` - City of Parramatta
* The [Sales Data Format][sales_format_pdf] describes the format of the DAT files
* There is a change in the ZIP files 
    * 2001 to 2014 - each ZIP folder has sub-folders;
    * 2015 onwards - each ZIP folder has a ZIP file for each week...



[sales]: https://valuation.property.nsw.gov.au/embed/propertySalesInformation
[sales_format_pdf]: https://www.valuergeneral.nsw.gov.au/__data/assets/pdf_file/0015/216402/Current_Property_Sales_Data_File_Format_2001_to_Current.pdf

In [17]:
import zipfile
from pathlib import Path
import io

def process_dat(file_name: str, file_bytes: bytes):
    """
    Your existing processing function.
    file_name  -> name/path of the DAT file inside the ZIP
    file_bytes -> raw contents of the DAT file
    """
    # TODO: implement your logic here
    print(f"Processing {file_name}, size={len(file_bytes)} bytes")

def process_zip_folder(zip_folder: str, prefixes: tuple=("090","139","260")):
    """
    process_zip_folder -> processes the supplied folder of ZIP files 
    Each file may contain either more ZIP files or DAT files in a folder structure

    Args:
        zip_folder (str): the folder containing the ZIP files
        prefixes (str, optional): The prefixes of DAT files to include. Defaults to ["090","139","260"].
    """
    # Set the path to be the folder passed in
    zip_folder_path = Path(zip_folder)
    # convert prefixes to a tuple
    prefix_tuple = tuple(prefixes)

    # loop through the ZIP files in the folder (not subfolders) - to make it 
    #   recurse through subfolders use rglob not glob
    for zip_path in zip_folder_path.glob("*.zip"):
        print(f"Opening ZIP: {zip_path.name}")

        with zipfile.ZipFile(zip_path, "r") as zf:
            for zip_info in zf.infolist():
                # Skip directories
                if zip_info.is_dir():
                    continue 
                    
                file_name = Path(zip_info.filename).name

                # Check is the FILE a DAT file and starts with the right type
                if file_name.lower().endswith(".dat") and file_name.startswith(prefixes):
                    with zf.open(zip_info) as dat_file:
                        file_bytes = dat_file.read()
                        process_dat(file_name, file_bytes)

                # Check if i the file is ZIP and if so scan in it...
                if file_name.lower().endswith(".zip"):
                    # recursive call the function...
                    print(f"Opening contained ZIP: {file_name}")
                    inner_bytes = zf.read(zip_info)
                    with zipfile.ZipFile(io.BytesIO(inner_bytes),"r") as inner_zf:
                        for inner_info in inner_zf.infolist():
                            if inner_info.is_dir():
                                continue
                            inner_name = Path(inner_info.filename).name
                            if inner_name.lower().endswith(".dat") and inner_name.startswith(prefixes):
                                with zf.open(zip_info) as dat_file:
                                    file_bytes = inner_zf.read(inner_info)
                                    process_dat(inner_name, file_bytes)
                                
if __name__ == "__main__":
    prefixes = ("090","139","260")
    process_zip_folder("../data/raw/not_synced",prefixes)

Opening ZIP: 2024.zip
Opening contained ZIP: 20240101.zip
Processing 090_SALES_DATA_NNME_01012024.DAT, size=6598 bytes
Processing 139_SALES_DATA_NNME_01012024.DAT, size=2440 bytes
Processing 260_SALES_DATA_NNME_01012024.DAT, size=19903 bytes
Opening contained ZIP: 20240108.zip
Processing 090_SALES_DATA_NNME_08012024.DAT, size=692 bytes
Processing 139_SALES_DATA_NNME_08012024.DAT, size=1302 bytes
Processing 260_SALES_DATA_NNME_08012024.DAT, size=12753 bytes
Opening contained ZIP: 20240115.zip
Processing 090_SALES_DATA_NNME_15012024.DAT, size=6635 bytes
Processing 139_SALES_DATA_NNME_15012024.DAT, size=5377 bytes
Processing 260_SALES_DATA_NNME_15012024.DAT, size=19701 bytes
Opening contained ZIP: 20240122.zip
Processing 090_SALES_DATA_NNME_22012024.DAT, size=19333 bytes
Processing 139_SALES_DATA_NNME_22012024.DAT, size=10904 bytes
Processing 260_SALES_DATA_NNME_22012024.DAT, size=50157 bytes
Opening contained ZIP: 20240129.zip
Processing 090_SALES_DATA_NNME_29012024.DAT, size=19482 bytes