#### Written Franklin (Koquiun) Li Lin 

## Parks/Reservations SA2 Matching

In this notebook, we will find the corresponding SA2 district of each park and reservation, and show the metropolitan parks and reservations in Victoria.
- Note: Make sure to run the `distrcit_boundaries.ipynb` notebook first to get the Victoria district boundaries, which will be used for the matching.

#### Import Libraries

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as func
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
from pyspark.sql import functions as F  #filtering
import pandas as pd
import geopandas as gpd
import folium

### Inspect data

In [2]:
# Starting a Spark session
spark = (
    SparkSession.builder.appName('Parkres SA2 Matching')
    .config("spark.sql.repl.eagerEval.enabled", True)
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.driver.memory", "2g")
    .config("spark.executor.memory", "4g")
    .getOrCreate()
)

your 131072x1 screen size is bogus. expect trouble
24/09/27 18:57:22 WARN Utils: Your hostname, LAPTOP-KOQUIUN resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
24/09/27 18:57:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/27 18:57:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# Read parkres shapefile into GeoPandas (only when necessary for geospatial operations)
parkres_gdf = gpd.read_file("../data/landing/parkres/parkres.shp")

# Convert all column names to lowercase and clean the dataframe
parkres_gdf.columns = parkres_gdf.columns.str.lower()
parkres_gdf = parkres_gdf.dropna().drop_duplicates()

# Drop unnecessary columns
parkres_gdf = parkres_gdf.drop(['estab_date', 'last_mod', 'vers_date'], axis=1)

parkres_gdf

Unnamed: 0,prims_id,name,area_type,name_short,area_src,manager,veac_rec,veac_study,iucn,total_area,poly_src,areatypeid,srchname,hectares,areasqm,geometry
1,3223,Glenmaggie Regional Park,REGIONAL PARK - NOT SCHEDULED UNDER NATIONAL P...,Glenmaggie RP,GIS PARKRES,Parks Victoria,A4,Gippsland Lakes Hinterland (1983),Not a Protected Area,570.861,HOLDING,30,GLENMAGGIE RP REGIONAL PARK R.P.,18.80,188008.61,"MULTIPOLYGON (((146.75825 -37.9299, 146.75828 ..."
2,352,Yarrara Flora and Fauna Reserve,NATURE CONSERVATION RESERVE - FLORA AND FAUNA ...,Yarrara FFR,GIS PARKRES,Parks Victoria,G1,Mallee Review (1989),Ia,2267.804,HOLDING,21,YARRARA FFR FLORA AND FAUNA RESERVE F.F.R.,1.29,12893.57,"POLYGON ((141.42817 -34.39947, 141.42734 -34.3..."
3,2746,Murray River K15 Streamside Reserve,NATURAL FEATURES RESERVE - STREAMSIDE RESERVE,Murray River K15 SSR,GIS PARKRES,Parks Victoria,H4,Box-Ironbark Investigation (2001),III,3.702,VM PARCEL EDITED WITH 25K FEATURES,18,MURRAY RIVER K15 SSR STREAMSIDE RESERVE SS.R.,3.70,37022.24,"POLYGON ((146.61136 -36.00962, 146.61143 -36.0..."
4,5075,Nooramunga Marine and Coastal Park (addition) ...,PROPOSED NATIONAL PARKS ACT PARK OR PARK ADDITION,Nooramunga Marine and Coastal Park (addition) NHP,GIS PARKRES,DEECA,none,No LCC Recommendation,Not a Protected Area,8.633,HOLDING,33,NOORAMUNGA MARINE AND COASTAL PARK (ADDITION) ...,8.63,86329.22,"POLYGON ((146.79819 -38.64209, 146.79498 -38.6..."
5,3193,Phillip Island Coastal Reserve,COASTAL RESERVE,Phillip Island Coast Res,GIS PARKRES,Committee of Management,J3,Melbourne (1977),Not a Protected Area,147.009,VM PARCEL EDITED WITH 25K FEATURES,1,PHILLIP ISLAND COAST RES COASTAL RESERVE COAST...,15.06,150629.34,"POLYGON ((145.33998 -38.53203, 145.34001 -38.5..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8782,2857,Puffing Billy Historic & Cultural Features Res.,HISTORIC RESERVE,Puffing Billy HR,GIS PARKRES,Other Government Authority,F4,Melbourne 2 Review (1994),Not a Protected Area,88.696,VM PARCEL EDITED WITH 25K FEATURES,4,PUFFING BILLY HR HISTORIC & CULTURAL FEATURES ...,0.65,6476.31,"POLYGON ((145.50782 -37.94836, 145.50774 -37.9..."
8783,1109,Natya Bushland Reserve,NATURAL FEATURES RESERVE - BUSHLAND RESERVE,Natya BR,GIS PARKRES,Parks Victoria,I55,Mallee Review (1989),IV,30.963,VM PARCEL EDITED WITH 25K FEATURES,9,NATYA BR BUSHLAND RESERVE B.R.,0.83,8281.31,"POLYGON ((143.22939 -34.95553, 143.22926 -34.9..."
8784,1600,Toolangi Bushland Reserve,NATURAL FEATURES RESERVE - BUSHLAND RESERVE,Toolangi BR,GIS PARKRES,Parks Victoria,G59,Melbourne 2 Review (1994),IV,30.059,VM PARCEL EDITED WITH 25K FEATURES,9,TOOLANGI BR BUSHLAND RESERVE B.R.,12.56,125619.56,"POLYGON ((145.4715 -37.54081, 145.4715 -37.540..."
8785,1770,Crib Point G229 Bushland Reserve,NATURAL FEATURES RESERVE - BUSHLAND RESERVE,Crib Point G229 BR,GIS PARKRES,Committee of Management,G229,Melbourne 2 Review (1994),IV,0.693,VM PARCEL EDITED WITH 25K FEATURES,9,CRIB POINT G229 BR BUSHLAND RESERVE B.R.,0.69,6926.96,"POLYGON ((145.20117 -38.36092, 145.2012 -38.36..."


In [4]:
print(f"Shape of the parkres shapefile: {parkres_gdf.shape}")

Shape of the parkres shapefile: (8452, 16)


### Read parks/reservations shapefile

In [5]:
# Read victoria district boundaries shapefile
victoria_gdf = gpd.read_file('../data/landing/boundaries/Victoria/vic_dist_boundaries.shp')

### Get the corresponding SA2 name for each park/reservation

In [6]:
# Define the target CRS for accurate area calculations in Australia
target_crs = 'EPSG:3112'  # GDA94 / Geoscience Australia Lambert

# Project both GeoDataFrames to the target CRS
parkres_gdf = parkres_gdf.to_crs(target_crs)
victoria_gdf = victoria_gdf.to_crs(target_crs)


In [7]:
# Function to find the district with the largest overlap
def find_largest_overlap(park_geometry, districts_gdf):
    overlaps = districts_gdf.intersection(park_geometry)
    overlap_areas = overlaps.area
    if overlap_areas.max() > 0:
        return districts_gdf.iloc[overlap_areas.idxmax()]['sa2_name']
    else:
        # If no overlap, find the nearest district
        distances = districts_gdf.distance(park_geometry)
        return districts_gdf.iloc[distances.idxmin()]['sa2_name']

In [8]:
# Add SA2 name to parkres_gdf based on largest overlap
parkres_gdf['sa2_name'] = parkres_gdf.apply(lambda row: find_largest_overlap(row['geometry'], victoria_gdf), axis=1)

parkres_gdf.head()

Unnamed: 0,prims_id,name,area_type,name_short,area_src,manager,veac_rec,veac_study,iucn,total_area,poly_src,areatypeid,srchname,hectares,areasqm,geometry,sa2_name
1,3223,Glenmaggie Regional Park,REGIONAL PARK - NOT SCHEDULED UNDER NATIONAL P...,Glenmaggie RP,GIS PARKRES,Parks Victoria,A4,Gippsland Lakes Hinterland (1983),Not a Protected Area,570.861,HOLDING,30,GLENMAGGIE RP REGIONAL PARK R.P.,18.8,188008.61,"MULTIPOLYGON (((1126585.815 -4322470.846, 1126...",Maffra
2,352,Yarrara Flora and Fauna Reserve,NATURE CONSERVATION RESERVE - FLORA AND FAUNA ...,Yarrara FFR,GIS PARKRES,Parks Victoria,G1,Mallee Review (1989),Ia,2267.804,HOLDING,21,YARRARA FFR FLORA AND FAUNA RESERVE F.F.R.,1.29,12893.57,"POLYGON ((679829.1 -3893333.379, 679752.495 -3...",Mildura Surrounds
3,2746,Murray River K15 Streamside Reserve,NATURAL FEATURES RESERVE - STREAMSIDE RESERVE,Murray River K15 SSR,GIS PARKRES,Parks Victoria,H4,Box-Ironbark Investigation (2001),III,3.702,VM PARCEL EDITED WITH 25K FEATURES,18,MURRAY RIVER K15 SSR STREAMSIDE RESERVE SS.R.,3.7,37022.24,"POLYGON ((1135072.166 -4108497.785, 1135077.68...",Rutherglen
4,5075,Nooramunga Marine and Coastal Park (addition) ...,PROPOSED NATIONAL PARKS ACT PARK OR PARK ADDITION,Nooramunga Marine and Coastal Park (addition) NHP,GIS PARKRES,DEECA,none,No LCC Recommendation,Not a Protected Area,8.633,HOLDING,33,NOORAMUNGA MARINE AND COASTAL PARK (ADDITION) ...,8.63,86329.22,"POLYGON ((1122004.523 -4402057.395, 1121728.61...",Yarram
5,3193,Phillip Island Coastal Reserve,COASTAL RESERVE,Phillip Island Coast Res,GIS PARKRES,Committee of Management,J3,Melbourne (1977),Not a Protected Area,147.009,VM PARCEL EDITED WITH 25K FEATURES,1,PHILLIP ISLAND COAST RES COASTAL RESERVE COAST...,15.06,150629.34,"POLYGON ((995644.224 -4377506.647, 995647.782 ...",Phillip Island


### Find the corresponding postcode for each park/reservation

In [9]:
# Read suburb parquet data into a Spark DataFrame
suburb_spark_df = spark.read.parquet('../data/landing/suburb_match/suburb_match.parquet')

                                                                                

In [10]:
# Filter the Spark DataFrame to only include rows where the state is 'VIC'
suburb_spark_df = suburb_spark_df.filter(suburb_spark_df['state'] == 'VIC')

In [11]:
# Matching postcode function optimized for Spark DataFrame
def match_postcode(final_df, suburb_spark_df):
    manual_updates = {
        'Horsham Surrounds': 3400,
        'Bacchus Marsh Surrounds': 3340,
    }

    # Broadcast the suburb Spark DataFrame to avoid converting it to Pandas
    suburb_broadcast = spark.sparkContext.broadcast(suburb_spark_df.toPandas())

    def find_postcode(sa2_name):
        if pd.isna(sa2_name):
            return None
        
        # Check if there's a manual update for this sa2_name
        if sa2_name in manual_updates:
            return manual_updates[sa2_name]
        
        sa2_name_lower = sa2_name.lower()

        # Access the broadcasted Pandas DataFrame for matching
        suburb_df_pd = suburb_broadcast.value

        # Exact match or partial match in both SA2_NAME_2021 and locality
        match = suburb_df_pd[
            (suburb_df_pd['SA2_NAME_2021'].fillna('').str.lower() == sa2_name_lower) |
            (suburb_df_pd['locality'].fillna('').str.lower() == sa2_name_lower)
        ]
        
        if not match.empty:
            return match['postcode'].iloc[0]

        # Partial match based on the main part of the name
        main_part = sa2_name_lower.split('-')[0].strip().split('(')[0].strip()
        match = suburb_df_pd[
            (suburb_df_pd['SA2_NAME_2021'].fillna('').str.lower().str.startswith(main_part)) |
            (suburb_df_pd['locality'].fillna('').str.lower().str.startswith(main_part))
        ]
        
        if not match.empty:
            return match['postcode'].iloc[0]

        return None

    # Apply the find_postcode function to final_df
    final_df['postcode'] = final_df['sa2_name'].apply(find_postcode)
    final_df['postcode'] = pd.to_numeric(final_df['postcode'], errors='coerce').astype('Int64')

    # Identify and display unmatched SA2 names
    unmatched = final_df[final_df['postcode'].isna()]['sa2_name'].unique().tolist()
    print("Unique unmatched sa2_names:")
    for name in unmatched:
        print(name)

    return final_df[['name', 'sa2_name', 'postcode', 'geometry']]

# Perform postcode matching and filtering
result_df = match_postcode(parkres_gdf, suburb_spark_df)
result_df.head()

24/09/27 18:57:56 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


Unique unmatched sa2_names:


Unnamed: 0,name,sa2_name,postcode,geometry
1,Glenmaggie Regional Park,Maffra,3825,"MULTIPOLYGON (((1126585.815 -4322470.846, 1126..."
2,Yarrara Flora and Fauna Reserve,Mildura Surrounds,3489,"POLYGON ((679829.1 -3893333.379, 679752.495 -3..."
3,Murray River K15 Streamside Reserve,Rutherglen,3685,"POLYGON ((1135072.166 -4108497.785, 1135077.68..."
4,Nooramunga Marine and Coastal Park (addition) ...,Yarram,3844,"POLYGON ((1122004.523 -4402057.395, 1121728.61..."
5,Phillip Island Coastal Reserve,Phillip Island,3922,"POLYGON ((995644.224 -4377506.647, 995647.782 ..."


In [12]:
# Filter the DataFrame for postcodes between 3000 and 3200 (metropolitan areas)
filtered_df = result_df[(result_df['postcode'] >= 3000) & (result_df['postcode'] <= 3200)]

# Reset the index and start from 1
filtered_df = filtered_df.reset_index(drop=True)
filtered_df.index += 1

filtered_df.head()

Unnamed: 0,name,sa2_name,postcode,geometry
1,Lilydale-Warburton Rail Trail,Yarra Valley,3139,"POLYGON ((1034153.618 -4293796.631, 1034157.80..."
2,Nangana Bushland Reserve,Yarra Valley,3139,"POLYGON ((1022203.959 -4305504.818, 1022103.44..."
3,Nillumbik G139 Bushland Reserve,Wattle Glen - Diamond Creek,3089,"POLYGON ((989912.759 -4282407.536, 989863.055 ..."
4,Lilydale-Warburton Rail Trail,Lilydale - Coldstream,3140,"POLYGON ((1005216.889 -4291459.212, 1005220.85..."
5,Plenty Gorge Parklands,Plenty - Yarrambat,3088,"POLYGON ((983018.706 -4280186.305, 982890.744 ..."


#### Save

In [13]:
output_dir = '../data/curated/parkres/'

# Ensure the output directory exists
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Save the filtered data to a new shapefile
filtered_df.to_csv('../data/curated/parkres/parkres.csv')

### Visualization

In [14]:
# Convert the 'filtered_df' GeoDataFrame back to WGS84 for mapping (if it's not already in EPSG:4326)
filtered_df_wgs84 = filtered_df.to_crs('EPSG:4326')

# Get the bounding box center of the filtered data to center the map
bounds = filtered_df_wgs84.total_bounds
centroid_lat = (bounds[1] + bounds[3]) / 2
centroid_lon = (bounds[0] + bounds[2]) / 2

# Create a folium map centered at the bounding box centroid
m = folium.Map(
    location=[centroid_lat, centroid_lon], 
    zoom_start=10, 
    tiles='OpenStreetMap',
    attr='Map data © OpenStreetMap contributors'
)

# Add the filtered_df geometries to the map, styling them with color and popup info
for _, row in filtered_df_wgs84.iterrows():
    # Create a GeoJson object for each geometry
    geo_json = folium.GeoJson(
        row['geometry'],
        style_function=lambda x: {'color': 'darkgreen', 'weight': 2},  # Style of the boundary
        highlight_function=lambda x: {'color': 'red', 'weight': 3},  # Highlight on hover
        tooltip=folium.Tooltip(f"Name: {row['name']}<br>SA2 Name: {row['sa2_name']}<br>Postcode: {row['postcode']}")
    )
    # Add the GeoJson to the map
    geo_json.add_to(m)

# Add a layer control to toggle the layers on/off
folium.LayerControl().add_to(m)

# Display the map (if in Jupyter Notebook)
m

Please go to `parkres_domain_merge.ipynb` to get the merge data that contains both information about parks/reservations and domain data. 