## SA2, Rental Suburb Group join table

This jupyter notebook aims to join together the SA2 regions and the historical rental groups.

The output dataframe schema is as follows:

| geometry | suburbs | region | code |
|----------|----------|----------|----|
| polygon  | string  | list of strings | list of integers |

The geometry field allows for spatial join of coordinates + visualisations. The suburbs field is in the form 'Suburb 1 - Suburb 2 (optional) - Suburb 3 (optional)' and should be joined with the suburb field in the historical rental dataset.

The region is a list of SA2 region names that the specific historical rental region includes. This is included for verification purposes but is not required in a join. Instead join on the list of codes (which are unique SA2 identifiers).

The historical rental suburb groups have many SA2 regions, but an SA2 region only has one historical rental suburb group.

### Steps to execute
1. Make sure the cleaned historical median rent by suburb df is saved in '@/data/raw/housing/historical_median_rent_by_suburb.csv'
2. Execute the below notebooks
3. The output dataframe will be saved in '@/data/raw/location/sa2_to_rental_suburb_groups.csv'
4. Feel free to explore the regions on the folumn display in the second last cell towards the bottom!

In [1]:
# IMPORTS
import pandas as pd
import geopandas as gpd
import itertools
import re
import os

In [2]:
PATH_TO_SUBURB_SHAPEFILE = '../data/landing/sa2_shapefile'
FINAL_OUTPUT_PATH = '../data/raw/location'
VICTORIA = 'Victoria'

os.makedirs(PATH_TO_SUBURB_SHAPEFILE, exist_ok=True)
os.makedirs(FINAL_OUTPUT_PATH, exist_ok=True)

In [3]:
gdf = gpd.read_file(PATH_TO_SUBURB_SHAPEFILE + '/SA2_2021_AUST_GDA2020.shp')

victoria_gdf = gdf[gdf['STE_NAME21'] == VICTORIA].rename(columns={'SA2_NAME21': 'suburb', 'SA2_CODE21': 'code'}, 
                                                         inplace=False)[['code', 'suburb', 'geometry']].copy()
victoria_gdf = victoria_gdf.dropna()

DIRECTIONS = ['North', 'East', 'South', 'West']
ALL_DIRECTIONS = DIRECTIONS + [f'{dir_one} {dir_two}' for (dir_one, dir_two) in itertools.permutations(DIRECTIONS, 2)]

def standardise_direction(suburb):
    for direction in DIRECTIONS:
        if suburb.endswith(direction):
            name = re.sub(fr'\s*{direction}$', '', suburb)
            return direction + " " + name.strip() 
    return suburb

def split_direction(suburb):
    # assume direction has been standardised
    for direction in DIRECTIONS:
        if suburb.startswith(direction):
            name = re.sub(fr'^{direction}\s*', '', suburb)
            return direction, name.strip() 
    return None, direction
            
def clean_suburb(suburb):
    # remove brackets
    suburb = re.sub(r'\s*\(.*?\)', '', suburb)
    suburb = suburb.replace('Mt', 'Mount')

    # convert Melbourne CBD to CBD
    if suburb == 'Melbourne CBD':
        return 'CBD'

    return suburb

def extract_components(full_suburb):
    # schema of (suburb, secondary burb, direction)

    # if there is a dash (it has a direction or sub)
    if '-' in full_suburb:
        suburb, sub_region = (details.strip() for details in full_suburb.split('-', 2))

        # if it was a cardinal direction, add it to the directions
        if sub_region in ALL_DIRECTIONS:
            res = suburb, None, sub_region
        # otherwise it is a sub region or joint suburb
        else:
            res = suburb.strip(), sub_region.strip(), None
    # otherwise it 
    else:
        res = full_suburb, None, None

    # clean the suburb component
    suburb = res[0]

    return clean_suburb(suburb), res[1], res[2]

victoria_gdf_with_breakdown = victoria_gdf.copy()
victoria_gdf_with_breakdown['suburb_breakdown'] = victoria_gdf_with_breakdown['suburb'].map(extract_components)

victoria_gdf_with_breakdown

Unnamed: 0,code,suburb,geometry,suburb_breakdown
644,201011001,Alfredton,"POLYGON ((143.78282 -37.56666, 143.75558 -37.5...","(Alfredton, None, None)"
645,201011002,Ballarat,"POLYGON ((143.81896 -37.55582, 143.81644 -37.5...","(Ballarat, None, None)"
646,201011005,Buninyong,"POLYGON ((143.84171 -37.61596, 143.84176 -37.6...","(Buninyong, None, None)"
647,201011006,Delacombe,"POLYGON ((143.7505 -37.59119, 143.75044 -37.59...","(Delacombe, None, None)"
648,201011007,Smythes Creek,"POLYGON ((143.73296 -37.62333, 143.73263 -37.6...","(Smythes Creek, None, None)"
...,...,...,...,...
1161,217031476,Otway,"MULTIPOLYGON (((143.40263 -38.78152, 143.40252...","(Otway, None, None)"
1162,217041477,Moyne - East,"POLYGON ((142.41438 -38.09303, 142.414 -38.072...","(Moyne, None, East)"
1163,217041478,Moyne - West,"MULTIPOLYGON (((142.0087 -38.41715, 142.00876 ...","(Moyne, None, West)"
1164,217041479,Warrnambool - North,"POLYGON ((142.43668 -38.35544, 142.43658 -38.3...","(Warrnambool, None, North)"


In [4]:
PATH_TO_HISTORICAL_RENTAL_DATA = '../data/landing/housing'

historical_rental_df = pd.read_csv(PATH_TO_HISTORICAL_RENTAL_DATA + '/flat_1_bed.csv')
historical_rental_df = historical_rental_df.rename(columns={"Unnamed: 1": "suburb"})

# also fix the typo
historical_rental_df['suburbs'] = historical_rental_df['suburb'].replace({'Wanagaratta': 'Wangaratta', 'Newcombe': 'Newcomb'})

historical_rental_df[['suburb_1', 'suburb_2', 'suburb_3']] = historical_rental_df['suburbs'].str.split('-', n=2, expand=True)

all_suburbs_list_rental = set(list(historical_rental_df['suburb_1']) + list(historical_rental_df['suburb_2']) + list(historical_rental_df['suburb_3']))

In [5]:
all_suburbs_list_rental

{'Abbotsford',
 'Albert Park',
 'Alfredton',
 'Alphington',
 'Altona',
 'Armadale',
 'Ascot Vale',
 'Ashburton',
 'Aspendale',
 'Avondale Heights',
 'Bairnsdale',
 'Ballarat',
 'Balwyn',
 'Barwon Heads',
 'Bayswater',
 'Beaumaris',
 'Belmont',
 'Benalla',
 'Bendigo',
 'Bendigo East',
 'Bentleigh',
 'Berwick',
 'Blackburn',
 'Boronia',
 'Box Hill',
 'Brighton',
 'Brighton East',
 'Broadmeadows',
 'Brunswick',
 'Bulleen',
 'Bundoora',
 'Buninyong',
 'Burnley',
 'Burwood',
 'Burwood East',
 'CBD',
 'Camberwell',
 'Canterbury',
 'Carlton',
 'Carlton North',
 'Carnegie',
 'Carrum',
 'Carrum Downs',
 'Castlemaine',
 'Caulfield',
 'Chadstone',
 'Chelsea',
 'Cheltenham',
 'Clayton',
 'Clifton Hill',
 'Coburg',
 'Coburg North',
 'Collingwood',
 'Corio',
 'Craigieburn',
 'Cranbourne',
 'Croydon',
 'Dandenong',
 'Dandenong North',
 'Deer Park',
 'Delacombe',
 'Docklands',
 'Doncaster',
 'Doncaster East',
 'Donvale',
 'Dromana',
 'East Brunswick',
 'East Hawthorn',
 'East Melbourne',
 'East St Kil

In [6]:
historical_rental_df.sort_values('suburbs')

Unnamed: 0,1 bedroom flat,suburb,Mar 2000,Mar 2000.1,Jun 2000,Jun 2000.1,Sep 2000,Sep 2000.1,Dec 2000,Dec 2000.1,...,Dec 2023,Dec 2023.1,Mar 2024,Mar 2024.1,Jun 2024,Jun 2024.1,suburbs,suburb_1,suburb_2,suburb_3
0,Inner Melbourne,Albert Park-Middle Park-West St Kilda,352,165,347,165,378,170,369,175,...,224,400,194,425,187,426,Albert Park-Middle Park-West St Kilda,Albert Park,Middle Park,West St Kilda
55,Outer Western Melbourne,Altona,87,95,94,100,97,105,98,105,...,85,310,82,320,73,325,Altona,Altona,,
1,,Armadale,210,150,212,150,213,155,213,160,...,148,408,155,430,147,450,Armadale,Armadale,,
41,Southern Melbourne,Aspendale-Chelsea-Carrum,105,103,97,105,95,110,86,110,...,38,350,29,350,23,385,Aspendale-Chelsea-Carrum,Aspendale,Chelsea,Carrum
137,Other Regional Centres,Bairnsdale,21,90,22,88,16,90,16,88,...,12,273,12,283,13,300,Bairnsdale,Bairnsdale,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93,,Whittlesea,-,-,-,-,-,-,-,-,...,-,-,-,-,-,-,Whittlesea,Whittlesea,,
65,,Williamstown,51,120,49,120,53,125,53,130,...,62,383,59,385,58,413,Williamstown,Williamstown,,
157,,Wodonga,77,85,72,85,77,85,83,85,...,42,255,44,260,42,260,Wodonga,Wodonga,,
102,,Yarra Ranges,68,110,73,110,69,105,73,110,...,44,323,34,350,37,360,Yarra Ranges,Yarra Ranges,,


In [7]:
suburbs = {j for j in victoria_gdf['suburb'] if j}
print([i for i in victoria_gdf['suburb'] if 'Albert Park' in i])
[i for i in all_suburbs_list_rental if i and 'Albert' in i]# i not in suburbs]

['Albert Park']


['Mont Albert', 'Albert Park']

In [8]:
FORCED_MATCHES = {
    'Herne Hill': 'Geelong West - Hamlyn Heights',
    'Portsea': 'Mornington - West',
    'Burnley': 'Richmond (South) - Cremorne',
    'North Bendigo': 'Bendigo',
    'Spotswood': 'Newport',
    'Port Melbourne': 'Port Melbourne Industrial'
}

def match_suburbs(df_rent, df_sa2):
    matched_suburbs = []
    matched_codes = []

    matched_suburbs_set = set()
    
    for _, rent_row in df_rent.iterrows():
        rent_suburbs = [rent_row['suburb_1'], rent_row['suburb_2'], rent_row['suburb_3']]
        matched_suburbs_per_rent_suburb = []
        matched_codes_per_rent_suburb = []
        
        for rent_suburb in rent_suburbs:
            if rent_suburb:
                rent_suburb_cleaned = standardise_direction(clean_suburb(rent_suburb))

                for _, sa2_row in df_sa2.iterrows():

                    # check for forced/override matches (these are manual matches that 
                    # would not happen programatically)
                    if rent_suburb in FORCED_MATCHES and FORCED_MATCHES[rent_suburb] == sa2_row['suburb']:
                        print(rent_suburb)
                        matched_suburbs_per_rent_suburb.append(sa2_row['suburb'])
                        matched_suburbs_set.add(sa2_row['suburb'])
                        matched_codes_per_rent_suburb.append(sa2_row['code'])
                        continue

                    # check for a normal match
                    sa2_suburb = sa2_row['suburb_breakdown']
                    sa2_suburb_cleaned = standardise_direction(clean_suburb(sa2_suburb[0]))
                    
                    # see if the suburb matches with the cleaned name, or the sub part of the suburb
                    if rent_suburb_cleaned in {sa2_suburb_cleaned, sa2_suburb[1]}:
                        matched_suburbs_per_rent_suburb.append(sa2_row['suburb'])
                        matched_suburbs_set.add(sa2_row['suburb'])
                        matched_codes_per_rent_suburb.append(sa2_row['code'])
                    else:

                        # check for an additional directional match (if taking the direction out makes it match)
                        direction, non_direction = split_direction(rent_suburb_cleaned)
                        # if rent_suburb_cleaned == 'West St Kilda':
                        #     print(direction, 'd', non_direction, sa2_suburb)
                        if direction and non_direction == sa2_suburb_cleaned and direction == sa2_suburb[2]:
                            matched_suburbs_per_rent_suburb.append(sa2_row['suburb'])
                            matched_suburbs_set.add(sa2_row['suburb'])
                            matched_codes_per_rent_suburb.append(sa2_row['code'])

        
        # remove duplicates
        matched_suburbs.append(list(set(matched_suburbs_per_rent_suburb)))
        matched_codes.append(list(set(matched_codes_per_rent_suburb)))


    # perform a second iteration for still unmatched suburbs but with looser criteria
    # specifically looking to match with east/south/west or more detailed descriptions
    for i, rent_row in df_rent.iterrows():
        rent_suburbs = [rent_row['suburb_1'], rent_row['suburb_2'], rent_row['suburb_3']]
        matched_suburbs_per_rent_suburb = []
        matched_codes_per_rent_suburb = []
        
        for rent_suburb in rent_suburbs:
            if rent_suburb:
                rent_suburb_cleaned = standardise_direction(clean_suburb(rent_suburb))

                for _, sa2_row in df_sa2.iterrows():
                    sa2_suburb = sa2_row['suburb_breakdown']
                    sa2_suburb_cleaned = standardise_direction(clean_suburb(sa2_suburb[0]))

                    # if there doesn't exist a normal match, and the suburb hasn't been matched
                    # check if there's a 'region' match

                    # that is check if it matches without the region modifier
                    _, non_direction = split_direction(sa2_suburb_cleaned)

                    if non_direction == rent_suburb_cleaned \
                        and sa2_row['suburb'] not in matched_suburbs_set:
                            matched_suburbs_per_rent_suburb.append(sa2_row['suburb'])
                            matched_codes_per_rent_suburb.append(sa2_row['code'])

        # remove duplicates
        matched_suburbs[i] = list(set(matched_suburbs[i] + matched_suburbs_per_rent_suburb))
        matched_codes[i] = (list(set(matched_codes[i] + matched_codes_per_rent_suburb)))

    
    return matched_suburbs, matched_codes

matched_suburbs, matched_codes = match_suburbs(historical_rental_df, victoria_gdf_with_breakdown)

joined_df = historical_rental_df.copy()
joined_df['regions'] = matched_suburbs
joined_df['codes'] = matched_codes

joined_df

Port Melbourne
Burnley
Spotswood
Portsea
Herne Hill
North Bendigo


Unnamed: 0,1 bedroom flat,suburb,Mar 2000,Mar 2000.1,Jun 2000,Jun 2000.1,Sep 2000,Sep 2000.1,Dec 2000,Dec 2000.1,...,Mar 2024,Mar 2024.1,Jun 2024,Jun 2024.1,suburbs,suburb_1,suburb_2,suburb_3,regions,codes
0,Inner Melbourne,Albert Park-Middle Park-West St Kilda,352,165,347,165,378,170,369,175,...,194,425,187,426,Albert Park-Middle Park-West St Kilda,Albert Park,Middle Park,West St Kilda,"[St Kilda - West, Albert Park]","[206051128, 206051514]"
1,,Armadale,210,150,212,150,213,155,213,160,...,155,430,147,450,Armadale,Armadale,,,[Armadale],[206061135]
2,,Carlton North,87,150,78,155,74,150,65,150,...,41,400,42,400,Carlton North,Carlton North,,,[Carlton North - Princes Hill],[206071140]
3,,Carlton-Parkville,298,165,297,170,312,175,346,180,...,1048,450,1092,470,Carlton-Parkville,Carlton,Parkville,,"[Carlton, Parkville]","[206041124, 206041117]"
4,,CBD-St Kilda Rd,755,250,861,250,934,250,952,250,...,6200,550,5962,550,CBD-St Kilda Rd,CBD,St Kilda Rd,,"[Melbourne CBD - East, Melbourne CBD - North, ...","[206041503, 206041505, 206041504]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
154,,Wanagaratta,51,85,46,85,44,85,47,85,...,71,240,64,255,Wangaratta,Wangaratta,,,[Wangaratta],[204021066]
155,,Warragul,13,80,11,75,12,90,10,90,...,10,260,11,295,Warragul,Warragul,,,[Warragul],[205011079]
156,,Warrnambool,113,75,104,75,108,75,105,80,...,46,300,37,300,Warrnambool,Warrnambool,,,"[Warrnambool - South, Warrnambool - North]","[217041479, 217041480]"
157,,Wodonga,77,85,72,85,77,85,83,85,...,44,260,42,260,Wodonga,Wodonga,,,"[Wodonga, West Wodonga]","[204031073, 204031492]"


In [9]:
flattened_matched_suburbs = [item for sublist in matched_suburbs for item in sublist]

duplicates = []
seen = set()
for item in flattened_matched_suburbs:
    if item in seen:
        duplicates.append(item)
    else:
        seen.add(item)

# get all rows that contain duplicates
joined_df[joined_df['regions'].apply((lambda x: any(item in duplicates for item in x)))].sort_values('regions')

Unnamed: 0,1 bedroom flat,suburb,Mar 2000,Mar 2000.1,Jun 2000,Jun 2000.1,Sep 2000,Sep 2000.1,Dec 2000,Dec 2000.1,...,Mar 2024,Mar 2024.1,Jun 2024,Jun 2024.1,suburbs,suburb_1,suburb_2,suburb_3,regions,codes
132,Bendigo,Bendigo,84,85,78,85,84,88,84,90,...,34,290,37,280,Bendigo,Bendigo,,,[Bendigo],[202011018]
135,,North Bendigo,26,85,20,85,20,88,19,85,...,13,243,15,270,North Bendigo,North Bendigo,,,[Bendigo],[202011018]
28,,Camberwell-Glen Iris,171,135,168,140,171,140,167,140,...,210,430,211,435,Camberwell-Glen Iris,Camberwell,Glen Iris,,"[Camberwell, Glen Iris - East, Malvern - Glen ...","[207011150, 208041194, 207011149]"
84,,Fairfield-Alphington,281,120,279,120,266,125,265,128,...,203,360,191,368,Fairfield-Alphington,Fairfield,Alphington,,"[Clifton Hill - Alphington, Alphington - Fairf...","[206021110, 206071145]"
11,,Fitzroy North-Clifton Hill,163,130,176,135,181,135,191,140,...,199,425,195,435,Fitzroy North-Clifton Hill,Fitzroy North,Clifton Hill,,"[Clifton Hill - Alphington, Fitzroy North]","[206071143, 206071145]"
106,,Dandenong,78,100,78,100,79,105,73,105,...,189,320,179,320,Dandenong,Dandenong,,,"[Dandenong - North, Dandenong - South]","[212041564, 212041563]"
107,,Dandenong North-Endeavour Hills,12,103,12,108,10,108,12,108,...,56,305,51,310,Dandenong North-Endeavour Hills,Dandenong North,Endeavour Hills,,"[Dandenong North, Dandenong - North, Endeavour...","[212041312, 212041563, 212021453, 212021454]"
113,Mornington Peninsula,Dromana-Portsea,45,85,44,87,45,85,35,80,...,19,360,25,390,Dromana-Portsea,Dromana,Portsea,,"[Dromana, Mornington - West]","[214021592, 214021377]"
50,,Malvern,96,145,100,145,95,145,91,150,...,99,420,107,425,Malvern,Malvern,,,[Malvern - Glen Iris],[208041194]
116,,Mt Eliza-Mornington-Mt Martha,35,100,35,100,33,105,33,105,...,19,400,18,390,Mt Eliza-Mornington-Mt Martha,Mt Eliza,Mornington,Mt Martha,"[Mount Eliza, Mornington - East, Mount Martha,...","[214021381, 214021592, 214021591, 214021382]"


There are 9 entities with duplicates

From inspection and additional research analysing the overlaps between SA2 regions and suburbs: https://maps.abs.gov.au/

Mt Eliza-Mornington should be Mornington West (not mornington east)

Dromana-Portsea should remain the same

Fairfield-Alphington to drop Clifton Hill-Alphington

Malvern should be merged with Camberwell + Glen Iris (Camberwell-Glen Iris-Malvern). Alternatively, drop Malvern - Glen Iris from the Camberwell + Glen Iris combination.

Bendigo + North Bendigo should be merged (Bendigo-North Bendigo)

Need to handle Dandenong North (should only include Dandenong - North not Dandenong North)

And also St Kilda dropping St Kilda west

In [10]:
# Perform these overrides
joined_df_cleaned = joined_df.copy()

row_index = joined_df_cleaned[joined_df_cleaned['suburbs'] == 'Camberwell-Glen Iris'].index[0]
joined_df_cleaned.at[row_index, 'regions'] = ['Camberwell', 'Glen Iris - East']
joined_df_cleaned.at[row_index, 'codes'] = [207011149, 207011150]

row_index = joined_df_cleaned[joined_df_cleaned['suburbs'] == 'Fairfield-Alphington'].index[0]
joined_df_cleaned.at[row_index, 'regions'] = ['Alphington - Fairfield']
joined_df_cleaned.at[row_index, 'codes'] = [206021110]

row_index = joined_df_cleaned[joined_df_cleaned['suburbs'] == 'Mt Eliza-Mornington-Mt Martha'].index[0]
joined_df_cleaned.at[row_index, 'regions'] = ['Mount Eliza', 'Mornington - East', 'Mount Martha']
joined_df_cleaned.at[row_index, 'codes'] = [214021382, 214021591, 214021381]

row_index = joined_df_cleaned[joined_df_cleaned['suburbs'] == 'St Kilda'].index[0]
joined_df_cleaned.at[row_index, 'regions'] = ['St Kilda - Central']
joined_df_cleaned.at[row_index, 'codes'] = [206051513]

row_index = joined_df_cleaned[joined_df_cleaned['suburbs'] == 'Dandenong North-Endeavour Hills'].index[0]
joined_df_cleaned.at[row_index, 'regions'] = ['Endeavour Hills - North', 'Dandenong North', 'Endeavour Hills - South']
joined_df_cleaned.at[row_index, 'codes'] = [212041312, 212021454, 212021453]

joined_df_cleaned = joined_df_cleaned[joined_df_cleaned['suburbs'] != 'North Bendigo']
joined_df_cleaned = joined_df_cleaned[joined_df_cleaned['codes'].apply(lambda x: len(x) >= 0)]

In [11]:
# verify there are no duplicates
flattened_cleaned_suburbs = [item for sublist in joined_df_cleaned['codes'] for item in sublist]
print('As list number of codes: ', len(flattened_cleaned_suburbs))
print('As set number of codes. Should be the same as as list (otherwise there are duplicates): ', 
      len(set(flattened_cleaned_suburbs)))

As list number of codes:  295
As set number of codes. Should be the same as as list (otherwise there are duplicates):  295


In [38]:
cleaned_exploded = joined_df_cleaned.explode('codes')

cleaned_exploded = cleaned_exploded.dropna(subset=['codes'])
cleaned_exploded['codes'] = cleaned_exploded['codes'].astype(int)

victoria_gdf_int_keys = victoria_gdf.copy()
victoria_gdf_int_keys['code'] = victoria_gdf_int_keys['code'].astype(int)

gdf_shapefile_joined = cleaned_exploded.merge(victoria_gdf_int_keys, left_on='codes', right_on='code', how='inner')

print('The number of rows in the dataframe (should be same as cell above) is:', len(gdf_shapefile_joined))

gdf_shapefile_joined

The number of rows in the dataframe (should be same as cell above) is: 295


Unnamed: 0,1 bedroom flat,suburb_x,Mar 2000,Mar 2000.1,Jun 2000,Jun 2000.1,Sep 2000,Sep 2000.1,Dec 2000,Dec 2000.1,...,Jun 2024.1,suburbs,suburb_1,suburb_2,suburb_3,regions,codes,code,suburb_y,geometry
0,Inner Melbourne,Albert Park-Middle Park-West St Kilda,352,165,347,165,378,170,369,175,...,426,Albert Park-Middle Park-West St Kilda,Albert Park,Middle Park,West St Kilda,"[Albert Park, St Kilda - West]",206051128,206051128,Albert Park,"POLYGON ((144.96767 -37.83737, 144.96789 -37.8..."
1,Inner Melbourne,Albert Park-Middle Park-West St Kilda,352,165,347,165,378,170,369,175,...,426,Albert Park-Middle Park-West St Kilda,Albert Park,Middle Park,West St Kilda,"[Albert Park, St Kilda - West]",206051514,206051514,St Kilda - West,"POLYGON ((144.97031 -37.86077, 144.97018 -37.8..."
2,,Armadale,210,150,212,150,213,155,213,160,...,450,Armadale,Armadale,,,[Armadale],206061135,206061135,Armadale,"POLYGON ((145.01167 -37.85357, 145.01177 -37.8..."
3,,Carlton North,87,150,78,155,74,150,65,150,...,400,Carlton North,Carlton North,,,[Carlton North - Princes Hill],206071140,206071140,Carlton North - Princes Hill,"POLYGON ((144.9594 -37.78471, 144.95955 -37.78..."
4,,Carlton-Parkville,298,165,297,170,312,175,346,180,...,470,Carlton-Parkville,Carlton,Parkville,,"[Parkville, Carlton]",206041117,206041117,Carlton,"POLYGON ((144.97488 -37.79794, 144.97477 -37.7..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
290,,Warragul,13,80,11,75,12,90,10,90,...,295,Warragul,Warragul,,,[Warragul],205011079,205011079,Warragul,"POLYGON ((145.89337 -38.14423, 145.89376 -38.1..."
291,,Warrnambool,113,75,104,75,108,75,105,80,...,300,Warrnambool,Warrnambool,,,"[Warrnambool - South, Warrnambool - North]",217041480,217041480,Warrnambool - South,"POLYGON ((142.45281 -38.39126, 142.4523 -38.39..."
292,,Warrnambool,113,75,104,75,108,75,105,80,...,300,Warrnambool,Warrnambool,,,"[Warrnambool - South, Warrnambool - North]",217041479,217041479,Warrnambool - North,"POLYGON ((142.43668 -38.35544, 142.43658 -38.3..."
293,,Wodonga,77,85,72,85,77,85,83,85,...,260,Wodonga,Wodonga,,,"[Wodonga, West Wodonga]",204031073,204031073,West Wodonga,"POLYGON ((146.77484 -36.12753, 146.77394 -36.1..."


In [39]:
import geopandas as gpd
import folium
import random

# Assuming your GeoDataFrame is already loaded and simplified
gdf_shapefile_joined_simplified = gdf_shapefile_joined[['suburbs', 'regions', 'code', 'geometry']].copy()
gdf_shapefile_joined_simplified = gpd.GeoDataFrame(gdf_shapefile_joined_simplified, geometry='geometry')

gdf_shapefile_joined_simplified = gdf_shapefile_joined_simplified.dissolve(by='suburbs', aggfunc={
    'suburbs': 'first',
    'regions': 'first',
    'code': list
})
gdf_shapefile_joined_simplified

# gdf_shapefile_joined_simplified['geometry'] = gdf_shapefile_joined_simplified['geometry'].simplify(tolerance=0.001, preserve_topology=True)

# Convert the GeoDataFrame to GeoJSON format
geojson_data = gdf_shapefile_joined_simplified.to_json()

# Function to generate random color
def get_random_color():
    return "#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)])

# Create a dictionary to store random colors for each suburb
suburb_colors = {suburb: get_random_color() for suburb in gdf_shapefile_joined_simplified['suburbs'].unique()}

# Create a style function to assign each polygon a color based on the suburb
def style_function(feature):
    suburb = feature['properties']['suburbs']  # Get the suburb name
    return {
        'fillColor': suburb_colors[suburb],  # Assign the random color based on the suburb
        'color': 'black',  # Polygon boundary color
        'weight': 1,  # Polygon boundary weight
        'fillOpacity': 0.7  # Opacity of the fill color
    }

# Create a Folium Map centered on an appropriate location
m = folium.Map(location=[-37.8136, 144.9631], zoom_start=10)  # Example: center on Melbourne, Australia

# Add the GeoDataFrame to the map using GeoJSON with the custom style_function
folium.GeoJson(
    geojson_data,
    style_function=style_function
).add_to(m)

# If using Jupyter, display the map inline
m

Random thoughts: there were two typos, Newcombe (should be Newcomb) and Waranagatta (Warangatta)

In [40]:

output_df = gdf_shapefile_joined_simplified.reset_index(drop=True)

if not os.path.exists(FINAL_OUTPUT_PATH):
    os.makedirs(FINAL_OUTPUT_PATH)

output_df.to_csv(FINAL_OUTPUT_PATH + '/sa2_to_rental_suburb_groups.csv')

This final cell is just used for testing, feel free to ignore!

In [16]:
suburbs = {j for j in victoria_gdf['suburb'] if j}
print([i for i in victoria_gdf['suburb'] if 'St Kilda' in i])
[i for i in all_suburbs_list_rental if i and 'CBD' in i]# i not in suburbs]

['St Kilda East', 'St Kilda - Central', 'St Kilda - West']


['CBD']