# Historical Rent Preprocessing
This Notebook outlines the preprocessing of historical rent data supplied from the department of family fairness and housing (DFFH).

Sections:
1. Rental Suburb Mapping
2. Data Restructuring & Cleaning
3. SAL Code Retrievial

Before running this notebook ensure that the file moving-annual-rent-suburb-march-quarter-2024.xlsx has been downloaded to the data/landing directory. This can be done by running the download datasets.py script.

In [665]:
import pandas as pd

### 1 Rental Suburb Mapping
Through an initial inspection of the historical rent dataset, it was clear that the dataset was in terrible shape (see the initial reading of the data below).  In particular the suburb groups were ambiguously defined and showed no pattern in their groupings.  Through a further inspection of the Homes Victoria Rental Report of this data (this can be downloaded from the page [https://www.dffh.vic.gov.au/publications/rental-report] and download the 2024 march quarter rental report), it was found that neighbouring suburbs with similar rent characteristics were grouped together.  However, the DFFH website supplies no shapefile for these groupings, only a figure of the report of how the suburbs were split in Melbourne.

We emailled the DFFH to see if they could supply a shapefile, or some sort of data, which defined these groupings of suburbs specifcally and clearly.  However, they did not response.

This was very unfortunate.  Our study is directly on rent in suburbs of Victoria.  It was crucial to have some measure of historical rent prices in Victoria, and since no other dataset on this could be found, this dataset was our best choice and was important to get these groupings correct.  The report also stated the suburb boundaries were defined from gazetted localities shapefile supplied from [https://discover.data.vic.gov.au/dataset/vicmap-admin].  So the first step of preprocessing was to manually visually inspect the suburb group boundaries and line them up with the gazetted localities to find out which suburbs are in each group, although this is not ideal but necessary since the DFFH could not directly supply the data.

A final important note is that the data is from a rental report.  Any other rental data of suburbs not found in the mapped out lists are deemed irrelevant to our study as the DFFH found the rental data here to be non existent or irrelevant to report it.



In [666]:
# inital reading of the historical rent data
historical_rent_df = pd.read_excel("../../data/landing/moving-annual-rent-suburb-march-quarter-2024.xlsx", sheet_name="All properties")
historical_rent_df

Unnamed: 0,Moving annual rent by suburb,Unnamed: 1,Lease commenced in year ending,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 186,Unnamed: 187,Unnamed: 188,Unnamed: 189,Unnamed: 190,Unnamed: 191,Unnamed: 192,Unnamed: 193,Unnamed: 194,Unnamed: 195
0,All properties,,Mar 2000,,Jun 2000,,Sep 2000,,Dec 2000,,...,Mar 2023,,Jun 2023,,Sep 2023,,Dec 2023,,Mar 2024,
1,,,Count,Median,Count,Median,Count,Median,Count,Median,...,Count,Median,Count,Median,Count,Median,Count,Median,Count,Median
2,Inner Melbourne,Albert Park-Middle Park-West St Kilda,1143,260,1134,260,1177,270,1178,275,...,796,545,740,550,730,600,720,600,671,650
3,,Armadale,733,200,737,200,738,205,739,210,...,757,490,687,500,639,525,594,560,566,560
4,,Carlton North,864,260,814,260,799,265,736,270,...,497,620,495,630,467,650,418,670,384,680
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
156,,Wanagaratta,705,125,671,125,631,130,623,130,...,535,380,555,390,565,390,593,395,580,400
157,,Warragul,385,130,367,135,382,135,366,135,...,507,440,542,450,558,450,543,460,541,470
158,,Warrnambool,1266,130,1229,135,1204,135,1135,135,...,881,420,861,430,846,450,844,460,840,460
159,,Wodonga,1446,145,1439,145,1468,150,1449,150,...,1205,410,1187,420,1164,420,1155,430,1139,450


#### 1.1 The Mapping Method
This was done on an excel worksheet.

Through copying and pasting the suburb groups of the historical rent data in one column labeled Rental Suburbs, and in the column next to it named "SAL suburbs (gazetted localities)" the list of all suburbs in that group found, through analysing the borders of suburbs in Melbourne in figure 2 of the report, of the data with the gazetted localities map to find which suburbs are in which group, entered in format specified below in excel:

Suburb Cluster	| Rental suburbs | SAL suburbs (gazetted localities)

Inner Melbourne | Albert Park-Middle Park-West St Kilda | albert park - middle park - st kilda west

An additional column suburb suburb cluster was used to seperate the suburb groups geographically, this was also copied and pasted in from historical rental data.  Each suburb group was manually inspected as stated above the boundaries following the gazetted localities data which lined up almost exactly one-to-one with the figure in the report.  Since the figure is only of Melbourne and its outer regions. 

The "Other regional centres" cluster was not able to be expanded into smaller suburbs like the other suburb groups in the data, so it was left as it is.

Once the mapping was finished the excel file was saved with the file name "rental_suburbs_to_SAL_mapping" to the data/raw/ directory

In [667]:
# read in the mapped out rental groups
rental_groups_df = pd.read_excel('../../data/raw/rental_suburbs_to_SAL_mapping.xlsx')
print(rental_groups_df.shape)
rental_groups_df

(159, 3)


Unnamed: 0,Suburb Cluster,Rental suburbs,SAL suburbs (gazetted localities)
0,Inner Melbourne,Albert Park-Middle Park-West St Kilda,albert park - middle park - st kilda west
1,,Armadale,armadale
2,,Carlton North,carlton north - princes hill
3,,Carlton-Parkville,parkville - carlton
4,,CBD-St Kilda Rd,melbourne cbd
...,...,...,...
154,,Wangaratta,Wangaratta
155,,Warragul,Warragul
156,,Warrnambool,Warrnambool
157,,Wodonga,Wodonga


### 2 Data Restructuring & Cleaning

In [668]:
# this cell creates new columns of the historical rent df that make the dataframe more readable are more readable

# fill all NaN columns in the first row with that of the cell left of it
for col in historical_rent_df.columns:
    column_list = list(historical_rent_df.columns)
    col_index = column_list.index(col)
    if col_index > 0:
        prev_col = column_list[col_index - 1]
        if not pd.isna(historical_rent_df.at[0, col]):
            historical_rent_df.at[0, col] = historical_rent_df.at[0, col] 
        else:
            historical_rent_df.at[0, col] = historical_rent_df.at[0, prev_col]

# merge first row and second row together
merged_row = historical_rent_df.iloc[0].astype(str) + ' ' + historical_rent_df.iloc[1].astype(str)
historical_rent_df.loc[1] = merged_row

# rename the first two columns
historical_rent_df.loc[1,"Moving annual rent by suburb"] = "Suburb Cluster"
historical_rent_df.loc[1,"Unnamed: 1"] = "Suburb(s)"

# set second row as the new columns titles
historical_rent_df.columns = historical_rent_df.iloc[1]

historical_rent_df

1,Suburb Cluster,Suburb(s),Mar 2000 Count,Mar 2000 Median,Jun 2000 Count,Jun 2000 Median,Sep 2000 Count,Sep 2000 Median,Dec 2000 Count,Dec 2000 Median,...,Mar 2023 Count,Mar 2023 Median,Jun 2023 Count,Jun 2023 Median,Sep 2023 Count,Sep 2023 Median,Dec 2023 Count,Dec 2023 Median,Mar 2024 Count,Mar 2024 Median
0,All properties,All properties,Mar 2000,Mar 2000,Jun 2000,Jun 2000,Sep 2000,Sep 2000,Dec 2000,Dec 2000,...,Mar 2023,Mar 2023,Jun 2023,Jun 2023,Sep 2023,Sep 2023,Dec 2023,Dec 2023,Mar 2024,Mar 2024
1,Suburb Cluster,Suburb(s),Mar 2000 Count,Mar 2000 Median,Jun 2000 Count,Jun 2000 Median,Sep 2000 Count,Sep 2000 Median,Dec 2000 Count,Dec 2000 Median,...,Mar 2023 Count,Mar 2023 Median,Jun 2023 Count,Jun 2023 Median,Sep 2023 Count,Sep 2023 Median,Dec 2023 Count,Dec 2023 Median,Mar 2024 Count,Mar 2024 Median
2,Inner Melbourne,Albert Park-Middle Park-West St Kilda,1143,260,1134,260,1177,270,1178,275,...,796,545,740,550,730,600,720,600,671,650
3,,Armadale,733,200,737,200,738,205,739,210,...,757,490,687,500,639,525,594,560,566,560
4,,Carlton North,864,260,814,260,799,265,736,270,...,497,620,495,630,467,650,418,670,384,680
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
156,,Wanagaratta,705,125,671,125,631,130,623,130,...,535,380,555,390,565,390,593,395,580,400
157,,Warragul,385,130,367,135,382,135,366,135,...,507,440,542,450,558,450,543,460,541,470
158,,Warrnambool,1266,130,1229,135,1204,135,1135,135,...,881,420,861,430,846,450,844,460,840,460
159,,Wodonga,1446,145,1439,145,1468,150,1449,150,...,1205,410,1187,420,1164,420,1155,430,1139,450


In [669]:
# drop irrelevant columns
historical_rent_df.drop([0,1], inplace=True)
historical_rent_df.reset_index(drop=True, inplace=True)
historical_rent_df.shape

(159, 196)

In [670]:
# label clusters of suburbs for all rows
current_cluser = historical_rent_df.iloc[0]["Suburb Cluster"]
for i in historical_rent_df.index:
    row = historical_rent_df.iloc[i]
    if type(row["Suburb Cluster"]) == str:
        current_cluser = row["Suburb Cluster"]
    else:
        row["Suburb Cluster"] = current_cluser
historical_rent_df

1,Suburb Cluster,Suburb(s),Mar 2000 Count,Mar 2000 Median,Jun 2000 Count,Jun 2000 Median,Sep 2000 Count,Sep 2000 Median,Dec 2000 Count,Dec 2000 Median,...,Mar 2023 Count,Mar 2023 Median,Jun 2023 Count,Jun 2023 Median,Sep 2023 Count,Sep 2023 Median,Dec 2023 Count,Dec 2023 Median,Mar 2024 Count,Mar 2024 Median
0,Inner Melbourne,Albert Park-Middle Park-West St Kilda,1143,260,1134,260,1177,270,1178,275,...,796,545,740,550,730,600,720,600,671,650
1,Inner Melbourne,Armadale,733,200,737,200,738,205,739,210,...,757,490,687,500,639,525,594,560,566,560
2,Inner Melbourne,Carlton North,864,260,814,260,799,265,736,270,...,497,620,495,630,467,650,418,670,384,680
3,Inner Melbourne,Carlton-Parkville,1303,251,1278,260,1280,260,1301,260,...,2953,500,2755,530,2687,550,2662,550,2543,570
4,Inner Melbourne,CBD-St Kilda Rd,2132,320,2264,320,2358,320,2361,320,...,13568,550,13505,580,13552,600,13564,620,13582,640
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
154,Other Regional Centres,Wanagaratta,705,125,671,125,631,130,623,130,...,535,380,555,390,565,390,593,395,580,400
155,Other Regional Centres,Warragul,385,130,367,135,382,135,366,135,...,507,440,542,450,558,450,543,460,541,470
156,Other Regional Centres,Warrnambool,1266,130,1229,135,1204,135,1135,135,...,881,420,861,430,846,450,844,460,840,460
157,Other Regional Centres,Wodonga,1446,145,1439,145,1468,150,1449,150,...,1205,410,1187,420,1164,420,1155,430,1139,450


In [671]:
# remove group total rows for the suburb clusters - can perform a groupby if we need this data later
null_rows = historical_rent_df[historical_rent_df["Suburb(s)"] == "Group Total"].index
historical_rent_df = historical_rent_df.drop(null_rows)
historical_rent_df.shape

(146, 196)

In [672]:
# do the same as the cell above for the mapped out suburbs file for consisent size of the columns
null_rows = rental_groups_df[rental_groups_df["SAL suburbs (gazetted localities)"].isnull()].index
rental_groups_df = rental_groups_df.drop(null_rows)
rental_groups_df.shape

(146, 3)

In [673]:
# add the list of suburbs in each suburb group as a column to the historical rent data
historical_rent_df['SAL suburbs (gazetted localities)'] = rental_groups_df['SAL suburbs (gazetted localities)']
historical_rent_df.shape

(146, 197)

In [674]:
historical_rent_df = historical_rent_df.reset_index()
historical_rent_df = historical_rent_df.drop(columns='index')

In [675]:
# transforming the mapped out suburb groups into a python list to make the column more transformable
historical_rent_df['Suburb_List'] = historical_rent_df['SAL suburbs (gazetted localities)'].apply(lambda x: x.split('-'))
historical_rent_df = historical_rent_df.drop(columns='SAL suburbs (gazetted localities)')
historical_rent_df = historical_rent_df.rename(columns={"Suburb_List":'SAL suburbs (gazetted localities)'})
historical_rent_df.head(5)

1,Suburb Cluster,Suburb(s),Mar 2000 Count,Mar 2000 Median,Jun 2000 Count,Jun 2000 Median,Sep 2000 Count,Sep 2000 Median,Dec 2000 Count,Dec 2000 Median,...,Mar 2023 Median,Jun 2023 Count,Jun 2023 Median,Sep 2023 Count,Sep 2023 Median,Dec 2023 Count,Dec 2023 Median,Mar 2024 Count,Mar 2024 Median,SAL suburbs (gazetted localities)
0,Inner Melbourne,Albert Park-Middle Park-West St Kilda,1143,260,1134,260,1177,270,1178,275,...,545,740,550,730,600,720,600,671,650,"[albert park , middle park , st kilda west]"
1,Inner Melbourne,Armadale,733,200,737,200,738,205,739,210,...,490,687,500,639,525,594,560,566,560,[armadale]
2,Inner Melbourne,Carlton North,864,260,814,260,799,265,736,270,...,620,495,630,467,650,418,670,384,680,"[carlton north , princes hill]"
3,Inner Melbourne,Carlton-Parkville,1303,251,1278,260,1280,260,1301,260,...,500,2755,530,2687,550,2662,550,2543,570,"[parkville , carlton]"
4,Inner Melbourne,CBD-St Kilda Rd,2132,320,2264,320,2358,320,2361,320,...,550,13505,580,13552,600,13564,620,13582,640,[melbourne cbd]


In [676]:
# loops through all rows and all SAL suburbs in suburb groups get rid of any begining or trailing whitespace and capitalising
# the first letter of every word
for row_id in historical_rent_df.index:
    historical_rent_df.loc[row_id,'SAL suburbs (gazetted localities)']
    for i in range(len(historical_rent_df.loc[row_id,'SAL suburbs (gazetted localities)'])):
        s = historical_rent_df.loc[row_id,'SAL suburbs (gazetted localities)'][i]
        if s[0] == ' ':
            historical_rent_df.loc[row_id,'SAL suburbs (gazetted localities)'][i]  = historical_rent_df.loc[row_id,'SAL suburbs (gazetted localities)'][i][1:]
        historical_rent_df.loc[row_id,'SAL suburbs (gazetted localities)'][i] = historical_rent_df.loc[row_id,'SAL suburbs (gazetted localities)'][i].rstrip()
        historical_rent_df.loc[row_id,'SAL suburbs (gazetted localities)'][i] = historical_rent_df.loc[row_id,'SAL suburbs (gazetted localities)'][i].title()
historical_rent_df.head(5)

1,Suburb Cluster,Suburb(s),Mar 2000 Count,Mar 2000 Median,Jun 2000 Count,Jun 2000 Median,Sep 2000 Count,Sep 2000 Median,Dec 2000 Count,Dec 2000 Median,...,Mar 2023 Median,Jun 2023 Count,Jun 2023 Median,Sep 2023 Count,Sep 2023 Median,Dec 2023 Count,Dec 2023 Median,Mar 2024 Count,Mar 2024 Median,SAL suburbs (gazetted localities)
0,Inner Melbourne,Albert Park-Middle Park-West St Kilda,1143,260,1134,260,1177,270,1178,275,...,545,740,550,730,600,720,600,671,650,"[Albert Park, Middle Park, St Kilda West]"
1,Inner Melbourne,Armadale,733,200,737,200,738,205,739,210,...,490,687,500,639,525,594,560,566,560,[Armadale]
2,Inner Melbourne,Carlton North,864,260,814,260,799,265,736,270,...,620,495,630,467,650,418,670,384,680,"[Carlton North, Princes Hill]"
3,Inner Melbourne,Carlton-Parkville,1303,251,1278,260,1280,260,1301,260,...,500,2755,530,2687,550,2662,550,2543,570,"[Parkville, Carlton]"
4,Inner Melbourne,CBD-St Kilda Rd,2132,320,2264,320,2358,320,2361,320,...,550,13505,580,13552,600,13564,620,13582,640,[Melbourne Cbd]


In [677]:
# explode the df by the "SAL suburbs (gazetted localities)" column
# each row now is an individual suburb containing the median rent of that suburb of that month between 2000-2024
# SIDENOTE: it is safe to assume here that the median rent of the suburb group is approximately equal to the median rent of suburbs
# in that group
all_suburbs_rent_df = historical_rent_df.explode("SAL suburbs (gazetted localities)")
all_suburbs_rent_df.rename(columns=lambda x: x + ' (of suburb group)' if x.endswith('Count') else x, inplace=True)

# Rename the exploded column for clarity
all_suburbs_rent_df = all_suburbs_rent_df.rename(columns={
    "SAL suburbs (gazetted localities)": "SAL suburb",
    "Suburb(s)": "Suburb Group"
})
all_suburbs_rent_df[all_suburbs_rent_df["Suburb Group"]=="Dromana-Portsea"]

1,Suburb Cluster,Suburb Group,Mar 2000 Count (of suburb group),Mar 2000 Median,Jun 2000 Count (of suburb group),Jun 2000 Median,Sep 2000 Count (of suburb group),Sep 2000 Median,Dec 2000 Count (of suburb group),Dec 2000 Median,...,Mar 2023 Median,Jun 2023 Count (of suburb group),Jun 2023 Median,Sep 2023 Count (of suburb group),Sep 2023 Median,Dec 2023 Count (of suburb group),Dec 2023 Median,Mar 2024 Count (of suburb group),Mar 2024 Median,SAL suburb
105,Mornington Peninsula,Dromana-Portsea,1205,135,1187,140,1232,140,1260,140,...,520,1610,525,1692,530,1739,530,1712,545,Portsea
105,Mornington Peninsula,Dromana-Portsea,1205,135,1187,140,1232,140,1260,140,...,520,1610,525,1692,530,1739,530,1712,545,Sorrento
105,Mornington Peninsula,Dromana-Portsea,1205,135,1187,140,1232,140,1260,140,...,520,1610,525,1692,530,1739,530,1712,545,Blairgowrie
105,Mornington Peninsula,Dromana-Portsea,1205,135,1187,140,1232,140,1260,140,...,520,1610,525,1692,530,1739,530,1712,545,Rye
105,Mornington Peninsula,Dromana-Portsea,1205,135,1187,140,1232,140,1260,140,...,520,1610,525,1692,530,1739,530,1712,545,St Andrews Beach
105,Mornington Peninsula,Dromana-Portsea,1205,135,1187,140,1232,140,1260,140,...,520,1610,525,1692,530,1739,530,1712,545,Tootgarook
105,Mornington Peninsula,Dromana-Portsea,1205,135,1187,140,1232,140,1260,140,...,520,1610,525,1692,530,1739,530,1712,545,Capel Sound
105,Mornington Peninsula,Dromana-Portsea,1205,135,1187,140,1232,140,1260,140,...,520,1610,525,1692,530,1739,530,1712,545,Rosebud
105,Mornington Peninsula,Dromana-Portsea,1205,135,1187,140,1232,140,1260,140,...,520,1610,525,1692,530,1739,530,1712,545,Mccrae
105,Mornington Peninsula,Dromana-Portsea,1205,135,1187,140,1232,140,1260,140,...,520,1610,525,1692,530,1739,530,1712,545,Arthurs Seat


In [678]:
all_suburbs_rent_df = all_suburbs_rent_df.reset_index()
all_suburbs_rent_df = all_suburbs_rent_df.drop(columns="index")
all_suburbs_rent_df

1,Suburb Cluster,Suburb Group,Mar 2000 Count (of suburb group),Mar 2000 Median,Jun 2000 Count (of suburb group),Jun 2000 Median,Sep 2000 Count (of suburb group),Sep 2000 Median,Dec 2000 Count (of suburb group),Dec 2000 Median,...,Mar 2023 Median,Jun 2023 Count (of suburb group),Jun 2023 Median,Sep 2023 Count (of suburb group),Sep 2023 Median,Dec 2023 Count (of suburb group),Dec 2023 Median,Mar 2024 Count (of suburb group),Mar 2024 Median,SAL suburb
0,Inner Melbourne,Albert Park-Middle Park-West St Kilda,1143,260,1134,260,1177,270,1178,275,...,545,740,550,730,600,720,600,671,650,Albert Park
1,Inner Melbourne,Albert Park-Middle Park-West St Kilda,1143,260,1134,260,1177,270,1178,275,...,545,740,550,730,600,720,600,671,650,Middle Park
2,Inner Melbourne,Albert Park-Middle Park-West St Kilda,1143,260,1134,260,1177,270,1178,275,...,545,740,550,730,600,720,600,671,650,St Kilda West
3,Inner Melbourne,Armadale,733,200,737,200,738,205,739,210,...,490,687,500,639,525,594,560,566,560,Armadale
4,Inner Melbourne,Carlton North,864,260,814,260,799,265,736,270,...,620,495,630,467,650,418,670,384,680,Carlton North
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,Other Regional Centres,Traralgon,851,125,823,120,831,125,807,125,...,385,922,390,910,390,880,395,842,410,Traralgon
565,Other Regional Centres,Wanagaratta,705,125,671,125,631,130,623,130,...,380,555,390,565,390,593,395,580,400,Wangaratta
566,Other Regional Centres,Warragul,385,130,367,135,382,135,366,135,...,440,542,450,558,450,543,460,541,470,Warragul
567,Other Regional Centres,Warrnambool,1266,130,1229,135,1204,135,1135,135,...,420,861,430,846,450,844,460,840,460,Warrnambool


### 3 SAL Code Retrievial
In this section we find the SAL (Suburbs and Localities) code for each suburb in the exploded dataframe

In [679]:
# read the converter df for suburb name to the SAL code
SAL_converter_df = pd.read_csv("../../data/landing/CG_SSC_2016_SAL_2021.csv")
SAL_converter_df

Unnamed: 0,SSC_CODE_2016,SSC_NAME_2016,SAL_CODE_2021,SAL_NAME_2021,RATIO_FROM_TO,INDIV_TO_REGION_QLTY_INDICATOR,OVERALL_QUALITY_INDICATOR,BMOS_NULL_FLAG
0,10001.0,Aarons Pass,10001,Aarons Pass,1.0,Good,Good,0
1,10002.0,Abbotsbury,10002,Abbotsbury,1.0,Good,Good,0
2,10003.0,Abbotsford (NSW),10003,Abbotsford (NSW),1.0,Good,Good,0
3,10004.0,Abercrombie,10004,Abercrombie,1.0,Good,Good,0
4,10005.0,Abercrombie River,10005,Abercrombie River,1.0,Good,Good,0
...,...,...,...,...,...,...,...,...
15751,90005.0,West Island,90005,West Island,1.0,Good,Good,0
15752,99494.0,No usual address (OT),99494,No usual address (OT),1.0,Good,Good,0
15753,99797.0,Migratory - Offshore - Shipping (OT),99797,Migratory - Offshore - Shipping (OT),1.0,Good,Good,0
15754,,,51265,Prince Regent River,1.0,Good,Good,3


In [680]:
# search all suburbs in the historical rent for the SAL code in the SAL converter df
# print the suburbs to see how many that can not be found
all_suburbs = all_suburbs_rent_df['SAL suburb']
mis_match = []
SAL_codes_list = list(SAL_converter_df["SAL_NAME_2021"])

for suburb in all_suburbs:
    print(suburb)
    if not suburb in SAL_codes_list:
        # perform a second check specifying the suburb is in Victoria
        adj_search_name = suburb + ' (Vic.)'
        if not adj_search_name in SAL_codes_list:
            mis_match.append(suburb)
print(len(mis_match))
mis_match
print(len(all_suburbs))

Albert Park
Middle Park
St Kilda West
Armadale
Carlton North
Princes Hill
Parkville
Carlton
Melbourne Cbd
Collingwood
Abbotsford
Docklands
East Melbourne
St Kilda East
Balaclava
Ripponlea
Elwood
Fitzroy
Fitzroy North
Clifton Hill
Flemington
Travancore
Kensington
North Melbourne
West Melbourne
Port Melbourne
Prahran
Windsor
Richmond
Cremorne
Burnley
South Melbourne
South Yarra
Southbank
South Wharf
St Kilda
Toorak
Balwyn
Balwyn North
Deepdene
Blackburn
Blackburn North
Blackburn South
Box Hill
Box Hill North
Box Hill South
Mont Albert North
Bulleen
Templestowe
Templestowe Lower
Doncaster
Burwood
Ashburton
Ashwood
Camberwell
Glen Iris
Canterbury
Surrey Hills
Mont Albert
Chadstone
Oakleigh
Oakleigh East
Huntingdale
Clayton
Notting Hill
Oakleigh South
Doncaster East
Donvale
Park Orchards
Warrandyte
Warrandyte South
Wonga Park
Hawthorn East
Glen Waverley
Wheelers Hill
Mulgrave
Hawthorn
Kooyong
Kew
Kew East
Mount Waverley
Nunawading
Mitcham
Vermont
Vermont South
Forest Hill
Burwood East
Aspen

Exception Handling
- 'Melbourne Cbd': different name in converter -handle as 'Melbourne'
- 'Mckinnon': typo - handle as 'McCrae'
- 'Hillside': different name in converter -handle Hillside melton
- 'Fieldstone': does not exist as SAL -remove
- 'Bellfield': different name in converter -handle Bellfield Banyule
- 'Mccrae': typo - handle McCrae
- 'Newtown': different name in converter -handle Newtown Greater Geelong
- 'Ballarat': different name in converter -handle as Ballarat Central

In [681]:
row_id = all_suburbs_rent_df[all_suburbs_rent_df["SAL suburb"]=="Mccrae"].index[0]
print(row_id)
all_suburbs_rent_df.loc[row_id,"SAL suburb"] = "McCrae"
row_id  = all_suburbs_rent_df[all_suburbs_rent_df["SAL suburb"]=="Mckinnon"].index[0]
all_suburbs_rent_df.loc[row_id,"SAL suburb"] = "McKinnon"

482


In [682]:
# retrieve SAL codes from the converter of all suburbs and print those that are not found
SAL_codes = []
missing_count = 0
for suburb in all_suburbs:
    # check if suburb name is in the converter
    print(suburb)
    # suburb is in the converter
    if suburb in list(SAL_converter_df['SAL_NAME_2021']):
        suburb_SAL_2021 = SAL_converter_df[SAL_converter_df['SAL_NAME_2021'] == suburb]['SAL_CODE_2021'].values[0]
        SAL_codes.append(suburb_SAL_2021)

    # suburb is in the converter under a different name
    elif suburb + ' (Vic.)' in list(SAL_converter_df['SAL_NAME_2021']):
        adj_search_name = suburb + ' (Vic.)'
        suburb_SAL_2021 = SAL_converter_df[SAL_converter_df['SAL_NAME_2021'] == adj_search_name]['SAL_CODE_2021'].values[0]
        SAL_codes.append(suburb_SAL_2021)
        
    # suburb not found in the converter
    else:
        # exception handling of suburbs
        if suburb == "Melbourne Cbd":
            SAL_codes.append(21640)
        elif suburb == "Hillside":
            SAL_codes.append(21193)
        elif suburb == "Bellfield":
            SAL_codes.append(20198)
        elif suburb == "Newtown":
            SAL_codes.append(21938)
        elif suburb == "Ballarat":
            SAL_codes.append(20111)
        # both search and exception hanlding failed to find a SAL code for the suburb
        else:
            print(suburb + ' Not found')
            missing_count += 1
            SAL_codes.append(-1)

missing_count

Albert Park
Middle Park
St Kilda West
Armadale
Carlton North
Princes Hill
Parkville
Carlton
Melbourne Cbd
Collingwood
Abbotsford
Docklands
East Melbourne
St Kilda East
Balaclava
Ripponlea
Elwood
Fitzroy
Fitzroy North
Clifton Hill
Flemington
Travancore
Kensington
North Melbourne
West Melbourne
Port Melbourne
Prahran
Windsor
Richmond
Cremorne
Burnley
South Melbourne
South Yarra
Southbank
South Wharf
St Kilda
Toorak
Balwyn
Balwyn North
Deepdene
Blackburn
Blackburn North
Blackburn South
Box Hill
Box Hill North
Box Hill South
Mont Albert North
Bulleen
Templestowe
Templestowe Lower
Doncaster
Burwood
Ashburton
Ashwood
Camberwell
Glen Iris
Canterbury
Surrey Hills
Mont Albert
Chadstone
Oakleigh
Oakleigh East
Huntingdale
Clayton
Notting Hill
Oakleigh South
Doncaster East
Donvale
Park Orchards
Warrandyte
Warrandyte South
Wonga Park
Hawthorn East
Glen Waverley
Wheelers Hill
Mulgrave
Hawthorn
Kooyong
Kew
Kew East
Mount Waverley
Nunawading
Mitcham
Vermont
Vermont South
Forest Hill
Burwood East
Aspen

1

In [683]:
# append SAL codes to the dataset
all_suburbs_rent_df['SAL_CODE_2021'] = SAL_codes
all_suburbs_rent_df.shape

(569, 198)

In [684]:
list(all_suburbs_rent_df.columns)

['Suburb Cluster',
 'Suburb Group',
 'Mar 2000 Count (of suburb group)',
 'Mar 2000 Median',
 'Jun 2000 Count (of suburb group)',
 'Jun 2000 Median',
 'Sep 2000 Count (of suburb group)',
 'Sep 2000 Median',
 'Dec 2000 Count (of suburb group)',
 'Dec 2000 Median',
 'Mar 2001 Count (of suburb group)',
 'Mar 2001 Median',
 'Jun 2001 Count (of suburb group)',
 'Jun 2001 Median',
 'Sep 2001 Count (of suburb group)',
 'Sep 2001 Median',
 'Dec 2001 Count (of suburb group)',
 'Dec 2001 Median',
 'Mar 2002 Count (of suburb group)',
 'Mar 2002 Median',
 'Jun 2002 Count (of suburb group)',
 'Jun 2002 Median',
 'Sep 2002 Count (of suburb group)',
 'Sep 2002 Median',
 'Dec 2002 Count (of suburb group)',
 'Dec 2002 Median',
 'Mar 2003 Count (of suburb group)',
 'Mar 2003 Median',
 'Jun 2003 Count (of suburb group)',
 'Jun 2003 Median',
 'Sep 2003 Count (of suburb group)',
 'Sep 2003 Median',
 'Dec 2003 Count (of suburb group)',
 'Dec 2003 Median',
 'Mar 2004 Count (of suburb group)',
 'Mar 2004 Medi

In [685]:
# loop that converts quarterly median weekly rent to average weekly rent for the year (explicitly this is the average of the median quarterly weekly rent)
for year in range(2000, 2025):
    year_columns_rent = []
    year_columns_count = []
    for col in all_suburbs_rent_df.columns:
        if (str(year) +' Median') in col:
            all_suburbs_rent_df[col] = all_suburbs_rent_df[col].replace('-', 0)
            year_columns_rent.append(col)
        if (str(year) +' Count') in col:
            all_suburbs_rent_df[col] = all_suburbs_rent_df[col].replace('-', 0)
            year_columns_count.append(col)
    # average of the quarterly median weekly rent of the suburb
    all_suburbs_rent_df[f"{year}_average_weekly_rent"] = 0
    all_suburbs_rent_df[f"{year}_average_quarterly_count"] = 0
    for col in year_columns_rent:
        all_suburbs_rent_df[f"{year}_average_weekly_rent"] = all_suburbs_rent_df[f"{year}_average_weekly_rent"] + \
            all_suburbs_rent_df[col].astype(int)
    for col in year_columns_count:
        all_suburbs_rent_df[f"{year}_average_quarterly_count"] = all_suburbs_rent_df[f"{year}_average_quarterly_count"] + \
            all_suburbs_rent_df[col].astype(int)
    all_suburbs_rent_df[f"{year}_average_weekly_rent"] = all_suburbs_rent_df[f"{year}_average_weekly_rent"]/len(year_columns_rent)
    all_suburbs_rent_df[f"{year}_average_quarterly_count"] = all_suburbs_rent_df[f"{year}_average_quarterly_count"]/len(year_columns_count)
all_suburbs_rent_df

  all_suburbs_rent_df[col] = all_suburbs_rent_df[col].replace('-', 0)
  all_suburbs_rent_df[col] = all_suburbs_rent_df[col].replace('-', 0)
  all_suburbs_rent_df[col] = all_suburbs_rent_df[col].replace('-', 0)
  all_suburbs_rent_df[col] = all_suburbs_rent_df[col].replace('-', 0)
  all_suburbs_rent_df[f"{year}_average_weekly_rent"] = 0
  all_suburbs_rent_df[f"{year}_average_quarterly_count"] = 0
  all_suburbs_rent_df[f"{year}_average_weekly_rent"] = 0
  all_suburbs_rent_df[f"{year}_average_quarterly_count"] = 0
  all_suburbs_rent_df[f"{year}_average_weekly_rent"] = 0
  all_suburbs_rent_df[f"{year}_average_quarterly_count"] = 0
  all_suburbs_rent_df[f"{year}_average_weekly_rent"] = 0
  all_suburbs_rent_df[f"{year}_average_quarterly_count"] = 0
  all_suburbs_rent_df[f"{year}_average_weekly_rent"] = 0
  all_suburbs_rent_df[f"{year}_average_quarterly_count"] = 0
  all_suburbs_rent_df[f"{year}_average_weekly_rent"] = 0
  all_suburbs_rent_df[f"{year}_average_quarterly_count"] = 0
  all_suburb

1,Suburb Cluster,Suburb Group,Mar 2000 Count (of suburb group),Mar 2000 Median,Jun 2000 Count (of suburb group),Jun 2000 Median,Sep 2000 Count (of suburb group),Sep 2000 Median,Dec 2000 Count (of suburb group),Dec 2000 Median,...,2020_average_weekly_rent,2020_average_quarterly_count,2021_average_weekly_rent,2021_average_quarterly_count,2022_average_weekly_rent,2022_average_quarterly_count,2023_average_weekly_rent,2023_average_quarterly_count,2024_average_weekly_rent,2024_average_quarterly_count
0,Inner Melbourne,Albert Park-Middle Park-West St Kilda,1143,260,1134,260,1177,270,1178,275,...,570.00,835.50,498.75,923.75,510.00,859.25,573.75,746.50,650.0,671.0
1,Inner Melbourne,Albert Park-Middle Park-West St Kilda,1143,260,1134,260,1177,270,1178,275,...,570.00,835.50,498.75,923.75,510.00,859.25,573.75,746.50,650.0,671.0
2,Inner Melbourne,Albert Park-Middle Park-West St Kilda,1143,260,1134,260,1177,270,1178,275,...,570.00,835.50,498.75,923.75,510.00,859.25,573.75,746.50,650.0,671.0
3,Inner Melbourne,Armadale,733,200,737,200,738,205,739,210,...,498.75,744.25,433.75,822.25,447.50,837.75,518.75,669.25,560.0,566.0
4,Inner Melbourne,Carlton North,864,260,814,260,799,265,736,270,...,590.00,539.00,577.50,597.75,593.75,552.25,642.50,469.25,680.0,384.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,Other Regional Centres,Traralgon,851,125,823,120,831,125,807,125,...,307.50,881.25,345.00,843.50,374.75,894.50,390.00,907.75,410.0,842.0
565,Other Regional Centres,Wanagaratta,705,125,671,125,631,130,623,130,...,299.50,646.75,340.00,560.50,377.50,538.00,388.75,562.00,400.0,580.0
566,Other Regional Centres,Warragul,385,130,367,135,382,135,366,135,...,360.00,572.75,383.75,536.00,412.50,520.00,450.00,537.50,470.0,541.0
567,Other Regional Centres,Warrnambool,1266,130,1229,135,1204,135,1135,135,...,342.50,983.50,365.00,864.50,410.00,857.50,440.00,858.00,460.0,840.0


In [686]:
# all counts above 50 in all months average of medians is reasonable
all_suburbs_rent_df = all_suburbs_rent_df.drop(columns=[col for col in all_suburbs_rent_df.columns if ' Count' in col])

In [687]:
all_suburbs_rent_df = all_suburbs_rent_df.drop(columns=[col for col in all_suburbs_rent_df.columns if ' Median' in col])
all_suburbs_rent_df

1,Suburb Cluster,Suburb Group,SAL suburb,SAL_CODE_2021,2000_average_weekly_rent,2000_average_quarterly_count,2001_average_weekly_rent,2001_average_quarterly_count,2002_average_weekly_rent,2002_average_quarterly_count,...,2020_average_weekly_rent,2020_average_quarterly_count,2021_average_weekly_rent,2021_average_quarterly_count,2022_average_weekly_rent,2022_average_quarterly_count,2023_average_weekly_rent,2023_average_quarterly_count,2024_average_weekly_rent,2024_average_quarterly_count
0,Inner Melbourne,Albert Park-Middle Park-West St Kilda,Albert Park,20018,266.25,1158.00,281.25,1247.50,300.00,1382.00,...,570.00,835.50,498.75,923.75,510.00,859.25,573.75,746.50,650.0,671.0
1,Inner Melbourne,Albert Park-Middle Park-West St Kilda,Middle Park,21677,266.25,1158.00,281.25,1247.50,300.00,1382.00,...,570.00,835.50,498.75,923.75,510.00,859.25,573.75,746.50,650.0,671.0
2,Inner Melbourne,Albert Park-Middle Park-West St Kilda,St Kilda West,22345,266.25,1158.00,281.25,1247.50,300.00,1382.00,...,570.00,835.50,498.75,923.75,510.00,859.25,573.75,746.50,650.0,671.0
3,Inner Melbourne,Armadale,Armadale,20066,203.75,736.75,222.50,726.00,230.75,763.50,...,498.75,744.25,433.75,822.25,447.50,837.75,518.75,669.25,560.0,566.0
4,Inner Melbourne,Carlton North,Carlton North,20496,263.75,803.25,276.25,684.25,290.00,646.75,...,590.00,539.00,577.50,597.75,593.75,552.25,642.50,469.25,680.0,384.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,Other Regional Centres,Traralgon,Traralgon,22569,123.75,828.00,126.25,803.25,132.50,799.75,...,307.50,881.25,345.00,843.50,374.75,894.50,390.00,907.75,410.0,842.0
565,Other Regional Centres,Wanagaratta,Wangaratta,22680,127.50,657.50,131.25,585.00,135.00,636.00,...,299.50,646.75,340.00,560.50,377.50,538.00,388.75,562.00,400.0,580.0
566,Other Regional Centres,Warragul,Warragul,22698,133.75,375.00,137.50,342.50,152.50,354.00,...,360.00,572.75,383.75,536.00,412.50,520.00,450.00,537.50,470.0,541.0
567,Other Regional Centres,Warrnambool,Warrnambool,22710,133.75,1208.50,141.25,1079.50,150.00,1084.50,...,342.50,983.50,365.00,864.50,410.00,857.50,440.00,858.00,460.0,840.0


In [688]:
df.columns

Index(['Suburb Cluster', 'Suburb Group', 'SAL suburb', 'SAL_CODE_2021',
       '2000_average_weekly_rent', '2000_average_quarterly_count',
       '2001_average_weekly_rent', '2001_average_quarterly_count',
       '2002_average_weekly_rent', '2002_average_quarterly_count',
       '2003_average_weekly_rent', '2003_average_quarterly_count',
       '2004_average_weekly_rent', '2004_average_quarterly_count',
       '2005_average_weekly_rent', '2005_average_quarterly_count',
       '2006_average_weekly_rent', '2006_average_quarterly_count',
       '2007_average_weekly_rent', '2007_average_quarterly_count',
       '2008_average_weekly_rent', '2008_average_quarterly_count',
       '2009_average_weekly_rent', '2009_average_quarterly_count',
       '2010_average_weekly_rent', '2010_average_quarterly_count',
       '2011_average_weekly_rent', '2011_average_quarterly_count',
       '2012_average_weekly_rent', '2012_average_quarterly_count',
       '2013_average_weekly_rent', '2013_average_quarterl

In [689]:
df = all_suburbs_rent_df
df[df["SAL suburb"]=="portsea"]

1,Suburb Cluster,Suburb Group,SAL suburb,SAL_CODE_2021,2000_average_weekly_rent,2000_average_quarterly_count,2001_average_weekly_rent,2001_average_quarterly_count,2002_average_weekly_rent,2002_average_quarterly_count,...,2020_average_weekly_rent,2020_average_quarterly_count,2021_average_weekly_rent,2021_average_quarterly_count,2022_average_weekly_rent,2022_average_quarterly_count,2023_average_weekly_rent,2023_average_quarterly_count,2024_average_weekly_rent,2024_average_quarterly_count


In [690]:
all_suburbs_rent_df = all_suburbs_rent_df.rename(columns={"SAL_CODE_2021": "SAL_CODE"})
all_suburbs_rent_df = all_suburbs_rent_df.drop(columns=["Suburb Cluster", "Suburb Group", "SAL suburb"])
all_suburbs_rent_df[all_suburbs_rent_df.duplicated()==True]

1,SAL_CODE,2000_average_weekly_rent,2000_average_quarterly_count,2001_average_weekly_rent,2001_average_quarterly_count,2002_average_weekly_rent,2002_average_quarterly_count,2003_average_weekly_rent,2003_average_quarterly_count,2004_average_weekly_rent,...,2020_average_weekly_rent,2020_average_quarterly_count,2021_average_weekly_rent,2021_average_quarterly_count,2022_average_weekly_rent,2022_average_quarterly_count,2023_average_weekly_rent,2023_average_quarterly_count,2024_average_weekly_rent,2024_average_quarterly_count
94,20278,173.75,1398.75,185.0,1398.0,194.75,1514.0,200.0,1536.75,202.5,...,442.5,1495.5,452.5,1333.5,477.5,1293.0,518.75,1219.0,575.0,1115.0


In [691]:
# if there are duplicates, take the mean of the values
all_suburbs_rent_grouped = all_suburbs_rent_df.groupby('SAL_CODE', as_index=False).mean()
all_suburbs_rent_grouped = all_suburbs_rent_grouped.drop(0) # rid of invalid suburbs (fieldstone)
all_suburbs_rent_grouped

1,SAL_CODE,2000_average_weekly_rent,2000_average_quarterly_count,2001_average_weekly_rent,2001_average_quarterly_count,2002_average_weekly_rent,2002_average_quarterly_count,2003_average_weekly_rent,2003_average_quarterly_count,2004_average_weekly_rent,...,2020_average_weekly_rent,2020_average_quarterly_count,2021_average_weekly_rent,2021_average_quarterly_count,2022_average_weekly_rent,2022_average_quarterly_count,2023_average_weekly_rent,2023_average_quarterly_count,2024_average_weekly_rent,2024_average_quarterly_count
1,20111,137.00,979.50,141.25,857.00,151.25,896.00,161.25,948.25,171.25,...,317.50,1743.75,331.25,1632.25,355.00,1546.50,371.25,1473.25,380.0,1345.0
2,20198,190.00,505.75,207.75,502.50,216.25,529.00,221.25,525.75,226.25,...,466.25,720.00,457.00,643.25,432.50,872.25,495.00,764.75,550.0,729.0
3,21193,200.00,608.25,207.50,765.50,210.00,1059.50,210.00,1334.00,217.50,...,400.00,2389.25,400.00,2591.50,411.25,3047.50,437.50,3557.00,470.0,3777.0
4,21640,320.00,2278.75,320.00,2752.50,320.00,3382.75,305.00,3972.75,300.00,...,483.75,10206.25,366.25,16559.25,426.25,14627.75,587.50,13547.25,640.0,13582.0
5,21938,142.50,443.50,151.25,434.75,161.25,429.50,171.25,473.00,178.75,...,390.75,434.50,407.50,428.50,432.50,365.25,465.00,312.50,475.0,356.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
563,22916,185.00,1511.00,196.25,1579.75,208.75,1565.75,216.25,1703.50,222.50,...,402.50,2991.00,407.50,2879.00,423.75,3105.00,462.50,2728.75,500.0,2440.0
564,22917,178.75,947.25,187.50,939.00,200.00,947.00,210.00,1037.25,218.75,...,491.25,1268.25,477.50,1321.25,475.00,1391.75,521.25,1280.00,570.0,1070.0
565,22925,152.50,1285.00,162.50,1280.50,170.00,1258.50,180.00,1315.00,192.50,...,400.00,873.50,415.00,721.75,450.75,733.00,486.25,767.75,500.0,729.0
566,22930,152.50,1285.00,162.50,1280.50,170.00,1258.50,180.00,1315.00,192.50,...,400.00,873.50,415.00,721.75,450.75,733.00,486.25,767.75,500.0,729.0


In [692]:
# output the dataset
all_suburbs_rent_grouped.to_csv("../../data/curated/historical_rent_cleaned.csv", index=False)
all_suburbs_rent_grouped.shape

(567, 51)