> # Data preprocessing

In this section we will preprocess all of our landing data including data scraped from domain.com.au website, historical rental price data and social indicator data.

> ### Import libraries and functions

In [2]:
%run ../scripts/historical_data.py
import pandas as pd
import os
import re

> ## Preprocess property data

We will preprocess the property data in ..data/landing/properties.csv, which is the data scraped from domain.com website.

In [None]:
print("Begin preprocessing property data")

In [None]:
# Get the csv file from scraping domain.com.au
property_df = pd.read_csv('../data/landing/properties.csv')

> ### Handling missing values

In [4]:
# Show the number of missing values in each column
property_df.isnull().sum()

price (AUD per week)    0
bedrooms                8
bathrooms               0
parkings                0
property type           0
address                 0
suburb                  0
postcode                0
additional features     0
property url            0
dtype: int64

We can see that number of bedrooms is the only column that contain missing values. We will inspect these rows.

In [21]:
# Inspect rows with any NaN values
property_df[property_df.isna().any(axis=1)]

Unnamed: 0,price (AUD per week),bedrooms,bathrooms,parkings,property type,address,suburb,postcode,additional features,property url
142,450.0,,1,0,Studio,1403/325 Collins Street,MELBOURNE,3000,['Furnished'],https://www.domain.com.au/1403-325-collins-str...
226,230.0,,1,0,Studio,109/32 St Edmonds Road,PRAHRAN,3181,[],https://www.domain.com.au/109-32-st-edmonds-ro...
462,615.0,,1,0,Apartment / Unit / Flat,L204/8 Caulfield Boulevard,CAULFIELD NORTH,3161,"['Intercom', 'In ground pool', 'Balcony', 'Out...",https://www.domain.com.au/l204-8-caulfield-bou...
506,75.0,,1,1,Car space,Car Park/228 La Trobe St,MELBOURNE,3000,[],https://www.domain.com.au/car-park-228-la-trob...
625,435.0,,1,0,Studio,7/340 Beaconsfield Parade,ST KILDA WEST,3182,"['Split Cooling', 'Split Heating', 'Kitchen', ...",https://www.domain.com.au/7-340-beaconsfield-p...
701,535.0,,1,0,Apartment / Unit / Flat,202/12 Caulfield Blvd,CAULFIELD NORTH,3161,"['In ground pool', 'In ground spa', 'Split sys...",https://www.domain.com.au/202-12-caulfield-blv...
757,250.0,,1,0,Studio,24/677 Park Street,BRUNSWICK,3056,[],https://www.domain.com.au/24-677-park-street-b...
802,350.0,,1,1,Studio,2/631 Punt Road,SOUTH YARRA,3141,[],https://www.domain.com.au/2-631-punt-road-sout...
905,380.0,,1,0,Studio,10/1 Lawson Grove,SOUTH YARRA,3141,[],https://www.domain.com.au/10-1-lawson-grove-so...


We can see that most missing values occur in the number of bedrooms for studio room or invalid property type such as car space.

We will look at what property type is included in our dataset.

In [5]:
property_df['property type'].unique()

array(['Townhouse', 'Apartment / Unit / Flat', 'House', 'Studio',
       'Car space', 'Villa'], dtype=object)

We will discard rows with type 'Car space' because this type is invalid in the scope of this project.

In [25]:
# Discard rows with type 'Car space'
property_df = property_df[property_df['property type'] != 'Car space']

Now we will fill in the missing values for bedrooms with the assumption that number of bedrooms for studio room is 1, otherwise we assume number of bedrooms equal number of bathrooms.

In [26]:
def fill_bedrooms(row):
    if pd.isnull(row['bedrooms']):
        if row['property type'] == 'Studio':    # assume number of bedrooms for studio is 1
            return 1
        else:
            return row['bathrooms']     # for other properties assume bedrooms = bathrooms
    return row['bedrooms']

property_df['bedrooms'] = property_df.apply(fill_bedrooms, axis=1)

In [24]:
# Check the number of missing values after filling in
property_df.isnull().sum()

price (AUD per week)    0
bedrooms                0
bathrooms               0
parkings                0
property type           0
address                 0
suburb                  0
postcode                0
additional features     0
property url            0
dtype: int64

We confirm that there are no missing entries left.

> ### Descriptive statistics

We will look at the descriptive statistics of number of bedrooms, bathrooms, parkings and rental price per bedroom to check if they are in reasonable ranges. Price per bedroom is chosen because it allows better interpretation.

In [27]:
# Compute price per bedroom
property_df['price per bedroom'] = property_df['price (AUD per week)'] / property_df['bedrooms']

In [17]:
property_df[['price per bedroom', 'bedrooms', 'bathrooms', 'parkings']].describe()

Unnamed: 0,price per bedroom,bedrooms,bathrooms,parkings
count,988.0,988.0,988.0,988.0
mean,361.057018,2.152834,1.461538,1.069838
std,125.107332,0.954869,0.639417,0.75605
min,112.5,1.0,1.0,0.0
25%,275.0,1.0,1.0,1.0
50%,337.5,2.0,1.0,1.0
75%,430.0,3.0,2.0,1.0
max,1100.0,6.0,4.0,6.0


Overall we can see that the range of number of bedrooms, bathrooms and parkings is reasonable. Some properties have very high price per bedroom however these could still be possible in more expensive suburbs. Therefore we will still keep these properties and later on classify them as 'Very High' for the classification task.

In [28]:
# Save the final df
property_df.to_csv('../data/raw/preprocessed properties.csv', index=False)

print("Saved preprocessed property data to ../data/raw/preprocessed properties.csv")

> ## Median Price by Suburb

In [None]:
# Get the preprocessed property data
property_df = pd.read_csv('../data/raw/preprocessed properties.csv')

In [None]:
# Define the interested combinations of property types and bedrooms
combinations = [
    ('Apartment / Unit / Flat', 1),
    ('Apartment / Unit / Flat', 2),
    ('Apartment / Unit / Flat', 3),
    ('House', 2),
    ('House', 3),
    ('House', 4)
]

median_price_df = pd.DataFrame()

# Compute median price for the above combinations by suburb
for property_type, bedrooms in combinations:

    # Filter out properties of the current combination
    filtered_df = property_df[(property_df['property type'] == property_type) 
                              & (property_df['bedrooms'] == bedrooms)]
    
    # Compute the median rental price
    median_price = filtered_df.groupby('postcode')['price (AUD per week)'].median().rename(
                                            f'median {bedrooms} bedroom {property_type}')
    median_price_df = pd.concat([median_price_df, median_price], axis=1)

In [None]:
# Compute the median rental price for all properties by suburb
median_price_all = property_df.groupby(['postcode', 'suburb'])['price (AUD per week)'].mean().rename(
                                                        'median all properties')

# Combining the median price for all properties and for the properties of interest
result_df = median_price_all.to_frame().join(median_price_df, on='postcode', how='left')
result_df.reset_index(inplace=True)

In [None]:
# Show the final df
result_df

Unnamed: 0,postcode,suburb,median all properties,median 1 bedroom Apartment / Unit / Flat,median 2 bedroom Apartment / Unit / Flat,median 3 bedroom Apartment / Unit / Flat,median 2 bedroom House,median 3 bedroom House,median 4 bedroom House
0,3000,MELBOURNE,602.659574,527.5,625.0,,,,
1,3002,EAST MELBOURNE,718.214286,475.0,740.0,,625.0,800.0,
2,3003,WEST MELBOURNE,673.750000,520.0,650.0,1100.0,690.0,,
3,3004,MELBOURNE,788.181818,550.0,752.5,1025.0,,,
4,3006,SOUTHBANK,685.000000,540.0,700.0,1100.0,830.0,825.0,
...,...,...,...,...,...,...,...,...,...
151,3936,SAFETY BEACH,700.000000,,,,,,700.0
152,3939,ROSEBUD,720.000000,,,,,720.0,
153,3941,RYE,650.000000,,,,,625.0,
154,3941,TOOTGAROOK,600.000000,,,,,625.0,


In [None]:
# Save the final df
result_df.to_csv('../data/raw/median price per postcode.csv', index=False)

> ## Preprocessing Historical Rental price data

We will make the historical data stored in ..data/landing/past_rental/moving_rent_suburb.xlsx into a more usable format.

> ### Read Excel file into csv

Because the excel file contains multiple sheets representing different types of property, we will read each sheet into individual csv files and save under ..data/landing/historical split

In [None]:
print("Begin preprocessing historical rental price data")

In [None]:
# Read the Excel file and get all sheet names
xls = pd.ExcelFile('../data/landing/historical data.xlsx')
sheet_names = xls.sheet_names  # Get all sheet names

# Create a new folder to store the splitted sheets
folder_path = '../data/raw/historical split'

# Loop through each sheet and save it as a separate CSV file
for sheet in sheet_names:
    df = pd.read_excel('../data/landing/historical data.xlsx', sheet_name=sheet)
    csv_file = f"{sheet}.csv"  # Name the CSV file based on the sheet name
    file_path = os.path.join(folder_path, csv_file)
    
    # Create the folder if it is not existed
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)

    df.to_csv(file_path, index=False)   # save under ..data/raw/historical split

Saved 1 bedroom flat.csv under ../data/raw/historical split
Saved 2 bedroom flat.csv under ../data/raw/historical split
Saved 3 bedroom flat.csv under ../data/raw/historical split
Saved 2 bedroom house.csv under ../data/raw/historical split
Saved 3 bedroom house.csv under ../data/raw/historical split
Saved 4 bedroom house.csv under ../data/raw/historical split
Saved All properties.csv under ../data/raw/historical split


> ### Load new csv files

In [None]:
one_bed_flat = pd.read_csv('../data/raw/historical split/1 bedroom flat.csv')
two_bed_flat = pd.read_csv('../data/raw/historical split/2 bedroom flat.csv')
three_bed_flat = pd.read_csv('../data/raw/historical split/3 bedroom flat.csv')
two_bed_house = pd.read_csv('../data/raw/historical split/2 bedroom house.csv')
three_bed_house = pd.read_csv('../data/raw/historical split/3 bedroom house.csv')
four_bed_house = pd.read_csv('../data/raw/historical split/4 bedroom house.csv')
all_properties = pd.read_csv('../data/raw/historical split/All properties.csv')

Now we will look at the csv file for the 1-bedroom flats data.

In [None]:
one_bed_flat

Unnamed: 0,Moving annual rent by suburb,Unnamed: 1,Lease commenced in year ending,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 186,Unnamed: 187,Unnamed: 188,Unnamed: 189,Unnamed: 190,Unnamed: 191,Unnamed: 192,Unnamed: 193,Unnamed: 194,Unnamed: 195
0,1 bedroom flat,,Mar 2000,,Jun 2000,,Sep 2000,,Dec 2000,,...,Mar 2023,,Jun 2023,,Sep 2023,,Dec 2023,,Mar 2024,
1,,,Count,Median,Count,Median,Count,Median,Count,Median,...,Count,Median,Count,Median,Count,Median,Count,Median,Count,Median
2,Inner Melbourne,Albert Park-Middle Park-West St Kilda,352,165,347,165,378,170,369,175,...,266,360,246,370,229,395,224,400,194,425
3,,Armadale,210,150,212,150,213,155,213,160,...,205,360,185,385,175,400,148,408,154,430
4,,Carlton North,87,150,78,155,74,150,65,150,...,65,370,64,380,58,380,53,380,41,400
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
156,,Wanagaratta,51,85,46,85,44,85,47,85,...,46,215,52,220,58,220,65,230,70,240
157,,Warragul,13,80,11,75,12,90,10,90,...,-,-,-,-,-,-,10,260,10,260
158,,Warrnambool,113,75,104,75,108,75,105,80,...,60,250,57,250,54,260,45,300,46,300
159,,Wodonga,77,85,72,85,77,85,83,85,...,54,250,51,250,46,250,42,255,43,260


Through inspection, we can see that these csv files are messy and have some redundant rows/columns. We will proceed to remove these redundant rows/columns. 

We aim to get a cleaned csv file that contains only the median price by time for each suburb/area.

In [None]:
cleaned_one_bed_flat = remove_redundant(one_bed_flat)
cleaned_two_bed_flat = remove_redundant(two_bed_flat)
cleaned_three_bed_flat = remove_redundant(three_bed_flat)
cleaned_two_bed_house = remove_redundant(two_bed_house)
cleaned_three_bed_house = remove_redundant(three_bed_house)
cleaned_four_bed_house = remove_redundant(four_bed_house)
cleaned_all_properties = remove_redundant(all_properties)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=['Moving annual rent by suburb'], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns={'Unnamed: 1': 'suburb'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=['Moving annual rent by suburb'], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-vie

In [None]:
# Create a list containing these dataframes to loop through
dataframe_list = [cleaned_one_bed_flat, cleaned_two_bed_flat, cleaned_three_bed_flat, cleaned_two_bed_house,
                  cleaned_three_bed_house, cleaned_four_bed_house, cleaned_all_properties]

Look at a dataframe after cleaning

In [None]:
cleaned_one_bed_flat

Unnamed: 0,suburb,Mar 2000,Jun 2000,Sep 2000,Dec 2000,Mar 2001,Jun 2001,Sep 2001,Dec 2001,Mar 2002,...,Dec 2021,Mar 2022,Jun 2022,Sep 2022,Dec 2022,Mar 2023,Jun 2023,Sep 2023,Dec 2023,Mar 2024
0,Albert Park-Middle Park-West St Kilda,165,165,170,175,180,185,190,190,195,...,320,315,325,340,350,360,370,395,400,425
1,Armadale,150,150,155,160,160,160,165,165,165,...,315,310,320,338,350,360,385,400,408,430
2,Carlton North,150,155,150,150,160,160,160,160,165,...,300,300,320,320,330,370,380,380,380,400
3,Carlton-Parkville,165,170,175,180,185,190,195,185,180,...,300,300,320,340,350,400,420,430,446,450
4,CBD-St Kilda Rd,250,250,250,250,255,260,260,260,265,...,300,320,340,365,400,450,479,500,520,550
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
107,Dandenong North-Endeavour Hills,103,108,108,108,105,-,-,-,-,...,270,260,270,275,275,280,280,290,290,305
108,Narre Warren-Hampton Park,96,-,128,135,150,153,156,156,156,...,280,290,300,320,321,340,320,348,360,400
109,Noble Park,100,100,100,100,105,105,105,105,105,...,260,260,260,260,270,270,276,300,300,330
110,Pakenham,98,105,105,108,105,110,110,110,110,...,280,290,270,275,265,260,275,-,295,320


In [None]:
# Check data type
cleaned_one_bed_flat.dtypes

suburb      object
Mar 2000    object
Jun 2000    object
Sep 2000    object
Dec 2000    object
             ...  
Mar 2022    object
Jun 2022    object
Sep 2022    object
Dec 2022    object
Mar 2023    object
Length: 94, dtype: object

We can see that all the redundant rows/columns have been removed and we have the csv file in a nicer format.

However there are still some missing values denoted as '-'. We will replace these values by the previous non-missing price.

Also, the rental price is not in integer type so we will cast it to integer type.

> ### Fill in missing values

We will first fill in the missing values in the first column (Mar 2000). Then we will fill in any subsequent missing values by its first previous non-NaN value. This means that the missing rental price will be filled in with the previous nearest rental price found.

In [None]:
# Fill in missing values in the first column

for i in range(len(dataframe_list)):
    dataframe_list[i].set_index('suburb', inplace=True)
    dataframe_list[i] = dataframe_list[i][~dataframe_list[i].apply
                                          (lambda row: all(x == '-' for x in row), axis=1)]
    for j, row in dataframe_list[i].iterrows():
        # Convert row to a Series if it is not already
        row = row.copy()
        
        # If the first entry in the row is '-', replace it with the next non '-' value
        if row[0] == '-':
            next_non_dash = row[1:].replace('-', method='bfill').iloc[0]  # Find the next non '-' value
            row[0] = next_non_dash
        
        # Assign the modified row back to the DataFrame
        dataframe_list[i].loc[j] = row

  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  next_non_dash = row[1:].replace('-', method='bfill').iloc[0]  # Find the next non '-' value
  row[0] = next_non_dash
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  next_non_dash = row[1:].replace('-', method='bfill').iloc[0]  # Find the next non '-' value
  row[0] = next_non_dash
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  if row[0] == '-':
  

In [None]:
# Fill in missing values in subsequent columns
for i in range(len(dataframe_list)):
    dataframe_list[i] = dataframe_list[i].apply(lambda row: row.replace('-', method='ffill'), axis=1)

  dataframe_list[i] = dataframe_list[i].apply(lambda row: row.replace('-', method='ffill'), axis=1)
  dataframe_list[i] = dataframe_list[i].apply(lambda row: row.replace('-', method='ffill'), axis=1)
  dataframe_list[i] = dataframe_list[i].apply(lambda row: row.replace('-', method='ffill'), axis=1)
  dataframe_list[i] = dataframe_list[i].apply(lambda row: row.replace('-', method='ffill'), axis=1)
  dataframe_list[i] = dataframe_list[i].apply(lambda row: row.replace('-', method='ffill'), axis=1)
  dataframe_list[i] = dataframe_list[i].apply(lambda row: row.replace('-', method='ffill'), axis=1)
  dataframe_list[i] = dataframe_list[i].apply(lambda row: row.replace('-', method='ffill'), axis=1)


In [None]:
# Check a dataframe after filling in the missing values
dataframe_list[0]

Unnamed: 0_level_0,Mar 2000,Jun 2000,Sep 2000,Dec 2000,Mar 2001,Jun 2001,Sep 2001,Dec 2001,Mar 2002,Jun 2002,...,Dec 2021,Mar 2022,Jun 2022,Sep 2022,Dec 2022,Mar 2023,Jun 2023,Sep 2023,Dec 2023,Mar 2024
suburb,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Albert Park-Middle Park-West St Kilda,165,165,170,175,180,185,190,190,195,200,...,320,315,325,340,350,360,370,395,400,425
Armadale,150,150,155,160,160,160,165,165,165,170,...,315,310,320,338,350,360,385,400,408,430
Carlton North,150,155,150,150,160,160,160,160,165,163,...,300,300,320,320,330,370,380,380,380,400
Carlton-Parkville,165,170,175,180,185,190,195,185,180,180,...,300,300,320,340,350,400,420,430,446,450
CBD-St Kilda Rd,250,250,250,250,255,260,260,260,265,260,...,300,320,340,365,400,450,479,500,520,550
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Dandenong North-Endeavour Hills,103,108,108,108,105,105,105,105,105,105,...,270,260,270,275,275,280,280,290,290,305
Narre Warren-Hampton Park,96,96,128,135,150,153,156,156,156,156,...,280,290,300,320,321,340,320,348,360,400
Noble Park,100,100,100,100,105,105,105,105,105,105,...,260,260,260,260,270,270,276,300,300,330
Pakenham,98,105,105,108,105,110,110,110,110,110,...,280,290,270,275,265,260,275,275,295,320


> ### Converting into interger type

In [None]:
for i in range(len(dataframe_list)):
    dataframe_list[i] = dataframe_list[i].astype(int)

In [None]:
# Check the data type after converting
dataframe_list[0].dtypes

Mar 2000    int64
Jun 2000    int64
Sep 2000    int64
Dec 2000    int64
Mar 2001    int64
            ...  
Mar 2022    int64
Jun 2022    int64
Sep 2022    int64
Dec 2022    int64
Mar 2023    int64
Length: 93, dtype: object

> ### Save dataframes

In [None]:
# Create a new folder to store the cleaned dataframes
folder_path = '../data/curated/historical without postcode'

# Save the df from dataframe list
save_dataframes(dataframe_list, folder_path, sheet_names)

Saved cleaned 1 bedroom flat.csv under ../data/curated/historical without postcode
Saved cleaned 2 bedroom flat.csv under ../data/curated/historical without postcode
Saved cleaned 3 bedroom flat.csv under ../data/curated/historical without postcode
Saved cleaned 2 bedroom house.csv under ../data/curated/historical without postcode
Saved cleaned 3 bedroom house.csv under ../data/curated/historical without postcode
Saved cleaned 4 bedroom house.csv under ../data/curated/historical without postcode
Saved cleaned All properties.csv under ../data/curated/historical without postcode


> ### Splitting suburbs

For visualisation purpose, we will add postcode to the data. This requires splitting data into suburbs.

We can see that some of the suburbs in these dataframes are actually a combination of 2 to 3 suburbs. Such as Albert Park-Middle Park-West St Kilda. Therefore we will split these suburbs and copy the entries to all individual split suburbs so that the dataframe can be aggregate with other data on suburb.

In [None]:
# Start splitting
for i in range(len(dataframe_list)):
    dataframe_list[i] = split_suburbs(dataframe_list[i])

In [None]:
# Check a dataframe after splitting
dataframe_list[0]

Unnamed: 0,suburb,Mar 2000,Jun 2000,Sep 2000,Dec 2000,Mar 2001,Jun 2001,Sep 2001,Dec 2001,Mar 2002,...,Dec 2021,Mar 2022,Jun 2022,Sep 2022,Dec 2022,Mar 2023,Jun 2023,Sep 2023,Dec 2023,Mar 2024
0,Albert Park,165,165,170,175,180,185,190,190,195,...,320,315,325,340,350,360,370,395,400,425
1,Middle Park,165,165,170,175,180,185,190,190,195,...,320,315,325,340,350,360,370,395,400,425
2,West St Kilda,165,165,170,175,180,185,190,190,195,...,320,315,325,340,350,360,370,395,400,425
3,Armadale,150,150,155,160,160,160,165,165,165,...,315,310,320,338,350,360,385,400,408,430
4,Carlton North,150,155,150,150,160,160,160,160,165,...,300,300,320,320,330,370,380,380,380,400
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,Narre Warren,96,96,128,135,150,153,156,156,156,...,280,290,300,320,321,340,320,348,360,400
153,Hampton Park,96,96,128,135,150,153,156,156,156,...,280,290,300,320,321,340,320,348,360,400
154,Noble Park,100,100,100,100,105,105,105,105,105,...,260,260,260,260,270,270,276,300,300,330
155,Pakenham,98,105,105,108,105,110,110,110,110,...,280,290,270,275,265,260,275,275,295,320


> ### Get postcode 

We will use the current median rental price data scraped from domain.com.au to add the postcodes to the historical dataframes.

In [None]:
# Read the CSV file for 2024 median price
median_2024 = pd.read_csv('../data/raw/median price per postcode.csv')

In [None]:
# Make the suburbs to lowercase to merge with other historical dataframes
median_2024['suburb'] = median_2024['suburb'].str.lower()

# The 'Melbourne' suburb is 'CBD' in the historical data
# So replace 'Melbourne' by 'CBD' to merge it with historical data
median_2024['suburb'] = median_2024['suburb'].replace('melbourne', 'cbd')

In [None]:
# Start merging and then cleaning the merged df
merged_dataframes = []
for i in range(len(dataframe_list)):
    dataframe_list[i]['suburb'] = dataframe_list[i]['suburb'].str.lower()
    merged_df = dataframe_list[i].merge(median_2024[['postcode', 'suburb']], on='suburb', how='inner')
    cleaned_df = clean_merged_df(merged_df)
    merged_dataframes.append(cleaned_df)

In [None]:
# Check a dataframe after merging
merged_dataframes[0]

Unnamed: 0,postcode,suburb,Mar 2000,Jun 2000,Sep 2000,Dec 2000,Mar 2001,Jun 2001,Sep 2001,Dec 2001,...,Dec 2021,Mar 2022,Jun 2022,Sep 2022,Dec 2022,Mar 2023,Jun 2023,Sep 2023,Dec 2023,Mar 2024
0,3206,albert park,165,165,170,175,180,185,190,190,...,320,315,325,340,350,360,370,395,400,425
1,3206,middle park,165,165,170,175,180,185,190,190,...,320,315,325,340,350,360,370,395,400,425
2,3143,armadale,150,150,155,160,160,160,165,165,...,315,310,320,338,350,360,385,400,408,430
3,3054,carlton north,150,155,150,150,160,160,160,160,...,300,300,320,320,330,370,380,380,380,400
4,3053,carlton,165,170,175,180,185,190,195,185,...,300,300,320,340,350,400,420,430,446,450
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74,3072,preston,100,105,107,110,110,110,110,115,...,300,300,310,320,320,340,350,360,390,400
75,3073,reservoir,110,110,110,115,115,115,120,120,...,300,300,300,300,310,320,330,348,350,363
76,3071,thornbury,105,110,110,115,115,120,125,125,...,290,290,290,295,300,310,320,330,350,360
77,3174,noble park,100,100,100,100,105,105,105,105,...,260,260,260,260,270,270,276,300,300,330


> ### Save dataframes with postcode

In [None]:
# Create a new folder to store the cleaned dataframes
folder_path = '../data/curated/historical with postcode'

# Save the df from dataframe list
save_dataframes(merged_dataframes, folder_path, sheet_names)

Saved cleaned cleaned 1 bedroom flat.csv under ../data/curated/historical with postcode
Saved cleaned cleaned 2 bedroom flat.csv under ../data/curated/historical with postcode
Saved cleaned cleaned 3 bedroom flat.csv under ../data/curated/historical with postcode
Saved cleaned cleaned 2 bedroom house.csv under ../data/curated/historical with postcode
Saved cleaned cleaned 3 bedroom house.csv under ../data/curated/historical with postcode
Saved cleaned cleaned 4 bedroom house.csv under ../data/curated/historical with postcode
Saved cleaned cleaned All properties.csv under ../data/curated/historical with postcode


In [None]:
print("Saved preprocessed historical data under ..data/curated/")

> ## Preprocess Social Indicators Dataset

In [None]:
indicator_df = pd.read_csv('../data/landing/social_indicator.csv')

def get_postcodes(row):
    # match postcodes in the row
    postcodes = re.findall(r'\d{4}', row)  
    return postcodes


indicator_df['postcode'] = indicator_df['respondent_group'].apply(lambda x: get_postcodes(x))

# Ensure each row only have one postcode
indicator_df = indicator_df.explode('postcode')

indicator_df = indicator_df.dropna(subset=['postcode'])

# output data to csv
indicator_df.to_csv('../data/raw/social_indicator_w_postcode.csv', index=False)