# Summary

In [1]:
import pandas as pd
import sys

sys.path.insert(0, "../scripts/")

from helper_functions import convert_census_to_postcode

<h3>Data & Methodology</h3>

**Web scraping**


Web scraping was performed on Domain.com. A maximum of 10 web pages of rental properties for each postcode were scraped according to the postcodes sourced from postcode.csv. The *download_property_data.py* script works by first scraping the URLs from the main Domain.com per postcode, then looping through each URL to extract the features. 

The main challenge with scraping was extracting the relevant features through HTML Regex matching. In the initial stages of web scraping, we also attempted the integration of Selenium to access all elements of the page such as the description. However, this was not included as the description was ultimately deemed unnecessary to the value of the research, and the process almost tripled the scraping time. Hence, the following features were scraped from domain.com
- Property Address/Name
- Cost
- Coordinates
- Number of Bedrooms, Bathrooms and Parking Spaces
- Property Type
- Agency that listed the property 

The following steps were taken to preprocess the scraped data:
1. Reformat all costs to numeric characters only without commas.
2. Null value imputations:
    - Bedrooms and parking: replace with numeric value zero
    - Bathrooms: replace with numeric value one
    - Postcodes and coordinates: invalid rows, removed from dataset
3. Convert numeric columns from string to integer or float values.
4. Remove any rows with cost <= $0 or cost > $20000.

This is a sample of the final data-set with final features:


In [2]:
pd.read_csv("../data/curated/properties_processed.csv", index_col=0).head()

Unnamed: 0,Name,Cost,Coordinates,Bed,Bath,Parking,Property_Type,Agency,Postcode
0,5408/500 Elizabeth Street Melbourne VIC 3000,440.0,"[-37.8072443, 144.9602814]",1,1,0,Apartment / Unit / Flat,BRADY residential,3000
1,502/118 Russell Street Melbourne VIC 3000,620.0,"[-37.8135864, 144.9687232]",1,1,0,Apartment / Unit / Flat,Dingle Partners,3000
2,202A/441 Lonsdale Street Melbourne VIC 3000,300.0,"[-37.8134292, 144.9594445]",1,1,0,Apartment / Unit / Flat,Biggin & Scott Stonnington,3000
3,57/243 Collins Street Melbourne VIC 3000,400.0,"[-37.8159969, 144.9657956]",1,1,0,Apartment / Unit / Flat,Harcourts Melbourne City,3000
4,2311/601 Little Lonsdale Street Melbourne VIC ...,625.0,"[-37.8137564, 144.9537143]",2,2,1,Apartment / Unit / Flat,Harcourts Melbourne City,3000


**Open Route Service Data**

For data on nearby facilities and their distances, Open Route Service (ORS) data was accessed which can be seen in *preprocessing_distances.py* and *preprocessing_school_distances.py*. These scripts use the preprocessed property data to reduce the number of requests to the server.

Points of Interest (POIs) selected:
- Railway station - a marker of public transport accessibility
- Parks - a place of gathering for families, particularly with children
- Post office - a marker of the city or town centre
- Melbourne CBD - defined as the GPO on Bourke St (in accordance with Google Maps)
- Primary and secondary schools


ORS functions used with POIs:
- Points of Interest - to find the nearest railway, park and post office
- Directions - to find the driving route to the POIs
- Isochrones - 30 min driving distance around a point


There were many limitations and challenges that are highlighted below:
- ORS quotas on requests - both in a 24hr period and a 60 second period which slowed progress in data collection
- Security - API tokens are sensitive information so were stored locally in .env files 
- Radius limit in the POI function - a maximum radius of 2km lead to a large amount of zero values, particularly for rural properties
- Zero values - all values were categorised to limit skew in the data created from zero values
- Open Street Maps (ORS Maps Client) - minor roads, long driveways and outdated construction work on roads meant some driving distances had be imputed manually (using either Google Maps or OSM with some estimates)


A simulation of the ORS pipeline with visualisations can be found in *visualise_ors.ipynb*. This data was also then merged with the preprocessed property dataframe in *merge_datasets.py*. A sample of this data can be found below:

In [3]:
pd.read_csv("../data/curated/all_distances.csv", index_col=0).head()

Unnamed: 0,Name,Cost,Coordinates,Bed,Bath,Parking,Property_Type,Agency,Postcode,CBD_Distance,...,Railway_Duration,Park_Distance,Park_Duration,Post_Office_Distance,Post_Office_Duration,Nearby_Schools,Primary_Distance,Primary_Duration,Secondary_Distance,Secondary_Duration
0,5408/500 Elizabeth Street Melbourne VIC 3000,440.0,"[-37.8072443, 144.9602814]",1,1,0,Apartment / Unit / Flat,BRADY residential,3000,749.2,...,93.1,423.1,68.7,327.5,41.7,612,1511.1,158.2,923.1,105.6
1,502/118 Russell Street Melbourne VIC 3000,620.0,"[-37.8135864, 144.9687232]",1,1,0,Apartment / Unit / Flat,Dingle Partners,3000,951.3,...,126.5,470.2,65.5,470.1,68.3,630,1652.9,183.0,438.8,68.5
2,202A/441 Lonsdale Street Melbourne VIC 3000,300.0,"[-37.8134292, 144.9594445]",1,1,0,Apartment / Unit / Flat,Biggin & Scott Stonnington,3000,577.3,...,74.2,841.9,105.1,557.1,71.1,612,2154.9,233.7,738.2,110.8
3,57/243 Collins Street Melbourne VIC 3000,400.0,"[-37.8159969, 144.9657956]",1,1,0,Apartment / Unit / Flat,Harcourts Melbourne City,3000,846.9,...,180.9,2019.2,294.8,184.3,47.4,625,2424.5,256.7,1211.2,160.1
4,2311/601 Little Lonsdale Street Melbourne VIC ...,625.0,"[-37.8137564, 144.9537143]",2,2,1,Apartment / Unit / Flat,Harcourts Melbourne City,3000,1052.5,...,81.5,607.6,89.2,324.1,62.8,608,1626.5,239.6,268.9,50.8


**Census Data**

Sourcing the census data was not as straightforward as initially thought, as the historical census data on the Australian Bureau of Statistics (ABS) website was granularised by Statistical Area Level 2 (SA2), whilst we needed postcode granularity. 

Unfortunately, the only SA2 to postcode conversion dataset available was for 2011,  and because SA2 and postcode areas were changed every census year, various mapping data had to be combined to create a '2021 SA2 to postcode' dataset. This was implemented in the *preprocessing_sa2_postcode_mapping.py* which merges the following datasets together:
1. 2016 to 2021 SA2 mapping
2. 2011 to 2016 SA2 mapping
3. SA2 to postcode 2011 mapping
4. 2011 to 2016 postcode mapping
5. 2016 to 2021 postcode mapping

Because SA2 covers a larger area than postcodes, when converting the census data to postcode granularity, all features were grouped with mean aggregation (excluding zeros) except for population which was grouped by sum. 

In [4]:
# census data by SA2
census_df = pd.read_csv("../data/curated/census_data.csv")
census_df.head()

Unnamed: 0,sa2_2021,Tot_persons_C11_P,Tot_persons_C16_P,Tot_persons_C21_P,Med_mortg_rep_mon_C2011,Med_mortg_rep_mon_C2016,Med_mortg_rep_mon_C2021,Med_person_inc_we_C2011,Med_person_inc_we_C2016,Med_person_inc_we_C2021,Med_rent_weekly_C2011,Med_rent_weekly_C2016,Med_rent_weekly_C2021,Med_tot_hh_inc_wee_C2011,Med_tot_hh_inc_wee_C2016,Med_tot_hh_inc_wee_C2021,Average_hh_size_C2011,Average_hh_size_C2016,Average_hh_size_C2021
0,201011001,8348,11658,16835,1565,1615,1698,607,702,865,250,311,370,1406,1585,1952,2.8,2.8,2.8
1,201011002,12076,12046,12131,1430,1500,1700,575,670,842,230,260,313,1120,1327,1573,2.3,2.3,2.2
2,201011005,6604,7153,7261,1517,1580,1662,532,638,805,260,300,330,1401,1634,1927,2.8,2.8,2.7
3,201011006,5749,7082,10661,1430,1466,1500,514,595,775,195,265,360,1073,1272,1627,2.7,2.6,2.6
4,201011007,3782,3935,4230,1558,1600,1733,565,646,802,240,260,350,1336,1583,2065,3.0,3.0,3.0


In [5]:
# census data by postcode
sa2_postcode_map = pd.read_csv("../data/curated/sa2_postcode_mapping_2021.csv")
convert_census_to_postcode(census_df, sa2_postcode_map, "mean_no_zero").head()

Unnamed: 0,postcode_2021,tot_population_11,tot_population_16,tot_population_21,avg_med_mortg_rep_11,avg_med_mortg_rep_16,avg_med_mortg_rep_21,avg_med_person_inc_11,avg_med_person_inc_16,avg_med_person_inc_21,avg_med_rent_16,avg_med_rent_11,avg_med_rent_21,avg_med_hh_inc_16,avg_med_hh_inc_11,avg_med_hh_inc_21,tot_avg_hh_size_16,tot_avg_hh_size_11,tot_avg_hh_size_21
0,3000,124551,167166,178424,2213.38,2040.38,2040.19,862.18,5483.82,6467.76,395.76,447.06,418.19,1482.53,1896.76,2159.41,1.88,1.97,1.86
1,3002,68729,82804,89023,2357.78,2173.67,2155.22,1091.8,8969.6,10432.9,398.0,460.33,449.67,1709.4,2415.0,2598.8,1.82,1.91,1.87
2,3003,15496,20633,23083,2200.0,2050.0,2085.0,701.5,716.0,1000.0,395.0,418.5,385.5,1466.0,1493.5,1751.0,2.15,2.15,1.95
3,3004,100879,123254,129273,2331.58,2155.67,2149.75,1066.08,7152.46,8339.46,391.15,446.83,440.75,1688.85,2270.46,2471.46,1.83,1.89,1.84
4,3006,21150,30239,36164,2477.25,2217.75,2079.0,1132.4,16783.0,19507.0,406.8,501.0,461.0,1637.2,2883.2,3088.8,1.8,1.92,1.92


**School Data**

Brief description

**Summary**

List of final datasets & all features?

<h3>Questions & Analysis</h3>

Sections should basically be a summary of each individual notebook including overall method, final results and graphs

**Question 1 - What are the most important internal and external features in predicting rental prices?**

Using Pearson correlation and correlation coefficient, we created a correlation matrix which was used for feature selection that contains features from all the data and features that have been processed:

<img src="../plots/corr_matrix.jpg" style="width: 700px;"/>

From this matrix, feature selection was performed with the following justifications:
- Postcode and SA2 are extremely correlated, so only include postcode which has slightly higher correlation with cost 
- All distance/duration combinations - pick best from each category
    - keep duration: CBD, station
    - keep distance: park, secondary and primary schools (kids can"t drive and people likely want to walk to parks, not drive)
- Postcode highly correlated with everything
- CBD duration: postcode and agency
- Median and total income - only include median
- Median rent and mortgage payments - makes sense for landlords
- Total household income and personal income - only keep household income


Interesting notes and observations:
- Highest correlations are for features already on property websites - good, makes sense as people can easily compare to cost
- Agency and postcode highly correlated with each other and everything else
- Schools seem to be pretty low correlated - seems that people do not care about schools. Reason may be because only a minority of people have children e.g. families are less likely to rent than buy

The following correlation table was produced between rental price and various attributes:

<img src="../plots/correlation_chart.jpg" alt="Drawing" style="width: 400px;"/>

**Question 2 - What are the top 10 suburbs with the highest predicted growth rate?**

*Methodology*

To begin forecasting, we used our property and census by postcode dataframes generated in the above sections. However, this census data was only available per census year. Firstly to fill in the years between census years, linear interpolation was applied to all features of the census data. This required an assumption of linearity between each instance.


Each postcode’s features were then extrapolated until 2026 using a multivariate time series through Vector AutoRegression (VAR). VAR is a statistical model used to capture the relationship between multiple quantities as they change over time. This model was generated for each postcode to produce a forecast for 2023 to 2026. 

The predicted growth rate was defined to be the change in median rental price between the current 2022 median rental price from our scraped data and the forecasted 2026 median rental price, divided by the current median rental price.


*Challenges*

When first calculating the growth rate, many of the top suburbs were outliers with abnormally high growth rates. Upon investigating these suburbs, it appeared that these areas often only had only a small number of properties or were not actually properties. A minimum threshold of 10 properties per postcode was applied before rerunning the forecasting to minimise these outliers. 


*Results*

These are the final results after performing forecasting and growth rate calculation for each postcode. These are the top 10 postcodes in terms of growth rate as per our definition. 


In [8]:
growth_rates_df = pd.read_csv("../data/curated/top_growth_rates.csv", index_col=0)
growth_rates_df.head(10)

Unnamed: 0,Growth Rate
3028,0.788013
3027,0.677426
3737,0.65901
3125,0.543594
3071,0.537571
3085,0.432888
3025,0.407147
3196,0.39593
3915,0.379317
3129,0.37116


Below is a visual output of the model of the postcode with the highest growth rate (3028). 

<img src="../plots/3028_forecast.jpg" alt="Drawing" style="width: 400px;"/>

**Question 3 - What are the most liveable and affordable suburbs according to your chosen metrics?**

*Affordability*

The following features were used when calculating affordability:
- Average number of parking spaces, bedrooms, and bathrooms
    - The number of parking spaces, bedrooms, and bathrooms for each property was obtained from the property dataframe
    - The average number per postcode was calculated by grouping the property data by postcode and taking the mean for each facility type
    - The averages were then weighted by the correlation with the median rental property price
- Cost as a percentage of income
    - The median rental property price was divided by the median weekly household income per postcode and multiplied by 100 
    - The median rental property cost per postcode was calculated during the preprocessing stage 
    - The median weekly household income per postcode was extracted from the preprocessed census data
- Mortgage as a percentage of income
    - The median monthly mortgage rate was extracted from the preprocessed census data


The average number of facilities for a rental property per postcode was included to assess the affordability since the number of bedrooms, bathrooms and parking spaces were positively correlated with the rental property cost. Therefore a smaller number of facilities are associated with a more affordable rental cost. The cost as a percentage of the income was included since areas with higher household incomes are more likely to be able to afford higher rental prices. Therefore if the cost as a percentage of the income was low the area would be considered more affordable. Similarly for the mortgage, if the mortgage as a percentage of the income was low the area was again considered more affordable. 

This metric was calculated by summing up all the aforementioned features for each postcode. The resulting values were then standardised to be on a scale of 0 to 1 for easier comparison. The following standardised value was then subtracted from 1, in order to have 0 indicating the least affordable postcode and 1 indicating the most affordable. 


In [34]:
most_affordable_df = pd.read_csv("../data/curated/most_affordable.csv", index_col=0)
most_affordable_df.head(10)

Unnamed: 0_level_0,Standardised Affordability
Postcode,Unnamed: 1_level_1
3424,1.0
3409,0.947093
3414,0.946619
3418,0.942966
3006,0.936327
3396,0.929125
3393,0.909012
3490,0.906518
3390,0.90594
3318,0.901626


*Liveability*

The following features were used when calculating liveability:
- Average number of parking spaces, bedrooms, and bathrooms
    - The average number of facilities per postcode was calculated similarly to the aforementioned affordability
    - The resulting average was then divided by the average household size per postcode for each facility type
- Proportion of schools which are located within a 30-minute driving radius
    - The number of nearby schools for each property was obtained from the property dataframe 
    - The average number of nearby schools per postcode was calculated by grouping the property data by postcode and taking the mean
    - The resulting average was then divided by the total number of schools in Victoria (calculated by counting the number of rows in the schools dataset)
- Proportion of available properties belonging to the postcode
    - The number of rental properties listed for each postcode was obtained by grouping the property dataframe by postcode and counting the number of records per postcode 
    - The property counts for each postcode were divided by the total number of available rental properties (calculated by counting the number of rows in the property dataframe)
- Average duration to the nearest train station, post office, and CBD
    - The average duration to these amenities per postcode was obtained by grouping the property dataframe by postcode and taking the mean of the respective columns
- Average distance to the nearest park 
    - Calculated similarly to the durations above


The average number of facilities per person for a rental property in each postcode was included since it is a quantifiable measure of the luxuries afforded to the average individual in that postcode. The proportion of schools nearby as well as the average durations/distances to other nearby amenities indicates how accessible these services are within the community. The proportion of available properties also indicates how accessible the postal region is for prospective tenants. A larger number of luxuries, easily accessible amenities, as well as a higher capacity to live in said region would all result in an increased measure of liveability.


The metric was calculated by summing up all the aforementioned features for each postcode. The resulting values were then standardised to be on a scale of 0 to 1 for easier comparison. 

In [33]:
most_liveable_df = pd.read_csv("../data/curated/most_liveable.csv", index_col=0)
most_liveable_df.head(10)

Unnamed: 0_level_0,Standardised Liveability
Postcode,Unnamed: 1_level_1
3213,1.0
3688,0.997138
3808,0.96244
3670,0.950135
3799,0.934989
3249,0.922225
3659,0.921298
3461,0.895837
3498,0.894708
3285,0.886109


<h3>Assumptions & Business Context</h3>

**Terminal tool prototype**

In addition to answering the three main questions, we created a tool that can be utilised by renters, investors, landlords and businesses to access a snapshot of the top ten postcodes in relation to their inputted one and their associated growth rate, standardised affordability and standardised liveability. This tool was created in two stages. 

*Stage 1: Database creation [tool.ipynb]*

Each postcode was connected with their derived growth rate, liveability and affordability metrics from our research. In addition to this, an external dataset *australian_postcodes.csv* that associates the postcode with the coordinates was utilised. The first suburb in the postcode in the csv was taken to be the representative coordinate for the whole postcode and used when calculating the distances between each postcode. 

Following this, the euclidean distance of a postcode to every other postcode was calculated and then ranked with the top ten closest neighbouring postcodes to the specified postcode being saved for the tool. The euclidean distance was used as the linear distance was deemed a sufficient measure of ‘closeness’. 

Finally the database of all the information was combined in the dataframe below. Any missing information has been represented by NaNs values.


*Python Script to access database and output answer [tooluse.ipynb]*

To use this tool, the user can input any Victorian postcode and the tool will output the following:
- The inputted postcode’s growth rate, standardised affordability and standardised liveability 
- The growth rate, standardised affordability and standardised liveability of the top ten nearest postcodes to the imputed postcode

This tool can be leveraged to aid the decision making of renters, landlords, and real-estate businesses as well as serve as a guide for those wanting to enter the investment property market.

In future research, the differences between the rental property market in metropolitan and rural areas of Victoria should be further investigated, as well as a wider range of features considered for assessing affordability and livability. The tool can be further expanded with more data as well as developed for application in government sectors for planning/development of services. 