In [6]:
import pandas as pd
import geopandas as gpd


# Introduction
### Background
The RBA (Reserve bank of Australia) has been increasing interest rate since 2021, which significantly affect the real estate market. The house price has been dropping since then. Many house investors are considering selling their property to stop loss or leasing and waiting for a possible market recovery. On the other hand, renters may consider stopping renting and buying property to settle down. For both situations, rental prices are relevant because they can be viewed as a substantial opportunity cost.

### Target and Methodology
This project aims to offer overviews and predictions of the rental market. We approach it using mainly machine learning methods because of their high accuracy rate and advantage in trend analysis.



### Value
The result may help ordinary renters position their target property to meet their needs. It may also offer assistance for investors in calculating revenue or opportunity costs.

# Dataset information

## Methodology
Methods used to collect data are direct downloading, manual requesting, web crawling, and API requesting.

## Dataset detail
Here is a list of datasets we used and their brief summary:

### Time Related
| Name | Duration | Time unite | Area unite | Extra note|
| --- | --- | --- | --- | --- |
| ERP | 2010 - 2021 | Year | SA2 | Expected Residential Population |
| Residential Property Price Index | Q1/2021 - Q4/2021 | Quarter | AUS | Represent house price |
| 3-year bond | 07/2013 - 08/2022 | Month | AUS | Close to risk-free rate |
| Median Rent | Q2/1999 - Q1/2021 | Quarter | SA2 | Median rent and Deal count of each LGA district |
| Exchange Rate | 03/2010 - 06/2022 | Month | AUS | Exchange rate from AUD to USD |
| Immigration data | 2004 - 2019 | Year | Victoria | Victoria Immigration Data |
| Debt income ratio | 2009 - 2019 | 2 Year | AUS | Measured every two years |


### District / Area Related
| Name | Unite | Extra note|
| --- | --- | --- |
| School location | Longitude / Latitude | Longitude and Latitude of school |
| ERP | SA2 | Expected Residential Population |
| Median Household Income | SA2 | - |
| Meidan Rent | SA2 | Median Rent and Deal Count of each SA2 district |
| Distance to CBD | SA2 | Distance From centroid of each SA2 district to CBD |
| PTV Station | SA2 | number of stations (train, bus, coach, ...)

### Property Related
| Name | Extra note |
| --- | --- |
| Position | Latitude and Longitude of Property |
| Number of Rooms | room types are bedroom, bathroom, and parking spot |
| Distance to school | Distance to Nearest Schools measured in meters |
| Distance to Station | Distance to Nearest Train Station OR CBD measured in meters |


## Time related

For time-related data, take SA2 as part of the index.

#### Assumptions:
- **Constant increase/decrease rate** Throughout the Year. This applied to ERP, Immigration Data, and debt-income ratio data

<img src="../plots/Constant Growth.svg" alt="Constant Growth Rate" style="width: 800px;" title="Constant Growth Rate" />


#### Analysis
<img src="../plots/History_rent.png" alt="Median Rent & Deal Count" style="width: 500px;" title="Median Rent & Deal Count" />

From the above Victoria History Median Rent and Deal Count Graph, we can overserve that the rent price and deal count keep **increasing** at a relatively constant rate.

<img src="../plots/Part_his_rent.png" alt="Part of History Rent" style="width: 500px;" title="Part of History Rent" />

From the segment graph, we can observe that both transition count and rental price in **quarter 2** are the **lowest** in the whole year.


In [7]:
his_df = pd.read_csv(f"../data/curated/history_info.csv").drop(["Unnamed: 0"], axis=1)
his_df = his_df.query("SA2 == 201011001")
his_df.head()

Unnamed: 0,SA2,year,quarter,population,bond,price_index,deal_count,median_rent,to_USD,immi_count,debt_ratio
0,201011001,2013,3,9550,2.9,105.0,1027,280.0,0.9309,30375,0.85875
1,201011001,2013,4,9714,2.96,109.0,1050,290.0,0.8948,30562,0.86
2,201011001,2014,1,9870,2.97,110.0,1251,295.0,0.9221,30932,0.85875
3,201011001,2014,2,10026,2.8,112.0,1069,280.0,0.942,31302,0.8575
4,201011001,2014,3,10182,2.8,113.0,1035,300.0,0.8752,31672,0.85625


## Related to SA2 Area standard

For SA2 code-related data, we use the newest data in terms of years when their timeline is included in datasets.

#### Assumptions:
- We take **right join** when converting the LGA area standard to the SA2 standard

<img src="../plots/right join.svg" alt="Right Join" style="width: 800px;"/>

- The **minimal distance** between any two points is **350m** (graph retrieved from ../plots/Min_Distance.html)

<img src="../plots/min_distance.png" alt="Minimal Distance" style="width: 300px;"/>

- We shift the centroid of some SA2 areas within (+- 0.04) degree on **latitude** when we cannot find any waypoints near the original position (graph retrieved from ../plots/Distance_shift.html)

<img src="../plots/distance_shift.png" alt="Centroid Shift" style="width: 300px;"/>


#### Analysis

<img src="../plots/Top_10.png" alt="Centroid Shift" style="width: 800px;"/>

- (graph retrieved from ../plots/Top_10_rent.html)

From the above median rent distribution for all SA2 areas in quarter 1, 2021, it can be observed that properties **close to city centres** generally have high rent. A similar high rent level can be found in some **seashore towns** and **East Melbourne** as well. 



In [8]:
sa2_df = pd.read_csv(f"../data/curated/sa2_info.csv").drop(["Unnamed: 0"], axis=1)
sa2_df.head()

Unnamed: 0,SA2,school_count,ERP_population,median_income,metrobus_count,metrotrain_count,metrotram_count,regbus_count,regcoach_count,regtrain_count,skybus_count,recr_count,comm_count,deal_count,median_rent,cbd_dis
0,202011018,13,14951,1267,0,0,0,142,2,1,0,1,1,709,350.0,152998.1
1,202011022,9,21060,1238,0,0,0,130,3,1,0,0,0,709,350.0,144471.1
2,203011035,6,8065,1898,0,0,0,3,4,0,0,1,0,2478,378.0,123256.7
3,203031048,6,16716,1424,0,0,0,74,0,0,0,0,0,1970,380.0,98849.4
4,204011062,4,4142,1222,0,0,0,0,10,0,0,3,0,360,331.666667,350.0


## Related to Property conditions

We use web crawling to retrieve basic house conditions and API requests to find distance-related data.

#### Assumptions:
- We approximate the distance from CBD to property with the distance between CBD and the centroid of the local SA2 area (graph retrieved from ../plots/Route_Appro.html)

<img src="../plots/route_appro.png" alt="Route Approximation" style="width: 600px;"/>

In [9]:
rent_df = pd.read_csv(f"../data/curated/rent_distance.csv").drop(["Unnamed: 0"], axis=1)
rent_df.head()

Unnamed: 0,SA2,rent,bedroom,baths,parking,Latitude,Longitude,school_dis,station_dis,cbd_dis
0,201011001,490.0,4,2,2,-37.563073,143.793875,1651.7,5895.5,125495.6
1,201011001,420.0,4,2,2,-37.547241,143.770106,1249.5,7529.9,125495.6
2,201011001,520.0,4,2,2,-37.566319,143.800328,2094.0,6864.2,125495.6
3,201011001,440.0,4,2,2,-37.563453,143.789489,3988.4,7111.3,125495.6
4,201011001,440.0,4,2,2,-37.550549,143.786038,1120.8,6272.3,125495.6


# Feature Selection

In the feature selection section, we investigated the random forest method

Random forest is a technique used in modeling predictions and behavior analysis and is built on decision trees. It contains many decision trees representing a distinct instance of the classification of data input into the random forest. The random forest technique considers the instances individually, taking the one with the majority of votes as the selected prediction.

Each tree in the classifications takes input from samples in the initial dataset. Features are then randomly selected, which are used in growing the tree at each node. Every tree in the forest should not be pruned until the end of the exercise when the prediction is reached decisively. In such a way, the random forest enables any classifiers with weak correlations to create a strong classifier.

![](../plots/sa2_predict/RF_str.png)

since the interpretability of random forests is strong and **the importance of each feature can be accurately given**, 

we select random forests for model training and give the importance of each feature

![](../plots/sa2_predict/RF.png)

and this is RF structure and Feature Selection result 

![](../models/rf_data/tree.png)

![](../plots/sa2_predict/result.png)

# liveable and affordable suburbs

Based on the results of the feature screening, we decide to give the following weights to the selected features through analysis

$ liveable\_suburbs = 0.2*median\_income  +0.2*deal\_count  +0.2*ERP\_population  +0.1*metrotrain\_count  +0.1*school\_count  +0.1*recr\_count  +0.1*regbus\_count$

![](../plots/sa2_predict/live.png)

finally,the follow pics show the top 10 liveable and affordable suburbs

![](../plots/sa2_predict/suburbs.png)

# SA2 Growth Rate Predict

When making predictions for the SA2 region, we have two issues that need to be addressed
1. What algorithm to use to predict the growth rate of SA2
2. How to proceed with the analysis

## What algorithm to use to predict the growth rate of SA2

In the beginning, we chose lstm or rnn as our model, Through our analysis we found that lstm is more suitable

LSTM networks are an extension of recurrent neural networks (RNNs) mainly introduced to handle situations where RNNs fail. Talking about RNN, it is a network that works on the present input by taking into consideration the previous output (feedback) and storing in its memory for a short period of time (short-term memory). Out of its various applications, the most popular ones are in the fields of speech processing, non-Markovian control, and music composition. Nevertheless, there are drawbacks to RNNs. First, it fails to store information for a longer period of time. At times, a reference to certain information stored quite a long time ago is required to predict the current output. But RNNs are absolutely incapable of handling such “long-term dependencies”. Second, there is no finer control over which part of the context needs to be carried forward and how much of the past needs to be ‘forgotten’. Other issues with RNNs are exploding and vanishing gradients (explained later) which occur during the training process of a network through backtracking. Thus, Long Short-Term Memory (LSTM) was brought into the picture. It has been so designed that the vanishing gradient problem is almost completely removed, while the training model is left unaltered. Long time lags in certain problems are bridged using LSTMs where they also handle noise, distributed representations, and continuous values. With LSTMs, there is no need to keep a finite number of states from beforehand as required in the hidden Markov model (HMM). LSTMs provide us with a large range of parameters such as learning rates, and input and output biases. Hence, no need for fine adjustments. The complexity to update each weight is reduced to O(1) with LSTMs, similar to that of Back Propagation Through Time (BPTT), which is an advantage. 

Since LSTM handles time series tasks better than CNN, in this section we use LSTM regions to predict growth rates


![](../plots/sa2_predict/LSTM.png)

## How to proceed with the analysis

After determining the algorithm，Since pairs require predictions for each SA2 region, this chapter is roughly divided into two parts,
- The first part uses a certain SA2 as an example to make a prediction
- The second part is to sort all SA2 after prediction


![](../plots/sa2_predict/analysis_part.png)

### Single SA2 Analysis

In this section, Let's take 201011001 this SA2 as an example, we expand on the following four parts
1. Load Dataset And Show Base Info
2. Data visualization
3. Feature Engineering
4. Model Predict

#### Load Dataset And Show Base Info

First, we have a basic understanding of the data by reading historical data, through statistical values such as variance, null values, etc

![](../plots/sa2_predict/base_info.png)

#### Data visualization

Then, we analysized the relationship between each feature and the label, which is roughly the trend of fluctuations within a certain range

![](../plots/sa2_predict/visualization.png)

#### Feature Engineering

There are several main difficulties in predicting the future:
- How to get future features
- How to predict house prices

- How to get future features

Because our data is very time-related, we explored the AR model

Autoregressive (AR) modeling is one of the techniques used for time-series analysis. An autoregressive model is a time-series model that describes how a particular variable’s past values influence its current value. In other words, an AR model attempts to predict the next value in a series by incorporating the most recent past values and using them as input data. Autoregressive models are based on the idea that past events can help us predict future events. For example, if we know that the stock market has been going up for the past few days, we might expect it to continue going up in the future. Or, if we know that there has been a lot of rain lately, we might expect more rain in the future.

Autoregressive modeling is training a regression model on the value of the response variable itself. Autoregressive is made of the word, Auto and Regressive which represents the linear regression on itself (auto). In the context of time-series forecasting, autoregressive modeling will mean creating the model where the response variable Y will depend upon the previous values of Y at a pre-determined constant time lag. The time lag can be daily (or 2, 3, 4… days), weekly, monthly, etc. A great way to explain this would be that if I were predicting what the stock price will be at 12 pm tomorrow based on the stock price today, then my model might have an auto part where each day affects the next day’s value just like regular linear regression does but also has regressive features which mean there are different factors influencing changes over shorter spans such as days rather than weeks. AR models can be used to model anything that has some degree of autocorrelation which means that there is a correlation between observations at adjacent time steps. The most common use case for this type of modeling is with stock market prices where the price today (t) is highly correlated with the price one day ago (t-1)

![](../plots/sa2_predict/autoregressive-model.jpg)

Since our data is divided by quarters, we set the order to 4, i.e. AR(4)

This is the change in population predicted by our AR model over time

We use the same method to predict the other 6 characteristics

![](../plots/sa2_predict/population.png)

The **red part** is our forecast value

#### Feature Engineering and Model Predict

In this section, we first divide the dataset

And, Construct batch data methods(create_batch_dataset) to improve performance

In [5]:
def create_batch_dataset(X, y, train=True, buffer_size=1000, batch_size=128):
    batch_data = tf.data.Dataset.from_tensor_slices((tf.constant(X), tf.constant(y)))
    if train:
        return batch_data.cache().shuffle(buffer_size).batch(batch_size)
    else:
        return batch_data.batch(batch_size)

Then we created the model with the help of tensorflow.keras，The model structure is as follows：

![](../plots/sa2_predict/model.png)

With historical and future data, then we can use LSTM for training and prediction
The model we built is shown in the figure
Two layers of LSTM are used, and finally the Dense layer is used to output prediction data

here is some output during training
- model structure: `models/model.png`
- model training logs: `models/logs`
- the best model:`models/best_model.hdf5`

This is the model loss, We can see that the loss is really declining

![](../plots/sa2_predict/output.png)

### ALL SA2 Predict

Based on the above analysis, we can give forecasts for the growth rate of all SA2 regions

![](../plots/sa2_predict/all_predict.png)