# Real Estate Industry Project

## Introduction

Data science in real estate plays an essential role in helping to leverage the decision-making process. It can be used to analyze economic trends to predict real estate performance. Our goal is to find the most important features of residential properties, predict rental prices, and find the most livable and affordable suburbs in victoria with the aid of big data. This notebook is used to summarise our key findings and futher discussions of this project.

## Data
### Internal Data

- The number of bedrooms
- The number of bathrooms
- Parking spaces
- Property types (such as apartments/houses)
- Property features (such as balcony, garden, and laundry)
- Address
- Weekly rent

The internal data is all scraped from the domain website.

### External Data

- Bus stations
- Shopping centers
- Crime rate
- Population
- Hospitals
- Schools
- Income

The external data is all scraped from the reliable websites, including Australia Bureau of Statistics, Crime Statistics Agency, Victoria Government Data Directory, Public Transport Victoria, etc.

## Data Preprocess and Outlier Detection

The original datasets have different classification criteria, such as SA2, LGA, and postcode. We decide to treat postcode as a classification criterion, and a correspondence table is used to unify. Then, each postcode's centroid's latitudes and longitudes are converted. We derive the travel distance and duration from the centroid to the nearest public facilities using API. We only get income data from 2016-2019, so we predict the income for the next six years using the growth rate. Outlier detections are also put into effect. We drop the properties with no bedrooms and bathrooms and weekly rental prices greater than 5000 and fill all the missing values to be 0.

## Feature Importance / The most important internal and external features in predicting rental prices

After retrieving all the internal and external data we required, the first problem we would solve was "what are the important features in predicting rental prices?". 

To answer the first question, we ranked the importance of features by computing the permutation feature importance. The graph shows that the top 5 features are income, number of bathrooms, bedrooms, count shopping centers, and parking spaces. These features will be used later in our models.

![](../plots/2.jpeg)

The three plots show a positive linear relationship between the three internal features and the rental price in general. This indicates that rental prices increase as the size of bedrooms, bathrooms, and parking spaces increase.

![](../plots/5.png)

![](../plots/3.png)

![](../plots/4.png)

Rent is the feature that mostly depends on income per person in each suburb. These two geomaps show similar distributions of income and rent per week across victoria. It is evident that the regions with high rental prices have a high income to a large extent.

## Models

### Random Forest Regression Model

The first model we have chosen is the random forest regression model. Random forest regression is a regression model using ensemble learning methods. A regression model provides a function that describes the relationship between one or more independent	variables and a response variable. And ensemble learning is a general meta-approach to machine learning that seeks better predictive performance by combining the predictions from multiple models. So, in the sample way, a random forest regression model combines numerous decision trees.

We used a random holdout to split the data into two parts that are 70% training set and a 30% testing set. And when we built the model, we did a grid search three times and analyzed the error of cross-validation. We chose the number of trees to be 300, the maximum number of features the model allowed to try in an individual tree to be the square root of the total number of features in an individual run, and the depth of each tree in the forest to be 13. A fixed value of random_state is just to ensure the model gives the same results if given with the same parameters.

![](../plots/7.png)

We calculated the MAE of the random forest regression model and the baseline model which will simply predict the results as the average. However, the results are not competitive. You can see there are still lots of significant gaps between the true value and the predictions. So, we finally decided to use the neural network model.

![](../plots/8.png)

### Neural Network Model

A neural network uses data and models to approximate the true underlying function in real life, describing price change behavior and making the best guess from limited resources. Our neural network model is made up of 6 layers. Layers are made up of nodes. Each node is connected to the next layer node. Each connection has a specific weight. Weight is the impact that node has on the next layer node. The computer uses a loss function to evaluate the temporary model's performance. Once it has the results, the optimizer improves the model, and metrics demonstrate the model performance.

## Findings

### Top 10 suburbs with the highest predicted growth rate

The growth rate was calculated by the difference between next year's price and the current price over the current price. The rank is sorted by the growth rate in 2023. According to the Geo map below, high-growth suburbs are geographically dispersed, with no specific high-growth region. We also predicted a growth rate in 2024 and 2025. According to the table on the right side, there would be some price fluctuation in the future, but the general price would still be larger than 2022's.

![](../plots/9.png)

![](../plots/10.png)

### The most livable and affordable suburbs

We would evaluate regions by six factors. Safety level/Rent income ratio / Number of bus stations/school/hospital/ shopping center. Since the crime count is not proportional to the safety level, the safety level was calculated by taking the inverse of the crime count per person. The rent-income ratio is not proportional to affordability, so we would also take the inverse. We believe the safety level is the most important factor to consider when we evaluate liveability, so we would only consider the region with a safety level above the median. Moreover, we define rental percentage over 50% of the income as unaffordable. A radar chart is useful for overall assessment through multivariate data. The final performance ranking was measured by radar plot area.

Top 1: Brighton

![](../plots/a.jpeg)

Top 2: Warranwood

![](../plots/b.jpeg)

Top 3: Briar HillBriar Hill

![](../plots/c.jpeg)

## Suggestion
1. Build an online platform for people to access our findings which help them to make decisions based on our price predictions and surburb liveability rank, creat a website tool that calculates the suburb liveability score by simply entering suburb names.
2. Share our findings with real estate developers to help them better assess location for new apartment construction.
3. Provide a price appropriate levels evaluation system which helps property owners to better pricing their property in a resonable rent.

## limitation and Assumption

### Limitations
1. We do not have enough historical data to train.  
2. Because we used one-hot encoding, our neural network model has too many features, which may lead to overfitting. 
3. We do not have future features of properties.
4. Our income prediction may be inaccurate, affecting our price prediction.
5. Time series models, such as ARIMA and LSMT would probably produce a better result.

### Assumptions
1. We assumed covid-19 has no impact. 
2. We assumed property only has three types: home,  townhouse, and apartment. 
3. We ignored bills and pets allowance.
4. We assumed the closest distances are between the postcode centroids and hospitals/schools/bus stations/shopping centers.
