# 1. Introduction
This project aims to predict rental prices across Victoria, Australia, using Machine Learning models that integrates property features and geospatial data. By analyzing current and historical rental price trends, the goal is to provide predictions and highlight key factors influencing rental demand across metropolitan and regional areas.

The project leverages property data scraped from domain.com.au and external datasets such as proximity to public transport and essential services. The project provides a 3-year rental price forecast, which can be used by real estate agencies, property investors, and renters to make informed decisions.

# 2. Data Collection and Preprocessing
### Data Sources
**Property data:** Scraped from domain.com.au, including rental prices, property characteristics (bedrooms, bathrooms, parking), and descriptions.

**Historical data:** Acquired from Department of Families, Fairness and Housing, covers from March 2000 to June 2024 for 231 suburbs

**Essential services:** PTV dataset: closest public transport stations. OPM dataset: essential services (healthcare, groceries). Leveraging openrouteservice API to get the coordinates of each properties in the dataset.

**Income data:** from 2016 to 2021, collected from abs.gov.au, spatially separated by SA2 (Statistical Area Level 2) system.

**Population data:** from 2001-2023, collected from abs.gov.au, spatially separated by SA2 (Statistical Area Level 2) system.

### Data Cleaning
Missing values were removed, irregular price formats were standardized to weekly rate, and duplicates were dropped.

### Feature Engineering
New features, such as distance to the nearest tram station, train station, hospital, grocery store, and number of nearby schools, were added to improve prediction accuracy.

The number of bedrooms and bathrooms was extracted from property descriptions to serve as key predictors.

# 3. Modelling and Predictions

## 3.1 Rental Price Prediction 
### Model Selection
We experimented with machine learning models which are Linear Regression, Random Forest Regressor. Based on performance metrics, Random Forest Regressor was chosen for its better accuracy in handling the non-linear relationships present in the dataset.
### Training and Testing
Stratified sampling with threshold of 50 properties for each suburb.

Split ratio 80-20.
### Evaluation Metrics
Mean absolute error (MAE).

Root mean square error (RMSE).

$R^{2}$: Variance explainability of the independent variables.
### Feature Importance
Feature importance analysis revealed that proximity to tram stations, number of bedrooms, and bathrooms were the most critical features in predicting rental prices.

## 3.2 Future Median Rental Price Prediction
### Model selection
After visualising price trend from 2000 to 2023, we decided on Linear Regression.
### Dataset
Filter for data from March 2021 to June 2024, combined with predicted data as data for September 2024.
### Prediction
Predict future median price for 2025, 2026 and 2027.


# 4. Assumptions
The project assumes that current infrastructure and population growth trends will continue in the forecast period, and there are no significant shifts due to policy changes, economic downturns, or other unforeseen events. 

# 5. Limitations
The model does not account for external economic conditions, such as changes in interest rates or unemployment, which could significantly impact rental prices.

There may still be regional variability in the predictions, as factors such as future infrastructure developments and migration patterns can be unpredictable.

# 6. Issues & Challenges
## Dataset integrities
The properties data from domain.com.au is not verified since they were manually added by the providers.

No data for properites currently being rented.

Income and Jobs data are not recently updated.
## Model evaluation
Threshold of 50 properties per suburb make the dataset quite small and can be insufficient, explained by the high RMSE and MAE.

 Additionally, MAEs are much smaller than RMSEs for Random Forest model, suggesting outliers in the data.
