# Summary of Project

The following notebook runs through the general methodology as well as the key findings across the project, this also contains visual aids in form of graphs, screenshots of notebook activities, tables and cell blocks. Our aim was to develop a model that can predict future rental prices to effectively deliver insights on rental properties in the suburbs of Victoria for consulting an online real estate company.

The Sections of this notebook split into the 6 parts of our project pipeline:
1. Data Gathering
2. Preprocessing
3. Preliminary Analysis
4. Modelling
5. Key Findings
6. Conclusion
Each section includes a summary of the overall approach, limitations encountered and assumptions made.

## 1. Data Gathering

All datasets used are listed out in the readme file, this includes a list of 19 Datasets of 17 were used in our final analysis and modelling.
Some of these datasets were deemed useful and gathered in the initial planning phase such as crime and ABS data, whilst other data such as coastline data was found to be useful after preliminary analysis was conducted, 2 datasets down the pipeline proved to be not useful this was the Domain dataset and internet speed data (not included in the readme, the data could not be preprocessed effectively, this is discussed further in preprocessing). 
Perhaps the largest hiccup in this stage was the discovery of a lack of historical rental data on domain.com.au and a lack of permission to scrape on realestate.com, hence, we resorted to using rental data from the DFFH. We would have preferred to scrape individual housing data and aggregated these values into suburb rental values ourselves to ensure high accuracy, but DFFH data served as a reasonable work around. 


## 2. Preprocessing

DFFH data did not use individual suburbs in every case, rather they used a self-defined form of suburb clusters that grouped certain ones together. We tried to obtain the shape file to ascertain the exact list of suburbs in the cluster through emailing the agency directly, but unfortunately they did not respond. So, we mapped the suburbs manually using a map of suburbs from the ABS - denoted as SAL - and the following map of suburb clusters:

<img src='../plots/Summary_Suburb_Clusters_DFFH.png' alt="My Image" width="600"/>


To merge all the features together, we left-joined the values based on the suburbs. 

We transformed the response variable to be the median rental price divided by inflation and people per household, as we found that inflation had a significant impact on rent through time and since it was a variable that affected the currency itself rather than the value of rent, we decided to input it into the response variable. 


## 3. Preliminary Analysis

In our preliminary analysis, we uncovered key insights that significantly shaped our modeling approach. One of the most important discoveries was the impact of inflation and average household size on rental prices. Inflation was found to influence rent independently of a house’s intrinsic value, reflecting the changing value of the dollar over time. Additionally, household size had a notable effect on property prices, but this was not directly correlated with the underlying value of properties within a suburb. To account for these factors, we adjusted our target variable to estimate inflation-adjusted rent per person, which resulted in a much cleaner distribution for making accurate predictions over time, as demonstrated by the graphs.

<div style="display: flex; justify-content: space-between;">

  <!-- First Image -->
  <div style="width: 45%;">
    <img src="../plots/prel1.png" alt="First Image" style="width: 300px height: auto;">
  </div>

  <!-- Second Image -->
  <div style="width: 45%;">
    <img src="../plots/prel2.png" alt="Second Image" style="width: 1300px; height:425px">
  </div>

</div>


Further feature engineering was conducted to refine our model, including converting crime statistics into a per-person metric to account for varying population sizes across suburbs. We also discarded suburbs with populations lower than 800, using the elbow method to determine this threshold. Ravenhall was specifically removed due to the majority of its population being prison inmates, which distorted the data. After these transformations, significant patterns began to emerge. For instance, commercial density showed a strong positive correlation with rental prices, indicating that more commercial buildings in an area were linked to higher rents. Similarly, proximity to the CBD displayed a clear pattern, with closer suburbs experiencing higher rents. These insights, formed through detailed feature analysis, laid the foundation for constructing an extremely accurate predictive model for our stakeholders.

<div style="display: flex; justify-content: space-between;">

  <!-- First Image -->
  <div style="width: 45%;">
    <img src="../plots/cbd1.png" alt="First Image" style="width: 300px height: auto;">
  </div>

  <!-- Second Image -->
  <div style="width: 45%;">
    <img src="../plots/cbd2.png" alt="Second Image" style="width: 1300px; height:425px">
  </div>

</div>


## 4. Modelling

We focused on three main models for our analysis: Linear Regression (LR), Random Forest Regression (RFR), and XGBoost. Linear Regression served as a simple and interpretable baseline, relying on the assumption of basic linear relationships to gauge the predictive capability. We then moved to Random Forest, which was chosen for its robustness and ability to capture more complex trends beyond linear relationships. Finally, we implemented XGBoost, a model known for its efficiency and accuracy, to enhance prediction performance.

To evaluate these models, we applied time series analysis by splitting the data into training and testing sets, ensuring that only past data was used to simulate real-world forecasting. We measured the performance using R-squared (R²) and Root Mean Squared Error (RMSE) on average weekly rent. XGBoost emerged as the best-performing model, with an RMSE of 32.5, indicating that the forecasted rent was off by only $32.5 on average per suburb. The comparison graph highlights the strong performance of all models over time, with an upward trend in rental prices. Using ARIMA for future estimates, we extended our predictions from 2024 to 2028, showing a continued rise in rent prices, and demonstrated that our models remained robust, even capturing irregular events like the impact of COVID-19.

<img src='../plots/model_strength.png' alt="My Image" width="800"/>


While our models made successful predictions, we were faced with various limitations accompanied with various assumptions to overcome them. Firstly, we were only restricted to publicly available data which possibly limited our model’s potential. If we had access to better data (i.e. private data from real estate firms/companies), a better refined model could have been produced. Next, the main historical rent dataset obtained from the Department of Family, Fairness, and Housing (DFFH) had their suburbs listed as suburbs clusters, this meant that we had to manually mapped suburbs to each suburb cluster to have a more granular dataset. 

Next we took the centroid (the geographical center) of each suburb in certain calculations involving average distance/time for the data as this was to prevent any data privacy concerns with getting actual coordinates of an individual's home without consent. Moving on, although we had scrapped rental data from domain.com.au, it was deemed unusable as the rental prices were consistently higher when compared to the data of the same time from the DFFH website which was used instead. 

API call limits heavily constrained daily analysis as well, a higher quota would allow for more varied analysis which may have resulted in more findings. Finally as forecasting was done, we needed to predict future features using ARIMA which assumes linearity of data and relies on historical data which may not always be true for the real estate market.


## 5. Key Findings

#### 5.1 'What are the most important internal and external features in predicting rental prices? (This can be at the granularity of the groups’ choosing)'

Feature importance analysis reveals that both internal and external factors play a crucial role in determining rental prices. Internal factors, such as proximity to the CBD, reflect aspects intrinsic to the property, while external factors, like food establishment density, relate to the surrounding environment. Understanding how these features interact allows our model to make more accurate predictions based on a comprehensive assessment of what influences land value.
The graph visually reinforces this, highlighting the most influential predictors in our XGBoost and Random Forest models. We can clearly see that factors like distance to the CBD and proximity to the beach top the list, showcasing the importance of location-based variables. The combination of these insights ensures our model can deliver precise rental price predictions, making it highly valuable for stakeholders seeking data-driven decision-making tools.


<img src='../plots/feature_importance_combined.png' alt="My Image" width="800"/>


#### 5.2 What are the top 10 suburbs with the highest predicted growth rate?

Our model has demonstrated good accuracy in predicting historical growth from 2019-2023, having almost perfectly identified the top ten growth suburbs. 

### Suburb Growth Comparison 2019-2023:

| Rank | Suburb        | Actual Growth | Predicted Growth (Predicted Rank) |
|------|---------------|---------------|-----------------------------------|
| 1.   | Sebastopol    | 74.50%        | 73.46% (1)                        |
| 2.   | Wodonga       | 64.67%        | 65.48% (2)                        |
| 3.   | Moe           | 63.29%        | 58.88% (4)                        |
| 4.   | Newborough    | 62.56%        | 58.94% (3)                        |
| 5.   | Maffra        | 59.28%        | 54.39% (8)                        |
| 6.   | Portland      | 58.00%        | 57.34% (6)                        |
| 7.   | Benella       | 57.23%        | 58.79% (5)                        |
| 8.   | Delacombe     | 57.23%        | 55.63% (7)                        |
| 9.   | Sale          | 52.56%        | 49.65% (11)                       |
| 10.  | Morwell       | 51.30%        | 53.89% (9)                        |


Building on this success, the model's forecast for the 2024-2028 period shows a continuation of similar trends, with strong growth expected in outer Melbourne suburbs. Notably, many of the top 10 forecasted high-growth suburbs are concentrated in the western regional areas of Melbourne, indicating a significant shift towards growth in these regions. This reinforces the reliability of the model in predicting future trends, particularly in identifying the top 10 growth suburbs, based on past performance.

<div style="display: flex; justify-content: space-between;">

<!-- Table on the left side -->
<div style="width: 45%;">
<h3>Predicted Growth for Suburbs:</h3>
<table>
  <tr>
    <th>Rank</th>
    <th>Suburb</th>
    <th>Predicted Growth</th>
  </tr>
  <tr><td>1.</td><td>Cairnlea</td><td>34.01%</td></tr>
  <tr><td>2.</td><td>Taylors Hill</td><td>33.77%</td></tr>
  
  <tr><td>3.</td><td>Weir Views</td><td>32.86%</td></tr>
  <tr><td>4.</td><td>Plenty</td><td>31.14%</td></tr>
  <tr><td>5.</td><td>Cobblebank</td><td>30.83%</td></tr>
  <tr><td>6.</td><td>Ballarat</td><td>29.61%</td></tr>
  <tr><td>7.</td><td>Taylors Lakes</td><td>28.71%</td></tr>
  <tr><td>8.</td><td>Kilsyth South</td><td>26.61%</td></tr>
  <tr><td>9.</td><td>Buninyong</td><td>22.26%</td></tr>
  <tr><td>10.</td><td>Brookfield</td><td>26.24%</td></tr>
</table>
</div>

<!-- Image on the right side -->
<div style="width: 45%;">
    <img src='../plots/forecastgrowth.png' alt="Image Description" width="800">
</div>

</div>

#### 5.2 'What are the most liveable and affordable suburbs according to your chosen metrics?'

We adapted the Global Liveability Index from the Economist Intelligence Unit (EIU) and applied weights arbitrarily based on our discretion to relevant features that relate to stability, healthcare, culture and environment, education, and infrastructure and summed said features up to form an index of our own. A higher score would mean higher liveability and vice versa. Some of the biggest contributors to the score were crime, education, and beach proximity.

<img src='../plots/Summary_liveability.png' alt="My Image" width="800"/>


We find that the top 10 most liveable suburbs such as Canterbury, Brighton, and Middle Park are centered around eastern melbourne while the least liveable suburbs such as Bendigo and Geelong are scattered outside of the city and regional victoria with the full list below along with their respective liveability scores. Interestingly enough a lot of the suburbs that are deemed least liveable are due to a high number of drug related crimes.

<div style="display: flex; justify-content: space-between;">

<div style="width: 45%;">
<h3>Top 10 Most Livable Suburbs:</h3>
<table>
  <tr>
    <th>Suburb Name</th>
    <th>Liveability Score</th>
  </tr>
  <tr><td>Canterbury (Vic.)</td><td>14.773933</td></tr>
  <tr><td>Brighton (Vic.)</td><td>13.177023</td></tr>
  <tr><td>Middle Park (Vic.)</td><td>12.863727</td></tr>
  <tr><td>Armadale (Vic.)</td><td>12.730818</td></tr>
  <tr><td>McKinnon</td><td>12.727558</td></tr>
  <tr><td>Hawthorn (Vic.)</td><td>12.672912</td></tr>
  <tr><td>Toorak</td><td>12.233129</td></tr>
  <tr><td>Glen Iris (Vic.)</td><td>12.088864</td></tr>
  <tr><td>Ormond</td><td>11.981344</td></tr>
  <tr><td>Caulfield North</td><td>11.308802</td></tr>
</table>
</div>

<div style="width: 45%;">
<h3>Bottom 10 Least Livable Suburbs:</h3>
<table>
  <tr>
    <th>Suburb Name</th>
    <th>Liveability Score</th>
  </tr>
  <tr><td>Sunshine (Vic.)</td><td>-23.082828</td></tr>
  <tr><td>Seymour (Vic.)</td><td>-23.650013</td></tr>
  <tr><td>Melton (Vic.)</td><td>-23.669923</td></tr>
  <tr><td>Bairnsdale</td><td>-24.691274</td></tr>
  <tr><td>Mildura</td><td>-27.376520</td></tr>
  <tr><td>Ballarat Central</td><td>-30.366898</td></tr>
  <tr><td>Morwell</td><td>-32.793555</td></tr>
  <tr><td>Caulfield East</td><td>-36.055177</td></tr>
  <tr><td>Geelong</td><td>-36.753450</td></tr>
  <tr><td>Bendigo</td><td>-39.830659</td></tr>
</table>
</div>

</div>


#### 5.3 Affordability

Affordability was calculated by taking the ratio of median family weekly income and adjusted average weekly rent. Based on that, we can see clearly that suburbs such as Southbank and Docklands near the CBD obviously have the lowest affordability as they command a higher weekly rent when compared to other suburbs while suburbs like Cobblebank and Aintree command a lower weekly rent which cause them to have a higher affordability in comparison. Another reason these suburbs more affordable suburbs could also be that they are apart of newer affordable housing developments. Another interesting point is that some of these more affordable subrubs also line up with the suburbs that have the highest growth based on our predictions.

<img src='../plots/Summary_affordability_map.png' alt="My Image" width="800"/>


<div style="display: flex; justify-content: space-between;">

<div style="width: 45%;">
<h3>Top 10 Most Affordable Suburbs:</h3>
<table>
  <tr>
    <th>Suburb Name</th>
    <th>Affordability Score</th>
  </tr>
  <tr><td>Cobblebank</td><td>0.089716</td></tr>
  <tr><td>Aintree</td><td>0.088354</td></tr>
  <tr><td>Weir Views</td><td>0.084913</td></tr>
  <tr><td>Strathtulloh</td><td>0.084124</td></tr>
  <tr><td>Kalkallo</td><td>0.077302</td></tr>
  <tr><td>Truganina</td><td>0.077170</td></tr>
  <tr><td>Derrimut</td><td>0.075595</td></tr>
  <tr><td>Burnside Heights</td><td>0.074719</td></tr>
  <tr><td>Manor Lakes</td><td>0.074032</td></tr>
  <tr><td>Lysterfield South</td><td>0.074009</td></tr>
</table>
</div>

<div style="width: 45%;">
<h3>Bottom 10 Least Affordable Suburbs:</h3>
<table>
  <tr>
    <th>Suburb Name</th>
    <th>Affordability Score</th>
  </tr>
  <tr><td>Abbotsford (Vic.)</td><td>0.028743</td></tr>
  <tr><td>East Melbourne</td><td>0.028685</td></tr>
  <tr><td>South Melbourne</td><td>0.028651</td></tr>
  <tr><td>West Melbourne</td><td>0.028018</td></tr>
  <tr><td>Carlton (Vic.)</td><td>0.028002</td></tr>
  <tr><td>St Kilda West</td><td>0.027873</td></tr>
  <tr><td>South Yarra</td><td>0.027245</td></tr>
  <tr><td>Southbank</td><td>0.025384</td></tr>
  <tr><td>Docklands</td><td>0.023707</td></tr>
  <tr><td>Melbourne</td><td>0.021520</td></tr>
</table>
</div>

</div>


#### 5.4 Assumptions
- The Global Liveability Index was an appropiate scoring metric to model our own weights off

#### 5.5 Limitations
- Data that contributed to the weightings of the liveability scoring categories: stability, healthcare, culture and environment, education, and infrastructure was limited to the data we scraped. This means we had more data on some categories then others, this still is believed not have a dramatic effect on results.

## 6. Conclusion

Finally, based on our findings, we believe these insights can be transformed into valuable consulting services for various stakeholders, particularly online real estate firms. These firms are constantly seeking data-driven insights to inform smarter, strategic decisions, and we are confident they would be willing to pay a premium for services like ours. With our model, they can accurately determine rental prices and identify growth opportunities in their property portfolios. We could offer customized reports on rental price forecasts, growth potential, and livability insights, helping firms stay ahead of the market and capitalize on high-growth opportunities.

Given the progress we've made in just six weeks and the strength of our highly accurate model, this project is well worth pursuing. We could begin offering custom reports within 1-3 months and scale to a subscription-based service in 3-6 months by automating our processes. By partnering with real estate firms, we could also gain access to better data, further refining our model and enhancing the accuracy of our insights, creating a mutually beneficial opportunity for both us and our clients.
