# Assumptions & Approach - Group 15

1. **Data Collection and Processing:**

    We sourced data from multiple platforms that were essential in deciding features that could possibly affect the rental price of a property.
    
    * **Rental Property Data from Domain:**
    We scraped data from the [domain.com.au](https://www.domain.com.au) website. This provided us with information on properties that are currently available in the market, including their availablity, pricing, and other features, which are of particular interest to renters, property managers, and investors.
    
    * **Historical Median Rent Pricing:**
    We accessed this data from from the [Department of Families, Fairness, and Housing](https://www.dffh.vic.gov.au/), which provided us with the historical data on median rent prices for different suburbs, which we were unable to access from Domain. This was essential for forecasting rental prices as we can use this data to understand its long-term trends.

    * **Crime Data:**
    Crime severity and crime rates were factored in, as safety is a critical concern for both renters and property managers alike. This data was obtained from the [Crime Statistics Agency](https://www.crimestatistics.vic.gov.au/).

    * **Population and Income Data for Demographics:**
    Population and income data was taken from the 2016 and 2021 Census as we wanted to account for the population growth, affordability, and rental demand across different suburbs, which are essential in determining future rental conditions.

    * **Urban Landmarks and Public Transport:**
    Using openrouteservice, we calculated the distances between rental properties and public transport hubs, as well as key urban landmarks such as shopping centers and employment hubs. These factors are crucial for assessing the desirability of rental properties for future tenants. Data for the landmarks and public transport options were sourced from OpenStreetMap.

    * **Schools and Education:**
    The proximity and rankings for educational institutions were considered, as rental demand from students and families wanting to live near a well-regarded school is a significant factor for stakeholders. School data was sourced from Data Victoria and University/TAFE data was sourced from Uni Reviews. 

<br>

2. **Data Imputation and Forecasting:**

    Some suburbs lacked median rental prices for some of the quarters in each year, hence we imputed those NaN values with the median rental price for the entire suburb in that corresponding year. 

    Some datasets such as population, income and crime rates lacked future projection. As a result, we thought it would be ideal to extrapolate these values and utilise them for our model. We implemented statistical forecasting techniques such as ARIMA, exponential smoothing, linear interpolation and linear regression to project these values. This enables our stakeholders to anticipate changes that may influence rental trends.

<br>

3. **Feature Engineering:**

    Feature engineering was a crucial step in preparing the data for modelling and ensuring that key stakeholder priorities were reflected. For example, real estate investors and property managers are highly interested in a property's proximity to public transport as well as crime rates, as these factors significantly influence rental demand. 
    
    **Urban Landmarks**
    - We created features to store the counts of each considered urban landmark present in each suburb. 
    - We also found the mode feature for each category. For example, in places of worship (category) the mode feature for many suburbs were 'Churches'.

<br>

4. **Model Fitting:**

   After trialling several models such as XGBoost, lasso regression and linear regression, we found that the random forest model performed the best in terms of our evaluation results (RMSE, MAE & $R^2$). 

   Please note that each unique property type underwent its own separate feature selection and model development process. As a result, a distinct random forest model was developed for each property type.
   
   Before training the random forest model, we applied recursive feature elimination (RFE) with a random forest estimator for feature selection. The model was initially trained on data from 2016 to 2021. We then performed hyperparameter tuning using cross-validation on the selected feature set, with the scoring metric set to 'neg_mean_squared_error', and using data from 2022 to 2024. After identifying the optimal random forest model, we retrained it on the combined training and validation sets (2016-2024). Finally, the model was used to predict median rental prices for the years 2025 to 2027.

   All predictions for each property type and each year were saved into the '/data/curated/predictions' directory. We also saved the mean absolute error for each property type into the same directory in a file called 'mae.csv'

<br>

5. **Predicted Median Rental Price Analysis:**

    In the 'analysis_rental_preictions.ipynb' notebook, one can view our growth rate predictions and Rent Vision Pro tool. 

    The growth rate predictions were calculated using the following formula: 

    $
    \text{growth\_rate} = \frac{\text{2027 December Median Rental Price} - \text{2024 March Median Rental Price}}{\text{2024 March Median Rental Price}} \times 100
   $

    Our Rent Vision Pro tool also allows clients to manually specify which suburb and which property type they would like to view the past (2022-2024) and forecasted (2025-2027) median rental prices for. Two graphical visualisations are then presented for the suburb, one showing the median rental prices for flats and the other for houses. 



---

## Approach

The main goal of our project was to forecast the rental prices across Victoria for the next 3 years. Our approach was tailored to meet the needs of our primary stakeholders, which included potential renters, real estate investors, property managers, and urban planners. We recognized that these groups have significant interest in the predictions of rental prices in order to support informed decision-making related to livelihood, investment, development, and urban infrastructure planning. With these in mind, we took deliberate steps to ensure that our modelling addressed their specific concerns.

----

## Assumptions

### General Assumption
- Internal features of all the rental properties (number of baths, number of parkings etc.) advertised in Victoria were scraped from the domain.com website. We then calculated the median number of each internal feature for each suburb. These median features were then generalised to be the median internal features for each suburb irrespective of the year. 

- From domain.com, we scraped all the rental properties advertised in Victoria during the first week of September 2024. Since there were some missing data in the rental history dataset from the Department of Families, Fairness and Housing, specifically in the September 2024 column, we used our scraped data to calculate and impute the median rental price for September for each suburb. We assumed that this would be reflective of the median rental price for that quarter. 

- SA2 levels were assumed to be equal to suburb names. This was a generalisation we made to avoid issues with inconsistent data reporting across regions and to streamline the analysis. However, we do acknowledge that SA2 boundaries may encompass multiple suburbs and may not perfectly align with the suburb names.


### Education

1. School density representing education quality: we assume that a higher density of education institutions (primary, secondary and tertiary institutions) indicate with better educational opportunities, regardless of the actual quality or performance of those institutions.

2. Relatively static analysis over time: our analysis is based on data from 2023, making the assumption that the educational landscape remains mostly stable. We presume that key factors such as funding, policies, and infrastructure development will not experience significant changes during the analysis period. 

3. Influence of non-academic factors: it is assumed that residents of the suburb have full access to schools within that suburb. That is, external factors such as catchment zones and private versus public school policies are not taken into consideration.

### Domain Data

1. We considered all properties in terms of weekly rent
    * Monthly to weekly rent: multiply by 84 and divide by 365 
        * Monthly rent is calculated as follows : weekly rent divided by 7 (days) x 365 (days) divided by 12 (months)
    * Yearly to weekly rent: multiply by 7 and divide by 365.

2. We assumed that the minimum monthly rent should be $500, everything below that is most likely to be car spaces or storage options. 


3. We assumed the maximum weekly rent would be $5000 and removed those above that.
    * If the rental price does not specify weekly or monthly, and if it is less than $5000, we assume that it is weekly. 
    * As a result we removed properties that rented 'by the season'. 

4. We determine whether or not a property is furnished by the description or the initial feature list. From the feature list we only say it's furnished if it's verified (i.e. no asterisk). This is due to the nature of the data from the domain website itself, wherein features with * are said to be unverified. 


5. We assumed that if the feature list did not specify 'Pets Allowed', then a property would not allow pets.


6. When taking the rental price and they specify the price for both unfurnished and furnished, we always take furnished and take the higher number (basically assuming that furnished properties are always more expensive than unfurnished properties).


7. For weekly rent that specified a range (e.g. 650 - 700 a week), we took the average of the two numbers. 


8. We also got rid of properties that rented per night as we are only considering properties that are rented for long term stays. 


9. We decided to map different property types into just 2 categories, House and Apartment. 
    * Houses: House, Townhouse, Villa, New House & Land, Semi-Detached, Duplex, Terrace
    * Apartments: Apartment / Unit / Flat, Studio, New Apartments / Off the Plan


10. We removed properties that do not specify rental price, and do not have any information on the number of rooms (bed and bath). We also removed properties with 0 beds or 0 baths. 


11. We removed properties that had more than 12 parking spaces (we manually inspected the properties with more than 5 parking spaces and found that most of them were still within reasonable ranges, with properties having more than 12 parking spaces to be outliers)


12. For rental prices without dollar signs and were less than 5000, we assumed a weekly rate for the rent.

### Income

We interpolated the income data between the 2016 and 2021 census data to get the data for 2017 to 2020, assuming a linear trend.

### Modelling

We assumed that our forecasted features of income, population and crimes for 2025-2027, are reflective on the data that we may expect in those years. Hence, we used this forecasted data to train our model and predict median rental prices for 2025-2027. 

----

## Limitations

### General Limitations 
MOKKK

DIDN'T TAKE ECONOMIC FACTORS INTO CONSIDERATION
LIMITED API CALL FOR OPENROUTESERVICE

### Education:  
1. Ranking criteria differences across school stages: ranking primary and secondary schools together was challenging due to the differences in their ducational objectives and outcomes. A secondary school's ranking might influence real estate more significantly due to its relevance to ATAR/VCE scores, while primary schools often lack performance-based rankings. Tertiary instituions are often highly selective and profession specific in the offered courses, making it difficult to rank them due to the diversity of the content provided. Hence, education density was used to identify top suburbs instead.

2. Quality vs. Quantity dilemma: the use of "education density" measures the quantity of schools but fails to account for the quality of education provided by each institution. A suburb with many low-performing schools could appear more favorable than one with fewer but higher-performing schools.