# **Real Estate Industry Project**



## **Data Collection and Pre-processing**

The first step in our analysis involved scraping rental listings from domain.com.au. Basic preprocessing tasks were carried out, including joining listings with their corresponding SA2 zone, removing outliers, imputing missing values (primarily for beds, baths and parkings) and extracting prices from inconsistent listings using RegEx. 

## **Contextual Data Collection**
In parallel, we gathered data on various socio-economic factors, utilities, and services to provide context for each property’s location within the Statistical Area Level 2 (SA2). This data offered a detailed understanding of the surrounding environment of the properties. Some of the data explored included:


## **Question 1:**
###  What are the Most Important Internal and External Factors

We implemented two machine learning models - Random Forest Regressor and XGBoost, to answer this question. 
We started with our final dataset where we combined various features relating to public transport (train stations), schools, income, parks, crime and shopping centres. Next, we performed correlation analysis to check for linear relationships between features. We identified pairs of features that were highly correlated with one another (Pearson correlation coefficient > 0.9) , using this information to remove redundant features. After this preprocessing, we implemented two machine learning models, a Random Forest Regressor and XGBoost. Once we had feature importance rankings, we proceeded to find the features with highest average importance across the two models, which we have identified as the top 10 most important features for predicting rental prices.




In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.barh(top_10_features['Feature'], top_10_features['avg_importance'], color='teal')
plt.xlabel('Average Importance')
plt.ylabel('Feature')
plt.title('Feature Importance (Average)')
plt.gca().invert_yaxis()  # To display the highest importance at the top
plt.tight_layout()

# Show the plot
plt.show()

![Top 10 Features Importance](../plots/top_10_features.jpg)

The top 10 most important features after our modelling can be grouped into three key categories: 
- Location of the property
- Property structure, specifically its size
- Suburb population demographics.

 Based on these insights, we have a few key recommendations to share.

For renters:
- Understand your affordability by comparing their earnings with the income levels of the population in the area. 
- Consider rental price growth - if they plan to stay long-term, aim for suburbs with slower rental price increases, which our further modelling can help them identify.
- The third advice is to balance location and affordability by thinking about the trade-offs they’re willing to make between the two. 

For investors: 
- Unlike renters, they should target suburbs with high rent growth. When deciding where to invest, they should consider factors like proximity to the CBD, the nearest schools, and the size of the property, based on your investment capacity.

For policymakers:
- If the goal is to maintain a healthy and growing rental market, one approach is to invest in school infrastructure, as this can help drive growth in areas with potential


## **Question 2:**

### Where are the most liveable and affordable suburbs?

Let’s move on to the liveability and affordability metrics, displayed in two pie charts (insert pie charts).

The liveability metrics on the left include six key elements. Mobility has the highest percentage, showing how important it is to be close to public transportation and the CBD. Next is Safety, which focuses on crime rates. We also have Community Amenities, which looks at the availability of green spaces, shopping centers, and entertainment facilities. 

On the right, we have the affordability metrics, consisting of four factors. The Income-to-Price Ratio is significant because it indicates how much of a person's income is allocated to housing expenses; the higher the ratio, the more affordable the housing. 

Here, we see the geospatial visualisation of the liveability and affordability indexes. On the left is the liveability index distribution in Victoria, while on the right is the affordability index distribution. The color bar on the side indicates that lighter colours represent higher index values. Both distributions vary across the region.


## **Key Assumptions:**

### 1) Independence:
 We've assumed that each property in the dataset is independent, implying that the features of one property do not affect those of the others. Furthermore, we have assumed that the various features of an individual property are independent of one another.

### 2) Linear Relationships: 
We've performed correlation analysis under the assumption that the relationships between features are linear. However, we acknowledge that this approach may overlook non-linear relationships in the data.

### 3) Absence of Major Macroeconomic Factors:
In our model, we’ve assumed that there won’t be any major disruptions, such as economic downturns or events like COVID-19, when predicting future rental growth.

### 4) Data Representativeness: 
We’ve assumed that the dataset is a good representation of the broader property market, ensuring that our model's predictions are generalisable to other properties beyond our dataset.



## **Limitations:**

### 1) Historical growth limited to a few suburbs

**Impact:**
- Missing cyclical patterns over months & years.
- Decreased accuracy in long-term predictions.

**For Future:**
- Acquire historical data through different vendors.
- Perform feature engineering using similar suburb data determined by more comprehensive datasets.


### 2) Government-provided dataset found online is incomplete in providing all details to date.

**Impact:**
- Biased prediction, giving advantage to suburbs with complete information.

**For Future:**
- Manual population of missing data if required.
- Perform feature engineering using similar suburb data determined by more comprehensive datasets.
