# **Real Estate Industry Project**



## **Data Collection and Pre-processing**

The first step in our analysis involved scraping rental listings from domain.com.au. Basic preprocessing tasks were carried out, including joining listings with their corresponding SA2 zone, removing outliers, imputing missing values (primarily for beds, baths and parkings) and extracting prices from inconsistent listings using RegEx. 

## **Contextual Data Collection**
In parallel, we gathered data on various socio-economic factors, utilities, and services to provide context for each property’s location within the Statistical Area Level 2 (SA2). This data offered a detailed understanding of the surrounding environment of the properties. Some of the data explored included:


## **Question 1:**
###  What are the Most Important Internal and External Factors

We implemented two machine learning models - Random Forest Regressor and XGBoost, to answer this question. 
We started with our final dataset where we combined various features relating to public transport (train stations), schools, income, parks, crime and shopping centres. Next, we performed correlation analysis to check for linear relationships between features. We identified pairs of features that were highly correlated with one another (Pearson correlation coefficient > 0.9) , using this information to remove redundant features. After this preprocessing, we implemented two machine learning models, a Random Forest Regressor and XGBoost. Once we had feature importance rankings, we proceeded to find the features with highest average importance across the two models, which we have identified as the top 10 most important features for predicting rental prices.




In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.barh(top_10_features['Feature'], top_10_features['avg_importance'], color='teal')
plt.xlabel('Average Importance')
plt.ylabel('Feature')
plt.title('Feature Importance (Average)')
plt.gca().invert_yaxis()  # To display the highest importance at the top
plt.tight_layout()

# Show the plot
plt.show()

![Top 10 Features Importance](../plots/top_10_features.jpg)

The top 10 most important features after our modelling can be grouped into three key categories: 
- Location of the property
- Property structure, specifically its size
- Suburb population demographics.

 Based on these insights, we have a few key recommendations to share.

For renters:
- Understand your affordability by comparing their earnings with the income levels of the population in the area. 
- Consider rental price growth - if they plan to stay long-term, aim for suburbs with slower rental price increases, which our further modelling can help them identify.
- The third advice is to balance location and affordability by thinking about the trade-offs they’re willing to make between the two. 

For investors: 
- Unlike renters, they should target suburbs with high rent growth. When deciding where to invest, they should consider factors like proximity to the CBD, the nearest schools, and the size of the property, based on your investment capacity.

For policymakers:
- If the goal is to maintain a healthy and growing rental market, one approach is to invest in school infrastructure, as this can help drive growth in areas with potential


## **Question 2:**

### Where are the most liveable and affordable suburbs?

The liveability and affordability metrics are displayed in the following two pie charts.

<img src="../plots/pie_chart.png" alt="Metrics Pie Chart" width="900" />

### Liveability Metrics
On the left, we have the *liveability metrics*, which consist of six key elements. Each segment highlights an essential aspect of what makes a place enjoyable and sustainable to live in. 
- **Mobility** holds the largest share at 35%, emphasizing the importance of transportation options and accessibility to the Central Business District (CBD). This metric reflects how well residents can move around the city, impacting their overall quality of life.
- **Safety** accounts for 20%, highlighting the need for secure neighborhoods in relation to crime rates, specifically in property lost incidents. A safer environment contributes significantly to residents' peace of mind and overall satisfaction with their living conditions.
- **Community Amenities** contributes 15%, focusing on the availability of green spaces, shopping centers, and entertainment facilities. Access to these amenities enhances residents' recreational options and promotes community engagement.
- **Healthcare** also takes up 15%, indicating the proximity and availability of local hospitals and medical facilities. Access to healthcare services is crucial for residents' well-being and emergency care.
- **Education** is next at 10%, emphasizing the number of independent schools available in the area. Quality educational options are vital for families and can significantly influence a neighborhood's attractiveness.
- **Price** accounts for 5%, underscoring how housing costs impact liveability. This metric highlights the financial aspect of living in an area and its effect on residents' ability to maintain a good quality of life.


### Affordability Metrics
On the right, we have the *affordability metrics*, consisting of four factors.
- **Income-to-Price Ratio (IPR)** stands out with 40%, indicating the relationship between income levels and housing costs. A higher ratio suggests that residents are less burdened by housing expenses, making it easier for them to afford their homes.
- **Housing Pressure Index (HPI)** follows at 30%, measuring the stress on housing markets. This index helps assess whether the demand for housing exceeds the available supply, which can lead to increased prices and reduced affordability. 
- **Population Density Inverse (PDI)** accounts for 20%. This metric reflects the inverse of population density; lower density can imply more spacious living conditions and potentially more affordable housing options.
- **Gini Inequality Inverse (GII)** contributes 10%, underscoring economic disparity in housing access. A higher GII value indicates greater income equality, suggesting that more individuals have a fair chance to access affordable housing.

In summary, these metrics help us understand the balance between what makes a community livable and what is affordable, guiding our efforts in urban planning and policy development. 

### Geospatial Visualization
Next, we observe the geospatial visualization of the liveability and affordability Indexes. On the left, we have the *liveability index distribution in Victoria*, while the right displays the *affordability index distribution*. The color bar on the side indicates that lighter colors signify higher index values. Both distributions show significant variation across the region.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import BoundaryNorm
import geopandas as gpd
import contextily as ctx
from mpl_toolkits.axes_grid1 import make_axes_locatable
from matplotlib.ticker import FormatStrFormatter

def plot_map_with_colorbar(sf, column, ax, cmap_name='viridis', n_colors=8, title="", crs_string="", alpha=0.5, bounds=None):
    '''
    Plots a geospatial map with a discrete color bar on the side.

    Parameters:
        sf: The GeoDataFrame containing the spatial data to be plotted
        column: The name of the column in `sf` to be visualized on the map
        ax: The matplotlib axis object on which the map will be plotted
        cmap_name: The name of the colormap to use (default is 'viridis')
        n_colors: The number of discrete colors in the colormap (default is 8)
        title: The title of the map (default is an empty string)
        crs_string: The Coordinate Reference System (CRS) as a string for adding the basemap (default is "")
        alpha: The transparency level for the map elements (default is 0.5)
        bounds: The boundaries for the color bar (default is None)

    Returns:
        ax: The axis object with the plotted map
        cbar: The color bar associated with the plot
    '''
    if bounds is None:
        bounds = np.linspace(sf[column].min(), sf[column].max(), n_colors + 1)
    
    cmap = plt.get_cmap(cmap_name, n_colors)
    norm = BoundaryNorm(bounds, cmap.N)

    im = sf.plot(column=column, ax=ax, legend=False, edgecolor='black', alpha=alpha, cmap=cmap, norm=norm)

    ctx.add_basemap(ax, source=ctx.providers.CartoDB.Positron, crs=crs_string, alpha=1, attribution=False)

    ax.set_title(title, weight="bold", fontsize=14)
    ax.set_xlabel("")  
    ax.set_ylabel("")  
    
    ax.tick_params(axis='both', which='major', labelsize=7)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

    divider = make_axes_locatable(ax)
    cax = divider.append_axes("right", size="3%", pad=0.1)
    cbar = plt.colorbar(im.collections[0], cax=cax, norm=norm, boundaries=bounds, ticks=bounds)
    cbar.ax.yaxis.set_major_formatter(FormatStrFormatter('%.1f'))

    cbar.outline.set_linewidth(0)
    cbar.ax.tick_params(size=0)  

    return ax, cbar

<img src="../plots/avg_liveability_affordability_metropolitan_plot_solid.png" alt="Metrics Pie Chart" width="1100" />

Combined index of liveability and affordability

<img src="../plots/combined_index_metropolitan_plot_solid.png" alt="Metrics Pie Chart" width="900" />


## **Key Assumptions:**

### 1) Independence:
 We've assumed that each property in the dataset is independent, implying that the features of one property do not affect those of the others. Furthermore, we have assumed that the various features of an individual property are independent of one another.

### 2) Linear Relationships: 
We've performed correlation analysis under the assumption that the relationships between features are linear. However, we acknowledge that this approach may overlook non-linear relationships in the data.

### 3) Absence of Major Macroeconomic Factors:
In our model, we’ve assumed that there won’t be any major disruptions, such as economic downturns or events like COVID-19, when predicting future rental growth.

### 4) Data Representativeness: 
We’ve assumed that the dataset is a good representation of the broader property market, ensuring that our model's predictions are generalisable to other properties beyond our dataset.



## **Limitations:**

### 1) Historical growth limited to a few suburbs

**Impact:**
- Missing cyclical patterns over months & years.
- Decreased accuracy in long-term predictions.

**For Future:**
- Acquire historical data through different vendors.
- Perform feature engineering using similar suburb data determined by more comprehensive datasets.


### 2) Government-provided dataset found online is incomplete in providing all details to date.

**Impact:**
- Biased prediction, giving advantage to suburbs with complete information.

**For Future:**
- Manual population of missing data if required.
- Perform feature engineering using similar suburb data determined by more comprehensive datasets.
