# OFFICIAL SUMMARY NOTEBOOK

## Abstract

In this project, we scraped 11k property rent listings from domain.com.au (~6k after preprocessing) and, combined with API queried POI data, tried to answer the 3 fundamental questions. We performed statistical tests on the scraped data combined with external data such as crime rate and income to determine relevant features to be used in our model. We used POI data as an insight into what makes certain properties valuable, but they provide no predictive value since their numbers do not fluctuate much over time.

We fit linear models to the dataset, and used correlation metrics to determine useful features. Unfortunately, only income was found to have any correlation with rent price, which resulted in our model not being very accurate. However, the model still was able to show us a general trend for the future, which still allowed us to answer the question of predictive growth.

For matters of livability, we used POI data and created a metric based on external reports of what Victorians consider to be signs that a place is livable. For affordability, we used income data and rent prices in each SA2 area to obtain an estimated percentage of salary to be paid for rent. From external reports, we found that most Australians are only willing to pay up to 30% of their salary on rent, and thus we reasoned that anything below that threshold for each SA2 area is considered to be affordable.

## How to navigate the notebook

Please run the the code cells under 'Preliminary code' in the next section, which runs the skeleton notebook we've compiled with all the variables required to demonstrate our results and findings. Please hide the cells to avoid overflow of output. It may take a while to run; thank you for your patience.

Once that is done, please continue to the 'Analysis and Presentation of findings' section, where we will walk you through the internal, external feature analysis and modelling, as well as our forecasts and key findings. 

(Please run code cells where necessary to view specific results and visuals)

## Preliminary Code

In [4]:
# import packages
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
%run summary_notebook.ipynb

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = c

SyntaxError: invalid syntax (<unknown>, line 1)

SyntaxError: invalid syntax (<unknown>, line 1)

## Analysis and Presentation of findings

### Property Internal features analysis

In this Section we will be examining the correlation of property internal features, such as property type, number of beds/baths/parking to the rental price of the properties. Relevant features will be selected for modelling in the next stage

Please run the cell to view the matrix that shows the pearson's correlation coefficients between the response variables and the numerical internal variables we investigated (scraped from the original listing)

In [2]:
# correlation matrix output
corr.style.background_gradient(cmap='coolwarm')

NameError: name 'corr' is not defined

The output shows that there is weak correlation between neighbourhood demographics (last 4 columns, showing percentage of people from an age grooup within the property's neighbourhood) and rental prices. The number of parking in each property is moderately correlated to rent prices however we consider this to be weak (<0.2). The number of bedrooms and bathrooms are both quite strongly correlated with the rent cost, but are more strongly correlated with eachother. Hence we will choose the number of bathrooms as the variable to be modelled and investigated further, as that is the most strongly correlated variable with rent cost

Pearson's correlation cannot be used for categorical variables (we have two, property type and whether the property is shared or not as a boolean). Hence we will fit a simple linear model with the number of bathrooms and these two variables against rent price to allow further investigation. Please run the below cell for a summary of this model

In [19]:
fit_OLS.summary() # model summary

0,1,2,3
Dep. Variable:,cost_text,R-squared:,0.19
Model:,OLS,Adj. R-squared:,0.189
Method:,Least Squares,F-statistic:,128.7
Date:,"Mon, 10 Oct 2022",Prob (F-statistic):,5.41e-290
Time:,13:59:57,Log-Likelihood:,-45628.0
No. Observations:,6583,AIC:,91280.0
Df Residuals:,6570,BIC:,91370.0
Df Model:,12,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,436.2214,111.535,3.911,0.000,217.577,654.866
share_flag[T.1],-98.9554,28.306,-3.496,0.000,-154.444,-43.467
property_type[T.Apartment / Unit / Flat],-200.3164,111.110,-1.803,0.071,-418.129,17.496
property_type[T.Duplex],-344.3286,207.486,-1.660,0.097,-751.068,62.411
property_type[T.House],-220.8326,111.023,-1.989,0.047,-438.474,-3.191
property_type[T.New Apartments / Off the Plan],-261.5250,207.475,-1.261,0.208,-668.243,145.193
property_type[T.Rural],166.7071,271.604,0.614,0.539,-365.725,699.140
property_type[T.Semi-Detached],-0.9357,150.209,-0.006,0.995,-295.394,293.522
property_type[T.Studio],-368.7591,113.854,-3.239,0.001,-591.950,-145.568

0,1,2,3
Omnibus:,7612.628,Durbin-Watson:,0.338
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1266862.916
Skew:,5.873,Prob(JB):,0.0
Kurtosis:,69.938,Cond. No.,246.0


The results here indicate that the predictor variables are able to explain around 19% of the variation in rent prices. We will also conduct an ANOVA test for statistical significance. Please run the next cell to view the test results

In [20]:
anova_table # anova table for model

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
share_flag,1.0,1300651.0,1300651.0,21.158082,4.308443e-06
property_type,10.0,12968730.0,1296873.0,21.096616,3.84559e-39
baths,1.0,80658020.0,80658020.0,1312.087931,4.147619e-262
Residual,6570.0,403877800.0,61473.03,,


The p-values of all three variables indicate that they are significant at 0.05 level for predicting rent prices, however the R^2 value computed previously indicate that on their own, these selected internal features are too simplistic to explain rent prices well, which could fluctuate due to a variety of complex factors we have not considered here