In [1]:
import pandas as pd

The walkthrough of this project can be summarised in the following four stages:
- Data Retrieval 
- Data Preprocessing 
- Feature Engineering
- Problem Solving

## Data Retrieval:
Total of 3 methods are used to obtain all data, the raw data will be stores under "data/raw"
#### Web Scraping: 
By scraping through www.domain.com.au, we obtained around 9000 records of rental information. These include all internal features related to a property, they are: 
- rental price
- number of bedrooms/bathrooms/parking
- size
- address
- location coordinates
- type (house, studio, apartment .etc)

Some external features are also obtained using web scraping because there exists no ready-made dataset, they are:
- location of shopping centres (by suburb)
- distance of properties to CBD
- Suburb-SA2-LGA mapping (since data are recorded under different scales)

#### Direct Download: 
some dataset already has machine readable format online, we can directly download them, these include:
- location of hospitals (by suburb)
- location of entertainment facilities (by suburb)
- population record from 2001~2021 (by SA2)
- income record from 2005~2018 (by SA2)
- location of schools (by suburb)
- crime rate (by LGA)
- location of train station (by suburb)
- median rental price from 2000~2021 (by suburb)

_Note that the last two dataset are manually added to data folder instead of downloading using script, because there was no direct download link available_

#### API: 
We requested an collaborative key from openrouteservice, and obtained distance of each property to:
- Melbourne CBD
- nearest school
- nearest shopping centre
- nearest entertainment facility
- nearest train station
- nearest hospital



## Preprocessing:
#### Approach:
- remove missing values 
- extract useful attribute from original dataset
- detect and remove outliers

The preprocessed dataset will be stored under "data/curated" folder
#### Difficulties and Assumptions:
1. **Difficulty**: We cannot directly scrape property's suburb from www.domain.com.au.<br>
 **Solution**: Regular expression is implemented to extract suburb information from raw address. <br> 
 **Flaw**: Due to time and technical limitation, a very small portion (around 0.5%) suburbs are wrongly extracted.<br>
 **Evaluation**: Although it is not perfect, this is necessary because it can be used to analyse rental prices for different suburbs

## Feature Engineering:
#### Approach:
- Group all data to suburb scale (detail strategy described in notebooks)
- Derive new attributes that will be helpful in our research, for example, count for number of schools in each suburb, calculate distances.
- Do further preprocessing, making data available for three questions

#### Difficulties and Assumptions:

1. **Difficulty**: Datasets are retrieved from different sources, and all of them have different number of suburbs, for example, there are around 600 suburb extracted from rental properties, and around 2000 suburbs from income dataset, etc. <br>
 **Solution**: Since the objective is about rental analysis, therefore we decide to use suburbs from rental price dataset as standard. If one dataset does not contain a particular suburb, we estimate it, if one dataset has extra suburb than standard, we discard it. <br>
 **Flaw**: Since only around 600 suburbs are recorded in rental price dataset, our future research cannot cover most part of Victoria, as shown in geographic visualisations. <br>
 **Evaluation**: If we preserve all suburbs, there will be lots of missing values across all datasets, and we need to estimate all of them. Therefore, there exists a trade off between accuracy of analysis vs. completeness of data. We assume that accuracy is more important in business context therefore made this decision


## Problem Solving:
### Q1 Predicting Rental Price:
#### Approach:
1. Plot the distributions and correlation of attributes
2. Apply data transformations
3. Add dummy variables for categorical attributes
4. Use backward elimination to build statistic model using OLS

In [2]:
# Final fitted model
# We can tell from p-values that this model is significant
with open("../models/OLSsummary.txt") as f:
    contents = f.read()
    print(contents)

                            OLS Regression Results                            
Dep. Variable:            weekly_cost   R-squared:                       0.210
Model:                            OLS   Adj. R-squared:                  0.209
Method:                 Least Squares   F-statistic:                     240.4
Date:                Mon, 17 Oct 2022   Prob (F-statistic):               0.00
Time:                        19:35:52   Log-Likelihood:                -58923.
No. Observations:                9082   AIC:                         1.179e+05
Df Residuals:                    9071   BIC:                         1.179e+05
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
(Intercept)               

In [3]:
# Simplified result, with variable transformed back
q1result = pd.read_csv('../data/curated/q1_result.csv', index_col=0)
q1result.round(2)

Unnamed: 0,Estimate
(Intercept),216.8
Apartment / Unit / Flat,34.37
House,72.2
Studio,-74.01
Townhouse,115.4
Villa,68.85
Beds,46.07
Baths,19.51
numStation_1km,4.96
numShopping_3km,4.29


#### Text Summary:
- Base price for a property is 216.80 dollars
- Price increase or decrease based on its property type
- Both number of Beds and Bathrooms increase rental price
- Station, shopping centre, entertainment facility in surrounding area increase price
- the further from cbd, the lower price is.

#### Difficulties and Assumptions:
1. **Difficulty**: Log transformation results in incorrect estimate, for example, the coefficient of cbd_route is 5.<br>
 **Solution**: We found this problem is caused by transformation, therefore used square root transformation instead <br>
 **Flaw**: Log transformation is better in overall distribution of variables, because they are more normal-like, which fits the assumption of linear model <br>
 **Evaluation**: We could not figure out the exact cause of the coefficient problem in log-transformation, we think it is because of the range of log function is between negative and positive infinity. The coefficient is more reasonable using square root transformation

#### Related Plots:
"The Number of Property.html": Number of renting properties in each SA2 region using chloropleth, see plots folder

### Q2 Suburb Growth Rate:
#### Approach:
1. Use rental price, population, income to measure suburb growth
2. Estimate missing value using the state's average of that particular time
3. Build time series model using Auto Regression, use data from earliest recorded time to 2018 for consistency
4. Predict values of these data for each suburb in 2025 and rank them
5. Use the average rank of three attributes as the final ranking of suburb

In [4]:
# Result
suburbRank = pd.read_csv('../data/curated/suburb_ranking.csv')
# Top 10 suburbs
suburbRank[:10]

Unnamed: 0,Suburb,rental rank,income rank,population rank,average rank
0,DROMANA,11,29,10,16.666667
1,PORTSEA,12,54,16,27.333333
2,ASPENDALE,24,16,50,30.0
3,GROVEDALE,54,34,3,30.333333
4,WERRIBEE,83,10,2,31.666667
5,PORT MELBOURNE,96,1,1,32.666667
6,CHELSEA,23,13,62,32.666667
7,SPOTSWOOD,19,31,54,34.666667
8,NEWPORT,20,30,55,35.0
9,TORQUAY,6,91,12,36.333333


In [5]:
# There is also another version of result showing growth rate for each fields
# We estimated 7 year time period, but 5 year growth is more accurate due to extrapolation
suburbGrowth = pd.read_csv('../data/curated/suburb_growth_5_years.csv')
suburbGrowth[:10]

Unnamed: 0,Suburb,% rental price Growth in 5 years,% income Growth in 5 years,% population Growth in 5 years,average
0,PORT MELBOURNE,0.042947,0.453935,599.293951,199.930277
1,WERRIBEE,0.048609,0.186259,1.491042,0.575304
2,GROVEDALE,0.068275,0.134769,0.756235,0.31976
3,ALFREDTON,0.070253,0.093947,0.492408,0.218869
4,DOCKLANDS,-0.043949,0.006437,0.647795,0.203428
5,DELACOMBE,0.057713,0.128732,0.389239,0.191895
6,BARWON HEADS,0.045979,0.117344,0.390832,0.184718
7,WARRAGUL,0.233166,0.097255,0.218381,0.182934
8,DROMANA,0.114168,0.145394,0.276231,0.178598
9,TORQUAY,0.152949,0.099504,0.256099,0.169517


We can see the rankings are very different than suburb rank, we will discuss it in next section

#### Difficulties and Assumptions:
1. **Difficulty**: Due to very small attribute value at beginning, some suburb growth rate are unreasonable, for example, the growth rate of population of Port Melbourne.<br>
 **Solution**: Use average rank among attributes as suburb rank, instead of calculating overall growth rates <br>
 **Flaw**: The difference bewteen suburbs become smaller, some fast-growing suburbs does not stand out<br>
 **Evaluation**: This decision is necessary, as described above, port melbourne grows massively, and one unreasonable prediction overdrives all other attributes, our method decrease the effect of these outliers.
 
2. **Difficulty**: Some attributes has too short time period, for example, only 14 timestamp from 2005 to 2018, resulting difficulty in tuning reliable model hyperparameters.<br>
 **Solution**: We are unable to solve this problem, we tried best to fit a hyperparameter by computing mean squared error between test split and prediction <br>
 **Flaw**: Problem is not solved, the overall accuracy of prediction decreases<br>
 **Evaluation**: This problem is due to limited resource online, we also considered to fit simple regression, where there is no need to tune hyperparameters, but we think time-series model is overall more reliable because the data is based on time

#### Related Plots:
"growth_rate_Map.html": top 10 suburb growth rate from our prediction \
"Income Growth Rate in 5 years.html" \
"Population Growth Rate in 5 years.html" \
"Rental Price Growth Rate in 5 years.html" \
"Dromana_rental_growth.png": Example growth rate of fastest-growing suburb

### Q3 Suburb Liveability/Affordability:
#### Approach - Liveability:
1. Use number of shopping centres, population, crime rate, number of train stations, number of entertainment facilities, number of hospital and number of schools in the suburb to measure liveability.
2. Normalise each attribute to a same scale.
3. Assign weight according to the importance of each attributes, this is customly defined.
4. Scores of each suburb are given by adding subscore of all attributes, and each subscore is calculated using the attribute value of the suburb, and the mean value of this attribute across all suburbs.
5. Sort final score of suburbs and obtain ranking.

#### Approach - Affordability
1. Affordability = rental price / income, the result represents proprotion of rental cost each year
2. The lower score is, the more money people can save, therefore high affordability
3. Take the inverse of score, to follow convention that higher score is, higher affordability

In [6]:
# Below shows top 10 suburb with highest liveability
liveability = pd.read_csv("../data/curated/Liveability_rank.csv", index_col=0)
liveability[:10]

Unnamed: 0,Suburb,Liveability
0,FRANKSTON,32.732122
1,FOOTSCRAY,26.074431
2,CAMBERWELL,19.12437
3,DANDENONG,18.127342
4,MORNINGTON,17.775018
5,WODONGA,17.247748
6,RICHMOND,17.012713
7,BENDIGO,16.560723
8,BERWICK,16.372245
9,RINGWOOD,15.638218


In [7]:
# Below shows top 10 suburb with highest affordability
affordability = pd.read_csv("../data/curated/Affordability_rank.csv", index_col=0)
affordability[:10]

Unnamed: 0,Suburb,Affordability
0,TOORAK,366.807767
1,EAST MELBOURNE,293.622222
2,HAWTHORN,267.6
3,MALVERN,265.315556
4,GLEN IRIS,248.733333
5,ARMADALE,248.064444
6,MALVERN EAST,242.078571
7,ALBERT PARK,240.824
8,MIDDLE PARK,240.824
9,IVANHOE EAST,239.909677


#### Difficulties and Assumptions:
1. **Difficulty**: Different attribute may effect liveability differently.<br>
 **Solution**: Use customly defined weighting to balance the importance of attributes<br>
 **Flaw**: The weight is not accurate and based on subjective decision<br>
 **Evaluation**: We discussed and interviewed people about what they think each attribute weights, this can help us to become more accurate and precise.
 
2. **Difficulty**: There is no commodity price to help us rank affordability.<br>
 **Solution**: Use median rental price to represent commodity price <br>
 **Flaw**: rental price can be affected by lots of factors (discussed in part 1), the relationship is not intuitive enough<br>
 **Evaluation**: Using rental price may decrease the accuracy of our measurements, however, the rental price comes from reliable source, and sample size should be large enough to decrease effect of individual property

#### Related Plots:
"affordable_Map.html": top affordable suburbs \
"liveable_Map.html": top liveable suburbs