In [1]:
import pandas as pd

The walkthrough of this project can be summarised in the following four stages:
- Data Retrieval 
- Data Preprocessing 
- Feature Engineering
- Problem Solving

## Data Retrieval:
Total of 3 methods are used to obtain all data, the raw data will be stores under "data/raw"
#### Web Scraping: 
By scraping through www.domain.com.au, we obtained around 9000 records of rental information. These include all internal features related to a property, they are: 
- rental price
- number of bedrooms/bathrooms/parking
- size
- address
- location coordinates
- type (house, studio, apartment .etc)

Some external features are also obtained using web scraping because there exists no ready-made dataset, they are:
- location of shopping centres (by suburb)
- distance of properties to CBD
- Suburb-SA2-LGA mapping (since data are recorded under different scales)

#### Direct Download: 
some dataset already has machine readable format online, we can directly download them, these include:
- location of hospitals (by suburb)
- location of entertainment facilities (by suburb)
- population record from 2001~2021 (by SA2)
- income record from 2005~2018 (by SA2)
- location of schools (by suburb)
- crime rate (by LGA)
- location of train station (by suburb)
- median rental price from 2000~2021 (by suburb)

_Note that the last two dataset are manually added to data folder instead of downloading using script, because there was no direct download link available_

#### API: 
We requested an collaborative key from openrouteservice, and obtained distance of each property to:
- Melbourne CBD
- nearest school
- nearest shopping centre
- nearest entertainment facility
- nearest train station
- nearest hospital



## Preprocessing:
In this stage, we mainly did following:
- remove missing values 
- extract useful attribute from original dataset
- detect and remove outliers

The preprocessed dataset will be stored under "data/curated" folder

## Feature Engineering:
In this stage, we mainly did following:
- Group all data to suburb scale (detail strategy and assumption described in notebooks)
- Derive new attributes that will be helpful in our research, for example, count for number of schools in each suburb, calculate distances.
- Do further preprocessing, making data available for three questions

## Problem Solving:
#### Q1:
- Plot the distributions and correlation of attributes
- Apply data transformations
- Use stepwise selection to build statistic model

In [10]:
# Result
q1result = pd.read_csv('../data/curated/q1_result.csv',  index_col=0)
q1result.round(2)

Unnamed: 0,Estimate
(Intercept),216.8
Apartment / Unit / Flat,34.37
House,72.2
Studio,-74.01
Townhouse,115.4
Villa,68.85
Beds,46.07
Baths,19.51
numStation_1km,4.96
numShopping_3km,4.29


- Base price for a property is 216.8039 dollars
- Price increase or decrease based on its property type
- Both number of Beds and Bathrooms increase rental price
- Station, shopping centre, entertainment facility in surrounding area increase price
- the farer from cbd, the lower price is

Number of renting properties are drawn using chloropleth, see plots folder

#### Q2:
- Use rental price, population, income to measure suburb growth
- Build time series model using Auto Regression, use data from earliest recorded time to 2018 for consistency
- Predict values of these data for each suburb in 2025 and rank them
- Use the average rank of three attributes as the final ranking of suburb

In [7]:
# Result
q2result = pd.read_csv('../data/curated/suburb_ranking.csv')

In [9]:
# Top 10 suburbs
q2result[:10]

Unnamed: 0,Suburb,rental rank,income rank,population rank,average rank
0,DROMANA,11,29,10,16.666667
1,PORTSEA,12,54,16,27.333333
2,ASPENDALE,24,16,50,30.0
3,GROVEDALE,54,34,3,30.333333
4,WERRIBEE,83,10,2,31.666667
5,PORT MELBOURNE,96,1,1,32.666667
6,CHELSEA,23,13,62,32.666667
7,SPOTSWOOD,19,31,54,34.666667
8,NEWPORT,20,30,55,35.0
9,TORQUAY,6,91,12,36.333333


For Q2, Choloropleth are drawn for suburb growth rates, see plot folder\
We also visulised growth of Dromana suburb, which is the highest ranking