# 1. data downloading and preprocessing

## 1.1: domain data 
First, retrieve past data from the Victorian Government website[1] and save the suburb-to-postcode mapping as a JSON file. Scrape data including address, price, type, number of bedrooms and bathrooms, and geographic coordinates, saving the data as CSV files named all_properties_combined.csv. After handling outliers and duplicates, save the final file as all_properties_preprocessed.csv in the path: /notebooks/1_domain_data/.


In [1]:
import pandas as pd
file_path = '../data/raw/domain/all_properties_preprocessed.csv'
df = pd.read_csv(file_path)
print(df.head(1))

   Unnamed: 0                                      Address   Cost  \
0           0  8/90 Hambleton Street, Middle Park VIC 3206  410.0   

  Property Type  Bedrooms  Bathrooms   Latitude   Longitude  \
0     Apartment         1          1 -37.847553  144.960477   

  Closest Gov Secondary School Gov Secondary Distance Age under 20 Age 20-39  \
0          Albert Park College            1.2 km away          36%       15%   

  Age 40-59 Age 60+  Postcode  
0       30%     19%      3206  


## 1.2 external dataset

### 1.2.1 ABS dataset 

We retrieved the SA2 district boundary shapefile from the Australian Bureau of Statistics (ABS)[2]. This will primarily be used to map and align all external ABS data with the selected group granularity.

The summary population data for Australia from 2001 to 2023 was obtained from abs.gov.au [3]. From this, We renamed the columns and extracted the population information for each SA2 region in Victoria, saved as a CSV file in the path: /data/raw/ABS_population/. 

Population forecasts for the years 2026, 2031, and 2036 were obtained from planning.vic.gov.au [4]. The extracted data includes forecasts at the SA2 level for regions in Victoria and is also saved in the path: /data/raw/ABS_population/. 

The income dataset [5]  contains earners per persons, sum, median, and mean income data at each regional level from 2016 to 2020, from which we extracted the average income per person at the SA2 level as a CSV file, and saved it in the path: /data/raw/Income_Statistics/.

All datasets sorted based on the SA2 region are correctly stored in the corresponding path, ready for further preprocessing. 

Notebook script: notebook/1_ABS_data/


### 1.2.2 Coles_WWS, Hospital data, PTV, Three external, and Electricity_Infrastructure.

1: Coles_WWS

This supermarket dataset contains location information for Woolworths and Coles supermarkets in Victoria. The address information for Woolworths supermarkets was scraped from a page on Seibertron.com [6], while the address database for Coles supermarkets was manually downloaded from the store lookup page on the Coles website [7]. After filtering, these data were correctly stored in a CSV file at the path /data/raw/Coles_WWS/.

2: Hospital data

This hospital dataset contains information about Australian healthcare facilities (including geographic location, facility name, etc.) obtained from Data.gov.au [8]. We extracted the Victorian hospital data as a CSV file by filtering and saved it in the path /data/raw/Hospital/.

3: PTV

We retrieved the PTV dataset [11] via URL, which is a zip archive containing coordinates and names of 11 different types of transit stops. We extracted the six most representative transit station data (Regional Train, Metropolitan Train, Metropolitan Tram, Metropolitan Bus, Regional Coach, Regional Bus) from the unpacked dataset, converted them into six parquet files. We then checked the data to ensure there were no identical duplicate values (preprocess part), and saved them at the path /data/raw/PTV/un_preprocess.

4: Three external

We downloaded and extracted shapefiles [12] via URL, containing geographic location data for libraries, tourist attractions, and parks. Before saving them as three additional datasets, we checked the data to ensure there were no identical duplicate values and missing values, and then saved them in the path /data/raw/three_external. 

5: Electricity_Infrastructure

We retrieved the Electricity Infrastructure data from two URLs (Transmission Substations [13] and Major Power Stations [14]) and stored them in the path /data/landing/Foundation_Electricity_Infrastructure. Afterward, we merged the two datasets, applied preprocessing and filtering to obtain the VIC data containing currently operational foundation electricity infrastructure.

#### The corresponding notebook scripts for each dataset are as follows:

Coles_WWS: /notebook/1_Coles_WWS_data/

Hospital data: /notebook/1_Hospital_data/

PTV: /notebooks/1_PTV_data/

Three external: /notebooks/1_three_external_data/

Electricity_Infrastructure: /notebooks/1_Electricity_Infrastructure_data/


### 1.2.3 Crime dataset

### 1.2.4 LGA/Postal shapefile（在3之前没用到过)

We obtained two additional external datasets:
 Crime data and LGA/Postal shapefile. Both datasets provide crucial contextual information that supports further analysis on crime incidents and spatial mapping within Victoria.

Crime data

The detailed crime statistics on the number of criminal incidents data was collected from Crime Statistics Agency [9], we then selected the criminal incidents and rate per 100000 population by police region and local government area from April 2014 to March 2024. And save these data as a csv file in the path: /data/raw/Crime/ .  

LGA/Postal shapefile

This part of the dataset comes from the website [10], from which we downloaded and extracted some shapefiles containing postal district codes and names, area details, as well as various geometry data, etc. 【The LGA section is used to process the crime data, while the postcodes section is used to calculate the livability score index. 】


## 2.1 api distance

We use osrm (https://project-osrm.org/) by calling their api to calculate car distance from each properties to public facilities such as hospital. Then save them as csv files in the path: /data/raw/domain/ for further modeling.



Notebook script: /notebooks/2_api_distance/

## 2.2 forcasting model

in this part，we use ARIMA model, simple linear model to forcast the data of population, income, and crime, then save them as csv files.

save path of forcasted population data: /data/raw/ABS_population/population_forecast_2024_2027.csv

save path of forcasted income data: /data/raw/Past_income_population_preprocessed/income_forecast_2015_2021to2027.csv

save path of forvasted crime data: /data/raw/ABS_population/crime_forecast_2024_2027.csv

Notebook scripts: /notebooks/4_forecasting_model/

## 2.3 merge past and future dataset(income, population, crime)

This part we performed three merges, combining historical and forecast data into a unified format with three columns: Year, region (either Local Government Area or SA2 Code), and the relevant metric (crime, population, or income). The data is then sorted by year and region in ascending order and saved in the path: ../../data/raw/merge_past_forcasting_data/.


Notebook script: /notebooks/2_merge_past_future_data/

## 3 individual data

## 4 prediction model

## 5 livability score

Source of Data

[1]https://www.dffh.vic.gov.au/moving-annual-rents-suburb-march-quarter-2023-excel

[2]https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files/SA2_2021_AUST_SHP_GDA2020.zip

[3]https://www.abs.gov.au/statistics/people/population/regional-population/2022-23/32180DS0003_2001-23.xlsx

[4]https://www.planning.vic.gov.au/__data/assets/excel_doc/0028/691660/VIF2023_SA2_Pop_Hhold_Dwelling_Projections_to_2036_Release_2.xlsx

[5]https://www.abs.gov.au/statistics/labour/earnings-and-working-conditions/personal-income-australia/2020-21-financial-year/Table%201%20-%20Total%20income%2C%20earners%20and%20summary%20statistics%20by%20geography%2C%202016-17%20to%202020-21.xlsx

[6]https://www.seibertron.com/sightings/stores/stores.php?chain_id=35&country=AU&state=101

[7]https://sites.coles.com.au/Sites/StoreSearch.aspx

[8]https://data.gov.au/dataset/ds-ga-696d12c2-38c6-4afa-96b6-309a1ac9a50b/details?q=VIC%20health

[9]https://www.crimestatistics.vic.gov.au/crime-statistics/latest-victorian-crime-data/download-data

[10]https://www.abs.gov.au/

[11] https://discover.data.vic.gov.au/dataset/timetable-and-geographic-information-gtfs   

[12]https://datashare.maps.vic.gov.au/search?q=title:vicmap

[13] https://digital.atlas.gov.au/datasets/digitalatlas::transmission-substations/about

[14] https://digital.atlas.gov.au/datasets/digitalatlas::major-power-stations/about



