# Data Scraping

In this notebook, we will fetch and download all the necessary datasets and store them in their associated files within the "data" directory

### Importing Libraries and Functions

In [2]:
import os
from json import dump
import sys
sys.path.append('../')
from scripts.external_scrape_functions import get_xlsx, get_zip
from scripts.parallelised_scrape import generate_url_list, fetch_all_rental_data
from scripts.scrape_oldlistings import scrape_postcodes, get_oldlisting_data, get_remaining_oldlisting_data


### Defining Global Variables

In [15]:
BASE_URL = "https://www.domain.com.au"
LANDING_PATH = "data/landing"
RAW_PATH = "data/raw"

### *domain.com* Rental Property Data

1. Fetching all of the '*domain.com*' Rental Property URLs

In [3]:
# Generating all links
url_links = generate_url_list(BASE_URL)


Generating the list of links...


Fetching listings for price range: 150-200

Found 53 listings for price range 150-200

Fetching listings for sub-range: 150-200

Fetching https://www.domain.com.au/rent/?price=150-200&excludedeposittaken=1&sort=price-asc&state=vic&page=1
Fetching https://www.domain.com.au/rent/?price=150-200&excludedeposittaken=1&sort=price-asc&state=vic&page=2
Fetching https://www.domain.com.au/rent/?price=150-200&excludedeposittaken=1&sort=price-asc&state=vic&page=3
Fetching https://www.domain.com.au/rent/?price=150-200&excludedeposittaken=1&sort=price-asc&state=vic&page=4
Fetching https://www.domain.com.au/rent/?price=150-200&excludedeposittaken=1&sort=price-asc&state=vic&page=5
Fetching https://www.domain.com.au/rent/?price=150-200&excludedeposittaken=1&sort=price-asc&state=vic&page=6
Fetching https://www.domain.com.au/rent/?price=150-200&excludedeposittaken=1&sort=price-asc&state=vic&page=7
Fetching https://www.domain.com.au/rent/?price=150-200&excludedeposittake

In [None]:
# Checking we have the correct number of properties
print(len(url_links))

13823


2. Fetching The Data From Each Of The URLs 

In [None]:
# Scraping all data
rental_data = fetch_all_rental_data(url_links)

  6%|▌         | 794/13823 [01:56<26:10,  8.30it/s]  

Issue with https://www.domain.com.au/3-jean-street-lalor-vic-3075-17181912: 'NoneType' object has no attribute 'find'


 23%|██▎       | 3177/13823 [08:16<30:00,  5.91it/s]

Issue with https://www.domain.com.au/22-200-wattletree-road-malvern-vic-3144-17185487: 'NoneType' object has no attribute 'text'


 37%|███▋      | 5145/13823 [12:54<15:56,  9.07it/s]  

Issue with https://www.domain.com.au/2601-65-dudley-street-west-melbourne-vic-3003-15760997: 'NoneType' object has no attribute 'find'


 46%|████▌     | 6348/13823 [15:23<16:09,  7.71it/s]

Issue with https://www.domain.com.au/226-shoreline-drive-golden-beach-vic-3851-16718849: 'NoneType' object has no attribute 'find'


 62%|██████▏   | 8544/13823 [20:45<18:05,  4.86it/s]  

Issue with https://www.domain.com.au/2-mylson-avenue-broadford-vic-3658-17187643: 'NoneType' object has no attribute 'find'


 70%|███████   | 9739/13823 [23:55<13:40,  4.98it/s]

Issue with https://www.domain.com.au/116-300-victoria-street-brunswick-vic-3056-16183098: 'NoneType' object has no attribute 'text'


 72%|███████▏  | 9975/13823 [24:31<07:26,  8.61it/s]

Issue with https://www.domain.com.au/30-leger-street-manor-lakes-vic-3024-17182455: 'NoneType' object has no attribute 'text'
Issue with https://www.domain.com.au/5-2-elata-street-oakleigh-south-vic-3167-17176641: 'NoneType' object has no attribute 'text'


 72%|███████▏  | 9979/13823 [24:32<07:33,  8.47it/s]

Issue with https://www.domain.com.au/3-devon-court-mount-martha-vic-3934-17168700: 'NoneType' object has no attribute 'text'


 81%|████████▏ | 11248/13823 [27:48<06:22,  6.74it/s]

Issue with https://www.domain.com.au/21-nyah-st-keilor-east-vic-3033-17078424: 'NoneType' object has no attribute 'find'


100%|██████████| 13823/13823 [33:22<00:00,  6.90it/s]


In [14]:
# Creates the directory for the rental data if it doesn't exist
if not os.path.exists(f"../{LANDING_PATH}"):
    os.makedirs(f"../{LANDING_PATH}")

# Save the data in the data/landing directory
with open(f'../{LANDING_PATH}/all_properties_metadata.json', 'w') as f:
    dump(rental_data, f)
    print(f"File saved in the {LANDING_PATH} directory")

NameError: name 'LANDING_PATH' is not defined

### *oldlisting.com* Rental Data Scrape

In [None]:
# Get postcode data:
scrape_postcodes(url="https://www.onlymelbourne.com.au/melbourne-postcodes")

Don't run the following 2 chunks, oldlistings.com has changed since and have been commented out

In [None]:
#get_oldlisting_data()

Run this function iteratively, until website has been scraped 

In [None]:
# Run this function iteratively to scrape additional listings until website has been scraped 
#get_remaining_oldlisting_data()

### Scraping Additional Data

* SA2 Districts in Victoria shapefile

In [4]:
get_zip(url="https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files/SA2_2021_AUST_SHP_GDA2020.zip", 
        output_dir = '../data/landing/SA2/SA2')

* Suburbs shapefile

In [5]:
get_zip(url='https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files/SAL_2021_AUST_GDA2020_SHP.zip',
        output_dir='../data/landing/SAL_2021_AUST_GDA2020_SHP')

* Historical Population by SA2 district

In [6]:
get_zip(url='https://www.abs.gov.au/statistics/people/population/regional-population/2022-23/32180_ERP_2023_SA2_GDA2020.zip',
        output_dir='../data/landing/population/population')

* Victorian Homelessness data

In [7]:
# 2021
get_xlsx(url='https://www.abs.gov.au/statistics/people/housing/estimating-homelessness-census/2021/20490do005_2021.xlsx', 
         output_dir='../data/landing/homelessness/homelessness21')

Data successfully written to ../data/landing/homelessness/homelessness21


In [8]:
# 2016
get_xlsx(url='https://www.abs.gov.au/statistics/people/housing/estimating-homelessness-census/2016/20490do005_2016.xls', 
         output_dir='../data/landing/homelessness/homelessness16')

Data successfully written to ../data/landing/homelessness/homelessness16


In [9]:
# 2011
get_xlsx(url='https://www.abs.gov.au/ausstats/subscriber.nsf/log?openagent&20490_2011%20statistical%20area%20level%202.xls&2049.0&Data%20Cubes&62A1E2D9A1BE3660CA257C4600154B66&0&2011&20.12.2013&Latest', 
         output_dir='../data/landing/homelessness/homelessness11')

Data successfully written to ../data/landing/homelessness/homelessness11


* Socio-economic Indexes per SA2 district

In [10]:
# 2021
get_xlsx(url='https://www.abs.gov.au/statistics/people/people-and-communities/socio-economic-indexes-areas-seifa-australia/2021/Statistical%20Area%20Level%202%2C%20Indexes%2C%20SEIFA%202021.xlsx', 
         output_dir='../data/landing/socioeconomic/socioeconomic21')

Data successfully written to ../data/landing/socioeconomic/socioeconomic21


In [11]:
# 2016
get_xlsx(url='https://www.abs.gov.au/ausstats/subscriber.nsf/log?openagent&2033055001%20-%20sa2%20indexes.xls&2033.0.55.001&Data%20Cubes&C9F7AD36397CB43DCA25825D000F917C&0&2016&27.03.2018&Latest', 
         output_dir='../data/landing/socioeconomic/socioeconomic16')

Data successfully written to ../data/landing/socioeconomic/socioeconomic16


In [12]:
# 2011
get_xlsx(url='https://www.abs.gov.au/AUSSTATS/subscriber.nsf/log?openagent&2033.0.55.001%20SA2%20Indexes.xls&2033.0.55.001&Data%20Cubes&76D0BC44356DC34ACA257B3B001A4913&0&2011&12.11.2014&Latest', 
         output_dir='../data/landing/socioeconomic/socioeconomic11')

Data successfully written to ../data/landing/socioeconomic/socioeconomic11


* 5-Yearly Projected Population size per SA2 district

In [16]:
get_xlsx(url="https://www.planning.vic.gov.au/__data/assets/excel_doc/0036/691659/VIF2023_SA2_Pop_Age_Sex_Projections_to_2036_Release_2.xlsx", 
         output_dir=f"../{LANDING_PATH}/5yearpopproj_perSA2")


Data successfully written to ../data/landing/5yearpopproj_perSA2


* Yearly Projected Population size per SA2 district

In [17]:
get_xlsx(url="https://www.gen-agedcaredata.gov.au/getmedia/564291d5-8e25-4b2e-90e6-171f554cfce9/Victoria.csv", 
         output_dir=f"../{LANDING_PATH}/yearly_pop_projection_perSA.csv", 
         headers={"User-Agent":
           "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"})


Data successfully written to ../data/landing/yearly_pop_projection_perSA.csv


* Melbourne Inflation data

In [18]:
get_xlsx(url='https://www.abs.gov.au/statistics/economy/price-indexes-and-inflation/consumer-price-index-australia/jun-quarter-2024/640107.xlsx', 
         output_dir='../data/landing/inflation/inflation')

Data successfully written to ../data/landing/inflation/inflation


* Population Distribution Data

In [19]:
get_xlsx(url='https://www.abs.gov.au/methodologies/data-region-methodology/2011-23/14100DO0001_2011-23.xlsx', 
         output_dir='../data/landing/population_dist/population_dist')

Data successfully written to ../data/landing/population_dist/population_dist


* Business Data

In [20]:
get_xlsx(url='https://www.abs.gov.au/methodologies/data-region-methodology/2011-23/14100DO0003_2011-23.xlsx', 
         output_dir='../data/landing/business/business')

Data successfully written to ../data/landing/business/business


* Income Data

In [21]:
get_xlsx(url='https://www.abs.gov.au/methodologies/data-region-methodology/2011-23/14100DO0004_2011-23.xlsx', 
         output_dir='../data/landing/income/income')

Data successfully written to ../data/landing/income/income


* Unemployment Data

In [22]:
get_xlsx(url='https://www.abs.gov.au/methodologies/data-region-methodology/2011-23/14100DO0005_2011-23.xlsx', 
         output_dir='../data/landing/unemployment/unemployment')

Data successfully written to ../data/landing/unemployment/unemployment


* Community Data

In [23]:
get_xlsx(url='https://www.abs.gov.au/methodologies/data-region-methodology/2011-23/14100DO0007_2011-23.xlsx', 
         output_dir='../data/landing/community/community')

Data successfully written to ../data/landing/community/community


* Crime Data

In [24]:
get_xlsx(url='https://files.crimestatistics.vic.gov.au/2024-09/Data_Tables_LGA_Recorded_Offences_Year_Ending_June_2024.xlsx',
         output_dir='../data/landing/recorded_offences_data')

Data successfully written to ../data/landing/recorded_offences_data
