# Data Scraping

In this notebook, we will fetch and download all the necessary datasets and store them in their associated files within the "data" file

### Importing Libraries and Functions

In [1]:
import sys
sys.path.append('../')
import os
from json import dump
from scripts.scrape_sa2 import get_shapefile
from scripts.parallelised_scrape import generate_url_list, fetch_all_rental_data
import asyncio

### Defining Global Variables - changes pending

In [2]:
# CONSTANTS
BASE_URL = "https://www.domain.com.au"
LANDING_PATH = "data/landing"
RAW_PATH = "data/raw"

### Fetching All The '*domain.com*' Rental Property Links

In [None]:
# Generating all links
url_links = generate_url_list(BASE_URL)

In [4]:
# Checking we have the correct number of properties
print(len(url_links))

13823


### Fetching The Data From Each Of The URLs 

In [5]:
# Scraping all data
rental_data = fetch_all_rental_data(url_links)

  6%|▌         | 794/13823 [01:56<26:10,  8.30it/s]  

Issue with https://www.domain.com.au/3-jean-street-lalor-vic-3075-17181912: 'NoneType' object has no attribute 'find'


 23%|██▎       | 3177/13823 [08:16<30:00,  5.91it/s]

Issue with https://www.domain.com.au/22-200-wattletree-road-malvern-vic-3144-17185487: 'NoneType' object has no attribute 'text'


 37%|███▋      | 5145/13823 [12:54<15:56,  9.07it/s]  

Issue with https://www.domain.com.au/2601-65-dudley-street-west-melbourne-vic-3003-15760997: 'NoneType' object has no attribute 'find'


 46%|████▌     | 6348/13823 [15:23<16:09,  7.71it/s]

Issue with https://www.domain.com.au/226-shoreline-drive-golden-beach-vic-3851-16718849: 'NoneType' object has no attribute 'find'


 62%|██████▏   | 8544/13823 [20:45<18:05,  4.86it/s]  

Issue with https://www.domain.com.au/2-mylson-avenue-broadford-vic-3658-17187643: 'NoneType' object has no attribute 'find'


 70%|███████   | 9739/13823 [23:55<13:40,  4.98it/s]

Issue with https://www.domain.com.au/116-300-victoria-street-brunswick-vic-3056-16183098: 'NoneType' object has no attribute 'text'


 72%|███████▏  | 9975/13823 [24:31<07:26,  8.61it/s]

Issue with https://www.domain.com.au/30-leger-street-manor-lakes-vic-3024-17182455: 'NoneType' object has no attribute 'text'
Issue with https://www.domain.com.au/5-2-elata-street-oakleigh-south-vic-3167-17176641: 'NoneType' object has no attribute 'text'


 72%|███████▏  | 9979/13823 [24:32<07:33,  8.47it/s]

Issue with https://www.domain.com.au/3-devon-court-mount-martha-vic-3934-17168700: 'NoneType' object has no attribute 'text'


 81%|████████▏ | 11248/13823 [27:48<06:22,  6.74it/s]

Issue with https://www.domain.com.au/21-nyah-st-keilor-east-vic-3033-17078424: 'NoneType' object has no attribute 'find'


100%|██████████| 13823/13823 [33:22<00:00,  6.90it/s]


In [6]:
# Creates the directory for the rental data if it doesn't exist
if not os.path.exists(f"../{LANDING_PATH}"):
    os.makedirs(f"../{LANDING_PATH}")

# Save the data in the data/landing directory
with open(f'../{LANDING_PATH}/all_properties_metadata.json', 'w') as f:
    dump(rental_data, f)
    print(f"File saved in the {LANDING_PATH} directory")

File saved in the data/landing directory


## Scraping Suburb Info

In [8]:
# Get post codes 
from scripts.scrape_oldlistings import scrape_postcodes_from_file

In [9]:
postcode_df = scrape_postcodes_from_file()
postcode_df.head()

Unnamed: 0,suburb,postcode
0,Abbotsford,3067
1,Aberfeldie,3040
2,Aberfeldy,3825
3,Acheron,3714
4,Addington,3352


### Scraping Additional Data

In [3]:
# scrapes SA2 shapfile and downloads to data folder

get_shapefile(url="https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files/SA2_2021_AUST_SHP_GDA2020.zip", output_dir = '../data/SA2/')