Domain Crawler crawls rent/sale infomation (address, price, #bedroom, #bathroom, #parking) of a list of suburbs in Sydney from https://www.domain.com.au.

Tools: requests, BeautifulSoup, re, pandas

In [1]:
import requests

Test requests.get(), considering the first result page of a specific suburb 'Ashfield NSW 2131' only.

In [2]:
url = 'https://www.domain.com.au'
r = requests.get('%s/%s/%s' % (url, 'rent', 'ashfield-nsw-2131'))
r.content[:1000]

'<!DOCTYPE html><html lang="en-AU" data-tag="v1.1.0" data-commit-id="833fcbe19ed5bdb1f3195230036c5e8493ca438b"><head><meta charset="utf-8"/><meta http-equiv="content-language" content="en-au"/><meta name="format-detection" content="telephone=no"/><meta http-equiv="X-UA-Compatible" content="IE=edge"/><meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no"/><link rel="dns-prefetch" href="//static.domain.com.au"/><link rel="dns-prefetch" href="//images.domain.com.au"/><link rel="dns-prefetch" href="//fonts.gstatic.com"/><link rel="dns-prefetch" href="//b.domainstatic.com.au"/><link rel="dns-prefetch" href="//mt0.googleapis.com"/><link rel="dns-prefetch" href="//mt1.googleapis.com"/><link rel="dns-prefetch" href="//assets.adobedtm.com"/><link rel="dns-prefetch" href="//renderizr-assets.domainstatic.com.au"/><meta name="apple-mobile-web-app-capable" content="yes"/><meta name="apple-itunes-app" content="app-id=319908646, affiliate-data=27021

Parse an HTML text using BeautifilSoup.

If the HTML text is not formatted well, try some online formatter such as https://www.freeformatter.com/html-formatter.html.

In [3]:
from bs4 import BeautifulSoup

In [4]:
soup = BeautifulSoup(r.content, 'lxml')

In [5]:
items = soup.find_all('li', {'class': 'search-results__listing'})
test_item = items[5]

Extract price.

In [6]:
price = test_item.find('p')
price.get_text()

u'$820 PER WEEK '

Extract address.

In [7]:
address = test_item.find('a')
address.get_text()

u'43 Service Avenue,ASHFIELD NSW 2131'

Extract number of bedrooms.

In [8]:
facilities = test_item.find_all('span', {'class': 'property-feature__feature'})
bedroom_num = facilities[0].get_text()
bathroom_num = facilities[1].get_text()
parking_num = facilities[2].get_text()
bedroom_num, bathroom_num, parking_num

(u'3 Beds', u'2 Baths', u'4 Parkings')

Extract all rent info.

In [9]:
info_list = []
for item in items:
    address = item.find('a')
    price = item.find('p')
    facilities = item.find_all('span', {'class': 'property-feature__feature'})
    if len(facilities) != 3:
        continue
    [bedroom_num, bathroom_num, parking_num] = facilities
    if address and price and bedroom_num and bathroom_num and parking_num:
        info_list.append([
            address.get_text(),
            price.get_text(),
            bedroom_num.get_text(),
            bathroom_num.get_text(),
            parking_num.get_text()
        ])
info_list[:10]

[[u'406/1 Victoria Street,ASHFIELD NSW 2131',
  u'$560 ',
  u'1 Bed',
  u'1 Bath',
  u'1 Parking'],
 [u'G07/1 Victoria St,ASHFIELD NSW 2131',
  u'$575 ',
  u'1 Bed',
  u'1 Bath',
  u'1 Parking'],
 [u'12/61-63 Frederick Street,ASHFIELD NSW 2131',
  u'DEPOSIT TAKEN ',
  u'3 Beds',
  u'1 Bath',
  u'1 Parking'],
 [u'B101/11-13 Hercules Street,ASHFIELD NSW 2131',
  u'DEPOSIT TAKEN ',
  u'1 Bed',
  u'1 Bath',
  u'1 Parking'],
 [u'43 Service Avenue,ASHFIELD NSW 2131',
  u'$820 PER WEEK ',
  u'3 Beds',
  u'2 Baths',
  u'4 Parkings'],
 [u'1/90 Victoria Street,ASHFIELD NSW 2131',
  u'$600 per week ',
  u'2 Beds',
  u'1 Bath',
  u'1 Parking'],
 [u'3/53 Gower Street,ASHFIELD NSW 2131',
  u'$550 ',
  u'2 Beds',
  u'1 Bath',
  u'1 Parking'],
 [u'4/37 Alt Street,ASHFIELD NSW 2131',
  u'$880 to $900 ',
  u'3 Beds',
  u'2 Baths',
  u'2 Parkings'],
 [u'9/2A Brown Street,ASHFIELD NSW 2131',
  u'$670 ',
  u'2 Beds',
  u'2 Baths',
  u'1 Parking'],
 [u'143/18-20 Knocklayde Street,ASHFIELD NSW 2131',
  u'$65

Now let's craw rent info for a list of popular suburbs.

In [10]:
suburbs = [
    'ashfield-nsw-2131',
    'auburn-nsw-2144',
    'burwood-nsw-2134',
    'campsie-nsw-2194',
    'chatswood-nsw-2067',
    'eastwood-nsw-2122/',
    'epping-nsw-2121',
    'haymarket-nsw-2000',
    'hurstville-nsw-2220',
    'kingsford-nsw-2032',
    'marsfield-nsw-2122',
    'mascot-nsw-2020',
    'parramatta-nsw-2150',
    'rhodes-nsw-2138',
    'strathfield-nsw-2135',
    'sydney-nsw-2000',
    'ultimo-nsw-2007',
    'waterloo-nsw-2017',
    'zetland-nsw-2017'
]

Note that there would be multiple result pages for each suburb.

Write a function crawl to crawl rent/sale info.

In [11]:
def crawl(url, rent_or_sale, suburbs):
    info = {}
    for suburb in suburbs:
        suburb_name = suburb.split('-')[0]
        print 'Crawling %s information of %s...' % (rent_or_sale, suburb_name)
        suburb_info = []

        page = 1
        while True:
            r = requests.get('%s/%s/%s/?page=%d' % (url, rent_or_sale, suburb, page))
            soup = BeautifulSoup(r.content, 'lxml')
            items = soup.find_all('li', {'class': 'search-results__listing'})
            if not items:
                break
            for item in items:
                address = item.find('a')
                price = item.find('p')
                facilities = item.find_all('span', {'class': 'property-feature__feature'})
                if len(facilities) != 3:
                    continue
                [bedroom_num, bathroom_num, parking_num] = facilities
                if address and price and bedroom_num and bathroom_num and parking_num:
                    suburb_info.append([
                        address.get_text(),
                        price.get_text(),
                        bedroom_num.get_text(),
                        bathroom_num.get_text(),
                        parking_num.get_text()
                    ])
            page += 1

        info[suburb_name] = suburb_info
    
    return info

Crawl rent info into rent_info. Crawl sale info into sale_info.

In [12]:
rent_info = crawl(url, 'rent', suburbs)
sale_info = crawl(url, 'sale', suburbs)

Crawling rent information of ashfield...
Crawling rent information of auburn...
Crawling rent information of burwood...
Crawling rent information of campsie...
Crawling rent information of chatswood...
Crawling rent information of eastwood...
Crawling rent information of epping...
Crawling rent information of haymarket...
Crawling rent information of hurstville...
Crawling rent information of kingsford...
Crawling rent information of marsfield...
Crawling rent information of mascot...
Crawling rent information of parramatta...
Crawling rent information of rhodes...
Crawling rent information of strathfield...
Crawling rent information of sydney...
Crawling rent information of ultimo...
Crawling rent information of waterloo...
Crawling rent information of zetland...
Crawling sale information of ashfield...
Crawling sale information of auburn...
Crawling sale information of burwood...
Crawling sale information of campsie...
Crawling sale information of chatswood...
Crawling sale informati

In [13]:
for suburb in suburbs:
    suburb_name = suburb.split('-')[0]
    print '%s has %d rent results and %d sale results.' % \
        (suburb_name, len(rent_info[suburb_name]), len(sale_info[suburb_name]))

ashfield has 108 rent results and 95 sale results.
auburn has 84 rent results and 171 sale results.
burwood has 89 rent results and 121 sale results.
campsie has 116 rent results and 134 sale results.
chatswood has 207 rent results and 117 sale results.
eastwood has 63 rent results and 46 sale results.
epping has 166 rent results and 180 sale results.
haymarket has 26 rent results and 24 sale results.
hurstville has 70 rent results and 172 sale results.
kingsford has 88 rent results and 18 sale results.
marsfield has 47 rent results and 15 sale results.
mascot has 110 rent results and 115 sale results.
parramatta has 226 rent results and 235 sale results.
rhodes has 98 rent results and 92 sale results.
strathfield has 107 rent results and 116 sale results.
sydney has 206 rent results and 174 sale results.
ultimo has 27 rent results and 44 sale results.
waterloo has 87 rent results and 108 sale results.
zetland has 79 rent results and 95 sale results.


In [14]:
rent_info['rhodes'][:10]

[[u'79/38 Shoreline Drive,RHODES NSW 2138',
  u'$600 per week ',
  u'2 Beds',
  u'2 Baths',
  u'1 Parking'],
 [u'601/36 Shoreline Drive,RHODES NSW 2138',
  u'$550 wk ',
  u'1 Bed',
  u'1 Bath',
  u'1 Parking'],
 [u'10/24 Walker St,RHODES NSW 2138',
  u'$540 wk ',
  u'1 Bed',
  u'1 Bath',
  u'1 Parking'],
 [u'606/52-54 Walker st,RHODES NSW 2138',
  u'$850 ',
  u'3 Beds',
  u'2 Baths',
  u'2 Parkings'],
 [u'721/4 Marquet Street,RHODES NSW 2138',
  u'$800 ',
  u'2 Beds',
  u'2 Baths',
  u'1 Parking'],
 [u'712/3 Timbrol Ave,RHODES NSW 2138',
  u'$730 wk ',
  u'2 Beds',
  u'2 Baths',
  u'1 Parking'],
 [u'705/52-54 WALKER STREET,RHODES NSW 2138',
  u'$720 ',
  u'2 Beds',
  u'2 Baths',
  u'1 Parking'],
 [u'113/56-58 Walker Street,RHODES NSW 2138',
  u'$700 ',
  u'2 Beds',
  u'2 Baths',
  u'1 Parking'],
 [u'205/11 Lewis Ave,RHODES NSW 2138',
  u'$700 wk ',
  u'2 Beds',
  u'2 Baths',
  u'1 Parking'],
 [u'405/44 Shoreline Drive,RHODES NSW 2138',
  u'$695 ',
  u'2 Beds',
  u'2 Baths',
  u'1 Parki

From the result we can see that the format is not uniform.

E.g., some prices contain 'PW' or 'Per Week' while some are not.

Write a function data_format to format them.

In [15]:
import re

In [16]:
def data_format(info, rent_or_sale):
    formatted_info = []
    for suburb_name, suburb_info in info.items():
        for item in suburb_info:
            try:
                address = item[0].split(',')
                address = ' '.join(address).encode('utf-8')
                
                price = re.sub('[, ]', '', item[1])
                price = re.search('[\d.]+[mM]*', price)
                if not price:
                    price = None
                else:
                    price = price.group()
                    if all([not c in '1234567890' for c in price]):
                        price = None
                    else:
                        if price[-1] in 'mM':
                            price = int(float(price[:-1])) * 1000000
                        else:
                            price = int(float(price))
                        if rent_or_sale == 'rent' and not 10 < price < 5000:
                            price = None
                        if rent_or_sale == 'sale' and not 100000 < price < 50000000:
                            price = None
                
                bedroom_num = item[2].split()[0]
                if not bedroom_num.isdigit():
                    bedroom_num = None
                else:
                    bedroom_num = int(bedroom_num)
                
                bathroom_num = item[3].split()[0]
                if not bathroom_num.isdigit():
                    bathroom_num = None
                else:
                    bathroom_num = int(bathroom_num)
                
                parking_num = item[4].split()[0]
                if not parking_num.isdigit():
                    parking_num = None
                else:
                    parking_num = int(parking_num)
                    
            except ValueError as e:
                print e
                
            formatted_info.append((suburb_name, address, price, bedroom_num, bathroom_num, parking_num))
    
    return formatted_info

In [17]:
formatted_rent_info = data_format(rent_info, 'rent')
formatted_sale_info = data_format(sale_info, 'sale')

Write formatted data into a csv file.

In [18]:
import csv

In [19]:
headers = ['Suburb','Address','Price','Bedrooms','Bathrooms','Parkings']

with open('rent_info.csv','w') as f:
    f_csv = csv.writer(f)
    f_csv.writerow(headers)
    f_csv.writerows(formatted_rent_info)

with open('sale_info.csv','w') as f:
    f_csv = csv.writer(f)
    f_csv.writerow(headers)
    f_csv.writerows(formatted_sale_info)

Now we are ready to analyse the data.

In [20]:
import pandas as pd

In [21]:
rent_data = pd.read_csv('rent_info.csv')
sale_data = pd.read_csv('sale_info.csv')

In [22]:
rent_data.head(5)

Unnamed: 0,Suburb,Address,Price,Bedrooms,Bathrooms,Parkings
0,haymarket,Level 18/178 Thomas Street HAYMARKET NSW 2000,1200.0,2.0,2.0,2.0
1,haymarket,303 Castlereagh Street HAYMARKET NSW 2000,1050.0,2.0,2.0,1.0
2,haymarket,N8.06/33 Ultimo Road HAYMARKET NSW 2000,755.0,1.0,1.0,1.0
3,haymarket,S909/178 Thomas Street HAYMARKET NSW 2000,,2.0,2.0,1.0
4,haymarket,Level 32/2 Quay Street HAYMARKET NSW 2000,580.0,1.0,1.0,


In [23]:
sale_data.head(5)

Unnamed: 0,Suburb,Address,Price,Bedrooms,Bathrooms,Parkings
0,haymarket,1009/2 Quay Street HAYMARKET NSW 2000,,2.0,2.0,1.0
1,haymarket,2 Hay St HAYMARKET NSW 2000,2000000.0,3.0,2.0,1.0
2,haymarket,Darling Square 61 Harbour Street HAYMARKET NSW...,,2.0,1.0,1.0
3,haymarket,Darling Square Tumbalong Boulevard HAYMARKET N...,850000.0,1.0,1.0,
4,haymarket,LVL 6/NE3 Darling North Harbour Street Darlin...,,1.0,1.0,1.0


In [24]:
rent_bedroom1_data = rent_data[rent_data['Bedrooms'] == 1]
rent_bedroom2_data = rent_data[rent_data['Bedrooms'] == 2]
rent_bedroom3_data = rent_data[rent_data['Bedrooms'] == 3]

sale_bedroom1_data = sale_data[sale_data['Bedrooms'] == 1]
sale_bedroom2_data = sale_data[sale_data['Bedrooms'] == 2]
sale_bedroom3_data = sale_data[sale_data['Bedrooms'] == 3]

rent_bedroom1_price = rent_bedroom1_data.groupby(['Suburb'])['Price'].mean().round(0)
rent_bedroom2_price = rent_bedroom2_data.groupby(['Suburb'])['Price'].mean().round(0)
rent_bedroom3_price = rent_bedroom3_data.groupby(['Suburb'])['Price'].mean().round(0)

sale_bedroom1_price = sale_bedroom1_data.groupby(['Suburb'])['Price'].mean().round(0)
sale_bedroom2_price = sale_bedroom2_data.groupby(['Suburb'])['Price'].mean().round(0)
sale_bedroom3_price = sale_bedroom3_data.groupby(['Suburb'])['Price'].mean().round(0)

Compare rent/sale prices of 1/2/3 bedrooms.

In [25]:
compare = pd.DataFrame({
    'rent 1 bedroom': rent_bedroom1_price,
    'rent 2 bedrooms': rent_bedroom2_price,
    'rent 3 bedrooms': rent_bedroom3_price,
    'buy 1 bedroom': sale_bedroom1_price,
    'buy 2 bedrooms': sale_bedroom2_price,
    'buy 3 bedrooms': sale_bedroom3_price
})
compare

Unnamed: 0,buy 1 bedroom,buy 2 bedrooms,buy 3 bedrooms,rent 1 bedroom,rent 2 bedrooms,rent 3 bedrooms
ashfield,648545.0,846478.0,1582600.0,394.0,531.0,654.0
auburn,529273.0,560635.0,967255.0,346.0,471.0,553.0
burwood,642777.0,899526.0,1491750.0,393.0,625.0,722.0
campsie,552615.0,692418.0,796333.0,371.0,500.0,643.0
chatswood,633947.0,1325500.0,1693333.0,594.0,719.0,1107.0
eastwood,750000.0,840000.0,1375000.0,310.0,495.0,619.0
epping,688848.0,871019.0,1646087.0,475.0,550.0,628.0
haymarket,785667.0,1377500.0,2000000.0,716.0,1098.0,
hurstville,625059.0,804859.0,910111.0,433.0,518.0,664.0
kingsford,,760000.0,,402.0,660.0,1021.0


Sort rent price for 2 bedrooms.

In [26]:
rent_bedroom2_price.sort_values(ascending=False)

Suburb
sydney         1140.0
haymarket      1098.0
ultimo          870.0
zetland         817.0
waterloo        755.0
chatswood       719.0
mascot          711.0
rhodes          674.0
kingsford       660.0
burwood         625.0
strathfield     580.0
epping          550.0
ashfield        531.0
hurstville      518.0
parramatta      517.0
marsfield       513.0
campsie         500.0
eastwood        495.0
auburn          471.0
Name: Price, dtype: float64

Sort sale price for 2 bedrooms.

In [27]:
sale_bedroom2_price.sort_values(ascending=False)

Suburb
haymarket      1377500.0
sydney         1374032.0
chatswood      1325500.0
waterloo       1185100.0
zetland         992080.0
rhodes          965250.0
mascot          937150.0
burwood         899526.0
ultimo          877627.0
epping          871019.0
ashfield        846478.0
eastwood        840000.0
hurstville      804859.0
strathfield     800714.0
kingsford       760000.0
marsfield       710000.0
parramatta      697007.0
campsie         692418.0
auburn          560635.0
Name: Price, dtype: float64

Take a look at 2 bedrooms at Rhodes.

In [28]:
rhodes = rent_data[(rent_data['Suburb'] == 'rhodes') &
                   (rent_data['Bedrooms'] == 2)].sort_values('Price', ascending=False)
rhodes

Unnamed: 0,Suburb,Address,Price,Bedrooms,Bathrooms,Parkings
414,rhodes,2308/46 Walker St RHODES NSW 2138,900.0,2.0,2.0,1.0
401,rhodes,721/4 Marquet Street RHODES NSW 2138,800.0,2.0,2.0,1.0
436,rhodes,79 Shoreline Drive RHODES NSW 2138,750.0,2.0,2.0,1.0
402,rhodes,712/3 Timbrol Ave RHODES NSW 2138,730.0,2.0,2.0,1.0
430,rhodes,402/42 Shoreline Drive RHODES NSW 2138,720.0,2.0,1.0,1.0
403,rhodes,705/52-54 WALKER STREET RHODES NSW 2138,720.0,2.0,2.0,1.0
439,rhodes,1412/43 Shoreline Drive RHODES NSW 2138,700.0,2.0,2.0,1.0
404,rhodes,113/56-58 Walker Street RHODES NSW 2138,700.0,2.0,2.0,1.0
405,rhodes,205/11 Lewis Ave RHODES NSW 2138,700.0,2.0,2.0,1.0
455,rhodes,205 2 Lewis Avenue RHODES NSW 2138,700.0,2.0,2.0,1.0


Take a look at gross rental yield for each suburb.

Gross rental yield = (Annual rental income / Property value) * 100

In [29]:
gry_bedroom1 = compare['rent 1 bedroom'] * 52 / compare['buy 1 bedroom'] * 100
gry_bedroom2 = compare['rent 2 bedrooms'] * 52 / compare['buy 2 bedrooms'] * 100
gry_bedroom3 = compare['rent 3 bedrooms'] * 52 / compare['buy 3 bedrooms'] * 100
gry = pd.DataFrame({
    '1 bedroom': gry_bedroom1.round(2),
    '2 bedrooms': gry_bedroom2.round(2),
    '3 bedrooms': gry_bedroom3.round(2)
})
gry

Unnamed: 0,1 bedroom,2 bedrooms,3 bedrooms
ashfield,3.16,3.26,2.15
auburn,3.4,4.37,2.97
burwood,3.18,3.61,2.52
campsie,3.49,3.75,4.2
chatswood,4.87,2.82,3.4
eastwood,2.15,3.06,2.34
epping,3.59,3.28,1.98
haymarket,4.74,4.14,
hurstville,3.6,3.35,3.79
kingsford,,4.52,
