My first crawler - craw all rent info (address, price, #bedroom, #bathroom, #parking) of a list of suburbs in Sydney from https://www.domain.com.au.

Tools: requests, bs4

In [169]:
import requests

Test requests.get(), considering the first result page of a specific region 'Ashfield NSW 2131' only.

In [170]:
url = 'https://www.domain.com.au/rent/ashfield-nsw-2131'
r = requests.get(url)
r.text

u'<!DOCTYPE html><html lang="en-AU" data-tag="v1.0.1" data-commit-id="927b694a4ec0916eb8b5565a0ee6671ead215883"><head><meta charset="utf-8"/><meta http-equiv="content-language" content="en-au"/><meta name="format-detection" content="telephone=no"/><meta http-equiv="X-UA-Compatible" content="IE=edge"/><meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no"/><link rel="dns-prefetch" href="//static.domain.com.au"/><link rel="dns-prefetch" href="//images.domain.com.au"/><link rel="dns-prefetch" href="//fonts.gstatic.com"/><link rel="dns-prefetch" href="//b.domainstatic.com.au"/><link rel="dns-prefetch" href="//mt0.googleapis.com"/><link rel="dns-prefetch" href="//mt1.googleapis.com"/><link rel="dns-prefetch" href="//assets.adobedtm.com"/><link rel="dns-prefetch" href="//renderizr-assets.domainstatic.com.au"/><meta name="apple-mobile-web-app-capable" content="yes"/><meta name="apple-itunes-app" content="app-id=319908646, affiliate-data=2702

Parse an HTML text using BeautifilSoup.

If the HTML text is not formatted well, try some online formatter such as https://www.freeformatter.com/html-formatter.html.

In [171]:
from bs4 import BeautifulSoup

In [172]:
soup = BeautifulSoup(r.text)

In [173]:
items = soup.findAll('li', {'class': 'search-results__listing'})
test_item = items[3]

Extract price.

In [174]:
price = test_item.find('p')
price.get_text()

u'DEPOSIT TAKEN '

Extract address.

In [175]:
address = test_item.find('a')
address.get_text()

u'12/61-63 Frederick Street,ASHFIELD NSW 2131'

Extract number of bedrooms.

In [176]:
facilities = test_item.findAll('span', {'class': 'property-feature__feature'})
bedroom_num = facilities[0].get_text()
bathroom_num = facilities[1].get_text()
parking_num = facilities[2].get_text()
print bedroom_num, bathroom_num, parking_num

3 Beds 1 Bath 1 Parking


Extract all rent info.

In [177]:
info_list = []
for item in items:
    address = item.find('a')
    price = item.find('p')
    facilities = item.findAll('span', {'class': 'property-feature__feature'})
    if len(facilities) != 3:
        continue
    [bedroom_num, bathroom_num, parking_num] = facilities
    if address and price and bedroom_num and bathroom_num and parking_num:
        info_list.append([
            address.get_text(),
            price.get_text(),
            bedroom_num.get_text(),
            bathroom_num.get_text(),
            parking_num.get_text()
        ])
info_list

[[u'406/1 Victoria Street,ASHFIELD NSW 2131',
  u'$560 ',
  u'1 Bed',
  u'1 Bath',
  u'1 Parking'],
 [u'G07/1 Victoria St,ASHFIELD NSW 2131',
  u'$575 ',
  u'1 Bed',
  u'1 Bath',
  u'1 Parking'],
 [u'12/61-63 Frederick Street,ASHFIELD NSW 2131',
  u'DEPOSIT TAKEN ',
  u'3 Beds',
  u'1 Bath',
  u'1 Parking'],
 [u'B101/11-13 Hercules Street,ASHFIELD NSW 2131',
  u'DEPOSIT TAKEN ',
  u'1 Bed',
  u'1 Bath',
  u'1 Parking'],
 [u'43 Service Avenue,ASHFIELD NSW 2131',
  u'$820 PER WEEK ',
  u'3 Beds',
  u'2 Baths',
  u'4 Parkings'],
 [u'1/90 Victoria Street,ASHFIELD NSW 2131',
  u'$600 per week ',
  u'2 Beds',
  u'1 Bath',
  u'1 Parking'],
 [u'3/53 Gower Street,ASHFIELD NSW 2131',
  u'$550 ',
  u'2 Beds',
  u'1 Bath',
  u'1 Parking'],
 [u'18/371-377 Liverpool Road,ASHFIELD NSW 2131',
  u'$530 ',
  u'1 Bed',
  u'1 Bath',
  u'1 Parking'],
 [u'4/37 Alt Street,ASHFIELD NSW 2131',
  u'$880 to $900 ',
  u'3 Beds',
  u'2 Baths',
  u'2 Parkings'],
 [u'9/2A Brown Street,ASHFIELD NSW 2131',
  u'$670 ',

Now let's craw all rent info for a list of suburbs.

The following suburbs are selected according to popular suburbs for renting in http://www.sydneytoday.com/house_rent.

In [178]:
url = 'https://www.domain.com.au/rent/'

suburbs = [
    'ashfield-nsw-2131',
    'auburn-nsw-2144',
    'burwood-nsw-2134',
    'campsie-nsw-2194',
    'chatswood-nsw-2067',
    'eastwood-nsw-2122/',
    'epping-nsw-2121',
    'haymarket-nsw-2000',
    'hurstville-nsw-2220',
    'kingsford-nsw-2032',
    'marsfield-nsw-2122',
    'mascot-nsw-2020',
    'parramatta-nsw-2150',
    'rhodes-nsw-2138',
    'strathfield-nsw-2135',
    'sydney-nsw-2000',
    'ultimo-nsw-2007',
    'waterloo-nsw-2017',
    'zetland-nsw-2017'
]

Note that there would be multiple result pages for each suburb.

In [190]:
suburb_rent_info = {}
for suburb in suburbs:
    suburb_name = suburb.split('-')[0]
    print 'Processing %s...' % suburb_name
    suburb_url = url + suburb
    info_list = []
    
    page = 1
    while True:
        url_full = suburb_url + '/?page=' + str(page)
        r = requests.get(url_full)
        soup = BeautifulSoup(r.text)
        items = soup.findAll('li', {'class': 'search-results__listing'})
        if not items:
            break
        for item in items:
            address = item.find('a')
            price = item.find('p')
            facilities = item.findAll('span', {'class': 'property-feature__feature'})
            if len(facilities) != 3:
                continue
            [bedroom_num, bathroom_num, parking_num] = facilities
            if address and price and bedroom_num and bathroom_num and parking_num:
                info_list.append([
                    address.get_text(),
                    price.get_text(),
                    bedroom_num.get_text(),
                    bathroom_num.get_text(),
                    parking_num.get_text()
                ])
        page += 1
    
    suburb_rent_info[suburb_name] = info_list

Processing ashfield...
Processing auburn...
Processing burwood...
Processing campsie...
Processing chatswood...
Processing eastwood...
Processing epping...
Processing haymarket...
Processing hurstville...
Processing kingsford...
Processing marsfield...
Processing mascot...
Processing parramatta...
Processing rhodes...
Processing strathfield...
Processing sydney...
Processing ultimo...
Processing waterloo...
Processing zetland...


In [180]:
for suburb_name, info_list in suburb_rent_info.items():
    print '%s has %d results'% (suburb_name, len(info_list))

haymarket has 29 results
burwood has 92 results
waterloo has 89 results
kingsford has 87 results
strathfield has 111 results
rhodes has 97 results
parramatta has 224 results
zetland has 79 results
epping has 162 results
auburn has 84 results
marsfield has 45 results
chatswood has 202 results
ashfield has 116 results
sydney has 212 results
eastwood has 64 results
campsie has 112 results
ultimo has 27 results
hurstville has 74 results
mascot has 106 results


In [181]:
suburb_rent_info['rhodes']

[[u'402/12 Shoreline Dr,RHODES NSW 2138',
  u'$700 wk ',
  u'2 Beds',
  u'2 Baths',
  u'1 Parking'],
 [u'205/11 Lewis Ave,RHODES NSW 2138',
  u'$700 wk ',
  u'2 Beds',
  u'2 Baths',
  u'1 Parking'],
 [u'601/36 Shoreline Drive,RHODES NSW 2138',
  u'$550 wk ',
  u'1 Bed',
  u'1 Bath',
  u'1 Parking'],
 [u'10/24 Walker St,RHODES NSW 2138',
  u'$540 wk ',
  u'1 Bed',
  u'1 Bath',
  u'1 Parking'],
 [u'L6 AND L11/52-54 Walker st,RHODES NSW 2138',
  u'$850 to $880 ',
  u'3 Beds',
  u'2 Baths',
  u'2 Parkings'],
 [u'721/4 Marquet Street,RHODES NSW 2138',
  u'$800 ',
  u'2 Beds',
  u'2 Baths',
  u'1 Parking'],
 [u'712/3 Timbrol Ave,RHODES NSW 2138',
  u'$730 wk ',
  u'2 Beds',
  u'2 Baths',
  u'1 Parking'],
 [u'705/52-54 WALKER STREET,RHODES NSW 2138',
  u'$720 ',
  u'2 Beds',
  u'2 Baths',
  u'1 Parking'],
 [u'113/56-58 Walker Street,RHODES NSW 2138',
  u'$700 ',
  u'2 Beds',
  u'2 Baths',
  u'1 Parking'],
 [u'405/44 Shoreline Drive,RHODES NSW 2138',
  u'$695 ',
  u'2 Beds',
  u'2 Baths',
  u'

In [249]:
def data_format(suburb_rent_info):
    formatted_suburb_rent_info = []
    for suburb_name, info_list in suburb_rent_info.items():
        for rent_info in info_list:
            try:
                address = rent_info[0].split(',')
                address = ' '.join(address)
                
                price = filter(lambda c: c in '1234567890-.', rent_info[1])
                if len(price) <= 1:
                    price = None
                else:
                    if '-' in price:
                        price = price.split('-')[0]
                    price = price.split('.')[0]
                    if not price:
                        price = None
                    else:
                        price = int(price)
                    if price > 10000:
                        price = None
                
                bedroom_num = rent_info[2].split()[0]
                if not bedroom_num.isdigit():
                    bedroom_num = None
                else:
                    bedroom_num = int(bedroom_num)
                
                bathroom_num = rent_info[3].split()[0]
                if not bathroom_num.isdigit():
                    bathroom_num = None
                else:
                    bathroom_num = int(bathroom_num)
                
                parking_num = rent_info[4].split()[0]
                if not parking_num.isdigit():
                    parking_num = None
                else:
                    parking_num = int(parking_num)
            except ValueError as e:
                print e
                print rent_info
            formatted_suburb_rent_info.append((suburb_name, address, price, bedroom_num, bathroom_num, parking_num))
    
    return formatted_suburb_rent_info

In [251]:
formatted_suburb_rent_info = data_format(suburb_rent_info)

Write formatted data into a csv file.

In [252]:
import csv

In [306]:
headers = ['Suburb','Address','Price','Bedrooms','Bathrooms','Parkings']

with open('rent_info.csv','w') as f:
    f_csv = csv.writer(f)
    f_csv.writerow(headers)
    f_csv.writerows(formatted_suburb_rent_info)

Now we are ready to analyse the data.

In [307]:
import pandas as pd

In [309]:
data = pd.read_csv('rent_info.csv')
data

Unnamed: 0,Suburb,Address,Price,Bedrooms,Bathrooms,Parkings
0,haymarket,Level 18/178 Thomas Street HAYMARKET NSW 2000,1200.0,2.0,2.0,2.0
1,haymarket,303 Castlereagh Street HAYMARKET NSW 2000,1050.0,2.0,2.0,1.0
2,haymarket,Level 32/2 Quay Street HAYMARKET NSW 2000,580.0,1.0,1.0,
3,haymarket,S909/178 Thomas Street HAYMARKET NSW 2000,,2.0,2.0,1.0
4,haymarket,Level 33/2 Quay Street HAYMARKET NSW 2000,920.0,2.0,2.0,1.0
5,haymarket,S16.10/178 Thomas Street Street HAYMARKET NSW ...,990.0,2.0,2.0,1.0
6,haymarket,N905/33 Ultimo Road HAYMARKET NSW 2000,790.0,1.0,1.0,1.0
7,haymarket,1708/2 Quay Street HAYMARKET NSW 2000,1000.0,2.0,2.0,1.0
8,haymarket,1601/6 Little Hay Street HAYMARKET NSW 2000,760.0,1.0,1.0,
9,haymarket,178 Thomas st HAYMARKET NSW 2000,,2.0,2.0,


In [338]:
target1 = data[data['Bedrooms'] == 1]
target2 = data[data['Bedrooms'] == 2]
target3 = data[data['Bedrooms'] == 3]
price1 = target1.groupby(['Suburb'])['Price'].mean()
price2 = target2.groupby(['Suburb'])['Price'].mean()
price3 = target3.groupby(['Suburb'])['Price'].mean()

In [339]:
compare = pd.DataFrame({'1 bedroom': price1, '2 bedrooms': price2, '3 bedrooms': price3})
compare

Unnamed: 0,1 bedroom,2 bedrooms,3 bedrooms
ashfield,385.78125,528.979592,639.333333
auburn,345.666667,464.459459,554.736842
burwood,393.0,613.513514,722.5
campsie,374.0625,502.714286,651.153846
chatswood,599.313725,722.28125,1107.096774
eastwood,313.75,484.772727,618.863636
epping,473.235294,547.685393,618.888889
haymarket,697.5,1080.0,
hurstville,429.642857,517.592593,661.428571
kingsford,399.6875,652.413793,1000.625


In [342]:
price1.sort_values()

Suburb
eastwood       313.750000
auburn         345.666667
campsie        374.062500
ashfield       385.781250
burwood        393.000000
kingsford      399.687500
strathfield    428.500000
hurstville     429.642857
epping         473.235294
marsfield      480.000000
rhodes         529.444444
parramatta     565.016667
mascot         565.238095
waterloo       584.393939
ultimo         595.000000
chatswood      599.313725
zetland        610.000000
haymarket      697.500000
sydney         705.512500
Name: Price, dtype: float64

In [343]:
price2.sort_values()

Suburb
auburn          464.459459
eastwood        484.772727
campsie         502.714286
hurstville      517.592593
marsfield       519.565217
ashfield        528.979592
epping          547.685393
strathfield     564.800000
parramatta      570.518182
burwood         613.513514
kingsford       652.413793
rhodes          675.543478
mascot          716.250000
chatswood       722.281250
waterloo        759.117647
zetland         828.333333
ultimo          870.000000
haymarket      1080.000000
sydney         1142.924528
Name: Price, dtype: float64

In [344]:
price3.sort_values()

Suburb
auburn          554.736842
parramatta      575.185185
eastwood        618.863636
epping          618.888889
marsfield       630.909091
ashfield        639.333333
campsie         651.153846
hurstville      661.428571
strathfield     715.500000
burwood         722.500000
mascot          890.227273
rhodes         1000.000000
kingsford      1000.625000
waterloo       1057.857143
chatswood      1107.096774
zetland        1256.363636
ultimo         1490.000000
sydney         1828.823529
Name: Price, dtype: float64

In [348]:
rhodes = data[(data['Suburb'] == 'rhodes') & (data['Bedrooms'] == 2)].sort_values('Price')
rhodes

Unnamed: 0,Suburb,Address,Price,Bedrooms,Bathrooms,Parkings
494,rhodes,10 CAVELL AVENUE RHODES NSW 2138,410.0,2.0,1.0,2.0
480,rhodes,A811/40 Shoreline Drive RHODES NSW 2138,580.0,2.0,1.0,1.0
432,rhodes,31 Blaxland Road RHODES NSW 2138,600.0,2.0,1.0,1.0
477,rhodes,204/46 Walker Street RHODES NSW 2138,610.0,2.0,1.0,1.0
500,rhodes,326/60 Walker Street RHODES NSW 2138,620.0,2.0,2.0,1.0
474,rhodes,D208/10-16 Marquet RHODES NSW 2138,630.0,2.0,2.0,1.0
475,rhodes,79/38 Shoreline Drive RHODES NSW 2138,630.0,2.0,2.0,1.0
499,rhodes,1607/63 Shoreline Drive RHODES NSW 2138,640.0,2.0,1.0,1.0
472,rhodes,65/50 Walker St RHODES NSW 2138,640.0,2.0,2.0,1.0
473,rhodes,77/50 Walker Street RHODES NSW 2138,640.0,2.0,2.0,1.0
