Domain Crawler crawls rent/sale infomation of a list of suburbs in Sydney from https://www.domain.com.au.

Target information includes:
- address
- price
- #bedroom
- #bathroom
- #parking
- latitude
- longitude
- distance to CBD
- distance to (nearest) train station

Tools:
1. requests
2. BeautifulSoup
3. re
4. csv
5. geopy

In [1]:
import requests

Test requests.get(), considering the first result page of a specific suburb 'Ashfield NSW 2131' only.

In [2]:
url = 'https://www.domain.com.au'
r = requests.get('%s/%s/%s' % (url, 'rent', 'ashfield-nsw-2131'))
r.content[:1000]

'<!DOCTYPE html><html lang="en-AU" data-tag="v1.1.0" data-commit-id="833fcbe19ed5bdb1f3195230036c5e8493ca438b"><head><meta charset="utf-8"/><meta http-equiv="content-language" content="en-au"/><meta name="format-detection" content="telephone=no"/><meta http-equiv="X-UA-Compatible" content="IE=edge"/><meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no"/><link rel="dns-prefetch" href="//static.domain.com.au"/><link rel="dns-prefetch" href="//images.domain.com.au"/><link rel="dns-prefetch" href="//fonts.gstatic.com"/><link rel="dns-prefetch" href="//b.domainstatic.com.au"/><link rel="dns-prefetch" href="//mt0.googleapis.com"/><link rel="dns-prefetch" href="//mt1.googleapis.com"/><link rel="dns-prefetch" href="//assets.adobedtm.com"/><link rel="dns-prefetch" href="//renderizr-assets.domainstatic.com.au"/><meta name="apple-mobile-web-app-capable" content="yes"/><meta name="apple-itunes-app" content="app-id=319908646, affiliate-data=27021

Parse an HTML text using BeautifilSoup.

If the HTML text is not formatted well, try some online formatter such as https://www.freeformatter.com/html-formatter.html.

In [3]:
from bs4 import BeautifulSoup

In [4]:
soup = BeautifulSoup(r.content, 'lxml')

In [5]:
items = soup.find_all('li', {'class': 'search-results__listing'})
test_item = items[5]

Extract price.

In [6]:
price = test_item.find('p')
price.get_text()

u'$820 PER WEEK '

Extract address.

In [7]:
address = test_item.find('a')
address.get_text()

u'43 Service Avenue,ASHFIELD NSW 2131'

Extract number of bedrooms/bathrooms/parkings.

In [8]:
facilities = test_item.find_all('span', {'class': 'property-feature__feature'})
bedroom_num = facilities[0].get_text()
bathroom_num = facilities[1].get_text()
parking_num = facilities[2].get_text()
bedroom_num, bathroom_num, parking_num

(u'3 Beds', u'2 Baths', u'4 Parkings')

Extract latitude and longitude.

In [9]:
lat = test_item.find('meta', {'itemprop': 'latitude'})
lng = test_item.find('meta', {'itemprop': 'longitude'})
if lat:
    lat = lat['content']
if lng:
    lng = lng['content']
lat, lng

('-33.90254', '151.12793')

Extract all rent info.

In [10]:
info_list = []

for item in items:
    address = item.find('a')
    price = item.find('p')
    facilities = item.find_all('span', {'class': 'property-feature__feature'})
    if len(facilities) != 3:
        continue
    [bedroom_num, bathroom_num, parking_num] = facilities
    lat = item.find('meta', {'itemprop': 'latitude'})
    lng = item.find('meta', {'itemprop': 'longitude'})
    if lat:
        lat = lat['content']
    if lng:
        lng = lng['content']
        
    if address and price and bedroom_num and bathroom_num and parking_num and lat and lng:
        info_list.append([
            address.get_text(),
            price.get_text(),
            bedroom_num.get_text(),
            bathroom_num.get_text(),
            parking_num.get_text(),
            lat,
            lng
        ])
info_list[:5]

[[u'406/1 Victoria Street,ASHFIELD NSW 2131',
  u'$550 ',
  u'1 Bed',
  u'1 Bath',
  u'1 Parking',
  '-33.8910065',
  '151.129547'],
 [u'G07/1 Victoria St,ASHFIELD NSW 2131',
  u'$575 ',
  u'1 Bed',
  u'1 Bath',
  u'1 Parking',
  '-33.8910065',
  '151.129547'],
 [u'12/61-63 Frederick Street,ASHFIELD NSW 2131',
  u'DEPOSIT TAKEN ',
  u'3 Beds',
  u'1 Bath',
  u'1 Parking',
  '-33.8814774',
  '151.122482'],
 [u'B101/11-13 Hercules Street,ASHFIELD NSW 2131',
  u'DEPOSIT TAKEN ',
  u'1 Bed',
  u'1 Bath',
  u'1 Parking',
  '-33.888092',
  '151.125031'],
 [u'43 Service Avenue,ASHFIELD NSW 2131',
  u'$820 PER WEEK ',
  u'3 Beds',
  u'2 Baths',
  u'4 Parkings',
  '-33.90254',
  '151.12793']]

Now let's craw rent info for a list of popular suburbs.

In [11]:
suburbs = [
    'ashfield-nsw-2131',
    'auburn-nsw-2144',
    'burwood-nsw-2134',
    'campsie-nsw-2194',
    'chatswood-nsw-2067',
    'eastwood-nsw-2122/',
    'epping-nsw-2121',
    'haymarket-nsw-2000',
    'hurstville-nsw-2220',
    'marsfield-nsw-2122',
    'mascot-nsw-2020',
    'parramatta-nsw-2150',
    'rhodes-nsw-2138',
    'strathfield-nsw-2135',
    'sydney-nsw-2000',
    'ultimo-nsw-2007',
    'waterloo-nsw-2017',
    'zetland-nsw-2017'
]

Note that there would be multiple result pages for each suburb.

Write a function crawl to crawl rent/sale info.

In [12]:
def crawl(url, rent_or_sale, suburbs):
    info = {}
    for suburb in suburbs:
        suburb_name = suburb.split('-')[0]
        print 'Crawling %s information of %s...' % (rent_or_sale, suburb_name)
        suburb_info = []

        page = 1
        while True:
            r = requests.get('%s/%s/%s/?page=%d' % (url, rent_or_sale, suburb, page))
            soup = BeautifulSoup(r.content, 'lxml')
            items = soup.find_all('li', {'class': 'search-results__listing'})
            if not items:
                break
            for item in items:
                address = item.find('a')
                price = item.find('p')
                facilities = item.find_all('span', {'class': 'property-feature__feature'})
                if len(facilities) != 3:
                    continue
                [bedroom_num, bathroom_num, parking_num] = facilities
                lat = item.find('meta', {'itemprop': 'latitude'})
                lng = item.find('meta', {'itemprop': 'longitude'})
                if lat:
                    lat = lat['content']
                if lng:
                    lng = lng['content']

                if address and price and bedroom_num and bathroom_num and parking_num and lat and lng:
                    suburb_info.append([
                        address.get_text(),
                        price.get_text(),
                        bedroom_num.get_text(),
                        bathroom_num.get_text(),
                        parking_num.get_text(),
                        lat,
                        lng
                    ])
            page += 1

        info[suburb_name] = suburb_info
    
    return info

Crawl rent info into rent_info. Crawl sale info into sale_info.

In [13]:
rent_info = crawl(url, 'rent', suburbs)
sale_info = crawl(url, 'sale', suburbs)

Crawling rent information of ashfield...
Crawling rent information of auburn...
Crawling rent information of burwood...
Crawling rent information of campsie...
Crawling rent information of chatswood...
Crawling rent information of eastwood...
Crawling rent information of epping...
Crawling rent information of haymarket...
Crawling rent information of hurstville...
Crawling rent information of marsfield...
Crawling rent information of mascot...
Crawling rent information of parramatta...
Crawling rent information of rhodes...
Crawling rent information of strathfield...
Crawling rent information of sydney...
Crawling rent information of ultimo...
Crawling rent information of waterloo...
Crawling rent information of zetland...
Crawling sale information of ashfield...
Crawling sale information of auburn...
Crawling sale information of burwood...
Crawling sale information of campsie...
Crawling sale information of chatswood...
Crawling sale information of eastwood...
Crawling sale informatio

In [14]:
for suburb in suburbs:
    suburb_name = suburb.split('-')[0]
    print '%s has %d rent results and %d sale results.' % \
        (suburb_name, len(rent_info[suburb_name]), len(sale_info[suburb_name]))

ashfield has 106 rent results and 96 sale results.
auburn has 86 rent results and 173 sale results.
burwood has 86 rent results and 123 sale results.
campsie has 112 rent results and 133 sale results.
chatswood has 202 rent results and 114 sale results.
eastwood has 66 rent results and 46 sale results.
epping has 167 rent results and 175 sale results.
haymarket has 25 rent results and 24 sale results.
hurstville has 68 rent results and 175 sale results.
marsfield has 48 rent results and 15 sale results.
mascot has 104 rent results and 117 sale results.
parramatta has 236 rent results and 233 sale results.
rhodes has 95 rent results and 91 sale results.
strathfield has 108 rent results and 113 sale results.
sydney has 207 rent results and 175 sale results.
ultimo has 24 rent results and 45 sale results.
waterloo has 86 rent results and 107 sale results.
zetland has 78 rent results and 98 sale results.


In [15]:
rent_info['rhodes'][:10]

[[u'79/38 Shoreline Drive,RHODES NSW 2138',
  u'$600 per week ',
  u'2 Beds',
  u'2 Baths',
  u'1 Parking',
  '-33.8268051',
  '151.085068'],
 [u'606/52-54 Walker st,RHODES NSW 2138',
  u'$850 ',
  u'3 Beds',
  u'2 Baths',
  u'2 Parkings',
  '-33.8259621',
  '151.0871'],
 [u'721/4 Marquet Street,RHODES NSW 2138',
  u'$780 ',
  u'2 Beds',
  u'2 Baths',
  u'1 Parking',
  '-33.8299522',
  '151.085068'],
 [u'712/3 Timbrol Ave,RHODES NSW 2138',
  u'$730 wk ',
  u'2 Beds',
  u'2 Baths',
  u'1 Parking',
  '-33.8263474',
  '151.0856'],
 [u'705/52-54 WALKER STREET,RHODES NSW 2138',
  u'$720 ',
  u'2 Beds',
  u'2 Baths',
  u'1 Parking',
  '-33.8259621',
  '151.0871'],
 [u'113/56-58 Walker Street,RHODES NSW 2138',
  u'$700 ',
  u'2 Beds',
  u'2 Baths',
  u'1 Parking',
  '-33.82507',
  '151.0873'],
 [u'205/11 Lewis Ave,RHODES NSW 2138',
  u'$700 wk ',
  u'2 Beds',
  u'2 Baths',
  u'1 Parking',
  '-33.8338661',
  '151.083771'],
 [u'405/44 Shoreline Drive,RHODES NSW 2138',
  u'$695 ',
  u'2 Beds',
 

From the result we can see that the format is not uniform.

E.g., some prices contain 'PW' or 'Per Week' while some are not.

Write a function data_format to format them.

In [16]:
import re

In [29]:
def data_format(info, rent_or_sale):
    formatted_info = []
    for suburb_name, suburb_info in info.items():
        for item in suburb_info:
            try:
                address = item[0].split(',')
                address = ' '.join(address).encode('utf-8')
                
                price = re.sub('[, ]', '', item[1])
                price = re.search('[\d.]+[mM]*', price)
                if not price:
                    price = None
                else:
                    price = price.group()
                    if all([not c in '1234567890' for c in price]):
                        price = None
                    else:
                        if price[-1] in 'mM':
                            price = int(float(price[:-1])) * 1000000
                        else:
                            price = int(float(price))
                        if rent_or_sale == 'rent' and not 10 < price < 5000:
                            price = None
                        if rent_or_sale == 'sale' and not 100000 < price < 5000000:
                            price = None
                
                bedroom_num = item[2].split()[0]
                if not bedroom_num.isdigit():
                    bedroom_num = None
                else:
                    bedroom_num = int(bedroom_num)
                
                bathroom_num = item[3].split()[0]
                if not bathroom_num.isdigit():
                    bathroom_num = None
                else:
                    bathroom_num = int(bathroom_num)
                
                parking_num = item[4].split()[0]
                if not parking_num.isdigit():
                    parking_num = None
                else:
                    parking_num = int(parking_num)
                
                lat, lng = round(float(item[5]), 7), round(float(item[6]), 7)
                    
            except ValueError as e:
                print e
                
            formatted_info.append([suburb_name,
                                   address,
                                   price,
                                   bedroom_num,
                                   bathroom_num,
                                   parking_num,
                                   lat,
                                   lng,
                                   None,
                                   None])
    
    return formatted_info

In [30]:
formatted_rent_info = data_format(rent_info, 'rent')
formatted_sale_info = data_format(sale_info, 'sale')

Obtain latitude/longtitude of sydney CBD and each suburb train stations.

Set sydney town hall as a representative address of sydney CBD.

In [19]:
cbd_address = 'sydney town hall'

In [20]:
from geopy.geocoders import GoogleV3

In [21]:
geolocator = GoogleV3()
cbd_location = geolocator.geocode(cbd_address)
cbd_lat, cbd_lng = cbd_location.latitude, cbd_location.longitude
cbd_lat, cbd_lng

(-33.8731575, 151.2061157)

In [22]:
suburb_stations = {}
for suburb in suburbs:
    suburb_name = suburb.split('-')[0]
    suburb_station_address = suburb_name + ' station, sydney'
    try:
        suburb_station_location = geolocator.geocode(suburb_station_address)
    except:
        print 'Cannot find \'%s\'' % suburb_station_address
    suburb_station_lat = round(suburb_station_location.latitude, 7) if suburb_station_location else None
    suburb_station_lng = round(suburb_station_location.longitude, 7) if  suburb_station_location else None
    suburb_stations[suburb_name] = (suburb_station_lat, suburb_station_lng)
    print suburb_name, suburb_station_lat, suburb_station_lng

ashfield -33.8876216 151.1257964
auburn -33.8492228 151.0329196
burwood -33.8770228 151.103898
campsie -33.9104555 151.1032614
chatswood -33.7970939 151.1808943
eastwood -33.7901317 151.0822352
epping -33.7727449 151.082121
haymarket -33.878496 151.2090002
hurstville -33.9676282 151.1023145
marsfield -33.7775092 151.1185824
mascot -33.9233749 151.1872747
parramatta -33.817225 151.0048271
rhodes -33.8305698 151.0870736
strathfield -33.871672 151.0942856
sydney -33.8613056 151.2108216
ultimo -33.8825462 151.2015489
waterloo -33.899479 151.211378
zetland -33.906269 151.2024366


Store train station locations into a csv file for future use.

In [23]:
import csv

In [24]:
headers = ['Suburb', 'Latitude', 'Longitude']

with open('station_locations.csv','w') as f:
    f_csv = csv.writer(f)
    f_csv.writerow(headers)
    f_csv.writerows([(suburb_name, location[0], location[1]) for suburb_name, location in suburb_stations.items()])

Write a function calculate_distance to calculate each property's distance to sydney CBD and nearest train station.

Assume a property's nearest train station is its suburb's train station.

In [25]:
from geopy.distance import vincenty

In [26]:
def calculate_distance(data, cbd_lat, cbd_lng, suburb_stations):
    cbd_location = (cbd_lat, cbd_lng)
    for item in data:
        suburb_name, lat, lng = item[0], item[6], item[7]
        property_location = (lat, lng)
        suburb_station_location = suburb_stations[suburb_name]
        distance_to_cbd = vincenty(property_location, cbd_location).meters
        distance_to_station = vincenty(property_location, suburb_station_location).meters
        item[-2], item[-1] = distance_to_cbd, distance_to_station
    return None

In [31]:
calculate_distance(formatted_rent_info, cbd_lat, cbd_lng, suburb_stations)
calculate_distance(formatted_sale_info, cbd_lat, cbd_lng, suburb_stations)

Write formatted rent_info/sale_info into csv files.

In [32]:
headers = [
    'Suburb',
    'Address',
    'Price',
    'Bedrooms',
    'Bathrooms',
    'Parkings',
    'Latitude',
    'Longitude',
    'Distance_to_CBD',
    'Distance_to_station'
]

with open('rent_data.csv','w') as f:
    f_csv = csv.writer(f)
    f_csv.writerow(headers)
    f_csv.writerows(formatted_rent_info)

with open('sale_data.csv','w') as f:
    f_csv = csv.writer(f)
    f_csv.writerow(headers)
    f_csv.writerows(formatted_sale_info)