In [2]:
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
import json
import numpy as np
from datetime import datetime
import requests

# Extracting data from single listing

Let's start with extracting most interesting data from a single listing. Once we manage to do that scraping multiple offers should not be an obstacle. On the other hand, creating whole pipeline to extract multiple offers and failing at the end would be a waste of time. 

In [7]:
# url of first listing
listing_url = 'https://www.rightmove.co.uk/properties/97077389#/'

In [8]:
# First request is the moment of truth - if we do not receive Response [200] it probably means that we cannot scrape this website
requests.get(listing_url)

<Response [200]>

In [9]:
# Opening the url with BeautifulSoup and investigating where we can find the key data
page = requests.get(listing_url)
bs = BeautifulSoup(page.text, 'lxml')
bs.prettify

<bound method Tag.prettify of <!DOCTYPE html>
<html lang="en-GB">
<head>
<meta charset="utf-8"/>
<title>2 bedroom flat for rent in Whiteheads Grove, Chelsea, SW3</title>
<meta content="width=device-width, shrink-to-fit=no, initial-scale=1.0, user-scalable=yes" name="viewport"/>
<meta content="telephone=no" name="format-detection"/>
<meta content="True" name="HandheldFriendly"/>
<meta content="2 bedroom flat for rent in Whiteheads Grove, Chelsea, SW3 - Rightmove." name="description"/>
<!-- Favicons -->
<link href="//www.rightmove.co.uk/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
<meta content="Rightmove" name="apple-mobile-web-app-title"/>
<meta content="Rightmove" name="application-name"/>
<meta content="#262637" name="theme-color"/>
<meta content="http://www.rightmove.co.uk/properties/97077389" property="og:url"/>
<meta content="Check out this 2 bedroom flat for rent on Rightmove" property="og:title"/>
<meta content="website" property="og:type"/>
<meta content="

First encounter with the output can be quite intimidating. Let's try some backengineering - after looking at the offer online, we know that one of the key features we want to extract is the monthly price of '2,817'. I recommend quickly searching for 2,817 value to see where it can be found in the listing html. 

As what we are looking for will be a json format data, we should be most interested in something similar to "price":"2,817". It seems that 5th instance of '2817' looks similar - "primaryPrice":"£2,817 pcm".

Scrolling a few lines up we can see that this part is hidden within a script block with type="text/javascript". 

## Locating json with key data

In [19]:
# Extracting script with "text/javascript" type
bs.findAll("script",{'type':"text/javascript"})[0]

<script type="text/javascript">
    window.PAGE_MODEL = {"propertyData":{"id":"97077389","status":{"published":true,"archived":false},"text":{"description":"<p>An absolutely stunning two bedroom apartment located just off the sought after Sloane Avenue.</p> <p>This great value property comes unfurnished and consists of a spacious living room, a separate kitchen with Miele appliances, two double bedrooms and two recently refurbished bathrooms. The property also benefits from being in a very secure and well looked after development with a porter.</p> <p>The property is equidistant between South Kensington and Sloane Square underground stations. </p>","propertyPhrase":"2 bedroom flat","disclaimer":"<b>Disclaimer</b> - Property reference 13632. The information displayed about this property comprises a property advertisement. Rightmove.co.uk makes no warranty as to the accuracy or completeness of the advertisement or any linked or associated information, and Rightmove has no control over th

In [22]:
# Striping html elements and leaving only string content
script_text=bs.findAll("script",{'type':"text/javascript"})[0].text

In [23]:
script_text

'\n    window.PAGE_MODEL = {"propertyData":{"id":"97077389","status":{"published":true,"archived":false},"text":{"description":"<p>An absolutely stunning two bedroom apartment located just off the sought after Sloane Avenue.</p> <p>This great value property comes unfurnished and consists of a spacious living room, a separate kitchen with Miele appliances, two double bedrooms and two recently refurbished bathrooms. The property also benefits from being in a very secure and well looked after development with a porter.</p> <p>The property is equidistant between South Kensington and Sloane Square underground stations.\xa0</p>","propertyPhrase":"2 bedroom flat","disclaimer":"<b>Disclaimer</b> - Property reference 13632. The information displayed about this property comprises a property advertisement. Rightmove.co.uk makes no warranty as to the accuracy or completeness of the advertisement or any linked or associated information, and Rightmove has no control over the content. This property a

Analyzing the "script_text" above, we can see that everything after 'window.PAGE_MODEL =' has a form of a dictionary, which hints that there is a big chance that this is the json data we are looking for. 

In [24]:
# Use regex to extract json data from the script text
script_json=re.findall(("(?<=window.PAGE_MODEL = )(?s)(.*$)"), script_text)[0]

In [27]:
# Transforming json data within string into dictionary
json_dict=json.loads(script_json)

In [29]:
# We managed to succesfully extract json data into a Python dict
type(json_dict)

dict

In [28]:
json_dict.keys()

dict_keys(['propertyData', 'renderProperties', 'isAuthenticated', 'analyticsInfo'])

In [12]:
def extract_listing_json(listing_url):
    page = requests.get(listing_url)
    bs = BeautifulSoup(page.text, 'lxml')
    
    js_script_text=bs.findAll("script",{'type':"text/javascript"})[0].text
    script_json=re.findall(("(?<=window.PAGE_MODEL = )(?s)(.*$)"),js_script_text)[0]
    json_data=json.loads(script_json)
    
    return(json_data)
    

In [13]:
listing_json=extract_listing_json(listing_url)

  


## Extracting key data from listing json

Let's analyze the content of the listing_json to see what data we can extract. For this part of tutorial we will only focus on 'propertyData' key of the dictionary as it has the most interesting data. 

In [14]:
listing_json.keys()

dict_keys(['propertyData', 'renderProperties', 'isAuthenticated', 'analyticsInfo'])

In [30]:
listing_json["propertyData"].keys()

dict_keys(['id', 'status', 'text', 'prices', 'address', 'keyFeatures', 'images', 'floorplans', 'virtualTours', 'customer', 'industryAffiliations', 'rooms', 'location', 'streetView', 'nearestAirports', 'nearestStations', 'showSchoolInfo', 'countryGuide', 'channel', 'propertyUrls', 'sizings', 'brochures', 'epcGraphs', 'bedrooms', 'bathrooms', 'transactionType', 'tags', 'misInfo', 'dfpAdInfo', 'staticMapImgUrls', 'listingHistory', 'feesApply', 'broadband', 'contactInfo', 'lettings', 'infoReelItems'])

In [34]:
listing_json['propertyData']

{'id': '97077389',
 'status': {'published': True, 'archived': False},
 'text': {'description': '<p>An absolutely stunning two bedroom apartment located just off the sought after Sloane Avenue.</p> <p>This great value property comes unfurnished and consists of a spacious living room, a separate kitchen with Miele appliances, two double bedrooms and two recently refurbished bathrooms. The property also benefits from being in a very secure and well looked after development with a porter.</p> <p>The property is equidistant between South Kensington and Sloane Square underground stations.\xa0</p>',
  'propertyPhrase': '2 bedroom flat',
  'disclaimer': '<b>Disclaimer</b> - Property reference 13632. The information displayed about this property comprises a property advertisement. Rightmove.co.uk makes no warranty as to the accuracy or completeness of the advertisement or any linked or associated information, and Rightmove has no control over the content. This property advertisement does not co

In [31]:
listing_json['propertyData']["id"]

'68374505'

In [32]:
listing_json['propertyData']["prices"]

{'primaryPrice': '£1,000 pcm',
 'secondaryPrice': '£231 pw',
 'displayPriceQualifier': '',
 'exchangeRate': None}

In [33]:
listing_json['propertyData']["keyFeatures"]

[]

In [34]:
listing_json['propertyData']["location"]

{'latitude': 51.62912,
 'longitude': -0.12552,
 'circleRadiusOnMap': 0,
 'zoomLevel': 15,
 'pinType': 'APPROXIMATE_POINT'}

In [35]:
listing_json['propertyData']["lettings"]

{'letAvailableDate': None,
 'deposit': None,
 'letType': 'Long term',
 'furnishType': None}

In [30]:
listing_json['propertyData']["bedrooms"]

2

In [31]:
listing_json['propertyData']["bathrooms"]

2

In [32]:
listing_json['propertyData']["sizings"]

[]

In [35]:
# We will focus on extracting data stored within these keys for each listing
main_data_keys=['prices','address','rooms','bedrooms','bathrooms','location']

In [95]:
def extract_listing_data(listing_json):
    data_json={}
    
    #First extract the listing id
    try:
        data_json["id"]=listing_json['propertyData']["id"]
    except:
        pass
    for k in main_data_keys:
        
        #If a key contains a subdictionary, then extract all the key:value pairs
        if type(listing_json['propertyData'][k])==dict:
            for k2 in listing_json['propertyData'][k].keys():
                try:
                    data_point=listing_json['propertyData'][k][k2]
                    data_json[k2]=(data_point)
                except:
                    continue
                    
        #If the base key value is not a dictionary, extract the value
        else:
            data_json[k]=(listing_json['propertyData'][k])
    
    #As 'sizings' key has more layers of subdictionaries, we will focus only on 2 values 
    try:
        data_json["property_size"]=listing_json['propertyData']["sizings"][0]['minimumSize']
        data_json["area_unit"]=listing_json['propertyData']["sizings"][0]['unit']
    except:
        pass
    return(data_json)
    

    

In [96]:
# Output from a single listing
extract_listing_data(listing_json)

{'id': '72742005',
 'primaryPrice': '£1,000 pcm',
 'secondaryPrice': '£231 pw',
 'displayPriceQualifier': '',
 'exchangeRate': None,
 'displayAddress': 'Brockley Road, London, SE4',
 'countryCode': 'GB',
 'outcode': 'SE4',
 'rooms': [],
 'bedrooms': 1,
 'bathrooms': 1,
 'latitude': 51.455734,
 'longitude': -0.03676,
 'circleRadiusOnMap': 0,
 'zoomLevel': 15,
 'pinType': 'ACCURATE_POINT'}

# Extracting multiple offers from page

After having successfully extracted data from a single listing, we need to go up in page hierarchy and conduct two steps, which will allow us to extract data from multiple offers:
 - Get list of all listings from first page extracting the appropriate href
 - Iterate through multiple pages and repeat step from previous bullet

## Locating listing href

In [44]:
# Let's start from first page - this is also where you want to apply all the filters and sorting options
first_page_url='https://www.rightmove.co.uk/property-to-rent/find.html?locationIdentifier=REGION%5E87490&minPrice=1000&sortType=1&propertyTypes=&includeLetAgreed=false&mustHave=&dontShow=&furnishTypes=&keywords='

In [45]:
requests.get(first_page_url)

<Response [200]>

In [46]:
page = requests.get(first_page_url)
bs = BeautifulSoup(page.text, 'lxml')
bs.prettify

<bound method Tag.prettify of <!DOCTYPE html>
<html class="is-not-modern property-to-rent channel-based-property-types channel--rent" lang="en-GB">
<head>
<meta charset="utf-8"/>
<title>Properties To Rent in London | Rightmove</title>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, shrink-to-fit=no, initial-scale=1.0, user-scalable=no" name="viewport"/>
<meta content="telephone=no" name="format-detection"/>
<meta content="True" name="HandheldFriendly"/>
<meta content="Flats &amp; Houses To Rent in London - Find properties with Rightmove - the UK's largest selection of properties." name="description"/>
<meta content="origin-when-cross-origin" name="referrer"/><link crossorigin="" href="https://media.rightmove.co.uk:443" rel="preconnect"/>
<link crossorigin="" href="//product.rightmove.co.uk" rel="preconnect"/><link href="/pvw/images/favicons/rebranded/favicon.ico" rel="shortcut icon"/><link href="/pvw/images/favicons/rebranded/apple-touch-icon-7

In [47]:
# We know that the link to the page will be stored within a href so let's analyze some of the reference instances from the first page
for item in bs.findAll("a",href=True):

        print("{} \n".format(item))

<a aria-label="Go to Rightmove homepage" class="seo-logo" href="/">
<svg height="25" viewbox="0 0 118 25" width="118" xmlns="http://www.w3.org/2000/svg">
<g fill="none" fill-rule="evenodd">
<path d="M114.114 13.262h2.022V7.088l-5.56-4.662-5.56 4.662v6.174h2.021l3.554-2.993 3.523 2.993zm2.259 1.764h-2.907l-2.89-2.441-2.907 2.441h-2.906a1.507 1.507 0 0 1-1.517-1.528l.016-6.52c0-.41.158-.788.442-1.071l.063-.063 6.809-5.718 6.887 5.78c.284.284.442.662.442 1.072l.016 6.52c0 .41-.158.788-.442 1.087-.285.3-.695.441-1.106.441z" fill="#00DEB6"></path>
<path d="M60.185 8.631c-.695 0-1.359.126-2.006.378a4.505 4.505 0 0 0-1.627 1.071 4.517 4.517 0 0 0-.474-.504c-.19-.173-.411-.33-.68-.472a4.138 4.138 0 0 0-.884-.347c-.348-.094-.727-.126-1.153-.126-.711 0-1.327.126-1.849.394-.52.268-.963.599-1.295 1.008l-.11-1.15h-3.191V19.31h3.333v-5.607c0-.661.19-1.213.584-1.638.395-.425.9-.646 1.548-.646.569 0 .995.174 1.28.504.284.347.426.82.426 1.45v5.937h3.223v-5.607c0-.661.19-1.213.584-1.638.38-.425.9-.646 1

Again, we begin with quite a few hrefs and it's hard to extract any meaning out of them - I would suggest looking at one of the offers and finding a key work to search within the href output. 

In my case I looked up a listing directly on the website link (avoid first few ones as they can change in matter of seconds) and found a property with "Belvedere" in it's name - this will surely be present in the desired href, which will take us to the offer. 

After searching for "Belvedere" we can see, that the name is found in a 'a' tag, defining a hyperlink with class="propertyCard-link". 

In [58]:
# Let's extract only the 'a' tags with the desired class - we can see that the 5th one is the "Belvedere" property we used to find the reference
bs.findAll("a",{'class':'propertyCard-link'})[4]

<a class="propertyCard-link" data-test="property-details" href="/property-to-rent/property-82780999.html">
<h2 class="propertyCard-title" itemprop="name">
            1 bedroom flat share        </h2>
<address class="propertyCard-address" itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<meta content="Large Double Room - Belvedere Road, SE1" itemprop="streetAddress"/>
<meta content="GB" itemprop="addressCountry"/>
<span>Large Double Room - Belvedere Road, SE1</span>
</address>
</a>

In [49]:
# We can get the direct hyperlink by calling the 'href' key from attrs dictionary 
bs.findAll("a",{'class':'propertyCard-link'})[0].attrs['href']

'/property-to-rent/property-97203149.html'

In [51]:
# Let's take a look at the initial 10 links
for item in bs.findAll("a",{'class':'propertyCard-link'})[:10]:

        listing_href=(item.attrs['href'])
        print("{} \n".format(listing_href))

/property-to-rent/property-72233850.html 

/property-to-rent/property-72233850.html 

/property-to-rent/property-96395564.html 

/property-to-rent/property-96395564.html 

/property-to-rent/property-94660964.html 

/property-to-rent/property-94660964.html 

/property-to-rent/property-81144568.html 

/property-to-rent/property-81144568.html 

/property-to-rent/property-84544456.html 

/property-to-rent/property-84544456.html 



The hyperlinks to offers look ok, all we need to do is to add the 'https://www.rightmove.co.uk' adress to them, once we do that you can test them directly in the browser. They will lead you directly to the listings. Let's wrap the whole point up into a single function. 

In [77]:
# The page_urls are fed as list so we can use same function to extract offers from single, or multiple pages
def extract_listing_url(page_urls_list):
    listings_list=[]
    for i in range(0, len(page_urls_list)):
        # Open the page url
        page_url=page_urls_list[i]
        requests.get(page_url)
        page = requests.get(page_url)
        bs = BeautifulSoup(page.text, 'lxml')

        # Extract hyperlinks to all offers from the page
        for item in bs.findAll("a",{'class':'propertyCard-link'}):
            listing_url='https://www.rightmove.co.uk'+(item.attrs['href'])
            listings_list.append(listing_url)
    
    # Get rid of duplicates by changing list into set and back to list
    listings_list=list(set(listings_list))
    
    return(listings_list)

In [78]:
first_page_listing=extract_listing_url([first_page_url])

In [79]:
# Check how many offers we managed to extract from a single page - 25 looks ok
len(first_page_listing)

25

## Iterate to next pages

We are on the last stretch, all we need to do is to get multiple pages with listings. This turned out to be easier than for majority websites as rightmove uses a get API request to display new pages, which means that all the key variables are present in the url. 

After comparing first and second pages urls, we can see that the movement between pages is directed by 'index' variable, meaning the index of first displayed offer.. The second page has index of 24, 3rd is 48 and so on. We can create a list of following pages by manipulating the index value. 

In [80]:
def get_n_page_urls(first_page_url,n):
    
    pages_urls_list=[first_page_url]
    for i in range(1,n):
        listing_index=i*24
        nth_page_url=first_page_url+'&index='+str(listing_index)
        pages_urls_list.append(nth_page_url)
    
        
    return(pages_urls_list)

In [81]:
# Checking the first 5 urls we managed to extract - I recommend pasting one of them as adress to check the outcome
get_n_page_urls(first_page_url,5)

['https://www.rightmove.co.uk/property-to-rent/find.html?locationIdentifier=REGION%5E87490&minPrice=1000&sortType=1&propertyTypes=&includeLetAgreed=false&mustHave=&dontShow=&furnishTypes=&keywords=',
 'https://www.rightmove.co.uk/property-to-rent/find.html?locationIdentifier=REGION%5E87490&minPrice=1000&sortType=1&propertyTypes=&includeLetAgreed=false&mustHave=&dontShow=&furnishTypes=&keywords=&index=24',
 'https://www.rightmove.co.uk/property-to-rent/find.html?locationIdentifier=REGION%5E87490&minPrice=1000&sortType=1&propertyTypes=&includeLetAgreed=false&mustHave=&dontShow=&furnishTypes=&keywords=&index=48',
 'https://www.rightmove.co.uk/property-to-rent/find.html?locationIdentifier=REGION%5E87490&minPrice=1000&sortType=1&propertyTypes=&includeLetAgreed=false&mustHave=&dontShow=&furnishTypes=&keywords=&index=72',
 'https://www.rightmove.co.uk/property-to-rent/find.html?locationIdentifier=REGION%5E87490&minPrice=1000&sortType=1&propertyTypes=&includeLetAgreed=false&mustHave=&dontShow=

## Extract listings URLs from 10 first pages

As a last check, let's let's combine steps 2.1 and 2.2 to extract listings urls for first 10 pages. 

In [82]:
# Create list of 10 initial pages
n=10
pages_urls_list=get_n_page_urls(first_page_url,10)

In [None]:
# Extract listings urls from 10 initial pages
listings_urls_list=extract_listing_url(pages_urls_list)

In [None]:
# We managed to extract 250 offers - 25 per page, which means everything works as expected
len(listings_urls_list)

In [None]:
# Let's see first 10
listings_urls_list[:10]

# Extract data from multiple offer listings

We have all the components we need:
- a list of n following page urls
- listings urls extracted from the page urls
- function to extract data from a single listing url

All that is left is to combine data from all the listings into a suitable format - a dataframe as prefered for further analysis.

We will do that in two steps, append each listings json to a list, having one list entry containing data in json format for each offer. Then we use json.dumps and json.loads to convert the list of json instances into a df. 

In [None]:
def extract_multiple_listings_data(listings_urls_list):
    json_list = []
    for i in range(0, len(listings_urls_list)):
        # Extract data in json for each listing url
        if np.mod(i,100)==0:
            print("Extraction: {}/{} at:{}".format(i+1,len(listings_urls_list),str(datetime.now())[:-7] ))
        listing_url = listings_urls_list[i]
        listing_json = extract_listing_json(listing_url)
        listing_data = extract_listing_data(listing_json)
        
        # Append data in json to a list
        json_list.append(listing_data)
        
    print("\n------------------------------------------\n")
    print("Sucessfully Extracted: {}/{} offers".format((len(json_list)),i+1))
    
    return(json_list)

In [None]:
# Test the function on first 100 urls
json_list=extract_multiple_listings_data(listings_urls_list[:100])

In [None]:
# Each list element is a separate json dictionary contatining the extracted data from one listing
json_list[:5]

In [None]:
# The last step we need to take is to convert the list into json with json.dumps and then load it into dataframe. 
data_json=json.dumps(json_list)
df=pd.read_json(data_json, orient='id')

In [None]:
df.head()

I seems we managed to successfully extract the property data, and transform it into a df.

In [None]:
df.shape

In [None]:
df.id.unique().shape

# Wrap everything up into easy to use Class

As the code above was written in a format, which aims to make understanding of each step as clear as possible, it is quite messy in terms of usability. We can however transform it into a simple Class, which will allow us the execute the whole scraping process in a few quick steps. 

In [None]:
class Scrape_Rightmove:
    '''
    Class to extract data from listings featured on Rigthmove and convert it to DataFrame. 
    All functions should be executed from top to bottom as they are dependent on each other. 
    '''
    
    def __init__(self, first_page_url, pages_count):
        self.pages_count=pages_count
        self.pages_urls_list=get_n_page_urls(first_page_url,pages_count)
        print("Initialized class to scrape {} pages starting with {}".format(pages_count,first_page_url))
        
    def extract_listings_ulrs(self):
        self.listings_urls_list=extract_listing_url(self.pages_urls_list)
        print("Extracted {} listings from {} pages".format(len(self.listings_urls_list),self.pages_count))
        
    
    def extract_listings_data(self):
        self.json_list=extract_multiple_listings_data(self.listings_urls_list)

        
    def convert_json_to_df(self):
        
        data_json=json.dumps(self.json_list)
        self.df_data=pd.read_json(data_json, orient='id')
        return(self.df_data)
        
    

In [None]:
scrapper = Scrape_Rightmove(first_page_url, 10)

In [None]:
scrapper.extract_listings_ulrs()

In [None]:
scrapper.extract_listings_data()

In [None]:
df=scrapper.convert_json_to_df()

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.id.unique().shape