# Web scraping

### Learning outcomes
- learn how we can get data from website content
- learn to get website HTML using python requests
- parsing HTML using xpath


So far we've looked at how we can get data from APIs, databases and files that we have locally. But what if we don't have any data? 

Or what if we want to get data from some website which doesn't have an API?

In this case we could go to that website and copy the data into some workable format like a CSV file, one feature at a time. But this doesn't work well when you need a large dataset.

Instead we can write python code to do this for us.

## The challenge: Scraping house data

The problem we are faced with is the classic house price prediction task, for all of the houses on the property website Zoopla. Zoopla doesn't provide it's data in a structured format for free; it has no API, and no structured files that we can download.

[This link shows houses listed for sale in London](https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes)

If we want to build a dataset of houses from Zoopla's website, then we would have to:
    1. Get the webpage content
    
    2. Finding the elements containing features of each example house,
    
    3. formatting them into a dictionary per example. 
    
Let's do this now.

## Step 1: Get website content

### What format does information on a website exist in?

We know that websites don't just print data in a nice CSV or JSON format. 
They have content to display stuff to you in a way that makes sense, like buttons, on the page. 
This content is defined in a HTML file.

They also have styling

#### What is HTML?

HTML stands for HyperText Markup Language. It consists of a tree structure of different types of web elements, like buttons, page divisions, images and more. This means that it is used to define what **content** is rendered on any webpage that you visit.

HTML markdown contains elements/tags that may contain other elements/tags.

Here is an example of some HTML markdown, and the page which it is rendered as.

![](./images/example_html.png)

### How can we get the website HTML, which contains data that we want?

When you search for a URL in a browser, here's what happens:
- your browser makes a GET request to the computer that serves requests from that URL endpoint
- this computer knows what web content to send you back, so it sends it in a response to the request. This stuff includes the HTML of the page that you want to view.
- Your browser gets the HTML, and knows how to present that type of data to you (it renders the webpage)

The point here being that you can get the HTML, which defines the content for any site, by making a GET request to that website.

Let's try that!

We can use the requests library to get the HTML from a website


In [13]:
import requests # import the requests library
r = requests.get('https://www.zoopla.co.uk') # make a HTTP GET request to this website
html_string = r.text # the text attribute of this response is the HTML as a string
print(r.text)


<!DOCTYPE html>
<html class="no-js" xmlns="https://www.w3.org/1999/xhtml" xml:lang="en" lang="en-GB" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/">
<head>

    
    <meta charset="utf-8" />
    <title>Zoopla &gt; Search Property to Buy, Rent, House Prices, Estate Agents</title>
    

        <meta name="description" content="Search for property with the UK's leading resource. Browse houses and flats for sale and to rent, and find estate agents in your area." />

        <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />

    <meta name="p:domain_verify" content="be06f9b9d493189acc5cb456ca9a68e1" />
    <meta name="verify-v1" content="5sGBRIBu2OitdItUbuozS0V22pWw4WXsMP8K15vIKD0=" />
    <meta name="msvalidate.01" content="20BD4007B12CDB4977964997AB208472" />
    <meta name="y_key" content="1a4dcbe953cadecb" />
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
    <link rel="shortcut icon" type="image/x-icon

## Step 2: Finding the data by Parsing HTML

Now that we have the page which we know contains the data, we need to find the information that we care about within it. 

The process of converting the string of text, which we know represents HTML, into structured HTML as well as maybe extracting information that we want from within it, is called **parsing**.

### Converting the string to a HTML tree

HTML is just a nested set of elements, all inside a single `<html>...</html>` tag. 
By looking at this top `html` node, and branching it's child nodes from it recursively, we can visualise the webpage as a tree. This is shown below.

![](./images/html_tree.jpg)

# show webpage and markdown with this structure

We can use the `lxml` library to turn the homogeneous string into a tree data structure. 
`lxml` has a module called etree which is a simple and efficient API for parsing and creating XML or HTML data.
Using the HTML function.

See more about `lxml.etree` [here](https://lxml.de/api/)

In [14]:
from lxml import etree

tree = etree.HTML(html_string) # use the HTML method to 
print(tree)
print(etree.tostring(tree)) # print the tree

<Element html at 0x18c01fc4648>
b'<html class="no-js" xmlns="https://www.w3.org/1999/xhtml" xml:lang="en" lang="en-GB" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/">\n<head>\n\n    \n    <meta charset="utf-8"/>\n    <title>Zoopla &gt; Search Property to Buy, Rent, House Prices, Estate Agents</title>\n    \n\n        <meta name="description" content="Search for property with the UK\'s leading resource. Browse houses and flats for sale and to rent, and find estate agents in your area."/>\n\n        <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"/>\n\n    <meta name="p:domain_verify" content="be06f9b9d493189acc5cb456ca9a68e1"/>\n    <meta name="verify-v1" content="5sGBRIBu2OitdItUbuozS0V22pWw4WXsMP8K15vIKD0="/>\n    <meta name="msvalidate.01" content="20BD4007B12CDB4977964997AB208472"/>\n    <meta name="y_key" content="1a4dcbe953cadecb"/>\n    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"/>\n    <link rel="sho

This tree object that we've converted the string HTML into now has methods that can be used to find elements within it. 

Check out the lxml.html documentation [here](https://lxml.de/3.1/tutorial.html).

### Finding tree elements within a `lxml.etree` using xpath

Xpath is a query language for selecting nodes/branches/elements within a tree-like data structure like HTML or XML. 

Below is a very simple xpath expression. This one finds all of the button elements in the html

## `//button` 

The `//` says "anywhere in the tree" and the `button` says find elements that have the tag type button. So this xpath expression says "find button tags anywhere within the tree"

The `xpath` method of `lxml.etree` takes in an xpath expression returns a list of all elements in the tree that match it.

Below are more examples of how to use xpath

`/button` find **child** (not all) tags of type button, of the element

`//div/button` - finds all of the button tags inside div tags anywhere on the page

`//div[@id='custom_id']` - finds all div tags with the attribute (`@`) `id` equal to `custom_id`, anywhere on the page 

If any of these don't make sense, let us know and look it up.

Use the `//button` xpath expression as an argument to find the button on the page

In [15]:
buttons = tree.xpath('//button')
print(buttons) # returned from xpath expression

[<Element button at 0x18c02233bc8>]


The elements of the tree that match the xpath expression are returned from the call to the `.xpath` method of the tree. To see the text that each of the buttons contain we can check out their `.text` attribute. Let's use a list comprehension to explore this, and a few other attributes.

In [16]:
btn_texts = [b.text for b in buttons] # map each button to their text
print(btn_texts)

['Search']


## Using the developer console to identify the right xpath

### How to open the console

Modern browsers come with tools to maximise web developers productivity and help find bugs.

The developer console has a lot of different tools. 

Open your element inspector by pressing `CTRL + SHIFT + C`.
It should open on the right hand side of your screen as shown below.

The elements tab of the developer console shows you the HTML and CSS that make up the website code (actually it shows the DOM. Read more about what exactly the DOM is [here](https://css-tricks.com/dom/)).

You can always close the developer console by clicking the cross in the corner. 

Pressing `CTRL + SHIFT + J` opens the javascript console and closes the developer console if it is already open.

![title](./images/dev_console_opened.png)

Check out the zoopla website for yourself. Try using your selector to see the HTML structure of the page.
![](./images/form_selector.png)

Now use your selector to find the location of the button as shown below.

![](./images/button_selector.png)

As mentioned, the selector allows us to visualise the DOM and find elements within our webpage.


### We can find elements, and then search for elements within them!

Elements returned from finding them by xpath also have the same search methods. They are the same object type.

### We can search for elements in more ways than just xpath

There are loads of ways to find elements within HTML.

Let's check what methods and properties of our tree object exist by calling the built-in `dir()` method.

In [17]:
print(dir(tree))

['__bool__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '_init', 'addnext', 'addprevious', 'append', 'attrib', 'base', 'clear', 'cssselect', 'extend', 'find', 'findall', 'findtext', 'get', 'getchildren', 'getiterator', 'getnext', 'getparent', 'getprevious', 'getroottree', 'index', 'insert', 'items', 'iter', 'iterancestors', 'iterchildren', 'iterdescendants', 'iterfind', 'itersiblings', 'itertext', 'keys', 'makeelement', 'nsmap', 'prefix', 'remove', 'replace', 'set', 'sourceline', 'tag', 'tail', 'text', 'values', 'xpath']


If we wanted to find the documentation which would explain all of these in detail, we should print it's type and then find the corresponding page in the docs.

In [18]:
print(type(tree))

<class 'lxml.etree._Element'>


The type of our tree is currently `lxml.etree._Element`. 

If we find the documentation, we can see all of the methods [here](https://lxml.de/api/lxml.etree._Element-class.html)

Sometimes it's not easy, but use your element inspector, and consider the many ways to select elements.

In [19]:

all_children = tree.getchildren()
find = tree.find('div')

css = tree.cssselect()

print('all_children:')
print(all_children)
print()
print('using "find" method:')
print(find)
print()

tree.get_element_by_id('mn-advice')
tree.find_css('') # find by css 

tree.find_class(class_name)

TypeError: cssselect() takes exactly 1 positional argument (0 given)

## Once you've got an element, you can do all kinds of things with it

A likely thing that you'll want to do is get the text inside an element. This might be a house name for example. You can do this by getting the 

Find the documentation [here](https://lxml.de/3.1/parsing.html)

In [20]:
element = tree.xpath('//button')[0]

# this element has a load of different attributes
as_a_string = etree.tostring(element) # get element as string
tag_type = element.tag # get the type of the tag
element_text = element.text
element_text2 = element.text_content()

print('as a string:')
print(as_a_string)
print()
print('type of element:', tag_type)
print()
print('text in element:', element_text)
print()

AttributeError: 'lxml.etree._Element' object has no attribute 'text_content'

## Beyond just GETTING static HTML


### Why might using requests to get the website content not work?

Some elements on webpages are inserted or manipulated by javascript code that runs only after the HTML is rendered.

Some information that you want may be shown only after interacting with certain elements.

The GET requests to the website just get the HTML file. They don't actually run the javascript code, or interact with the page after it renders. So parsing them for our data won't work.

Again, there is a way around this. We can use a library called Selenium to take control of a browser that can then be programatically instructed to fill in forms, click elements, and find data on any webpage.

## Selenium

Selenium is a tool for programmatically controlling a browser. It's originally intended to be used for creating unit tests, but it can also be used to do anything that needs a browser to be controlled.

Check out the docs [here](https://selenium-python.readthedocs.io/)

### Webdriving

Selenium can "drive" a web browser. This means it can take full control of it and, find elements, click, scroll, execute js etc.

You need to specify which browsers this webdriver will drive such as Chrome or Firefox. To drive a browser you need to have the driver installed. We'll use the chrome browser and download it's driver called Chromedriver.

We'll have to install chromedriver to drive our chrome browser. You should ensure you have the correct version, which should be the same as the version of chrome which you wish to drive. 

[Check your chrome version here](https://help.zenplanner.com/hc/en-us/articles/204253654-How-to-Find-Your-Internet-Browser-Version-Number-Google-Chrome)

[Download chromedriver from here](https://chromedriver.chromium.org/downloads)

Add chromedriver to your path

# how to do that

### Getting pages

To start up Selenium automatically driving a browser, we need to instantiate a Selenium webdriver. 

`driver = webdriver.Chrome()`

This will create a webdriver instance 

Walk through options



### Finding elements in the HTML when using Selenium

Selenium has it's own APIs for finding elements, including xpath and [many more]()

`driver.find_elements_by_xpath`


In [11]:
from selenium import webdriver
from time import sleep

class Scraper:
    driver = webdriver.Chrome('chrome_driver/chromedriver') 
    #     driver = webdriver.Chrome() # create webdriver
    
    def scrape(self):
        self.driver.get('https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes')
        sleep(1)
        
        # try using find_element
        accept_cookies = self.driver.find_elements_by_xpath('//button[@data-responsibility="acceptAll"]')
        # this gets the first one which is not interactable so it doesnt work
        
        # print the contents of all elements which match the xpath expression to find the right one
        for c in accept_cookies:
            print(c.text)

        # try using find_elements
        accept_cookies = self.driver.find_elements_by_xpath('//button[@data-responsibility="acceptAll"]')[1]
        
        print(accept_cookies)
        # use loop to select correct element
        for c in accept_cookies:
            if c.text.lower() == 'accept all cookies':
                c.click()
                break        
        
s = Scraper()
s.scrape()


Accept all cookies
<selenium.webdriver.remote.webelement.WebElement (session="3f7f58d3ba6e9f55d1f9245447700080", element="f6b80168-d935-479f-9750-742979ca5f9f")>


TypeError: 'WebElement' object is not iterable



### Executing javascript

driver.

### Running headless

We can pass options to our webdriver when it starts up. One of these useful options is a flag that toggles whether selenium should actually show you the chrome tab which it is driving, or just run it silently in the background. 

You might not always have a output display. maybe you want to run this script on a server that doesn't have a screen. Running selenium without showing the browser is called running a headless browser (imagine the tab you would normally see being the head).



### Assembling our cleaned dataset

## Scrapy

So far we have:
- used the `requests` library to get content from webpages
- used `lxml.html` to parse our html using xpath

These tools we've covered so far work for many use-cases, not just web scraping. 
All of the tools that we need for scraping come as part of a specialised Python scraping library called Scrapy.

Scrapy will help us to:
- get webpage HTML
- parse HTML
- crawl through many websites in a single run

### Scrapy spiders

Let's make our first spider in Scrapy

In [6]:
import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for title in response.css('.post-header>h2'):
            yield {'title': title.css('a ::text').get()}

        for next_page in response.css('a.next-posts-link'):
            yield response.follow(next_page, self.parse)