# Introduction to Webscraping in Python

This notebook provides an introduction into webscraping with the Python library `BeautifulSoup`. A full documentation on the package can be found [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#modifying-the-tree"). To initially install the package, type 
`pip install beautifulsoup4` in the Anaconda command line. In addition install a parser by typing `pip install lxml` or `pip install html5lib`. 

For a start we need one package that allows us to download the html-file (`requests`), one parser (`html5lib`or `lxml`), `BeautifulSoup` itself, which enables us to make use of the structure of the file to search and navigate through it more easly, and `re` to search using regular expressions.

In [1]:
import requests
import html5lib
import lxml
from bs4 import BeautifulSoup
import re
import pandas as pd
%load_ext snakeviz

### Background
The examples in this notebook are based on a research project of Matt Lowe. Final results of the scrape can be found in his paper [Now You See Me: The Returns to Visibility for Politicians, Lowe 2019 WP](https://mfr.osf.io/render?url=https://osf.io/c23tg/?action=download%26mode=render). We investigate a natural experiment that occurs in the British Parliament every week. MP's can enter the name to ask the Prime Minister a question. 10-15 of those who put their name down are randomly selected and get to ask a question in a (commonly) well attended parliamentary hall and benefit as well from the popularity of this format in the media. Lowe uses the treatment of getting to ask a question as an exogenous shock to the politicians visibility to his or her respective party leaders and studies the effects of this visibility shock on future career performance. Therefore we want to extract for every *date* this Prime Minister Questions (PMQ) format took place information at the *speech-level* regarding *speaker* and whether or not the speech was a *question*. 

### Step 1: Fetch the HTML-page of interest

In [2]:
url = "https://hansard.parliament.uk/Commons/1982-07-27/debates/7a87ccce-4da2-42b2-96b8-718f36f651b0/Engagements"
page = requests.get(url)

In [3]:
#page.content

### Step 2: Brew the soup

In [4]:
soup = BeautifulSoup(page.content,'lxml')

In [5]:
#soup

In [6]:
def brew_the_soup(content_page,parser_str):
    soup = BeautifulSoup(content_page,parser_str)
    return soup

In [7]:
%timeit brew_the_soup(page.content,'html5lib')

213 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [8]:
%timeit brew_the_soup(page.content,'lxml')

62.4 ms ± 6.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


`lxml` performs way fast than `html5lib` and should usually be considered the option of choice. Very rarely HTML-files are formatted weirdly and `lxml` does not capture all of the source code (happened to me once so far). Then `html5lib` can make sure, that no content is lost. 

### Step 3: Search & Navigate

This is the core of our scraping routine. Now we use (i) the structural commands of `BeautifulSoup`and (ii) customized regular expressions with `re` to extract the bits of information from the html-file that we are after. 
In this case we want the links of the two PMQ sessions. 

#### I) Search

There are two fundamental ways to search for elements of the html file:
`soup.find(tagname, attrs)` and `soup.findAll(tagname,attrs)`. 


In [9]:
soup.find('h2',text=re.compile(r'The Prime Minister'))

<h2 class="memberLink col-md-9" id="member-link-db588e6f-becc-46ef-9105-744edd82b68d">
                The Prime Minister
            </h2>

In [10]:
soup.findAll('h2',text=re.compile(r'The Prime Minister'))

[<h2 class="memberLink col-md-9" id="member-link-db588e6f-becc-46ef-9105-744edd82b68d">
                 The Prime Minister
             </h2>,
 <h2 class="memberLink col-md-9" id="member-link-2c72979e-590e-4734-b131-cff782d12521">
                 The Prime Minister
             </h2>,
 <h2 class="memberLink col-md-9" id="member-link-84e51054-cf58-4b84-a1dc-1ee2aba2108a">
                 The Prime Minister
             </h2>,
 <h2 class="memberLink col-md-9" id="member-link-d4ca3754-2040-4520-a829-aa6725123cd6">
                 The Prime Minister
             </h2>,
 <h2 class="memberLink col-md-9" id="member-link-b8ab825a-7513-43be-af16-c08271324f83">
                 The Prime Minister
             </h2>,
 <h2 class="memberLink col-md-9" id="member-link-397a793b-6328-4133-b145-03fbc00f0d74">
                 The Prime Minister
             </h2>,
 <h2 class="memberLink col-md-9" id="member-link-d536887a-fba8-4272-bc5d-ca6f20a498f9">
                 The Prime Minister
            

#### II) Navigating

You cannot only search for tags *absolutely* using the two commands from above, but also move *relative* to a designated tag in all directions:

#### Going Upwards

In [11]:
soup.find(text='Q2.')

'Q2.'

In [12]:
for parent in soup.find(text='Q2.').parents:
    print(parent.name)

p
div
div
div
div
div
div
div
div
div
div
div
body
html
[document]


You can also combine navigation and searching

In [13]:
soup.find(text='Q2.').find_parent('div',class_="content-item other-content")

<div class="content-item other-content">
<!-- START statement -->
<div class="statement col-md-9 content-container" id="contribution-7439638f-cfb2-4f61-8334-d9b09743eaf0">
<p class="">Q2.</p>
</div>
<div class="col-md-3 hidden-sm hidden-xs right-column">
</div>
<div class="clearfix"></div>
<!-- END statement -->
</div>

#### Going Down

In [14]:
content_container = soup.find(class_="content-container")
#content_container

You can get tags that are *direct* children of the initial tag:

In [15]:
for item in content_container.children:
    if not isinstance(item,str):
        print(item['class'])

['content-item', 'other-content']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item', 'other-content']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item', 'other-content']
['content-item']
['content-item']
['content-item']
['content-item']


Or even *recursively iterate* through all tags that are nested within the inital tag:

In [16]:
for item in content_container.descendants:
    if not isinstance(item,str):
        #print(item.attrs)

SyntaxError: unexpected EOF while parsing (<ipython-input-16-319163f6f053>, line 3)

#### Going Sideways 
And lastly you can also go sideways, however a bit more restricted. To get all *siblings* of a tag, you need to use to commands.

In [17]:
content_item = content_container.find('div',class_="content-item",id=True)
content_item

<div class="content-item" id="contribution-d284e83d-8a4d-4b5d-bd45-d7ad24e78bca">
<div class="col-md-9 edit-fail-error">
<div class="alert alert-danger" id="alert-d284e83d-8a4d-4b5d-bd45-d7ad24e78bca">
<span class="glyphicon glyphicon-info-sign"></span>The edit just sent has not been saved.  The following error was returned:
        </div>
</div>
<span class="glyphicon glyphicon-info-sign"></span>This content has already been edited and is awaiting review.
        </div>
</div>
<div class="col-md-9 nohighlight member-container">
<h2 class="memberLink col-md-9" id="member-link-d284e83d-8a4d-4b5d-bd45-d7ad24e78bca">
                Mr. Delwyn Williams
            </h2>
</div>
<div class="col-md-3 hidden-sm hidden-xs right-column">
<a class="link-to-contribution link-text" data-hop-popover="http://hansard.parliament.uk/Commons/1982-07-27/debates/7a87ccce-4da2-42b2-96b8-718f36f651b0/Engagements#contribution-d284e83d-8a4d-4b5d-bd45-d7ad24e78bca" data-hop-url-shorten-url="/UrlShortener/Short

In [18]:
previous_item = content_item.next_sibling.next_sibling
#previous_item

In [19]:
for item in content_item.find_next_siblings(id=True):
    print(item['class'])

['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']
['content-item']


### Step 4: Extract

Lastly, we want to extract the relevant information into a structured data object. In most cases we need the actual content, i.e. the string or bits of a string of a HTML-tag, bit in principle we can also extract, all other attributes such as `class`, `id`, `href`, etc.
We will now construct a dataframe at the speech-level with 4 variables on question number, question dummy, speaker name and speech. 

In [20]:
question_number = list()
speaker = list()
speech = list()
question_indicator = list()

In [21]:
q_ptrn = r'Q(\d)+\.?'

In [22]:
# Initialize a state variable that indicates whether the last item was of class "content-item" (0) or "content-item other-content" (1)
previous_item = 0
for item in content_container.children:
    if not isinstance(item,str):
        if len(item['class']) > 1:
            q_match = re.search(q_ptrn,item.text)
            if bool(q_match):
                question_number.append(int(q_match.group(1)))
            else:
                question_number.append(pd.NaT)
            previous_item = 1

        else:
            if previous_item == 1:
                previous_item = 0
                question_indicator.append(1)
            else:
                question_number.append(pd.NaT)
                question_indicator.append(0)


            speaker_container = item.find('div',class_="col-md-9 nohighlight member-container")
            speech_container = item.find('div', class_="col-md-9 contribution content-container")

            speaker.append(str.strip(speaker_container.text))
            speech.append(str.strip(speech_container.text))      

In [23]:
df = pd.DataFrame()
df['question'] = question_indicator
df['question_number'] = question_number
df['speaker'] = speaker
df['speech'] = speech

In [24]:
df

Unnamed: 0,question,question_number,speaker,speech
0,1,2,Mr. Delwyn Williams,asked the Prime Minister if she will list her ...
1,0,NaT,The Prime Minister,This morning I had meetings with ministerial c...
2,0,NaT,Mr. Williams,Is my right hon. Friend not dismayed by the de...
3,0,NaT,The Prime Minister,I understand that the suggestion of lowering e...
4,0,NaT,Mrs. Shirley Williams,Has the Prime Minister seen the report of the ...
5,0,NaT,The Prime Minister,The report is not mine in any way. I have seen...
6,0,NaT,Mr. Allen McKay,Is the right hon. Lady aware of the agricultur...
7,0,NaT,The Prime Minister,The hon. Gentleman knows that the agricultural...
8,0,NaT,Mr. Beaumont-Dark,Will my right hon. Friend express regret that ...
9,0,NaT,The Prime Minister,"I agree with my hon. Friend. If allegations, s..."


## Some Strategy for larger projects

### Make things robust

We were lucky so far that we always found what we were looking for where we were looking for it. In practice, you will need try and except a lot. In my experience it makes sense to incorporate these into robust versions of the basic search-commands from `BeautifulSoup`. 

In [25]:
def robust_tag_finder(start_soup, tagname, attribute="parent", **kwargs):
    """
    Try to find the next tag that satisfies the specified criteria and return a 
    None object if the search does not succeed.
    Args:
        start_soup(soup): The soup object from which the find algorithm starts.
        tagname(str): Specifies that name of the tags that should be searched for.
        attribute(str)(optional): Specifies the attribute of the found tag that is returned. 
        By default the parent of the tag is returned. 
    Output:
        found_soup(soup): The soup object that satisfies the stated criteria.

    """
    try:
        found_soup = getattr(start_soup.find(tagname, **kwargs), attribute)
    except:
        found_soup = None

    return found_soup

In [26]:
def findAll_surrounding_tags(start_tag, tag_name, previous_steps=None, next_steps=None, **kwargs):
    """
    Find all previous and next steps that satisfy certain criteria starting from
    a start_tag and store them in a list.
    Args:
        start_tag(soup): A soup object from which surrounding tags are searched.
        previous_steps(int): Specifies the number of previous tags that are searched for.
        next_steps(int): Specifies the number of next tags that are searched for.
        tag_name(str): Specifies name of the tags that are searched for.
        **kwargs(optional): Further searching criteria can be passed.
    Output:
        tag_list(list): A list of all found previous and next tags that satisfy 
        the stated criteria.
    """
    tag_list = []

    # Previous steps
    if previous_steps is not None:
        following_tag = start_tag
        for n in range(1, previous_steps):
            if following_tag is not None:
                previous_tag = following_tag.find_previous_sibling(tag_name, **kwargs)
                following_tag = previous_tag
                if previous_tag is not None:
                    tag_list.append(previous_tag)
            else:
                break
        # The tags need to be in chronological order but the loop returns them in the order
        # of ascending distance to the first engagement tag
        # Therefore they have to be reversed.
        tag_list.reverse()

    # Next steps
    if next_steps is not None:
        previous_tag = start_tag
        for n in range(1, next_steps + 1):
            if previous_tag is not None:
                next_tag = previous_tag.find_next_sibling(tag_name, **kwargs)
                previous_tag = next_tag
                if next_tag is not None:
                    tag_list.append(next_tag)
            else:
                break

    return tag_list

For the examle above we started from the final website where the content is. However, in the general case we need a protocol that produces the URLs for any given date for these websites. 
For this we use the website that has a regular URL and search for the item that carry the links to the final websites. 

This is also a good example of what to me seems like the most efficient and robust way of scraping websites. After we have fetched the website, we narrow in on those bits of the content that we predict the objects of interest to be in, here `question_soup` and `pm_section`. Now we search first in the most narrow location for what we want and only if we don't find anything here, we move up. This requires supposedly reduces / eliminates the number of False Positives whilst allowing for (predictable) changes in the strucutre. 

In [27]:
def retrieve_pmq_links(date,base_url = "https://hansard.parliament.uk"):
    # 1) Fetch
    def fetch(date,base_url):
        url = base_url + "/Commons/" + date
        page = requests.get(url)
        return page
    
    
    soup = brew_the_soup(fetch(date,base_url).content,'lxml')

    # 2) Zoom in on the elements of the structure where the objects of interest CAN be
    def scrape_soup(soup):
        
        question_soup = robust_tag_finder(
            start_soup=soup,
            tagname="a",
            text=re.compile(r"Oral\s*Answer(s)?\s*to\s*Question(s)?\s*\n*", re.I),
        )

        if question_soup is not None:
            pm_section_temp = robust_tag_finder(
                start_soup=question_soup, 
                tagname="a", 
                text=re.compile(r"P?rime Minister\n*", re.I)
            )
            tag_class = " ".join(pm_section_temp["class"])

            if bool(re.search(r"has-children( open)?",tag_class)):
                pm_section = pm_section_temp
            else:
                pm_section = None
        else:
            pm_section = None

        # 3) Search from the most narrow location upwards (pm_section first, question_soup second)
        pmq_links = list()

        if question_soup is not None:
            if pm_section is not None:
                for bullet in pm_section.findAll('a'):
                    if not bool(re.search(r"P?rime Minister\n*", bullet.text, re.I)):
                        pmq_links.append(bullet.get("href"))

            # If no designated pm section was found, search in the whole question_soup
            else:                    
                engagement_tags = [tag.parent for tag in question_soup.findAll('a',text=re.compile(r'Engagements',re.I))]
                last_engagement_tag = engagement_tags[-1]
                following_tags = findAll_surrounding_tags(
                    start_tag=last_engagement_tag,
                    next_steps=3,
                    tag_name="li",
                    class_="no-children",
                )
                pmq_tags = engagement_tags + following_tags

                for bullet in pmq_tags:
                    pmq_links.append(bullet.a.get("href"))

        # 4) Convert the url-bits into links that we can scrape 
        pmq_links_final = list()
        for link in pmq_links:
            pmq_links_final.append(base_url + link)
        return pmq_links_final
    
    pmq_links_final = scrape_soup(soup)
    
    return pmq_links_final

In [28]:
date_to_link = dict()
for d in ['1979-01-18','1982-07-27']:
    date_to_link[d] = retrieve_pmq_links(d)

date_to_link

{'1979-01-18': ['https://hansard.parliament.uk/Commons/1979-01-18/debates/d7952bdb-a5f4-4510-abfa-4806295650ee/PrimeMinister(Engagements)',
  'https://hansard.parliament.uk/Commons/1979-01-18/debates/6cd5b624-da84-4747-abb7-699d785073c2/NationalEconomicdevelopmentCouncil'],
 '1982-07-27': ['https://hansard.parliament.uk/Commons/1982-07-27/debates/d8026e1d-5ea4-471f-8dc0-3d57056e4cc2/FactoryClosures',
  'https://hansard.parliament.uk/Commons/1982-07-27/debates/7a87ccce-4da2-42b2-96b8-718f36f651b0/Engagements']}

### Accelerating
Webscraping many pages can be extremely time consuming. The bottleneck is always the process of fetchting the HTML file. 
It is highly recommandable to isolate this step from the rest of the scraping process and to store the html files locally in `pickle` files. 
In addition, parallelization (e.g. with `waf`) yields large performance boosts, when a large number of websites are fetched.
https://hansard.parliament.uk/Commons/1982-07-27
https://hansard.parliament.uk/Commons/1979-01-18

In [29]:
%snakeviz retrieve_pmq_links('1979-01-18')

 
*** Profile stats marshalled to file 'C:\\Users\\jankn\\AppData\\Local\\Temp\\tmp2nza223r'. 
Embedding SnakeViz in this document...


In [30]:
import pickle

In [31]:
def pickle_html(date,base_url="https://hansard.parliament.uk"):
    page = requests.get(base_url + "/Commons/" + date)
    page_dict = {date: page.content}
    
    with open("./html_files/base_page_{}".format(date), "wb") as f:
        pickle.dump(page_dict, f)

In [32]:
pickle_html("1979-01-18")

In [33]:
def load_pickle(date):
    with open("./html_files/base_page_{}".format(date), "rb") as f:
        content_page = pickle.load(f)[date]
    
    return content_page

In [34]:
def retrieve_pmq_links_pickle(date,base_url = "https://hansard.parliament.uk"):
    # 1) Fetch
    content_page = load_pickle(date)
    soup = brew_the_soup(content_page,'lxml')

    # 2) Zoom in on the elements of the structure where the objects of interest CAN be
    def scrape_soup(soup):
        
        question_soup = robust_tag_finder(
            start_soup=soup,
            tagname="a",
            text=re.compile(r"Oral\s*Answer(s)?\s*to\s*Question(s)?\s*\n*", re.I),
        )

        if question_soup is not None:
            pm_section_temp = robust_tag_finder(
                start_soup=question_soup, 
                tagname="a", 
                text=re.compile(r"P?rime Minister\n*", re.I)
            )
            tag_class = " ".join(pm_section_temp["class"])

            if bool(re.search(r"has-children( open)?",tag_class)):
                pm_section = pm_section_temp
            else:
                pm_section = None
        else:
            pm_section = None

        # 3) Search from the most narrow location upwards (pm_section first, question_soup second)
        pmq_links = list()

        if question_soup is not None:
            if pm_section is not None:
                for bullet in pm_section.findAll('a'):
                    if not bool(re.search(r"P?rime Minister\n*", bullet.text, re.I)):
                        pmq_links.append(bullet.get("href"))

            # If no designated pm section was found, search in the whole question_soup
            else:                    
                engagement_tags = [tag.parent for tag in question_soup.findAll('a',text=re.compile(r'Engagements',re.I))]
                last_engagement_tag = engagement_tags[-1]
                following_tags = findAll_surrounding_tags(
                    start_tag=last_engagement_tag,
                    next_steps=3,
                    tag_name="li",
                    class_="no-children",
                )
                pmq_tags = engagement_tags + following_tags

                for bullet in pmq_tags:
                    pmq_links.append(bullet.a.get("href"))

        # 4) Convert the url-bits into links that we can scrape 
        pmq_links_final = list()
        for link in pmq_links:
            pmq_links_final.append(base_url + link)
        return pmq_links_final
    
    pmq_links_final = scrape_soup(soup)
    
    return pmq_links_final

In [35]:
links = retrieve_pmq_links_pickle(date="1979-01-18")
links

['https://hansard.parliament.uk/Commons/1979-01-18/debates/d7952bdb-a5f4-4510-abfa-4806295650ee/PrimeMinister(Engagements)',
 'https://hansard.parliament.uk/Commons/1979-01-18/debates/6cd5b624-da84-4747-abb7-699d785073c2/NationalEconomicdevelopmentCouncil']

In [36]:
%timeit retrieve_pmq_links_pickle(date="1979-01-18")

98.4 ms ± 4.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [37]:
%timeit retrieve_pmq_links(date="1979-01-18")

The slowest run took 7.83 times longer than the fastest. This could mean that an intermediate result is being cached.
619 ms ± 635 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
