# Scraping British Political Speech

Using `lxml` and `requests`, we can take the root URL (`http://www.britishpoliticalspeech.org/speech-archive.htm`) and the speech number to construct a query (`http://www.britishpoliticalspeech.org/speech-archive.htm?speech=6`) - and then look-up that URL. There's no clear pattern to the speech numbers - so perhaps it's just best to query a wide range of speech nums and then remove those that return an error message.

In [1]:
from lxml import html
import requests 

def scrape_speech_url(root, speech_num):
    # Prepare URL using root and params
    params = {"speech": str(speech_num)}
    req = requests.models.PreparedRequest()
    req.prepare_url(root, params)
    # Access URL and save retrieved html as a string
    page = requests.get(req.url)
    # Use lxml method to structure html
    tree = html.fromstring(page.content)
    
    # Check whether speech was found
    error_msg = ["We're sorry, but the requested speech could not be found."]
    if tree.xpath('//h3/text()') == error_msg:
        title = None
        speaker = None
        location = None
        tags = None
        commentary = None
        content = None
    else: # Obtain variables of interest
        title = ' '.join(tree.xpath('//h3/text()'))
        speaker = ' '.join(tree.xpath('//p[@class="speech-speaker"]/text()'))
        location = ' '.join(tree.xpath('//p[@class="speech-location"]/text()'))
        tags = ' '.join(tree.xpath('//p[@class="speech-tags"]/text()'))
        commentary = ' '.join(tree.xpath('//div[@class="speech-commentary"]/text()'))
        content = ' '.join(tree.xpath('//div[@class="speech-content"]//p/text()'))
    
    return [title, speaker, location, tags, commentary, content]
    

Then run that function across the first 1000 speech numbers - which results in 362 valid speeches.

In [2]:
import pandas as pd
speeches = []
bps = "http://www.britishpoliticalspeech.org/speech-archive.htm"

for i in range(1000):
    speeches.append(scrape_speech_url(bps, i))
    
speeches_df = pd.DataFrame(speeches, columns = ["title", "speaker", "location", "tags", "commentary", "content"])
# Remove URLs that did not return a speech
speeches_df = speeches_df[speeches_df["title"].map(lambda x: x is not None)]
display(speeches_df)

Unnamed: 0,title,speaker,location,tags,commentary,content
1,"Leader's speech, Blackpool 2005",Michael Howard (Conservative),Location: Blackpool,,"In this speech, Howard stood down as leader a...",I was always taught that it doesn’t matter wha...
2,"Leader's speech, Manchester 2006",Tony Blair (Labour),Location: Manchester,,This conference speech was Blairâs last as ...,I’d like to start by saying something very sim...
5,"Leader's speech, Cardiff 1895",Earl of Rosebery (Liberal),Location: Cardiff,,This speech from Lord Rosebery closed the Nat...,"Mr. Bird, ladies and gentlemen, - I am deeply ..."
6,"Leader's speech, Huddersfield 1896",Earl of Rosebery (Liberal),Location: Huddersfield,,Some 4000 people crowded into Rowleyâs musi...,"Mr. Walker, ladies and gentlemen. It is very ..."
7,"Leader's speech, Norwich 1897",Sir William Harcourt (Liberal),Location: Norwich,,In a departure from previous procedure Harcou...,"My Lords and Gentlemen, - I will say ‘My lords..."
...,...,...,...,...,...,...
366,"Leader's speech, Brighton 2017",Jeremy Corbyn (Labour),Location: Brighton,,,"We meet here this week as a united Party, adva..."
367,"Leader's speech, Manchester 2017",Theresa May (Conservative),Location: Manchester,,,A little over forty years ago in a small villa...
368,"Leader's speech, Brighton 2018",Vince Cable (Liberal Democrat),Location: Brighton,,,"Conference, we meet at an absolutely cruci..."
369,"Leader's speech, Liverpool 2018",Jeremy Corbyn (Labour),Location: Liverpool,,,Thank you for that welcome. I want to star...


We can tidy up a few columns:

In [3]:
# Separate the speaker column to speakers names and party names
speeches_df["party"] = speeches_df["speaker"].map(lambda x: x.rpartition(' (')[2].replace(')', ''))
speeches_df["speaker"] = speeches_df["speaker"].map(lambda x: x.rpartition(' (')[0])
# Tidy up location column
speeches_df["location"] = speeches_df["location"].map(lambda x: x.rpartition(': ')[2])
# Extract year from title column using regex for 4 digits in a row
import re
speeches_df["year"] = speeches_df["title"].map(lambda x: ''.join(re.findall("\d{4}", x)))
# Display dataframe
display(speeches_df)

Unnamed: 0,title,speaker,location,tags,commentary,content,party,year
1,"Leader's speech, Blackpool 2005",Michael Howard,Blackpool,,"In this speech, Howard stood down as leader a...",I was always taught that it doesn’t matter wha...,Conservative,2005
2,"Leader's speech, Manchester 2006",Tony Blair,Manchester,,This conference speech was Blairâs last as ...,I’d like to start by saying something very sim...,Labour,2006
5,"Leader's speech, Cardiff 1895",Earl of Rosebery,Cardiff,,This speech from Lord Rosebery closed the Nat...,"Mr. Bird, ladies and gentlemen, - I am deeply ...",Liberal,1895
6,"Leader's speech, Huddersfield 1896",Earl of Rosebery,Huddersfield,,Some 4000 people crowded into Rowleyâs musi...,"Mr. Walker, ladies and gentlemen. It is very ...",Liberal,1896
7,"Leader's speech, Norwich 1897",Sir William Harcourt,Norwich,,In a departure from previous procedure Harcou...,"My Lords and Gentlemen, - I will say ‘My lords...",Liberal,1897
...,...,...,...,...,...,...,...,...
366,"Leader's speech, Brighton 2017",Jeremy Corbyn,Brighton,,,"We meet here this week as a united Party, adva...",Labour,2017
367,"Leader's speech, Manchester 2017",Theresa May,Manchester,,,A little over forty years ago in a small villa...,Conservative,2017
368,"Leader's speech, Brighton 2018",Vince Cable,Brighton,,,"Conference, we meet at an absolutely cruci...",Liberal Democrat,2018
369,"Leader's speech, Liverpool 2018",Jeremy Corbyn,Liverpool,,,Thank you for that welcome. I want to star...,Labour,2018


There's still some tidying to do for the speech content columns, but otherwise we should be good to go - save the dataframe to a csv:

In [11]:
from pathlib import Path
path = Path().cwd()
speeches_df.to_csv(path / "speeches_df.csv")