# Notebook for scraping data from British Political Speech Archive

Before dynamic topic modelling can begin we need to get the speech data we want to use. We're going to use party leaders speeches from party conferences for all available parties since 1945.

In this notebook we'll just do some web scraping to get all of the speech text data into a format we can subsequently explore and topic model.

In [None]:
# import libraries

# import library for querying website
import re
import pandas as pd
import os
import requests
from bs4 import BeautifulSoup

## Create initial table of links

First we need to create a table of speeches including links to the web pages containing the speeches themselves.

In [None]:
URL = 'http://www.britishpoliticalspeech.org/speech-archive.htm?q=&speaker=&party=&searchRangeFrom=1945&searchRangeTo=2018'

In [None]:
primary_url = 'http://www.britishpoliticalspeech.org/'

In [None]:
# first use pandas to scrap the initial table data
tbls = pd.read_html(URL)

main_tbl = tbls[1]

In [None]:
# select our rows with the title we want
main_tbl = main_tbl[main_tbl.Title.str.contains("Leader\'s")]

main_tbl.reset_index(drop=True, inplace=True)

In [None]:
main_tbl.shape

In [None]:
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
results_table = soup.find('table', class_="results-table")

In [None]:
tbl_links = results_table.find_all('a', string=lambda text: 'leader\'s' in text.lower(), href=True)

In [None]:
# check if table is same no. of rows as scraped links
assert main_tbl.shape[0] == len(tbl_links)

In [None]:
main_tbl.shape

In [None]:
# create series of urls
url_srs = pd.Series([primary_url + a['href'] for a in tbl_links])

url_srs.name = 'url'

In [None]:
main_tbl = pd.concat([main_tbl, url_srs], axis=1)

In [None]:
main_tbl

## Scrape speech text from links

Now we'll scrape the speech text from each page in our `main_tbl` and add that to the pandas dataframe object.

In [None]:
main_tbl.url[0]

In [None]:
def fetch_speech(url):
    """
    For a given url return text in speech-content div element
    """
    
    page = requests.get(url)

    soup = BeautifulSoup(page.content, 'html.parser')
    
    return soup.find('div', class_='speech-content').get_text()

In [None]:
main_tbl['speech-text'] = main_tbl.url.apply(fetch_speech)

In [None]:
main_tbl.to_csv(os.path.join('..','data','leaders-speeches.csv'), index=False)