# Hansard Analysis

This notebook contains code to analyse Hansard, the official record of every spoken or written contribution made in the Houses of Parliament from 1803 to the present day.

## Why bother?

Looking at changes in political discourse over time is fascinating. We can infer a lot about which issues were prioritised by different governments, how Members of Parliament responded to global events, when emerging technologies were first mentioned in Parliament, and how the tone surrounding different issues has changed.

<img src="https://assets3.parliament.uk/iv/main-large//ImageVault/Images/id_10860/scope_0/ImageVaultHandler.aspx.jpg" width=500px>

## Downloading the archives

The online Hansard archive (available at XXX) is an incredibly rich source of information, and is free to access. It's also a bit of a mess; the archives for 1803-2005, 2005-2010, and 2010-2017 are all held in slightly different places and in slightly different formats. The archive offers some basic search functionality, but not at the speeds or level of detail we need to be able to track, for example, usage of the word "Empire" over the past 200 years. This means it's faster and more efficient to download a local copy of Hansard and search that, so let's start off by writing some code to automatically download one HTML file for every day's worth of Parliamentary debate. 

Here, we cycle through every possible date in our time range and check whether a page corresponding to that date exists in the Hansard archive. If a page exists, that means there was a Parliamentary sitting that day and we can download the associated records. If a page doesn't exist, we can assume Parliament wasn't sitting. 

This is the programmatic equivalent of accessing the Hansard archive in any web browser, navigating to a particular date, and manually downloading all debates listed under the House of Commons.

In [None]:
# Import libraries 

import urllib2 # requests and grabs information from web pages
from bs4 import BeautifulSoup # parses and searches HTML files
import matplotlib.pyplot as plt # plots results
import re # uses regular expressions to efficiently search large bodies of text
import os 
import math
import time
from glob import glob 
import calendar
from datetime import datetime 

### Downloading 1803-2004 

In [None]:
# function to save daily Hansard records from 1803-2004 locally as HTML files
def saveOldContribs():

    for year in range(1803,2005):
        for month in range(1,13):
            for day in range(1,32):

                monthName = calendar.month_name[month].lower()[0:3]
                contentsURL = "http://hansard.millbanksystems.com/sittings/"+str(year)+"/"+monthName+"/"+str(day).zfill(2)

                try: 
                    print "trying date: ", str(year)+' '+str(month)+' '+str(day).zfill(2)
                    contentsPage = urllib2.urlopen(contentsURL).read()
                    contentsSoup = BeautifulSoup(contentsPage,'lxml')
                    print "success!"
                    saveFile = open('./XXXX'+ ' '+str(year)+' '+str(month)+' '+str(day).zfill(2)+'.html', 'w')
                    commonsSection = contentsSoup.find_all("ol", {"class": "xoxo first"})[0]
                    for link in commonsSection.find_all('a', href=True):
                        linkURL = "http://hansard.millbanksystems.com"+link['href']
                        print linkURL
                        linkPage = urllib2.urlopen(linkURL).read()
                        saveFile.write(linkPage)

                    saveFile.close()
                except:
                    print "page not found!"

### Downloading 2005-2010

In [None]:
# function to save daily Hansard records from 2005-2010 locally as HTML files
def saveMidContribs():

    days = ['01', '02', '03', '04', '05', '06', '07' ,'08' ,'09', '10', '11' ,'12', '13', '14' ,'15',
            '16', '17' ,'18' ,'19' ,'20' ,'21' ,'22' ,'23', '24' ,'25' ,'26' ,'27', '28' ,'29', '30' ,'31']

    validSessions = ['200405 2004 11','200405 2004 12', '200405 2005 01', '200405 2005 02','200405 2005 03','200405 2005 04',
                     '200506 2005 05','200506 2005 06','200506 2005 07','200506 2005 08','200506 2005 09','200506 2005 10',
                     '200506 2005 11','200506 2005 12',
                     '200506 2006 01','200506 2006 02','200506 2006 03','200506 2006 04','200506 2006 05','200506 2006 06',
                     '200506 2006 07','200506 2006 08','200506 2006 09','200506 2006 10','200506 2006 11',
                     '200607 2006 12','200607 2007 01','200607 2007 02','200607 2007 03','200607 2007 04','200607 2007 05','200607 2007 06',
                     '200607 2007 07','200607 2007 08','200607 2007 09','200607 2007 10',
                     '200708 2007 11','200708 2007 12',
                     '200708 2008 01','200708 2008 02','200708 2008 03','200708 2008 04','200708 2008 05','200708 2008 06',
                     '200708 2008 07','200708 2008 08','200708 2008 09','200708 2008 10','200708 2008 11',
                     '200809 2008 12',
                     '200809 2009 01','200809 2009 02','200809 2009 03','200809 2009 04','200809 2009 05','200809 2009 06',
                     '200809 2009 07','200809 2009 08','200809 2009 09','200809 2009 10','200809 2009 11',
                     '200910 2009 12',
                     '200910 2010 01','200910 2010 02','200910 2010 03','200910 2010 04',
                     '201011 2010 05','201011 2010 06','201011 2010 07','201011 2010 08','201011 2010 09','201011 2010 10',
                     '201011 2010 11','201011 2010 12'
                     ]

    for sesh in validSessions:
        for day in days:

            session, year, month = sesh.split(" ")

            date = year[2:4]+month+day

            print "trying date ",date

            try:
                currentDate = time.strptime(day+'/'+month+'/'+year,'%d/%m/%Y')

                if year[2] == '0':
                    dateShort = year[3:4]+month+day
                else:
                    dateShort = date

                if time.strptime('04/05/2006','%d/%m/%Y') < time.strptime(day+'/'+month+'/'+year,'%d/%m/%Y'):
                    pageFlag = "-0001.htm"
                else:
                    pageFlag = "-01.htm"

                if  time.strptime('08/11/2006','%d/%m/%Y') < time.strptime(day+'/'+month+'/'+year,'%d/%m/%Y'):
                    volFlag = "cm"
                else:
                    volFlag = "vo"

                contentsURL = "https://publications.parliament.uk/pa/cm"+session+"/cmhansrd/"+volFlag+date+"/debtext/"+dateShort+pageFlag
                contentsPage = urllib2.urlopen(contentsURL).read()
                contentsSoup = BeautifulSoup(contentsPage,'lxml')

                if contentsSoup.find("h1") is None:
                    print "success!"
                    print "..."
                    saveFile = open('./'+session+' '+year+' '+month+' '+day+'.html', 'w')
                    for pageNum in range(1,2000):
                        print "trying page ",pageNum
                        if time.strptime('04/05/2006','%d/%m/%Y') < time.strptime(day+'/'+month+'/'+year,'%d/%m/%Y'):
                            pageFlag = '-'+str(pageNum).zfill(4)+'.htm'
                        else:
                            pageFlag = '-'+str(pageNum).zfill(2)+'.htm'

                        searchURL = "https://publications.parliament.uk/pa/cm"+session+"/cmhansrd/"+volFlag+date+"/debtext/"+dateShort+pageFlag
                        searchPage = urllib2.urlopen(searchURL).read()
                        searchSoup = BeautifulSoup(searchPage,'lxml')
                        if searchSoup.find("h1") is None:
                            saveFile.write(searchPage)
                        else:
                            break
                    saveFile.close()
                else:
                    print "no debates on this day!"

            except:
                print "not a valid date!"

### Downloading 2011-2017