Code for the creation of the Hansard database
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
MPs to check in all files.txt

Hansard Speeches and Sentiment

GitHub tag DOI License

Repository for a public dataset of speeches in the Hansard. The dataset provides information on each speech of ten words or longer, made in the House of Commons between 1980 and 2016, with information on the speaking MP, their party, gender and age at the time of the speech. The dataset also includes all speeches of ten words made from 1936 to 1980, for a total of 4,212,134 speeches and 773,585,770 words. More information on the dataset is available here. The dataset itself can be accessed through Zenodo.

The speeches have been classified for sentiment using a total of four libraries from the R package lexicon, one from syuzhet and one from this paper. All six scores used the method from the sentimentr package. The libraries are:

  1. The AFINN library by Finn Årup Nielsen, labelled afinn. The AFINN library was accessed through the syuzhet package.

  2. The Opinion Mining, Sentiment Analysis and Opinion Spam Detection dataset by Bing Liu, Minqing Hu and Junsheng Cheng, labelled bing. The Bing library was access through the syuzhet package.

  3. The NRC Word-Emotion Association Lexicon, library by Saif M. Mohammad, labelled nrc. The NRC library was access through the syuzhet package.

  4. The Sentiwords dataset, created by Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. The Sentiwords library was accessed through the library was accessed through the lexicon package.

  5. The Hu & Liu dataset, by Minqing Hu and Bing Liu, labelled Hu. The Hu & Liu library was accessed through the sentimentr package.

  6. A modified version of the unnamed lexicon from the paper Measuring Emotion in Parliamentary Debates with Automated Textual Analysis, labelled rheault. As the method in sentimentr does not use distinguish between the same word that can occupy multiple lexical categories, I used the average polarity score assigned to such words.


The data used to create this dataset was taken from the parlparse project operated by They Work For You and supported by mySociety.

The dataset is licensed under a Creative Commons Attribution 4.0 International License.Creative Commons License

The code included in this repository is licensed under an MIT license.

Please contact me or open an issue here if you find any errors in the dataset. The integrity of the public Hansard record is questionable at times, and while I have improved it, the data is presented 'as is'.

New in 2.4.3

  • "Julia Dockerill" name changed to "Julia Lopez" to reflect MP's actual name change