danielharan / hansard_scraper
- Source
- Commits
- Network (0)
- Issues (0)
- Downloads (0)
- Wiki (1)
- Graphs
-
Branch:
master
Hansard scraper
Parliament’s minutes ("Hansard") are stuck in the age of dead trees. This project is a first step to making them really digital.
The objective of this scraper is to extract structured content from a Hansard page. Anyone can use it, although it was originally created for use by citizen-factory, "Hansard 2.0": github.com/danielharan/citizen-factory/tree/master
Try it
ruby output.rb > semantic.html
It’s effing slow: it takes around 2 minutes on a macbook pro
The code’s also not the prettiest.
Limitation
The project does not try to resolve / disambiguate members or bills. URLs are preserved so the importing application can do so.
About the data
The last session of parliament as of writing: www2.parl.gc.ca/housechamberbusiness/chambersittings.aspx?Parl=40&Ses=2&Language=E&Mode=2
A couple sample pages are saved in hansards/

