github
Advanced Search
  • Home
  • Pricing and Signup
  • Explore GitHub
  • Blog
  • Login

danielharan / hansard_scraper

  • Admin
  • Watch Unwatch
  • Fork
  • Your Fork
  • Pull Request
  • Download Source
    • 5
    • 0
  • Source
  • Commits
  • Network (0)
  • Issues (0)
  • Downloads (0)
  • Wiki (1)
  • Graphs
  • Branch: master

click here to add a description

click here to add a homepage

  • Branches (1)
    • master ✓
  • Tags (0)
Sending Request…
Enable Donations

Pledgie Donations

Once activated, we'll place the following badge in your repository's detail box:
Pledgie_example
This service is courtesy of Pledgie.

get parl.gc.ca Hansard into a saner format — Read more

  cancel

  cancel
  • Private
  • Read-Only
  • HTTP Read-Only

This URL has Read+Write access

keep links inside interventions 
Daniel Haran (author)
Thu Apr 23 14:14:28 -0700 2009
commit  41b17296667bb4ed83a79585ea445afbb0f54e18
tree    051b40691aa7d1578eb3e6f1c0d6cffa3211f300
parent  67a6d883e4ca3e080f110d58dd6d057f81f3e9f0
hansard_scraper /
name age
history
message
file .gitignore Loading commit data...
file README.rdoc
file Rakefile
file TODO
file array_utils.rb
file division.rb
file extractor.rb
directory hansards/
file header.rb
file hpricot_extractor.rb
file intervention.rb
file output.rb
file semantic_out.html.erb
directory test/
file toc_link.rb
README.rdoc

Hansard scraper

Parliament’s minutes ("Hansard") are stuck in the age of dead trees. This project is a first step to making them really digital.

The objective of this scraper is to extract structured content from a Hansard page. Anyone can use it, although it was originally created for use by citizen-factory, "Hansard 2.0": github.com/danielharan/citizen-factory/tree/master

Try it

ruby output.rb > semantic.html

It’s effing slow: it takes around 2 minutes on a macbook pro

The code’s also not the prettiest.

Limitation

The project does not try to resolve / disambiguate members or bills. URLs are preserved so the importing application can do so.

About the data

The last session of parliament as of writing: www2.parl.gc.ca/housechamberbusiness/chambersittings.aspx?Parl=40&Ses=2&Language=E&Mode=2

A couple sample pages are saved in hansards/

Blog | Support | Training | Contact | API | Status | Twitter | Help | Security
© 2010 GitHub Inc. All rights reserved. | Terms of Service | Privacy Policy
Powered by the Dedicated Servers and
Cloud Computing of Rackspace Hosting®
Dedicated Server