Skip to content

A tool for parsing Australian government websites into a contact database

Notifications You must be signed in to change notification settings

CodeProBono/polilist

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This project is an effort to parse the contents of Australian parliament
websites into a database suitable for mail merging, filtering, etc.

Information for users
---------------------
To use this tool, you will need to the following components installed:
 * python 2.6
 * python-httplib2
 * python-beautifulsoup (for some parsers)
While this tool is designed to work on any operating system, it has only been
tested on Linux. If you run into compatibility problems, please file a bug
report as described below.

To check the command line arguments for this tool run:
 ./polilist --help

More comprehensive information will be added to this README file as the tool
matures. If you come across a bug or would like a feature added please request
this at https://github.com/Smattr/polilist/issues.

Information for developers
--------------------------
For now, while this project is in its early stages this README file will
contain a basic overview of the proposed design of this system.

Please follow these guidelines during development:
 - Unless you have a very good reason, always use the hardcache provider during
   development. If not, you will quickly upset website admins.
 - Don't commit uncommented code. I don't want to have to have to divine the
   meaning of your code.
 - If you are making a major architectural change, please speak to me about it
   before committing.
 - If you can see problems in the design that will cause issues down the road,
   please flag them with me as soon as possible.
 - If you come across an issue that you can't (or don't want to) fix, please
   raise it at https://github.com/Smattr/polilist/issues.
 - If possible, make sure your changes don't break any of the tests (ensure
   tests/runtests executes correctly).
Along with these, follow general DVCS-project politeness (informative commit
messages, <80 characters in first line of commit message, don't create remote
heads, don't commit code that doesn't compile, etc.). If you're about to do
something that you think might upset people, ask them before pushing.

Design
------
Apologies for the confusing manner in which this is described. It's not very
clear in my head right now.
                                                  +-----------+
                                               ++-|Base Parser|
                           +----+              || +-----------+
    +-------------+<-------|Main|------+     +--------+
    |Base Provider|        +----+      +---->|Parser 1|
    +-------------+             |      |     +--------+
        /       \               |      |       |
       /         \              |      |     +--------+
      /           \             |      +---->|Parser 2|
  +----------+   +----------+   |            +--------+
  |Provider 1|   |Provider 2|   |
  +----------+   +----------+   |
                                |
                               \ /
                           +-----------+
                           |Base Output|
                           +-----------+
                             /      \
                            /        \
                           /          \
                      +----------+   +-------------+
                      |CSV Output|   |SQLite Output|
                      +----------+   +-------------+

The main logic of this tool has a provider (an object with the ability to
retrieve documents over HTTP). The reason for having more than one type of
provider is to allow different providers to implement different caching
policies; in particular to allow a provider with a very lazy cache coherence
protocol for use in testing so as not to overload websites with unnecessary
HTTP requests.

Main has a set of parsers, each specific to a given webpage. Each parser
encapsulates the format of a page by providing a function to retrieve the
contacts on that page (or pages linked from there). In this way, Main should
just be able to iterate through its known parsers and call each one.

Main has a single Output that it uses when it comes time to write out the
contact information it has discovered. Each Output defines a format for the
file(s) it writes.

That should all be as clear as mud now.

Testing
-------
The directory tests contains all the unit and regression tests for this project.
To run them, execute the script runtests in this directory. Note that running
the tests may cause some data to be downloaded and you should expect some tests
to fail if you do not have an available connection to the internet.

If you look at runtests, you will see that it just executes every executable
file in the subdirectories. To create a new test just create an executable file
(can be any interpreted language) in one of the subdirectories. To decide where
your new test should live, refer to the following list:
 - tests/integration Tests that only interact with the project through its
                     external interfaces. These are tests designed to exercise
                     the project functionality that is exposed to end users.
 - tests/regression  Tests that prove the absence of some previously observed
                     bug. The purpose of these tests is to ensure a bug that has
                     been fixed in the past is not reintroduced by new changes.
                     When closing a bug you should also commit a regression test
                     that validates the fix.
 - tests/unit        Tests that probe a feature of a specific module in
                     isolation. These do not have to run through the same setup
                     as the main program logic.

Links
-----
Directory of NSW local councils: http://www.dlg.nsw.gov.au/dlg/dlghome/dlg_localgovdirectory.asp

About

A tool for parsing Australian government websites into a contact database

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published