Scraper of Slovak National Council for Visegrad+ project.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
conf
fixed_debates
fixtures
.gitignore
README.rst
dummy-image.jpg
parse.py
requirements.txt
scrape.py
scrapeutils.py
test.py

README.rst

scraper-sk_nrsr

Scraper of Slovak National Council for Visegrad+ project. Scrapes MPs, their memberships, votes and debates and stores the data into Visegrad+ parliament API.

Installation

Prerequisites

Requires:

  • lxml library to parse HTML documents,
  • LibreOffice core and unoconv to convert documents from RTF format,
  • some Python packages.

On Debian-based distributions install the libraries:

$ sudo apt-get install libxml2-dev libxslt1-dev zlib1g-dev libreoffice-core unoconv

Download

Get the scraper:

$ sudo mkdir --p /home/projects/scrapers
$ cd /home/projects/scrapers
$ sudo git clone https://github.com/KohoVolit/scraper-sk_nrsr.git sk_nrsr

Get VPAPI client and SSH certificate of the server:

$ cd sk_nrsr
$ sudo wget https://raw.githubusercontent.com/KohoVolit/api.parldata.eu/master/client/vpapi.py
$ sudo wget https://raw.githubusercontent.com/KohoVolit/api.parldata.eu/master/client/server_cert.pem

Create a virtual environment for the scraper and install the required packages into it:

$ sudo virtualenv /home/projects/.virtualenvs/scrapers/sk_nrsr --no-site-packages
$ source /home/projects/.virtualenvs/scrapers/sk_nrsr/bin/activate
(sk_nrsr)$ sudo pip install -r requirements.txt
(sk_nrsr)$ deactivate

Configuration

Check that SERVER_NAME and SERVER_CERT variables in vpapi.py have correct values.

Copy file conf/private-example.json to conf/private.json and fill in your username and password for write access through API. Those sensitive data must not be present in the repository.

Running

Run in the virtual environment. See help message of the scraper for parameters the scraper accepts

$ source /home/projects/.virtualenvs/scrapers/sk_nrsr/bin/activate
$ python scrape.py --help

unoconv listener must be running to scrape transcripts of former debates (election terms 1-4)

$ unoconv --listener &

Scrape people and their memberships first, then debates and finally votes (initial scrape of debates deletes all existing sessions and sittings)

$ sudo -u visegrad python scrape.py --people initial --debates none --votes none
$ sudo -H -u visegrad python scrape.py --people none --debates initial --votes none
$ sudo -u visegrad python scrape.py --people none --debates none --votes initial

(unoconv creates tmp files in HOME). Or all at once

$ sudo -H -u visegrad python scrape.py --people initial --debates initial --votes initial

You can stop unoconv listener unless needed for other scrapers or conversions

$ sudo killall soffice.bin

Then schedule periodic scrape

$ sudo -u visegrad python scrape.py --people recent --debates recent --votes recent

or, knowing that recent is the default value, simply

$ sudo -u visegrad python scrape.py