Skip to content

CRJI/avere-europarl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status

avere-europarl

Scrape https://www.europarl.europa.eu/meps/

Usage

Run with docker

./start.sh

It will create a folder data with all the scraped information in it

Run locally

Install requirements

  • Built on python-3.6.8
sudo apt update
sudo apt install python3 python3-pip
pip3 install Scrapy==1.6.0 beautifulsoup4

Run the program

cd scraper
./main.py

Creates a folder ./data/ where it will dump all the scraped data from every MEP

Arguments

./main.py --id 123456

This will create a folder called TEST - 123456 in ./data/ where it will save the data scraped from the MEP with the id 123456

How it works

The scraper:

  1. gets names and IDs from https://www.europarl.europa.eu/meps/en/directory/xml

  2. for each page https://www.europarl.europa.eu/meps/en/$ID

  1. saves data from each page in a folder which has the structure:

    • $root/scraper/data/
      • $MEP_FULL_NAME - $ID/
        • declaration1.pdf
        • declaration2.pdf
        • section1.html
        • section2.html
        • ...
  2. prints stats about what was done:

Taking a look at the data

What is the total disk size of the output? - 1.7 GB

Do the results change when running on 3 days in a row?

Do we need to run OCR on any 2019 declarations?