Scrape https://www.europarl.europa.eu/meps/
./start.sh
It will create a folder data
with all the scraped information in it
- Built on python-3.6.8
sudo apt update
sudo apt install python3 python3-pip
pip3 install Scrapy==1.6.0 beautifulsoup4
cd scraper
./main.py
Creates a folder ./data/
where it will dump all the scraped data from every MEP
./main.py --id 123456
This will create a folder called TEST - 123456
in ./data/
where it will save the data scraped from the MEP with the id
123456
-
gets names and IDs from https://www.europarl.europa.eu/meps/en/directory/xml
-
for each page
https://www.europarl.europa.eu/meps/en/$ID
- gets all Declaration pdfs
- for each section and subsection in the left pane, it saves a html file
- e.g. curriculum-vitae.html for https://www.europarl.europa.eu/meps/en/124831/ISABELLA_ADINOLFI/cv#mep-card-content
- the HTML content starts from the title, "Curriculum Vitae", until just above the next section, "Contact"
-
saves data from each page in a folder which has the structure:
- $root/scraper/data/
- $MEP_FULL_NAME - $ID/
- declaration1.pdf
- declaration2.pdf
- section1.html
- section2.html
- ...
- $MEP_FULL_NAME - $ID/
- $root/scraper/data/
-
prints stats about what was done:
- how many MEPs?
- how many of them had pdfs looking like http://www.europarl.europa.eu/mepdif/124831_DFI_LEG9_rev0_IT.pdf under "Declarations"?
What is the total disk size of the output? - 1.7 GB
Do the results change when running on 3 days in a row?
Do we need to run OCR on any 2019 declarations?