In [14]:
# Import some useful libraries
%matplotlib inline
import pandas as pd
import urllib
import xml.etree.ElementTree as ET
from scraper import *
import numpy as np
%load_ext autoreload
%autoreload 2

# display all pandas columns
pd.set_option('display.max_columns', 100)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Scraping of http://www.parlament.ch

First, we need to scrap some information from the website http://parlament.ch. In this notebook, we will scrap different information. These information will be stored in the folder *data* but **not pushed into GitHub repo** for obvious storage reason. 

### Scraper class

The scaping use `Scraper` class defined in `scraper.py` file.

* `scraper.get[table_name]` download the whole table from parlament.ch, store it into a csv file, and return a pandas.data_frame

Depending of the size of the table, Scraper uses different technics to get them.

* `scraper.get['Party']` is small (less than 1'000 lines) and can be get using one only GET request
* `scraper.get['Person']` is mid-size (between 1'000 and 30'000 lines) and must be get iteratively using `skip` option, the API only returning max. 1'000 each time.
* `scraper.get['Transcript']` is big (more than 30'000 lines). Server API times out for skip > 30'000 so we have to get them precisely by interval of ID's.

**After each scraping, Scraper check if the entire table was properly scraped and provides a summary.** To do so, we use `$count` parameter to get the real size of the online table and check if the size correspond the the scraped table.

### Metadata
URL of the metadata: https://ws.parlament.ch/odata.svc/$metadata

Can be parsed using tool like: http://pragmatiqa.com/xodata/

### More about the method

For the scraping, we are using the library `requests`. The metadata of the website are provided and follow the OData convention. We get the table names using XOData (provided above) and reconstruct the url request. We get the XML using `requests` and we transform the XML into JSON using the library `xmltodict`.



# Examples

## Scrap


Tables: Party, Person, Council

In [2]:
scrap = Scraper()
df_party = scrap.get('Party') # get with simple GET request
df_person = scrap.get('Person') # get with iterativ GET request using skip
df_member_council = scrap.get('MemberCouncil')

GET: https://ws.parlament.ch/odata.svc/Party?$filter=Language%20eq%20'FR'
[OK] table Party correctly scraped, df.shape =  83 as expected
GET: https://ws.parlament.ch/odata.svc/Person?$top=1000&$filter=Language%20eq%20'FR'&$skip=0
GET: https://ws.parlament.ch/odata.svc/Person?$top=1000&$filter=Language%20eq%20'FR'&$skip=1000
GET: https://ws.parlament.ch/odata.svc/Person?$top=1000&$filter=Language%20eq%20'FR'&$skip=2000
GET: https://ws.parlament.ch/odata.svc/Person?$top=1000&$filter=Language%20eq%20'FR'&$skip=3000
GET: https://ws.parlament.ch/odata.svc/Person?$top=1000&$filter=Language%20eq%20'FR'&$skip=4000
[OK] table Person correctly scraped, df.shape =  3640 as expected
GET: https://ws.parlament.ch/odata.svc/MemberCouncil?$top=1000&$filter=Language%20eq%20'FR'&$skip=0
GET: https://ws.parlament.ch/odata.svc/MemberCouncil?$top=1000&$filter=Language%20eq%20'FR'&$skip=1000
GET: https://ws.parlament.ch/odata.svc/MemberCouncil?$top=1000&$filter=Language%20eq%20'FR'&$skip=2000
GET: https://w

In [3]:
df_party.shape

(83, 8)

Count how many occurencies exist in a table
(Will be used after to control that we got)

In [6]:
n_parties = scrap.count('Party')
n_parties

83

Let's also get the voting data

In [19]:
scrap = Scraper(time_out=100) # increase time_out
df_vote = scrap.get('Vote')

GET: https://ws.parlament.ch/odata.svc/Vote?$filter=Language%20eq%20'FR'
GET: https://ws.parlament.ch/odata.svc/Vote?$filter=Language%20eq%20'FR'%20and%20ID%20ge%201%20and%20ID%20lt%201001
GET: https://ws.parlament.ch/odata.svc/Vote?$filter=Language%20eq%20'FR'%20and%20ID%20ge%201001%20and%20ID%20lt%202001
GET: https://ws.parlament.ch/odata.svc/Vote?$filter=Language%20eq%20'FR'%20and%20ID%20ge%202001%20and%20ID%20lt%203001
GET: https://ws.parlament.ch/odata.svc/Vote?$filter=Language%20eq%20'FR'%20and%20ID%20ge%203001%20and%20ID%20lt%204001
GET: https://ws.parlament.ch/odata.svc/Vote?$filter=Language%20eq%20'FR'%20and%20ID%20ge%204001%20and%20ID%20lt%205001
GET: https://ws.parlament.ch/odata.svc/Vote?$filter=Language%20eq%20'FR'%20and%20ID%20ge%205001%20and%20ID%20lt%206001
GET: https://ws.parlament.ch/odata.svc/Vote?$filter=Language%20eq%20'FR'%20and%20ID%20ge%206001%20and%20ID%20lt%207001
GET: https://ws.parlament.ch/odata.svc/Vote?$filter=Language%20eq%20'FR'%20and%20ID%20ge%207001%2

## Scrap big table

In [7]:
# try to scrap transcript... will be hard -_-
# server time_out with skip > 30'000 --> Scraper needs to loop on ID's instead of skipping table
# SLOW (approx. 5 min)
scrap = Scraper(time_out=300) # increase time_out
df_transcript = scrap.get('Transcript')

GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%201%20and%20ID%20lt%201001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%201001%20and%20ID%20lt%202001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%202001%20and%20ID%20lt%203001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%203001%20and%20ID%20lt%204001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%204001%20and%20ID%20lt%205001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%205001%20and%20ID%20lt%206001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%206001%20and%20ID%20lt%207001
GET: https://ws.parlament.ch/odata.svc/Transcript

GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%2065001%20and%20ID%20lt%2066001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%2066001%20and%20ID%20lt%2067001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%2067001%20and%20ID%20lt%2068001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%2068001%20and%20ID%20lt%2069001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%2069001%20and%20ID%20lt%2070001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%2070001%20and%20ID%20lt%2071001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%2071001%20and%20ID%20lt%2072001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%2072001%20and%20

GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20130001%20and%20ID%20lt%20131001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20131001%20and%20ID%20lt%20132001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20132001%20and%20ID%20lt%20133001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20133001%20and%20ID%20lt%20134001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20134001%20and%20ID%20lt%20135001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20135001%20and%20ID%20lt%20136001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20136001%20and%20ID%20lt%20137001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20

GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20194001%20and%20ID%20lt%20195001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20195001%20and%20ID%20lt%20196001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20196001%20and%20ID%20lt%20197001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20197001%20and%20ID%20lt%20198001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20198001%20and%20ID%20lt%20199001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20199001%20and%20ID%20lt%20200001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20200001%20and%20ID%20lt%20201001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20

GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20258001%20and%20ID%20lt%20259001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20259001%20and%20ID%20lt%20260001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20260001%20and%20ID%20lt%20261001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20261001%20and%20ID%20lt%20262001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20262001%20and%20ID%20lt%20263001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20263001%20and%20ID%20lt%20264001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20264001%20and%20ID%20lt%20265001
GET: https://ws.parlament.ch/odata.svc/Transcript?$filter=Language%20eq%20'FR'%20and%20ID%20ge%20

In [8]:
df_transcript_count = scrap.count('Transcript')
df_transcript_count

291240

In [None]:
df_votes = scrap.get('Votes')