# 1st Homework - data visualisation and web scraping (deadline October 18th)

  * In this homework you should download (*scrape*) data from the web, process it and visualise.
  * The objective is to download data from server https://www.psp.cz/en/sqw/hlasovani.sqw?o=8 regarding voting in the Chamber of Deputies of the Parliament of the Czech Republic, save it in a tabular form and create visualisations, which make exploration of the data easier and show interesting information about it.

> **Unfortunately, the web page is only in Czech. However, it should be possible to navigate it. Please contact your tutorial teacher in case of any questions.**

## Data

 * Download data from all votings of the current Chamber of Deputies (since 2017). Download details of voting of particular deputies.
 * Data should contain basic information about the voting - number of meeting, number of voting, point of the meeting and date.

## Instructions

> **Homework is assigned in a way that you have space for invention. Thinking of the _exact solution path_ is part of the assignment. Originality will be taken into account in the evaluation.**

**Basic points of the assignment (8 points)**:
  * Write Python script for downloading data. Download the data and save it in a suitable machine-readable format.
  * **Wait at least 1 second between two consecutive requests to the server, to not overload it.**
  * In the second part of the notebook, work with the data loaded from a local file. File(s) with downloaded data should be submitted as well (so the reviewer do not have to download the data again).
  * Create visualisations to show the following:
    * Deputies changing their parliamentary clubs.
    * Attendance of individual deputies in the votings. Attendance of parliamentary clubs in the votings.
    * How often individual parliamentary clubs vote the same and different.
    * Are deputies in the same parliamentary club voting the same? Who are the biggest rebels?

**Further points of assignment**, each for 2 points (maximum for the homework is 12):
  * Visualise some time development in the data (e.g. attendance, change of agreement in voting among individual parliamentary clubs, etc.)
  * Find individual deputies who have the most similar voting or attendance.
  * Try to find particular voting, where deputies voted the most differently than traditionally.
  
## Tips and tricks
  * Import libraries at the beginning of the notebook, or the beginning of scraping and visualisation parts.
  * Use markdown cells (like this one) and headings to make orientation in the notebook easier.
  * Select plots and visualisation matching the information you want to show. You can see galleries of libraries `matplotlib` and `seaborn` for inspiration.

## Submission notes

  * Follow instructions at https://courses.fit.cvut.cz/BIE-VZD/homeworks/index.html
  * Submit **Jupyter Notebook** (possibly with additional scripts) and **file(s)** with downloaded data
  * Reviewer may allow you to finish or correct your homework to achieve additional points. However, the first version is crucial.

In [15]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import re
import lxml.html


class Deputy:
    def __init__(self, name, id):
        self.name = name
        self.id = id
        self.__generate_info()
#         self.polit_group = ''
#         self.member_from = ''
#         self.vote_num = 0
    def __generate_info(self):
        self.url = 'https://public.psp.cz/en/sqw/detail.sqw?id=' + str(id) + '&o=8&l=cz'
        r = requests.get(self.url)
        soup = BeautifulSoup(r.content.decode('windows-1252', 'ignore'), 'html.parser')
        polit_group = [i for i in soup.find_all('a') if 'Political group' in str(i)]
        polit_group = str(polit_group[0])[-12:-4]
#         self.polit_group
        

In [30]:
url = 'https://public.psp.cz/en/sqw/detail.sqw?id=6254&o=8&l=cz'
r = requests.get(url)
soup = BeautifulSoup(r.content.decode('windows-1252', 'ignore'), 'html.parser')


'ANO 2011'

In [2]:
# url with a form
url = 'https://www.psp.cz/en/sqw/hlasovani.sqw?o=8'

r = requests.get(url)
r.encoding="windows-1252"

In [3]:
soup = BeautifulSoup(r.content, 'lxml')
ahrefs = []
for link in soup.find_all('a'):
    ahrefs.append(link.get('href'))

In [4]:
assembly_links = []
url = 'https://public.psp.cz/en/sqw/'
for elem in ahrefs:
    if elem.startswith('hl.sqw?o=8&s=') and len(elem) <= 15:
        assembly_links.append(url + elem)
        

In [5]:
r = requests.get(assembly_links[0])  # Here i have to change it for cycle
r.encoding="windows-1252"
    
soup = BeautifulSoup(r.content)

# soup.find_all('a', href=re.compile("phlasa.sqw"))
links = soup.select_one("a[href^='phlasa.sqw']") 
links = re.search(r'"([^"]*)"', str(links))
links = url + links.group()[1:-1]
links = re.sub('amp;', '', links)

In [6]:
r = requests.get(links)
r.encoding="windows-1252"
soup = BeautifulSoup(r.content, 'lxml')
# https://public.psp.cz/en/sqw/hlasy.sqw?g=67018&l=en is a link example

ahrefs.clear()
for link in soup.find_all('a'):
    ahrefs.append(link.get('href'))

In [7]:
votings = []
url = 'https://public.psp.cz/en/sqw/'
for elem in ahrefs:
    if elem.startswith('hlasy.sqw?g='):
        votings.append(url + elem)
        

In [8]:
deputes = []
r = requests.get(votings[0])
r.encoding="windows-1252"
soup = BeautifulSoup(r.content, 'lxml')
for link in soup.find_all('li'):
    deputes.append(link)

In [9]:
votings[0]

'https://public.psp.cz/en/sqw/hlasy.sqw?g=67018&l=en'

In [16]:
# [href^="detail.sqw?id="] 
# soup.find_all('div', {'id':'main-content'})
deputes = []
r = requests.get(votings[0])
soup = BeautifulSoup(r.content.decode('windows-1252', 'ignore'), 'html.parser')
uls = soup.find_all('ul', {'class':'results'})
for ul in uls:
    for li in ul.find_all('li'):
        name = li.find('a').text.replace('\u00E8', '\u010D').replace('\xa0', ' ').replace('\u00F8', '\u0159').replace('\u00EC', '\u00E8')
        id = li.find('a').get('href')
        deputes.append(Deputy(name,id))


None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

None

In [91]:
deputes[0].name

'Vèra Adámková'