# 1st Homework - data visualisation and web scraping (deadline October 18th)

  * In this homework you should download (*scrape*) data from the web, process it and visualise.
  * The objective is to download data from server https://www.psp.cz/en/sqw/hlasovani.sqw?o=8 regarding voting in the Chamber of Deputies of the Parliament of the Czech Republic, save it in a tabular form and create visualisations, which make exploration of the data easier and show interesting information about it.

> **Unfortunately, the web page is only in Czech. However, it should be possible to navigate it. Please contact your tutorial teacher in case of any questions.**

## Data

 * Download data from all votings of the current Chamber of Deputies (since 2017). Download details of voting of particular deputies.
 * Data should contain basic information about the voting - number of meeting, number of voting, point of the meeting and date.

## Instructions

> **Homework is assigned in a way that you have space for invention. Thinking of the _exact solution path_ is part of the assignment. Originality will be taken into account in the evaluation.**

**Basic points of the assignment (8 points)**:
  * Write Python script for downloading data. Download the data and save it in a suitable machine-readable format.
  * **Wait at least 1 second between two consecutive requests to the server, to not overload it.**
  * In the second part of the notebook, work with the data loaded from a local file. File(s) with downloaded data should be submitted as well (so the reviewer do not have to download the data again).
  * Create visualisations to show the following:
    * Deputies changing their parliamentary clubs.
    * Attendance of individual deputies in the votings. Attendance of parliamentary clubs in the votings.
    * How often individual parliamentary clubs vote the same and different.
    * Are deputies in the same parliamentary club voting the same? Who are the biggest rebels?

**Further points of assignment**, each for 2 points (maximum for the homework is 12):
  * Visualise some time development in the data (e.g. attendance, change of agreement in voting among individual parliamentary clubs, etc.)
  * Find individual deputies who have the most similar voting or attendance.
  * Try to find particular voting, where deputies voted the most differently than traditionally.
  
## Tips and tricks
  * Import libraries at the beginning of the notebook, or the beginning of scraping and visualisation parts.
  * Use markdown cells (like this one) and headings to make orientation in the notebook easier.
  * Select plots and visualisation matching the information you want to show. You can see galleries of libraries `matplotlib` and `seaborn` for inspiration.

## Submission notes

  * Follow instructions at https://courses.fit.cvut.cz/BIE-VZD/homeworks/index.html
  * Submit **Jupyter Notebook** (possibly with additional scripts) and **file(s)** with downloaded data
  * Reviewer may allow you to finish or correct your homework to achieve additional points. However, the first version is crucial.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import re
import lxml.html


class Deputy:
    def __init__(self, name, id):
        self.name = name
        self.id = id
        self.polit_group = []
        self.member_from = []
        self.generate_info()
        
#         self.list_of_votes = []
#         create list of votes [yes, no , omluvlen]
        
    def generate_info(self):
        time.sleep(0.5)
        
        self.url = str('https://public.psp.cz/en/sqw/detail.sqw?id=' + str(self.id) + '&o=8&l=cz')
        r = requests.get(self.url)
        soup = BeautifulSoup(r.content.decode('windows-1252', 'ignore'), 'html.parser')

        lis = [i for i in soup.find_all('li') if 'Political group' in str(i)]
        for content in lis:
            self.polit_group.append(content.find('a').text)
            member_from = str(str(str(content).split(' from ', 1)[1])[:-5]).replace('\xa0', '')
            if 'till ' in member_from:
                member_from = member_from.split('till ', 1)[1]
            self.member_from.append(member_from)
        
        
    def load_to_csv(self, path):
        out = str(self.name) + ';' + str(self.id) + ';' + str(self.polit_group) + ';' + str(self.member_from) + '\n'
        with open('deputy.csv', 'a') as file:
            file.write(out)

In [2]:

def download_data(url, vote_df, deput_list, results_list):
    # Step one
    r = requests.get(url)
    soup = BeautifulSoup(r.content.decode('windows-1252', 'ignore'), 'html.parser')
    ahrefs = []
    
    for link in soup.find_all('a'):
        ahrefs.append(link.get('href'))
    
    meeting_links = []
    url = 'https://public.psp.cz/en/sqw/'
    
    for elem in ahrefs:
        if elem.startswith('hl.sqw?o=8&s=') and len(elem) <= 15:
            meeting_links.append(url + elem)
    # Step two
    r = requests.get(meeting_links[0])
    r.encoding="windows-1252"
    soup = BeautifulSoup(r.content, 'lxml')

    # soup.find_all('a', href=re.compile("phlasa.sqw"))  -> alternative
    link = soup.select_one("a[href^='phlasa.sqw']") 
    link = re.search(r'"([^"]*)"', str(link))
    link = url + link.group()[1:-1]  # link == 'https://public.psp.cz/en/sqw/phlasa.sqw?o=8&l=en&pg=1'
    
    r = requests.get(link)
    soup = BeautifulSoup(r.content.decode('windows-1252', 'ignore'), 'html.parser')
    ahrefs.clear()
    
    for link in soup.find_all('a'):
        ahrefs.append(link.get('href'))

    pg_list = []
    for elem in ahrefs:
        if elem.startswith('phlasa.sqw?o=8&l=en&pg='):
            pg_list.append(int(elem.split('phlasa.sqw?o=8&l=en&pg=')[1]))

    for i in range(max(pg_list) + 1):
        if i == 0:
            continue
        url = 'https://public.psp.cz/en/sqw/phlasa.sqw?o=8&l=en&pg=' + str(i)
        r = requests.get(url)
        soup = BeautifulSoup(r.content.decode('windows-1252', 'ignore'), 'html.parser')
        ahrefs.clear()

        # Step three
        # Creating Votings data
        for tr in soup.find_all('table')[0].find_all('tr'):
            tds = []
            for td in tr.find_all('td'):
        #         display(td)
                tds.append(td)
            if not tds:
                continue
            # Meeting
            meeting = tds.pop(0)
            meeting = int(meeting.text)
            # ID
            id = tds.pop(0)
            id = str(str(id.find('a').get('href')).split('hlasy.sqw?g=', 1)[1])[:-5]
            # Topic
            tds.remove(tds[0])
            
            topic = tds.pop(0).text.replace('\u00D8', '\u0158').replace('\u00E8', '\u010D').replace('\xa0', ' ').replace('\u00F8', '\u0159').replace('\u00EC', '\u00E8')
            # Date
            date = tds.pop(0).text.replace('\xa0', '')
            # Result
            result = tds.pop(0).text.replace('\u00D8', '\u0158').replace('\u00E8', '\u010D').replace('\xa0', ' ').replace('\u00F8', '\u0159').replace('\u00EC', '\u00E8')
            if result == 'Přijato':
                result = 'Accepted'
            elif result == 'Zamítnuto':
                result = 'Denied'
            elif result == 'Přijato (zmatečné)':  
                result = 'Accepted (invalid)'  # objection received
            elif result == 'Zamítnuto (zmatečné)':  
                result = 'Denied (invalid)'
            elif result == 'Přijato (přijata námitka)':  
                result = 'Accepted (objection received)'
            elif result == 'Zamítnuto (přijata námitka)':  
                result = 'Denied (objection received)'
#             vote_df.append([id, meeting, topic, date, result])
            vote_list.append([id, meeting, topic, date, result])
            # Step four
            # Creating result data
            vot_id = str(id)
#             display(id)
            url = 'https://public.psp.cz/en/sqw/hlasy.sqw?g=' + id + '&l=en'
            try:
                r = requests.get(url)
            except:
                continue
            soup = BeautifulSoup(r.content.decode('windows-1252', 'ignore'), 'html.parser')
            uls = soup.find_all('ul', {'class':'results'})

            for ul in uls:
                for li in ul.find_all('li'):
                    name = li.find('a').text.replace('\u00D8', '\u0158').replace('\u00E8', '\u010D').replace('\xa0', ' ').replace('\u00F8', '\u0159').replace('\u00EC', '\u00E8')
                    id = str(li.find('a').get('href').split('id=', 1)[1])[:4]
                    result = li.find('span').text
                    if result == 'A':
                        result = 1
                    elif result == 'N':
                        result = -1
                    elif result == 'Z':
                        result = 0
                    elif result == 'M' or result == '0':
                        result = -2
                    results_list.append([id, vot_id, name, result])
    # Step five
    # Creating Depute data
    url = 'https://public.psp.cz/en/sqw/hlasy.sqw?g=' + vote_list[0][0] + '&l=en'
    r = requests.get(url)
    soup = BeautifulSoup(r.content.decode('windows-1252', 'ignore'), 'html.parser')
    uls = soup.find_all('ul', {'class':'results'})
        
    for ul in uls:
        for li in ul.find_all('li'):
            name = li.find('a').text.replace('\u00D8', '\u0158').replace('\u00E8', '\u010D').replace('\xa0', ' ').replace('\u00F8', '\u0159').replace('\u00EC', '\u00E8')
            id = str(li.find('a').get('href').split('id=', 1)[1])[:4]
            if id == '6165':  # speaker exeption. His page is changed
                deput_list.append([id, name, 'ANO 2011', '26.10.2013'])  
                continue
            dep = Deputy(str(name),str(id))
            if len(dep.polit_group) > 1:
#                 print([dep.id, dep.name, dep.polit_group, dep.member_from])
                for idx, pgroup in enumerate(set(dep.polit_group)):
                    deput_list.append([dep.id, dep.name, pgroup, dep.member_from[idx]])
            else:
#                 print([dep.id, dep.name, dep.polit_group[0], dep.member_from])
                deput_list.append([dep.id, dep.name, dep.polit_group[0], dep.member_from[0]])    
    
    

In [None]:
# Here we start scraping the data. 
# url with a form
url = 'https://www.psp.cz/en/sqw/hlasovani.sqw?o=8'
vote_list = []
deput_list = []
results_list = []

download_data(url, vote_list, deput_list, results_list)

In [None]:
result_df = pd.DataFrame(results_list, columns=['ID', 'Vote_id', 'Name', 'Result'])
result_df.to_csv('result.csv', sep=';')

In [None]:
vote_df = pd.DataFrame(vote_list, columns=['ID', 'Name', 'Date', 'Result'])
vote_df.to_csv('votes.csv', sep=';')

In [None]:
deput_df = pd.DataFrame(deput_list, columns=['ID', 'Name', 'Politic group', 'Member from'])
deput_df.to_csv('deputies.csv', sep=';')

### ===============================

In [15]:
vote_df.info()
# result_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6983 entries, 0 to 6982
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   ID       6983 non-null   object
 1   Vote_id  6983 non-null   int64 
 2   Name     6983 non-null   object
 3   Date     6983 non-null   object
 4   Result   6983 non-null   object
dtypes: int64(1), object(4)
memory usage: 272.9+ KB


Here we can see depute who change his parlamentary club

In [30]:
duplicates = deput_df[deput_df.duplicated(['Name'], keep=False)]
duplicates

Unnamed: 0_level_0,Name,Politic group,Member from
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5911,Jaroslav Foldyna,Political group Czech Social Democratic Party,25.2.2020
5911,Jaroslav Foldyna,Political group Freedom and Direct Democracy,7.4.2020


In [14]:
result = pd.merge(vote_df, result_df, on=['ID', 'Vote_id'])

ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat