pip3 install lxml <br>
pip3 install beautifulsoup4

# WEB SCRAPING 

In [120]:
import numpy as np
import requests # to obtain html data
import bs4      # beautifulsoup4 to parse the html content

The first step is to get the content related to an specific web page. You shall to read the privacy policies section over the data. In this tutorial we are going to use the page of chilean deputies

![hola](./images/privacypolicies.png)

In [35]:
res = requests.get('https://www.camara.cl/trabajamos/sala_votaciones.aspx')

Once we have obtained the content of the page, we can access the html plain text

In [37]:
res.text

'\r\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml">\r\n<head id="ctl00_Head1"><meta http-equiv="Content-Type" content="text/html;charset=utf-8" /><title>\r\n\tCámara de Diputados de Chile\r\n</title><link rel="stylesheet" type="text/css" href="/common/styles/main.css" media="screen" /><link rel="stylesheet" type="text/css" href="/common/styles/print.css" media="print" />\r\n\r\n    <script type="text/javascript" src="/common/scripts/jquery-1.2.6.min.js"></script>\r\n\r\n    <script type="text/javascript" src="/common/scripts/main.js"></script>\r\n\r\n   \r\n\r\n    <link rel="shortcut icon" href="/media/images/favicon.ico" /><link href="/WebResource.axd?d=pPzMpXLOsVTlHg-dfa2oezlyxtxfyJankOAeiJ8eRUQd5BFIUN4aXG0dZiMTW-Yw2byc2_X86KJrRXtd5TCIrDvH5YM7Njz5P0X40whzQ-np6-cOtlR5C2kcYbmtRSYYpkEcVu5qnoKJJ2ihbz8ba02ZmBt_hpYh2AuzTUMGzl41&amp;t=634366558068332176" type="text/css"

You can see the entire html (as plain text) content associated with the request. In order to parser and exctract information from this document we shall use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).<br>
Using the ```bs4.BeautifulSoup(text, struct_of_data)``` method we can initialize an object which allow us to request and work with all tools that beatigulsoup give us. The ```'xlml'``` is an easy-to-use library for processing XML and HTML in the Python.

In [38]:
soup = bs4.BeautifulSoup(res.text, 'lxml')

It is important to know the HTML syntax. In order, to explore the strcture we need to **inspect** the web page using the navigator tools. A direct way is to select the item using the right button and click on inspect.
![a](./images/navigation.png)

Now we can select everything (in the html tags) that we need. For example we can select **the table** section. The selection function returns a list of objects that match the entered keyword. We use ```.``` to refer to **class names** and ```#``` for **ids**

In [64]:
table = soup.select('.tabla') # From the HTML we selected the class name "tabla" 
print(type(table))

<class 'list'>


Alternatively, we could get info from tags by accessing to the soup attribute. If you select a specific HTML object (no list, as we saw above). You can clean tags to recover the text: we must use the ```.getText ()``` method or the ```.text``` attribute

In [41]:
header = soup.thead
print(header.getText())



Fecha
Documento
Materia
Artículo
Tipo
Resultado
Afir.
Neg.
Abst.
Detalle




The complete documentation of **BeautifulSoup** would be find [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). Feel free to explore and use all powerful methods!

### Extracting Data from Chilean Deputies Web Page

The ```find_all()``` method scans the entire document looking for results, but sometimes you only want to find one result; in those cases it is convenient to use the ```find()``` method

In [79]:
table = soup.find('table', attrs={'class': 'tabla'})
print(type(table))

<class 'bs4.element.Tag'>


Now we can iterate over table elements.
![a](./images/tablehtml.png)

In [88]:
for bar in table.find_all('tr'):
    for foo in bar.find_all('td'):
        print(foo.text)
        print('-'*20)

13 de sep de 2018 - 12:47
--------------------
Boletín N° 11702-13
--------------------
Modifica el Código del Trabajo con el objeto de hacer...
--------------------

--------------------
GENERAL
--------------------
APROBADO
--------------------
128
--------------------
0
--------------------
0
--------------------
Ver
--------------------
13 de sep de 2018 - 12:45
--------------------
Boletín N° 11787-22
--------------------
Modifica la ley N°18.290, de Tránsito, para eximir...
--------------------

--------------------
GENERAL
--------------------
APROBADO
--------------------
126
--------------------
0
--------------------
0
--------------------
Ver
--------------------
13 de sep de 2018 - 12:44
--------------------
Boletín N° 11972-10
--------------------
Aprueba el Tercer Protocolo Adicional al Acuerdo por...
--------------------

--------------------
UNICA
--------------------
APROBADO
--------------------
94
--------------------
21
--------------------
11
--------------------
V

It is also necessary to keep the header table

In [130]:
header_raw = table.thead.text

In [135]:
header = header_raw.split('\n')[2:-2]
print(header)

['Fecha', 'Documento', 'Materia', 'Artículo', 'Tipo', 'Resultado', 'Afir.', 'Neg.', 'Abst.', 'Detalle']


### Pre-processing text 

At this point, we've collected a lot unstructured data (plain text). So, To deal with floating text and create our relations and structures, we need to parse the language properties. In particular, we need to use a python package that allow us to work with regular expresion: [Regex](https://docs.python.org/3/library/re.html).<br><br>
Regex is a default-installed python package. We can use it by coding ```import re``` 

In [136]:
import re
re.__version__

'2.2.1'

![](./images/regex_reference.png)

We'll make general Regex formulas or pattern to find substring inside the text. To create a formula you have to use ```re.compile(S)```, where S is a string with a regex pattern that uses regex syntax (see Regex quick reference image).

First, we save the strings in python lists

In [386]:
table_list = []
for bar in table.find_all('tr'):
    row = []
    for foo in bar.find_all('td'):
        row.append(foo.text)
    table_list.append(row)
table_list = table_list[1:] #because the 0 element is void
print(len(table_list))

20


We will divide the schedule element into date and time

In [148]:
schedules = [h[0] for h in table_list]

In [187]:
date_pattern = re.compile('\d\d?\s.*(20)\d{2}')
time_pattern = re.compile('\d{2}:\d{2}')

There is two main methods to find substrings: 
- ```re.match()``` checks for a match only at the beginning of the string
- ```re.search()``` checks for a match anywhere in the string

In [201]:
dates = []
times = []
for sch in schedules:
    date = date_pattern.match(sch).group()
    time = time_pattern.search(sch).group()
    dates.append(date)
    times.append(time)

We can do whatever we want using [Regex library](https://docs.python.org/3/library/re.html)

### Formatting pandas tables

In [204]:
import pandas as pd

In [205]:
header

['Fecha',
 'Documento',
 'Materia',
 'Artículo',
 'Tipo',
 'Resultado',
 'Afir.',
 'Neg.',
 'Abst.',
 'Detalle']

In [220]:
new_table = pd.DataFrame()
new_table['Fecha'] = dates
new_table['Hora'] = times
for i, he in enumerate(header[1:-1]):
    new_table[he] = [h[i+1] for h in table_list]

In [388]:
new_table.head()

Unnamed: 0,Fecha,Hora,Documento,Materia,Artículo,Tipo,Resultado,Afir.,Neg.,Abst.
0,13 de sep de 2018,12:47,Boletín N° 11702-13,Modifica el Código del Trabajo con el objeto d...,,GENERAL,APROBADO,128,0,0
1,13 de sep de 2018,12:45,Boletín N° 11787-22,"Modifica la ley N°18.290, de Tránsito, para ex...",,GENERAL,APROBADO,126,0,0
2,13 de sep de 2018,12:44,Boletín N° 11972-10,Aprueba el Tercer Protocolo Adicional al Acuer...,,UNICA,APROBADO,94,21,11
3,13 de sep de 2018,12:42,Boletín N° 11487-04,Autoriza a erigir un monumento en memoria del ...,,GENERAL,APROBADO,118,0,8
4,13 de sep de 2018,12:41,Boletín N° 11749-10,Aprueba el Protocolo de Modificación del Trata...,,UNICA,APROBADO,98,23,5


The last column has a link to see the votes detail. So, we can open those links by looping in the strcture

In [389]:
res = requests.get('https://www.camara.cl/camara/diputados.aspx#tab')
soup2 = bs4.BeautifulSoup(res.text, 'lxml')

In [313]:
foo = soup2.find('ul', {'class':'diputados'})
pattern_dip = re.compile('^\w*')
names_dip = {}
for index, value in enumerate(foo.find_all('h5')):
    name = value.find('a').text
    name = " ".join(name.split()[1:])
    names_dip[name] = index

In [331]:
len(names_dip)
keys = [value for key, value in names_dip.items()]

In [397]:
base = 'https://www.camara.cl/trabajamos/'
favor = []
contra = []
abstencion = []
for index, a in enumerate(table.find_all('a')):
    url = base+a['href']
    res = requests.get(url)
    soup2 = bs4.BeautifulSoup(res.text, 'lxml')
    tabla2 = soup2.find_all('div', {'class':'stress'})
    partial_vote = []
    for t in tabla2[1:-2]:
        partial = [] #to save favor - contra - abstencion
        votes_partial = np.zeros(len(names_dip)) # a vector with 0 for each diputado
        for aa in t.find_all('a'):
            list_name = aa.text.split()
            if list_name[0] == 'Del':
                m = list_name[-1]+' '+list_name[0]+' '+list_name[1]
            else:
                m = list_name[-1]+' '+list_name[0]
            try:
                iloc = names_dip[m]
                votes_partial[iloc] = 1
            except:
                continue
        partial_vote.append(votes_partial)
    favor.append(partial_vote[0])
    contra.append(partial_vote[1])
    abstencion.append(partial_vote[2])

In [398]:
votes_info = pd.DataFrame(index=names_dip.keys())
for label, votes in zip(new_table['Documento'],favor):
    votes_info[label] = votes

In [399]:
votes_info.head()

Unnamed: 0,Boletín N° 11702-13,Boletín N° 11787-22,Boletín N° 11972-10,Boletín N° 11487-04,Boletín N° 11749-10,P. Resolución N° 166,P. Resolución N° 164,P. Resolución N° 313,Otro Documento,Boletín N° 8924-07,P. Resolución N° 322,P. Resolución N° 321,P. Resolución N° 320,Boletín N° 11317-21
Florcita Alarcón,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
Jorge Alessandri,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0
René Alinco,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
Sebastián Álvarez,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0
Jenny Álvarez,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0


## Dynamic Web Scraping

Most web pages use dynamic content to display data. We need to explore the Javascript code. 