# How to get data

We produce a lot of data, and we need a lot of data for our analysis.
Our last goal is to extract information from data.
We're searching for *informative patterns* within our data.

Five V's of big data: 
- *Volume*, i.e. amount of data
- *Velocity*
- *Variety*
- *Veracity*
- *Value*

We've three classes of data:
- *structured*, i.e. standardized (informatic's POV)
- *semi-structured*
- *unstructured*, by fact raw

The goal is to have a structured form of data.
In order to have it we do a pre-processing (most expensive step in the analysis).
Every dataset requires an ad-hoc management, so there are no universal ways to do it.

Data formats usually are:
- *Tabular*
- *Text*
- *Images*

We'll focus in particular on the text ones.

What about the sources of these data?
Where can we find them?

Usually the answer is internet, but there's not always a direct download button.
That means we need some API in order to *scrape* them from a website.
Also, the manual download may be an option...

What's a web scraping application?
Not all websites allow scraping, but we can foul them eh, eh.

Let's make an example:
All Italians complain about the complexity of the law.
Let's try to analyze the complexity of the Italian law.

We notice that each *decreto* has hyperlinks to other documents: we can construct a network (or a graph)!
First, we need to look to the webpage.
Each webpage is built on HTML, CSS and JavaScript.

To analyze a webpage in python me mainly use three libraries:

In [10]:
import requests # make http requests
from bs4 import BeautifulSoup # parse html
import time # for sleep

Let's try to scrape something

In [11]:
BASE_URL = 'https://parlamento18.camera.it/229'
year = 1988

URL = f'{BASE_URL}?tipo_ricerca=anno_tipo&anno={year}'

# get HTML page
page = requests.get(URL)
print(page) # 200 = success code

<Response [200]>


In [12]:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup) # entire HTML of the page


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html lang="it" xml:lang="it" xmlns="http://www.w3.org/1999/xhtml">
<!-- view_groups/show -->
<head>
<title>Decreti legislativi</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="interact Xmanager" name="GENERATOR"/>
<meta content="index, follow" name="robots"/>
<meta content="no-referrer-when-downgrade" name="referrer"/>
<meta content="" name="description"/>
<meta content="Tue Jan 21 12:41:15 +0100 2020" name="last_modified"/>
<meta content="" name="keywords"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<script type="text/javascript">
//<![CDATA[
var current_project_name = 'Parlamento18';
var current_mode = 'live';
var current_view = '253';
var language = 'it';
var current_group = '229';
var current_environment = 'camera_internet';

//]]>
</script>
<script src="/javascripts/cache/Parlamento18.7ca4cab8dd75f28366

To extract the links:

In [13]:
results = soup.find_all('div', class_='link_decreto')
print(results)

[<div class="link_decreto"><a href="http://www.normattiva.it/uri-res/N2Ls?urn:nir:stato:decreto.legislativo:1988;539" title="Visualizza il testo del decreto 22 dicembre 539/1988">Decreto legislativo 22 dicembre 539/1988</a></div>, <div class="link_decreto"><a href="http://www.normattiva.it/uri-res/N2Ls?urn:nir:stato:decreto.legislativo:1988;509" title="Visualizza il testo del decreto 23 novembre 509/1988">Decreto legislativo 23 novembre 509/1988</a></div>, <div class="link_decreto"><a href="http://www.normattiva.it/uri-res/N2Ls?urn:nir:stato:decreto.legislativo:1988;478" title="Visualizza il testo del decreto 9 novembre 478/1988">Decreto legislativo 9 novembre 478/1988</a></div>]


In [14]:
# to get names
names = []
for result in results:
    name = result.getText()
    names.append(name)
print(names)

['Decreto legislativo 22 dicembre 539/1988', 'Decreto legislativo 23 novembre 509/1988', 'Decreto legislativo 9 novembre 478/1988']


In [16]:
# get new url of the next page
links = []
for i, result in enumerate(results):
    link = result.find('a', href=True)['href']
    links.append(link)
print(links)

['http://www.normattiva.it/uri-res/N2Ls?urn:nir:stato:decreto.legislativo:1988;539', 'http://www.normattiva.it/uri-res/N2Ls?urn:nir:stato:decreto.legislativo:1988;509', 'http://www.normattiva.it/uri-res/N2Ls?urn:nir:stato:decreto.legislativo:1988;478']


To go further: *selenium*.
Its is a sort-of browser which has automatic functionalities.
For example, it allows you to *click* on webpages.