# Web APIs, Scraping and Web services with Python

## Index

1. Web APIs
2. Web scraping
   - 2.1. Ultra easy scraping with pandas
3. Building web services with Flask
4. Annex I: exercises

# 1. Web APIs



An API, or aplication programming interface, is the way programs communicate with one another. 

Web APIs are the way programs communicate with one another _over the internet_

[RESTful](https://en.wikipedia.org/wiki/Representational_state_transfer) APIs respect a series of design principles that make them simple to use.

The basic tools we are going to use are: POST and GET requests to urls we'll specify and json objects that we'll receive as response or send as payload (in a POST command, for example).

In [1]:
import pandas as pd
import numpy as np

In [4]:
import requests  # Es la librería que utilizaremos.

requests?  # requests es una librería en HTTP.

La utilizaremos para hacer peticiones a recursos a través de HTTP (protocolo de transferencia de hipertexto. Obtendremos una respuesta.

In [3]:
response = requests.get('http://elpais.com')

`response` tiene muchos métodos.

In [4]:
response.content  # Esto es el contenido en HTML de esa página.

b'<!DOCTYPE html>\n<html lang="es">\n<head>\n<meta charset="utf-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge"><script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={xpid:"VQEDUVdSCxAIVVVUBggHVw=="};window.NREUM||(NREUM={}),__nr_require=function(t,e,n){function r(n){if(!e[n]){var o=e[n]={exports:{}};t[n][0].call(o.exports,function(e){var o=t[n][1][e];return r(o||e)},o,o.exports)}return e[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<n.length;o++)r(n[o]);return r}({1:[function(t,e,n){function r(t){try{c.console&&console.log(t)}catch(e){}}var o,i=t("ee"),a=t(20),c={};try{o=localStorage.getItem("__nr_flags").split(","),console&&"function"==typeof console.log&&(c.console=!0,o.indexOf("dev")!==-1&&(c.dev=!0),o.indexOf("nr_dev")!==-1&&(c.nrDev=!0))}catch(s){}c.nrDev&&i.on("internal-error",function(t){r(t.stack)}),c.dev&&i.on("fn-err",function(t,e,n){r(n.stack)}),c.dev&&(r("NR AGENT IN DEVELOPMENT MODE"),r("flags: "+a(c,function(t

Un navegador lo que hace es interpretar este código y mostratlo.

En qué se diferencia una API de una página web? Básicamente, en que una API devuelve .json en vez de .html

Por ejemplo, en http://api.open-notify.org/ tenemos una API que nos da la posición de la estación espacial internacional en cada momento.

A la hora de usar API, cuanto mejor sea su documentación mejor, porque no es fácil mirar su código fuente, etc. Normalmente la documentación de las APIs suele estar algo desactualizada.

In [14]:
# Llamamos a la api

r = requests.get('http://api.open-notify.org/iss-now.json')

r.status_code  # 200 significa que OK. Los 4 son errores, los 5 errores del servidor...

200

In [15]:
r.content  # Nos devuelve un diccionario

b'{"message": "success", "timestamp": 1527927416, "iss_position": {"longitude": "73.0352", "latitude": "-5.8795"}}'

In [16]:
type(r.content)

bytes

Es un formato de archivo que se parece mucho a un diccionario pero no lo es. Es un `json` (json es el lenguaje en el que están escritos los objetos de javascript). 

We can convert a json-formatted string such as the one we get in the response into a Python object with the json library:

In [19]:
import json  # Sólo utilizaremos dos métodos de esta librería: loads y dumps

json.loads(r.content)  # Me parsea la string .json que le pase y me devuelve un diccionario en python

{'iss_position': {'latitude': '-5.8795', 'longitude': '73.0352'},
 'message': 'success',
 'timestamp': 1527927416}

In [20]:
type(json.loads(r.content))

dict

In [24]:
json.loads('[1,2,3]')  # En .json también hay listas


[1, 2, 3]

In [26]:
json.loads(r.content)

{'iss_position': {'latitude': '-5.8795', 'longitude': '73.0352'},
 'message': 'success',
 'timestamp': 1527927416}

In [27]:
json.loads(r.content)['iss_position']

{'latitude': '-5.8795', 'longitude': '73.0352'}

We also can go in the other direction and generate json-formatted strings from Python objects:

In [28]:
my_dict = {'Acelgas': False, 'Bacon': True}

json.dumps(my_dict)

'{"Acelgas": false, "Bacon": true}'

In [29]:
type(json.dumps(my_dict))

str

Esta API no tenía parámetros, pero los hay que sí. Por ejemplo, hay una API que nos dice los siguientes 'n' pases de la ISS por un punto.

Cómo meter los parámetros está definido en la documentación

#### Exercise:
Write a function that returns the duration of the next 5 overhead passes of the ISS for a given latitude and longitude. Use http://open-notify.org/Open-Notify-API/ISS-Pass-Times/
. We are going to need to encode the parameters in the url as per the specification.

For example, for Madrid:

http://api.open-notify.org/iss-pass.json?lat=40.4&lon=-3.7

In [55]:
def next_5_durations(lat, long):
    
    r2 = requests.get('http://api.open-notify.org/iss-pass.json?lat=%.2f&lon=%.2f' % (lat,long))
    r2_content = r2.content
    dict_r2 = json.loads(r2_content)
    
    print("Next 5 passes:")
    
    for i in np.arange(0,5):
        print(dict_r2['response'][i]['duration'])
    
next_5_durations(40.4, -3.7)    

Next 5 passes:
466
640
593
537
590


También se podría con una list comprehension: queremos recorrer una lista y coger un elemento de cada elemento de la lista

In [58]:
def next_5_durations(lat, long):
    
    r2 = requests.get('http://api.open-notify.org/iss-pass.json?lat=%.2f&lon=%.2f' % (lat,long))
    r2_content = r2.content
    dict_r2 = json.loads(r2_content)
    
    durations = [iss_pass['duration'] for iss_pass in dict_r2['response']]
    
    return durations
    
next_5_durations(40.4, -3.7)

[466, 640, 593, 537, 590]

Hay un montón de APIs. Para ello buscar **Public APIs**

Although we managed to get the response, more complicated sets of parameters will be a complicated and error-prone thing to encode. Thankfully, the `requests` library can do that work for us.

In [62]:
madrid_coords = {'lat': 40.4 , 'lon': -3.7}

r = requests.get('http://api.open-notify.org/iss-pass.json', params = madrid_coords)

In [64]:
json.loads(r.content)

{'message': 'success',
 'request': {'altitude': 100,
  'datetime': 1527928098,
  'latitude': 40.4,
  'longitude': -3.7,
  'passes': 5},
 'response': [{'duration': 466, 'risetime': 1527950162},
  {'duration': 641, 'risetime': 1527955830},
  {'duration': 593, 'risetime': 1527961661},
  {'duration': 537, 'risetime': 1527967536},
  {'duration': 590, 'risetime': 1527973358}]}

Más fácil así.

Even more complicated sets of parameters are sometimes required. When that is the case, API designers often decide to require them in json format, received via a `POST` request.

For example, take a look at the [QPX api from Google](https://developers.google.com/qpx-express/v1/trips/search). In the documentation, they define the body of the request, which we will have to provide, and of the response, which they'll provide back.  **OJO**: esta API murió en abril.

Crearíamos un objeto .json y lo enviaríamos con POST.

In [None]:
requests.post?

Un detalle importante: la **autenticación**. Para APIs restringidas o de pago hará falta autenticación. Lo que suele pasar es que tienes un número de requests gratis, y a partir de un cierto número pagas.

Cuando requieran autenticación, 

# 2. Web scraping


![HTML to DOM](http://www.cs.toronto.edu/~shiva/cscb07/img/dom/treeStructure.png)

![DOM TREE](http://www.openbookproject.net/tutorials/getdown/css/images/lesson4/HTMLDOMTree.png)



Lo que haremos aquí será también tirar del HTML.

El texto HTML (cuadrito 'The Document') se puede parsear a 'The DOM Tree'.

En el arbolito vemos la estructura de un documento HTML genérico (no los cuadros de arriba).

Para hacer web scraping utilizaremos la librería:

In [65]:
from bs4 import BeautifulSoup

Utilizaremos la página https://aflcio.org/what-unions-do/social-economic-justice/advocacy/legislative-alerts

Son cartas abiertas de un sindicato americano

In [66]:
url = 'https://aflcio.org/what-unions-do/social-economic-justice/advocacy/legislative-alerts'

In [69]:
# Lo primero es conseguir el HTML

r = requests.get(url)

r.status_code

200

In [71]:
page = r.content

page[:100]

b'<!DOCTYPE html>\n<html lang="en" dir="ltr" xmlns:article="http://ogp.me/ns/article#" xmlns:book="http'

In [76]:
# Cómo lo parseamos?

soup = BeautifulSoup(page, 'html5lib')  # Creamos una sopa

# Lo de 'html5lib' es la librería subyacente de HTML que utiliza para parsear. Suele usarse siempre.

Esto lo que hace es pasar del cuadrito de la izq (ver dibujo) al de la derecha. La `soup` es la página en modo `DOM`.

In [75]:
print(soup.prettify()[:1000])  # Esto me lo pone bonito y legible

<!DOCTYPE html>
<html dir="ltr" lang="en" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# " xmlns:article="http://ogp.me/ns/article#" xmlns:book="http://ogp.me/ns/book#" xmlns:product="http://ogp.me/ns/product#" xmlns:profile="http://ogp.me/ns/profile#" xmlns:video="http://ogp.me/ns/video#">
 <head>
  <meta charset="utf-8"/>
  <script type="text/javascript">
   window.NREUM||(NREUM={}),__nr_require=function(e,t,n){function r(n){if(!t[n]){var o=t[n]={exports:{}};e[n][0].call(o.exports,function(t){var o=e[n][1][t];return r(o||t)},o,o.exports)}return t[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<n.length;o++)r(n[o]);return r}({1:[functi

Con la sopa ya podemos trabajar sobre la página

Ahora podemos ir a la página en el navegador, ver su código e inspeccionar dónde están las cosas que quiero en el código.

Inciso: en páginas modernas, casi todo lo que queramos estará en las tags `<div>`

Vamos a intentar construir una función que, cuando la llamemos, consulte la página de los sindicatos y genere una lista con las alertas legislativas visibles. Cada alerta estará representada por el link a la carta, el título y la fecha (en un diccionario).

En el HTML de la página web cada elemento de los que queremos está en algo tal que así:			
```            
            <div class="block block-content col-12 col-lg-4">
			<div class="content-details  " >
	<a class="b-inner" href="/about/advocacy/legislative-alerts/letter-opposing-legislation-would-put-consumers-risk">
	  <div class="b-text">
              <h5 class="content-type">Legislative Alert</h5>
        <h2 class="content-title"><span>Letter Opposing Legislation That Would Put Consumers At Risk</span>
</h2>
              <time datetime="2018-05-22T10:37:18-0400">May 22, 2018</time>
          </div>
	</a>
  <div></div>
</div>

```

In [78]:
help(soup.find_all)

Help on method find_all in module bs4.element:

find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) method of bs4.BeautifulSoup instance
    Extracts a list of Tag objects that match the given
    criteria.  You can specify the name of the Tag and any
    attributes you want the Tag to have.
    
    The value of a key-value pair in the 'attrs' map can be a
    string, a list of strings, a regular expression object, or a
    callable that takes a string and returns whether or not the
    string matches for some custom definition of 'matches'. The
    same is true of the tag name.



Extrae una lista de tags que matchean criterios determinados.

In [80]:
type(soup.find_all('div'))  # Esto nos devuelve todos los divs con todo lo que tienen dentro

bs4.element.ResultSet

Nos fijamos en el HTML que lo que comparten las cosas que queremos es que `class="content-details  "`

In [84]:
alerts = soup.find_all('div', class_='content-details')
len(alerts)

18

Aquí le hemos dicho que me coja todos los `div` que tengan el atributo `class` igual a eso.

Tenemos 18 elementos que son las alertas que queremos.

In [85]:
alerts[0].find('a')

<a class="b-inner" href="/about/advocacy/legislative-alerts/letter-opposing-legislation-would-put-consumers-risk">
	  <div class="b-text">
              <h5 class="content-type">Legislative Alert</h5>
        <h2 class="content-title"><span>Letter Opposing Legislation That Would Put Consumers At Risk</span>
</h2>
              <time datetime="2018-05-22T10:37:18-0400">May 22, 2018</time>
          </div>
	</a>

In [87]:
alerts[0].find('a').get_text()

'\n\t  \n              Legislative Alert\n        Letter Opposing Legislation That Would Put Consumers At Risk\n\n              May 22, 2018\n          \n\t'

In [88]:
alerts[0].find('a')['href']

'/about/advocacy/legislative-alerts/letter-opposing-legislation-would-put-consumers-risk'

Construyamos la función:

In [91]:
alerts[0]

<div class="content-details ">
	<a class="b-inner" href="/about/advocacy/legislative-alerts/letter-opposing-legislation-would-put-consumers-risk">
	  <div class="b-text">
              <h5 class="content-type">Legislative Alert</h5>
        <h2 class="content-title"><span>Letter Opposing Legislation That Would Put Consumers At Risk</span>
</h2>
              <time datetime="2018-05-22T10:37:18-0400">May 22, 2018</time>
          </div>
	</a>
  <div></div>
</div>

In [103]:
# Link

alerts[0].find('a')['href']

full_link = 'https://aflcio.org' + alerts[0].find('a')['href']
full_link

'https://aflcio.org/about/advocacy/legislative-alerts/letter-opposing-legislation-would-put-consumers-risk'

In [105]:
# Título

alerts[0].find('h2').get_text()

'Letter Opposing Legislation That Would Put Consumers At Risk'

In [106]:
alerts[0].find('span').get_text()

'Letter Opposing Legislation That Would Put Consumers At Risk'

In [101]:
# Fecha

alerts[0].find('time').get_text()

'May 22, 2018'

In [108]:
alerts[0].find('time')['datetime']

'2018-05-22T10:37:18-0400'

In [127]:
def function_alerts():
        
    url = 'https://aflcio.org/what-unions-do/social-economic-justice/advocacy/legislative-alerts'
    r = requests.get(url)
    page = r.content
    soup = BeautifulSoup(page, 'html5lib') 
    
    alerts = soup.find_all('div', class_='content-details')
    
    alerts_list = []
    
    for alert in alerts:
        link = 'https://aflcio.org' + alert.find('a')['href']
        title = alert.find('h2').get_text()
        datetime = alert.find('time')['datetime']
        
        results = {'link': link, 'title': title, 'datetime': datetime}
        
        alerts_list.append(results)
        
    return alerts_list

In [130]:
function_alerts()

[{'datetime': '2018-05-22T10:37:18-0400',
  'link': 'https://aflcio.org/about/advocacy/legislative-alerts/letter-opposing-legislation-would-put-consumers-risk',
  'title': 'Letter Opposing Legislation That Would Put Consumers At Risk\n'},
 {'datetime': '2018-05-21T16:54:33-0400',
  'link': 'https://aflcio.org/about/advocacy/legislative-alerts/letter-opposing-bill-would-make-it-more-difficult-americans-feed',
  'title': 'Letter Opposing Bill That Would Make It More Difficult for Americans to Feed Their Families\n'},
 {'datetime': '2018-05-21T16:48:34-0400',
  'link': 'https://aflcio.org/about/advocacy/legislative-alerts/letter-opposing-legislation-would-help-privatize-va',
  'title': 'Letter Opposing Legislation That Would Help Privatize the VA\n'},
 {'datetime': '2018-05-21T16:43:40-0400',
  'link': 'https://aflcio.org/about/advocacy/legislative-alerts/letter-opposing-michael-truncales-nomination-eastern-district',
  'title': "Letter Opposing Michael Truncale's Nomination to the Easter

In [134]:
function_alerts()[0]

{'datetime': '2018-05-22T10:37:18-0400',
 'link': 'https://aflcio.org/about/advocacy/legislative-alerts/letter-opposing-legislation-would-put-consumers-risk',
 'title': 'Letter Opposing Legislation That Would Put Consumers At Risk\n'}

In [137]:
pd.DataFrame(function_alerts()).head()

Unnamed: 0,datetime,link,title
0,2018-05-22T10:37:18-0400,https://aflcio.org/about/advocacy/legislative-...,Letter Opposing Legislation That Would Put Con...
1,2018-05-21T16:54:33-0400,https://aflcio.org/about/advocacy/legislative-...,Letter Opposing Bill That Would Make It More D...
2,2018-05-21T16:48:34-0400,https://aflcio.org/about/advocacy/legislative-...,Letter Opposing Legislation That Would Help Pr...
3,2018-05-21T16:43:40-0400,https://aflcio.org/about/advocacy/legislative-...,Letter Opposing Michael Truncale's Nomination ...
4,2018-05-21T16:34:44-0400,https://aflcio.org/about/advocacy/legislative-...,Letter in Support of Amendment that Will Prote...


In [141]:
# Podemos meter el resultado en un .json

json.dumps(function_alerts())[:100]

'[{"link": "https://aflcio.org/about/advocacy/legislative-alerts/letter-opposing-legislation-would-pu'

## 2.1. Ultra easy scraping with pandas

When the data we want is already formatted as a table, we can do it even more easily! Just use `pandas.read_html`:

https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll#Space_exploration

Aquí tenemos páginas con tablas

In [149]:
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll',header=0)

# header es la fila 0

In [153]:
tables[1].head(2)

Unnamed: 0,Deaths,Date,Attraction,Amusement park,Location
0,28,14 February 2004,Transvaal Park (entire facility affected); the...,Transvaal Park,"Yasenevo, Moscow, Russia[1]"
1,15,27 June 2015,Formosa Fun Coast music stage; a dust explosio...,Formosa Fun Coast,"Bali, New Taipei, Taiwan[2]"


Hasta ahora, hemos usado APIs, luego Web Scraping, y ahora **proveeremos nuestras propias APIs.**

# 3. Building web services with Flask



[Flask](http://flask.pocoo.org/docs/1.0/) is a framework for building web applications.

Building a simple Web Service is extremely easy: you just create an app, define a function that generates the result and tie it to a route, and run the app.



In [165]:
from flask import Flask  # Importamos flask
from werkzeug.serving import run_simple  # Esto es para correr el web service

In [166]:
# Creamos una app y le damos un nombre

app = Flask('My first web service')


# Para convertir una función en un servicio web, le tenemos que pasar una ruta

@app.route('/saludamajo')  

# Esto de la @ es para 'anotar' funciones -> le decimos a Flask que coja esa función y la sirva en la ruta

def hello_world():
    return 'hello!!'

# Poniendo la línea de la @ encima de la función estamos relacionando la app con la función hello_world

In [167]:
run_simple('localhost', 5000, app)

 * Running on http://localhost:5000/ (Press CTRL+C to quit)
127.0.0.1 - - [02/Jun/2018 13:09:14] "GET /saludamajo HTTP/1.1" 200 -


Si metemos http://localhost:5000/saludamajo en el navegador, veremos la web service.

Para parar el servicio, detenemos en el notebook.

Lo hacemos con la función de antes

In [168]:
app = Flask('union-alerts')

@app.route('/latest-alerts')

def function_alerts():
        
    url = 'https://aflcio.org/what-unions-do/social-economic-justice/advocacy/legislative-alerts'
    r = requests.get(url)
    page = r.content
    soup = BeautifulSoup(page, 'html5lib') 
    
    alerts = soup.find_all('div', class_='content-details')
    
    alerts_list = []
    
    for alert in alerts:
        link = 'https://aflcio.org' + alert.find('a')['href']
        title = alert.find('h2').get_text()
        datetime = alert.find('time')['datetime']
        
        results = {'link': link, 'title': title, 'datetime': datetime}
        
        alerts_list.append(results)
        
    return json.dumps(alerts_list)

run_simple('localhost',5000, app)

 * Running on http://localhost:5000/ (Press CTRL+C to quit)
127.0.0.1 - - [02/Jun/2018 13:12:43] "GET /latest-alerts HTTP/1.1" 200 -


Entrando en http://localhost:5000/latest-alerts tendremos el json resultante

Accepting request parameters is easy too:

In [174]:
from flask import request

app = Flask('My first web service')

@app.route('/aritmetica')

def cuadrado():
    
    n = int(request.args.get('n')) # Con esto, Flask ya nos proporciona la request. request nos lo da en string.
    return '%d al cuadrado es %d' % (n, n**2)

run_simple('localhost', 5000, app)

 * Running on http://localhost:5000/ (Press CTRL+C to quit)
127.0.0.1 - - [02/Jun/2018 13:23:17] "GET /aritmetica?n=21 HTTP/1.1" 200 -


Metiendo http://localhost:5000/aritmetica?n=21 nos devuelve el resultado

# 4. Annex I: exercises

#### Exercise:

Get a random fact from the [Internet Chuck Norris Database](http://www.icndb.com/api/).

In [11]:
r = requests.get('https://api.icndb.com/jokes')
r.content



#### Exercise

Write a function that uses query parameters to get a Chuck Norris fact to talk about you.

#### Exercise:

Extract the date of the worst aviation disaster from: https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll

Prerequisites: pandas, pd.read_html

#### Exercise: 

Assuming the list is exhaustive, calculate how many people died in accidental explosions per decade in the XX century. Plot it.

Data: 
https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll

Prerequisites: pandas, pd.read_html, pd.to_datetime, matplotlib or seaborn

#### Exercise

Build a small Flask app that serves the total number of deaths by accidental explosion and  a list of accidents when given a decade in the 20th century as a parameter.

#### Exercise: 

create a function that, given the two tables extracted from http://en.wikipedia.org/wiki/List_of_S%26P_500_companies and a date, returns the list of companies in the S&P 500 at that date.

# WORK IN PROGRESS