# Introduction to Data Analysis with Python III: APIs and Web Scraping

<img src="https://www.python.org/static/img/python-logo.png" alt="Python" style="width: 200px; float: right;"/>

#  Web APIs



An API (Aplication Programming Interface), is the way programs communicate with one another. Web APIs are the way programs communicate with one another _over the internet_.

[RESTful](https://en.wikipedia.org/wiki/Representational_state_transfer) APIs respect a series of design principles that make them simple to use.

The basic tools we are going to use are:
- POST and GET requests to urls we'll specify
- JSON objects that we'll receive as response or send as payload (in a POST command, for example).

We'll use the handy Python module [`requests`](https://requests.readthedocs.io/en/master/):

In [None]:
import requests

resp = requests.get('http://www.elpais.com/')
resp.content[:500]

In [None]:
type(resp.content)

In [None]:
resp.encoding

In [None]:
resp.text

## On dealing with HTTP requests in Python

urllib originates in Python 2, in Python 3 has been rewritten an it's the part of the standard Python library that deals with HTTP. Additionally, there's a urllib 3 package that, despite its name, is not related to the Python HTTP standard library as urllib. The package `requests` is based internally on `urllib3` but aims for an easier to use API than `urllib` or `urllib3`.

In [None]:
from urllib.request import urlopen

html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

This command outputs the complete HTML code for page1 located at the URL http://pythonscraping.com/pages/page1.html. More accurately, this outputs the HTML file page1.html, found in the directory <web root>/pages, on the server located at the domain name http://pythonscraping.com.

## Connecting reliably and handling exceptions

The web is messy. When fetching a file, two main things can go wrong:

- The page is not found on the server (or there was an error in retrieving it).
- The server is not found.

In the first situation, an HTTP error will be returned. This HTTP error may be “404 Page Not Found,” “500 Internal Server Error,” and so forth. In all of these cases, the urlopen function will throw the generic exception HTTPError. 

You can handle this exception in the following way using `urllib`:

In [None]:
from urllib.request import urlopen
from urllib.error import HTTPError

try:
    html = urlopen('http://www.pythonscraping.com/pages/page1.html')
except HTTPError as e:
    print(e)
    # return null, break, or do some other "Plan B"
else:
    # program continues. Note: If you return or break in the  
    # exception catch, you do not need to use the "else" statement

Or like this if your're using the `requests` package:

In [None]:
import requests

try:
  html = requests.get('http://www.pythonscraping.com/pages/page1.html')
except requests.exceptions.RequestException as e:
    raise SystemExit(e)
else:
  # Program continues

If you want http errors (e.g. 401 Unauthorized) to raise exceptions, you can call `Response.raise_for_status`. That will raise an `HTTPError`, if the response was an http error.

An example:

In [None]:
try:
    r = requests.get('http://www.google.com/nothere')
    r.raise_for_status()
except requests.exceptions.HTTPError as err:
    raise SystemExit(err)
else:
  # Program continues

## Gettting data from International Space Station orbit

This is an API that returns the current position of the ISS:

In [None]:
r = requests.get('http://api.open-notify.org/iss-now.json')
r.status_code

In [None]:
r.content

We can convert a json-formatted string such as the one we get in the response into a Python object with the json library:

In [None]:
import json 

pos = json.loads(r.content)
pos

Instead of the package JSON, there's the `json()` method in the request package that allows us to do the same:

In [None]:
r.json()

We can see that the `pos` object before returned by `json.loads()` on the returned content is a dictionary as well:

In [None]:
type(pos)

Once we've got the our data parsed into a Python dictionary, accessing parts of it should be familiar:

In [None]:
pos['iss_position']['latitude']


We can use Pandas to directly import the result of the request (JSON) in to a dataframe object:

In [None]:
import pandas as pd

pd.read_json('http://api.open-notify.org/iss-now.json')

We also can go in the other direction and generate json-formatted strings from Python objects:

In [None]:
mi_diccionario = {'Chicago' : "Illinois", "Kansas City" : ["Kansas", "Missouri"]}

In [None]:
mi_diccionario

In [None]:
json.dumps(mi_diccionario)

### Exercise


Write a function that returns the duration of the next 5 overhead passes of the ISS for a given latitude and longitude. Use http://open-notify.org/Open-Notify-API/ISS-Pass-Times/
. We are going to need to encode the parameters in the url as per the specification.

For example, for Madrid the URL would be http://api.open-notify.org/iss-pass.json?lat=40.4&lon=-3.7&n=5

Let's work our way towards building a solution by using iPython quick feedback loop:

In [None]:
import sys
sys.version_info

In [None]:
positions = requests.get("http://api.open-notify.org/iss-pass.json?lat=40.4&lon=-3.7&n=5")
positions.text

Looks like this is a JSON-formatted document, just as we were expecting by the looks of the URL we're calling:

In [None]:
positions.json()

Once we've explored a bit and tested what we wanted to do, let's write a function to encapsulate it:

In [None]:
def get_iss(lat, lon, passes):
    
    url = f"http://api.open-notify.org/iss-pass.json?lat={lat}&lon={lon}&n={passes}"
    response = requests.get(url)
    result =response.json()
    
    return result['response']

get_iss(40.0, 3.5, 5)

Although we managed to get the response, more complicated sets of parameters will be a complicated and error-prone thing to encode. Thankfully, the `requests` library can do that work for us, allowing us to pass a dictionary storing all the parameters:

In [None]:
madrid_coords = {'lat': 40.4, 'lon': -3.7}

r = requests.get('http://api.open-notify.org/iss-pass.json', params=madrid_coords)
r.json()

In [None]:
resp = r.json()['response']

pd.DataFrame(resp)

Even more complicated sets of parameters are sometimes required. When that is the case, API designers often decide to require them in json format, received via a `POST` request.

For example, take a look at the [QPX api from Google](https://developers.google.com/qpx-express/v1/trips/search). In the documentation, they define the body of the request, which we will have to provide, and of the response, which they'll provide back.

In [None]:
help(requests.post)

# Web scraping

**TODO** Intro to the different kinds of web: public, deep and dark. Why scraping and crawling.


Let's now see how to use Python to request information from a web server, how to perform basic handling of the server’s response, and how to begin interacting with a website in an automated fashion. 

Let's now see how an HTML document looks like and how the DOM (Dcoument Object Model) associated to it could loook like as well:

![HTML to DOM](http://www.cs.toronto.edu/~shiva/cscb07/img/dom/treeStructure.png)

![DOM TREE](http://www.openbookproject.net/tutorials/getdown/css/images/lesson4/HTMLDOMTree.png)

The DOM represents the document as nodes and objects. It's used by browsers and programming languages to connect to that page. HTML will generate a DOM, but the latter can be modified by the browser and Javascript, so sometimes the relation HTML/DOM is not going to be biunivocal.

We've got different libraries to tackle web scraping with Python:

- Beautiful Soup (leveraging Request)
- Scrapy
- Selenium

Scrapy is a complete web scraping framework which takes care of everything from getting the HTML, to processing the data. Selenium is a browser automation tool that can for example enable you to navigate between multiple pages. These two libraries have a steeper learning curve than Request (used to get HTML data) combined with BeautifulSoup which is used as a parser for the HTML. So, we'll start using BeautifulSoup and we'll learn a bit about Selenium later in this notebook.

In [None]:
la_url = 'https://aflcio.org/what-unions-do/social-economic-justice/advocacy/legislative-alerts'

In [None]:
from IPython.display import display, HTML

display(HTML(la_url))

This will work on Jupyter, but won't on Colab:

In [None]:
from IPython.display import IFrame

IFrame(la_url, 800, 600)

Let's now use the Beautiful Soup library, named after Lewis Carrol's Alice in Wonderland Beautiful Soup poem (because its goal is to make sense of the nonsensical). Typically, you will need to install this library using `pip`, but in the case of Google Colab it's already included for you:

In [None]:
from bs4 import BeautifulSoup

r = requests.get(la_url)

page = r.content
page[:1000]

Let's talk about the parsers Beautiful Soup can use.

`html.parser` is a parser that is included with Python 3 and requires no extra installations in order to use. It is reasonably fast.

Another popular HTML parser is `html5lib`, that we're going to use here. `html5lib` is an extremely forgiving parser that takes even more initiative correcting broken HTML. It also depends on an external dependency, and is slower than the `html.parser`. Despite this, it may be a good choice if you are working with messy or handwritten HTML sites.

There's also the `lxml` parser. It is as well more forgiving with broken HTML than `html.parser` (but not so autocorrecting as `html5lib`), and is the fastest of the three. Depending of what you'll be parsing and your goals, you can choose one or the other.

In [None]:
soup = BeautifulSoup(page, 'html5lib')
print(soup.prettify()[:1000])

In [None]:
print(soup.prettify()[28700:30500])

In [None]:
help(soup.find_all)

In [None]:
alerts = soup.find_all('div', class_='content-details')
print(len(alerts))
type(alerts)

ResultSet class is a subclass of a list. Looping through the results of find_all() is the most common approach to process the ResultSet as we'll see some cells down.

Being a list-like structure, we can access the elements using the index:


In [None]:
alerts[0]

Let's process this first alert a bit more and extract exactly the information we want from the HTML:

In [None]:
first = alerts[0]
print(first.find('time').get_text())
print(first.a.find('span').get_text())
print(first.a['href'])

Now, let's try to encapsulate this logic into a function that extracts all the information we want from the alerts in the page.

If you're curious, we're using Python type hints to tell our tooling that we're returning a list.

In [None]:
import typing

# Python 3.9 onwards, you can use list[dict] instead of typing.List[dict]
def get_aflcio_alerts() -> typing.List[dict]:
    # Initialize the results list to push elements when ready
    result = []
    r = requests.get('http://www.aflcio.org/Legislation-and-Politics/Legislative-Alerts')
    soup = BeautifulSoup(r.content, 'html5lib')
    
    for alert in soup.find_all('div', class_='content-details'):
        # Initialize a dictionary to store our information
        dictionary = {}
        dictionary['date'] = alert.find('time').get_text()
        dictionary['title'] = alert.a.find('span').get_text()
        dictionary['link'] = 'http://www.aflcio.org' + alert.a['href']
        
        # Add the alert data to our list
        result.append(dictionary)
        
    return result

Ok, let's test our function and show some results:

In [None]:
letters = get_aflcio_alerts()
letters[:2]

And we come full circle! We encode the list we created in a json string. We could then provide that over the internet in our own API (something that is out of the scope of this class, but very interesting nevertheless).

In [None]:
json.dumps(letters)[:1000]

## Practice - Scraping IMDB movie data

**TODO** https://www.dataquest.io/blog/web-scraping-beautifulsoup/

## Ultra easy scraping with pandas!

When the data we want is already formatted as a table, we can do it even more easily! Just use `pandas.read_html`:

In [None]:
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll', header=0)

In [None]:
tables[4].head()

## Scraping with Selenium

Selenium is an open source tool to automate testing in web browsers. It does not need to use a full fledged browser, we can use it with what's called a "headless" browser, as we'll be doing here.

A *headless* browser is just a regular browser that contains no visible UI elements. It can do way more than make requests, it can also render HTML though you will not see it, keep session information and even perform asynchronous network communications running Javascript code. Headless browsers are essential for any automation task.

To use Selenium with a browser, you need to install a WebDriver for the browser of your choosing. Also, you'd need to install the Selenium Python package (Selenium exists independently of Python, but a module is available to use Selenium inside Python).

Instead of going through the installation of all this in your machine, we'll be taking advantage of a ready made Python module that imports the Chromium driver to be used inside Colab notebooks. Worry not, Selenium installation is outside the scope of this class, but if you're interested there's plenty of information in the Internet about it.

Let's practice with Selenium over a scrape-friendly (licensing wise) site that lists some books. The site is [Books to scrape](http://books.toscrape.com/).

We'll scrape the details of each book on the page. Each page has 20 books and the details of the book can be found using the URL on each book's card. We'll do this for all the books in the page and for all the page the site has.

Pages follow a simple URL structure, and we'll use that. If the site you're scraping needs a button to be clicked, Selenium can do that as well.


### Scraping one book

Open [Chrome Developer Tools](https://developers.google.com/web/tools/chrome-devtools?hl=es) and navigate one of the books, centering around the `<article class="product_pod">` element. This HML code contains the URL of the book detail, that in turn contains what we want to scrap:

- Title
- Stock Status
- Rating
- Description
- Price
- Tax
- UPC

The first thing we need to do is install/import our webdriver and the Selenium Python package, as mentioned before. We'll be using the [Kora](https://github.com/airesearch-in-th/kora) package that will simplify our task here:

In [None]:
!pip install kora -q

Now, let's import the Chromium webdriver and use it to load the first page of the Site we want to scrape:

In [None]:
from kora.selenium import wd
wd.get('http://books.toscrape.com/catalogue/category/books_1/page-1.html')

Inspecting the page with Chrome Dev Tools, we can see that we get the detailed book information from the book URL that's in the element `product_pod`. Let's use the webdriver to find all the elements in the page and selectd the first one:

In [None]:
product_pod = wd.find_elements_by_class_name("product_pod")[0]
product_pod

Got it! Now, let's dig into `product_pod` structure to extract the link we're interested in. Once more, what we're looking for is outlined in the source code of the page that we can navigate using Chrome Dev Tools:

In [None]:
book_link = product_pod.find_element_by_tag_name("h3").find_element_by_tag_name("a").get_property("href")
book_link

Now we'll use again the webdriver to load the book description page and continue our analysis so we can scrape the data we're interested in:

In [None]:
wd.get(book_link)

We'll navigate the document structure using `find_element_by_xpath`. XPath is convenient because it's quite powerful when you don’t have a suitable id or name attribute for the element you wish to locate. You can use XPath to either locate the element in absolute terms (not recommended), or relative to an element that does have an id or name attribute.

Use this [tutorial](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~List_of_XPaths) to understand the XPath syntax and different wildcards that we're using here. Once we get what we want

In [None]:
title = wd.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/h1")
price = wd.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[1]")
stock_status = wd.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[2]")
rating = wd.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[3]").get_attribute("class")

description = wd.find_element_by_xpath("//*[@id='content_inner']/article/p")
upc = wd.find_element_by_xpath("//*[@id='content_inner']/article/table/tbody/tr[1]/td")
tax = wd.find_element_by_xpath("//*[@id='content_inner']/article/table/tbody/tr[5]/td")
category =  wd.find_element_by_xpath("//*[@id='default']/div/div/ul/li[3]/a")

print(f"Title: {title}\n",
      f"Description: {description}\n",
      f"Rating: {rating}\n",
      f"Stock Status: {stock_status}\n",
      f"Price: {price}\n",
      f"Tax: {tax}\n",
      f"UPC: {upc}\n"
      )

Let's clean this up a bit by accessing the text inside the element in most of the attributes we want:

In [None]:

book = {
    'Title': title.text,
    'Description': description.text,
    'Rating': rating,
    'Stock Status': stock_status,
    'Price': price.text,
    'Tax': tax.text,
    'UPC': upc.text
}

book

Ok, it looks like we almost got our book dictionary, but there are still some rough edges. Let's focus first on the 'Stock status' property, and try to extract the number of items in stock from it:

In [None]:
import re

book['Stock Status'] = int(re.findall("\d+",stock_status.text)[0])

book

Ok, now that we're all set, let's clen up the rating a bit so it's a number we can work with. We'll extract the number name and use a module to convert the name to an integer:

In [None]:
!pip install word2number

In [None]:
from word2number import w2n
book['Rating'] = w2n.word_to_num(rating.split()[1])

book

Let's wrap all this book scraping logic into a Python function, so we can iterate easily over all the books in the page:

In [None]:
def scrape_book(book_link):
  wd.get(book_link)
  book = {
    'Title': wd.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/h1").text,
    'Description': wd.find_element_by_xpath("//*[@id='content_inner']/article/p").text,
    'Rating': w2n.word_to_num(wd.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[3]").get_attribute("class").split()[1]),
    'Stock Status': int(re.findall("\d+",wd.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[1]").text)[0]),
    'Price': wd.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[1]").text,
    'Tax': wd.find_element_by_xpath("//*[@id='content_inner']/article/table/tbody/tr[5]/td").text,
    'UPC': wd.find_element_by_xpath("//*[@id='content_inner']/article/table/tbody/tr[1]/td").text
  }
  return book

In [None]:
scrape_book(book_link)

### Scraping all the books in one page

Ok, time to get all the books in the first page. First, let's reset what the webdriver point to by reloading the first page of books:

In [None]:
wd.get('http://books.toscrape.com/catalogue/category/books_1/page-1.html')
wd

Let's find all the `product_pod` elements in the page. This contains all the links we're interested in:

In [None]:
product_pods = wd.find_elements_by_class_name("product_pod")

This is a list:

In [None]:
type(product_pods)

...but the elements are not `string`. Be careful with this as you won't be able to manipulate what the webdriver points to easily when iterating:

In [None]:
type(product_pods[0])

We'll need to access the books link in order to carefully build a list of dictionaries containing the book info we want to scrape:

In [None]:
book_links = []
for product_pod in product_pods:
  book_link = product_pod.find_element_by_tag_name("h3").find_element_by_tag_name("a").get_property("href")
  book_links.append(book_link)

books=[]
for book_link in book_links:
  books.append(scrape_book(book_link))

books

Again, let's wrap this up in a nice Python function:

In [None]:
def scrape_page(page_link):
  wd.get(page_link)
  
  book_links = []
  for product_pod in product_pods:
    book_link = product_pod.find_element_by_tag_name("h3").find_element_by_tag_name("a").get_property("href")
    book_links.append(book_link)

  books=[]
  for book_link in book_links:
    books.append(scrape_book(wd, book_link))

  return books

Now with this new function you could scrape the whole site. Try to reuse this function and do it as an exercise.

To end using Selenium, let's close the webdriver so we don't keep the headless browser open:

In [None]:
wd.close()

# Exercises

### Exercise:

Extract the date of the worst aviation disaster from: https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll

Prerequisites: pandas, pd.read_html

In [None]:
aviation = pd.read_html('https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll', header=0)[1]
aviation.head(1)['Date']

### Exercise: 

Assuming the list is exhaustive, calculate how many people died in accidental explosions per decade in the XX century. Plot it.

Data: 
https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll

Prerequisites: pandas, pd.read_html, pd.to_datetime, matplotlib or seaborn

In [None]:
explosions = pd.read_html('https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll', header=0)[4]
explosions.head()

In [None]:
explosions['year'] = explosions['Date'].str[-4:]

In [None]:
explosions['Deaths'] = pd.to_numeric(explosions.Deaths.str.replace('[^0-9]', ''))

In [None]:
explosions['Decade'] = explosions['year'].str[:3] + '0s'

In [None]:
explosions.head()

In [None]:
twentieth_century = explosions[(explosions['date'] > '1900') & (explosions['date'] < '2000')]
per_decade = twentieth_century.groupby('Decade')['Deaths'].sum()

In [None]:
per_decade

In [None]:
import seaborn as sns
%matplotlib inline

sns.barplot(data=twentieth_century, 
            x='Decade', 
            y='Deaths', 
            order=sorted(twentieth_century['Decade'].unique()),
            ci=None,
            color='darkgrey')

### Exercise: 

create a function that, given the two tables extracted from http://en.wikipedia.org/wiki/List_of_S%26P_500_companies and a date, returns the list of companies in the S&P 500 at that date.