In [None]:
import json
!pip install -Uq requests
import requests
!mkdir -p data

# Using an API

Learning outcomes

- difference between API & webscraping
- what JSON is (and why it's like a Python `dict`)
- how to properly handle files in Python
- what a REST API is
- how to use the `requests` library

## API versus web-scraping

**Both are ways to sample data from the internet**

API
- structured
- provided as a service (you are talking to a server via a REST API)
- limited data / rate limits / paid / require auth (sometimes)
- most will give back JSON (maybe XML or CSV)

Web scraping
- less structure
- parsing HTML meant for your browser

Neither is better than the other

- API developer can limit what data is accessible through the API
- API developer can not maintain the API
- website page can change HTML structure
- website page can have dynamic (Javascript) content that requires execution (usually done by the browser) before the correct HTML is available

Much of the work in using an API is figuring out how to properly construct URL's for `GET` requests
- requires looking at their documentation (& ideally a Python example!)

## Where to find APIs

- [ProgrammableWeb](https://www.programmableweb.com/apis/directory) - a collection of available API's
- For the *Developer* or *For Developers* documentation on your favourite website
- [public-apis/public-apis](https://github.com/public-apis/public-apis)

## Using API's

Most API's require authentication

- so they API developer knows who you are
- can charge you
- can limit access
- commonly via key or OAuth (both of which may be free)

All the API's we use here are unauthenticated - this is to avoid the time of you all signing up

If your app requries authentication, it's usually done by passing in your credentials into the request (as a header)

```python
response = requests.get(url, auth=auth)
```

## JSON strings

JSON (JavaScript Object Notation) is a:
- lightweight data-interchange format (text)
- easy for humans to read and write 
- easy for machines to parse and generate
- based on key, value pairs

You can think of the Python `dict` as JSON like:

In [None]:
data = {'name': 'alan-turing'}
data

But true JSON is just a string (only text).  We can use `json.dumps` from the standard library to turn the `dict` into a JSON string:

In [None]:
data = json.dumps(data)
data

In [None]:
type(data)

We can then use `json.loads` to turn this string back into a `dict`:

In [None]:
json.loads(data)

In [None]:
!ulimit -n

## Opening, reading & writing to files

### Reading from a file

We open files using the Python `open` builtin function, followed by a `read`:

In [None]:
open('./readme.md', 'r').read()[:100]

(Note the `./path` - the `.` refers to the current working directory.  It's not straightforward to know where this is - is it where the notebook is, or where the notebook server is running?

If we wanted to read the file as separate lines, we could use `readlines()` (note we would still need to manually strip off the `\n` characters later)

(Note `/n` is used as a newline indicator in text files - you never see it because your editor interprets it as a line break :)

In [None]:
open('./readme.md', 'r').readlines()[:8]

## Using `open`

`open(path, mode)`

Common values for the mode:
- `r` read
- `rb` read binary
- `w+` write (`+` to create file if it doesn't exist)
- `a` append

Note there are options for both reading & writing - we actually use `open` for both reading & writing.

We open a file using the Python builtin `open`, which is then followed by either a read or write stage

- open the file
- read the file OR write to the file

Notice that the file is read in as a single string, with the newline character `\n` separating lines

- this is how all text files are structured
- your editor does the line splitting for you

### Writing to a file (without context management)

We can write to a new file using the same `open` builtin

- open the file
- write to the file

In [None]:
open('./data/output.data', 'w').write('We make this file to show how not to do it\n')

Note that we can do the same file write by explicitly assiging the file object to a variable (note the `a` to append)

In [None]:
fi = open('./data/output.data', 'a')
fi.write('We make this file slightly differently to show how not to do it\n')

The issue with the code above is that we aren't closing the file - we can fix this by intentionally closing the file.  

One way to do this is to use `.close()` when we are done:

In [None]:
fi = open('./data/output.data', 'a')
fi.write('This time we close the file manually\n')
fi.close()

This requires us to remember to close (also an additional line).

### Reading files with context management

The Pythonic way of handling opening & closing of files is context management:

In [None]:
with open('./readme.md', 'r') as fi:
    data = fi.read()

### Writing to a file with context management

Now that we understand context management, we can save our `data` dict as JSON, using `json.dump` to write the dict to a file:

In [None]:
data = {'name': 'alan turing'}
with open('./data/output.json', 'w') as fi:
    json.dump(data, fi)

Let's check it worked by loading the file, here using `json.load` to load from the file object:

In [None]:
with open('./data/output.json', 'r') as fi:
    data = json.load(fi)
data

## REST API's

[REST - Wiki](https://en.wikipedia.org/wiki/Representational_state_transfer)

REST is a set of constraints that allow **stateless communication of text data on the internet**

- REST = REpresentational State Transfer
- API = Application Programming Interface

REST
- communication of resources (located at URLs / URIs)
- requests for a resource are responded to with a text payload (HTML, JSON etc)
- these requests are made using HTTP (determines how messages are formatted, what actions (methods) can be taken)
- common HTTP methods are `GET` and `POST`

HTTP methods
- GET - retrieve information about the REST API resource
- POST - create a REST API resource
- PUT - update a REST API resource
- DELETE - delete a REST API resource or related component

RESTful APIs enable you to develop any kind of web application having all possible CRUD (create, retrieve, update, delete) operations

- can do anything we would want to do with a database

*Further reading*
- [Web Architecture 101](https://engineering.videoblocks.com/web-architecture-101-a3224e126947) for more detail on how the web works

## Example - sunrise API

Docs - https://sunrise-sunset.org/api

First we need to form the url
- use `?` to separate the API server name from the parameters for our request
- use `&` to separate the parameters from each other
- use `+` instead of space in the parameter

In [None]:
res = requests.get("https://api.sunrise-sunset.org/json?lat=36.7201600&lng=-4.4203400")
data = res.json()
data

This response is JSON - `requests.json()` turns it into a `dict`:

In [None]:
type(data)

It's common to have a top level heirarchy to dig through to get the data:

In [None]:
data.keys()

Here the interesting stuff is in `results`:

In [None]:
data['results']

## Example - Chronicling America API

Docs - https://chroniclingamerica.loc.gov/about/api/ 

In [None]:
term = "germany"
fmt = "json"
url = f"https://chroniclingamerica.loc.gov/search/pages/results/?proxtext={term}&format={fmt}"
url

We use the `requests` HTTP library to perform a `GET` request:

In [None]:
response = requests.get(url)

## HTTP response

What we recieved above is an *HTTP response**

[HTTP Response - Wiki](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Response_message)

The response message consists of the following:

- a status line which includes the status code and reason message (e.g., HTTP/1.1 200 OK, which indicates that the client's request succeeded)
- response header fields (e.g., Content-Type: text/html)
- an optional message body

## What can we do with this HTTP response in `requests`?

The Python builtin `dir` gives us all the attributes & methods of a Python object.

This also includes all the `__` dunder (literally double-under) methods) - which we filter out using a list comprehension.

In [None]:
[o for o in dir(response) if '__' not in o]

We can get the HTTP status code (used to communicate things like everything OK (200), stop making requests etc - see [List of HTTP status codes - Wiki](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)):

In [None]:
response.status_code

The HTTP response headers:

In [None]:
response.headers

And the response body:

In [None]:
response.text[:1000]

In [None]:
import json
data = json.loads(response.text)

We can access the keys of the dictionary:

In [None]:
data.keys()

We can access the values using the square bracket indexing with a key:

In [None]:
data['totalItems']

While JSON is a simple text format, it can become complex due to

- nesting (JSON inside JSON)
- lists of JSON

An example is our `items`, which has been parsed as a Python `list`:

In [None]:
type(data['items'])

`items` is a list of dicts:

In [None]:
item = data['items'][0]
item.keys()

We can iterate over both the keys and values as a pair using `items()`.

Below we use a check  a quick check that the value isn't too long before printing:

In [None]:
from collections.abc import Iterable

for k, v in item.items():
    if isinstance(v, Iterable) and len(v) < 100:
        print(f'{k}: {v}')

Let's finish this exercise by only taking articles that appear between two years, and save those to disk.

Normally you would apply this kind of filtering in the API request - we are going to filter in memory.

First we need a bit of data cleaning of our date, which is an integer representation of time (but is a `str`):

In [None]:
item['date']

Here use `strptime` to convert the integer into a proper datetime:
- ([Python's strftime directives](http://strftime.org/) is very useful!)

In [None]:
from datetime import datetime as dt
dt.strptime(item['date'], "%Y%m%d")

Now let's put this data cleaning & filtering into a pipeline:

In [None]:
term = "germany"
fmt = "json"
url = f"https://chroniclingamerica.loc.gov/search/pages/results/?proxtext={term}&format={fmt}"
res = requests.get(url)
data = res.json()
items = data['items']

start = 1900
extract = []
for item in items:
    item['date'] = dt.strptime(item['date'], "%Y%m%d")
    
    if item['date'].year > start:
        extract.append(item)
        
len(extract)

We have a list of dictionaries, which plays very nice with `pandas`:

In [None]:
!pip install -q pandas
import pandas as pd
df = pd.DataFrame(extract)
df.head(2)

## Example - downloading images

We can also use web scraping to download things other than text - such as images.

Below we do a requests and see we get back a binary string:

In [None]:
url = 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png'
res = requests.get(url)
res.text[:100]

We can use context management to dump the contents of the binary string into a file:

In [None]:
with open('./data/google-logo.png', 'wb') as fi:
    fi.write(res.content)

We now have the Google logo locally (you may need to re-run this cell)

![](data/google-logo.png)

## Exericse (group) - earthquake API 

Let's as a group write a program to get data from the USGS Earthquake Catalog - [documentation](https://earthquake.usgs.gov/fdsnws/event/1/#methods)

## Exercise (individual) - Wikipedia API

Now for an open-ended exercise for you! Your task is to:
- create a database of countries
- in a folder called `countries` (you will need to make the folder - you can do this in bash or Python)
- each country in it's own folder
- start with germany & new zealand

V1 of your program should:
- save the url you use to request the data
- save the title
- save the `line` parameter of each section
- save all in a single JSON

V2 of your program should also:
- save all '.png' & '.jpg' images as images, with the url as the image name
- save all external links as CSV

Much of the work will be understanding how the Wikipedia API works - useful resources are below:
- [Main API page](https://www.mediawiki.org/wiki/API:Main_page)
- [What the actions are](https://www.mediawiki.org/w/api.php)
- [Python examples](https://github.com/wikimedia/mediawiki-api-demos/tree/master/python)

Please also feel free to work on another API - happy to assist you with this as well :)