In [None]:
import json

!pip install -q requests

import requests

## API versus web-scraping

Both are ways to sample data from the internet

API
- structured
- limited data / rate limits
- parsing JSON

Web scraping
- less structure
- parsing HTML

This notebook covers using an API - see the []() notebook for web scraping.

Before we introduce using an API we will first cover **JSON** - a file format used often in API calls.  We will also take a look at handling files in Python.

## JSON

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate.

You can think of the Python `dict` as JSON like:

In [None]:
data = {'alan': 'turing'}

data

But true JSON is just a string.  We can use the Python standard library to turn the `dict` into a JSON string:

In [None]:
json_data = json.dumps(data)
json_data

In [None]:
type(json_data)

We can turn this string back into a `dict`:

In [None]:
json.loads(json_data)

Let's save this data to disk as a JSON file.  

We will use the **context management** feature of Python (see [Python with Context Managers - Jeff Knupp](https://jeffknupp.com/blog/2016/03/07/python-with-context-managers/) for a deeper look).

## Why do we need to manage context?

When we are reading or writing to a file we consume an operating system resource known as a file descriptor.  

The OS limits the number of file descriptors a process can have open.  We can see this by running the following in a shell (we use the `!` shortcut to run it directly in a notebook):

In [None]:
!ulimit -n

We open files using the Python `open` builtin.  We specify the file path and the mode, most commonly:
- `r` read
- `w+` write (`+` to create file if it doesn't exist)
- `a` append

We can open a file using the Python builtin `open`:

In [None]:
open('./readme.md', 'r').read()

We can write to a new file:

In [None]:
open('./dump.md', 'w+').write('hi')

The issue with the code above is that we aren't closing the file.  We can fix this by intentionally closing the file.   Note that in the code below we open the file far more than the `ulimit`:

In [None]:
#  how not to do it

data = []
for _ in range(1024):
    fi = open('./readme.md', 'r')
    fi.close()
    data.append(fi)

The Pythonic way of handling opening & closing of files is **context management**:

In [None]:
#  the pythonic way - one less line

data = []
for _ in range(1024):
    with open('./readme.md', 'r') as fi:
        data.append(fi)

Now that we understand context management, we can save our `data` dict as JSON:

In [None]:
data = {'name': 'alan turing'}

with open('./test.json', 'w') as fi:
    fi.write(json.dumps(data))

with open('./test.json', 'w') as fi:
    json.dump(data, fi)

## REST API's

API = application programming interface

Take a look at [ProgrammableWeb](https://www.programmableweb.com/apis/directory) for a collection of available API's.
- also look for the *Developer* or *For Developers* documentation on your favourite website

RESTful APIs enable you to develop any kind of web application having all possible CRUD (create, retrieve, update, delete) operations.

HTTP means HyperText Transfer Protocol. HTTP is the underlying protocol used by the World Wide Web and this protocol defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands.
- data that your business logic works on should be in the body (content), metadata can/should be put in headers.

HTTP methods:
- GET - retrieve information about the REST API resource
- POST - create a REST API resource
- PUT - update a REST API resource
- DELETE - delete a REST API resource or related component

## SpaceX API using Python

Now that we know a bit about APIs, let's use one.  We will use the [SpaceX API](https://github.com/r-spacex/SpaceX-API).

We use the `requests` HTTP library to perform a `GET` request:

In [None]:
response = requests.get("https://api.spacexdata.com/v3/launches/latest")

Use `dir` to see what we can do with this HTTP response:

In [None]:
dir(response)

We can get the HTTP header:

In [None]:
response.status_code

In [None]:
response.headers

And the response body using the `json` method:

In [None]:
content = response.json()
content

This response is a Python **dictionary**.  We can access the **keys** of the dictionary:

In [None]:
content.keys()

We can access the **values** using the square bracket indexing with a key:

In [None]:
content['links']

In [None]:
content['links'].keys()

In [None]:
content['links']['flickr_images']

In [None]:
image = content['links']['flickr_images'][-1]
print(image)
response = requests.get(image)

#  run a bash commmand to make new directory
#  -p to work if dir already exists
!mkdir -p images

with open("./images/spacex.jpg", 'wb') as f:
    f.write(response.content)

We can run the shell command `ls` to see

In [None]:
!ls

We can now see this image (you may need to run this cell again).

![](images/spacex.jpg)

## Exercise

Download data from the SpaceX API into a folder that is the `mission_name`
- images an `.png` in an images folder
- the metadata (flight number etc) into a `json` file

## Exercise

[Programmable Web API directory](https://www.programmableweb.com/apis/directory) - pick an API and grab some data

If you are stuck - the Wikipedia API:
- [Main API page](https://www.mediawiki.org/wiki/API:Main_page)
- [What the actions are](https://www.mediawiki.org/w/api.php)
- [Python examples](https://github.com/wikimedia/mediawiki-api-demos/tree/master/python)
- `https://en.wikipedia.org/w/api.php\?action\=opensearch\&search\=germany\&limit\=2\&format\=json`
- `https://en.wikipedia.org/w/api.php\?action\=parse\&page\=germany\&format\=json`