In [1]:
import json

!pip install -q requests

import requests

## API versus web-scraping

Both are ways to sample data from the internet

API
- structured
- limited data / rate limits
- parsing JSON

Web scraping
- less structure
- parsing HTML

This notebook covers using an API - see the []() notebook for web scraping.

Before we introduce using an API we will first cover **JSON** - a file format used often in API calls.  We will also take a look at handling files in Python.

## JSON

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate.

You can think of the Python `dict` as JSON like:

In [2]:
data = {'alan': 'turing'}

data

{'alan': 'turing'}

But true JSON is just a string.  We can use the Python standard library to turn the `dict` into a JSON string:

In [3]:
json_data = json.dumps(data)
json_data

'{"alan": "turing"}'

In [4]:
type(json_data)

str

We can turn this string back into a `dict`:

In [5]:
json.loads(json_data)

{'alan': 'turing'}

Let's save this data to disk as a JSON file.  

We will use the **context management** feature of Python (see [Python with Context Managers - Jeff Knupp](https://jeffknupp.com/blog/2016/03/07/python-with-context-managers/) for a deeper look).

## Why do we need to manage context?

When we are reading or writing to a file we consume an operating system resource known as a file descriptor.  

The OS limits the number of file descriptors a process can have open.  We can see this by running the following in a shell (we use the `!` shortcut to run it directly in a notebook):

In [6]:
!ulimit -n

256


We open files using the Python `open` builtin.  We specify the file path and the mode, most commonly:
- `r` read
- `w+` write (`+` to create file if it doesn't exist)
- `a` append

We can open a file using the Python builtin `open`:

In [7]:
open('./readme.md', 'r').read()

'A collection of notebooks that teach Python from the top down - introducing language features as they are needed in a real example.\n\nThey provide the student with a view of an entire Python program, immersing the student in a big picture and explaining concepts & components as they appear, in the context of using them in the program.\n\nThis approach can be directly contrasted with the more common bottom up approach, where language features are introduced out of context.  You can find a collections of notebooks that teach Python from the bottom up in [teaching-monolith/basics](https://github.com/ADGEfficiency/teaching-monolith/tree/master/python/basics) and [teaching-monolith/advanced](https://github.com/ADGEfficiency/teaching-monolith/tree/master/python/advanced).\n\nIt is recomended that the notebooks are taught in the following order:\n- AB testing\n    * collections.nametuple\n- using an API\n    * file read/write (I/O) management\n- web scraping\n    * HTML parsing'

We can write to a new file:

In [8]:
open('./dump.md', 'w+').write('hi')

2

The issue with the code above is that we aren't closing the file.  We can fix this by intentionally closing the file.   Note that in the code below we open the file far more than the `ulimit`:

In [11]:
#  how not to do it

data = []
for _ in range(1024):
    fi = open('./readme.md', 'r')
    fi.close()
    data.append(fi)

The Pythonic way of handling opening & closing of files is **context management**:

In [12]:
#  the pythonic way - one less line
data = []
for _ in range(1024):
    with open('./readme.md', 'r') as fi:
        data.append(fi)

Now that we understand context management, we can save our `data` dict as JSON:

In [13]:
data = {'name': 'alan turing'}

with open('./test.json', 'w') as fi:
    fi.write(json.dumps(data))

with open('./test.json', 'w') as fi:
    json.dump(data, fi)

## REST API's

API = application programming interface

Take a look at [ProgrammableWeb](https://www.programmableweb.com/apis/directory) for a collection of available API's.
- also look for the *Developer* or *For Developers* documentation on your favourite website

RESTful APIs enable you to develop any kind of web application having all possible CRUD (create, retrieve, update, delete) operations.

HTTP means HyperText Transfer Protocol. HTTP is the underlying protocol used by the World Wide Web and this protocol defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands.
- data that your business logic works on should be in the body (content), metadata can/should be put in headers.

HTTP methods:
- GET - retrieve information about the REST API resource
- POST - create a REST API resource
- PUT - update a REST API resource
- DELETE - delete a REST API resource or related component

## SpaceX API using Python

Now that we know a bit about APIs, let's use one.  We will use the [SpaceX API](https://github.com/r-spacex/SpaceX-API).

We use the `requests` HTTP library to perform a `GET` request:

In [14]:
response = requests.get("https://api.spacexdata.com/v3/launches/latest")

Use `dir` to see what we can do with this HTTP response:

In [15]:
dir(response)

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 '_next',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'next',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

We can get the HTTP header:

In [16]:
response.status_code

200

In [17]:
response.headers

{'Date': 'Tue, 07 Jan 2020 08:48:52 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Set-Cookie': '__cfduid=d0f73d98b1775ca0f6b3d8b1ae34af9e61578386932; expires=Thu, 06-Feb-20 08:48:52 GMT; path=/; domain=.spacexdata.com; HttpOnly; SameSite=Lax; Secure', 'X-DNS-Prefetch-Control': 'off', 'X-Frame-Options': 'SAMEORIGIN', 'Strict-Transport-Security': 'max-age=15552000; includeSubDomains', 'X-Download-Options': 'noopen', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Vary': 'Origin', 'Access-Control-Allow-Origin': '*', 'Access-Control-Expose-Headers': 'spacex-api-cache,spacex-api-count,spacex-api-response-time', 'spacex-api-cache': 'HIT', 'spacex-api-response-time': '1ms', 'Content-Encoding': 'gzip', 'CF-Cache-Status': 'DYNAMIC', 'Expect-CT': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"', 'Server': 'cloudflare', 'CF-RAY': '5514acd67c2fc290-FRA'}

And the response body using the `json` method:

In [18]:
content = response.json()
content

{'flight_number': 87,
 'mission_name': 'Starlink 2',
 'mission_id': [],
 'launch_year': '2020',
 'launch_date_unix': 1578363540,
 'launch_date_utc': '2020-01-07T02:19:00.000Z',
 'launch_date_local': '2020-01-06T21:19:00-05:00',
 'is_tentative': False,
 'tentative_max_precision': 'hour',
 'tbd': False,
 'launch_window': 0,
 'rocket': {'rocket_id': 'falcon9',
  'rocket_name': 'Falcon 9',
  'rocket_type': 'FT',
  'first_stage': {'cores': [{'core_serial': 'B1049',
     'flight': 4,
     'block': 5,
     'gridfins': True,
     'legs': True,
     'reused': True,
     'land_success': True,
     'landing_intent': True,
     'landing_type': 'ASDS',
     'landing_vehicle': 'OCISLY'}]},
  'second_stage': {'block': 5,
   'payloads': [{'payload_id': 'Starlink 2',
     'norad_id': [],
     'reused': False,
     'customers': ['SpaceX'],
     'nationality': 'United States',
     'manufacturer': 'SpaceX',
     'payload_type': 'Satellite',
     'payload_mass_kg': 15400,
     'payload_mass_lbs': 33951.2,

This response is a Python **dictionary**.  We can access the **keys** of the dictionary:

In [19]:
content.keys()

dict_keys(['flight_number', 'mission_name', 'mission_id', 'launch_year', 'launch_date_unix', 'launch_date_utc', 'launch_date_local', 'is_tentative', 'tentative_max_precision', 'tbd', 'launch_window', 'rocket', 'ships', 'telemetry', 'launch_site', 'launch_success', 'links', 'details', 'upcoming', 'static_fire_date_utc', 'static_fire_date_unix', 'timeline', 'crew', 'last_date_update', 'last_ll_launch_date', 'last_ll_update', 'last_wiki_launch_date', 'last_wiki_revision', 'last_wiki_update', 'launch_date_source'])

We can access the **values** using the square bracket indexing with a key:

In [21]:
content['links'].keys()

dict_keys(['mission_patch', 'mission_patch_small', 'reddit_campaign', 'reddit_launch', 'reddit_recovery', 'reddit_media', 'presskit', 'article_link', 'wikipedia', 'video_link', 'youtube_id', 'flickr_images'])

In [29]:
content['links']['flickr_images']

[]

In [25]:
content['links']['mission_patch']

'https://images2.imgbox.com/d2/3b/bQaWiil0_o.png'

In [55]:
image

'https://images2.imgbox.com/d2/3b/bQaWiil0_o.png'

In [54]:
image.split('/')[-1]

'bQaWiil0_o.png'

In [51]:
image = content['links']['mission_patch']
print(image)
response = requests.get(image)

#  run a bash commmand to make new directory
#  -p to work if dir already exists
!mkdir -p images

with open("./images/spacex.png", 'wb') as f:
    f.write(response.content)

https://images2.imgbox.com/d2/3b/bQaWiil0_o.png


We can run the shell command `ls` to see

In [28]:
!ls

dump.md                  readme.md                web-scraping.ipynb
[1m[36mimages[m[m                   test.json
linear-programming.ipynb using-an-api.ipynb


We can now see this image (you may need to run this cell again).

![](images/spacex.png)

## Exercise

Download data from the SpaceX API (for the latest launch) into a folder that is the `mission_name`
- images an `.png` in an images folder
- the metadata (flight number etc) into a `json` file

Company info - https://api.spacexdata.com/v3/info

## Exercise

[Programmable Web API directory](https://www.programmableweb.com/apis/directory) - pick an API and grab some data

If you are stuck - the Wikipedia API:
- [Main API page](https://www.mediawiki.org/wiki/API:Main_page)
- [What the actions are](https://www.mediawiki.org/w/api.php)
- [Python examples](https://github.com/wikimedia/mediawiki-api-demos/tree/master/python)
- `https://en.wikipedia.org/w/api.php?action=opensearch&search=germany&limit=2&format=json`
- `https://en.wikipedia.org/w/api.php?action=parse&page=germany&format=json`

exercise = get data for different countries

In [60]:
res = requests.get('https://en.wikipedia.org/w/api.php?action=parse&page=germany&format=json')

In [62]:
requests.get('https://en.wikipedia.org/w/api.php?action=opensearch&search=germany&limit=2&format=json')

<Response [200]>

In [61]:
res.status_code

200

In [48]:
for fl in res.json():
    try:
        if fl['flickr_images']:
            print(fl)
    except:
        pass

In [49]:
len(res.json())

87

In [63]:
import os

image = 'image.png'

# doesn't work on windows!
print('./' + 'images/' + image)
print()

#  works on windows
print(os.path.join('.', 'images', image))

./images/image.png

./images/image.png


In [66]:
os.makedirs(
    os.path.join('.', 'new', 'dir'), exist_ok=True
)
!tree

[01;34m.[00m
├── dump.md
├── [01;34mimages[00m
│   ├── [01;35mspacex.jpg[00m
│   └── spacex.png
├── linear-programming.ipynb
├── [01;34mnew[00m
│   └── [01;34mdir[00m
├── readme.md
├── test.json
├── using-an-api.ipynb
└── web-scraping.ipynb

3 directories, 8 files


In [87]:
r = requests.get('https://upload.wikimedia.org/wikipedia/commons/f/fb/Darts_in_a_dartboard.jpg')

In [88]:
with open("test.jpg", 'wb') as f:
    f.write(r.content)

In [83]:
https://upload.wikimedia.org/wikipedia/commons/f/fb/Darts_in_a_dartboard.jpg

200

In [114]:
r = requests.get('https://en.wikipedia.org/w/api.php?action=query&format=json&list=allimages&aifrom=Darts_in_a_dartboard&ailimit=100000000')

In [116]:
for q in r.json()['query']['allimages']:
    if 'dartboard' in q['name']:
        print(q)

In [None]:
e1.append({k, v})

In [118]:
q['name']

'Daughter-of-the-forest.jpg'

In [108]:
r.json()

{'batchcomplete': '',
 'continue': {'aicontinue': 'DaughterCongo.gif', 'continue': '-||'},
 'query': {'allimages': [{'name': 'Darts_of_Pleasure.PNG',
    'timestamp': '2017-07-07T01:05:31Z',
    'url': 'https://upload.wikimedia.org/wikipedia/en/3/31/Darts_of_Pleasure.PNG',
    'descriptionurl': 'https://en.wikipedia.org/wiki/File:Darts_of_Pleasure.PNG',
    'descriptionshorturl': 'https://en.wikipedia.org/w/index.php?curid=13402008',
    'ns': 6,
    'title': 'File:Darts of Pleasure.PNG'},
   {'name': 'Daruchini_Dip.jpg',
    'timestamp': '2011-05-08T05:37:16Z',
    'url': 'https://upload.wikimedia.org/wikipedia/en/6/68/Daruchini_Dip.jpg',
    'descriptionurl': 'https://en.wikipedia.org/wiki/File:Daruchini_Dip.jpg',
    'descriptionshorturl': 'https://en.wikipedia.org/w/index.php?curid=31711797',
    'ns': 6,
    'title': 'File:Daruchini Dip.jpg'},
   {'name': 'Darucover.jpg',
    'timestamp': '2009-03-08T21:03:43Z',
    'url': 'https://upload.wikimedia.org/wikipedia/en/1/1a/Darucover.