# An API explainer notebook

We use this notebook to explain APIs. APIs make it easy for us to download data from webpages in a format that we can work with. Let's see how!


Let's say we want to download the Oppenheimer page https://en.wikipedia.org/wiki/J._Robert_Oppenheimer

How can we download a webpage? If we ask the web for help, a [top reply](https://stackoverflow.com/questions/22676/how-to-download-a-file-over-http) suggests the following code should work

In [2]:
import urllib.request

url = 'https://example.com'
response = urllib.request.urlopen(url)
data = response.read()      # a `bytes` object
text = data.decode('utf-8')

Alternatively, you can use [requests](https://requests.readthedocs.io/en/latest/) library.

In [3]:
import requests

response = requests.get(url)
# optionally, check response status code before proceeding further:
# response.raise_for_status()
# If the status 200 is returned, the request was successful.
print("Status code:", response.status_code)
print("Headers:", response.headers)
print("Content:", response.text[:20]) # Use slicing to limit the output

Status code: 200
Headers: {'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Age': '600723', 'Cache-Control': 'max-age=604800', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Wed, 30 Aug 2023 08:48:32 GMT', 'Etag': '"3147526947+gzip"', 'Expires': 'Wed, 06 Sep 2023 08:48:32 GMT', 'Last-Modified': 'Thu, 17 Oct 2019 07:18:26 GMT', 'Server': 'ECS (dcb/7F83)', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'Content-Length': '648'}
Content: <!doctype html>
<htm


> Try replacing the example webpage address with the Wikipedia address we would like to download

> Now, print the _text_ variable below

In [4]:
text

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <

There might be some system in this text, but it is difficult to see. If we want to work with what we just downloaded, we will spend _a lot_ of time cleaning the data.

APIs let us interact with webpages in an _ordered_ fashion. Let's use the Wikipedia API to get the page we are interested in.

We can get the page in an easy-to-work-with format, by typing in a new address created from a few base ingredients (see [the API quick start guide](https://www.mediawiki.org/wiki/API:Main_page)) such as the API _baseurl_, an _action_, a _data format_, and more:

In [5]:
# urllib version
baseurl = "https://en.wikipedia.org/w/api.php?"
action = "action=query"
title = "titles=J._Robert_Oppenheimer"
content = "prop=revisions&rvprop=content"
dataformat ="format=json"

query = "{}{}&{}&{}&{}".format(baseurl, action, content, title, dataformat)
print(query)


https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&titles=J._Robert_Oppenheimer&format=json


In [6]:
# requests version, using the params argument
# this way there's no need to manually format the query string
params = {
    "action": "query",
    "prop": "revisions",
    "rvprop": "content",
    "format": "json",
    "titles": "J._Robert_Oppenheimer"
}

> Try following the _query_ link. This is a webpage, but structured in a way that makes it easy for us to work with when we download it with Python.
> Explore the structure of the page. How do you get to the actual content of the page?

Now, let's download the nicely-structured page with Python. We do exactly what we did at the top of this notebook.

In [7]:
# urllib version
import json

wikiresponse = urllib.request.urlopen(query)
wikidata = wikiresponse.read()
wikitext = wikidata.decode('utf-8')
wikijson = json.loads(wikitext)
wikijson

{'batchcomplete': '',
  'revisions': {'*': 'Because "rvslots" was not specified, a legacy format has been used for the output. This format is deprecated, and in the future the new format will always be used.'}},
 'query': {'normalized': [{'from': 'J._Robert_Oppenheimer',
    'to': 'J. Robert Oppenheimer'}],
  'pages': {'39034': {'pageid': 39034,
    'ns': 0,
    'title': 'J. Robert Oppenheimer',
    'revisions': [{'contentformat': 'text/x-wiki',
      'contentmodel': 'wikitext',

In [8]:
# requests version
wikitext = requests.get(baseurl, params=params)
wikijson = wikitext.json()
wikijson

{'batchcomplete': '',
  'revisions': {'*': 'Because "rvslots" was not specified, a legacy format has been used for the output. This format is deprecated, and in the future the new format will always be used.'}},
 'query': {'normalized': [{'from': 'J._Robert_Oppenheimer',
    'to': 'J. Robert Oppenheimer'}],
  'pages': {'39034': {'pageid': 39034,
    'ns': 0,
    'title': 'J. Robert Oppenheimer',
    'revisions': [{'contentformat': 'text/x-wiki',
      'contentmodel': 'wikitext',

This might not look much better than what we had at first. But what we have now is a dictionary with the same structure as the ordered webpage provided by the API. See:

In [9]:
print("keys:", wikijson.keys())
print("one level deeper:", wikijson["query"])



> Now explore the dictionary structure. Can you find the page content again?

> Also download the source for your 4 favorite wikipedia pages and explore their structure.

To sum up: 
- The web has _a lot_ of content that could be cool to work with. 
- APIs make it possible for us to download content in a structure that we can work with.
- APIs are great!