# Exploring APIs and data structures with Jupyter notebooks

Recently a colleague shared a very useful technique for exploring Web APIs with me: [Jupyter notebooks](https://jupyter.org/).

Previously I used to use Bash scripts and [curl](https://curl.haxx.se/) for tasks like this. Other colleagues preferred GUI tools like [Postman](https://www.postman.com/).

Jupyter brings both worlds together:

* You can write code using powerful Python libraries
  * [Requests](https://requests.readthedocs.io/en/master/) library for HTTP requests
  * [Pandas](https://pandas.pydata.org/) library for data analysis
* You get documentation to share with your co-workers (and your future self)
  * GitHub will [render Jupyter notebooks](https://help.github.com/en/github/managing-files-in-a-repository/working-with-jupyter-notebook-files-on-github) as static HTML
  * You can include images, tables, and even interactive elements like maps
 
![](./jupyter.png)
  
  
## Setting up

To get started, you need to install the Jupyter package first:

> pip install jupyterlab

Depending on when you read this, you might have to check if `pip` is the Python 3.x version of Python or still the [legacy Python 2.7](https://pythonclock.org/) version. On my machine I had to use `pip3`. If that's the case, the Python executable is most likely also named `python3`. 

Next you can start Jupyter:

> python -m jupyterlab

## Getting started with Requests

The first library I want to introduce is [Requests](https://requests.readthedocs.io/en/master/), Python standard HTTP library.

In [1]:
import requests

Let request something simple to try out requests:

In [2]:
response = requests.request('GET', 'http://httpbin.org/json')
response.status_code

200

To get a pretty output from the JSON data, a quick helper function comes in handy:

In [3]:
import json
def pp(item):
    print(json.dumps(item, indent=2))

In [4]:
pp(response.json())

{
  "slideshow": {
    "author": "Yours Truly",
    "date": "date of publication",
    "slides": [
      {
        "title": "Wake up to WonderWidgets!",
        "type": "all"
      },
      {
        "items": [
          "Why <em>WonderWidgets</em> are great",
          "Who <em>buys</em> WonderWidgets"
        ],
        "title": "Overview",
        "type": "all"
      }
    ],
    "title": "Sample Slide Show"
  }
}


[Headers](https://github.com/psf/requests/blob/fd13816d015c4c90ee65297fa996caea6a094ed1/requests/models.py#L445) is a `CaseInsensitiveDict` so you need to wrap it in a `dict()` to print it.

In [5]:
pp(dict(response.headers))

{
  "Date": "Mon, 02 Mar 2020 20:33:42 GMT",
  "Content-Type": "application/json",
  "Content-Length": "429",
  "Connection": "keep-alive",
  "Server": "gunicorn/19.9.0",
  "Access-Control-Allow-Origin": "*",
  "Access-Control-Allow-Credentials": "true"
}


You can still get the curl version of your request by using the [curlify](https://github.com/ofw/curlify) package:

> pip install curlify

In [6]:
import curlify
print(curlify.to_curl(response.request))

curl -X GET -H 'Accept: */*' -H 'Accept-Encoding: gzip, deflate' -H 'Connection: keep-alive' -H 'User-Agent: python-requests/2.22.0' http://httpbin.org/json


## Using Pandas to explore JSON documents

[Pandas](https://pandas.pydata.org/) is a data analysis and manipulation library that's popular among [Data Science](https://github.com/jakevdp/PythonDataScienceHandbook) people. It has a very useful JSON analysis method.

To get started, let's first install the package (you might need to use `pip3`):

> pip install pandas

First let's look how it would work without pandas:

In [7]:
import requests

r = requests.request('GET', 'https://api.github.com/users/janahrens/repos')
json = r.json()
json.__class__

list

We now know that the call returns a JSON list. Let's examine what items this list has by looking at the first one.

In [8]:
json[0].keys()

dict_keys(['id', 'node_id', 'name', 'full_name', 'private', 'owner', 'html_url', 'description', 'fork', 'url', 'forks_url', 'keys_url', 'collaborators_url', 'teams_url', 'hooks_url', 'issue_events_url', 'events_url', 'assignees_url', 'branches_url', 'tags_url', 'blobs_url', 'git_tags_url', 'git_refs_url', 'trees_url', 'statuses_url', 'languages_url', 'stargazers_url', 'contributors_url', 'subscribers_url', 'subscription_url', 'commits_url', 'git_commits_url', 'comments_url', 'issue_comment_url', 'contents_url', 'compare_url', 'merges_url', 'archive_url', 'downloads_url', 'issues_url', 'pulls_url', 'milestones_url', 'notifications_url', 'labels_url', 'releases_url', 'deployments_url', 'created_at', 'updated_at', 'pushed_at', 'git_url', 'ssh_url', 'clone_url', 'svn_url', 'homepage', 'size', 'stargazers_count', 'watchers_count', 'language', 'has_issues', 'has_projects', 'has_downloads', 'has_wiki', 'has_pages', 'forks_count', 'mirror_url', 'archived', 'disabled', 'open_issues_count', 'lic

Now that we know what kinds of fields we have available, we can use Python to further explore the data. 

This process gets a lot easier with [Pandas](https://pandas.pydata.org/) and it's `json_normalize` function.
With `json_normalize` the data gets parsed into a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), which is a core data structure for "Two-dimensional, size-mutable, potentially heterogeneous tabular data". In other words: It represents the data as a table.

In [9]:
from pandas import json_normalize
df = json_normalize(r.json())
df.shape

(30, 95)

The `.shape` method is a good first property to test for. It shows that our DataFrame/table has 30 rows and 95 columns. Let's see what those columns are:

In [10]:
df.columns

Index(['id', 'node_id', 'name', 'full_name', 'private', 'html_url',
       'description', 'fork', 'url', 'forks_url', 'keys_url',
       'collaborators_url', 'teams_url', 'hooks_url', 'issue_events_url',
       'events_url', 'assignees_url', 'branches_url', 'tags_url', 'blobs_url',
       'git_tags_url', 'git_refs_url', 'trees_url', 'statuses_url',
       'languages_url', 'stargazers_url', 'contributors_url',
       'subscribers_url', 'subscription_url', 'commits_url', 'git_commits_url',
       'comments_url', 'issue_comment_url', 'contents_url', 'compare_url',
       'merges_url', 'archive_url', 'downloads_url', 'issues_url', 'pulls_url',
       'milestones_url', 'notifications_url', 'labels_url', 'releases_url',
       'deployments_url', 'created_at', 'updated_at', 'pushed_at', 'git_url',
       'ssh_url', 'clone_url', 'svn_url', 'homepage', 'size',
       'stargazers_count', 'watchers_count', 'language', 'has_issues',
       'has_projects', 'has_downloads', 'has_wiki', 'has_pages', 

The list of columns itself isn't a very good demonstration of Pandas analysis capabilities. It gets more useful if we use it's sorting and filtering capabilities.

Let's find out what GitHub repositories have the most stars and only select some of the columns:

In [11]:
df.sort_values(by='stargazers_count', ascending=False).head()[['name', 'created_at', 'size', 'language', 'stargazers_count']]

Unnamed: 0,name,created_at,size,language,stargazers_count
24,threema-protocol-analysis,2014-03-16T14:38:56Z,311,TeX,17
11,ipconfig-http-server,2014-05-12T06:15:38Z,152,C,6
29,yesod-oauth-demo,2012-05-15T21:02:29Z,216,Haskell,5
27,xing-api-haskell,2013-01-28T07:28:41Z,508,Haskell,5
4,dotfiles,2011-09-05T09:39:29Z,2337,Shell,5


Pandas can do a lot more and it's definetely worth to take a look at the [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) guide.

## Bonus: Generating a blog post from a Jupyter notebook

My blog gets written by feeding Markdown files [into Jekyll](https://help.github.com/en/github/working-with-github-pages/setting-up-a-github-pages-site-with-jekyll). Using [nbconvert](https://nbconvert.readthedocs.io/en/latest/) I was able to convert this notebook into a Markdown file. The only thing I had to add manually was the header for Jekyll. The rest of this post is directly from the notebook.

First install the nbconvert package
> pip install nbconvert

Then you can invoke nbconvert on this file:
> python -m nbconvert files/explore-apis.ipynb --to markdown --stdout > _posts/2020-03-02-explore-apis.markdown