# Web APIs and HTTP requests

In a nutshell, web APIs are publicly (usually; there is plenty of private APIs, but for obvious reasons, we do not care about them as we can not use them) available interfaces through which third parties (this is us!) can access some data resources in a remote, reliable and programmable manner.

What does it mean in practice?

* **Remote.** Users can access the resource from anywhere, provided they have an internet connection.
* **Reliable.** The interface exposed to users is independent of the internal details of the system that produces the data. In other words, the way a user communicates with the API is independent of the way the system works. In practice it means that a user does not have to know anything about the system, it is enough to know the API interface.
* **Programmable.** API can be interacted with based on a predefined set of commands/methods (an interface) in a way that can be expressed with a programming language. This is usually achieved by using HTTP protocol which is a standard communication protocol in the Web and for which utilities are available in any major programming language.

## Practical example -- Wikipedia

<center><img src="png/wiki_logo.png"/></center>

But what does that all really mean? Let's talk about [practice](https://youtu.be/eGDBR2L5kzI). We will use the public API of Wikipedia (we all know what it is, right?). Public Wikipedia API can be used for many purposes, but it also makes publicly available a lot (in fact almost all) of data that is stored within Wikipedia, such as page statistics, registered users, etc. We mentioned that in some sense an API is an interface that allows third parties to communicate with and requests various thing from some platform in an orderly and programmable manner. Let us now see a real example of such an interface.

Wikipedia API (for English Wikipedia) lives at this URL:

* [https://en.wikipedia.org/w/api.php](https://en.wikipedia.org/w/api.php)

The URL takes us to an ugly webpage that contains documentation on all so-called API endpoints exposed by the Wikipedia API. What are they? Endpoints are *commands/requests* that the API understands and that can be used to extract some data from it. They define exactly the interface through which one can communicate with some external system via API. To sum up, an API understood as an interface is:

* a publicly available *place* on the internet (associated with a particular URL)
* a set of endpoints (commands) that define possible interactions with the API.

Ok, so we have seen that the Wikipedia API lives at a particular URL. However, the URL by itself just leads us to documentation describing all the endpoints. So how can we use a particular endpoint to actually do something? Let us inspect endpoint called [query](https://en.wikipedia.org/w/api.php?action=help&modules=query)

`https://en.wikipedia.org/w/api.php?action=help&modules=query`

Now we see the documentation for the endpoint `query`. It is quite complex as it kind of defines another nested API within the top API. From now on we will work exclusively with this part of the Wikipedia API since this is the one we have to use to extract data from Wikipedia. Let us now note that the URL has already a particular form:

`<URL>?<key-value pair>&<key-value pair> ...`

The part after the `?` sign is crucial here as it defines a so-called query string that can be passed with an URL. A query string does not specify a different location (like a URL does), instead, it attaches some additional data to a request sent to a location specified in the standard `<URL>` part. This is additional data is crucial here since it allows us to communicate with APIs through the HTTP protocol. Now it is clear that `https://en.wikipedia.org/w/api.php?action=help&modules=query` is still the same address as `https://en.wikipedia.org/w/api.php` but enhanced with additional data that told the Wikipedia API to take as to the help page of the module (endpoint) `query`. So let us now try to finally do something useful.

### Extracting list of Wikiprojects from Wikipedia API

Now from the docs of the `query` endpoint, we select the `projects` [(sub)endpoint](https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bprojects). The documentation gives us instructions on how to use the endpoint as well as some usage examples. When we click the link from the first example we see a long list of project names. These are so-called *Wikiprojects* which are registered semi-official groups of editors dedicated to working on a specific topic/theme. They can give us some basic insight into what kinds of topics are of most interest to Wikipedia editors (but do not base any claims solely on this simple information!)

The URL from the first example looks like this:

`api.php?action=query&list=projects`

Again, it has the URL part (some of it omitted) and the query string part that specify that we use the `query` endpoint and ask it to list all the projects. This is great! We can look at the list in our browsers. However, even this list is somewhat too long to deal with it like this, so we would like to process it in Python.

### Talking to API from Python

Fortunately, Python makes it very easy to build HTTP requests and talk to an API. Utilities for this kind of work can be found in the `requests` package.

In [None]:
import requests
## From now on we can refer to the `requests` module
## by its name (it is saved as a variable!)

How can we use it to get some data from an API? Let us decompose this problem into several steps.

In [None]:
## First define the base API url
URL = "https://en.wikipedia.org/w/api.php"
## Then define the query string parameters we want to pass with our request
## It is often called the 'payload'
payload = {
    'action': 'query',
    'list': 'projects',
    'format': 'json'
}
## The 'requests' package wants us to define the payload as a dict
## since this makes it easy to build GET requests dynamically
## in a program

In [None]:
## Now with the URL and the payload ready we can send a request
## to the Wikipedia API to kindly ask for the list of projects
## It is as simple as this
response = requests.get(URL, params=payload)

In [None]:
## Let's see the results
response

It is a bit underwhelming because apart from the status code we did not get anything. Therefore, we need to examine the object we got. For that, we will use the `dir()` function. In simple terms, it returns a scope of an object (what is inside it). In the terms of a module, it would return all methods (functions) included in the module. In our case, it will tell us what we can get from the response object apart from the status code.

In [None]:
dir(response)

In [None]:
## By a time-honored tradition of a countless generation of computer
## scientists we call the result of a web request a 'response'
## Now we would like to extract the actual data from it
data = response.json()
## What do we have in the response?
data.keys()

In [None]:
## Probably we want to focus on the 'query' part
data['query'].keys()
## And here are the projects

In [None]:
## Now let us save the projects in a variable
## to save us some typing
projects = data['query']['projects']
## Now we can easily count the projects
len(projects)

In [None]:
# and see them if we like ...
# ... but maybe not all of them at one
# maybe just first ten
projects[:10]

In [None]:
# and last ten
projects[-10:]

# Working with Wikipedia API: part II

Above we learnt how to extract names of the projects from the Wikipedia API. This time we will try to do something a little bit more involved.

1. We gonna take a random sample of 10 Wikipedia articles. There is an endpoint in Wikipedia for doing just that. However, note because of the sampling each of you will get different results.
2. The first step will give us only id numbers and the title of the pages. We will use them to extract the full text of the pages via a different endpoint of the Wikipedia API.
3. We will compute the word length distributions of the pages.

## Step 1.

First, we have to sample 10 random Wikipedia articles. This should not be too hard since we have a special method for this, so it should be just one simple API call. The method we are looking for is `list=random` and it is defined within the `query` endpoint (`action=query`). We can read more about it [here](https://en.wikipedia.org/w/api.php?action=help&modules=query%2Brandom). **HINT.** Remember that you can view the results of your queries directly in the browser.

A quick read of the doc page and we can decide that we need only two query parameters:

1. `rnamespace=0` (which limits the results to the namespace `0` which is the part of Wikipedia where actual encyclopedic articles live).
2. `rnlimit=10` (because we want to extract only 10 random articles).

In [None]:
## The above considerations lead us to the following payload
## we will want to attach to out query URL.

payload = {
    'action': 'query',  ## since we want to use the `query` endpoint
    'list': 'random',   ## because we want to use the `random` method
    ## But we also need to add arguments for the `random` method
    'rnnamespace': 0,
    'rnlimit': 10,
    'format': 'json'    ## we need to add it so the data can be read by Python
}

In [None]:
## Now we are ready to make the GET request
## But first we need to import the requests package
import requests as rq
## And define our base URL
BASE_URL = "https://en.wikipedia.org/w/api.php"

response = rq.get(BASE_URL, params=payload)
response

In [None]:
## We see that our response is OK (HTTP response code 200 means 'OK')
## So we can extract the data from the response object with `.json()`
## method defined on it.
data = response.json()
data

Ok, so now we have titles of articles and their unique ids. As you can probably imagine we are only interested in the unique ids at this point. Because we will need them to in different endpoint to extract texts of the articles. But how exactly we are going to access them? It is a dictionary so it should not be a major problem to access a single value, right?

In [None]:
data['query']['random'][0]['id']

So what exactly happened there? First, we got a mapping with three keys: `batchcomplete`, `continue`, and `query`. We were only interested in the `query` field. Therefore, we typed as follows:
```python
data['query']
```
However, it was again a mapping inside a mapping with only one field: `random`. Therefore:
```python
data['query']['random']
```
Inside this mapping, we had a ten-element list. So to access the first element we typed:
```python
data['query']['random'][0]
```
Every element of that list was also a mapping again with three keys: `id`, `ns`, and `title`. We were only interested in the `id` field. So, we just typed:
```python
data['query']['random'][0]['id']
```
However, again we could access all the `ids` manually but it would be easier just to use a `for-loop`. As you probably can imagine we are going to loop over that list because the rest of the fields are going to be the same.

```python
for page in data['query']['random']:
    print(page['id'])
```
So a loop like this would work fine if we only wanted to print the `ids`. We could even modify it a bit to store the `ids` in the list (it is what we want to do), for example:
```python
list_ids = []
for page in data['query']['random']:
    list_ids.append(page['id'])
```
So first, we would create a list outside of the loop and then use a method `append` to add each value of `page['id']` as the last element of the list. It is doable. But Python offers a smarter way of saving results of the loop in a list. It is called **list comprehension** and in this particular example looks like this:
```python
page_ids = [ page['id'] for page in data['query']['random'] ]
```
It does exactly the same as the previous example but in a neater way. The difference is that first you write what is happening in the loop `page['id']` and afterward you define the loop `for page in data['query']['random']`.

In [None]:
## From the obtained relatively simply dictionary
## We can extract the list of page ids as follows:
page_ids = [ page['id'] for page in data['query']['random'] ]
page_ids

## Step 2.

Now we have a nice list of page ids, so we can use it to extract the content of the pages using a different method defined on the `query` endpoint. We will use a so-called _cirrus doc_ endpoint. _Cirrus_ is a system for organizing and storing text documents used by Wikipedia. It does not really matter to us. What matters is the fact that an endpoint like this exists and that it has a particular format. As we said _cirrus doc_ is a method on the `query` endpoint and we can call it with `prop=cirrusdoc`. However, to obtain any data we have also to pass a list of page ids in a proper format. Remember every piece of data that we provide through URL parameters (query string) is always treated as a string. Thanks to this every API can use some convention for defining lists of values. The Wikipedia API uses `|` as the separator, so it uses the following convention:

* `<item 1>|<item 2>| ... |<item n>`

In [None]:
## Thus we have to join our page ids to form a single string
page_ids_string = "|".join(str(p) for p in page_ids) ## this for loop is written similarly as the previous one
page_ids_string

In [None]:
## Now, the above considerations already enforce a particular form of a payload
## that we will have to attach to the request URL.

payload = {
    'action': 'query',
    'prop': 'cirrusdoc',
    'pageids': page_ids_string,
    'format': 'json'
}
payload

In [None]:
## And now we are ready to make a request
response = rq.get(BASE_URL, params=payload)
response

In [None]:
## And parse the response to a json dictionary
data = response.json()
## We can look and the top-level keys of the dict
data.keys()

In [None]:
## We should be interested in the query field, since judging by the name
## it should contain the results of our query
data['query'].keys()

In [None]:
## Great, now we have only one key on the lower level, so it has to store the data
pages = data['query']['pages']
pages.keys()

In [None]:
## We see that the pages dictionary store all the pages we requested identified with their ids
## Let us look at the inner keys of sub-dict with data of a single page
key = list(pages)[0]
pages[key].keys()

In [None]:
## It seems that the main data is stored under the `cirrusdoc` key.
type(pages[key]['cirrusdoc'])

In [None]:
## Hmm, the cirrusdoc property is a list.
## So we have to extract data from it.
pages[key]['cirrusdoc'][0].keys()

In [None]:
## Okay, finally we see the source key, that must store the actual article content
pages[key]['cirrusdoc'][0]['source'].keys()

In [None]:
## Bingo!! We see the `text` field. It contains the article text.
## This is exactly what we want to extract.
pages[key]['cirrusdoc'][0]['source']['text']

We examined the anatomy of the response of the _cirrus doc_ method in the Wikipedia API. So now we understand it and we can use this new knowledge to automatically extract the content of all the articles.

In [None]:
## Let's use a for loop to get all the content of all pages
articles = [ p['cirrusdoc'][0]['source']['text'] for p in pages.values() ]
articles

In [None]:
## NOTE THAT THE PREVIOUS EXPRESSION
## DOES THE SAME AS THE FOLLOWING MORE VERBOSE EXPRESSION
articles = []
for page_id in pages.keys():
  page = pages[page_id]
  cirrus = page['cirrusdoc']
  page_data = cirrus[0]
  source = page_data['source']
  text = source['text']
  articles.append(text)

articles

Great!!! We finally extracted the data we want. Now we can focus on computing words' length distributions of this data. So let's try to write a function that will return a dictionary with the distribution of words' lengths. It will be somehow similar to the function we wrote for the [N6](https://github.com/MikoBie/ppss/blob/main/notebooks/N6.ipynb) last year. It looked like that:

```python
def dict_count(L1):
	"""
	It returns a dictionary with the frequencies of elements of L1.

	Args:
		L1 (list): a list of values

	Returns:
		dict: dictionary with frequencies of elements of L1
	"""
	output = {}
	for item in L1:
		if item in output:
			output[item] += 1
		else:
			output[item] = 1
	return output
```
And it more or less worked as follows. If we input the following list:
```python
input_list = [1, 2, 3, 3, 2, 4]
```
It returned the following dictionary:
```python
output_dict = {1 : 1, 2 : 2, 3 : 2, 4 : 1}
```
Now we need to figure out how from a list of strings get a list of words' lengths, right? Any ideas?

In [None]:
## YOUR CODE.
def counter(l):
    """
    It takes a list of strings and returns a dictionary with the distribution of words
    length. Keys are the lengths and values are the frequencies.
    Args:
        l (str): a list of strings

    Returns:
        dict : a dictionary with the distribution of words length. 
    """
    ## BODY OF THE FUNCTION
    
dist =  counter(articles)           
        

Let's see the results because they are very characteristic. They follow the [Zipf's distribution](https://en.wikipedia.org/wiki/Zipf%27s_law#Related_laws). In simple terms, the shortest words occur the most frequently and their frequency is inversely proportional to their rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. Let's see the results on a plot. We are not going to go too deep into plotting in _Python_ because there is no need for that. You can always plot something in _R_. Most of cases it will be enough to extract some data in _Python_ and later plot in _R_. That is because the syntax for plotting in _Python_ is at first not that straightforward, it is a bit similar to base plotting in R. So you have to imagine a blank canvas on which you put layer after layer. 

In [None]:
## Let's import the matplot lib module for plotting
import matplotlib.pylab as plt

## Sort the dictionary in descending order by keys
## Instead of length of the word compute the rank
## in the frequency table
lists = { key + 1 : value[1] for key, value in enumerate(sorted(dist.items())) }
## Create tuplets for x and y axis
x, y = zip(*lists.items())
## Create a plot
plt.plot(x, y)
## Add name of the x ax
plt.xlabel('Rank')
## Add name of the y ax
plt.ylabel('Frequency')
## Show the plot
plt.show()

## Exercise

Read about the `pageviews` method (`prop=pageviews`) in the `query endpoint` ([docpage](https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bpageviews)). Use this method to extract page views data for the pages from the previous exercise (if you want you can sample 10 new pages with the `list=random` method) for the last 60 days. The results will be broken down by single days, so you have to aggregate the results (sum) so they give the total page views count for the entire period of 60 days. Remember that to select pages by page ids you pass `pageids=<id 1>|<id 2>|...|<id n>`. We did a very similar thing when we extracted article content through the `cirrusdoc` method in the Wikipedia API in the previous part of this notebook. Your final output should be a `dict` object that maps page ids to pageviews (total number of pageviews over 60 days). It should look something like this:

```python
results = {
    # page_id: pageviews
    153253: 10204,
    423423: 101,
    11012:  12,
    42435:  546,
    # and so on
}
```

If you want you can sample 10 pages yourself. Otherwise, you may use the following list of page ids that we prepared for you. Sampling pages yourself will give you extra credit (but it is possible to get maximum points without it as well).

In [None]:
## Import module requests
import requests

## Some page ids
page_ids = [
    19969580,
    39982842,
    25699035,
    52642931,
    53055349,
    24133565,
    1164662,
    40656459,
    12533026,
    47110862
]

## API URL
BASE_URL = 'https://en.wikipedia.org/w/api.php'

In [None]:
## YOUR CODE
payload = {}
