In [None]:
from cred import GUARDIAN_KEY
import requests
import pandas as pd

## Understanding APIs
When we visit websites online, we provide an address. The address specifies what website data we want to be sent back from the server for us to view in our browsers.

A simple way of thinking about an API, is to also think of it as a website that sends us data which varies depending on what address we provide it. The address has to be a little more complicated than simply www.google.com, but as long as we build that address correctly, we will get what we ask for.

### Guardian API - Interactive Exploration
The Guardian API provides a helpful tool for us to explore how the address it built, and what results we can get back. It is also useful in showing us what kind of options we might have when requesting data.

[Explore the Guardian API](https://open-platform.theguardian.com/explore/)

## Communicating with the API in Python
So we can see how the web address is built using the API explorer, but how do we build that address using Python, and communicate with the API server so that it sends us back the data?

### A Very Basic Example
Initially we will just make the simplest query we can, which is simply contacting the API with our credential key to get it to send back *something*.

First we define the end point, which is essentially the root of the address we are going to start with. The Guardian API has a [few different endpoints](https://open-platform.theguardian.com/documentation/) but for our purposes, the *content* end point is the one we need.

In [None]:
API_ENDPOINT = 'http://content.guardianapis.com/search'

In [None]:
# We first create a dictionary that has a parameter name of api-key, and then our key as its value.
parameters = {'api-key':GUARDIAN_KEY}

Now we're going to communicate with the Guardian API using `requests`. We will pass in the address we are going to communicate with, the `API_ENDPOINT` and by providing `requests` with a dictionary of parameters, it can build the rest of the address for us before making its request for data.

In [None]:
response = requests.get(API_ENDPOINT, params=parameters)

Requests has now communicated with the server and whatever the server sent back has been packaged up in a special `Response` object.

In [None]:
# We can see the type of object
type(response)

In [None]:
# and if we look at the object itself it doesn't tell us much.
response

In [None]:
# One useful check is to see how requests built the url for us...

response.url

Finally, we can see the data that was sent back by asking the response object to show us its data in JSON format.
JSON is essentially a set of nested dictionaries.

In [None]:
response.json()

In [None]:
# The top level dictionary just has one key called 'response' which contains all the other information.
guardian_data = response.json()['response']
guardian_data

In [None]:
# The dictionary under response is what matters and has a few keys with associated values..
guardian_data.keys()

Whilst the other keys have useful information for later, for now `results` is key that contains the news results we want...

In [None]:
guardian_data['results']

In [None]:
# As a list of dictionaries Pandas is able to restructure this information into a table

results  = pd.DataFrame(guardian_data['results'])
results

In [None]:
# To summarise the process
parameters = {'api-key':GUARDIAN_KEY}
response = requests.get(API_ENDPOINT, params=parameters)
guardian_data = response.json()['response']
results = pd.DataFrame(guardian_data['results'])

results

## Customising your request with parameters
To customise our query we simply need to add to or adjust the parameters we pass to our request.

### Query
The search query is the primary way to filter our results.

In [None]:
parameters = {'api-key':GUARDIAN_KEY,
              'q':'crime'}

response = requests.get(API_ENDPOINT, params=parameters)
guardian_data = response.json()['response']
results = pd.DataFrame(guardian_data['results'])

results

In [None]:
response.url

Queries can be more than one word. The Guardian API documentation explains a number of ways you might adjust your query.
- 'Crime AND Prison' - Search for articles where both the terms 'crime' and 'prison' are used.
- 'Crime OR Prison' for either term. - Search for articles where either 'crime' or 'prison' are used.
- '"Criminal justice"' - Using quote marks to search for a phrase.
- 'debate AND NOT immigration' - Search for articles that use the term debate, but not the term immigration.

See the [Guardian API documentation](https://open-platform.theguardian.com/documentation/) for more options.

In [None]:
# Phrases require an extra step because of the way requests works.

parameters = {'api-key':GUARDIAN_KEY,
              'q':'"human rights"'}

response = requests.get(API_ENDPOINT, params=parameters)
guardian_data = response.json()['response']
results = pd.DataFrame(guardian_data['results'])
results

In [None]:
response.url

### Additional Filters
Other useful filters that might be of value when narrowing down your search...

See the [Guardian API documentation - Filters](https://open-platform.theguardian.com/documentation/search) for more options.


In [None]:


parameters = {'api-key':GUARDIAN_KEY,
              'q':'"human rights"',
              'page-size':10, # controls how many results you get per request - max 200
              'production-office':'uk', # filter based on where the article was produced
              'lang':'en', # language
              'from-date':'2023-01-20', # only published from a specific date
              'to-date':'2023-01-30', # only published before a specific date,
              'order-by':'oldest' # options - oldest, newest, relevance
              }

response = requests.get(API_ENDPOINT, params=parameters)
guardian_data = response.json()['response']
results = pd.DataFrame(guardian_data['results'])
results

In [None]:
response.url

#### Should I use all of these?
No, these are OPTIONS, rather than requirements and should be used to refine your data request depending on the type of question you might be studying. However, for most projects about news reporting you will probably want to at least specify that the type of content should be an article.

#### Exercise
Examine the documentation for the Search section of the Guardian API. Can you find the correct filter to add that will allow you to only return results from the  `"society"` section of the Guardian? Add the filter to the parameters dictionary below and run the cell to see what gets returned.

[Guardian API Documentation - Search](https://open-platform.theguardian.com/documentation/search)

In [None]:
# adjust the parameters dictionary
parameters = {'api-key':GUARDIAN_KEY,
              'q':'crime',
              'page-size':10,
              'production-office':'uk',
              'lang':'en',
              'section':'society',
              }

response = requests.get(API_ENDPOINT, params=parameters)
guardian_data = response.json()['response']
results = pd.DataFrame(guardian_data['results'])
results

### Getting Additional Content
By default the API provides us a limited range of information. Dates, titles, section categories etc can be useful as analysable data, but we may want additional content such as...
- Keyword tags - Human provided classification of articles, useful for a range of analysis techniques including network analysis.
- Content body - The actual article text, useful for text analysis.
- Article word counts

Again, the procedure is the same, we just need to adjust our parameters.

In [None]:
parameters = {'api-key':GUARDIAN_KEY,
              'q':'crime',
              'page-size':200,
              'production-office':'uk',
              'lang':'en',
              'section':'news',
              'show-tags':'keyword',
              'show-fields':'body, byline, wordcount',
              }

response = requests.get(API_ENDPOINT, params=parameters)
guardian_data = response.json()['response']
results = pd.DataFrame(guardian_data['results'])
results

Later we will look at exploring tags, which may lead you to want to focus on a specific tag. If we find a tag or two we want to focus on, we can add them to our query...

In [None]:
parameters = {'api-key':GUARDIAN_KEY,
              'q':'crime',
              'page-size':200,
              'production-office':'uk',
              'lang':'en',
              'section':'news',
              'show-tags':'keyword',
              'show-fields':'body, byline, wordcount',
              'tag':'society/drugs'
              }

response = requests.get(API_ENDPOINT, params=parameters)
guardian_data = response.json()['response']
results = pd.DataFrame(guardian_data['results'])
results

### Collecting more than 200 items
The maximum number of items sent back in a single call to the API is 200. This can be quite a large number for some projects, but what if we wanted to get a larger sample so we could...
- Do an exhaustive search of all content on a specific topic
- See trends over time - if the topic is frequently discussed 200 results may only cover a very short period of time.
- See large scale patterns across topics.

In this instance we need to make multiple calls to the API and each set of results to our locally held data, however we need to make sure the API always sends us data that we don't already have. This is where we need to work with some of the extra information we get in our response that isn't the results themselves.

In [None]:
parameters = {'api-key':GUARDIAN_KEY,
              'q':'crime',
              'page-size':200}

response = requests.get(API_ENDPOINT,params=parameters)
guardian_data = response.json()['response']
guardian_data

The key information here is `total`, `pages`,`currentPage`.
- `total` tells us how many records there are matching our parameters.
- `pages` tells us how many pages of results there are available to us given that there are `page-size` number of results per-page.
- `currentPage` tells us what page of results we've just received.

We can ask the API for a specific page of results using the `page` parameter.

In [None]:
parameters = {'api-key':GUARDIAN_KEY,
              'q':'crime',
              'page-size':200,
              'page':2}

response = requests.get(API_ENDPOINT,params=parameters)
guardian_data = response.json()['response']
guardian_data

The most direct way to gather multiple pages of data then is to...
- Make a call to the API
- Store the results in a list.
- Increment the value of `page` by 1
- Repeat...
- Eventually hit a maximum number of pages we set, or run out of data.

Initially you will need to make one request to the API to see how much data could be available to you, and then base your max number of pages etc on that information.


In [None]:
# Let's just discuss the logic of how we handle the data collection here before we actually implement the real collection
from time import sleep


current_page = 1 # The page number we're requesting from the API. We start with page 1.
available_pages = 1 # We don't necessarily know how many pages the API call will be providing until we make our first call.

failsafe_pages = 5 # However many pages are available, we'll set our absolute limit to 5



# here we use a while loop that runs the code over and over until the expression is false

while (current_page <= available_pages) and (current_page <= failsafe_pages):
    parameters['page'] = current_page

    # We would do our data collection here
    print(parameters)

    # Here we pretend the API told us there were 124 pages available to us.
    available_pages = 124

    # We increment the value of current_page by 1
    current_page += 1

    # sleep stops our script for 1 second - we do this so we don't overload the Guardian's servers
    sleep(1)

In [None]:
from time import sleep

parameters = {'api-key':GUARDIAN_KEY,
              'q': 'crime',
              'page-size':200}

current_page = 1
available_pages = 1

failsafe_pages = 5

all_results = []

while (current_page <= available_pages) and (current_page <= failsafe_pages):
    parameters['page'] = current_page

    response = requests.get(API_ENDPOINT, params=parameters)
    guardian_data = response.json()['response']
    results = guardian_data['results']
    all_results += results

    available_pages = guardian_data['pages']
    print(f'Collected page {current_page} of {available_pages}')
    current_page += 1
    sleep(1)

In [None]:
df = pd.DataFrame(all_results)
df