# Downloading Article Data from the Guardian API 

In this tutorial I will demonstrate how Python can be used to query the Guardian's API. 

### Accessing the API 

#### Registering for a key 

The first step to accessing the Guardian's API is to [register as a developer](https://open-platform.theguardian.com/access/) 

Once registered you will recieve an email contatining an API key which has the form:

>a12b345c - 123a - 1234 - a1b2 - 123abcd45ef6


#### Installing requests and json

We will use `requests`, a HTTP library for python, to access and query the API. We will also use the json file format to save the data. Install `requests` and `json` on your machine using `pip`:

```
pip install requests
pip install json
```


#### Using requests to query the API 

The first step to accessing the Guardian API is to decide which endpoint is most appropriate to use. This will depend on your aims and what you intend to do with the data. 

The available endpoints are:
- Content
- Tags
- Sections
- Editions
- Single Item

More information about each endpoint is available from the Guardian's [documentation](https://open-platform.theguardian.com/documentation/).

In the following I would like to download **all** articles published on 1st January 2018. These articles can have any tag and come from any section or edition therefore I will use the content endpoint. 

To get started, import `requests` and `json` into `python` and define your endpoint and apikey as strings:

``` python

import requests as re
import json

endpoint = 'http://content.guardianapis.com/search'
apikey = 'my-api-key'

```

Here, the string ``'my-api-key'`` should be replaced with your API key.

Now we need to define the parameters we want to query the API with. A full list of parameters that can be used with the search endpoint is available [here](https://open-platform.theguardian.com/documentation/search).

Parameters can be passed to the API by creating a dictionary of the parameters and their values: 

```python

parameters = {
    'from-date':'2018-01-01'
    'to-date':'2018-01-02'
    'show-fields':'body'
    'page-size':200
    'api-key':apikey
}
```

As I want all articles published on 1st January 2018, I have set the `'from-date'` to `2018-01-01` (note the yyyy-mm-dd format) and the `to-date` to `2018-01-02`. 

For the `show-fields` parameter, I woud like the body only; this will return the full text of the article. 

As our endpoint and our parameters are now defined, we can use `requests` to query the API.

In the above dictionary you will see that the `page-size` is set to 200. This means that only 200 results will be returned per page and therefore, in the case that the total number of results is greater than 200, we need to loop through all of the pages to download all of the data.

We can do this by defining two variables, `current_page`, the page that is currently being browsed, and `pages`, the total number of pages returned by the query. We can then loop through the pages by incrementing `current_page` until it equals `pages`.

When the data from each page is downloaded it needs to be both stored and saved. To store the data we can define an array `data`. To save the data we use `json` to save the data to a file called `fname`.

```python
data=[]

current_page=1
pages=1

fname='guardian_data'

while current_page<=pages:
    
    parameters['page'] = current_page
    r=requests.get(endpoint, parameters).json()
    data.extend(r['response']['results'])
    pages=data['response']['pages']
    current_page+=1
    
    with open(fname,'w') as f:
        f.write(json.dump(data))
        
```
    

You will notice that initially, the values of `current_page` and `pages` are both 1. This ensures that one iteration of the while loop is performed. The value of `pages` will be updated to its true value during this first loop.

Lets look at the while loop. The first line:

```python
parameters['page'] = current_page
```
sets the parameter `page` in our paremeter dictionary to the value of `current_page`. 

The second line:

```python
r = requests.get(endpoint,parameters).json()
```
uses the `requests` library to access the Guardian's API, located at `endpoint`, and passes the parameters defined in our dictionary `parameters`. Adding `.json()` to the end of the `get` function ensures that the results, `r`, returned by the API are in a `json` file format.

The third line:
```python
data.extend(r['response']['results']) 
```
adds the results from the current page to our array `data`.

The next two lines:
```python
pages=data['response']['pages']
current_page+=1
```
update the value of `pages` to the number of pages returned by the query and increases the value of `current_pages` by one.

The final two lines:
```python
with open(fname,'w') as f:
    f.write(json.dump(data))
```
open a file named `fname` for writing the data to. Our array `data` is then saved to this file.

Providing the value of `current_page` is less than or equal to `pages`, these steps will be repeated ensuring that all data returned by the query is saved to our file.

### All together 

```python
import requests as re
import json

endpoint = 'http://content.guardianapis.com/search'
apikey = 'my-api-key'

parameters = {
    'from-date':'2018-01-01'
    'to-date':'2018-01-02'
    'show-fields':'body'
    'page-size':200
    'api-key':apikey
}

data=[]

current_page=1
pages=1

fname='guardian_data'

while current_page<=pages:

    parameters['page'] = current_page
    r=requests.get(endpoint, parameters).json()
    data.extend(r['response']['results'])
    pages=data['response']['pages']
    current_page+=1

    with open(fname,'w') as f:
        f.write(json.dump(data))
        
```

### Exploring the data 

The file you now have, `data`, can be used as a python dictionary \- it is in fact an array of dictionaries. 

Lets look at the first entry:

```python
data[0]
```

Displaying the keys using 
```python
data[0].keys()
```
we have:

So entering:
```python
data[0]['webUrl']
```
returns the web address of the first article in `data`. Likewise, entering:
```python
data[0]['fields']['body']
```
will return the full text of the article.

### Finally...

If you are interested in obtaining comments from the articles you have in `data`, save the urls to a list, installing numpy if you need to (`pip install numpy`):
```python
import numpy as np
urls=[record['webUrl'] for record in data]
np.savetxt(fname+'urls.txt',urls,'%s')
```
and take a look at the tutorial [Scraping comments from the Guardian using BeautifulSoup](Scraping_comments_from_the_Guardian_using_BeautifulSoup.ipynb)