## Introduction

This notebook uses the Trove API to introduce some core concepts for using Web APIs.
Topics covered include:

- important terms
- using API keys and safely managing secrets
- using the Requests library to make GET requests
- getting the data you want using header and query parameters
- working with JSON data
- where to go next...

Much of the information in this guide is drawn from the [Trove Data Guide](https://tdg.glam-workbench.net/home.html) and the [Trove API technical guide](https://trove.nla.gov.au/about/create-something/using-api/v3/api-technical-guide) - which have in-depth information on all aspects of the Trove API.



## What are APIs for?

Application Programming Interfaces (APIs) are interactive bits of code developers create to allow other programs to use their tools. You could think of them like USB ports, where many different devices can connect, so long as they are built with the USB design standard. This is a bit of an oversimplification, but the idea is that APIs make it easier to share and update code without breaking things that connect to them.

The Trove API is representative of a family of APIs broadly known as web APIs, which tend to follow a similar design pattern. The Trove API works similarly to a website URL in a browser: the URL points to the location of the webpage, the data is retrieved from that location, and the browser displays it. The Trove API follows a similar process, but instead of returning a web page it returns structured data. To see what this looks like in a browser click this link: [https://api.trove.nla.gov.au/v3/newspaper/titles](https://api.trove.nla.gov.au/v3/newspaper/titles).

Web APIs become powerful when you combine them with code to use the structured data in some way. To see an example, run the code block below. Don't worry about what it is doing yet - all the processes will be explained throughout the notebook.

In [None]:
import requests
import json
from IPython.display import HTML, display
import random

# a function to create a clickable thumbnail.
def render_image(source_url, thumbnail_url):
    display(HTML('<a href="{}"><img src="{}"></a>'.format(source_url, thumbnail_url)))

# a function to check that user input is a year
def is_year(string):
    try:
        if not 0 <= int(string) <= 2024:
            print("Year must be between 0 and 2024")
            return False
        return True
    except:
        print("Input must be an integer.")
        return False
    
def get_trove_images(word, year):
    if is_year(year):
        # create API request
        search_url = "https://api.trove.nla.gov.au/v3/result"
        query = {"category":"image", "q":word, "l-year":f"{year}", "n":"100"}
        encoding = {"accept":"application/json"}
        response = requests.get(search_url, params=query, headers=encoding)
        
        # Load the data as JSON
        data = response.json()
        images = data["category"][0].get("records")
        
        # Let the user know how many results the query has
        total = images["total"]
        print(f"Your query had {total} results.")
        if total == 0:
            print("Run the cell again to retry.")
            return
        
        # set the number of results to max 100 (the number returned by a single API call)
        if total > 100:
            total = 100
        numbers = []
        source = None
        
        # check for thumbnails not from SLV and display them, with their title and date information.
        while (len(numbers) < total):
            number = random.randrange(1, total)
            if number not in numbers:
                try:
                    thumbnail = images["work"][number]["identifier"][0]["value"]
                    if thumbnail is not None and "slv" not in thumbnail:
                        # print(images["work"][number])
                        source = images["work"][number]["troveUrl"]
                        title = images["work"][number]["title"]
                        date = images["work"][number]["issued"]
                        break
                    else:
                        numbers.append(number)
                except:
                    numbers.append(number)
                    continue
        if source is not None:
            render_image(source, thumbnail)
            print(title + ", " + date)

Now that we have setup the underlying logic, we can call `get_trove_images()` to search the Trove database. The code will prompt you for a year that will be used to filter the results. It is already setup to search for the word `computer`, but you can update this term to anything that interests you. If something goes wrong it will go back to 2000, as it was a simpler time.

Trove aggregates a lot of data, so some thumbnails won't display correctly. Running the code a few times will eventually find a nice thumbnail that can be rendered. The thumbnails also have a link to the Trove page for the resource, so you can click through to view more information.

In [None]:
print("Enter a year...")
try:
    year = input()
    get_trove_images("computer", year)
except Exception as e:
    print("Whoops, something went wrong... Reverting to y2k...")
    get_trove_images("computer","2000")

## Getting started

#### Important terms:
- API : application programming interface - a way for computers to communicate with each other.
- Web API : an API designed to allow communication to happen over the web.
- Endpoints : the location of data that can be requested via the API, these generally appear like URLs. 
- Requests : a structured command sent to the web server that prompts it to return data. 
- Header : metadata about a request or response.
- Parameters : options that change the response. These are specific to each API and could include search terms, categories, or ids.
- Key : a string of characters used to authenticate a connection.

### Using keys and managing secrets

When working with APIs, a key is often required to authenticate the connection. You should think of API keys as a form of password, and as such keys should **never** be stored in published code or documentation as they might be exploited. 

To avoid this, keys are stored in a separate secure location and loaded into the code as variables.

For example, you may use a variable called `APIkey` in the code. Another file will have a line of code where the value is recorded. For example:

        APIkey = "The api key value"
        
Then the code will load the data each time the script is run. How this looks will depend on the virtual environment being used. In a local code repository this could involve using a [.env file](https://python.plainenglish.io/the-essentials-of-env-files-in-python-simplifying-environment-management-securing-your-secrets-14c51c411400) to store the variable name and value, and a [.gitignore file](https://www.w3schools.com/git/git_ignore.asp?remote=github) which excludes the .env file from tracking in the git repository.

Kaggle handles secrets using an Add-on called Secrets. You can read more about it here - [Feature Launch: User Secrets](https://www.kaggle.com/discussions/product-feedback/114053).

This notebook uses a public API that does not require a key, but it is important to remember to carefully store API keys when you begin to work with them.

## Requests in Python

### Python Requests library
    
The Python [Requests library](https://requests.readthedocs.io/en/latest/) provides an easy-to-use wrapper to make API requests. It handles authentication, [percent encoding](https://www.w3schools.com/tags/ref_urlencode.ASP), and other steps that are required to make a valid API request.

To use the library, it needs to be imported into the project.


In [None]:
import requests
# Also import the json library for printing JSON
import json

## Base URL and endpoints

Each Web API will have a base URL, which is consistent between different endpoints. The base URL for Trove is [https://api.trove.nla.gov.au/v3](https://api.trove.nla.gov.au/v3)

Different endpoints can be appended to the base URL to access different data. 
For example:

- `/result` – functions like a search in the Trove interface.
- `/newspaper/titles` - limits the search scope to Newspaper titles.
- `/newspaper/title/[id]` - limits the results to a single Newspaper title by its id.
- `/newspaper/[id]` - limits the results to a single issue of a Newspaper by its id.

For a full list of Trove endpoints see [Endpoints in the Trove Data Guide](https://wragge.github.io/trove-data-guide/accessing-data/trove-api-intro.html#endpoints).

### Encoding

Most web APIs will return data encoded in either JSON (JavaScript Object Notation) or XML. 
To set the preferred encoding a header needs to be submitted as part of the request. Because it is easier to work with JSON in Python, this tutorial uses JSON so we add a header variable. Not all APIs will provide both JSON and XML, so it is important to check the documentation to figure out what is available.

### Request methods

The `Requests` library supports a range of HTTP (Hypertext Transfer Protocol) request methods including:  
- `get` - used for requesting data.
- `put` - used for sending data to create/update something.

The Trove API is designed to give access to data, so will only support GET requests. You can read more about HTTP request methods [here](https://www.w3schools.com/tags/ref_httpmethods.asp).

Run the code below to make a call for all newspaper titles in Trove using a GET request.


In [None]:
base_url = "https://api.trove.nla.gov.au/v3"
endpoint = "/newspaper/titles"

# Set the request to accept JSON
encoding = {"accept":"application/json"}

# calls the API by sending a GET request to the specified endpoint with encoding passed to headers
response = requests.get(base_url + endpoint, headers=encoding)

# An example bad request with a non-existent endpoint
bad_request = requests.get(base_url + "/something")

The response returned by the API request will be stored in each variable. The response object contains data that can be useful to view:

- `response.url` will return the URL sent to Trove to make the request
- `response.status_code` will return a number that indicates if the request was successful (read the [Status Code Documentation](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) for more information)
- `response.text` will return the full text of the response content

The code below will show a successful call to the valid endpoint `/newspaper/titles` and a not found response for the fake endpoint `/something`.

Try adding in `response.txt` for the successful response below to see the difference.

In [None]:
# Prints the URL and status code for the request using a valid endpoint
print(response.url)
print(f"Status code for response: {response.status_code}")
if response.status_code == 200:
    print("OK!")
else:
    print("Something went wrong :(")
print("")

# Prints the URL and status code for the request using an invalid endpoint
print(bad_request.url)
print(f"Status code for bad response: {bad_request.status_code}")
if bad_request.status_code == 200:
    print("OK!")
else:
    print("Something went wrong :(")
print("Data returned:")
print(bad_request.text)
print("")

# Additional way to check if response code is ok
print("requests.codes.ok can also be used to test status codes:")
if response.status_code == requests.codes.ok:
    print(f"The response status code ({response.status_code}) is on the OK list.")
else:
    print(f"The response status code ({response.status_code}) indicates something went wrong...")


The request can also return headers, which include useful metadata about the request such as the date and time, rate limit, and content encoding.

In [None]:
headers = response.headers
print(headers)

As the request is returning the data in JSON, it needs to be loaded before it can be accessed.

Depending on how the data has been structured by the developer of the API, there may be different tags that are returned. You can use the inbuilt `json.dumps()` method to *pretty print* the output, which helps us to understand the way the returned data is structured.

Here is an example `json.dumps()` output for a simple JSON record, which includes the top level tag `records`, with a list of three entries each containing `record_id`, `title`, `fun_fact` and in one case `url`.

In [None]:
# An example string with JSON syntax
json_string = r'{"records": [{"record_id" : "4627", "title":"Why JSON","fun_fact":"JSON is a standard designed to allow structured data to be exchanged on the internet."},{"record_id" : "42", "title":"JSON and the Argonauts","fun_fact":"JSON is pronounced like the human name Jason."},{"record_id" : "2002", "title":"JSON license","fun_fact":"The license to use JSON includes the line: The Software shall be used for Good, not Evil.", "url":"http://www.json.org/license.html"}]}'

# convert to JSON
json_object = json.loads(json_string)

# pretty print the JSON
print(json.dumps(json_object, indent=4))

The code below uses the tag `"total"` which stores the total number of returned results, and `"newspaper"` which stores a list of records returned by the API call. Some API endpoints, such as `/result` will limit the number of records returned, but in this case the `/newspaper/titles` endpoint returns everything. As there are almost 2 thousand newspapers recorded in Trove, the list can be very long, so the code below only returns one item from the list.

In [None]:
# Load the data as JSON
data = response.json()

# Get the total number of results
total = data["total"]
print("Total records returned: " + str(total))

# Display the 50th result
record = data["newspaper"][49]

## Pretty print the JSON with json.dumps
print(json.dumps(record, indent=4))

## If you're feeling brave, uncomment this line to print all the records
#print(json.dumps(data, indent=4))

### Navigating the response

The `.json()` method converts the data into a Python dictionary with keys and values. Responses may contain data, nested dictionaries, and lists. If you need a refresher on data types check out the notebook [CC Data types & structures](https://www.kaggle.com/code/sotiriosalpanis/cc-data-types-structures).

To see all the keys contained in the JSON we can use `.keys()`

The response received from Trove will include two top level keys: 

- `total` - that returns a count of records, and 
- `newspaper` - that returns a list of records. 

To get a sense of the keys in each record, check the first newspaper record by selecting it.

In [None]:
# Store and print the top level keys
keys = data.keys()
print("The top level keys are:")
print(keys)

# Store and print the keys for the first newspaper in the list
print("The keys in the first newspaper record are:")
newspaper_keys = data['newspaper'][0].keys()
print(newspaper_keys)

The structure of the data means that we can select information using loops.

For example the following code finds all newspapers with the word `Age` in the title and prints the title and Id.

In [None]:
for record in data["newspaper"]:
    title = record.get("title")
    if "Age" in title:
        print(title + ", with id: " + record.get("id"))

## Headers and parameters

### Headers  

Each request and response will have a **header** which contains metadata about the action and common data that needs to be sent for every request, such as date of request, rate limits, and authentication.

Header parameters for Trove API requests include:

- encoding: default is XML, or can be set to JSON using `"accept" : "application/json"` 
- authentication (API keys) set using `"X-API-KEY" : API_KEY"`

Each API will have different parameters so it is important to check the documentation.

Header parameters can be combined in a dictionary as key, value pairs.

In [None]:
# Headers parameter with API key and encoding.
#API_KEY = APIkey
#headers = {"accept" : "application/json", "X-API-KEY": API_KEY}

# Headers with just encoding.
encoding = {"accept" : "application/json"}

### Query parameters

When requesting data from an API, other parameters called **query parameters** are often supplied to make it easier to find the data you want. These are all dependent on the specific API.

The Trove API has different parameters available when searching or when retrieveing records.

**Searching**  
These parameters are related to the `/result` endpoint which acts like a search in the Trove catalogue, but with more granular options.

- `category` : Required repeatable parameter that reflects different resource types. Options include `all`, `newspaper`, `magazine`, `image`, `research`, `book`, `diary`, `music`, `people`, `list`.
- `q` : Optional parameter for supplying the search query.
- `l-<facet name>` : Optional from a controlled list that includes options such as format, date, language, availability, title, etc. See the full list [Trove Technical Guide - facet values](https://trove.nla.gov.au/about/create-something/using-api/v3/api-technical-guide#facetValues-01)

Further parameters are available for navigating and sorting results and specifying what metadata should be returned. See more in the documentation: [Parameters available when searching](https://trove.nla.gov.au/about/create-something/using-api/v3/api-technical-guide#parameters-available-when-searching)


In [None]:
base_url = "https://api.trove.nla.gov.au/v3"
results_endpoint = "/result"

# Parameters added to a dictionary that will be passed to the API call.
# Searches for the fastest computers of the 1990s
query = {"q":"fastest computer", "category":"all", "l-decade":"199"}

# Make GET request
w_response = requests.get(base_url + results_endpoint, params=query, headers=encoding)

# Check if response was ok.
if w_response.status_code == requests.codes.ok:
    print("The response has a status code on the OK list.")
else:
    print("The response status code indicates something went wrong...")
    
# Show the URL used to make the request
print("API url: ")
print(w_response.url)


With the request completed, we can now get a sense for the data. 

This can involve trial and error, or looking at the documentation to find the data that interests you.

The code below prints out the number of categories, category keys and names, as well as the number of records returned for each category.

The code imports and uses methods from [IPython](https://ipython.readthedocs.io/en/stable/) to show the image thumbnail for the record returned.

In [None]:
# import libraries for displaying images
from IPython.display import HTML, display

# get the API response JSON
data = w_response.json()

keys = data.keys() # dict_keys(['query', 'category'])

categories = data["category"] # selects the data we want to work with

# print the keys
category_keys = categories[0].keys()
print(category_keys)

# loop through the list of keys and print out the details of each 
# category and how many records were returned in each
index = 0;
for category in categories:
    print(f"{index} Category code {category['code']}: {category['name']}, " 
          f"records: {category['records']['total']}")
    index += 1
    
print("")

# first record in the image category note that other categories have different keywords
first_image = categories[2]["records"]["work"][0]

## Get the thumbnail url and display it
image = categories[2]["records"]["work"][0]["identifier"][0]["value"]
display(HTML('<img src="{}">'.format(image)))

# pretty prints the JSON record
print(json.dumps(first_image,indent=4))



## More fun with images

This is all very nice, but perhaps a bit dry if you're not a metadata librarian (like the author of this tutorial). The API can become really powerful when combined with a range of Python packages for returning different types of data. 

To help demonstrate this, I copied some code from the GLAM Workbench notebook [Save Trove newspaper articles as image](https://nbviewer.org/github/GLAM-Workbench/trove-newspapers/blob/master/Save-Trove-newspaper-article-as-image.ipynb) by [Tim Sherratt](https://updates.timsherratt.org/).

Running the code below will build Sherratt's method for saving newspaper articles. It has four functions which are explained in the blocks below...

In [None]:
import re
from io import BytesIO

import requests
from bs4 import BeautifulSoup
from IPython.display import HTML, display
from PIL import Image


This function finds the outside bounding box of the article by iterating through sets of coordinates supplied from the `get_article_boxs()` function. It returns a dictionary of the page id and edge coordinates.

In [None]:
def get_box(zones):
    """
    Loop through all the zones to find the outer limits of each boundary.
    Return a bounding box around the article.
    """
    left = 10000
    right = 0
    top = 10000
    bottom = 0
    page_id = zones[0]["data-page-id"]
    #print(zones)
    for zone in zones:
        if int(zone["data-y"]) < top:
            top = int(zone["data-y"])
        if int(zone["data-x"]) < left:
            left = int(zone["data-x"])
        if (int(zone["data-x"]) + int(zone["data-w"])) > right:
            right = int(zone["data-x"]) + int(zone["data-w"])
        if (int(zone["data-y"]) + int(zone["data-h"])) > bottom:
            bottom = int(zone["data-y"]) + int(zone["data-h"])
    return {
        "page_id": page_id,
        "left": left,
        "top": top,
        "right": right,
        "bottom": bottom,
    }

The function below processes the raw web page data to find the coordinates of parts of the article. It uses Requests to retrieve the article data, which returns HTML required to view the article in a browser instead of the record structure of the Trove API. 

To process the data, the function uses a library called [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) and parses the xml sections using the library [lxml](https://lxml.de/). The data contains OCR article text, which uses `<div>` tags to provide structure. Each `<div>` includes information about which page the article part is on and the boundaries of each section.

The function returns a list of boxes structured based on the output of `get_box(zones)`.

In [None]:
def get_article_boxes(article_url):
    """
    Positional information about the article is attached to each line of the OCR output in data attributes.
    This function loads the HTML version of the article and scrapes the x, y, and width values for each line of text
    to determine the coordinates of a box around the article.
    """
    boxes = []
    response = requests.get(article_url)
    soup = BeautifulSoup(response.text, "lxml")
    # Lines of OCR are in divs with the class 'zone'
    # 'onPage' limits to those on the current page
    zones = soup.select("div.zone.onPage")
    boxes.append(get_box(zones))
    off_page_zones = soup.select("div.zone.offPage")
    if off_page_zones:
        current_page = off_page_zones[0]["data-page-id"]
        zones = []
        for zone in off_page_zones:
            if zone["data-page-id"] == current_page:
                zones.append(zone)
            else:
                boxes.append(get_box(zones))
                zones = [zone]
                current_page = zone["data-page-id"]
        boxes.append(get_box(zones))
    return boxes

The next function first builds the list of boxes using the functions above. It then requests each image from the Trove image service using the page ID for each box and crops them using the coordinates. Each image is saved to memory (see `/kaggle/working` in the notebook when you run it), and returns a list of image filenames.

In [None]:
def get_page_images(article_id, size):
    """
    Extract an image of the article from the page image(s), save it, and return the filename(s).
    """
    images = []
    # Get position of article on the page(s)
    boxes = get_article_boxes("http://nla.gov.au/nla.news-article{}".format(article_id))
    for box in boxes:
        # print(box)
        # Construct the url we need to download the page image
        page_url = (
            "https://trove.nla.gov.au/ndp/imageservice/nla.news-page{}/level{}".format(
                box["page_id"], 7
            )
        )
        # Download the page image
        response = requests.get(page_url)
        # Open download as an image for editing
        img = Image.open(BytesIO(response.content))
        # Use coordinates of top line to create a square box to crop thumbnail
        points = (box["left"], box["top"], box["right"], box["bottom"])
        # Crop image to article box
        cropped = img.crop(points)
        # Resize if necessary
        if size:
            cropped.thumbnail((size, size), Image.LANCZOS)
        # Save and display thumbnail
        cropped_file = "nla.news-article{}-{}.jpg".format(article_id, box["page_id"])
        cropped.save(cropped_file)
        images.append(cropped_file)
    return images

The final function below brings all these functions together to process an article into a cohesive image. 

First, it searches the `article_url` (equivalent to the `troveUrl` element in the API records) for the article id and uses it to get images. 

The size parameter limits the maximum size of the images. Use `None` to get full size. Once the image files have been downloaded by the `get_page_images()` function, it uses `HTML()` and `display()` from [IPython.display](https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html) library to show the images in the output.

In [None]:
def get_article(article_url, size):
    # Get the article record from the API
    article_id = re.search(r"article\/{0,1}(\d+)", article_url).group(1)
    # print(article_id)
    images = get_page_images(article_id, size)
    for image in images:
        display(HTML(f'<a href="{image}" download>Download {image}</a>'))
        display(HTML('<img src="{}">'.format(image)))

Now that the functions have been added, we can use `get_article` to display some images from the API response. 

The below code will create a list of all the newspaper articles, find all the Trove URLs in the data, create a list, and run `get_article` on the first 3. 

You can try updating this by:

- Adjusting the existing inputs like size or the start and end of the range,
- Creating a new API call on a topic that interests you,
- Creating functions to work with images from other data categories. *This will require creating new functions using the structure of the URL and the different JSON elements for each category.*

In [None]:
# Get the results from the newspaper category
articles_list = categories[6]["records"]["article"]
to_get = []

# get all the TroveUrl values from the data
for article in articles_list:
    if 'troveUrl' in article.keys():
        to_get.append(article['troveUrl'])

# sort list
to_get.sort()

# calls the first three images in the list
for i in range(1):
    # prints the Trove URL
    print(to_get[i])
    # calls the get_article function on the URL
    get_article(to_get[i], 400)




## Next steps

If you have made it this far, you now have the knowledge to be able to create a similar function to the one used at the start of this notebook. You have seen how to construct a valid GET request, including configuring header and query parameters, and how to navigate the data returned by the Trove API. 
Tweak the code to get familiar with how to use the API or explore one of the tutorials below.

- Check out the documentation for version 3 [Using APIs](https://trove.nla.gov.au/about/create-something/using-api)
- Explore tutorials from the [Trove Data guide](https://tdg.glam-workbench.net/home.html) such as:
    - Learn about retrieving multiple pages of results: [Harvest a complete set of search results using the Trove API](https://tdg.glam-workbench.net/accessing-data/how-to/harvest-complete-results.html)
    - Learn more about [Accessing data about newspaper and gazette articles](https://tdg.glam-workbench.net/newspapers-and-gazettes/data/articles.html)
    - Investigate other digitised resources [Understanding and using digitised resources](https://tdg.glam-workbench.net/other-digitised-resources/index.html) such as books and oral histories 
- Learn more about APIs through [API for Libraries at Library Carpentry](https://joshuadull.github.io/APIs-for-Libraries/)


## Projects built using Trove API

### Ideas and tools

[GLAM workbench](https://glam-workbench.net/) has a lot of Jupyter notebooks covering Trove API projects including [making a IIIF manifest for display in IIIF viewers](https://glam-workbench.net/trove-images/save-image-collection-iiif/#using-this-notebook) or [get OCR text from digitised journals](https://glam-workbench.net/trove-journals/get-ocrd-text-from-digitised-journal/). Also check out [Tim Sherrat's blog](https://updates.timsherratt.org/) for updates on his work developing for the GLAM workbench.

### Projects that use the Trove API

[digitalpasifik.org/](https://digitalpasifik.org/) - a pilot project as part of the Pacific Virtual Museum project aiming to bring together digitised Pacific collections held at many institutions into one location to improve access. 

[Drifter](https://mtchl.net/drifter/) - 2016 work by Mitchell Whitelaw that brings together data relating to the Murrumbidgee river system, including newspapers from the Trove API with other tools and datasets.

[To Be Continued...](https://bootstrap.rbi.skyhigh.cloud/clientless/#url=https://readallaboutit.com.au/) Australian Newspaper Fiction database creating a searchable collection of 40,000 stories from the 19th and 20th centuries.

