# Working with zones

<p class="alert alert-info">New to Jupyter notebooks? Try <a href="getting-started/Using_Jupyter_notebooks.ipynb"><b>Using Jupyter notebooks</b></a> for a quick introduction.</p>

Trove's zones are important in constructing API requests and interpreting the results. So let's explore them a bit.

<p class="alert alert-warning">The update to the Trove web interface in 2020 removed 'zones' and replaced them with 'categories'. However, the API still uses the original zones and knows nothing about the new categories. There is no one-to-one correspondence between zones and categories, so this can make it hard to compare results between the web interface and the API.</p>

There are 10 zones used by the Trove API (11 if you regard the newspapers and gazettes as separate):

* Digitised newspapers and gazettes 
* Journals, articles and data sets
* Books
* Pictures, photos and objects
* Music, sound and video
* Maps
* Diaries, letters and archives
* People and organisations
* Archived websites
* Lists

However, data from the 'People and organisations' and 'Archives websites' zones are not available through the API. Well, sort of not...

Let's see what the API itself can tell us about the zones.

## Setting things up

We'll start by importing the modules we're going to need later on.

In [54]:
# Let's import the modules we need
import os
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)

# Altair helps us make pretty charts
import altair as alt

# Pandas helps us analyse tabular data
import pandas as pd
import requests
from IPython.display import JSON

os.makedirs("data", exist_ok=True)

In [55]:
%%capture
# Load variables from the .env file if it exists
# Use %%capture to suppress messages
%load_ext dotenv
%dotenv

As usual we're going to need a Trove API key.

In [None]:
# This creates a variable called 'api_key', paste your key between the quotes
api_key = ""

# Use an api key value from environment variables if it is available (useful for testing)
if os.getenv("TROVE_API_KEY"):
    api_key = os.getenv("TROVE_API_KEY")

# This displays a message with your key
print("Your API key is: {}".format(api_key))

We'll also set the base url for our API requests.

In [57]:
# Create a variable called 'api_search_url' and give it a value
api_search_url = "https://api.trove.nla.gov.au/v2/result"

## Give us everything!

This time we're going to ask for **everything** from **all** the zones. (Don't worry, you won't break anything, Trove will only give us the first 20 results in each zone.)

To do this, we'll set the `q` parameter to be an empty string (quotes around a space), and the `zone` parameter to 'all'.

In [58]:
# This creates a dictionary called 'params' and sets values for the API's parameters
params = {
    "q": " ",  # A space to search for everything
    "zone": "all",  # All zones thanks!
    "key": api_key,
    "encoding": "json",
}

We can now send our request off to the Trove API. Because we're not applying any limits to our query, the API can take a little longer than normal to respond. Just wait for the asterix in the square brackets to turn into a number, and then move on.

In [59]:
# This sends our request to the Trove API and stores the result in a variable called 'response'
response = requests.get(api_search_url, params=params)

# This shows us the url that's sent to the API
print(response.url)  # This shows us the url that's sent to the API

# This checks the status code of the response to make sure there were no errors
if response.status_code == requests.codes.ok:
    print("All ok")
elif response.status_code == 403:
    print("There was an authentication error. Did you paste your API above?")
else:
    print("There was a problem. Error code: {}".format(response.status_code))
    print("Try running this cell again.")

https://api.trove.nla.gov.au/v2/result?q=+&zone=all&key=gq29l1g1h75pimh4&encoding=json
All ok


As before we'll get the JSON results data from the API response.

In [60]:
# Get the Trove API's JSON results and make them available as a Python variable called 'data'
data = response.json()

If you'd like to have a look at the raw data, run the next cell.

In [61]:
JSON(data)

<IPython.core.display.JSON object>

## Looking into the zones

Now we've got data from all of Trove's zone – let's see what it looks like!

The next code cell loops through each zone in the results and extracts the total number of results. Because we didn't apply any limits to our query, this will tell us how many items are currently in each zone.

In [62]:
# Create empty lists to store 'zones' and 'totals'
zones = []
totals = []

# Loop through the zones in the API results
for zone in data["response"]["zone"]:

    # Add the name and total values to the relevant list
    zones.append(zone["name"])
    totals.append(int(zone["records"]["total"]))

# Save the results as a dictionary
zone_totals = {"zones": zones, "totals": totals}

We're now going to convert the results into a Pandas dataframe. Pandas has lots of useful options for working with and displaying tabular data.

In [63]:
# Create a Pandas dataframe to work with the results
df = pd.DataFrame(zone_totals)

# Sort by zone name
df = df.sort_values(by="zones")

# Display as a table (formatting the numbers with comma separators for readability)
df[["zones", "totals"]].style.format({"totals": "{:,}"})

Unnamed: 0,zones,totals
8,article,13406104
9,book,18601177
6,collection,4748600
1,gazette,3535033
2,list,106022
4,map,363594
5,music,3066364
10,newspaper,236616092
3,people,1309873
7,picture,5618995


In [64]:
# We can even use Pandas to display the results table with a simple bar chart
df[["zones", "totals"]].style.format({"totals": "{:,}"}).bar(
    subset=["totals"], color="#d65f5f"
)

Unnamed: 0,zones,totals
8,article,13406104
9,book,18601177
6,collection,4748600
1,gazette,3535033
2,list,106022
4,map,363594
5,music,3066364
10,newspaper,236616092
3,people,1309873
7,picture,5618995


That's pretty cool, but let's take things one step further and use Altair to create a pretty interactive bar chart. As you can see from the cell below, Altair is very easy to use.

In [65]:
alt.Chart(df).mark_bar().encode(
    x="zones:N",
    y="totals:Q",
    # Add tooltips to the bars, format numbers with thousand separators
    tooltip=["zones:N", alt.Tooltip("totals:Q", format=",")],
)

Ok, let's stop making pictures and look at what the results tell us. It's probably no surprise that there are more digitised newspaper articles that anything else.

Most of the zone names are seem straightforward, though it might not be immediately obvious that the 'article' zone corresponds to the 'Journals, articles and data sets' zone in the old web interface.

However, you might be wondering about 'collection'. It's not 'Lists' as there's already a 'list' zone in the results. It turns out that the 'collection' zone corresponds to the 'Diaries, letters and archives' zone in the old web interface. I suppose it sort of makes sense.

<p class="alert alert-warning">While the total numbers above give us an indication of how resources are distributed across Trove, remember that items can appear in multiple zones – for example, a book that includes maps might be in both the 'book' and 'map' zones. So if you try to add the totals to get the overall number of resources in Trove, you'll be including many duplicates.</p>

You might also have noticed that although I said that the API didn't include results for the 'People and organisations' zone, there is a result in the data above for 'people'. What's going on?

Basically full support for the 'People and organisations' zone was never completed. Don't believe me? Let's have a look at the results.

First we'll extract the results for the 'people' zone from the API data.

In [66]:
# Create an empty list to store the results
people_results = []

# Loop through the zone results
for zone in data["response"]["zone"]:

    # When we find the people zone save the records data to 'people_results'
    if zone["name"] == "people":
        people_results = zone["records"]["people"]

Once again, we'll convert the results into a dataframe and display the first 5 rows as a table

In [67]:
# Create a dataframe from 'people_results'
people_df = pd.DataFrame(people_results)

# Display the first 5 results as a table
people_df[:5]

Unnamed: 0,id,url,troveUrl
0,949358,/people/949358,https://trove.nla.gov.au/people/949358
1,897386,/people/897386,https://trove.nla.gov.au/people/897386
2,542626,/people/542626,https://trove.nla.gov.au/people/542626
3,461145,/people/461145,https://trove.nla.gov.au/people/461145
4,638889,/people/638889,https://trove.nla.gov.au/people/638889


You'll notice that there's not a lot of useful data – just identifiers and urls for the Trove web interface. If you try to use the identifier to get more information from the API you'll be out of luck – it returns a '404: Not Found' error.

As I said, full support for the 'People and organisations' zone was never completed. Hopefully it will be added in a future release.

## More zone peculiarities

There's another couple of peculiarities that you need to be aware of. The first is really more of an annoyance than a peculiarity. As you might have noticed above, to find the results for the 'people' zone I had to loop through all the zones until I found the one with the name 'people'. We can't just say, 'give me the people results!'. Of course, this is only an issue if you've asked for results from more than one zone. If you set the 'zone' parameter to a single zone – like 'newspaper' – the newspaper data would be the first (and only) set of results. You could find them at  `data['response']['zone'][0]`.

You might also have noticed that the individual records from the 'people' zone were found at `zone['records']['people']`. What's wrong with that? Well, it means that different zones use different keys to identify their records. So you have to know in advance what the key is in order to get the records data. Again, if you're only working with one zone it's not too hard. But if you're working across multiple zones, it's a bit of a pain.

At least we can use the data we've already gathered to create a mapping of zones to keys.

In [68]:
# Create an empty list to store the results
zone_keys = []

# Loop through the zones
for zone in data["response"]["zone"]:

    # Get the name of the zone
    zone_name = zone["name"]

    # Loop through the keys
    for key in zone["records"].keys():

        # Check the key against the keys that are always there
        if key not in ["s", "n", "total", "next", "nextStart"]:
            # If it's not one of the standard keys save it
            records_key = key
            # Append the zone name and records key to our list
            zone_keys.append({"zone_name": zone_name, "records_key": records_key})

# Convert the results to a dataframe
keys_df = pd.DataFrame(zone_keys)

# Sort and display the results
keys_df = keys_df.sort_values(by="records_key")
keys_df[["zone_name", "records_key"]]

Unnamed: 0,zone_name,records_key
0,gazette,article
9,newspaper,article
1,list,list
2,people,people
3,map,work
4,music,work
5,collection,work
6,picture,work
7,article,work
8,book,work


As you can see, the 'newspaper', 'list', and 'people' zones all have specific keys. Every other zone uses 'work'.

## Finally...

If you want to save the zones data, just run the cell below to create a CSV-formatted file.

In [69]:
# Save the zones data to a CSV file you can download
df.to_csv("data/trove_zones.csv", index=False)

Once you've created it, you can download this file from the workbench's [data directory](data).

----

Created by [Tim Sherrratt](https://timsherratt.org) for the [GLAM workbench](https://glam-workbench.net/). Support this project by [becoming a GitHub sponsor](https://github.com/sponsors/wragge?o=esb).