# Find urls of digitised finding aids

The 'Diaries, Letters, and Archives' category of Trove (aka the 'collection' zone using the Trove API) includes a mix of digitised items, digitised finding aids, and metadata records for an assortment of collections and series. Unfortunately, there are no facets to help separate the different types of content, and so the finding aids can get a bit swamped. For example, you might think that searching for `findingaid "nla.obj"` would help identify the NLA's digitised finding aids, but this search returns 153,345 results, most of which are items within finding aids, rather than the finding aids themselves. I've yet to find a search in Trove that can reliably isolate the digitised finding aids. ([Searching for 'finding aid'](https://catalogue.nla.gov.au/Search/Home?lookfor=findingaid&type=all&limit%5B%5D=&submit=Find&limit%5B%5D=format%3AManuscript) in the manuscripts section of the NLA catalogue gets close.) Nor can I find a complete list of finding aids elsewhere on the NLA site. So how can we find the finding aids?

This notebook uses the Trove API to harvest finding aid urls from a search in the `collection` zone. The strategy goes something like this:

* search for all records in the `collection` zone containing "nla.obj" – this limits results to records containing links to Trove's delivery system for digitised items
* loop through the `identifier` field of each record, saving urls that contain the text 'findingaid'
* the resulting list of urls will contain many duplicates, as item records contain links to finding aids, but we can use Pandas to drop the duplicates and create a set of unique urls

The result is a list of finding aid urls. Additional information about each finding aid is extracted in [another notebook](get-info-finding-aids.ipynb).

## Set things up

In [80]:
import os
import re
import time

import pandas as pd
import requests_cache
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm

s = requests_cache.CachedSession()
retries = Retry(total=10, backoff_factor=0.2, status_forcelist=[500, 502, 503, 504])
s.mount("http://", HTTPAdapter(max_retries=retries))
s.mount("https://", HTTPAdapter(max_retries=retries))

In [81]:
%%capture
# Load variables from the .env file if it exists
# Use %%capture to suppress messages
%load_ext dotenv
%dotenv

Insert your Trove API key where indicated below.

In [82]:
# Insert your Trove API key
API_KEY = "YOUR API KEY"

if os.getenv("TROVE_API_KEY"):
    API_KEY = os.getenv("TROVE_API_KEY")

In [83]:
API_URL = "http://api.trove.nla.gov.au/v2/result"

# Define the API search parameters we'll be using
params = {
    "q": '"nla.obj"',
    "zone": "collection",
    "include": "links,workversions",
    "reclevel": "full",
    "encoding": "json",
    "n": 100,
    "bulkHarvest": "true",
    "key": API_KEY,
}

## Define some functions

In [98]:
def get_total(params):
    """
    Get the total number of results in a search using the supplied params.
    """
    response = s.get(API_URL, params=params)
    data = response.json()
    total = int(data["response"]["zone"][0]["records"]["total"])
    return total


def get_fa_link(identifiers):
    """
    Loop through the links in the `identifier` field, returning those that look like finding aids.
    """
    for ident in identifiers:
        if "findingaid" in ident["value"]:
            # Use re to trim off any search parameters appended to the fa url
            return re.search(r".*?findingaid", ident["value"]).group(0)


def harvest(max_records=None):
    """
    Search for all items containing "nla.obj" in the 'collection' zone and
    save links to finding aids.
    """
    finding_aids = []
    total = get_total(params)
    start = "*"
    with tqdm(total=total) as pbar:
        while start:
            params["s"] = start
            response = s.get(API_URL, params=params)
            data = response.json()
            for record in data["response"]["zone"][0]["records"]["work"]:
                try:
                    fa_link = get_fa_link(record["identifier"])
                except KeyError:
                    # No identifier field -- ignore
                    pass
                else:
                    if fa_link:
                        # Standardise urls
                        fa_link = fa_link.replace("https", "http")
                        fa_link = fa_link.replace("http://http", "http")
                        # Add url
                        finding_aids.append(fa_link)
            if not response.from_cache:
                time.sleep(0.2)
            # Try to get the nextStart value
            try:
                start = data["response"]["zone"][0]["records"]["nextStart"]
            # If there's no nextStart then the harvest is finished
            except KeyError:
                start = None
            pbar.update(100)
            # Stop iteration once max number of records inspected (mainly for testing)
            if max_records and pbar.n >= max_records:
                break
    return finding_aids

## Get the data

Use the `harvest` function to get a list of finding aid urls.

In [None]:
finding_aids = harvest()

Convert the list of urls into a dataframe.

In [72]:
df = pd.DataFrame(finding_aids, columns=["url"])

How many urls are there?

In [73]:
df.shape

(146192, 1)

Remove duplicate urls.

In [74]:
df.drop_duplicates(inplace=True)

How many are there now?

In [75]:
df.shape

(2337, 1)

Let's display a few of the urls.

In [100]:
# Display a few urls
df[:10]["url"]

0    http://nla.gov.au/nla.obj-240918828/findingaid
1    http://nla.gov.au/nla.obj-241093353/findingaid
2    http://nla.gov.au/nla.obj-241096550/findingaid
3    http://nla.gov.au/nla.obj-252896402/findingaid
4    http://nla.gov.au/nla.obj-241098568/findingaid
5    http://nla.gov.au/nla.obj-294499964/findingaid
6    http://nla.gov.au/nla.obj-299897373/findingaid
7    http://nla.gov.au/nla.obj-241101566/findingaid
8    http://nla.gov.au/nla.obj-324174783/findingaid
9    http://nla.gov.au/nla.obj-241426022/findingaid
Name: url, dtype: object

Save the urls to a CSV file.

In [76]:
df.to_csv("finding-aids.csv", index=False)

In [None]:
# Ignore -- this is just used for testing in development
if os.getenv("GW_STATUS") == "dev":
    fa_test = harvest(200)
    df_test = pd.DataFrame(fa_test)
    assert not df_test.empty

----

Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/). Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge).