# Collect information about digitised finding aids

In [another notebook](find-finding-aids.ipynb) we assembled a list of urls pointing to the NLA's digitised finding aids. In this notebook we'll work through the list of urls, extracting additional information for each finding aid. While I think the finding aids are created and stored as [EAD](https://www.loc.gov/ead/) encoded XML files, they are delivered as HTML. As there's no machine-readable version readily available, we need to try and reassemble the original hierarchical structure by scraping the HTML.

This notebook assembles the following data:

- summary information for each finding aid, typically including title, collection number, creator, date range, extent, and repository
- the number of elements in each finding aid – this includes both containers (like series and boxes) and individual items
- the number of items listed in each finding aid – these are elements at the bottom of the hierarchy (so they have no children), they include things like documents and diaries
- the number of items digitised – by looking to see whether an item has digital images associated with it, we can find out if it's been digitised
- the number of items searchable through Trove – this information isn't scraped from the finding aid, it's found by searching Trove's 'Diaries, Letters, and Archives' for records containing the finding aid's url


## Set things up

In [406]:
import os
import re
from copy import deepcopy

import pandas as pd
import requests_cache
from bs4 import BeautifulSoup, NavigableString, Tag
from tqdm.auto import tqdm

s = requests_cache.CachedSession(timeout=30)

In [407]:
%%capture
# Load variables from the .env file if it exists
# Use %%capture to suppress messages
%load_ext dotenv
%dotenv

Insert your Trove API key where indicated below.

In [408]:
# Insert your Trove API key
API_KEY = "YOUR API KEY"

if os.getenv("TROVE_API_KEY"):
    API_KEY = os.getenv("TROVE_API_KEY")

## Define some functions

In [409]:
def is_digitised(node):
    """
    Check if a node has digitised images associated with it.
    """
    if "class" in node.parent.attrs and "image-and-title" in node.parent.attrs["class"]:
        return True
    return False


def get_first_page(node):
    """
    Get the first page image identifier from a set of digitised images associated with a node.
    """
    if img := node.parent.find("img", class_="thumbnail"):
        return img["data-pid"]


def extract_node_details(node, level):
    """
    Extract the details of the supplied finding aid node.
    """
    details = {
        "id": node.attrs["id"],
        "title": node.get_text().strip(),
        "description": get_text(node),
        "digitised": is_digitised(node),
        "first_page": "",
    }
    if details["digitised"]:
        try:
            details["first_page"] = get_first_page(node)
        except KeyError:
            pass
    details["children"] = get_children(node, level)
    return details


def get_text(node):
    """
    Get any text within the element that's not in tags or in <p> tags and combine into a single string.
    """
    text = []
    for child in node.parent.children:
        if isinstance(child, NavigableString):
            if string := str(child).strip():
                text.append(string)
        elif isinstance(child, Tag) and child.name == "p":
            text.append(str(child.string).strip())
    return "\n".join(text)


def get_children(parent, level):
    """
    Get all items at the next level down within the hierarchy starting from the current element's parent.
    """
    children = []
    for child in parent.parent.find_all(class_=f"ctag-heading c0{level + 1}"):
        children.append(extract_node_details(child, level + 1))
    return children


def get_string_by_id(soup, tag, id):
    """
    Get the string value of the required html tag/id.
    """
    try:
        return str(soup.find(tag, id=id).string)
    except AttributeError:
        return ""


def get_fa_details(soup):
    """
    Get basic details of the finding aid.
    """
    return {
        "title": get_string_by_id(soup, "h1", "contentrow"),
        "collection_number": get_string_by_id(soup, "h2", "collection-number"),
    }


def check_tag(node, tag):
    """
    Check if the supplied node is a tag of the specified type.
    """
    if isinstance(node, Tag) and node.name == tag:
        return True
    return False


def find_sibling_tag(node, tag):
    """
    Find the next sibling of the current node that is a tag of the specified type.
    """
    for sib in node.next_siblings:
        if check_tag(sib, tag):
            return sib


def get_fa_summary(soup):
    """
    Get basic details of the finding aid.
    """
    summary_data = {}
    if summary_head := soup.find(id="collection-summary"):
        summary = find_sibling_tag(summary_head, "dl")
        for label in summary.find_all("dt"):
            value = find_sibling_tag(label, "dd")
            summary_data[
                str(label.string).lower().strip().replace(" ", "_")
            ] = value.get_text().strip()
    return summary_data


def convert_finding_aid(soup):
    """
    Extract the hierarchical structure of a html finding aid into a dict or JSON.
    """
    fa_data = get_fa_summary(soup)
    if "title" not in fa_data:
        fa_data = get_fa_details(soup)
    fa_items = []
    for node in soup.find_all(class_="ctag-heading c01"):
        fa_items.append(extract_node_details(node, 1))
    fa_data["items"] = fa_items
    return fa_data


def find_leaves(node):
    """
    Recurse through a collection of finding aid nodes
    to find those at the bottom of the hierarchy (the leaves).

    Leaves are identified by having no children.
    """
    if isinstance(node, list):
        for i in node:
            for x in find_leaves(i):
                yield x
    elif isinstance(node, dict):
        if node["children"] == []:
            yield node
        else:
            for j in node["children"]:
                for x in find_leaves(j):
                    yield x

In [461]:
finding_aids = pd.read_csv("https://raw.githubusercontent.com/GLAM-Workbench/nla-finding-aids-data/main/finding-aids.csv").to_dict("records")

In [462]:
def add_fa_totals(finding_aids):
    for fa in tqdm(finding_aids):
        # print(fa["url"])
        response = s.get(fa["url"])
        if response.ok:
            soup = BeautifulSoup(response.text)
            fa_data = convert_finding_aid(soup)
            fa.update(fa_data)
            del fa["items"]
            # Remove little used fields
            fa["total_elements"] = len(
                soup.find_all(class_=re.compile(r"ctag-heading c0\d"))
            )
            leaves = list(find_leaves(fa_data["items"]))
            df = pd.DataFrame(leaves)
            fa["total_items"] = df.shape[0]
            if not df.empty:
                fa["total_digitised_items"] = df.loc[df["digitised"]].shape[0]
    return finding_aids

In [None]:
# Copy the fa list and add in the additional info
fa_totals = add_fa_totals(deepcopy(finding_aids))

## How many are searchable at item-level within the Trove diaries category?

In [410]:
API_URL = "http://api.trove.nla.gov.au/v2/result"

params = {
    "zone": "collection",
    "reclevel": "full",
    "encoding": "json",
    "n": 0,
    "bulkHarvest": "true",
    "key": API_KEY,
}


def get_total(params):
    response = s.get(API_URL, params=params)
    data = response.json()
    total = int(data["response"]["zone"][0]["records"]["total"])
    return total


def add_searchable_totals(finding_aids):
    for fa in tqdm(finding_aids):
        fa_id = re.search(r"nla.obj-\d+", fa["url"]).group(0)
        params["q"] = f'"{fa_id}/findingaid"'
        total = get_total(params)
        fa["total_searchable"] = total
    return finding_aids

In [None]:
fa_totals = add_searchable_totals(fa_totals)

In [418]:
df_fa = pd.DataFrame(fa_totals)

# Make totals ints
df_fa[
    ["total_elements", "total_items", "total_digitised_items", "total_searchable"]
] = df_fa[
    ["total_elements", "total_items", "total_digitised_items", "total_searchable"]
].fillna(
    value=0
)
df_fa[
    ["total_elements", "total_items", "total_digitised_items", "total_searchable"]
] = df_fa[
    ["total_elements", "total_items", "total_digitised_items", "total_searchable"]
].astype(
    "Int64"
)

df_fa.head()

Unnamed: 0,url,creator,title,date_range,collection_number,extent,language_of_materials,repository,total_elements,total_items,total_digitised_items,total_searchable,sponsor,abstract,physical_description,physical_facet
0,http://nla.gov.au/nla.obj-240918828/findingaid,Thomas Vere Hodgson,Papers of Thomas Vere Hodgson,1901-1926,MS 223,0.36 metres (2 archive boxes),English,National Library of Australia,69,69,0,1,,,,
1,http://nla.gov.au/nla.obj-241093353/findingaid,Lawrence Hargrave,Papers of Lawrence Hargrave,1908-1915,MS 352,1.02 metres (4 archive boxes + 1 small folio box),English,National Library of Australia,37,37,0,1,,,,
2,http://nla.gov.au/nla.obj-241096550/findingaid,Samuel Mauger,Papers of Samuel Mauger,1895-1946,MS 403,0.15 metres (1 box),English,National Library of Australia,14,14,0,1,,,,
3,http://nla.gov.au/nla.obj-252896402/findingaid,George Mackaness,Papers of George Mackaness,1853-1962,MS 534,3.24 metres (18 archive boxes),English,National Library of Australia,16,16,0,1,,,,
4,http://nla.gov.au/nla.obj-241098568/findingaid,W. Farmer Whyte,Papers of W. Farmer Whyte,1766-1952,MS 563,0.78 metres (1 archives box + 1 large folio box),English,National Library of Australia,61,53,0,1,,,,


## Totals

How many finding aids are there?

In [444]:
total_fa = df_fa.shape[0]
total_fa

2337

An element is a component within the finding aid's hierarchical structure. It might be a container – like a series, box, or file – or it might be an individual document. Elements that are containers will have children. How many elements are included in the digitised finding aids?

In [437]:
total_elements = df_fa["total_elements"].sum()
total_elements

473096

An item is an element at the bottom of the finding aid's hierarchy (or a leaf at the top of the finding aid's tree). It might be a letter, or a diary. Items don't have any children of their own, but can have digitised images associated with them in the finding aid. How many items are listed in the finding aids?

In [434]:
total_items = df_fa["total_items"].sum()
total_items

435537

If an item has digital images associated with it in the finding aid, it's counted as being digitised. How many items are digitised?

In [427]:
total_digitised = df_fa["total_digitised_items"].sum()
total_digitised

128619

What proportion of items listed in finding aids are digitised?

In [439]:
print(
    f"{total_digitised / total_items: 0.2%} of items listed in finding aids are digitised."
)

 29.53% of items listed in finding aids are digitised.


In [438]:
total_searchable = df_fa["total_searchable"].sum()
total_searchable

147228

In [440]:
print(
    f"{total_searchable / total_elements: 0.2%} of elements listed in finding aids can be found by Trove searches."
)

 31.12% of elements listed in finding aids can be found by Trove searches.


In [443]:
fa_digitised = df_fa.loc[df_fa["total_digitised_items"] > 0].shape[0]
fa_digitised

846

In [445]:
print(
    f"{fa_digitised / total_fa: 0.2%} of finding aids have at least some items digitised."
)

 36.20% of finding aids have at least some items digitised.


In [449]:
fa_searchable = df_fa.loc[df_fa["total_searchable"] > 1].shape[0]
fa_searchable

692

In [450]:
print(
    f"{fa_searchable / total_fa: 0.2%} of finding aids have at least some items that can be found in a Trove search."
)

 29.61% of finding aids have at least some items that can be found in a Trove search.


In [424]:
df_fa.loc[df_fa["total_searchable"] <= 1].shape[0]

1645

These finding aids either couldn't be retrieved or list no items.

In [388]:
df_fa.loc[df_fa["total_items"] == 0]["url"]

45       http://nla.gov.au/nla.obj-242432412/findingaid
820      http://nla.gov.au/nla.obj-712634851/findingaid
874      http://nla.gov.au/nla.obj-223908896/findingaid
886      http://nla.gov.au/nla.obj-242374146/findingaid
925      http://nla.gov.au/nla.obj-244365404/findingaid
1242    http://nla.gov.au/nla.obj-1065313126/findingaid
1396    http://nla.gov.au/nla.obj-2536380656/findingaid
1649    http://nla.gov.au/nla.obj-2973770245/findingaid
2059     http://nla.gov.au/nla.obj-253224791/findingaid
2088     http://nla.gov.au/nla.obj-415329506/findingaid
2316     http://nla.gov.au/nla.obj-342621178/findingaid
Name: url, dtype: object

In [352]:
df_fa["repository"].value_counts()

National Library of Australia        970
Special Collections (Manuscripts)    557
Australian Joint Copying Project     533
Printed Collections                  168
NLA                                   89
Naitonal Library of Australia          2
national Library of Australia          2
Published Collections                  1
Naitonal Library oF Australia          1
Natinal Library of Australia           1
National Libaray of Australia          1
National LIbrary of Australia          1
National Library of AUstralia          1
Name: repository, dtype: int64

In [448]:
# Number with searchable items
df_fa.loc[df_fa["total_searchable"] > 1].shape[0]

692

## AJCP

How many finding aids are part of the AJCP?

In [425]:
df_fa.loc[df_fa["repository"] == "Australian Joint Copying Project"].shape[0]

533

In [433]:
ajcp_items = df_fa.loc[df_fa["repository"] == "Australian Joint Copying Project"][
    "total_items"
].sum()
ajcp_items

103513

In [426]:
ajcp_digitised = df_fa.loc[df_fa["repository"] == "Australian Joint Copying Project"][
    "total_digitised_items"
].sum()
ajcp_digitised

100810

In [431]:
print(
    f"{ajcp_digitised / total_digitised: 0.2%} of items digitised are part of the AJCP."
)

 78.38% of items digitised are part of the AJCP.


In [451]:
ajcp_elements = df_fa.loc[df_fa["repository"] == "Australian Joint Copying Project"][
    "total_elements"
].sum()
ajcp_elements

118941

In [436]:
ajcp_searchable = df_fa.loc[df_fa["repository"] == "Australian Joint Copying Project"][
    "total_searchable"
].sum()
ajcp_searchable

114089

In [454]:
print(
    f"{ajcp_searchable / ajcp_elements: 0.2%} of elements in AJCP finding aids can be found in Trove searches."
)

 95.92% of elements in AJCP finding aids can be found in Trove searches.


In [455]:
print(
    f"{ajcp_searchable / total_searchable: 0.2%} of elements found in Trove searches are part of the AJCP."
)

 77.49% of elements found in Trove searches are part of the AJCP.


In [354]:
df_fa.loc[
    (df_fa["repository"] == "Australian Joint Copying Project")
    & (df_fa["total_searchable"] <= 1)
]

Unnamed: 0,url,creator,title,date_range,collection_number,extent,language_of_materials,repository,total_elements,total_items,total_digitised_items,total_searchable,sponsor,abstract,physical_description,physical_facet,ms_number
1086,http://nla.gov.au/nla.obj-752720615/findingaid,"Smyth, Arthur Bowes",Journal of Arthur Bowes Smyth (as filmed by th...,1787 - 1789,M933,1 item,English,Australian Joint Copying Project,1,1,1,1,The Australian Joint Copying Project (AJCP) on...,,,,
1228,http://nla.gov.au/nla.obj-836340466/findingaid,"Thomas, Robert (Llanfechain, Montgomeryshire)",Papers of Robert Thomas (as filmed by the AJCP),1863 - 1864,M2090,1 item,English,Australian Joint Copying Project,1,1,1,1,The Australian Joint Copying Project (AJCP) on...,,,,
2010,http://nla.gov.au/nla.obj-753102471/findingaid,"Lee, John",Correspondence of John Lee (as filmed by the A...,1840 - 1865,M678,1 items,English,Australian Joint Copying Project,48,45,45,1,The Australian Joint Copying Project (AJCP) on...,,,,


## Clean up the dataframe for saving

In [383]:
# Remove little used columns and change order
df_fa[
    [
        "url",
        "collection_number",
        "title",
        "creator",
        "date_range",
        "extent",
        "language_of_materials",
        "repository",
        "sponsor",
        "abstract",
        "physical_description",
        "total_elements",
        "total_items",
        "total_digitised_items",
        "total_searchable",
    ]
].sort_values("title").to_csv("finding-aids-totals.csv", index=False)

In [None]:
# Ignore -- this is just used for testing in development
if os.getenv("GW_STATUS") == "dev":
    fa_test = add_fa_totals(deepcopy(finding_aids[:10]))
    fa_test = add_searchable_totals(fa_test)
    df_test = pd.DataFrame(fa_test)
    assert not df_test.empty
    for col in [
        "title",
        "url",
        "collection_number",
        "total_elements",
        "total_items",
        "total_digitised_items",
        "total_searchable",
    ]:
        assert col in df_test.columns

----

Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/). Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge).