# Convert a HTML finding aid to JSON

While I think the finding aids are created and stored as [EAD](https://www.loc.gov/ead/) encoded XML files, they are delivered as HTML. This means that to reassemble the finding aid hierarchy in a way that facilitates analysis, we have to scrape the HTML and make a few assumptions about the content.

This notebook:

- Scrapes data from the HTML of a finding aid, saving the hierarchy of series, sub-series, and items as a list of nested objects. The results can be saved as a JSON file.
- Extracts a list of items (or leaves) from the finding aid. These are the elements at the bottom of the hierarchy without any children of their own. They're the content rather than the containers – things like letters and diaries.
- Extracts a list of items together with the *paths* you follow through the hierarchy to get to them – sort of like breadcrumbs. This is useful because important contextual information can be included at other levels in the hierarchy.


In [43]:
import json
import re
from pathlib import Path

import pandas as pd
import requests_cache
from bs4 import BeautifulSoup, NavigableString, Tag
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

s = requests_cache.CachedSession()
retries = Retry(total=10, backoff_factor=0.2, status_forcelist=[500, 502, 503, 504])
s.mount("http://", HTTPAdapter(max_retries=retries))
s.mount("https://", HTTPAdapter(max_retries=retries))

In [46]:
def is_digitised(node):
    """
    Check if a node has digitised images associated with it.
    """
    if "class" in node.parent.attrs and "image-and-title" in node.parent.attrs["class"]:
        return True
    return False


def get_first_page(node):
    """
    Get the first page image identifier from a set of digitised images associated with a node.
    """
    if img := node.parent.find("img", class_="thumbnail"):
        return img["data-pid"]


def extract_node_details(node, level):
    """
    Extract the details of the supplied finding aid node.
    """
    details = {
        "id": node.attrs["id"],
        "title": node.get_text().strip(),
        "description": get_text(node),
        "digitised": is_digitised(node),
        "first_page": "",
    }
    if details["digitised"]:
        try:
            details["first_page"] = get_first_page(node)
        except KeyError:
            pass
    details["children"] = get_children(node, level)
    return details


def get_text(node):
    """
    Get any text within the element that's not in tags or in <p> tags and combine into a single string.
    """
    text = []
    for child in node.parent.children:
        if isinstance(child, NavigableString):
            if string := str(child).strip():
                text.append(string)
        elif isinstance(child, Tag) and child.name == "p":
            text.append(str(child.string).strip())
    return "\n".join(text)


def get_children(parent, level):
    """
    Get all items at the next level down within the hierarchy starting from the current element's parent.
    """
    children = []
    for child in parent.parent.find_all(class_=f"ctag-heading c0{level + 1}"):
        children.append(extract_node_details(child, level + 1))
    return children


def get_string_by_id(soup, tag, id):
    """
    Get the string value of the required html tag/id.
    """
    try:
        return str(soup.find(tag, id=id).string)
    except AttributeError:
        return ""


def get_fa_details(soup):
    """
    Get basic details of the finding aid.
    """
    return {
        "title": get_string_by_id(soup, "h1", "contentrow"),
        "collection_number": get_string_by_id(soup, "h2", "collection-number"),
    }


def check_tag(node, tag):
    """
    Check if the supplied node is a tag of the specified type.
    """
    if isinstance(node, Tag) and node.name == tag:
        return True
    return False


def find_sibling_tag(node, tag):
    """
    Find the next sibling of the current node that is a tag of the specified type.
    """
    for sib in node.next_siblings:
        if check_tag(sib, tag):
            return sib


def get_fa_summary(soup):
    """
    Get basic details of the finding aid.
    """
    summary_data = {}
    if summary_head := soup.find(id="collection-summary"):
        summary = find_sibling_tag(summary_head, "dl")
        for label in summary.find_all("dt"):
            value = find_sibling_tag(label, "dd")
            summary_data[
                str(label.string).lower().strip().replace(" ", "_")
            ] = value.get_text().strip()
    return summary_data


def convert_finding_aid(url):
    """
    Extract the hierarchical structure of a html finding aid into a dict or JSON.
    """
    response = s.get(url)
    soup = BeautifulSoup(response.text)
    fa_data = get_fa_summary(soup)
    if "title" not in fa_data:
        fa_data = get_fa_details(soup)
    fa_items = []
    for node in soup.find_all(class_="ctag-heading c01"):
        fa_items.append(extract_node_details(node, 1))
    fa_data["items"] = fa_items
    return fa_data

In [66]:
# Insert the url of the finding aid you watn to convert
url = "http://nla.gov.au/nla.obj-225220821/findingaid"

# Get the id from the url
fa_id = re.search(r"nla.obj-\d+", url).group(0)

# Extract the data from the HTML finding aid
fa = convert_finding_aid(url)

In [67]:
# Write the fa data to a JSON file
Path(f"finding-aid-{fa_id}.json").write_text(json.dumps(fa, indent=4))

3813065

## Extract a list of items (leaves

In [68]:
def find_leaves(node):
    """
    Recurse through a collection of finding aid nodes
    to find those at the bottom of the hierarchy (the leaves).

    Leaves are identified by having no children.
    """
    if isinstance(node, list):
        for i in node:
            for x in find_leaves(i):
                yield x
    elif isinstance(node, dict):
        if node["children"] == []:
            yield node
        else:
            for j in node["children"]:
                for x in find_leaves(j):
                    yield x

In [69]:
leaves = list(find_leaves(fa["items"]))

In [77]:
df_leaves = pd.DataFrame(leaves)
df_leaves.head()

Unnamed: 0,id,title,description,digitised,first_page,children
0,nla-obj-225228228,From Edward Maitland to Alfred Deakin (Item 1_...,,False,,[]
1,nla-obj-225228253,From Edward Maitland to Alfred Deakin (Item 1_...,,False,,[]
2,nla-obj-225228289,Correspondence; author(s) include Mary Balmain...,,False,,[]
3,nla-obj-225228312,From Joseph W. Evans to Alfred Deakin (Item 1_7),,False,,[]
4,nla-obj-225228348,From Sir Graham Berry to Alfred Deakin (Item 1_8),,False,,[]


In [79]:
df_leaves.to_csv(f"finding-aid-{fa_id}-leaves.csv", index=False)

## Paths to leaves

Because of the hierarchical structure of the finding aids, some descriptive information relating to individual leaves is included in the titles of its ancestors. To capture this context, we can save the paths followed to reach each leaf.

In [71]:
def get_paths(node, paths=[], titles=[]):
    paths = []
    titles.append(node["title"])
    if not node["children"]:
        context = titles.copy()
        # Remove the item's own title
        context.pop()
        # Reset title list
        titles = []
        paths.append(
            {"id": node["id"], "title": node["title"], "context": " / ".join(context)}
        )
    else:
        for i in node["children"]:
            paths.extend(get_paths(i, paths, titles))
            # Go back up a level
            titles.pop()
    return paths

In [72]:
paths = []
for series in fa["items"]:
    # print(series)
    paths.extend(get_paths(series, [], []))

In [73]:
df_paths = pd.DataFrame(paths)
df_paths.head()

Unnamed: 0,id,title,context
0,nla-obj-225228228,From Edward Maitland to Alfred Deakin (Item 1_...,"Series 1. General correspondence, 1878-1919 / ..."
1,nla-obj-225228253,From Edward Maitland to Alfred Deakin (Item 1_...,"Series 1. General correspondence, 1878-1919 / ..."
2,nla-obj-225228289,Correspondence; author(s) include Mary Balmain...,"Series 1. General correspondence, 1878-1919 / ..."
3,nla-obj-225228312,From Joseph W. Evans to Alfred Deakin (Item 1_7),"Series 1. General correspondence, 1878-1919 / ..."
4,nla-obj-225228348,From Sir Graham Berry to Alfred Deakin (Item 1_8),"Series 1. General correspondence, 1878-1919 / ..."


In [76]:
df_paths["context"].value_counts().to_frame()[:10]

Unnamed: 0,context
"Series 15. Prime Minister, 1901-1923 (bulk 1903-1910) / Subseries 15_1. General correspondence, 1903-1910 / Subseries 15_1_1. Congratulatory letters, 1903",287
"Series 15. Prime Minister, 1901-1923 (bulk 1903-1910) / Subseries 15_3. Imperial Conference, London, 1907 / Subseries 15_3_1. Correspondence, 1907",170
"Series 18. Post-retirement, 1913-1925 / Subseries 18_1. Correspondence, 1913",129
"Series 15. Prime Minister, 1901-1923 (bulk 1903-1910) / Subseries 15_1. General correspondence, 1903-1910 / Subseries 15_1_4. Correspondence, 1906",128
"Series 11. Federation delegate, 1884-1944 (bulk 1887-1900) / Subseries 11_13. London, 1900-1907 / Subseries 11_13_1. Correspondence, 1900-1907",118
"Series 16. Leader of the Opposition, 1902-1933 / Subseries 16_1. Correspondence, 1902-1913 / Subseries 16_1_2. Correspondence, 1904",114
"Series 1. General correspondence, 1878-1919 / Subseries 1_22. Correspondence, January-June 1908",107
"Series 1. General correspondence, 1878-1919 / Subseries 1_23. Correspondence, July-December 1908",106
"Series 16. Leader of the Opposition, 1902-1933 / Subseries 16_1. Correspondence, 1902-1913 / Subseries 16_1_3. Correspondence, 1905",102
"Series 1. General correspondence, 1878-1919 / Subseries 1_15. Correspondence, 1904",94


In [80]:
df_paths.to_csv(f"finding-aid-{fa_id}-paths.csv", index=False)

----

Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/). Support this project by becoming a [GitHub sponsor](https://github.com/sponsors/wragge).