# Create a list of Trove's digital periodicals

Everyone know's about Trove's newspapers, but there is also a growing collection of digitised periodicals available in the 'Magazines & newsletters' category. They're not easy to find, however, which is why I created the [Trove Titles](https://trove-titles.herokuapp.com/) web app.

This notebook uses the Trove API to harvest metadata relating to periodicals available from Trove in digital form. As well as digitised publications, this includes born digital publications in formats like PDF and MOBI that have been made available through the edeposit program.

The search strategy to find digitised (and digital) periodicals takes advantage of the fact that Trove's digital resources (excluding the newspapers) all have an identifier that includes the string `nla.obj`. So we start by searching in the journals zone for records that include `nla.obj`. However, this search returns results for individual articles from periodicals, as well as the periodicals themselves, so to try and exclude the articles we add `NOT format:Article` to the query. To make sure that a digital copies of the periodicals are actually available, we loop through all the search results checking to see if a record includes a `fulltext` link to a digital copy. If it does it gets saved.

You can see the results in [this CSV file](digital-journals-20220831.csv). Obviously you could extract additional metadata from each record if you wanted to.

The default fields are:

* `title` – the title of the periodical
* `contributor` – information about creator or publisher
* `issued` – publication date, or date range
* `format` – the type of publication, all entries should include 'Periodical', but may include other types such as 'Government publication'
* `trove_id` – the 'nla.obj' part of the fulltext_url, a unique identifier for the digital periodical
* `trove_url` – url of the periodical's metadata record in Trove
* `fulltext_url` – the url of the landing page of the digital version of the periodical
* `fulltext_url_type` – the type of digital periodical, one of 'digitised', 'edeposit', or 'other'

I've used this list to [harvest all the OCRd text from digitised periodicals](Download-text-for-all-digitised-journals.ipynb).

In [1]:
# Let's import the libraries we need.
import os
import re
import time
from datetime import datetime

import pandas as pd
import requests
from IPython.display import HTML, display
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

# from slugify import slugify
from tqdm.notebook import tqdm

s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("https://", HTTPAdapter(max_retries=retries))
s.mount("http://", HTTPAdapter(max_retries=retries))

## Add your Trove API key

You can get a Trove API key by [following these instructions](https://help.nla.gov.au/trove/building-with-trove/api).

In [2]:
%%capture
# Load variables from the .env file if it exists
# Use %%capture to suppress messages
%load_ext dotenv
%dotenv

In [3]:
# Insert your Trove API key
API_KEY = "YOUR API KEY"

# Use api key value from environment variables if it is available
if os.getenv("TROVE_API_KEY"):
    API_KEY = os.getenv("TROVE_API_KEY")

## Define some functions to do the work

In [4]:
def get_total_results(params):
    """
    Get the total number of results for a search.
    """
    these_params = params.copy()
    these_params["n"] = 0
    response = s.get("https://api.trove.nla.gov.au/v2/result", params=these_params)
    data = response.json()
    return int(data["response"]["zone"][0]["records"]["total"])


def get_fulltext_urls(links):
    """
    Loop through the identifiers to find a link to the digital version of the journal.
    """
    urls = []
    for link in links:
        if link["linktype"] == "fulltext" and "nla.obj" in link["value"]:
            url = link["value"]
            if "digitised" in link["linktext"].lower():
                url_type = "digitised"
            elif "edeposit" in link["linktext"].lower():
                url_type = "edeposit"
            else:
                url_type = "other"
            urls.append({"url": url, "url_type": url_type})
    return urls


def listify(value):
    if not isinstance(value, list):
        value = [value]
    return value


def format_list(record, field):
    value = record.get(field, [])
    value = listify(value)
    return " | ".join(value)


def get_titles():
    """
    Harvest metadata about digitised journals.
    With a little adaptation, this basic pattern could be used to harvest
    other types of works from Trove.
    """
    url = "http://api.trove.nla.gov.au/v2/result"
    titles = []
    params = {
        # We can 'NOT' the format facet in the query
        # "q": '"nla.obj-" NOT format:Article',
        "q": '"nla.obj" NOT format:Article',
        # 'q': '"nla.obj-" NOT format:Article',
        "zone": "article",
        # "l-format": format_type,  # Journals only
        # 'l-format': 'Government publication',
        "include": "links",
        "bulkHarvest": "true",  # Needed to maintain a consistent order across requests
        "key": API_KEY,
        "n": 100,
        "encoding": "json",
    }
    start = "*"
    total = get_total_results(params)
    with tqdm(total=total) as pbar:
        while start:
            params["s"] = start
            response = s.get(url, params=params)
            data = response.json()
            try:
                works = data["response"]["zone"][0]["records"]["work"]
            except KeyError:
                # It seems that if the result set ends with a full page of results,
                # the nextStart value is included even though there are no more results
                pass
            for work in works:
                # Check to see if there's a link to a digital version
                try:
                    fulltext_urls = get_fulltext_urls(work["identifier"])
                except (KeyError, TypeError):
                    pass
                else:
                    for fulltext_url in fulltext_urls:
                        trove_id = re.search(
                            r"(nla\.obj\-\d+)", fulltext_url["url"]
                        ).group(1)
                        # Get basic metadata
                        # You could add more work data here
                        # Check the Trove API docs for work record structure
                        title = {
                            "title": work["title"],
                            "contributor": format_list(work, "contributor"),
                            "issued": work.get("issued", ""),
                            "format": format_list(work, "type"),
                            "fulltext_url": fulltext_url["url"],
                            "trove_url": work["troveUrl"],
                            "trove_id": trove_id,
                            "fulltext_url_type": fulltext_url["url_type"],
                        }
                        titles.append(title)
            # If there's a startNext value then we get it to request the next page of results
            try:
                start = data["response"]["zone"][0]["records"]["nextStart"]
            except KeyError:
                start = None
            pbar.update(len(works))
            time.sleep(0.2)
    return titles

## Run the harvest

In [None]:
titles = get_titles()

In [7]:
df = pd.DataFrame(titles)

In [8]:
# Save as CSV and display a download link
csv_file = f'digital-journals-{datetime.now().strftime("%Y%m%d")}.csv'
df.to_csv(csv_file, index=False)
display(HTML(f'<a href="{csv_file}" download="{csv_file}">{csv_file}</a>'))

In [9]:
# How many journals are there?
df.shape

(8728, 8)

Let's have a look at the different formats in the dataset. Remember that most records have multiple formats.

In [10]:
df["format"] = df["format"].str.split(" | ", regex=False)
formats = df.explode("format")
formats["format"].value_counts()

Periodical                             8330
Periodical/Journal, magazine, other    8269
Government publication                 4462
Conference Proceedings                  411
Book                                    177
Archived website                        171
Microform                               130
Book/Illustrated                        101
Periodical/Newspaper                     81
Map                                       6
Book/Large print                          2
Thesis                                    2
Video                                     1
Audio book                                1
Sound                                     1
Sound/Other sound                         1
Name: format, dtype: int64

Seems that we've captured a few 'Books' as well as a lot of 'Government publications'. We could try and exclude them, but the metadata is a bit inconsistent, so I think it's safest to keep the dataset in it's current form and manage oddities as we come across them. We can also look at the types of fulltext link.

In [11]:
df["fulltext_url_type"].value_counts()

edeposit     7184
digitised    1474
other          70
Name: fulltext_url_type, dtype: int64

So most of the periodicals in the dataset have come via the edeposit scheme. Even those these are in digital form and available online, there can be access restrictions – you might only be able to view them onsite at a library.

For some reason there are a number of duplicates in the dataset, where multiple Trove work records point to the same digitised journal. Again, I'm leaving them in the dataset just in case there's useful information in the duplicate records. If you need to, you can display and remove duplicates like this.

In [12]:
# Show dupes
df.loc[df.duplicated(subset=["trove_id"], keep=False)].sort_values(
    by=["trove_id", "fulltext_url_type"]
)

Unnamed: 0,title,contributor,issued,format,fulltext_url,trove_url,trove_id,fulltext_url_type
6855,"Wings (Sydney, N.S.W. Online)",,2019-2022,"[Periodical, Periodical/Journal, magazine, other]",https://nla.gov.au/nla.obj-1226109179,https://trove.nla.gov.au/work/248566050,nla.obj-1226109179,edeposit
7746,"Wings (Sydney, N.S.W. Print)",Royal Australian Air Force Association.,1946-2022,"[Periodical, Periodical/Journal, magazine, other]",https://nla.gov.au/nla.obj-1226109179,https://trove.nla.gov.au/work/30060307,nla.obj-1226109179,edeposit
1715,[Event programme] / Australian Festival of Cha...,,1990-2022,"[Periodical, Periodical/Journal, magazine, other]",https://nla.gov.au/nla.obj-1252107366,https://trove.nla.gov.au/work/205602387,nla.obj-1252107366,edeposit
4936,[Event programme] / Australian Festival of Cha...,Australian Festival of Chamber Music (Townsvil...,1990-2022,"[Periodical, Periodical/Journal, magazine, other]",https://nla.gov.au/nla.obj-1252107366,https://trove.nla.gov.au/work/237613201,nla.obj-1252107366,edeposit
508,Newsletter (Fassifern Field Naturalists Club),Newsletter (Fassifern Field Naturalists Club),1990-2022,"[Periodical, Periodical/Journal, magazine, other]",https://nla.gov.au/nla.obj-1252109161,https://trove.nla.gov.au/work/163410772,nla.obj-1252109161,edeposit
...,...,...,...,...,...,...,...,...
5013,"Bookfellow (Sydney, N.S.W. : 1911 : Online)",,1911-1925,"[Periodical, Periodical/Journal, magazine, other]",http://nla.gov.au/nla.obj-768936943,https://trove.nla.gov.au/work/238051629,nla.obj-768936943,other
8384,The New Triad,,1927-2022,"[Periodical, Periodical/Journal, magazine, other]",http://nla.gov.au/nla.obj-788254980,https://trove.nla.gov.au/work/5552221,nla.obj-788254980,digitised
5015,New Triad (Online),,1927-1928,"[Periodical, Periodical/Journal, magazine, other]",http://nla.gov.au/nla.obj-788254980,https://trove.nla.gov.au/work/238053099,nla.obj-788254980,other
7679,"Triad (Sydney, N.S.W.)",,1892-1927,"[Periodical, Periodical/Journal, magazine, other]",https://nla.gov.au/nla.obj-875780662,https://trove.nla.gov.au/work/27592184,nla.obj-875780662,digitised


In [15]:
# How many unique ids after duplicates are removed
df.drop_duplicates(subset=["trove_id"], keep="last").shape[0]

8662

----

Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/).

Work on this notebook was supported by the [Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab](https://tinker.edu.au/).