# Create a list of Trove's digitised journals

Everyone know's about Trove's newspapers, but there is also a growing collection of digitised journals available in the journals zone. They're not easy to find, however, which is why I created the [Trove Titles](https://trove-titles.herokuapp.com/) web app.

This notebook uses the Trove API to harvest metadata relating to digitised journals – or more accurately, journals that are freely available online in a digital form. This includes some born digital publications that are available to view in formats like PDF and MOBI, but excludes some digital journals that have access restrictions.

The search strategy to find digitised (and digital) journals takes advantage of the fact that Trove's digitised resources (excluding the newspapers) all have an identifier that includes the string `nla.obj`. So we start by searching in the journals zone for records that include `nla.obj` and have the `format` 'Periodical'. By specifying 'Periodical' we exclude individual articles from digitised journals.

Then it's just a matter of looping through all the results and checking to see if a record includes a `fulltext` link to a digital copy. If it does it gets saved.

You can see the results in [this CSV file](digital-journals.csv). Obviously you could extract additional metadata from each record if you wanted to.

The default fields are:

* `fulltext_url` – the url of the landing page of the digital version of the journal
* `title` – the title of the journal
* `trove_id` – the 'nla.obj' part of the fulltext_url, a unique identifier for the digital journal
* `trove_url` – url of the journal's metadata record in Trove

I've used this list to harvest all the OCRd text from digitised journals.

In [8]:
# Let's import the libraries we need.
import requests
import pandas as pd
from bs4 import BeautifulSoup
import time
import json
import os
import re
from tqdm.auto import tqdm
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from slugify import slugify
from IPython.display import display, FileLink

s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))

## Add your Trove API key

You can get a Trove API key by [following these instructions](https://help.nla.gov.au/trove/building-with-trove/api).

In [9]:
# Add your Trove API key between the quotes
api_key = 'YOUR API KEY'

## Define some functions to do the work

In [35]:
def get_total_results(params):
    '''
    Get the total number of results for a search.
    '''
    these_params = params.copy()
    these_params['n'] = 0
    response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)
    data = response.json()
    return int(data['response']['zone'][0]['records']['total'])


def get_fulltext_url(links):
    '''
    Loop through the identifiers to find a link to the digital version of the journal.
    '''
    nla_digitised = False
    for link in links:
        if link['linktype'] == 'fulltext' and 'nla.obj' in link['value']:
            url = link['value']
            if link['linktext'] == 'National Library of Australia digitised item':
                nla_digitised = True
            return url, nla_digitised


def get_titles():
    '''
    Harvest metadata about digitised journals.
    With a little adaptation, this basic pattern could be used to harvest
    other types of works from Trove.
    '''
    url = 'http://api.trove.nla.gov.au/v2/result'
    titles = []
    params = {
        # We can 'NOT' the format facet in the query
        'q': '"nla.obj-" NOT format:"Government publication" NOT format:Article',
        'zone': 'article',
        'l-format': 'Periodical', # Journals only
        'include': 'links',
        'bulkHarvest': 'true', # Needed to maintain a consistent order across requests
        'key': api_key,
        'n': 100,
        'encoding': 'json'
    }
    start = '*'
    total = get_total_results(params)
    with tqdm(total=total) as pbar:
        while start:
            params['s'] = start
            response = s.get(url, params=params)
            data = response.json()
            # If there's a startNext value then we get it to request the next page of results
            try:
                start = data['response']['zone'][0]['records']['nextStart']
            except KeyError:
                start = None
            for work in data['response']['zone'][0]['records']['work']:
                # Check to see if there's a link to a digital version
                try:
                    fulltext_url, nla_digitised = get_fulltext_url(work['identifier'])
                except (KeyError, TypeError):
                    pass
                else:
                    if fulltext_url:
                        trove_id = re.search(r'(nla\.obj\-\d+)', fulltext_url).group(1)
                        # Get basic metadata
                        # You could add more work data here
                        # Check the Trove API docs for work record structure
                        title = {
                            'title': work['title'],
                            'fulltext_url': fulltext_url, 
                            'trove_url': work['troveUrl'],
                            'trove_id': trove_id,
                            'nla_digitised': nla_digitised
                        }
                        titles.append(title)
            time.sleep(0.2)
            pbar.update(100)
    return titles

## Run the harvest

In [None]:
titles = get_titles()

## Convert to a dataframe and save as a CSV file

Let's convert the Python list to a Pandas DataFrame, have a peek inside, then save in CSV format.

In [37]:
df = pd.DataFrame(titles)
df.head()

Unnamed: 0,title,fulltext_url,trove_url,trove_id,nla_digitised
0,The Silver stream songster,https://nla.gov.au/nla.obj-614066685,https://trove.nla.gov.au/work/10087062,nla.obj-614066685,False
1,Stonequarry journal (Online),https://nla.gov.au/nla.obj-862209995,https://trove.nla.gov.au/work/10106079,nla.obj-862209995,False
2,"Philament (Sydney, N.S.W. : Online)",https://nla.gov.au/nla.obj-749489295,https://trove.nla.gov.au/work/10287808,nla.obj-749489295,False
3,Journal (Queensland Law Society),https://www.nla.gov.au/nla.obj-2735787548,https://trove.nla.gov.au/work/10321820,nla.obj-2735787548,False
4,The Order of service for the annual festival t...,http://nla.gov.au/nla.obj-657473276,https://trove.nla.gov.au/work/10388163,nla.obj-657473276,True


In [38]:
# How many journals are there?
df.shape

(2730, 5)

For some reason there are a number of duplicates in the list, where multiple Trove work records point to the same digitised journal. We an display the duplicates like this.

In [41]:
# SHow all the rows
pd.set_option('display.max_rows', None)
# Show dupes
df.loc[df.duplicated(subset=['trove_id'], keep=False)].sort_values(by=['trove_id', 'nla_digitised'])

Unnamed: 0,title,fulltext_url,trove_url,trove_id,nla_digitised
1594,"Wings (Sydney, N.S.W. : Online)",https://nla.gov.au/nla.obj-1226109179,https://trove.nla.gov.au/work/236307958,nla.obj-1226109179,False
2379,"Wings (Sydney, N.S.W.)",https://nla.gov.au/nla.obj-1226109179,https://trove.nla.gov.au/work/30060307,nla.obj-1226109179,False
672,[Event programme] / Australian Festival of Cha...,https://nla.gov.au/nla.obj-1252107366,https://trove.nla.gov.au/work/205602387,nla.obj-1252107366,False
1942,[Event programme] / Australian Festival of Cha...,https://nla.gov.au/nla.obj-1252107366,https://trove.nla.gov.au/work/237613201,nla.obj-1252107366,False
632,The Brisbane Bushwalker,https://nla.gov.au/nla.obj-1252263267,https://trove.nla.gov.au/work/200641459,nla.obj-1252263267,False
2314,The Brisbane bushwalker : monthly magazine of ...,https://nla.gov.au/nla.obj-1252263267,https://trove.nla.gov.au/work/24167477,nla.obj-1252263267,False
1840,The Shadowland newsletter,https://nla.gov.au/nla.obj-1771610885,https://trove.nla.gov.au/work/237293619,nla.obj-1771610885,False
2692,The Atlas of the solar system / Patrick Moore ...,https://nla.gov.au/nla.obj-1771610885,https://trove.nla.gov.au/work/7113005,nla.obj-1771610885,False
1996,Photographic review of reviews (Online),http://nla.gov.au/nla.obj-389050007,https://trove.nla.gov.au/work/238058947,nla.obj-389050007,False
2471,Photographic review of reviews,http://nla.gov.au/nla.obj-389050007,https://trove.nla.gov.au/work/33565755,nla.obj-389050007,True


In [43]:
df.sort_values(by=['trove_id', 'nla_digitised']).drop_duplicates(subset='trove_id', keep='last').shape

(2698, 5)

In [40]:
# Save as CSV and display a download link
df.to_csv('digital-journals.csv', index=False)
display(FileLink('digital-journals.csv'))

----

Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/).

Work on this notebook was supported by the [Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab](https://tinker.edu.au/).