# Create a list of Trove's digitised periodicals

Everyone know's about Trove's newspapers, but there is also a growing collection of digitised periodicals available in the 'Magazines & newsletters' category. They're not easy to find, however, which is why I created the [Trove Titles](https://trove-titles.herokuapp.com/) web app.

This notebook uses the Trove API to harvest metadata relating to digitised periodicals – or more accurately, periodicals that are freely available online in a digital form. This includes some born digital publications that are available to view in formats like PDF and MOBI, but excludes some digital journals that have access restrictions.

The search strategy to find digitised (and digital) periodicals takes advantage of the fact that Trove's digitised resources (excluding the newspapers) all have an identifier that includes the string `nla.obj`. So we start by searching in the journals zone for records that include `nla.obj` and have the `format` 'Periodical'. By specifying 'Periodical' we exclude individual articles from digitised journals.

Then it's just a matter of looping through all the results and checking to see if a record includes a `fulltext` link to a digital copy. If it does it gets saved.

You can see the results in [this CSV file](digital-journals.csv). Obviously you could extract additional metadata from each record if you wanted to.

The default fields are:

* `title` – the title of the periodical
* `contributor` – information about creator or publisher
* `issued` – publication date, or date range
* `format` – the type of publication, all entries should include 'Periodical', but may include other types such as 'Government publication'
* `trove_id` – the 'nla.obj' part of the fulltext_url, a unique identifier for the digital periodical
* `trove_url` – url of the periodical's metadata record in Trove
* `fulltext_url` – the url of the landing page of the digital version of the periodical
* `fulltext_url_type` – the type of digital periodical, one of 'digitised', 'edeposit', or 'other'

I've used this list to [harvest all the OCRd text from digitised periodicals](Download-text-for-all-digitised-journals.ipynb).

In [70]:
# Let's import the libraries we need.
import requests
import pandas as pd
from bs4 import BeautifulSoup
import time
import json
import os
import re
from tqdm.auto import tqdm
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from slugify import slugify
from IPython.display import display, HTML
from datetime import datetime

s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))

## Add your Trove API key

You can get a Trove API key by [following these instructions](https://help.nla.gov.au/trove/building-with-trove/api).

In [71]:
# Add your Trove API key between the quotes
api_key = 'YOUR API KEY'

## Define some functions to do the work

In [80]:
def get_total_results(params):
    '''
    Get the total number of results for a search.
    '''
    these_params = params.copy()
    these_params['n'] = 0
    response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)
    data = response.json()
    return int(data['response']['zone'][0]['records']['total'])


def get_fulltext_urls(links):
    '''
    Loop through the identifiers to find a link to the digital version of the journal.
    '''
    urls = []
    for link in links:
        if link['linktype'] == 'fulltext' and 'nla.obj' in link['value']:
            url = link['value']
            if 'digitised' in link['linktext'].lower():
                url_type = 'digitised'
            elif 'edeposit' in link['linktext'].lower():
                url_type = 'edeposit'
            else:
                url_type = 'other'
            urls.append({'url': url, 'url_type': url_type})
    return urls

def listify(value):
    if not isinstance(value, list):
        value = [value]
    return value

def format_list(record, field):
    value = record.get(field, [])
    value = listify(value)
    return ' | '.join(value)


def get_titles():
    '''
    Harvest metadata about digitised journals.
    With a little adaptation, this basic pattern could be used to harvest
    other types of works from Trove.
    '''
    url = 'http://api.trove.nla.gov.au/v2/result'
    titles = []
    params = {
        # We can 'NOT' the format facet in the query
        'q': '"nla.obj-" NOT format:Article',
        #'q': '"nla.obj-" NOT format:Article',
        'zone': 'article',
        'l-format': ['Periodical'], # Journals only
        # 'l-format': 'Government publication', # Journals only
        'include': 'links',
        'bulkHarvest': 'true', # Needed to maintain a consistent order across requests
        'key': api_key,
        'n': 100,
        'encoding': 'json'
    }
    start = '*'
    total = get_total_results(params)
    with tqdm(total=total) as pbar:
        while start:
            params['s'] = start
            response = s.get(url, params=params)
            data = response.json()
            # If there's a startNext value then we get it to request the next page of results
            try:
                start = data['response']['zone'][0]['records']['nextStart']
            except KeyError:
                start = None
            for work in data['response']['zone'][0]['records']['work']:
                # Check to see if there's a link to a digital version
                try:
                    fulltext_urls = get_fulltext_urls(work['identifier'])
                except (KeyError, TypeError):
                    pass
                else:
                    for fulltext_url in fulltext_urls:
                        trove_id = re.search(r'(nla\.obj\-\d+)', fulltext_url['url']).group(1)
                        # Get basic metadata
                        # You could add more work data here
                        # Check the Trove API docs for work record structure
                        title = {
                            'title': work['title'],
                            'contributor': format_list(work, 'contributor'),
                            'issued': work.get('issued', ''),
                            'format': format_list(work, 'type'),
                            'fulltext_url': fulltext_url['url'], 
                            'trove_url': work['troveUrl'],
                            'trove_id': trove_id,
                            'fulltext_url_type': fulltext_url['url_type']
                        }
                        titles.append(title)
            time.sleep(0.2)
            pbar.update(100)
    return titles

## Run the harvest

In [81]:
titles = get_titles()

  0%|          | 0/10064 [00:00<?, ?it/s]

In [82]:
df = pd.DataFrame(titles)
df.head()

Unnamed: 0,title,contributor,issued,format,fulltext_url,trove_url,trove_id,fulltext_url_type
0,"Laws, etc. (Acts of the Parliament)",Victoria,1900-2021,Government publication | Periodical | Periodic...,http://nla.gov.au/nla.obj-54127737,https://trove.nla.gov.au/work/10078182,nla.obj-54127737,digitised
1,The Silver stream songster,,1890-1900,"Periodical | Periodical/Journal, magazine, other",https://nla.gov.au/nla.obj-614066685,https://trove.nla.gov.au/work/10087062,nla.obj-614066685,digitised
2,Report / Defence Force Remuneration Tribunal,Australia. Defence Force Remuneration Tribunal,1980-2021,Government publication | Periodical | Periodic...,https://nla.gov.au/nla.obj-2137302489,https://trove.nla.gov.au/work/10096343,nla.obj-2137302489,digitised
3,Country web (Online),Rural Women's Network (N.S.W.),1993-2021,Government publication | Periodical | Periodic...,https://nla.gov.au/nla.obj-2337896017,https://trove.nla.gov.au/work/10100510,nla.obj-2337896017,edeposit
4,Territory of Papua : annual report for the per...,Australia. Department of External Territories,1940-1949,Government publication | Periodical | Periodic...,https://nla.gov.au/nla.obj-2060262652,https://trove.nla.gov.au/work/10103835,nla.obj-2060262652,digitised


In [83]:
# How many journals are there?
df.shape

(7318, 8)

For some reason there are a number of duplicates in the list, where multiple Trove work records point to the same digitised journal. We an display the duplicates like this.

In [84]:
# SHow all the rows
pd.set_option('display.max_rows', None)
# Show dupes
df.loc[df.duplicated(subset=['trove_id'], keep=False)].sort_values(by=['trove_id', 'fulltext_url_type'])

Unnamed: 0,title,contributor,issued,format,fulltext_url,trove_url,trove_id,fulltext_url_type
3020,"Wings (Sydney, N.S.W. : Online)",,2019-2021,"Periodical | Periodical/Journal, magazine, other",https://nla.gov.au/nla.obj-1226109179,https://trove.nla.gov.au/work/236307958,nla.obj-1226109179,edeposit
6479,"Wings (Sydney, N.S.W.)",Royal Australian Air Force Association,1946-2021,"Periodical | Periodical/Journal, magazine, other",https://nla.gov.au/nla.obj-1226109179,https://trove.nla.gov.au/work/30060307,nla.obj-1226109179,edeposit
1454,[Event programme] / Australian Festival of Cha...,,1990-2021,"Periodical | Periodical/Journal, magazine, other",https://nla.gov.au/nla.obj-1252107366,https://trove.nla.gov.au/work/205602387,nla.obj-1252107366,edeposit
4434,[Event programme] / Australian Festival of Cha...,Australian Festival of Chamber Music (Townsvil...,1990-2021,"Periodical | Periodical/Journal, magazine, other",https://nla.gov.au/nla.obj-1252107366,https://trove.nla.gov.au/work/237613201,nla.obj-1252107366,edeposit
437,Newsletter (Redland Libraries),Redcliffe Art Society,1970-2021,"Periodical | Periodical/Journal, magazine, oth...",https://nla.gov.au/nla.obj-1252138665,https://trove.nla.gov.au/work/163411531,nla.obj-1252138665,edeposit
1003,Newsletter (Redland Libraries),Redland (Qld.). Council. Library,2000-2021,Government publication | Periodical | Periodic...,https://nla.gov.au/nla.obj-1252138665,https://trove.nla.gov.au/work/189939461,nla.obj-1252138665,edeposit
1361,The Brisbane Bushwalker,Brisbane Bushwalkers Club,1900-2021,"Periodical | Periodical/Journal, magazine, other",https://nla.gov.au/nla.obj-1252263267,https://trove.nla.gov.au/work/200641459,nla.obj-1252263267,edeposit
5650,The Brisbane bushwalker : monthly magazine of ...,Brisbane Bushwalkers Club,1965-2021,"Periodical | Periodical/Journal, magazine, other",https://nla.gov.au/nla.obj-1252263267,https://trove.nla.gov.au/work/24167477,nla.obj-1252263267,edeposit
444,The Queensland geologist,Geological Society of Australia. Queensland Di...,2000-2021,"Periodical | Periodical/Journal, magazine, other",https://nla.gov.au/nla.obj-1252284018,https://trove.nla.gov.au/work/163412146,nla.obj-1252284018,edeposit
521,The Queensland geologist : bimonthly newslette...,Geological Society of Australia. Queensland Di...,1900-2021,Government publication | Periodical | Periodic...,https://nla.gov.au/nla.obj-1252284018,https://trove.nla.gov.au/work/16824694,nla.obj-1252284018,edeposit


In [85]:
df.sort_values(by=['trove_id', 'fulltext_url_type']).drop_duplicates(subset='trove_id', keep='last').shape

(7270, 8)

In [86]:
# Save as CSV and display a download link
csv_file = f'digital-journals-{datetime.now().strftime("%Y%m%d")}.csv'
#csv_file = f'government-publications-periodicals-{datetime.now().strftime("%Y%m%d")}.csv'
df.to_csv(csv_file, index=False)
display(HTML(f'<a href="{csv_file}" download="{csv_file}">{csv_file}</a>'))

----

Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/).

Work on this notebook was supported by the [Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab](https://tinker.edu.au/).