# Harvesting the text of digitised books (and ephemera)

This notebook harvests metadata and OCRd text from digitised books in Trove. There's three main steps:

* Harvest metadata of digitised books using the Trove API
* Extract the number of pages for each book from the Trove web interface (the number of pages is necessary to download the OCRd text)
* Download the OCRd text for each book

It's not easy to identify all the digitised books with OCRd text in Trove. I'm starting with a [search in the book zone](https://trove.nla.gov.au/search/category/books?keyword=%22nla.obj%22&l-availability=y&l-format=Book) for books that include the phrase `"nla.obj"` and are available online. This currently returns 57,046 results in the web interface. In amongst these are a range of government papers and reports, as well as publications where access to the digital copy is restricted. I think these are mostly recent books submitted in digital form under legal deposit. Things are made even more confusing by the fact that the contents of the 'Books & libraries' category in the web interface is not the same as the API's books zone. Anyway, I've used the new `fullTextInd` index to try and filter out works without any OCRd text and added `NOT format:"Government publication"` to the search query to try and filter out the government publications. The government publications are split across the journals and books zone, so I think I'll try and do a separate harvest that brings them all together. These extra parameters currently reduce the total results to 32,149 results.

But some of those 32,149 results are actually parent records that contain multiple volumes or parts. When I find the number of pages in each book, I'm also checking to see if the record is a 'Multi volume book' and has child works. If it does, I add the child works to the list of books. After this stage there are 33,505 works. However, not all of these 33,505 records have OCRd text. Parent records of multi volume works, and ebook formats like PDFs or MOBI, don't have individual pages, and therefore don't have any text to download. If we exclude works without pages, there are 27,367 works that might have some OCRd text to download.

After downloading all the OCRd text files (ignoring any that were empty) I ended up with a grand total of **24,620 files**.

If you compare the number of downloaded files to the number in the CSV file that are identified as having OCRd text you'll notice a difference – 24,620 compared to 27,367. After a bit more poking around I realised that there are some duplicates in the list of works. This seems to be because more than one Trove metadata record can point to the same digitised work. For example, both [this record](https://trove.nla.gov.au/work/192090169) and [this record](https://trove.nla.gov.au/work/31771096) point to [this digitised work](http://nla.gov.au/nla.obj-1874683). As they're not exact duplicates, I've left them in the results.

Looking through the downloaded text files, it's clear that we're getting ephemera (particularly pamphlets and posters) as well as books. There doesn't seem to be an obvious way to filter these out up front, but of course you could filter later by the number of pages. It's also obvious we didn't manage to exclude all the government publications – not sure how to deal with these either.

Here's the metadata I've harvested in CSV format:

* [CSV formatted file with details of digitised books](trove_digitised_books_with_ocr.csv)

This file includes the following columns:

* `children` – pipe-separated ids of any child works
* `contributors` – pipe-separated names of contributors
* `date` – publication date
* `form` – work format
* `fulltext_url` – link to the digitised version
* `language` – main language of the work
* `pages` – number of pages
* `parent` – id of parent work (if any)
* `rights` – copyright status
* `text_downloaded` – file name of the downloaded OCR text
* `text_file` – True/False is there any OCRd text
* `title` – title of the work
* `trove_id` – unique identifier
* `url` – link to the metadata record in Trove
* `volume` – volume/part number

Browse and download text files from Cloudstor:

* **[24,620 text files](https://cloudstor.aarnet.edu.au/plus/s/ugiw3gdijSKaoTL) (about 3.3gb in total) downloaded from the books zone on 20 July 2020 ([1.1gb zip file](https://cloudstor.aarnet.edu.au/plus/s/k7Mv5FRR7rR3c61))** 


Previous harvests:

* April 2019 - everything ([400mb zip file](https://cloudstor.aarnet.edu.au/plus/s/XdAqbGoPpefhmj2))
* 11 March 2020 – books (and ephemera) ([565mb zip file](https://cloudstor.aarnet.edu.au/plus/s/0eYAJMSgf0YVLPU))
* 11 March 2020 – parliamentary papers ([471mb zip file](https://cloudstor.aarnet.edu.au/plus/s/Gg6orB4UkWg6Rij))

## Setting things up

In [3]:
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm
from IPython.display import display, FileLink
import pandas as pd
import json
import re
import time
import os
from copy import deepcopy
from bs4 import BeautifulSoup
from slugify import slugify
import requests_cache

In [4]:
s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))

In [5]:
# Add your Trove API key below
api_key = 'YOUR API KEY'

In [6]:
params = {
    'key': api_key,
    'zone': 'book',
    'q': '"nla.obj" fullTextInd:y NOT format:"Government publication"', # API v 2.1 added the full text indicator
    'bulkHarvest': 'true',
    'n': 100,
    'encoding': 'json',
    'l-availability': 'y',
    'l-format': 'Book',
    'include': 'links,workversions'
}

## Harvest metadata using the API

In [13]:
def get_total_results():
    '''
    Get the total number of results for a search.
    '''
    these_params = params.copy()
    these_params['n'] = 0
    response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)
    data = response.json()
    return int(data['response']['zone'][0]['records']['total'])


def get_fulltext_url(links):
    '''
    Loop through the identifiers to find a link to the full text version of the book.
    '''
    url = None
    for link in links:
        if link['linktype'] == 'fulltext' and 'nla.obj' in link['value']:
            url = link['value']
            break
    return url

def get_version_record(record):
    for version in record.get('version'):
        for record in version['record']:
            try:
                if record['metadataSource'].get('value') == 'ANL:DL':
                    return record
            except (AttributeError, TypeError):
                pass
                
def join_list(record, key):
    # A field may have a single value or an array.
    # If it's an array, join the values into a string.
    string_list = ''
    if record:
        value = record.get(key)
        if value:
            try:
                string_list = '|'.join(value)
            except TypeError:
                string_list = value
    return string_list


def harvest_books():
    '''
    Harvest metadata relating to digitised books.
    '''
    books = []
    total = get_total_results()
    start = '*'
    these_params = params.copy()
    with tqdm(total=total) as pbar:
        while start:
            these_params['s'] = start
            response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)
            data = response.json()
            # The nextStart parameter is used to get the next page of results.
            # If there's no nextStart then it means we're on the last page of results.
            try:
                start = data['response']['zone'][0]['records']['nextStart']
            except KeyError:
                start = None
            for record in data['response']['zone'][0]['records']['work']:
                # See if there's a link to the full text version.
                if 'identifier' in record:
                    fulltext_url = get_fulltext_url(record['identifier'])
                    # I'm making the assumption that if this is a booky book (not a map or music etc),
                    # then 'Book' will appear first in the list of types.
                    # This might not be a valid assumption.
                    # try:
                    #    format_type = record.get('type')[0]
                    # except (IndexError, TypeError):
                    #    format_type = None
                    # Save the record if there's a full text link and it's a booky book.
                    if fulltext_url:
                        trove_id = re.search(r'(nla\.obj\-\d+)', fulltext_url).group(1)
                        # Get the basic metadata.
                        book = {
                            'title': record.get('title'),
                            'url': record.get('troveUrl'),
                            'contributors': join_list(record, 'contributor'),
                            'date': record.get('issued'),
                            'fulltext_url': fulltext_url,
                            'trove_id': trove_id
                        }
                        # Add some extra info if avaliable
                        version = get_version_record(record)
                        book['language'] = join_list(version, 'language')
                        book['rights'] = join_list(version, 'rights')
                        books.append(book)
                        # print(book)
            if not response.from_cache:
                time.sleep(0.2)
            pbar.update(100)
    return books

In [None]:
# Do the harvest!
books = harvest_books()

In [15]:
len(books)

32149

## Get the number of pages in each book

In order to download the OCRd text we need to know the number of pages in a work. This information is not available via the API, so we have to scrape it from the work's HTML page.

In [22]:
def get_work_data(url):
    '''
    Extract work data in a JSON string from the work's HTML page.
    '''
    response = s.get(url)
    try:
        work_data = re.search(r'var work = JSON\.parse\(JSON\.stringify\((\{.*\})', response.text).group(1)
    except AttributeError:
        work_data = '{}'
    if not response.from_cache:
        time.sleep(0.2)
    return json.loads(work_data)


def get_pages(work):
    '''
    Get the number of pages from the work data.
    '''
    try:
        pages = len(work['children']['page'])
    except KeyError:
        pages = 0
    return pages


def get_volumes(parent_id):
    '''
    Get the ids of volumes that are children of the current record.
    '''
    start_url = 'https://nla.gov.au/{}/browse?startIdx={}&rows=20&op=c'
    # The initial startIdx value
    start = 0
    # Number of results per page
    n = 20
    parts = []
    # If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.
    while n == 20:
        # Get the browse page
        response = s.get(start_url.format(parent_id, start))
        # Beautifulsoup turns the HTML into an easily navigable structure
        soup = BeautifulSoup(response.text, 'lxml')
        # Find all the divs containing issue details and loop through them
        details = soup.find_all(class_='l-item-info')
        for detail in details:
            title = detail.find('h3')
            if title:
                issue_id = title.parent['href'].strip('/')
            else:
                issue_id = detail.find('a')['href'].strip('/')
            # Get the issue id
            parts.append(issue_id)
        if not response.from_cache:
            time.sleep(0.2)
        # Increment the startIdx
        start += n
        # Set n to the number of results on the current page
        n = len(details)
    return parts


def add_pages(books):
    '''
    Add the number of pages to the metadata for each book.
    Add volumes from multi volume books.
    '''
    books_with_pages = []
    for book in tqdm(books):
        # print(book['fulltext_url'])
        work = get_work_data(book['fulltext_url'])
        form = work.get('form')
        pages = get_pages(work)
        book['pages'] = pages
        book['form'] = form
        book['volume'] = ''
        book['parent'] = ''
        book['children'] = ''
        time.sleep(0.2)
        # Multi volume books are containers with child volumes
        # so we have to get the ids of each individual volume and process them
        if pages == 0 and form == 'Multi Volume Book':
            # Get child volumes
            volumes = get_volumes(book['trove_id'])
            # For each volume get details and add as a new book entry
            for index, volume_id in enumerate(volumes):
                volume = book.copy()
                # Add link up to the container
                volume['parent'] = book['trove_id']
                volume['fulltext_url'] = 'http://nla.gov.au/{}'.format(volume_id)
                volume['trove_id'] = volume_id
                work = get_work_data(volume['fulltext_url'])
                form = work.get('form')
                pages = get_pages(work)
                volume['form'] = form
                volume['pages'] = pages
                volume['volume'] = str(index + 1)
                # print(volume)
                books_with_pages.append(volume)
            # Add links from container to volumes
            book['children'] = '|'.join(volumes)
        # print(book)
        books_with_pages.append(book)
    return books_with_pages

In [None]:
# Add number of pages to the book metadata
books_with_pages = add_pages(deepcopy(books))

## Convert and save results

Getting the page numbers takes quite a while, so it's a good idea to save the results to a CSV file before proceeding. That way, you won't have to repeat the process if something goes wrong and you lose the data that's sitting in memory.

In [24]:
df = pd.DataFrame(books_with_pages)

In [25]:
df.head()

Unnamed: 0,title,url,contributors,date,fulltext_url,trove_id,language,rights,pages,form,volume,parent,children
0,The works of the Rev. Sydney Smith,https://trove.nla.gov.au/work/1004403,"Smith, Sydney, 1771-1845",1839-1900,https://nla.gov.au/nla.obj-630176596,nla.obj-630176596,English,No known copyright restrictions|http://rightss...,65,Book,,,
1,Nellie Doran : a story of Australian home and ...,https://trove.nla.gov.au/work/10049667,Miriam Agatha,1914-1923,http://nla.gov.au/nla.obj-24357566,nla.obj-24357566,English,Out of Copyright|http://rightsstatements.org/v...,246,Book,,,
2,Trefoil : the story of a girls' society / by M...,https://trove.nla.gov.au/work/10057400,"Macdonald, M. P. (Margaret P.)",1900-1920,http://nla.gov.au/nla.obj-19907304,nla.obj-19907304,English,Out of Copyright|http://rightsstatements.org/v...,388,Book,,,
3,Military report on the province of Chiang-su (...,https://trove.nla.gov.au/work/10068876,,1909,http://nla.gov.au/nla.obj-233089297,nla.obj-233089297,,,0,Picture,,,
4,Le Siege de Berlin : Drame en un Acte / Charle...,https://trove.nla.gov.au/work/10069391,,1915,http://nla.gov.au/nla.obj-509324870,nla.obj-509324870,,,33,Book,,,


In [26]:
# How many records?
df.shape

(33505, 13)

In [27]:
# How many have pages?
df.loc[df['pages'] != 0].shape

(29086, 13)

In [28]:
# How many of each format?
df['form'].value_counts()

Book                   26962
Digital Publication     3602
Multi Volume Book       2227
Picture                  452
Journal                  219
Manuscript                22
Other - General            5
Map                        2
Other - Australian         1
Name: form, dtype: int64

In [29]:
# Breakdown by language
df['language'].value_counts()

English                                       23769
                                               7631
Chinese                                        1203
Undetermined                                    192
French                                          181
German                                           77
Japanese                                         62
Australian languages                             53
Dutch                                            51
Austronesian (Other)                             43
Italian                                          29
Latin                                            26
Maori                                            20
Spanish                                          18
Korean                                           14
Portuguese                                       14
Swedish                                          13
Danish                                           12
Tahitian                                         11
Indonesian  

In [30]:
# Save as CSV
df.to_csv('trove_digitised_books.csv', index=False)
display(FileLink('trove_digitised_books.csv'))

## Download the OCRd texts

In [11]:
# Run this cell if you need to reload the books data from the CSV
df = pd.read_csv('trove_digitised_books.csv', keep_default_na=False)
books_with_pages = df.to_dict('records')

In [35]:
def save_ocr(books, output_dir='text'):
    '''
    Download the OCRd text for each book.
    '''
    os.makedirs(output_dir, exist_ok=True)
    for book in tqdm(books):
        # Default values
        book['text_downloaded'] = False
        book['text_file'] = ''
        if book['pages'] != 0:       
            # print(book['title'])
            # The index value for the last page of an issue will be the total pages - 1
            last_page = book['pages'] - 1
            file_name = '{}-{}.txt'.format(slugify(str(book['title'])[:50]), book['trove_id'])
            file_path = os.path.join(output_dir, file_name)
            # Check to see if the file has already been harvested
            if os.path.exists(file_path) and os.path.getsize(file_path) > 0:
                # print('Already saved')
                book['text_file'] = file_name
                book['text_downloaded'] = True
            else:
                url = 'https://trove.nla.gov.au/{}/download?downloadOption=ocr&firstPage=0&lastPage={}'.format(book['trove_id'], last_page)
                # print(url)
                # Get the file
                r = s.get(url)
                # Check there was no error
                if r.status_code == requests.codes.ok:
                    # Check that the file's not empty
                    r.encoding = 'utf-8'
                    if len(r.text) > 0 and not r.text.isspace():
                        # Check that the file isn't HTML (some not found pages don't return 404s)
                        if BeautifulSoup(r.text, 'html.parser').find('html') is None:
                            # If everything's ok, save the file
                            with open(file_path, 'w', encoding='utf-8') as text_file:
                                text_file.write(r.text)
                            # print('Saved')
                            book['text_file'] = file_name
                            book['text_downloaded'] = True
                if not r.from_cache:
                    time.sleep(1)

In [38]:
save_ocr(books_with_pages)

HBox(children=(FloatProgress(value=0.0, max=33505.0), HTML(value='')))




## Convert and save updated results

The new books list includes the file name of the downloaded text file (if there is one),
and a boolean field indicating if the text has been downloaded.

In [39]:
# Convert this to df
df_downloaded = pd.DataFrame(books_with_pages)

In [40]:
df_downloaded.head()

Unnamed: 0,title,url,contributors,date,fulltext_url,trove_id,language,rights,pages,form,volume,parent,children,text_downloaded,text_file
0,The works of the Rev. Sydney Smith,https://trove.nla.gov.au/work/1004403,"Smith, Sydney, 1771-1845",1839-1900,https://nla.gov.au/nla.obj-630176596,nla.obj-630176596,English,No known copyright restrictions|http://rightss...,65,Book,,,,True,the-works-of-the-rev-sydney-smith-nla.obj-6301...
1,Nellie Doran : a story of Australian home and ...,https://trove.nla.gov.au/work/10049667,Miriam Agatha,1914-1923,http://nla.gov.au/nla.obj-24357566,nla.obj-24357566,English,Out of Copyright|http://rightsstatements.org/v...,246,Book,,,,True,nellie-doran-a-story-of-australian-home-and-sc...
2,Trefoil : the story of a girls' society / by M...,https://trove.nla.gov.au/work/10057400,"Macdonald, M. P. (Margaret P.)",1900-1920,http://nla.gov.au/nla.obj-19907304,nla.obj-19907304,English,Out of Copyright|http://rightsstatements.org/v...,388,Book,,,,True,trefoil-the-story-of-a-girls-society-by-m-p-nl...
3,Military report on the province of Chiang-su (...,https://trove.nla.gov.au/work/10068876,,1909,http://nla.gov.au/nla.obj-233089297,nla.obj-233089297,,,0,Picture,,,,False,
4,Le Siege de Berlin : Drame en un Acte / Charle...,https://trove.nla.gov.au/work/10069391,,1915,http://nla.gov.au/nla.obj-509324870,nla.obj-509324870,,,33,Book,,,,False,


In [41]:
# How many have been downloaded?
df_downloaded.loc[df_downloaded['text_downloaded'] == True].shape

(27367, 15)

Why is the number above different to the number of files actually downloaded? Let's have a look for duplicates.

As you can see below, some digitised works are linked to from multiple metadata records. Hence there are duplicates.

In [42]:
df_downloaded.loc[df_downloaded.duplicated('trove_id', keep=False) == True].sort_values('trove_id')

Unnamed: 0,title,url,contributors,date,fulltext_url,trove_id,language,rights,pages,form,volume,parent,children,text_downloaded,text_file
18342,Three weeks in Southland : being the account o...,https://trove.nla.gov.au/work/237350529,"Reid, Stuart, active 1884-1885",1885,https://nla.gov.au/nla.obj-101207695,nla.obj-101207695,English,Out of Copyright|http://rightsstatements.org/v...,66,Book,,,,True,three-weeks-in-southland-being-the-account-of-...
6114,Three weeks in Southland : being the account o...,https://trove.nla.gov.au/work/19178390,"Reid, Stuart, active 1884-1885",1885,http://nla.gov.au/nla.obj-101207695,nla.obj-101207695,,,66,Book,2,nla.obj-477008239,,True,three-weeks-in-southland-being-the-account-of-...
6371,A recent visit to several of the Polynesian is...,https://trove.nla.gov.au/work/19241288,"Bennett, George, active 1830-1831",1831-1832,http://nla.gov.au/nla.obj-101212925,nla.obj-101212925,,,8,Book,,,,True,a-recent-visit-to-several-of-the-polynesian-is...
18344,A recent visit to several of the Polynesian is...,https://trove.nla.gov.au/work/237350531,"Bennett, George, active 1830-1831",1831,https://nla.gov.au/nla.obj-101212925,nla.obj-101212925,English,No known copyright restrictions|http://rightss...,8,Book,,,,True,a-recent-visit-to-several-of-the-polynesian-is...
18361,How Capt. Cook died : new light from an old book,https://trove.nla.gov.au/work/237350548,,1908,https://nla.gov.au/nla.obj-101227721,nla.obj-101227721,English,No known copyright restrictions|http://rightss...,10,Book,,,,True,how-capt-cook-died-new-light-from-an-old-book-...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18320,A Wonderful Illawarra waterfall : a rare beaut...,https://trove.nla.gov.au/work/237350507,,1895,https://nla.gov.au/nla.obj-99671695,nla.obj-99671695,English,No known copyright restrictions|http://rightss...,1,Book,,,,True,a-wonderful-illawarra-waterfall-a-rare-beauty-...
3340,The Results of the census of 1871 : supplement...,https://trove.nla.gov.au/work/17856108,,1873,http://nla.gov.au/nla.obj-99716940,nla.obj-99716940,,,2,Book,,,,True,the-results-of-the-census-of-1871-supplement-t...
18365,The Results of the census of 1871 : supplement...,https://trove.nla.gov.au/work/237350552,,1873,https://nla.gov.au/nla.obj-99716940,nla.obj-99716940,English,No known copyright restrictions|http://rightss...,2,Book,,,,True,the-results-of-the-census-of-1871-supplement-t...
727,Regular packets for Australia : emigration to ...,https://trove.nla.gov.au/work/12328620,,1850,http://nla.gov.au/nla.obj-99727992,nla.obj-99727992,,,1,Book,,,,True,regular-packets-for-australia-emigration-to-po...


In [43]:
# Save as CSV
df_downloaded.to_csv('trove_digitised_books_with_ocr.csv', index=False)
display(FileLink('trove_digitised_books_with_ocr.csv'))

## Some leftover bits used for renaming the text files

In [None]:
# Rename files to include truncated title of book
for row in df.itertuples():
    try:
        os.rename(os.path.join('text', '{}.txt'.format(row.book_id)), os.path.join('text', '{}-{}.txt'.format(slugify(row.title[:50]), row.book_id)))
    except FileNotFoundError:
        pass

In [None]:
# Convert all filenames back to just nla.obj- form
for filename in [f for f in os.listdir('text') if f[-4:] == '.txt']:
    try:
        objname = re.search(r'.*(nla\.obj.*)', filename).group(1)
    except AttributeError:
        print(filename)
    os.rename(os.path.join('text', filename), os.path.join('text', objname))

----

Created by [Tim Sherratt](https://timsherratt.org) for the [GLAM Workbench](https://glam-workbench.github.io/).

Work on this notebook was supported by the [Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab](https://tinker.edu.au/).