# Harvesting data from the Bulletin (or any other digitised journal)

<div class="alert alert-warning">
    This notebook is outdated. For a more general approach to harvesting data from Trove journals see:
    <ul>
        <li><a href="Get-text-from-a-Trove-journal.ipynb">Get OCRd text from a digitised journal in Trove</a></li>
        <li><a href="Get-page-images-from-a-Trove-journal.ipynb">Get covers (or any other pages) from a digitised journal in Trove</a></li>
    </ul>
</div>

The National Library of Australia is digitising lots of interesting and useful journals like *The Bulletin*. These can be downloaded as images, PDFs or as plain text. However, as there's no API access, there's no obvious way of mechanising the download process to create large data sets. But with a little reverse engineering of the interface and some screen scraping it *is* possible. This notebook will show you how, providing all you need to download all the issue metadata of *The Bulletin*, along with high-resoluation images of the covers, and the OCRd text.

The code here could be easily modified to download data from another journal.

You can download [pre-harvested text and metadata](https://github.com/wragge/ozglam-workbench/blob/master/Trove/Cookbook/Harvesting-data-from-the-Bulletin.ipynb) from the Trove texts repository.

In [1]:
# Let's import the libraries we need.
import requests
from bs4 import BeautifulSoup
import time
import json
import os
import zipfile
import io
import re
import arrow
from arrow.parser import ParserError

In [2]:
# Make sure data directory exists
data_dir = os.path.join('journals', 'bulletin')
os.makedirs(data_dir, exist_ok=True)

## Getting the issue data

Each issue of a journal like *The Bulletin* has it's own unique identifier. You've probably noticed them in the urls of Trove resources. They look something like this `nla.obj-188537163`. Once we have the identifier for an issue we can easily download the contents, but how do we get a complete list of identifiers?

This is where we need to do a bit of reverse engineering. One essential tool when you're doing this sort of work is your browser console. It varies a bit across browsers, but usually you can open the console by right clicking on a page and selecting the 'Inspect' option. Once it's open, choose the 'Network' tab, then go to the [parent page](https://nla.gov.au/nla.obj-68375465/) for *The Bulletin* in the Trove Digital Library. Now click on the 'Browse' option in the Trove menu. Look carefully through all the entries in the Network console, and you should find this link:

```
https://nla.gov.au/nla.obj-68375465/browse?startIdx=0&rows=20&op=c
```

This link retrieves the issue details that are then displayed in the browse pane, but it's just a normal url, that delivers normal HTML. [Click here to open it.](https://nla.gov.au/nla.obj-68375465/browse?startIdx=0&rows=20&op=c)

As you may have noticed, the url contains a `startIdx` parameter. By increasing this value, you can navigate your way through the complete set of issues.

The browser console is useful useful for inspecting the HTML structure of pages. If you look at the contents of the browse page, you'll see that the details for each issue are presented as a definition list (`<dl>`), inside a `<div>` with the class of `l-item-info`. This information tells us the paths we need to follow to get to the issue metadata.

That's really all we need to know to start harvesting!

In [3]:
# This is just the url we found above, with a slot into which we can insert the startIdx value
# If you want to download data from another journal, just change the nla.obj identifier to point to the journal.
start_url = 'https://nla.gov.au/nla.obj-68375465/browse?startIdx={}&rows=20&op=c'

In [4]:
# The initial startIdx value
start = 0
# Number of results per page
n = 20
issues = []
# If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.
while n == 20:
    # Get the browse page
    response = requests.get(start_url.format(start))
    # Beautifulsoup turns the HTML into an easily navigable structure
    soup = BeautifulSoup(response.text, 'lxml')
    # Find all the divs containing issue details and loop through them
    details = soup.find_all(class_='l-item-info')
    for detail in details:
        issue = {}
        # Get the issue id
        issue['id'] = detail.dt.a.string
        rows = detail.find_all('dd')
        # Get the issue details
        issue['details'] = rows[2].p.string
        # Get the number of pages
        issue['pages'] = re.search(r'^(\d+)', detail.find('a', class_="browse-child").text, flags=re.MULTILINE).group(1)
        issues.append(issue)
        print(issue)
        time.sleep(0.2)
    # Increment the startIdx
    start += n
    # Set n to the number of results on the current page
    n = len(details)
        
    

{'id': 'nla.obj-124654480', 'details': 'No. 1 (31 Jan 1880)', 'pages': '8'}
{'id': 'nla.obj-188284455', 'details': 'No. 2 (7 Feb 1880)', 'pages': '8'}
{'id': 'nla.obj-188537163', 'details': 'No. 3 (14 Feb 1880)', 'pages': '8'}
{'id': 'nla.obj-125527104', 'details': 'No. 4 (21 Feb 1880)', 'pages': '8'}
{'id': 'nla.obj-201081027', 'details': 'No. 5 (28 Feb 1880)', 'pages': '8'}
{'id': 'nla.obj-125534622', 'details': 'No. 6 (6 Mar 1880)', 'pages': '8'}
{'id': 'nla.obj-201334002', 'details': 'No. 7 (13 Mar 1880)', 'pages': '8'}
{'id': 'nla.obj-125535846', 'details': 'No. 8 (20 Mar 1880)', 'pages': '8'}
{'id': 'nla.obj-201347662', 'details': 'No. 9 (27 Mar 1880)', 'pages': '8'}
{'id': 'nla.obj-125537405', 'details': 'No. 10 (3 Apr 1880)', 'pages': '8'}
{'id': 'nla.obj-201352807', 'details': 'No. 11 (10 Apr 1880)', 'pages': '8'}
{'id': 'nla.obj-125539523', 'details': 'No. 12 (17 Apr 1880)', 'pages': '8'}
{'id': 'nla.obj-186292019', 'details': 'No. 13 (24 Apr 1880)', 'pages': '8'}
{'id': 'nla

In [5]:
len(issues)

4645

In [6]:
# Save the harvested results as a JSON file in case we need them later on
with open('{}/issues.json'.format(data_dir), 'w') as outfile:
    json.dump(issues, outfile)

In [7]:
# Open the saved JSON file
with open('{}/issues.json'.format(data_dir), 'r') as infile:
    issues = json.load(infile)

## Cleaning up the metadata

So far we've just grabbed the complete issue details as a single string. It would be good to parse this string so that we have the dates, volume and issue numbers in separate fields. As is always the case, there's a bit of variation in the way this information is recorded. The code below tries out different combinations and then saves the structured data in a Python list.

In [8]:
issues_data = []
# Loop through the issues
for issue in issues:
    issue_data = {}
    issue_data['id'] = issue['id']
    issue_data['pages'] = int(issue['pages'])
    print(issue['details'])
    try:
        # This pattern looks for details in the form: Vol. 2 No. 3 (2 Jul 1878)
        details = re.search(r'(.*)Vol. (\d+) No\.* (\d+) \((.+)\)', issue['details'].strip())
        issue_data['label'] = details.group(1).strip()
        issue_data['volume'] = details.group(2)
        issue_data['number'] = details.group(3)
        date = details.group(4)
    except AttributeError:
        try:
            # This pattern looks for details in the form: No. 3 (2 Jul 1878)
            details = re.search(r'No. (\d+) \((.+)\)', issue['details'].strip())
            issue_data['label'] = ''
            issue_data['volume'] = ''
            issue_data['number'] = details.group(1)
            date = details.group(2)
        except AttributeError:
            try:
                # This pattern looks for details in the form: Bulletin Christmas Edition (2 Jul 1878)
                details = re.search(r'(.*) \((.+)\)', issue['details'].strip())
                issue_data['label'] = details.group(1)
                issue_data['volume'] = ''
                issue_data['number'] = ''
                date = details.group(2)
            except AttributeError:
                try:
                    # This pattern looks for details in the form: Bulletin 1878 Jul 3
                    details = re.search(r'Bulletin (.+)', issue['details'].strip())
                    date_str = details.group(1)
                    # Date is wrong way round, split and reverse
                    date = ' '.join(reversed(date_str.split()))
                    issue_data['label'] = ''
                    issue_data['volume'] = ''
                    issue_data['number'] = ''
                except AttributeError:
                    # Coping with this... Special Number of The Bulletin, November 1, 1895
                    details = re.search(r'(.*), (.*) (\d+), (\d+)', issue['details'].strip())
                    issue_data['label'] = details.group(1).strip()
                    issue_data['volume'] = ''
                    issue_data['number'] = ''
                    date = '{} {} {}'.format(details.group(3), details.group(2), details.group(4))
    # Normalise months
    date = date.replace('June', 'Jun').replace('July', 'Jul').replace('Sept', 'Sep').replace('November', 'Nov').replace('  ', ' ')
    # Convert date to ISO format
    issue_data['date'] = arrow.get(date, 'D MMM YYYY').isoformat()[:-15]
    issues_data.append(issue_data)
    

No. 1 (31 Jan 1880)
No. 2 (7 Feb 1880)
No. 3 (14 Feb 1880)
No. 4 (21 Feb 1880)
No. 5 (28 Feb 1880)
No. 6 (6 Mar 1880)
No. 7 (13 Mar 1880)
No. 8 (20 Mar 1880)
No. 9 (27 Mar 1880)
No. 10 (3 Apr 1880)
No. 11 (10 Apr 1880)
No. 12 (17 Apr 1880)
No. 13 (24 Apr 1880)
Vol. 2 No. 14 (1 May 1880)
Vol. 2 No. 15 (8 May 1880)
Vol. 2 No. 16 (15 May 1880)
Vol. 2 No. 17 (22 May 1880)
Vol. 2 No. 18 (29 May 1880)
Vol. 2 No. 19 (5 Jun 1880)
Vol. 2 No. 20 (12 Jun 1880)
Vol. 2 No. 21 (19 Jun 1880)
Vol. 2 No. 22 (26 June 1880)
Vol. 2 No. 23 (3 Jul 1880)
Vol. 2 No. 24 (10 Jul 1880)
Vol. 2 No. 25 (17 Jul 1880)
Vol. 2 No. 26 (24 Jul 1880)
Vol. 3 No. 27 (31 Jul 1880)
Vol. 3 No. 28 (7 Aug 1880)
Vol. 3 No. 29 (14 Aug 1880)
Vol. 3 No. 30 (21 Aug 1880)
Vol. 3 No. 31 (28 Aug 1880)
Vol. 3 No. 32 (4 Sept 1880)
Vol. 3 No. 33 (11 Sep 1880)
Vol. 3 No. 34 (18 Sep 1880)
Vol. 3 No. 35 (25 Sep 1880)
Vol. 3 No. 36 (2 Oct 1880)
Vol. 3 No. 37 (9 Oct 1880)
Vol. 3 No. 38 (16 Oct 1880)
Vol. 3 No. 39 (23 Oct 1880)
Vol. 4 No. 40 (30

## Save as CSV

Now the issues data is in a nice, structured form, we can load it into a Pandas dataframe. This allows us to do things like find the total number of pages digitised.

We can also save the metadata as a CSV.

In [9]:
import pandas as pd
# Convert issues metadata into a dataframe
df = pd.DataFrame(issues_data, columns=['id', 'label', 'volume', 'number', 'date', 'pages'])

In [10]:
# Find the total number of pages
df['pages'].sum()

200546

In [11]:
# Save metadata as a CSV.
df.to_csv('{}/bulletin_issues.csv'.format(data_dir), index=False)

## Download front pages

The covers of many of the digitised journals are pretty interesting. Here's some code to download images of the covers of *The Bulletin*. Unfortunately, at some point the Bulletin moved its cover artworks *inside* the journal, so the front page is mostly advertising. Of course you easily adjust this code to download a different page, or range of pages.

Once again, you can find the link to download an item by opening up your browser console's network tab, and then watching what happens when you click on the 'Start download' button in Trove.

You should see a url something like this:

```
https://trove.nla.gov.au/nla.obj-514230837/download?downloadOption=zip&firstPage=0&lastPage=27
```

There are for parameters we can change to control what we download and the format that it's downloaded in:

* the item id (the `nla.obj` bit)
* the `downloadOption` parameter – this can be `zip` (a zip file containing JPG images), `pdf`, or `ocr` (the OCRd text)
* the `firstPage` parameter – what page to start from (numbering starts from 0)
* the `lastPage` parameter – what page to stop at

So to download the first page of the issue with the id of `nla.obj-514230837`, you'd use the url:

```
https://trove.nla.gov.au/nla.obj-514230837/download?downloadOption=zip&firstPage=0&lastPage=0
```

Note that the JPG and PDF files are likely to be very large, so downloading them will consume significant amounts of time and disk space.

The code below checks to see if an image has already been saved before downloading it, so if the process is interrupted you can just run it again to pick up where it stopped. If more issues are added to Trove you could run it again to pick up any new images.

In [16]:
# Prepare a directory to save the images into
output_dir = data_dir + '/images'
os.makedirs(output_dir, exist_ok=True)
# Loop through the issue metadata
for issue in issues_data:
    print(issue['id'])
    id = issue['id']
    # Check to see if the first page of this issue has already been downloaded
    if not os.path.exists('{}/{}-1.jpg'.format(output_dir, id)):
        url = 'https://nla.gov.au/{}/download?downloadOption=zip&firstPage=0&lastPage=0'.format(id)
        # Get the file
        r = requests.get(url)
        # The image is in a zip, so we need to extract the contents into the output directory
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall(output_dir)
        time.sleep(1)

nla.obj-124654480
nla.obj-188284455
nla.obj-188537163
nla.obj-125527104
nla.obj-201081027
nla.obj-125534622
nla.obj-201334002
nla.obj-125535846
nla.obj-201347662
nla.obj-125537405
nla.obj-201352807
nla.obj-125539523
nla.obj-186292019
nla.obj-125539812
nla.obj-125541996
nla.obj-125542902
nla.obj-201374109
nla.obj-125569437
nla.obj-125586911
nla.obj-125604005
nla.obj-127616310
nla.obj-201397257
nla.obj-242158627
nla.obj-68379827
nla.obj-79812988
nla.obj-79851608
nla.obj-79833688
nla.obj-127616411
nla.obj-127618323
nla.obj-201414516
nla.obj-127621071
nla.obj-79866818
nla.obj-242204468
nla.obj-242208549
nla.obj-242211336
nla.obj-204414888
nla.obj-127622376
nla.obj-242213994
nla.obj-242221417
nla.obj-127624573
nla.obj-127626657
nla.obj-242222932
nla.obj-242229226
nla.obj-242235949
nla.obj-242242684
nla.obj-242246068
nla.obj-127630095
nla.obj-204461471
nla.obj-127632647
nla.obj-127633292
nla.obj-127634679
nla.obj-128328977
nla.obj-128329114
nla.obj-128330549
nla.obj-128333258
nla.obj-1283346

## Download texts

As noted above, you can downloaded the OCRd text of an issue using exactly the same method. Just change the `downloadOption` to `ocr` and the `lastPage` to the number of pages in the issue minus one (because the numbering starts at zero).

Some issues do not have any OCRd text and the download link returns an empty file. The code below downloads and saves all non-empty files, and stores the ids of empty files in the `empty` list for further checking.

In [17]:
# Prepare a directory to save the texts into
output_dir = data_dir + '/text'
os.makedirs(output_dir, exist_ok=True)
empty = []
# Loop through the issues
for issue in issues_data:
    print(issue['id'])
    id = issue['id']
    # The index value for the last page of an issue will be the total pages - 1
    last_page = int(issue['pages']) - 1
    # Put the date in the file name for easy sorting and browsing
    filename = '{}/{}-{}.txt'.format(output_dir, issue['date'], id)
    # Check to see if the file has already been harvested
    if os.path.exists(filename) and os.path.getsize(filename) > 0:
        print('Already saved')
    else:
        url = 'https://trove.nla.gov.au/{}/download?downloadOption=ocr&firstPage=0&lastPage={}'.format(id, last_page)
        # Get the file
        r = requests.get(url)
        # Check there was no error
        if r.status_code == requests.codes.ok:
            # Check that the file's not empty
            if len(r.content) > 0:
                # If everything's ok, save the file
                with open(filename, 'wb') as text_file:
                    text_file.write(r.content)
                print('Saved')
            else:
                print('Empty')
                # Store details of empty files for later
                empty.append(id)
            time.sleep(1)
        else:
            print('There was a problem: {}'.format(r.status_code))
print(empty)
        

nla.obj-124654480
Already saved
nla.obj-188284455
Already saved
nla.obj-188537163
Already saved
nla.obj-125527104
Already saved
nla.obj-201081027
Already saved
nla.obj-125534622
Already saved
nla.obj-201334002
Already saved
nla.obj-125535846
Already saved
nla.obj-201347662
Already saved
nla.obj-125537405
Already saved
nla.obj-201352807
Already saved
nla.obj-125539523
Already saved
nla.obj-186292019
Already saved
nla.obj-125539812
Already saved
nla.obj-125541996
Already saved
nla.obj-125542902
Already saved
nla.obj-201374109
Already saved
nla.obj-125569437
Already saved
nla.obj-125586911
Already saved
nla.obj-125604005
Already saved
nla.obj-127616310
Already saved
nla.obj-201397257
Already saved
nla.obj-242158627
Already saved
nla.obj-68379827
Already saved
nla.obj-79812988
Already saved
nla.obj-79851608
Already saved
nla.obj-79833688
Already saved
nla.obj-127616411
Already saved
nla.obj-127618323
Already saved
nla.obj-201414516
Already saved
nla.obj-127621071
Already saved
nla.obj-7986

## Save details of empty files

In [18]:
# Turn the list of empty ids into a dataframe
df_empty = pd.DataFrame(empty, columns=['id'])

In [19]:
# merge empty ids with details dataframe to add the full details
empty_data = pd.merge(df_empty, df, on='id', how='left')

In [20]:
empty_data.to_csv('{}/bulletin_issues_empty.csv'.format(data_dir), index=False)

## Report on the harvest

In [21]:
import datetime
total_texts = len([f for f in os.listdir(data_dir + '/text') if f[-4:] == '.txt'])
print('Report on harvest completed on {}: \n'.format(datetime.datetime.now().strftime('%d %b %Y')))
print('* metadata harvested for {} issues'.format(len(issues_data)))
print('* ocr text harvested for {} issues'.format(total_texts))
print('* {} issues contained no text'.format(len(empty)))

Report on harvest completed on 03 Aug 2018: 

* metadata harvested for 4635 issues
* ocr text harvested for 4465 issues
* 171 issues contained no text
