# Using TroveHarvester to get newspaper articles in bulk

The Trove Newspaper Harvester is a command line tool that helps you download large quantities of digitised newspaper articles from [Trove](http://trove.nla.gov.au/).

Instead of working your way through page after page of search results using Trove’s web interface, the newspaper harvester will save the results of your search to a CSV (spreadsheet) file which you can then filter, sort, or analyse.

Even better, the harvester can save the full OCRd (and possibly corrected) text of each article to an individual file. You could, for example, collect the text of thousands of articles on a particular topic and then feed them to a text analysis engine like [Voyant](http://voyant-tools.org/) to look for patterns in the language.

If you'd like to install and run the TroveHarvester on your local system see [the installation instructions](http://timsherratt.org/digital-heritage-handbook/docs/trove-newspaper-harvester/).

If you'd like to try before you buy, you can run a fully-functional version of the TroveHarvester from this very notebook!

## Getting started

If you were running TroveHarvester on your local system, you could access the basic help information by entering this on the command line:

``` bash
troveharvester -h
```

In this notebook you need to use the magic `%run` command to call the TroveHarvester script. Try this:

In [2]:
%run -m troveharvester -h

usage: troveharvester [-h] {start,restart,report} ...

positional arguments:
  {start,restart,report}
    start               Start a new harvest
    restart             Restart an unfinished harvest
    report              Report on a harvest

optional arguments:
  -h, --help            show this help message and exit


Before we go any further you should make sure you have a Trove API key. For non-commercial projects, you just  fill out a simple form and your API key is generated instantly. Follow the instructions in the Trove Help to [obtain your own Trove API Key](http://help.nla.gov.au/trove/building-with-trove/api).

Once you've created a key, you can access it at any time on the 'For developers' tab of your Trove user profile.

Copy your API key now, and paste it in the cell below, between the quotes.

In [3]:
api_key = '6pi5hht0d2umqcro'
print('Your API key is: {}'.format(api_key))

Your API key is: 6pi5hht0d2umqcro


## What do you want to harvest?

The TroveHarvester translates queries from the Trove web interface into something that the API can understand. So all you need to do is construct your query using the web interface. Once you're happy with the results you're getting just copy the url.

It's important to note that there are currently a few differences between the indexes used by the web interface and the API, so some queries won't translate directly. For example, the `state` facet doesn't exist in the API index. If you use the `state` facet the TroveHarvester will try to replace it with a list of newspapers from that state, but there are now so many newspaper titles that this could fail. Similarly, the API index won't recognise `has:corrections`. However, most queries should translate without any problems.

Once you've constructed your query and copied the url, paste it between the quotes in the cell below.

In [4]:
query = 'https://trove.nla.gov.au/newspaper/result?q=wragge+cyclone'

In [5]:
%run -m troveharvester start $query $api_key --text

http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=0
Harvested: 20
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=20
Harvested: 40
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=40
Harvested: 60
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=60
Harvested: 80
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=80
Harvested: 100
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=100
Harvested: 120
http://api.trove.nla.gov.au/result?q=w

Harvested: 1020
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=1020
Harvested: 1040
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=1040
Harvested: 1060
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=1060
Harvested: 1080
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=1080
Harvested: 1100
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=1100
Harvested: 1120
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=1120
Harvested: 1140


Harvested: 2020
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=2020
Harvested: 2040
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=2040
Harvested: 2060
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=2060
Harvested: 2080
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=2080
Harvested: 2100
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=2100
Harvested: 2120
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=2120
Harvested: 2140


Harvested: 3020
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=3020
Harvested: 3040
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=3040
Harvested: 3060
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=3060
Harvested: 3080
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=3080
Harvested: 3100
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=3100
Harvested: 3120
http://api.trove.nla.gov.au/result?q=wragge+cyclone&encoding=json&zone=newspaper&reclevel=full&include=articleText&n=20&key=6pi5hht0d2umqcro&s=3120
Harvested: 3140
