# Harvest data from Papers Past

This notebooks lets you harvest large amounts of data for Papers Past (via DigitalNZ) for further analysis. It saves the results as a CSV file that you can open in any spreadsheet program. It currently includes the OCRd text of all the newspaper articles, but I might make this optional in the future — thoughts?

You can edit this notebook to harvest other collections in DigitalNZ — see the notes below for pointers. However, this is currently only saving a small subset of the available metadata, so you'd probably want to adjust the fields as well. Add an [issue on GitHub](https://github.com/GLAM-Workbench/digitalnz/issues) if you need help creating a custom harvester.

There's only two things you **have** to change — you need to enter your API key, and a search query. There are additional options for limiting your search results.

<div class="alert alert-block alert-warning">
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them. When you hover over them a <i class="fa-step-forward fa"></i> icon appears.</li>
        <li>To run a code cell either click the <i class="fa-step-forward fa"></i> icon, or click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>
</div>

## Setting things up

Just run these cells to set up some things that we'll need later on.

In [12]:
# This cell just sets up some stuff that we'll need later

import logging
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import pandas as pd
from tqdm import tqdm_notebook
import time
import re
from slugify import slugify
from time import strftime
from IPython.display import display, FileLink

logging.basicConfig(level=logging.ERROR)
s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))

In [32]:
# This cell sets the basic parameters that we'll send to the API
# You'll add your search query to this below
# You could change the 'display_collection' value to something other than
# Papers Past to harvest other parts of DigitalNZ

params = {
    'and[display_collection][]': 'Papers Past',
    'per_page': '100',
    'api_key': api_key
}

## Add your API key

Go get yourself a [DigitalNZ API key](https://digitalnz.org/developers/getting-started), then paste it between the quotes below. You need a key to make API requests, but they're free and quick to obtain.

In [33]:
# Past your API key between the quotes
# You might need to trim off any spaces at the beginning and end
api_key = '9yXNTynMDb3TUQws7QuD'
print('Your API key is: {}'.format(api_key))

Your API key is: 9yXNTynMDb3TUQws7QuD


## Add your search query

This is where you specify your search. Just put in anything you might enter in the DigitalNZ search box.


In [34]:
params['text'] = 'possum'
#params['text'] = 'possum AND opossum'
#params['text'] = '"possum skins"'

You can also add limit your results to a particular newspaper. Just remove the '#' from the start of the line to add this parameter to your query.

In [None]:
#params['and[collection][]'] = 'Evening Post'

You can also limit your query by date, but it's a bit fiddly. 

Filtering by a single century, decade or year is simple. Just add the appropriate parameter as in the examples below. Remove the '#', edit the value, and run the cell.

In [None]:
#params['and[century][]'] = '1800'
#params['and[decade][]'] = '1850'
#params['and[year][]'] = '1853'

There's no direct way (I think) to search a range of years, but we can get around this by issuing a request for each year separately and then combining the results. If you want to do this, change the values below.

In [38]:
# This sets the default values
# Change from None to a year, eg 1854 to set a specific range.
# You need both a start and an end year

start_year = 1870
end_year = 1880

## Set up some code

In [43]:
class Harvester():
    
    def __init__(self, params, start_year=None, end_year=None):
        self.params = params
        self.start_year = start_year
        self.end_year = end_year
        self.current_year = None
        self.total = 0
        self.more = True
        self.articles = []

    def process_results(self, data):
        results = data['search']['results']
        if results:
            self.articles += self.process_articles(results)
            return True
        else:
            return False
        
    def process_articles(self, results):
        articles = []
        for result in results:
            title = re.sub(r'(\([^)]*\))[^(]*$', '', result['title']).strip()
            articles.append({
                'id': result['id'],
                'title': title,
                'newspaper': result['publisher'][0],
                'date': result['date'],
                'text': result['fulltext'],
                'paperspast_url': result['landing_url'],
                'source_url': result['source_url']
            })
        return articles

    def get_data(self):
        response = s.get('http://api.digitalnz.org/v3/records.json', params=self.params)
        print(response.url)
        return response.json()
    
    def harvest(self):
        data = self.get_data()
        total = data['search']['result_count']
        if total > 0:
            self.more = self.process_results(data)
            with tqdm_notebook(total=total) as pbar:
                pbar.update(100)
                while self.more:
                    self.params['page'] += 1
                    data = self.get_data()
                    self.more = self.process_results(data)
                    pbar.update(100)
                    time.sleep(0.2)    

    def start_harvest(self):
        if self.start_year and self.end_year:
            for year in range(self.start_year, self.end_year+1):
                self.params['page'] = 1
                self.more = True
                self.current_year = year
                self.params['and[year][]'] = year
                self.harvest()
        else:
            self.harvest()
            
    def restart_harvest(self):
        # Don't reset the page to 1 -- hopefully pick up where it stopped
        if self.current_year and self.end_year:
            for year in range(self.current_year, self.end_year+1):
                if year != self.current_year:
                    self.params['page'] = 1
                self.more = True
                self.current_year = year
                self.params['and[year][]'] = year
                self.harvest()
        else:
            self.harvest()
               
    def save_as_csv(self, filename=None):
        if not filename:
            filename = '{}-{}.csv'.format(slugify(self.params['text']), strftime("%Y%m%d"))
        df = pd.DataFrame(self.articles)
        df.to_csv(filename, index=False)
        display(FileLink(filename))

## Start your harvest

In [44]:
harvester = Harvester(params, start_year=start_year, end_year=end_year)
harvester.start_harvest()

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=1&and%5Byear%5D%5B%5D=1870


HBox(children=(IntProgress(value=0, max=4), HTML(value='')))

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=2&and%5Byear%5D%5B%5D=1870

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=1&and%5Byear%5D%5B%5D=1871


HBox(children=(IntProgress(value=0, max=19), HTML(value='')))

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=2&and%5Byear%5D%5B%5D=1871

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=1&and%5Byear%5D%5B%5D=1872


HBox(children=(IntProgress(value=0, max=20), HTML(value='')))

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=2&and%5Byear%5D%5B%5D=1872

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=1&and%5Byear%5D%5B%5D=1873


HBox(children=(IntProgress(value=0, max=23), HTML(value='')))

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=2&and%5Byear%5D%5B%5D=1873

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=1&and%5Byear%5D%5B%5D=1874


HBox(children=(IntProgress(value=0, max=84), HTML(value='')))

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=2&and%5Byear%5D%5B%5D=1874

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=1&and%5Byear%5D%5B%5D=1875


HBox(children=(IntProgress(value=0, max=27), HTML(value='')))

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=2&and%5Byear%5D%5B%5D=1875

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=1&and%5Byear%5D%5B%5D=1876


HBox(children=(IntProgress(value=0, max=29), HTML(value='')))

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=2&and%5Byear%5D%5B%5D=1876

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=1&and%5Byear%5D%5B%5D=1877


HBox(children=(IntProgress(value=0, max=34), HTML(value='')))

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=2&and%5Byear%5D%5B%5D=1877

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=1&and%5Byear%5D%5B%5D=1878


HBox(children=(IntProgress(value=0, max=15), HTML(value='')))

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=2&and%5Byear%5D%5B%5D=1878

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=1&and%5Byear%5D%5B%5D=1879


HBox(children=(IntProgress(value=0, max=34), HTML(value='')))

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=2&and%5Byear%5D%5B%5D=1879

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=1&and%5Byear%5D%5B%5D=1880


HBox(children=(IntProgress(value=0, max=38), HTML(value='')))

http://api.digitalnz.org/v3/records.json?and%5Bdisplay_collection%5D%5B%5D=Papers+Past&per_page=100&api_key=9yXNTynMDb3TUQws7QuD&text=possum&page=2&and%5Byear%5D%5B%5D=1880



## Save your harvest

This cell generates a CSV file and creates a link that you can use to download it.

In [23]:
harvester.save_as_csv()