# Get an random newspaper article from Trove

Changes to the Trove API mean that the techniques I've previously used to select resources at random [will no longer work](https://updates.timsherratt.org/2019/10/09/creators-and-users.html). This notebook provides one alternative.

I wanted something that would work efficiently, but would also expose as much of the content as possible. Applying multiple facets together with a randomly-generated query seems to do a good job of getting the result set below 100 (the maximum available from a single API call). This should mean that *most* of the newspaper articles are reachable, but it's a bit hard to quantify.

Thanks to Mitchell Harrop for [suggesting I could use randomly selected stopwords](https://twitter.com/mharropesquire/status/1182175315860213760) as queries. I've supplemented the stopwords with letters and digits, and together they seem to do a good job of applying an initial filter and mixing up the relevance ranking.

I've added some options that allow you to specify a value for `query`, `newspaper_id`, `category`, and `illustrated`, but note that some combinations of these parameters might yield no results.

In [1]:
import random
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import json

s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))

with open('stopwords.json', 'r') as json_file:
    STOPWORDS = json.load(json_file)

In [11]:
API_KEY = 'YOUR API KEY'
API_URL = 'http://api.trove.nla.gov.au/v2/result'

In [12]:
def get_random_facet_value(params, facet):
    '''
    Get values for the supplied facet and choose one at random.
    '''
    these_params = params.copy()
    these_params['facet'] = facet
    response = s.get(API_URL, params=these_params)
    data = response.json()
    try:
        values = [t['search'] for t in data['response']['zone'][0]['facets']['facet']['term']]
    except TypeError:
        return None
    return random.choice(values)


def get_random_article(query=None, newspaper_id=None, category=None, illustrated=None):
    total = 0
    tries = 0
    params = {
        'zone': 'newspaper',
        'encoding': 'json',
        'n': '0',
        # Uncomment these if you need more than the basic data
        #'reclevel': 'full',
        #'include': 'articleText',
        'key': API_KEY
    }
    while total == 0 and tries <= 10:
        if query:
            params['q'] = query
        else:
            random_word = random.choice(STOPWORDS)
            params['q'] = f'"{random_word}"'
        if newspaper_id:
            params['l-title'] = newspaper_id
        else:
            params['l-title'] = get_random_facet_value(params, 'title')
        if category:
            params['l-category'] = category
        else:
            params['l-category'] = get_random_facet_value(params, 'category')
        if illustrated:
            params['l-illustrated'] = illustrated
        else:
            params['l-illustrated'] = get_random_facet_value(params, 'illustrated')
        # Select a word length at random
        params['l-word'] = get_random_facet_value(params, 'word')
        # Get decades
        decade = get_random_facet_value(params, 'decade')
        # Get years
        params['l-decade'] = decade
        year = get_random_facet_value(params, 'year')
        # Get months
        params['l-year'] = year
        month = get_random_facet_value(params, 'month')
        # Get articles for month
        params['l-month'] = month
        params['n'] = 100
        response = s.get(API_URL, params=params)
        data = response.json()
        total = int(data['response']['zone'][0]['records']['total'])
        #print(total)
        #print(response.url)
        tries += 1
    if total > 0:
        article = random.choice(data['response']['zone'][0]['records']['article'])
        return article

## Get any old article...

In [4]:
get_random_article()

{'id': '153742819',
 'url': '/newspaper/153742819',
 'heading': 'ALL ROLES IN LLOYD FILM FILLED.',
 'category': 'Article',
 'title': {'id': '742',
  'value': 'Daily Telegraph (Launceston, Tas. : 1883 - 1928)'},
 'date': '1926-11-13',
 'page': 15,
 'pageSequence': 15,
 'relevance': {'score': '0.7499819', 'value': 'likely to be relevant'},
 'snippet': 'All the principal roles in support of Harold Lloyd in the mountain story he is now filming, were filled when production manager John L. Murphy',
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/153742819?searchTerm=%22they%22'}

## Get a random article about pademelons

In [5]:
get_random_article(query='pademelon')

{'id': '221484875',
 'url': '/newspaper/221484875',
 'heading': 'SYDNEY SQUIBS. THE MEANING OF "STANDING ORDERS"--POLITICS, ANCIENT AND MODERN--A DOMESTIC PROBLEM--ALPINE PERILS.',
 'category': 'Article',
 'title': {'id': '1176', 'value': 'Lithgow Mercury (NSW : 1898 - 1954)'},
 'date': '1903-09-04',
 'page': 7,
 'pageSequence': 7,
 'relevance': {'score': '1.2293992', 'value': 'likely to be relevant'},
 'snippet': 'I have had the honour of a chat with a member of the "alleged" collective wisdom of this State, and, amongst other things, I suggested it would be a happy thought to organise a',
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/221484875?searchTerm=pademelon'}

## Get a random article from the _Sydney Morning Herald_

In [6]:
get_random_article(newspaper_id='35', category='Article')

{'id': '28089944',
 'url': '/newspaper/28089944',
 'heading': 'DOMAIN BATHS. WOMEN TO UNDERGO TRAINING.',
 'category': 'Article',
 'title': {'id': '35',
  'value': 'The Sydney Morning Herald (NSW : 1842 - 1954)'},
 'date': '1921-01-06',
 'page': 7,
 'pageSequence': 7,
 'relevance': {'score': '25.312206', 'value': 'very relevant'},
 'snippet': 'Arrangements are in hand by the City Council to make the Domain baths available on certain evenings each week for continental bathing. This has been done at the request',
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/28089944?searchTerm=%22are%22'}

## Get a random illustrated article

In [7]:
get_random_article(illustrated='true')

{'id': '173097943',
 'url': '/newspaper/173097943',
 'heading': 'MACLEAY RIVER JOCKEY CLUB Influx of New Blood.',
 'category': 'Detailed lists, results, guides',
 'title': {'id': '881',
  'value': 'The Macleay Chronicle (Kempsey, NSW : 1899 - 1952)'},
 'date': '1932-12-07',
 'page': 4,
 'pageSequence': 4,
 'relevance': {'score': '0.14988357', 'value': 'may have relevance'},
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/173097943?searchTerm=%22now%22'}

## Get a random illustrated advertisement from the _Australian Womens Weekly_

In [8]:
get_random_article(newspaper_id='112', illustrated='true', category='Advertising')

{'id': '43598909',
 'url': '/newspaper/43598909',
 'heading': 'Advertising',
 'category': 'Advertising',
 'title': {'id': '112',
  'value': "The Australian Women's Weekly (1933 - 1982)"},
 'date': '1962-08-29',
 'page': 82,
 'pageSequence': 82,
 'relevance': {'score': '0.59381086', 'value': 'likely to be relevant'},
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/43598909?searchTerm=%22above%22'}

## Speed test

In [13]:
%%timeit
get_random_article()

2.13 s ± 761 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
