# Get an random newspaper article from Trove

Changes to the Trove API mean that the techniques I've previously used to select resources at random [will no longer work](https://updates.timsherratt.org/2019/10/09/creators-and-users.html). This notebook provides one alternative.

I wanted something that would work efficiently, but would also expose as much of the content as possible. Applying multiple facets together with a randomly-generated query seems to do a good job of getting the result set below 100 (the maximum available from a single API call). This should mean that *most* of the newspaper articles are reachable, but it's a bit hard to quantify.

Thanks to Mitchell Harrop for [suggesting I could use randomly selected stopwords](https://twitter.com/mharropesquire/status/1182175315860213760) as queries. I've supplemented the stopwords with letters and digits, and together they seem to do a good job of applying an initial filter and mixing up the relevance ranking.

I've added some options that allow you to specify a value for `query`, `newspaper_id`, `category`, and `illustrated`, but note that some combinations of these parameters might yield no results.

In [15]:
import random
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import json

s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))

with open('stopwords.json', 'r') as json_file:
    STOPWORDS = json.load(json_file)

In [16]:
API_KEY = 'YOUR API KEY'
API_URL = 'http://api.trove.nla.gov.au/v2/result'

In [24]:
def get_random_facet_value(params, facet):
    '''
    Get values for the supplied facet and choose one at random.
    '''
    these_params = params.copy()
    these_params['facet'] = facet
    response = s.get(API_URL, params=these_params)
    data = response.json()
    try:
        values = [t['search'] for t in data['response']['zone'][0]['facets']['facet']['term']]
    except TypeError:
        return None
    return random.choice(values)


def get_random_article(query=None, newspaper_id=None, category=None, illustrated=None, tag=None):
    total = 0
    tries = 0
    params = {
        'zone': 'newspaper',
        'encoding': 'json',
        'n': '0',
        # Uncomment these if you need more than the basic data
        #'reclevel': 'full',
        #'include': 'articleText',
        'key': API_KEY
    }
    while total == 0 and tries <= 10:
        if query:
            params['q'] = query
        else:
            random_word = random.choice(STOPWORDS)
            params['q'] = f'"{random_word}"'
        if tag:
            params['l-publictag'] = tag
        if newspaper_id:
            params['l-title'] = newspaper_id
        else:
            params['l-title'] = get_random_facet_value(params, 'title')
        if category:
            params['l-category'] = category
        else:
            params['l-category'] = get_random_facet_value(params, 'category')
        if illustrated:
            params['l-illustrated'] = illustrated
        else:
            params['l-illustrated'] = get_random_facet_value(params, 'illustrated')
        # Select a word length at random
        params['l-word'] = get_random_facet_value(params, 'word')
        # Get decades
        decade = get_random_facet_value(params, 'decade')
        # Get years
        params['l-decade'] = decade
        year = get_random_facet_value(params, 'year')
        # Get months
        params['l-year'] = year
        month = get_random_facet_value(params, 'month')
        # Get articles for month
        params['l-month'] = month
        params['n'] = 100
        response = s.get(API_URL, params=params)
        data = response.json()
        total = int(data['response']['zone'][0]['records']['total'])
        #print(total)
        #print(response.url)
        tries += 1
    if total > 0:
        article = random.choice(data['response']['zone'][0]['records']['article'])
        return article

## Get any old article...

In [18]:
get_random_article()

{'id': '84397893',
 'url': '/newspaper/84397893',
 'heading': 'Advertising',
 'category': 'Advertising',
 'title': {'id': '336',
  'value': 'Queensland Figaro (Brisbane, Qld. : 1901 - 1936)'},
 'date': '1914-02-12',
 'page': 17,
 'pageSequence': 17,
 'relevance': {'score': '5.8201594', 'value': 'very relevant'},
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/84397893?searchTerm=%22hadn%27t%22'}

## Get a random article about pademelons

In [19]:
get_random_article(query='pademelon')

{'id': '93274347',
 'url': '/newspaper/93274347',
 'heading': 'ACROSS THE ALPS TO THE BUCHAN CAVES. No. 2. (Continued.)',
 'category': 'Article',
 'title': {'id': '241', 'value': 'The Colac Herald (Vic. : 1875 - 1918)'},
 'date': '1908-02-26',
 'page': 3,
 'pageSequence': 3,
 'relevance': {'score': '15.130503', 'value': 'very relevant'},
 'snippet': "Now we have the telephone wire to keep as to the right track, for Sunnyside is connected by 'phone with Omeo. For the previous 40 miles we",
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/93274347?searchTerm=pademelon'}

## Get a random article from the _Sydney Morning Herald_

In [20]:
get_random_article(newspaper_id='35', category='Article')

{'id': '13666434',
 'url': '/newspaper/13666434',
 'heading': 'THE COURSE.',
 'category': 'Article',
 'title': {'id': '35',
  'value': 'The Sydney Morning Herald (NSW : 1842 - 1954)'},
 'date': '1887-11-28',
 'page': 3,
 'pageSequence': 3,
 'relevance': {'score': '0.83701843', 'value': 'likely to be relevant'},
 'snippet': 'The Nepean championship course, which will be found, illustrated above, is undoubtedly one of the finest rowing courses in the world, and this assertion is supported both by Beach and Hanlan, each of whom has, given his',
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/13666434?searchTerm=%22most%22'}

## Get a random illustrated article

In [21]:
get_random_article(illustrated='true')

{'id': '226228030',
 'url': '/newspaper/226228030',
 'heading': 'SHEEP MERINOS',
 'category': 'Detailed lists, results, guides',
 'title': {'id': '664',
  'value': 'The Gundagai Independent (NSW : 1928 - 1939)'},
 'date': '1936-03-19',
 'page': 7,
 'pageSequence': 7,
 'relevance': {'score': '0.644695', 'value': 'likely to be relevant'},
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/226228030?searchTerm=%223%22'}

## Get a random illustrated advertisement from the _Australian Womens Weekly_

In [22]:
get_random_article(newspaper_id='112', illustrated='true', category='Advertising')

{'id': '46242638',
 'url': '/newspaper/46242638',
 'heading': 'Advertising',
 'category': 'Advertising',
 'title': {'id': '112',
  'value': "The Australian Women's Weekly (1933 - 1982)"},
 'date': '1974-07-24',
 'page': 78,
 'pageSequence': 78,
 'relevance': {'score': '0.19900703', 'value': 'may have relevance'},
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/46242638?searchTerm=%22same%22'}

## Get a random article tagged 'poem'

In [26]:
get_random_article(tag='poem')

{'id': '69569812',
 'url': '/newspaper/69569812',
 'heading': 'THE FLEATOWN STEEPLECHASE.',
 'category': 'Article',
 'title': {'id': '204',
  'value': 'Wodonga and Towong Sentinel (Vic. : 1885 - 1954)'},
 'date': '1902-04-25',
 'page': 4,
 'pageSequence': 4,
 'relevance': {'score': '2.2554219', 'value': 'likely to be relevant'},
 'snippet': "Too much about Kanaka Bills, too much about the Chow— A head line for Matilda, Dan, Murphy's brindle cow!",
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/69569812?searchTerm=%22you%22'}

## Speed test

In [23]:
%%timeit
get_random_article()

1.67 s ± 346 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
