# DIY QueryPic

QueryPic is a tool I created many years ago to visualise searches in Trove's digitised newspapers. It's been through a number of versions, but the basic idea has stayed the same. QueryPic shows you the number of articles each year that match your query -- instead of a page of search results, you see the complete result set. You can look for patterns and trends across time.

But underneath the hood QueryPic is pretty simple. So simple that we can make a see-through, hackable version here in this notebook!

First let's import a few things that we'll need. We're going to use [Plotly](https://plot.ly/python/) to create charts.

In [None]:
import requests
from requests.exceptions import HTTPError, Timeout
from operator import itemgetter # used for sorting
import pandas as pd # makes manipulating the data easier
import plotly.offline as py # for charts
import plotly.graph_objs as go
from utilities import retry # a retry function for API requests

py.init_notebook_mode() # initialise plotly

## Set up some variables 

Insert your API key between the quotes.

In [None]:
api_key = ''
print('Your API key is: {}'.format(api_key))

In [None]:
start = 181
end = 195
queries = ['cat', 'dog']

In [None]:
api_search_url = 'https://api.trove.nla.gov.au/result'

Set up our query parameters. We want everything, so we set the `q` parameter to be a single space.

l-title

In [None]:
params = {
    'q': ' ', # A space to search for everything
    'facet': 'year',
    'zone': 'newspaper',
    'l-category': 'Article',
    'key': api_key,
    'encoding': 'json',
    'n': 0
}

## Define a few handy functions

In [None]:
@retry((HTTPError, Timeout), tries=10, delay=1)
def get_results(decade, query):
    '''
    Get JSON response data from the Trove API.
    Parameters:
        q       - query string
        decade  - eg 191 (for 1910-1919)
    Returns:
        JSON formatted response data from Trove API 
    '''
    params['q'] = query
    params['l-decade'] = decade
    response = requests.get(api_search_url, params=params, timeout=30)
    response.raise_for_status()
    print(response.url) # This shows us the url that's sent to the API
    data = response.json()
    return data

In [None]:
def get_facets(data):
    '''
    Loop through facets in Trove API response, saving terms and counts.
    Parameters:
        data  - JSON formatted response data from Trove API  
    Returns:
        A list of dictionaries containing: 'year', 'total_results'
    '''
    facets = []
    for term in data['response']['zone'][0]['facets']['facet']['term']:
        facets.append({'year': int(term['display']), 'total_results': int(term['count'])})
    facets.sort(key=itemgetter('year'))
    return facets

In [None]:
def combine_totals(query_data, total_data):
    '''
    Take facets data from the query search and a blank search (ie everything) for a decade and combine them.
    Parameters:
        query_data    - list of dictionaries containing facets data from a query search
        total_data    - list of dictionaries containing facets data from a blank search
    Returns:
        A list of dictionaries containing: 'year', 'total_results', 'total articles' 
    '''
    combined_data = []
    query_data = get_facets(query_data)
    total_data = get_facets(total_data)
    for index, query_row in enumerate(query_data):
        total_row = total_data[index]
        query_row['total_articles'] = total_row['total_results']
        combined_data.append(query_row)
    return combined_data 

In [None]:
def year_totals(query):
    '''
    Generate a dataset for a search query.
    Parameters:
        query    - search query
    Returns:
        A Pandas dataframe with three columns -- year, total_results, total_articles -- and one row per year.
    '''
    totals = []
    for decade in range(start, end+1):
        print('Getting {}0'.format(decade))
        query_data = get_results(decade, query)
        total_data = get_results(decade, ' ')
        combined_data = combine_totals(query_data, total_data)
        totals.extend(combined_data)
    totals.sort(key=itemgetter('year'))
    return pd.DataFrame(totals)

## Get some data!

In [None]:
traces = {}
for query in queries:
    print('Searching for {}...'.format(query))
    traces[query] = year_totals(query)

In [None]:
traces[queries[0]]

## Plot raw number of articles per year

Let's make a chart showing the raw number of articles per year.

In [None]:
# Prepare data for Plotly
raw_plot_data = []
for query, trace in traces.items():
    raw_plot_data.append(
        go.Scatter (
            x=trace.year,
            y=trace.total_results,
            name=query
        )
    )
# Create the chart
py.iplot(raw_plot_data, filename='articles-by-year')

## Plot percentage of total articles per year

In most cases the raw number of articles isn't terribly useful, because it doesn't take into account how many newspaper articles were published that year. Let's divide the number of results by the total number of articles to look at the percentage of articles each year that match our queries.

In [None]:
# Prepare data for Plotly
av_plot_data = []
for query, trace in traces.items():
    av_plot_data.append(
        go.Scatter (
            x=trace.year,
            y=trace.total_results/trace.total_articles*100,
            name=query
        )
    )
# Create the chart
py.iplot(av_plot_data, filename='articles-by-year')

## Save query data as CSV files

One of the nifty things about pandas dataframes is that is stupidly easy to save them as CSVs. Just call `.to_csv()`.

In [None]:
for query, trace in traces.items():
    trace.to_csv('data/querypic-{}.csv'.format(query), index=False)