## Analysis of Trove Government Gazettes

Attempting to reproduce the work described [on the NLA blog](https://www.nla.gov.au/blogs/trove/2018/07/23/digital-tools-for-big-research) where a collection of Certificates of Naturalisation were selected from the Trove Government Gazettes and analysed to give a picture of the number of arrivals over time. 

In that exercise the work was done manually to identify names and generate counts.  I will attempt to implement an automated process to derive the same data. 



Before you begin, please ensure that you have a `secret.json` file in the current working directory (generally this is your workspace.)<br />If you haven't got this file, run the ***Set up secrets*** notebook first, then return here.

In [None]:
# Install dependencies first
!curl -s -O -L https://raw.githubusercontent.com/HASSCloud/TinkerStudio-Examples/master/{requirements.txt,utils.py}
!pip install -q -r requirements.txt

In [None]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import matplotlib.pylab as plt
%matplotlib inline
import seaborn as sns
import requests
import datetime
import utils
TROVE_API_KEY = utils.secret('trove')

In [None]:

def trove_query(q, n=100):
    """A simple Trove API interface, 
    q is a query term, we search the 
    newspaper zone and return
    the decoded JSON response (a Python dictionary)"""
    
    TROVE_API_URL = "http://api.trove.nla.gov.au/result"
    qterms = {
        'zone': 'newspaper',
        'encoding': 'json',
        'include': 'articleText',
        's': 0,
        'n': n,
        'key': TROVE_API_KEY,
        'q': q
    }
    r = requests.get(TROVE_API_URL, params=qterms).json()
    articles = r['response']['zone'][0]['records']['article']
    remaining = n-100
    while remaining > 0:
        qterms['n'] = remaining
        qterms['s'] += 100
        r = requests.get(TROVE_API_URL, params=qterms)
        r = r.json()
        art = r['response']['zone'][0]['records']['article']
        if len(art) > 0:
            articles.extend(art)
            remaining -= 100
        else:
            # no more articles
            remaining = 0
        
    return articles

#articles = trove_query('"Certificate of Naturalisation"', 110)
#len(articles)

In [None]:
articles = trove_query('"Certificates of Naturalisation"', 1000)

In [None]:
len(articles)

In [None]:
from IPython.core.display import display, HTML
display(HTML(articles[0]['articleText']))

In [None]:
articles[0]['date']

In [None]:
import re
lines = re.findall('<span>([^<]+)</span>', articles[5]['articleText'])
len(lines)

In [None]:
lines[:10]

## Trying Spacy NER

We'll try to use NER on this text to find names. However, given the lack of context in the text (this is just a list of names) it may not be very successful.

In [None]:
import spacy
from spacy import displacy
from IPython.core.display import display, HTML
# download the spacy models we need
model = 'en_core_web_md'
spacy.cli.download(model)
nlp = spacy.load(model)

Applying the NER model and displaying the output for this text we see that while many  names are highlighted (in purple) there are also many missed and many false positives shown.   The lack of context in the text removes the usual cues to names and leaves the system guessing based on capitalisation.  

In [None]:
doc = nlp("\n".join(lines))
display(HTML(displacy.render(doc, style='ent')))

## Regular Expression based Extraction

In this case the text is very structured as a list of names, addresses and dates.  We can try to use regular expressions to locate these fields in the text.

First find the lines in the text containg date-like words (digits + .).

In [None]:
lines = re.findall('<span>\W*([^<]+)\W*</span>', articles[5]['articleText'])
print(lines[:10])
datelines = [m for m in lines if re.search('\d+', m)]
datelines[:5]

Now join all of these lines together into one big string since records seem to flow over lines. 

We can then look for the individual records. Each record looks like:

> Cianetti,  Carla,  68  West  Street,  Mt  Isa,  10.7.67.

which we can generalise to:

> Last, First, Address, Date

So let's write a regular expression pattern to match that

In [None]:
text = " ".join(datelines)
pattern = "\W+(.+?)(\d\d?[ .]+\d\d?[ .]+\d\d)[.;]?"
matches = re.findall(pattern, text)
matches[:3]

In [None]:
res = []
for text, date in matches:
    n = text.split(',')
    if len(n) > 2:
        res.append({'first': n[1].strip(), 'last': n[0].strip(), 'addr': " ".join(n[2:]).strip()})
res[:3]

In [None]:
# turn all that into a function

def extract_names(document):
    """Extract a list of names from a CERTIFICATION OF NATURALISATION 
    article in Trove Government Gazettes"""
    
    if 'articleText' in document:
        lines = re.findall('<span>\W*([^<]+)\W*</span>', document['articleText'])
        datelines = [m for m in lines if re.search('\d+', m)]

        text = " ".join(datelines)
        pattern = "\W+(.+?)(\d\d?)[ .]+(\d\d?)[ .]+(\d\d)[.;]?"
        matches = re.findall(pattern, text)

        result = []
        badlines = []
        for text, day, month, year in matches:
            n = text.split(',')
            if len(n) > 2:
                try:
                    date = datetime.datetime(day=int(day), month=int(month), year=int("19"+year))
                    result.append({'article': document['url'],
                               'first': n[1].strip(), 
                               'last': n[0].strip(), 
                               'addr': " ".join(n[2:]).strip(),
                               'date': date,
                               'articledate': pd.to_datetime(document['date'])
                              })
                except ValueError:
                    date = day + month + year
                    badlines.append(document)
                    
        return result, badlines
    else:
        print(document.keys())
        return [], []
    
#extract_names(articles[30])

In [None]:
extract_names(articles[0])[:3]

In [None]:
result = []
bad = []
counts = []
for art in articles:
    names, badlines = extract_names(art)
    #print("^^-----", art['heading'], art['url'], len(names), "------^^\n")
    result.extend(names)
    bad.extend(badlines)
    counts.append({'id': art['id'], 'date': art['date'], 'count': len(names), 'bad': len(badlines)})

counts = pd.DataFrame(counts)
print("Got error lines: ", len(bad))
names = pd.DataFrame(result)
print("Got ", names.shape[0], "names")
names.head()

In [None]:
names.groupby('last').size().sort_values(ascending=False).head()

In [None]:
names['ayear'] = [x.year for x in names['articledate']]
names['month'] = [x.month for x in names['articledate']]
by_year = names.groupby('articledate').size().sort_index()
print(by_year.index.min(), by_year.index.max())
plt.figure(figsize=(15,6))
by_year.plot.bar()

In [None]:
names.head()

In [None]:
plt.figure(figsize=(10,6))
sns.distplot(names['month'], rug=True)

In [None]:
counts[counts['count'] > 0].shape

In [None]:
import geocoder 
GEONAMES_KEY = utils.secret('geonames')
loc = names['addr'][5]
g = geocoder.geonames(loc, key=GEONAMES_KEY, countryBias=['AU'])
g.address, g.lat, g.lng, loc

In [None]:
GOOGLE_KEY=utils.secret('google')

for loc in names['addr'][10:20]:
    g = geocoder.google(loc, key=GOOGLE_KEY)
    print(g.address, g.lat, g.lng, loc)