# ATAP Notebook for the Geolocation project

This notebook helps you access the Geolocation tools in a Python development environment.

### Contents
* [Premise](#section-premise)
* [Requirements](#section-requirements) 
* [Data Preparation](#section-datapreparation)
* [Named Entity Recognition](#section-ner)
 * [Look for NEs](#section-nes)
 * [Reviewing Candidate Placenames](#section-reviewplacenames)
* [Finding Locations for Placenames](#section-findinglocs)
 * [Identifying States and Capitals](#section-statescapitals)
 * [Searching a Gazzetteer for Locations](#section-searchgazetteer)

## Premise <a class="anchor" id="section-premise"></a>
*This section explains the Geolocation project and tools.*

The Geolocation project relates to doctoral research done by [Fiannuala Morgan](https://finnoscarmorgan.github.io/) at the Australian National University. It uses software to identify placenames in archived historical texts, then compares them to data about known locations to identify where the placenames may be located. 

This notebook is designed to allow you to perform similar operations on textual documents.

<div class="alert alert-block alert-success">
It will teach you how to
<ul>
    <li>use the spaCy library to identify and classify Named Entities (NEs)</li>
    <li>identify multi-word expressions (MWE) that are NEs</li>
    <li>search for spatial data about specific locations or places in gazetteers of such data</li>
    <li>determine which locations are referred to by placenames, based on the context in which they are used in a text</li>
</ul>
</div>

## Requirements <a class="anchor" id="section-requirements"></a>

<div class="alert alert-block alert-info">
This notebook uses various Python libraries. Most will come with your Python installation, but the following are crucial:
<ul>
    <li> pandas</li> 
    <li> json</li> 
    <li> nltk</li> 
    <li> geopandas</li> 
    <li> shapely  </li> 
</ul>
</div>

In [45]:
# [TO DO] UPDATE
# Many of these are probably not needed.

import os
from pickle import NONE
import nltk
import csv
import time
import urllib
import requests
import json
import math

#import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Geopandas is used to work with spatial data
# If you have issues installing it on a MAcOS, 
# see https://stackoverflow.com/questions/71137617/error-installing-geopandas-in-python-on-mac-m1
#import geopandas as gpd
#from geopandas import GeoDataFrame

# NLTK is used to work with textual data 
#from nltk.tag import StanfordNERTagger
#from nltk.tokenize import word_tokenize

# spaCy is used for a pipeline of NLP functions
import spacy
from spacy.tokens import Span
from spacy import displacy

# Shapely is used to work with geometric shapes
#from shapely.geometry import Point

# Fuzzywuzzy is used for fuzzy searches
#from fuzzywuzzy import fuzz

# used for the checklist
import ipywidgets as widgets

In [46]:
# Make sure you can see as much of the output as possible within the Jupyter Notebook screen
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 115)

# Data preparation <a class="anchor" id="section-datapreparation"></a>

You also want to set up some directories for the import and output of data.

In [47]:
## Declare the data directories
## This presumes that Notebooks/ is the current working directory  
text_directory = os.path.normpath("../Texts/")
csv_directory = os.path.normpath("../ner_output/")
reference_directory = os.path.normpath("../Data")

## Create the data directories
if not os.path.exists(text_directory):
    os.makedirs(text_directory)
if not os.path.exists(csv_directory):
   os.makedirs(csv_directory)
if not os.path.exists(reference_directory):
   os.makedirs(reference_directory)


For this workshop, we will be examining the text of *For the Term of His Natural Life*, an 1874CE novel by Marcus Clarke that is in the public domain. Our copy was obtained via the [Gutenburg Project Australia](https://gutenberg.net.au/ebooks/e00016.txt). It is an unformatted textfile. We have slightly simplified it further by reducing it to only standard ASCII characters, replacing any accented characters with their unaccented forms and the British Pound Sterling symbol with the word pounds. 

The novel is divided into four books, each based in different regions of the world. You can start with part of the second book, which is titled *BOOK II.\-\-MACQUARIE HARBOUR.  1833*. 


In [48]:
filename="FtToHNL_BOOK_2_CHAPTER_3.txt"
print("Working on | ", filename)

# set the specific path for the 'filename' 
text_location = os.path.normpath(os.path.join(text_directory, filename))
text_filename = os.path.basename(text_location)

text = open(text_location, encoding="utf-8").read()

Working on |  FtToHNL_BOOK_2_CHAPTER_3.txt


This is no more than a long string of characters. So far, you have done no processing. 

In [49]:
text[0:500] # look at the first 501 characters

'CHAPTER III.\n\nA SOCIAL EVENING.\n\n\n\nIn the house of Major Vickers, Commandant of Macquarie Harbour,\nthere was, on this evening of December 3rd, unusual gaiety.\n\nLieutenant Maurice Frere, late in command at Maria Island, had unexpectedly\ncome down with news from head-quarters.  The Ladybird, Government schooner,\nvisited the settlement on ordinary occasions twice a year, and such visits\nwere looked forward to with no little eagerness by the settlers.\nTo the convicts the arrival of the Ladybird mean'

## Named Entity Recognition <a class="anchor" id="section-ner"></a>
*This section provides tools on identifying named entities in textual data*

### Look for NEs <a class="anchor" id="section-nes"></a>

Named Entities (NEs) are proper noun phrases within text, like the names of places, people or organisations.

Looking at the above text from Book 2 of FtToHNL, you can see there are the names of places, characters, and a ship.  While you could mannually extract them from the text, Natural Language Processing (NLP) technology allows this process to be semi-automated through software.

There are various packages that can include Named Enity Recognition (NER), e.g., the [Stanza CoreNLP](https://colab.research.google.com/github/stanfordnlp/stanza/blob/main/demo/Stanza_CoreNLP_Interface.ipynb), the Stanford NER, and the spaCy library. They often combine machine learning and a rule-based system to identify NEs and classify them into categories.

For this notebook, you will be using the spaCy NER - https://spacy.io/usage/linguistic-features#morphology .  This is available as a Python library.

SpaCy allows you to load a language model that has been trained on various examples of the language of interest. 

__[TO DO] Explain a language model in one sentence.__

In [50]:
nlp = spacy.load("en_core_web_sm")

SpaCy will automatically run the model through various levels of natural language processing, that is calls a processing pipeline. This pipeline includes tokenising the text into individual tokens or terms, like words, values and punctuation.

<div class="alert alert-block alert-info">
The options for the spaCy pipeline include:
    <p>&nbsp;</p>
<table>
    <tr><th>NAME</th>	<th>COMPONENT</th>		<th>	DESCRIPTION</th>	
    </tr>
    <tr>
        <td><strong>tokenizer</strong></td><td>	Tokenizer</td><td>	Segment text into tokens.</td>
    </tr>
    <tr>

<td><strong>tagger</strong></td><td>	Tagger</td><td>	Assign part-of-speech tags.</td>
    </tr>
    <tr>

<td><strong>parser</strong></td><td>	DependencyParser</td><td>	Assign dependency labels.</td>
    </tr>
    <tr>

<td><strong>ner</strong></td><td>	EntityRecognizer</td><td>	Detect and label named entities.</td>
    </tr>
    <tr>

<td><strong>lemmatizer</strong></td><td>	Lemmatizer</td><td>	Assign base forms.</td>
    </tr>
    <tr>

<td><strong>textcat</strong></td><td>	TextCategorizer</td><td>	Assign document labels.</td>
    </tr>
    <tr>

<td><strong>custom</strong></td><td>	custom components</td><td>	Assign custom attributes, methods or properties.</td>
    </tr>
    </table>
    
   A full explanation can be found at <a href="https://spacy.io/usage/processing-pipelines">https://spacy.io/usage/processing-pipelines</a>
    </div>

![spaCy Language Procesing Pipeline](./spaCy_pipeline.png)

For example, the default line contains the following.

In [51]:
print("Pipeline:", nlp.pipe_names)

Pipeline: ['tagger', 'parser', 'ner']


Text sent to the spaCy model will be processed by the pipeline.

In [52]:
doc = nlp(text[0:500])
doc

CHAPTER III.

A SOCIAL EVENING.



In the house of Major Vickers, Commandant of Macquarie Harbour,
there was, on this evening of December 3rd, unusual gaiety.

Lieutenant Maurice Frere, late in command at Maria Island, had unexpectedly
come down with news from head-quarters.  The Ladybird, Government schooner,
visited the settlement on ordinary occasions twice a year, and such visits
were looked forward to with no little eagerness by the settlers.
To the convicts the arrival of the Ladybird mean

The output from the pipeline is then  available in the output structure of the spaCy model. Each word is regarded as a token.

<div class="alert alert-block alert-info">
From <a href="https://spacy.io/usage/linguistic-features">https://spacy.io/usage/linguistic-features</a>:
        <ul><li> <strong>Text:</strong> The original token text.</li>
<li> <strong>Dep:</strong> The syntactic relation connecting child to head.</li>
<li> <strong>Head text:</strong> The original text of the token head.</li>
<li> <strong>Head POS:</strong> The part-of-speech tag of the token head.</li>
<li> <strong>Children:</strong> The immediate syntactic dependents of the token.</li>
    </ul>
    </div>

In [53]:
for token in doc[9:59]:
    print(token.text," - ", 
          "\n   Dep: ",token.dep_,       
          "\n   Head: ",token.head.text, 
          "\n   Pos: ",token.head.pos_,  
          "\n   Child: ",[child for child in token.children]) 


In  -  
   Dep:  prep 
   Head:  was 
   Pos:  AUX 
   Child:  [house]
the  -  
   Dep:  det 
   Head:  house 
   Pos:  NOUN 
   Child:  []
house  -  
   Dep:  pobj 
   Head:  In 
   Pos:  ADP 
   Child:  [the, of]
of  -  
   Dep:  prep 
   Head:  house 
   Pos:  NOUN 
   Child:  [Vickers]
Major  -  
   Dep:  compound 
   Head:  Vickers 
   Pos:  PROPN 
   Child:  []
Vickers  -  
   Dep:  pobj 
   Head:  of 
   Pos:  ADP 
   Child:  [Major, ,, Commandant]
,  -  
   Dep:  punct 
   Head:  Vickers 
   Pos:  PROPN 
   Child:  []
Commandant  -  
   Dep:  appos 
   Head:  Vickers 
   Pos:  PROPN 
   Child:  [of]
of  -  
   Dep:  prep 
   Head:  Commandant 
   Pos:  PROPN 
   Child:  [Harbour]
Macquarie  -  
   Dep:  compound 
   Head:  Harbour 
   Pos:  PROPN 
   Child:  []
Harbour  -  
   Dep:  pobj 
   Head:  of 
   Pos:  ADP 
   Child:  [Macquarie]
,  -  
   Dep:  punct 
   Head:  was 
   Pos:  AUX 
   Child:  [
]

  -  
   Dep:   
   Head:  , 
   Pos:  PUNCT 
   Child:  []
there  -  
  

However, you might not want some of this pipeline processing as it may not be beneficial to your analysis. Any unneeded processing will also slow the system down and place a greater demand on the memory. This is particularly true of the parser. Luckily, it is easy to stipulate what you want excluded from the spaCy pipeline. 

In [54]:
doc=nlp(text[0:500], disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])

In [55]:
for token in doc[9:59]:
    print(token.text," - ", 
          "\n   Dep: ",token.dep_,        
          "\n   Head: ",token.head.text,  
          "\n   Pos: ",token.head.pos_,   
          "\n   Child: ",[child for child in token.children])  

In  -  
   Dep:   
   Head:  In 
   Pos:   
   Child:  []
the  -  
   Dep:   
   Head:  the 
   Pos:   
   Child:  []
house  -  
   Dep:   
   Head:  house 
   Pos:   
   Child:  []
of  -  
   Dep:   
   Head:  of 
   Pos:   
   Child:  []
Major  -  
   Dep:   
   Head:  Major 
   Pos:   
   Child:  []
Vickers  -  
   Dep:   
   Head:  Vickers 
   Pos:   
   Child:  []
,  -  
   Dep:   
   Head:  , 
   Pos:   
   Child:  []
Commandant  -  
   Dep:   
   Head:  Commandant 
   Pos:   
   Child:  []
of  -  
   Dep:   
   Head:  of 
   Pos:   
   Child:  []
Macquarie  -  
   Dep:   
   Head:  Macquarie 
   Pos:   
   Child:  []
Harbour  -  
   Dep:   
   Head:  Harbour 
   Pos:   
   Child:  []
,  -  
   Dep:   
   Head:  , 
   Pos:   
   Child:  []

  -  
   Dep:   
   Head:  
 
   Pos:  SPACE 
   Child:  []
there  -  
   Dep:   
   Head:  there 
   Pos:   
   Child:  []
was  -  
   Dep:   
   Head:  was 
   Pos:   
   Child:  []
,  -  
   Dep:   
   Head:  , 
   Pos:   
   Child:  []
on 

Of course, what you are interested in is the NER. Any text sent down the pipeline with the NER will get a list of entities that have been found. 

In [56]:
for entity in doc.ents:
    print(entity.text, "[",entity.label_,"]")

Vickers [ PERSON ]
this evening [ TIME ]
December 3rd [ DATE ]
Maurice Frere [ PERSON ]
Maria Island [ LOC ]
Ladybird [ ORG ]


As you can see, each entity is labelled with a category.

__[TO DO] UPDATE__

<div class="alert alert-block alert-info">
    The NER categories classified by Stanza include:
   <ul>
<li><strong>Default:</strong>
LOCATION, ORGANIZATION, PERSON</li>
<li><strong>High recall: </strong>
DATE, LOCATION, MONEY, ORGANIZATION, PERCENT, PERSON, TIME, MISC</li>
<li><strong>KBP fine-grained:</strong>
CAUSE_OF_DEATH, CITY, COUNTRY, CRIMINAL_CHARGE, EMAIL, HANDLE,
IDEOLOGY, NATIONALITY, RELIGION, STATE_OR_PROVINCE, TITLE, URL</li>
</ul>
</div>

In [57]:
spacy.explain('LOC')

'Non-GPE locations, mountain ranges, bodies of water'

In [58]:
spacy.explain('FAC')

'Buildings, airports, highways, bridges, etc.'

In [59]:
spacy.explain('GPE')

'Countries, cities, states'

In [60]:
spacy.explain('ORG')

'Companies, agencies, institutions, etc.'

__[TO DO] Update__

Most tokenised terms in the sentence have O as their NER value (that is the letter O not the number 0). Some however have been categorised. For instance, Van and Diemen are both classified as being PERSON named entities, Tasman is an ORGANIZATION and Cape and Pillar are in the LOCATION category. These categories are specific to Stanza. There are two key levels of processing available - the normal level will only identify the categories LOCATION, ORGANIZATION and PERSON, but the high recall processing will also consider other specialised phrases like TIME and MONEY which are not really named entities. Stanza can also look for the categories used in the KBP competition like CITY, COUNTRY and NATIONALITY, but this fine-grained processing will be slower.


The data for the entities includes the character position for the start and the end of the NE.

In [61]:
for entity in doc.ents:
    print(entity.text, "("+str(entity.start_char)+","+str(entity.end_char)+") [",entity.label_,"]")

Vickers (57,64) [ PERSON ]
this evening (113,125) [ TIME ]
December 3rd (129,141) [ DATE ]
Maurice Frere (171,184) [ PERSON ]
Maria Island (205,217) [ LOC ]
Ladybird (487,495) [ ORG ]


Each token will also have a value that indicates whether it is part of an NE.

In [62]:
for token in doc[9:59]:
    print(token.text, "[", token.ent_type_, "]")

In [  ]
the [  ]
house [  ]
of [  ]
Major [  ]
Vickers [ PERSON ]
, [  ]
Commandant [  ]
of [  ]
Macquarie [  ]
Harbour [  ]
, [  ]

 [  ]
there [  ]
was [  ]
, [  ]
on [  ]
this [ TIME ]
evening [ TIME ]
of [  ]
December [ DATE ]
3rd [ DATE ]
, [  ]
unusual [  ]
gaiety [  ]
. [  ]


 [  ]
Lieutenant [  ]
Maurice [ PERSON ]
Frere [ PERSON ]
, [  ]
late [  ]
in [  ]
command [  ]
at [  ]
Maria [ LOC ]
Island [ LOC ]
, [  ]
had [  ]
unexpectedly [  ]

 [  ]
come [  ]
down [  ]
with [  ]
news [  ]
from [  ]
head [  ]
- [  ]
quarters [  ]
. [  ]


__[TO DO] Update__

The annotations so far show that *Cape* and *Pillar* are both a LOCATION NE, but not that *Cape Pillar* is actually the complete name of the location. However Stanza does also recognise multi-word expressions (MWEs), even though it recognises that they are part of an NE. Each *Sentence* annotated by Stanza has a list of [NE *Mentions*](https://stanfordnlp.github.io/CoreNLP/entitymentions.html) which are also given NER categories as their *Type*. 

__[TO DO] Remove any comments and code about hand-coding corrections for exceptions?__

It is also possible to hand-code entities after the NER has been done. This can help make up for any common irregularities with the NER for your input documents. 

For instance, this model doesn't recognise that FB is a NE.

The solution is to create a new entry for the list of entities.

Even the data for the tokens is updated. 

__[TO DO] Decide whether to keep this part about batch processing. This may be too spaCy specific.__

SpaCy also allows the input documents to be processed in batches. This helps better manage the processing demands of the system throughout the pipeline when there are multiple files or many sentences.

While you can process the output of the piped pipeline straight away, you can't print it unless you convert it into a list.

__[TO DO] This is about spaCy indicating the scope of a NE.
Probably best to omit as it is too technical and about the spaCy rather than the problem.__

<div class="alert alert-block alert-info">
    From <a href="https://spacy.io/usage/linguistic-features">https://spacy.io/usage/linguistic-features</a>:

<ul>
    <li>IOB SCHEME
        <ul>
<li>I – Token is inside an entity.</li>
<li>O – Token is outside an entity.</li>
<li>B – Token is the beginning of an entity.</li>
        </ul></li>
<li>BILUO SCHEME
    <ul>
<li>B – Token is the beginning of a multi-token entity.</li>
<li>I – Token is inside a multi-token entity.</li>
<li>L – Token is the last token of a multi-token entity.</li>
<li>U – Token is a single-token unit entity.</li>
<li>O – Token is outside an entity.</li>
    </ul></li>
    </ul>
    </div>

__[TO DO] talk about extracting just the NER types we want.__

__[TO DO] Run through a single chapter (variable: text) before doing the entire collection?
    This will allow all NEs to be shown, then the filter be introduced.
    This will put it more on topic.__

In [63]:
text[0:500]

'CHAPTER III.\n\nA SOCIAL EVENING.\n\n\n\nIn the house of Major Vickers, Commandant of Macquarie Harbour,\nthere was, on this evening of December 3rd, unusual gaiety.\n\nLieutenant Maurice Frere, late in command at Maria Island, had unexpectedly\ncome down with news from head-quarters.  The Ladybird, Government schooner,\nvisited the settlement on ordinary occasions twice a year, and such visits\nwere looked forward to with no little eagerness by the settlers.\nTo the convicts the arrival of the Ladybird mean'

In [64]:
doc = nlp(text)

# document level
entities = [(entity.text, entity.start_char, entity.end_char, entity.label_) for entity in doc.ents]
    
i=0 # entity counter
# token level
for entity in doc.ents:
    print("{:5}\t\t{:30s}\t{}".format(i+1,entity.text, entity.label_))
    i=i+1

    1		Vickers                       	PERSON
    2		this evening                  	TIME
    3		December 3rd                  	DATE
    4		Maurice Frere                 	PERSON
    5		Maria Island                  	LOC
    6		Ladybird                      	ORG
    7		Ladybird                      	PERSON
    8		Ladybird                      	ORG
    9		Tom                           	PERSON
   10		Dick                          	PERSON
   11		Harry                         	PERSON
   12		bush                          	PERSON
   13		Jack                          	PERSON
   14		Town Gaol                     	PERSON
   15		Ladybird                      	ORG
   16		one                           	CARDINAL
   17		Commandant                    	ORG
   18		Vickers                       	PERSON
   19		Arthur                        	PERSON
   20		Hobart Town                   	ORG
   21		Arthur                        	PERSON
   22		Tasman                        	PERSON
   23		Peninsula              

__[TO DO] Expand on this explanation with example context.
Talk about the issue with _Van Diemen's_ versus _Van Diemen's Land_ (and _Tasman's Head_)__

These categories are assigned according to the context in which the NE is used. For this reason, _Van Diemen's_ is considered an *ORG*, a *PERSON* and a *FAC*, depending on its linguistic context. Note also that _VAN DIEMEN'S LAND_ in the title of the chapter isn't recognised as a NE, probably due to its unconventional case.

__[TO DO] Expand on this explanation.__

Of course, not all of these NE are suitable for placenames, so you will need to make a list of what categories regularly contain placenames.

In [65]:
PLACENAME_CATEGORIES = ["LOC", "GPE", "FAC", "ORG"]

## Reviewing Candidate Placenames <a class="anchor" id="section-reviewplacenames"></a>

You can now put all of this together and find the placenames that are identified by spaCy in each chapter of the text. They can all be collected in a single dataframe.

In [66]:
# Dataframe where we store the details about each instance of the placenames
placenames_df = pd.DataFrame(columns=['Book','Chapter',"NEIndex","Placename"])

In [67]:
# Define which chapters and books you want to annotate
CHAPTERS=[1,2,3] 
BOOKS=[1,2]

You should define what spaCy processing you do or don't want in the pipeline.

In [68]:
disabled_pipeline=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]

Now let's process FtToHNL.

In [69]:
nlp = spacy.load("en_core_web_sm")  # language model

i=0 # counter of the entities
for book in BOOKS:
    for chapter in CHAPTERS:
        filename = "FtToHNL_BOOK_"+str(book)+"_CHAPTER_"+str(chapter)+".txt" 
        # set the specific path for the 'filename'
        text_location = os.path.normpath(os.path.join(text_directory, filename))
        text_filename = os.path.basename(text_location)

        # read this chapter
        text = open(text_location, encoding="utf-8").read()
        print("Working on |",filename)
        
        # run spaCy    
        doc = nlp(text,disable=disabled_pipeline)

        # document level
        ents = [(entity.text, entity.start_char, entity.end_char, entity.label_) for entity in doc.ents]

        # token level
        for entity in doc.ents:
            if entity.label_ in PLACENAME_CATEGORIES: # filter out MONEY, DATE etc
                print("{:5}\t\t{:30s}\t{}".format(i+1,entity.text, entity.label_))
                # To help understand the context of the text, extract the occurence
                context_text=doc.text[entity.start_char-30:entity.end_char+30].replace("\n"," ")

                # Add the placenames according to spaCy
                new_placename = {'Book':book,             # The Book number
                                'Chapter':chapter,        # The Chapter number
                                'NEIndex':i,              # A reference number to the nth Named Entity 
                                'Placename':entity.text,  # The placename in the text
                                'Category':entity.label_, # The spaCy category
                                'Context':context_text,   # The textual context where the placename was found
                                'Approval':1}             # A flag for whether this is a suitable placename
                placenames_df = placenames_df.append(new_placename, ignore_index=True)
                
            i=i+1 # entity counter

Working on | FtToHNL_BOOK_1_CHAPTER_1.txt
    2		Malabar                       	GPE
    6		Crown                         	ORG
   10		Crown                         	ORG
   11		Crown                         	ORG
   15		the sleepy sea                	FAC
   17		the Bay of Biscay             	LOC
   22		Lord Bellasis                 	ORG
   23		Heath                         	ORG
   25		London                        	GPE
   38		Van Diemen's Land             	ORG
   44		Vickers                       	ORG
   46		Vickers                       	ORG
   47		Sylvia                        	ORG
   50		Bath                          	ORG
   53		Julia Vickers's               	ORG
   54		Frere                         	ORG
   58		Sylvia                        	GPE
   64		Chatham                       	FAC
   81		Sylvia                        	GPE
   93		Frere                         	ORG
   96		Sylvia                        	LOC
Working on | FtToHNL_BOOK_1_CHAPTER_2.txt
   97		CHAPTER II                 

__[TO DO] Save a copy of this data__

In [70]:
placenames_df[['Book','Chapter','NEIndex','Placename','Category']]

Unnamed: 0,Book,Chapter,NEIndex,Placename,Category
0,1,1,1,Malabar,GPE
1,1,1,5,Crown,ORG
2,1,1,9,Crown,ORG
3,1,1,10,Crown,ORG
4,1,1,14,the sleepy sea,FAC
5,1,1,16,the Bay of Biscay,LOC
6,1,1,21,Lord Bellasis,ORG
7,1,1,22,Heath,ORG
8,1,1,24,London,GPE
9,1,1,37,Van Diemen's Land,ORG


This shows that a lot more placenames were found in Book 2 than Book 1. This makes sense since Book 1 is set on board an ocean voyage, whereas Book 2 is at Macquarie Harbour in Australia.

However, there are a number of NEs that are unlikely to be placenames, regardless of what spaCy categoriesd them as. It is best to consider the context in which the terms were used. Use the checkboxes to select which terms you do consider to be placenames.

In [74]:
# lists of the data required for displaying the checkboxes
placename_items=[]
context_items=[]
num_items=[]

# Time to try to make checkboxes for every placename for b in BOOKS:
for book in BOOKS:
    for chapter in CHAPTERS:
        # Get the NEs from this book and chapter
        placenames_bookchapter=placenames_df[(placenames_df["Book"]==book) & 
                                             (placenames_df["Chapter"]==chapter)]
        # get the contextual data for each indexed NE
        for i in placenames_bookchapter["NEIndex"]:
            context_text=placenames_bookchapter[placenames_bookchapter["NEIndex"]==i]["Context"].values[0]
            category=placenames_bookchapter[placenames_bookchapter["NEIndex"]==i]["Category"].values[0]
            
        # Make lists of the candidate placenames, context text and index numbers 
        # Only the placenames are given a checkbox. 
        placename_items = placename_items + [widgets.Checkbox(True,description=i) for i in placenames_bookchapter["Placename"]]
        context_items = context_items + [widgets.Label(placenames_bookchapter[placenames_bookchapter["NEIndex"]==i]["Context"].values[0]) for i in placenames_bookchapter["NEIndex"]]
        num_items = num_items + [widgets.Label(str(i)) for i in placenames_bookchapter["NEIndex"]]

# create a display
num_placenames=len(placename_items)
left_box = widgets.VBox(placename_items)
right_box = widgets.VBox(context_items)
num_box = widgets.VBox(num_items)
whole_box = widgets.HBox([num_box, left_box, right_box])
        
print("Unselect any Named Entities (NEs) that you do not consider to be placenames.")
print("Each instance of an NE is listed, with the textual context in which it appeared.")
display(whole_box)

Unselect any Named Entities (NEs) that you do not consider to be placenames.
Each instance of an NE is listed, with the textual context in which it appeared.


HBox(children=(VBox(children=(Label(value='1'), Label(value='5'), Label(value='9'), Label(value='10'), Label(v…

You can now copy all the values from the checkboxes to the data, so you know which placenames you have approved.

In [76]:
# Transfer the status of each checklist item to the data
for i in range(num_placenames):

    NEIndex_num = int(num_items[i].value)
    approval_flag = placename_items[i].value
    
    # set the flag to match the checklist
    for placename in placenames_df["NEIndex"]:
        if (placename - NEIndex_num == 0):
            placenames_df.loc[placenames_df["NEIndex"] == NEIndex_num,"Approval"] = approval_flag

You can now visualise the result.

In [77]:
placenames_df[['NEIndex','Placename','Approval']]

Unnamed: 0,NEIndex,Placename,Approval
0,1,Malabar,False
1,5,Crown,False
2,9,Crown,False
3,10,Crown,False
4,14,the sleepy sea,True
5,16,the Bay of Biscay,True
6,21,Lord Bellasis,False
7,22,Heath,True
8,24,London,True
9,37,Van Diemen's Land,True


From this, you can extract the final list of distinct placenames that you have approved. While the names aren't sorted (though they could be), if you missed unselecting an NE on the checklist, this will help find it. All you need to do is go back to the checklist, unselect it, then run all other steps from there to here.

In [78]:
# Make a unique list of the approved placenames
approved_placenames = placenames_df[placenames_df["Approval"]==True]['Placename'].unique()
print(approved_placenames)

['the sleepy sea' 'the Bay of Biscay' 'Heath' 'London' "Van Diemen's Land"
 'Vickers' 'Sylvia' 'Bath' "Julia Vickers's" 'Frere' 'Chatham'
 'CHAPTER II' 'Surgeon Pine' 'Coromandel' 'Pine' 'India'
 'the Hydaspes for Calcutta' 'the poop guard' 'MONOTONY' "Three'll"
 "Van Diemen's" 'Tasman' 'Cape Pillar' "Pirates' Bay" 'east' 'west'
 'the Isle of Wight' 'the South-West Cape' 'Swan Port' 'Mediterranean'
 'Maria Island' 'the Three Thumbs' 'Peninsula' 'Storm Bay'
 'Storing Island' 'Italy' 'Sorrell' 'Bruny Island' 'Mount Royal'
 "D'Entrecasteaux Channel" 'Actaeon' 'the South Cape' 'New Norfolk'
 'Derwent' 'the Southern Ocean' 'Tamar' 'Victoria' 'Port Philip Bay'
 'Wellington' 'Dromedary' 'Mount Wellington' 'Launceston' 'Smyrna'
 'Pyramid Island' 'Rocky Point' 'Port Davey' 'Mount Direction'
 'Macquarie Harbour' 'Mount Heemskirk' 'Mount Zeehan' "King's River"
 'Sarah Island' "Philip's Island" 'Hobart Town' 'earth' 'south-east'
 'Ladybird' 'Commandant' 'Port Arthur' 'Honduras' 'Arthur' 'Hells Gat

The final step is to save this new data to a csv file. You have already defined the directory for your data files.

In [79]:
filename = "FtToHNL_placenames.csv"
save_location = os.path.normpath(os.path.join(csv_directory, filename))
save_filename = os.path.basename(save_location)
print("Saving to placename data to ",save_location)

Saving to placename data to  ../ner_output/FtToHNL_placenames.csv


In [None]:
# save the list 
# using the savetxt from the numpy module
np.savetxt(save_location, 
           approved_placenames,
           delimiter =", ", 
           fmt ='% s')

## Finding Locations for the Placenames <a class="anchor" id="section-findinglocs"></a>

Now that you have a list of placenames from the text, the next step is to work out their location on Earth. For this you can use a combination of specialised lists of locations, gazzetteers and heuristics. The objective is to match every placename with the coordinates of a known location.

The first step is to read the file of your placenames.

In [603]:
filename="FtToHNL_placenames.csv"
print("Working on | ", filename)

# set the specific path for the 'filename'
data_location = os.path.normpath(os.path.join(csv_directory, filename))
data_filename = os.path.basename(data_location)

# Using pandas, read the csv file. This will place it in a dataframe format. 
placenames_df = pd.read_csv(data_location, encoding="utf-8",header=None)

Working on |  FtToHNL_placenames.csv


In [604]:
placenames_df = placenames_df.rename(columns={placenames_df.columns[0]: 'Placename'})

In [605]:
placenames_df

Unnamed: 0,Placename
0,the sleepy sea
1,the Bay of Biscay
2,Heath
3,London
4,Van Diemen's Land
5,Vickers
6,Sylvia
7,Bath
8,Julia Vickers's
9,Frere


### Identifying States and Capitals <a class="anchor" id="section-statescapitals"></a>

Some placenames, like *High Street* or *Maryborough*, may be very common across the world, or even in Australia. However, certain placenames refer to significant locations, like states, territories, large geographic features or capital cities. As such, if they are mentioned in a text, the placename is more likely to refer to the major location than a town or village in Tasmania.

These significant locations are a finite set. They can be defined in a reference file that can be reused when reviewing the placenames of any text.

A good point for you to start is a file about locations like modern capital cities and countries, combined with historical locations of significance.

In [606]:
filename="reference_location_data.csv"
print("Working on | ", filename)

# set the specific path for the 'filename' which is basically working through a list of everything that is in the folder
reference_location = os.path.normpath(os.path.join(reference_directory, filename))
reference_filename = os.path.basename(reference_location)

Working on |  reference_location_data.csv


Rather than reading this and then processing it, you can process each line as you read it.

In [607]:
# Place the reference data in a dataframe
locref_df = pd.read_csv(reference_location, encoding="utf-8", header=0)

In [608]:
locref_df

Unnamed: 0,LocationName,Category,Latitude,Longitude,PartOf
0,Melbourne,city,-37.814218,144.963161,VIC
1,Brisbane,city,-27.468968,153.023499,QLD
2,Perth,city,-31.955896,115.860580,WA
3,Darwin,city,-12.460440,130.841047,NT
4,Alice Springs,city,-23.698388,133.881289,NT
...,...,...,...,...,...
550,Zagreb,Capital,16.000000,45.800000,Croatia
551,Zambia,Country,28.283333,-15.416667,Africa
552,Zanzibar City,Capital,39.198914,-6.165193,Tanzania
553,Zimbabwe,Country,31.033333,-17.816667,Africa


__[TO DO] update this text chunk to suit the workshop__

Of course, if you are researching historical texts, then some of these contemporary locations may have had different names. Old New York was once New Amsterdam (and had the [nickname of Gotham](https://www.nypl.org/blog/2011/01/25/so-why-do-we-call-it-gotham-anyway), amongst others). Istanbul was Constantinople. Some locations had [romanized names](https://en.wikipedia.org/wiki/Chinese_postal_romanization), like Beijing being called Peking. They may be a long time gone but you might want to add them to the list of significant known locations.

Another historical variant is changing which cities are the capitals. These may be due to political decisions, like the movement of the Australian parliament from Melbourne to the new city of Canberra, or they could be a necessity due to the results of war, like Bonn becoming the capital of West Germany after World War II. These older capitals may also have to be accomodated in your reference data.

Because FtToHNL is set in the 19th Century CE, the next step is to add various capital cities from then.

There are also larger geopolitical regions that may have been associated with placenames and cultures, for instance empires, dynasties and colonies like the British Empire or the Zulu Kingdom. Again, the borders and applicability of these political entities changed over time, so a contemporary reference list may not include them. 

The 19th Century CE was a time of many European Empires so for FtToHNL, you will need to add reference data associated with relevant entities.

When processing this reference file, you can add the old political entity, its capital (if known), the geographic region (like continent or part thereof) and the modern country it would be considered part of.  

The next step is to see if any of the placenames from our selected chapters of FtToHNL match these locations.

__[TO DO] Describe this without being technical__

If we match a placename, copy the geolocation data for the matching location. Otherwise, keep it empty so we know to keep looking for the placename.

In [609]:

geolocdata = [] # all the data about placenames and locations, once linked

for placename in placenames_df['Placename']:
    
    # create a new geoloc entry about this placename
    new_geolocdata={} 
    # start a record for a placename
    new_geolocdata['placename'] = placename
    new_geolocdata['locations'] = {} # start with no location details
    new_geolocdata['locations']['best_match'] = [] # start with no match
    
    # Exact match
    if(placename in list(locref_df['LocationName'])):
        
        print("*** Found", placename,"[Exact match]")
        # Copy the details from the reference file entry
        new_geolocdata['locations']['best_match'] = locref_df[locref_df['LocationName']==placename]
        
    else:
        print("Still looking for ", placename)
    
    geolocdata.append(new_geolocdata) # add the new placename data to the list

Still looking for  the sleepy sea
Still looking for  the Bay of Biscay
Still looking for  Heath
*** Found London [Exact match]
Still looking for  Van Diemen's Land
Still looking for  Vickers
Still looking for  Sylvia
Still looking for  Bath
Still looking for  Julia Vickers's
Still looking for  Frere
Still looking for  Chatham
Still looking for  CHAPTER II
Still looking for  Surgeon Pine
Still looking for  Coromandel
Still looking for  Pine
*** Found India [Exact match]
Still looking for  the Hydaspes for Calcutta
Still looking for  the poop guard
Still looking for  MONOTONY
Still looking for  Three'll
Still looking for  Van Diemen's
Still looking for  Tasman
Still looking for  Cape Pillar
Still looking for  Pirates' Bay
Still looking for  east
Still looking for  west
Still looking for  the Isle of Wight
Still looking for  the South-West Cape
Still looking for  Swan Port
Still looking for  Mediterranean
Still looking for  Maria Island
Still looking for  the Three Thumbs
Still looking fo

Check that you have recorded the matches (and mismatches)

In [610]:
geolocdata[:10]

[{'placename': 'the sleepy sea', 'locations': {'best_match': []}},
 {'placename': 'the Bay of Biscay', 'locations': {'best_match': []}},
 {'placename': 'Heath', 'locations': {'best_match': []}},
 {'placename': 'London',
  'locations': {'best_match':     LocationName Category  Latitude  Longitude          PartOf
   273       London  Capital -0.083333       51.5  United Kingdom}},
 {'placename': "Van Diemen's Land", 'locations': {'best_match': []}},
 {'placename': 'Vickers', 'locations': {'best_match': []}},
 {'placename': 'Sylvia', 'locations': {'best_match': []}},
 {'placename': 'Bath', 'locations': {'best_match': []}},
 {'placename': "Julia Vickers's", 'locations': {'best_match': []}},
 {'placename': 'Frere', 'locations': {'best_match': []}}]

What locations did you end up finding?

In [611]:
matchdata = [p['locations']['best_match'].to_string(index=False,header=False) for p in geolocdata 
             if len(p['locations']['best_match'])>0]
matchdata

['London Capital -0.083333 51.5 United Kingdom',
 'India Country 77.2 28.6 Asia',
 'Italy Country 12.483333 41.9 Europe',
 'Victoria Capital 55.45 -4.616667 Seychelles',
 'Wellington Capital 174.783333 -41.3 New Zealand',
 'Honduras Country -87.216667 14.1 Central America']

We can now forget about the dataframe with the complete set of reference data.

In [612]:
del locref_df

### Searching a Gazetteer for Locations <a class="anchor" id="section-searchgazzeteer"></a>

Search [Open Street Map (ODM)](https://nominatim.org/release-docs/develop/api/Search/) for locations that match the unknown placenames.

In [613]:
#install ratelimit
# [TO DO] Shift to head of Notebook
import requests
from IPython.display import JSON
import json
from pprint import pprint
from ratelimit import limits, RateLimitException, sleep_and_retry

In [614]:
# How many (max) results do we want for each name?
#[TO DO] Make this a user setting, defaulting to 5
# The normal is (Default: 10, Maximum: 50), according to https://nominatim.org/release-docs/develop/api/Search/
OSM_limit = 5

In [615]:
# Send rate-limited requests that stay within n requests per second
# [TO DO] add link to webpage about this
@sleep_and_retry
@limits(calls=1, period=1)
def osm_call_api(url):
    response = requests.get(url)
    return response

# converted a postcode from a string into a state abbreviation
def postcode_to_state(postcodestr):
    postcode = int(postcodestr)
    
    if (postcode >=1000 and postcode <=2599) or (postcode >= 2619 and postcode < 2899) or (postcode >= 2921 and postcode <= 2999):
        return("NSW")
    elif (postcode >=200 and postcode <=299) or (postcode >= 2600 and postcode <= 2618) or (postcode >= 2900 and postcode <= 2920):
        return("ACT")
    elif (postcode >=3000 and postcode <=3999) or (postcode >= 8000 and postcode <= 8999):
        return("VIC")
    elif (postcode >=4000 and postcode <=4999) or (postcode >= 9000 and postcode < 9999):
        return("QLD")
    elif (postcode >=5000 and postcode <=5999):
        return("SA")
    elif (postcode >=6000 and postcode <=6797) or (postcode >= 6800 and postcode <= 6999):
        return("WA")
    elif (postcode >=8000 and postcode <=8999):
        return("TAS")
    elif (postcode >= 7000 and postcode <= 7999):
        return("TAS")
    elif (postcode >=800 and postcode <=999):
        return("NT")
    # some postcodes are special cases
    elif (postcode ==2899):
        return("Norfolk Island")  # coded as NSW
    elif (postcode ==6798):
        return("Christmas Island")  # coded as WA
    elif (postcode ==6799):
        return("Cocos (Keeling) Islands")  # coded as WA
    elif (postcode ==9999):
        return("North Pole")  # coded as VIC for Santa mail
    
    return(postcodestr)

    
# Format the api response to make comparison easier
def osm_format_response(input):

    # Shorten the name and extract the country name, if any
    hyperlocation = None;
    shortlocation = input["display_name"] # default to full address

    if input["display_name"].find(','):
        # break up the address
        namesplit = input["display_name"].split(',')
        # extract the rightmost term from the split
        hyperlocation = namesplit[len(namesplit)-1].lstrip().rstrip()
        shortlocation = namesplit[0].lstrip().rstrip()
        # change to a Australian state name
        if (hyperlocation=="Australia" and len(namesplit)>2):
            hyperlocation = namesplit[len(namesplit)-2].lstrip().rstrip()
            if (hyperlocation.isdigit() and len(hyperlocation)==4):
                hyperlocation = postcode_to_state(hyperlocation)
        
    # for now, keep the names consistent between records 
    response = {"LocationName": str(shortlocation), 
              "Category": str(input["type"]),
              "Latitude": input["lat"], 
              "Longitude": input["lon"],
              "PartOf": str(hyperlocation),
              "Gazetteer": "OSM"
                }
    return response

__[TO DO] Talk more__

You can now move to the data that is needed for the geolocation project.

In [616]:
# For every placename in our list
for p in geolocdata:
    # Already found a location, so skip to the next placename
    if len(p['locations']['best_match']) > 0:
        continue
        
    placename = p['placename']
    print ("looking for",placename)

    # query the OSM database
    url = f"https://nominatim.openstreetmap.org/search?q={placename}&format=json&limit={OSM_limit}"
    response = osm_call_api(url)
    response_dict = json.loads(response.text)

    p['locations']['candidates']=None
    
    # Handle no results found
    if len(response_dict) is 0:
        # skip to the next placename
        continue
        
    # Save the possible locations for later processing
    data_frames = []

    # Handle results found
    for response_record in response_dict:
        #  Use this to look at a reduced set of data from the results
        cleaned_response = pd.DataFrame([osm_format_response(response_record)])

        # Add the data to a dataframe
        # The cleaned_response should be in the form we want to keep.
        data_frames.append(cleaned_response)

    # add the results to the geoloc dataframe
    # review the outcomes later
    p['locations']['candidates'] = data_frames

    

looking for the sleepy sea
looking for the Bay of Biscay
looking for Heath
looking for Van Diemen's Land
looking for Vickers
looking for Sylvia
looking for Bath
looking for Julia Vickers's
looking for Frere
looking for Chatham
looking for CHAPTER II
looking for Surgeon Pine
looking for Coromandel
looking for Pine
looking for the Hydaspes for Calcutta
looking for the poop guard
looking for MONOTONY
looking for Three'll
looking for Van Diemen's
looking for Tasman
looking for Cape Pillar
looking for Pirates' Bay
looking for east
looking for west
looking for the Isle of Wight
looking for the South-West Cape
looking for Swan Port
looking for Mediterranean
looking for Maria Island
looking for the Three Thumbs
looking for Peninsula
looking for Storm Bay
looking for Storing Island
looking for Sorrell
looking for Bruny Island
looking for Mount Royal
looking for D'Entrecasteaux Channel
looking for Actaeon
looking for the South Cape
looking for New Norfolk
looking for Derwent
looking for the Sout

What placenames have you still not found?

In [617]:
unmatcheddata = [p['placename'] for p in geolocdata 
             if len(p['locations']['best_match'])==0 and p['locations']['candidates']==None]
unmatcheddata

['the sleepy sea',
 "Julia Vickers's",
 'the Hydaspes for Calcutta',
 'the poop guard',
 'MONOTONY',
 'the Three Thumbs',
 'Storing Island',
 'Commandant',
 'verandah.-She',
 'Grummet Island']

Where have you found locations for placenames?

In [618]:
matchdata = [p['placename'] for p in geolocdata 
             if len(p['locations']['best_match'])>0 or p['locations']['candidates']!=None]
matchdata

['the Bay of Biscay',
 'Heath',
 'London',
 "Van Diemen's Land",
 'Vickers',
 'Sylvia',
 'Bath',
 'Frere',
 'Chatham',
 'CHAPTER II',
 'Surgeon Pine',
 'Coromandel',
 'Pine',
 'India',
 "Three'll",
 "Van Diemen's",
 'Tasman',
 'Cape Pillar',
 "Pirates' Bay",
 'east',
 'west',
 'the Isle of Wight',
 'the South-West Cape',
 'Swan Port',
 'Mediterranean',
 'Maria Island',
 'Peninsula',
 'Storm Bay',
 'Italy',
 'Sorrell',
 'Bruny Island',
 'Mount Royal',
 "D'Entrecasteaux Channel",
 'Actaeon',
 'the South Cape',
 'New Norfolk',
 'Derwent',
 'the Southern Ocean',
 'Tamar',
 'Victoria',
 'Port Philip Bay',
 'Wellington',
 'Dromedary',
 'Mount Wellington',
 'Launceston',
 'Smyrna',
 'Pyramid Island',
 'Rocky Point',
 'Port Davey',
 'Mount Direction',
 'Macquarie Harbour',
 'Mount Heemskirk',
 'Mount Zeehan',
 "King's River",
 'Sarah Island',
 "Philip's Island",
 'Hobart Town',
 'earth',
 'south-east',
 'Ladybird',
 'Port Arthur',
 'Honduras',
 'Arthur',
 'Hells Gates',
 'England',
 'New Town'

While Open Street Map is a wonderful resource, it focusses on current names of geographic locations. If the original source of your placenames was not written in the recent decades, then the OSM may not know the appropriate names of locations for the time of the document. 

One solution is to also look up a historical gazetteer, like the TLC. 

Like the OSM API, the TLCMap API has various options, like which type of search to use and whether to search any data in the database enetred by the public, rather than that which has been verified or entered by experts.

__[TO DO] Describe the TLCMap__

For this workshop, you will look for exact matches between the placenames and the locations, and not consider any publicly entered data.

In [619]:
# Which order to do different searches for known locations
search_type = 'exact' # alt values: 'exact', 'fuzzy', 'contains' 

# Flag whether to use data provided by the public
search_public_data = False # alt values = True, False

Like for the OSM, you can limit how many results you want to examine. The TLCMap default for this notebook is 1.

In [620]:
TLCMap_limit = 5

Like for the OSM, you will need a few functions to query the the API.

In [621]:
def tlc_build_url(placename: str, search_type: str, search_public_data: bool = False) -> str:
    """
    Build a url to query the tlcmap/ghap API.
    placename: the place we're trying to locate
    search_type: what search type to use (accepts one of ['contains','fuzzy','exact'])
    
    ref: https://www.tlcmap.org/guides/ghap/#ws
    """
    safe_placename = urllib.parse.quote(placename.strip().lower())

    url = f"https://tlcmap.org/ghap/search?"

    if search_type == 'fuzzy':
        url += f"fuzzyname={safe_placename}"
    elif search_type == 'exact':
        url += f"name={safe_placename}"
    elif search_type == 'contains':
        url += f"containsname={safe_placename}"
    else:
        return None

    # Search Australian National Placenames Survey provided data
    url += "&searchausgaz=on"
    
    # Search public provided data, this data could be unreliable
    if search_public_data == True:
        url += "&searchpublicdatasets=on"
    else:
        url += "&searchpublicdatasets=off"
        
    # Retrieve data as JSON
    url += "&format=json"
    
    # limit the number of results
    url += "&paging=" + str(TLCMap_limit)

    return url

# Send rate-limited requests that stay within n requests per second
# [TO DO] add link to webpage about this
@sleep_and_retry
@limits(calls=1, period=1)
def tlc_call_api(url):
    r = requests.get(url)
    if r.url == 'https://tlcmap.org/ghap/maxpaging':
        return None

    # If the reply says the placename wasn't found, customise the JSON data for the reply
    if r.content.decode() == "No search results to display.":
        # This should have obviously just be an empty list of features, but TLCMap is badly behaved
        response = json.loads('{"type": "FeatureCollection","metadata": {},"features": []}')
    # SUCCESS! Record the spatial data provided in the reply
    elif r.ok:
        response = r.json()    # get [lon, lat] for spatial matches

    return response


def tlc_query_name(placename: str, search_type: str):
    """
    Use tlcmap/ghap API to check a placename, implemented fuzzy search but will not handle non returns.
    """
    url = tlc_build_url(placename, search_type, search_public_data)
    if url:
        return tlc_call_api(url)
    return None

In [622]:
# Format the api response to make comparison easier
def tlc_format_response(input):

    locdata={} # formatted data
    
    # look at each location in the features
    locationfeatures=input
    if len(locationfeatures):
        # Gather the locdata for one of the placename's location 
        if 'placename' in locationfeatures['properties']:
            locdata['LocationName'] = locationfeatures['properties']['placename'].lstrip().rstrip()
        else:
            locdata['LocationName'] = "Unknown Location"
        if 'feature_term' in locationfeatures['properties']:
            locdata['Category']=locationfeatures['properties']["feature_term"].lstrip().rstrip()
        else:
            locdata['Category']=None
        if 'longitude' in locationfeatures['properties']:
            locdata['Longitude']=locationfeatures['properties']["longitude"]
        else:
            locdata['Longitude']=""
        if 'latitude' in locationfeatures['properties']:
            locdata['Latitude']=locationfeatures['properties']["latitude"]
        else:
            locdata['Latitude']=""
        if 'state' in locationfeatures['properties']:
            locdata['PartOf']=locationfeatures['properties']['state'].lstrip().rstrip()
        else:
            locdata['PartOf']=None
        locdata['Gazetteer']="TLCMap"

        # for now, keep the names consistent between records 
        response = {"LocationName": str(locdata["LocationName"]), 
              "Category": str(locdata["Category"]),
              "Latitude": locdata["Latitude"], 
              "Longitude": locdata["Longitude"],
              "PartOf": str(locdata["PartOf"]),
              "Gazetteer": "TLCMap"
                }
    else:
        response=None

    return response

You can now search the TLCMap for locations matching the same placenames you previously searched for in the OSM.

In [623]:
# For every placename in our list
for p in geolocdata:
    # Already found a location, so skip to the next placename
    if len(p['locations']['best_match']) > 0:
        continue
        
    placename = p['placename']
    print ("looking for",placename)

    # query the OSM database
    response = tlc_query_name(placename,search_type)
    
    # Handle no results found
    if response is None:
        # skip to the next placename
        continue
        
    # Save the possible locations for later processing
    data_frames = []

    # Handle results found
    for response_record in response["features"]:
        #  Use this to look at a reduced set of data from the results
        cleaned_response = pd.DataFrame([tlc_format_response(response_record)])

        # Add the data to a dataframe
        data_frames.append(cleaned_response)
        
    # add the results to the geoloc dataframe
    # review the outcomes later
    # Match sure you don't write over any candidates previously added from another gazetteer.
    if p['locations']['candidates'] == None:
        p['locations']['candidates'] = data_frames
    else:
        p['locations']['candidates'] = p['locations']['candidates'] + data_frames   

looking for the sleepy sea
looking for the Bay of Biscay
looking for Heath
looking for Van Diemen's Land
looking for Vickers
looking for Sylvia
looking for Bath
looking for Julia Vickers's
looking for Frere
looking for Chatham
looking for CHAPTER II
looking for Surgeon Pine
looking for Coromandel
looking for Pine
looking for the Hydaspes for Calcutta
looking for the poop guard
looking for MONOTONY
looking for Three'll
looking for Van Diemen's
looking for Tasman
looking for Cape Pillar
looking for Pirates' Bay
looking for east
looking for west
looking for the Isle of Wight
looking for the South-West Cape
looking for Swan Port
looking for Mediterranean
looking for Maria Island
looking for the Three Thumbs
looking for Peninsula
looking for Storm Bay
looking for Storing Island
looking for Sorrell
looking for Bruny Island
looking for Mount Royal
looking for D'Entrecasteaux Channel
looking for Actaeon
looking for the South Cape
looking for New Norfolk
looking for Derwent
looking for the Sout

What placenames have you still not found?

In [624]:
unmatcheddata = [p['placename'] for p in geolocdata 
             if (len(p['locations']['best_match'])==0 and 
                 (p['locations']['candidates']==None or 
                  p['locations']['candidates'])==[])]
unmatcheddata

['the sleepy sea',
 "Julia Vickers's",
 'the Hydaspes for Calcutta',
 'the poop guard',
 'MONOTONY',
 'the Three Thumbs',
 'Storing Island',
 'Commandant',
 'verandah.-She']

You can now compare all the locations you have found.

In [625]:
matcheddata = [p for p in geolocdata 
             if (len(p['locations']['best_match'])!=0 or 
                 (p['locations']['candidates']!=None and 
                  p['locations']['candidates'])!=[])]
alllocations=[]
for p in matcheddata:
    placename = p['placename']
    locations = []
    if len(p['locations']['best_match'])!=0:
        locations=p['locations']['best_match']
    else:
        if p['locations']['candidates']!= None:
            locations = pd.concat(p['locations']['candidates'],ignore_index=True) 
    print("===> ",placename)
    print(locations)
    alllocations=alllocations+[locations]

===>  the Bay of Biscay
        LocationName Category     Latitude           Longitude          PartOf Gazetteer
0  The Bay of Biscay     park  52.95425415  -1.159789836905746  United Kingdom       OSM
===>  Heath
  LocationName        Category            Latitude          Longitude          PartOf Gazetteer
0        Heath  administrative          32.8365147         -96.474987   United States       OSM
1        Heath  administrative          31.3607243        -86.4696811   United States       OSM
2        Heath  administrative          40.0228421        -82.4445991   United States       OSM
3        Heath            site          53.1908153         -1.3499525  United Kingdom       OSM
4        Heath  administrative          42.6898311        -72.8178076   United States       OSM
5        Heath    trig station  -32.13166666666667  149.3011111111111             NSW    TLCMap
6        Heath          parish               -19.6              143.6             QLD    TLCMap
===>  London
    L

4     Ladybird      clothes         -43.6020782         172.7198237  New Zealand / Aotearoa       OSM
===>  Port Arthur
  LocationName            Category      Latitude    Longitude           PartOf Gazetteer
0  Port Arthur      administrative    29.8988618  -93.9288723    United States       OSM
1  Port Arthur      administrative   -43.1435301  147.8440404              TAS       OSM
2          大连市                city    38.9181714  121.6282945               中国       OSM
3  Port Arthur              suburb    60.4468775   22.2444076  Suomi / Finland       OSM
4  Port Arthur              suburb    48.4348261  -89.2194423           Canada       OSM
5  Port Arthur  locality (bounded)     -34.14898    138.06368               SA    TLCMap
6  Port Arthur                 Bay  -43.13999939  147.8399963              TAS    TLCMap
7  Port Arthur     Suburb/Locality  -43.13999939  147.8399963              TAS    TLCMap
===>  Honduras
    LocationName Category   Latitude  Longitude           PartOf

In [626]:
pprint(alllocations[0:10])

[        LocationName Category     Latitude           Longitude          PartOf Gazetteer
0  The Bay of Biscay     park  52.95425415  -1.159789836905746  United Kingdom       OSM,
   LocationName        Category            Latitude          Longitude          PartOf Gazetteer
0        Heath  administrative          32.8365147         -96.474987   United States       OSM
1        Heath  administrative          31.3607243        -86.4696811   United States       OSM
2        Heath  administrative          40.0228421        -82.4445991   United States       OSM
3        Heath            site          53.1908153         -1.3499525  United Kingdom       OSM
4        Heath  administrative          42.6898311        -72.8178076   United States       OSM
5        Heath    trig station  -32.13166666666667  149.3011111111111             NSW    TLCMap
6        Heath          parish               -19.6              143.6             QLD    TLCMap,
     LocationName Category  Latitude  Longitude   

The issue is now to work out which of these locations is most suitable.

A few heuristics can be used to flag those locations with key features that can be used to help rank and select the locations. 

One way is to acknowledge if multiple locations have similar coordinates.  

In [627]:
# Compare two sets of coordinates
# Return true if they are both within 1 point of each other for both Lat and Lon
def compare_coords(Lat1,Lon1,Lat2,Lon2):

    if Lat1.item() == "" or Lat2.item() == "" or Lon1.item() == "" or Lon2.item() == "":
            return False
        
    Lat1Int = int(float(Lat1.item()))
    Lon1Int = int(float(Lon1.item()))
    Lat2Int = int(float(Lat2.item()))
    Lon2Int = int(float(Lon2.item()))

    # Is Coord1 within 1 point of Coord2
    return (Lat1Int in range(Lat2Int-1,Lat2Int+2,1)) and (Lon1Int in range(Lon2Int-1,Lon2Int+2,1))

However, sometimes there are no coordinates for a location, which can complicate the comparisons. The solution is to default to the value of 0 for missing data. It is not perfect but in most cases it is adequate.

In [628]:
# Given a float as a string value, convert it to an integer
# This is required because of NA or "".
def sorting_coord(Coord):
   
    if type(Coord) in [float, int]:
        return int(Coord)
    # set to 0 if missing data
    elif type(Coord)==str and Coord=="":
        return 0
    elif type(Coord)!= str and Coord.isna():
        return 0
    else:
        return (int(float(Coord)))

Another consideration is to recognise which locations are in certain countries which you know are relevant to the original document. For this you can focus on what values may be included in the PartOf field.

In [629]:
# Flag locations in Australia or Britain
aus_states = {"AUSTRALIA","NSW","VIC","QLD","TAS","WA","NT","SA","ACT",
             "NEW SOUTH WALES", "VICTORIA", "QUEENSLAND","TASMANIA","WESTERN AUSTRALIA",
             "SOUTH AUSTRALIA", "NORTHERN TERRITORY","AUSTRALIAN CAPITAL TERRITORY"}
gb_states = {"BRITAIN","UK","GB","GREAT BRITAIN","UNITED KINGDOM","BRITISH ISLES",
            "ENGLAND","WALES","SCOTLAND",
            "IRELAND","NORTHERN IRELAND", "ÉIRE / IRELAND", "EIRE", "EIRE / IRELAND"}

Now you can go through each location and its candidate locations and flag whether they correspond to any of these criteria. A distinction is made between whether any two locations with similar coordinates were found in the same gazetteer or a different one.

In [630]:
for p in matcheddata:
    placename = p['placename']
    locations = []
    if (len(p['locations']['best_match'])==0 and
        p['locations']['candidates']!= None and 
        len(p['locations']['candidates']) > 1):
        
        # sort the locations by Latitude & Longitude
        sorted_candidates = sorted(p['locations']['candidates'],
                                   key=lambda x:[sorting_coord(x['Latitude'].item()),
                                                 sorting_coord(x['Longitude'].item())] )

        prev_location = {} 

        # flag the candidates according to the heuristics
        for c in sorted_candidates:
            rank_flags = []
            partof_flags = ""
            
            # Flag locations in Australia or Britain
            partings=c['PartOf'].item().upper()
            if partings in aus_states:
                partof_flags='Australia'
            if partings in gb_states:
                partof_flags='Britain'
            if partof_flags!="":
                rank_flags.append(partof_flags)
 
            # Flag coords in multiple gazetteers
            if len(prev_location)>0:
                coord_flag = compare_coords(prev_location['Latitude'],
                                        prev_location['Longitude'],
                                        c['Latitude'],
                                        c['Longitude'])
                # Dupl_Gaz2: Matching coords in locations from 2 gazetteers 
                if coord_flag and prev_location['Gazetteer'].item()!=c['Gazetteer'].item():
                    rank_flags.append("Dupl_2Gaz")
                # Dupl_Gaz1: Matching coords in locations from 1 gazetteer
                elif coord_flag:
                    rank_flags.append("Dupl_1Gaz")

            prev_location = c
                    
            # rank_flags has to be converted to a single string, rather than a list 
            # because c is a Pandas dataframe
            c["RankFlags"]=pd.Series(','.join(rank_flags))
            print(" ** ",c['LocationName'].item(),",",partings,",[",(c["RankFlags"].item()),']') 

        # Update the geoloc data
        p['locations']['candidates']=sorted_candidates

 **  Heath , NSW ,[ Australia ]
 **  Heath , QLD ,[ Australia ]
 **  Heath , UNITED STATES ,[  ]
 **  Heath , UNITED STATES ,[  ]
 **  Heath , UNITED STATES ,[  ]
 **  Heath , UNITED STATES ,[  ]
 **  Heath , UNITED KINGDOM ,[ Britain ]
 **  Tasmania , AUSTRALIA ,[ Australia ]
 **  Van Diemen's Land , TAS ,[ Australia ]
 **  Van Diemen's Land , UNITED KINGDOM ,[ Britain ]
 **  Vickers , UNITED STATES ,[  ]
 **  Vickers , UNITED STATES ,[ Dupl_1Gaz ]
 **  Vickers , UNITED STATES ,[  ]
 **  Vickers , UNITED STATES ,[  ]
 **  Vickers , UNITED KINGDOM ,[ Britain ]
 **  Sylvia , QLD ,[ Australia ]
 **  Sylvia , PHILIPPINES ,[  ]
 **  Sylvia , UNITED STATES ,[  ]
 **  Sylvia , UNITED STATES ,[  ]
 **  Sylvia , UNITED STATES ,[  ]
 **  Sylvia , UNITED STATES ,[  ]
 **  Bath , SA ,[ Australia ]
 **  Bath County , UNITED STATES ,[  ]
 **  Bath , UNITED STATES ,[  ]
 **  Bath , UNITED STATES ,[  ]
 **  Bath , UNITED KINGDOM ,[ Britain ]
 **  Bath Abbey , UNITED KINGDOM ,[ Britain,Dupl_1Gaz ]
 **

 **  Mount Direction , QUEENSLAND ,[ Australia ]
 **  Mount Direction , QLD ,[ Australia,Dupl_2Gaz ]
 **  Macquarie Harbour , TASMANIA ,[ Australia ]
 **  Macquarie Harbour , TAS ,[ Australia,Dupl_2Gaz ]
 **  Mount Heemskirk , TASMANIA ,[ Australia ]
 **  Mount Heemskirk , TAS ,[ Australia,Dupl_2Gaz ]
 **  Mount Zeehan , TAS ,[ Australia ]
 **  Mount Zeehan , TAS ,[ Australia,Dupl_2Gaz ]
 **  King's River , ÉIRE / IRELAND ,[ Britain ]
 **  King's River , ÉIRE / IRELAND ,[ Britain,Dupl_1Gaz ]
 **  King's River , ÉIRE / IRELAND ,[ Britain,Dupl_1Gaz ]
 **  King's River , ÉIRE / IRELAND ,[ Britain,Dupl_1Gaz ]
 **  King's River , ÉIRE / IRELAND ,[ Britain,Dupl_1Gaz ]
 **  Sarah Island , TASMANIA ,[ Australia ]
 **  Sarah Island , TAS ,[ Australia,Dupl_2Gaz ]
 **  Sarah Island , UNITED STATES ,[  ]
 **  Sarah Island , UNITED STATES ,[  ]
 **  Sarah Island , CANADA ,[  ]
 **  ᐅᑲᓕᐊᕐᔪᒃ Sarah Island , CANADA ,[  ]
 **  Saint Philip's Baptist Church , UNITED STATES ,[  ]
 **  King Philip's Remain

You can now use the heuristics flags to re-sort the candidates and select the best one.

In [631]:
# Establish how to rank the heuristic flags
sortorder = ['Australia,Dupl_2Gaz',  # in Australia, found in 2 gazetteers
             'Australia,Dupl_1Gaz',  # in Australia, found more than once in 1 gazetteer
             'Australia',            # in Australia, found only once
             'Britain,Dupl_2Gaz',    # in Great Britain, found in 2 gazetteers
             'Dupl_2Gaz',            # not in Australia or Great Britain, found in 2 gazetteers
             'Britain,Dupl_1Gaz',    # in Great Britain, found more than once in 1 gazetteer
             'Britain',              # in Great Britain, found only once
             'Dupl_1Gaz',            # not in Australia or Great Britain, found more than once in 1 gazetteer
             ""]                     # not in Australia or Great Britain, found only once

In [632]:
# Sort the candidates
for p in geolocdata:
    placename = p['placename']
    pprint(placename)

    if (len(p['locations']['best_match'])==0 and # haven't found a winner yet
        p['locations']['candidates']!= None and 
        len(p['locations']['candidates'])> 1): # more than one candidate
        
        candidates = p['locations']['candidates']
        #pprint(candidates)

        sorted_candidates=[]
        for h in sortorder:
            matched_candidates=[]
            for c in candidates:
                if c['RankFlags'].item() == h: 
                    sorted_candidates=sorted_candidates + [c]
        locations = pd.concat(sorted_candidates,ignore_index=True) 
        pprint(locations)
        
        # update the geoloc data
        p['locations']['candidates']=sorted_candidates

'the sleepy sea'
'the Bay of Biscay'
'Heath'
  LocationName        Category            Latitude          Longitude          PartOf Gazetteer  RankFlags
0        Heath    trig station  -32.13166666666667  149.3011111111111             NSW    TLCMap  Australia
1        Heath          parish               -19.6              143.6             QLD    TLCMap  Australia
2        Heath            site          53.1908153         -1.3499525  United Kingdom       OSM    Britain
3        Heath  administrative          31.3607243        -86.4696811   United States       OSM           
4        Heath  administrative          32.8365147         -96.474987   United States       OSM           
5        Heath  administrative          40.0228421        -82.4445991   United States       OSM           
6        Heath  administrative          42.6898311        -72.8178076   United States       OSM           
'London'
"Van Diemen's Land"
        LocationName        Category     Latitude            Longitude

'Tasman'
               LocationName           Category      Latitude           Longitude                  PartOf  \
0                    Tasman  Municipality/City       -43.054             147.834                     TAS   
1                    Tasman     administrative   -43.0422943  147.69664144273847                Tasmania   
2                    Tasman          homestead     -32.46015           144.58254                     NSW   
3                    Tasman          homestead                                                       NSW   
4                    Tasman            village   -41.1891054         173.0526542  New Zealand / Aotearoa   
5  Mount Tasman / Rarakiroa               peak    -43.565698         170.1571884  New Zealand / Aotearoa   
6                    Tasman     administrative  -41.30222105  172.89453190955697  New Zealand / Aotearoa   
7                    Tasman            station    37.4086658        -121.9444674           United States   

  Gazetteer       

6              Britain  
'Smyrna'
        LocationName        Category    Latitude    Longitude         PartOf Gazetteer RankFlags
0             Smyrna  administrative   33.883887  -84.5147454  United States       OSM          
1             Smyrna            town  35.9824598  -86.5199492  United States       OSM          
2              İzmir            city  38.4224548   27.1310699        Türkiye       OSM          
3             Smyrna  administrative  39.2998339  -75.6046494  United States       OSM          
4  Village of Smyrna  administrative  42.6872921  -75.5707376  United States       OSM          
'Pyramid Island'
     LocationName Category             Latitude           Longitude            PartOf Gazetteer  RankFlags
0  Pyramid Island   Island         -42.59000015         145.7299957               TAS    TLCMap  Australia
1  Pyramid Island   Island         -39.81999969         147.2299957               TAS    TLCMap  Australia
2  Pyramid Island    islet  -62.41742569999999

1      Grummet     locality  50.3686748  10.8561277  Deutschland       OSM           
'Malabar'
  LocationName        Category             Latitude           Longitude         PartOf Gazetteer  \
0      Malabar       homestead           -35.564599         148.3241805            NSW    TLCMap   
1      Malabar       homestead          -34.2227815          148.163882            NSW    TLCMap   
2      Malabar       homestead           -35.718037          147.892788            NSW    TLCMap   
3      Malabar  administrative          -33.9629779         151.2480171            NSW       OSM   
4      Malabar       homestead          -29.5811455          147.689279            NSW    TLCMap   
5      Malabar            None  -27.583333333333332  152.58333333333334            QLD    TLCMap   
6      Malabar         village           -7.2880681         108.6960653      Indonesia       OSM   
7      Malabar         village           -6.9260791         107.6210172      Indonesia       OSM   
8   

In [633]:
# Outputting without the RankFlags column
matcheddata = [p for p in geolocdata 
             if (len(p['locations']['best_match'])!=0 or 
                 (p['locations']['candidates']!=None and 
                  p['locations']['candidates'])!=[])]
alllocations=[]
for p in matcheddata:
    placename = p['placename']
    locations = []
    if len(p['locations']['best_match'])!=0:
        locations=p['locations']['best_match']
    else:
        if p['locations']['candidates']!= None:
            short_candidates = []
            for c in p['locations']['candidates']:
                short_candidates = short_candidates + [c.loc[:, c.columns != 'RankFlags']]
            locations = pd.concat(short_candidates,ignore_index=True) 
    #print("===> ",placename)
    print(locations)


        LocationName Category     Latitude           Longitude          PartOf Gazetteer
0  The Bay of Biscay     park  52.95425415  -1.159789836905746  United Kingdom       OSM
  LocationName        Category            Latitude          Longitude          PartOf Gazetteer
0        Heath    trig station  -32.13166666666667  149.3011111111111             NSW    TLCMap
1        Heath          parish               -19.6              143.6             QLD    TLCMap
2        Heath            site          53.1908153         -1.3499525  United Kingdom       OSM
3        Heath  administrative          31.3607243        -86.4696811   United States       OSM
4        Heath  administrative          32.8365147         -96.474987   United States       OSM
5        Heath  administrative          40.0228421        -82.4445991   United States       OSM
6        Heath  administrative          42.6898311        -72.8178076   United States       OSM
    LocationName Category  Latitude  Longitude        

7       OSM  
  LocationName         Category      Latitude    Longitude                             PartOf Gazetteer
0  Cape Pillar  Suburb/Locality  -43.18000031  147.9299927                                TAS    TLCMap
1  Cape Pillar             cape   -43.2215036  148.0107165                           Tasmania       OSM
2  Cape Pillar             Cape  -43.22000122  148.0099945                                TAS    TLCMap
3  Cape Pillar   administrative   -43.1817045  147.9383047                           Tasmania       OSM
4  Cape Pillar         locality   -53.1379539   73.4104006  Heard Island and McDonald Islands       OSM
  LocationName       Category     Latitude    Longitude     PartOf Gazetteer
0  Pirates Bay            bay   -43.021334  147.9384131        TAS       OSM
1  Pirates Bay            bay  -38.3792167  144.7708906        VIC       OSM
2  Pirates Bay  travel_agency   -8.3647299  116.0860116  Indonesia       OSM
3  Pirates Bay     restaurant   -8.7998639  115.235440

0  Tweedelig hert  artwork  51.9864345  5.6694395  Nederland       OSM
  LocationName    Category    Latitude            Longitude          PartOf Gazetteer
0         Cape  industrial  52.2840847  -1.9055396710979262  United Kingdom       OSM
  LocationName            Category      Latitude           Longitude    PartOf Gazetteer
0  New Norfolk  locality (bounded)  -42.77999878         147.0500031       TAS    TLCMap
1  New Norfolk             station   -42.7760355  147.05581769761721  Tasmania       OSM
2  New Norfolk      administrative   -42.7801998         147.0615332  Tasmania       OSM
  LocationName        Category             Latitude            Longitude          PartOf Gazetteer
0      Derwent      Electorate              -42.813              146.423             TAS    TLCMap
1      Derwent       homestead           -30.874125           150.650851             NSW    TLCMap
2      Derwent       homestead  -23.166666666666668               132.15              NT    TLCMap
3    

4        earth           track   49.7933447          13.8667052          Česko       OSM
                   LocationName        Category     Latitude           Longitude          PartOf Gazetteer
0  South East Constituency 1947       political   53.3262506  -6.236059861357543  Éire / Ireland       OSM
1           South-East District  administrative  -24.9902332  25.726402352052464        Botswana       OSM
2        Département du Sud-Est  administrative   18.2973566         -72.3745698           Ayiti       OSM
3                    South East     residential    40.006443          -77.113805   United States       OSM
4                       Sud-Est  administrative  44.97028655  28.282042663820434         România       OSM
  LocationName     Category            Latitude           Longitude                  PartOf Gazetteer
0     Ladybird         gift  50.233202399999996  -5.228724001225023          United Kingdom       OSM
1     Ladybird          bar          51.5360577          -0.10387

Now that the candidates are sorted, the best match can be selected. The most obvious choice is the highest ranked candidate.

In [634]:
for p in geolocdata:
    placename = p['placename']
    locations = []
    # Already have a best match
    if len(p['locations']['best_match'])!=0:
        locations=p['locations']['best_match']
    else:
        if p['locations']['candidates']!= None and len(p['locations']['candidates'])>0:
            # Presume the best match has the top rank
            topcand = p['locations']['candidates'][0]
            p['locations']['best_match'] = topcand.loc[:, ~topcand.columns.isin(['Gazetteer','RankFlags'])]

In [635]:
# Outputting without the RankFlags column
matcheddata = [p for p in geolocdata 
             if (len(p['locations']['best_match'])!=0 or 
                 (p['locations']['candidates']!=None and 
                  p['locations']['candidates'])!=[])]
alllocations=[]
for p in matcheddata:
    placename = p['placename']
    locations = []
    if len(p['locations']['best_match'])!=0:
        locations=p['locations']['best_match']
    else:
        # This shouldn't run because all entries with matches should now have best matches
        if p['locations']['candidates']!= None:
            short_candidates = []
            for c in p['locations']['candidates']:
                short_candidates = short_candidates + [c.loc[:, c.columns != 'RankFlags']]
            locations = pd.concat(short_candidates,ignore_index=True) 
    #print("===> ",placename)
    print(locations)


        LocationName Category     Latitude           Longitude          PartOf
0  The Bay of Biscay     park  52.95425415  -1.159789836905746  United Kingdom
  LocationName      Category            Latitude          Longitude PartOf
0        Heath  trig station  -32.13166666666667  149.3011111111111    NSW
    LocationName Category  Latitude  Longitude          PartOf
273       London  Capital -0.083333       51.5  United Kingdom
  LocationName        Category    Latitude    Longitude     PartOf
0     Tasmania  administrative  -42.035067  146.6366887  Australia
  LocationName Category     Latitude            Longitude          PartOf
0      Vickers  primary  51.59845985  -1.7415535311162076  United Kingdom
  LocationName Category Latitude           Longitude PartOf
0       Sylvia   parish   -20.85  143.81666666666666    QLD
  LocationName             Category   Latitude  Longitude PartOf
0         Bath  locality unbounded)  -34.83706  138.49131     SA
  LocationName Category    Latitud

Unfortunately, the top ranked candidate may still not be the best. The second ranked candidate may actually be the same rank as the top ranked one, but it simply might have a lesser position due to the earlier ranking based on the latitude and longitude values (to find the matching candidates for the ranking). For this reason, the final stage of the processing is left to the user, showing them the best candidate location, but allowing them to select an alternate candidate.

There are a number of ways the user interface could be done. The first is very verbose and shows all the candidate locations of all placenames to the user. Using checkboxes, the user can then select what they think is the best match. 

Alternatively, the candidates could be presented as drop-down accordions.

In [758]:
# individually for each placename
for p in geolocdata:
    placename = p['placename']
    #print("Adding ",placename)

    # If there isn't a bestmatch, then there aren't any candidates
    if len(p['locations']['best_match'])!=0:
        
        best_match=p['locations']['best_match']
        best_match_index = 0
        
        # Must be at least one candidate
        if ('candidates' in p['locations'].keys() and 
            p['locations']['candidates']!= None and 
            len(p['locations']['candidates'])>=1):

            # Create a RadioButton for each candidate location
            candidates = p['locations']['candidates']
            candidate_buttons = [widgets.RadioButtons(layout={'width':'max-content'},
                                                    options = [c['LocationName'].item()+
                                                    ", "+
                                                    c['PartOf'].item()+
                                                    " ("+
                                                    str(c['Latitude'].item())+
                                                    ","+
                                                    str(c['Longitude'].item())+
                                                    ")" 
                                                    for c in candidates])]
            # Add a button for "None of the above"
            candidate_buttons[0].options = candidate_buttons[0].options+tuple(["None of the above"])
            p["locations"]["candidatebuttons"] = candidate_buttons
    
            # Select the button for the current bestmatch
            p['locations']['candidatebuttons'][0].index = best_match_index
    
            # Create a HBox for the Placename 
            placename_box = widgets.HBox([widgets.Label("'"+
                                                p['placename']+
                                                "' matches best with: "+
                                                best_match['LocationName'].item()+ ", "+
                                                best_match['PartOf'].item()+" ("+
                                                best_match['Latitude'].item()+ ", "+
                                                best_match['Longitude'].item()+ ")"
                                                )])
    
            # Create an Accordion (in a HBox) for the candidates 
            new_accordion = widgets.Accordion()
            new_accordion.set_title(0,"Expand to chooose a different location")
            new_accordion.children = [widgets.HBox(p["locations"]["candidatebuttons"])]
            new_accordion.selected_index=None # close the accordion at startup

            # Put the HBoxes together in a VBox
            whole_box = widgets.VBox([placename_box,
                              new_accordion
                              ])
    
            # Show it all!
            # Note: this doesn't close one accordion if another is opened because they are in separate VBoxes
            display(whole_box)

VBox(children=(HBox(children=(Label(value="'the Bay of Biscay' matches best with: No suitable location selecte…

VBox(children=(HBox(children=(Label(value="'Heath' matches best with: Heath, NSW (-32.13166666666667, 149.3011…

VBox(children=(HBox(children=(Label(value="'Van Diemen's Land' matches best with: Tasmania, Australia (-42.035…

VBox(children=(HBox(children=(Label(value="'Vickers' matches best with: Vickers, United Kingdom (51.59845985, …

VBox(children=(HBox(children=(Label(value="'Sylvia' matches best with: Sylvia, QLD (-20.85, 143.81666666666666…

VBox(children=(HBox(children=(Label(value="'Bath' matches best with: Bath, SA (-34.83706, 138.49131)"),)), Acc…

VBox(children=(HBox(children=(Label(value="'Frere' matches best with: Frere, Italia (44.4701528, 7.004274)"),)…

VBox(children=(HBox(children=(Label(value="'Chatham' matches best with: Chatham, VIC (-37.82402802, 145.089035…

VBox(children=(HBox(children=(Label(value="'CHAPTER II' matches best with: No suitable location selected, No s…

VBox(children=(HBox(children=(Label(value="'Surgeon Pine' matches best with: Surgeon's Kitchen, Norfolk Island…

VBox(children=(HBox(children=(Label(value="'Coromandel' matches best with: Coromandel, SA (-35.0251942, 138.61…

VBox(children=(HBox(children=(Label(value="'Pine' matches best with: Pine, SA (-36.0, 135.0)"),)), Accordion(c…

VBox(children=(HBox(children=(Label(value="'Three'll' matches best with: Three, United States (43.9445278, -71…

VBox(children=(HBox(children=(Label(value="'Van Diemen's' matches best with: Van Gogh Court, TAS (-41.3799529,…

VBox(children=(HBox(children=(Label(value="'Tasman' matches best with: Tasman, TAS (-43.054, 147.834)"),)), Ac…

VBox(children=(HBox(children=(Label(value="'Cape Pillar' matches best with: Cape Pillar, TAS (-43.18000031, 14…

VBox(children=(HBox(children=(Label(value="'Pirates' Bay' matches best with: Pirates Bay, TAS (-43.021334, 147…

VBox(children=(HBox(children=(Label(value="'east' matches best with: East, WA (-30.59869, 122.5848)"),)), Acco…

VBox(children=(HBox(children=(Label(value="'west' matches best with: Western, Papua Niugini (-7.5, 142)"),)), …

VBox(children=(HBox(children=(Label(value="'the Isle of Wight' matches best with: Isle of Wight, United Kingdo…

VBox(children=(HBox(children=(Label(value="'the South-West Cape' matches best with: MOSAIC Lagoon Cafe in the …

VBox(children=(HBox(children=(Label(value="'Swan Port' matches best with: Swan, Mauritius (-20.163587, 57.5032…

VBox(children=(HBox(children=(Label(value="'Mediterranean' matches best with: Mediterranean, United States (29…

VBox(children=(HBox(children=(Label(value="'Maria Island' matches best with: Maria Island, TAS (-42.61999893, …

VBox(children=(HBox(children=(Label(value="'Peninsula' matches best with: Peninsula, WA (-33.92686, 116.0693)"…

VBox(children=(HBox(children=(Label(value="'Storm Bay' matches best with: Storm Bay, TAS (-43.16999817, 147.44…

VBox(children=(HBox(children=(Label(value="'Sorrell' matches best with: Sorrell, United Kingdom (53.9614506000…

VBox(children=(HBox(children=(Label(value="'Bruny Island' matches best with: Bruny Island, TAS (-43.29000092, …

VBox(children=(HBox(children=(Label(value="'Mount Royal' matches best with: Mount Royal, NSW (-32.213611111111…

VBox(children=(HBox(children=(Label(value="'D'Entrecasteaux Channel' matches best with: D'Entrecasteaux Channe…

VBox(children=(HBox(children=(Label(value="'Actaeon' matches best with: Tweedelig hert, Nederland (51.9864345,…

VBox(children=(HBox(children=(Label(value="'the South Cape' matches best with: Cape, United Kingdom (52.284084…

VBox(children=(HBox(children=(Label(value="'New Norfolk' matches best with: New Norfolk, TAS (-42.77999878, 14…

VBox(children=(HBox(children=(Label(value="'Derwent' matches best with: Derwent, TAS (-42.813, 146.423)"),)), …

VBox(children=(HBox(children=(Label(value="'the Southern Ocean' matches best with: ocean, Ελλάς (37.1034349, 2…

VBox(children=(HBox(children=(Label(value="'Tamar' matches best with: Tamar, NSW (-35.815, 144.56777777777776)…

VBox(children=(HBox(children=(Label(value="'Port Philip Bay' matches best with: Yarra Bay Bicentennial Park, N…

VBox(children=(HBox(children=(Label(value="'Dromedary' matches best with: Dromedary, TAS (-42.74000168, 147.16…

VBox(children=(HBox(children=(Label(value="'Mount Wellington' matches best with: Mount Wellington, TAS (-42.88…

VBox(children=(HBox(children=(Label(value="'Launceston' matches best with: Launceston, TAS (-41.361, 147.304)"…

VBox(children=(HBox(children=(Label(value="'Smyrna' matches best with: Smyrna, United States (33.883887, -84.5…

VBox(children=(HBox(children=(Label(value="'Pyramid Island' matches best with: Pyramid Island, TAS (-42.590000…

VBox(children=(HBox(children=(Label(value="'Rocky Point' matches best with: Rocky Point, TAS (-41.97000122, 14…

VBox(children=(HBox(children=(Label(value="'Port Davey' matches best with: Port Davey, TAS (-43.33000183, 145.…

VBox(children=(HBox(children=(Label(value="'Mount Direction' matches best with: Mount Direction, TAS (-42.7900…

VBox(children=(HBox(children=(Label(value="'Macquarie Harbour' matches best with: Macquarie Harbour, TAS (-42.…

VBox(children=(HBox(children=(Label(value="'Mount Heemskirk' matches best with: Mount Heemskirk, TAS (-41.8499…

VBox(children=(HBox(children=(Label(value="'Mount Zeehan' matches best with: Mount Zeehan, TAS (-41.91999817, …

VBox(children=(HBox(children=(Label(value="'King's River' matches best with: King's River, Éire / Ireland (52.…

VBox(children=(HBox(children=(Label(value="'Sarah Island' matches best with: Sarah Island, TAS (-42.38000107, …

VBox(children=(HBox(children=(Label(value="'Philip's Island' matches best with: St. Philip's Lane, United King…

VBox(children=(HBox(children=(Label(value="'Hobart Town' matches best with: Village of Hobart, United States (…

VBox(children=(HBox(children=(Label(value="'earth' matches best with: earth, Česko (49.5904859, 14.6651358)"),…

VBox(children=(HBox(children=(Label(value="'south-east' matches best with: South East Constituency 1947, Éire …

VBox(children=(HBox(children=(Label(value="'Ladybird' matches best with: Ladybird, United Kingdom (50.23320239…

VBox(children=(HBox(children=(Label(value="'Port Arthur' matches best with: Port Arthur, TAS (-43.13999939, 14…

VBox(children=(HBox(children=(Label(value="'Arthur' matches best with: Arthur, NSW (-32.363055555555555, 150.8…

VBox(children=(HBox(children=(Label(value="'Hells Gates' matches best with: Hells Gates, TAS (-42.20999908, 14…

VBox(children=(HBox(children=(Label(value="'England' matches best with: England, QLD (-27.433333333333334, 152…

VBox(children=(HBox(children=(Label(value="'New Town' matches best with: New Town, TAS (-42.84999847, 147.3000…

VBox(children=(HBox(children=(Label(value="'Grummet Island' matches best with: Grummet Island, TAS (-42.380001…

VBox(children=(HBox(children=(Label(value="'Grummet' matches best with: grummet, Deutschland (50.4723087, 11.9…

VBox(children=(HBox(children=(Label(value="'Malabar' matches best with: Malabar, NSW (-35.564599, 148.3241805)…

VBox(children=(HBox(children=(Label(value="'Dawes' matches best with: Dawes, VIC (-37.81597137, 146.4990234)")…

VBox(children=(HBox(children=(Label(value="'Sydney' matches best with: Sydney, NSW (-33.865, 151.2094444444444…

Record the selected candidates as the best matches. Account for placenames without any best match.

In [645]:
for p in geolocdata:

    # If there isn't a bestmatch, then there aren't any candidates
    if len(p['locations']['best_match'])!=0:
        
        best_match_index = 0

        # Must be at least one canddiate
        if ('candidatebuttons' in p['locations'].keys()):

            # Extract the selection
            best_match_index = p['locations']['candidatebuttons'][0].index
            # Make sure the selection is a location
            if (best_match_index < len(p['locations']['candidates'])):
                # Record the best match
                topcand = p['locations']['candidates'][best_match_index]
                # Just record the important columns for the best match
                p['locations']['best_match'] = topcand.loc[:, ~topcand.columns.isin(['Gazetteer','RankFlags'])]
            else:
                # None of the above
                best_match = {'LocationName':"No suitable location selected",
                              'Category':"No suitable location selected",
                              'Latitude':"",
                              'Longitude': "",
                              'PartOf':"No suitable location selected"}
                p['locations']['best_match']= pd.DataFrame([best_match])
    else:
        # Note that there is no best match
        best_match = {'LocationName':"No location matched",
                      'Category':"No location matched",
                      'Latitude':"",
                      'Longitude': "",
                      'PartOf':"No location matched"}
        p['locations']['best_match']= pd.DataFrame([best_match])

            
    

Prepare the final geoloc data for output to file.

In [646]:
pprint(geolocdata[0:10])


[{'locations': {'best_match':           LocationName             Category Latitude Longitude               PartOf
0  No location matched  No location matched                     No location matched,
                'candidates': []},
  'placename': 'the sleepy sea'},
 {'locations': {'best_match':                     LocationName                       Category Latitude Longitude                         PartOf
0  No suitable location selected  No suitable location selected                     No suitable location selected,
                'candidatebuttons': [RadioButtons(index=1, layout=Layout(width='max-content'), options=('The Bay of Biscay, United Kingdom (52.95425415,-1.159789836905746)', 'None of the above'), value='None of the above')],
                'candidates': [        LocationName Category     Latitude           Longitude          PartOf Gazetteer
0  The Bay of Biscay     park  52.95425415  -1.159789836905746  United Kingdom       OSM]},
  'placename': 'the Bay of Biscay'},

In [647]:
# Verbose output
alllocations=[] # the processed data
for p in geolocdata:
    placename = p['placename']
    pprint(placename)

    # Reformat the data about the best match
    bestlocations = []
    # All placenames should now have a best_match
    if len(p['locations']['best_match'])!=0:
        for c in p['locations']['best_match']:
            if p['locations']['best_match'][c].item!="":
                bestlocations = bestlocations+[p['locations']['best_match'][c].item()]

    # Reformat the data about the candidate locations
    locations = []
    if 'candidates' in p['locations'].keys() and p['locations']['candidates']!= []:
            short_candidates = []
            for c in p['locations']['candidates']:
                # select which columns to output
                short_candidates = short_candidates + [c.loc[:, c.columns != 'RankFlags']]
            # merge the dataframe values into a more human-readible format for now
            # though this is not a comma-separated format
            locations = pd.concat(short_candidates,ignore_index=True) 

    # Put this all together
    newrecord = [['placename',placename],
                ['best_match',bestlocations],
                ['candidates',locations]]
    alllocations = alllocations + [newrecord]

'the sleepy sea'
'the Bay of Biscay'
'Heath'
'London'
"Van Diemen's Land"
'Vickers'
'Sylvia'
'Bath'
"Julia Vickers's"
'Frere'
'Chatham'
'CHAPTER II'
'Surgeon Pine'
'Coromandel'
'Pine'
'India'
'the Hydaspes for Calcutta'
'the poop guard'
'MONOTONY'
"Three'll"
"Van Diemen's"
'Tasman'
'Cape Pillar'
"Pirates' Bay"
'east'
'west'
'the Isle of Wight'
'the South-West Cape'
'Swan Port'
'Mediterranean'
'Maria Island'
'the Three Thumbs'
'Peninsula'
'Storm Bay'
'Storing Island'
'Italy'
'Sorrell'
'Bruny Island'
'Mount Royal'
"D'Entrecasteaux Channel"
'Actaeon'
'the South Cape'
'New Norfolk'
'Derwent'
'the Southern Ocean'
'Tamar'
'Victoria'
'Port Philip Bay'
'Wellington'
'Dromedary'
'Mount Wellington'
'Launceston'
'Smyrna'
'Pyramid Island'
'Rocky Point'
'Port Davey'
'Mount Direction'
'Macquarie Harbour'
'Mount Heemskirk'
'Mount Zeehan'
"King's River"
'Sarah Island'
"Philip's Island"
'Hobart Town'
'earth'
'south-east'
'Ladybird'
'Commandant'
'Port Arthur'
'Honduras'
'Arthur'
'Hells Gates'
'England'
'

In [648]:
for p in alllocations[0:10]:
    pprint(p)

[['placename', 'the sleepy sea'],
 ['best_match',
  ['No location matched',
   'No location matched',
   '',
   '',
   'No location matched']],
 ['candidates', []]]
[['placename', 'the Bay of Biscay'],
 ['best_match',
  ['No suitable location selected',
   'No suitable location selected',
   '',
   '',
   'No suitable location selected']],
 ['candidates',
          LocationName Category     Latitude           Longitude          PartOf Gazetteer
0  The Bay of Biscay     park  52.95425415  -1.159789836905746  United Kingdom       OSM]]
[['placename', 'Heath'],
 ['best_match',
  ['Heath', 'trig station', '-32.13166666666667', '149.3011111111111', 'NSW']],
 ['candidates',
    LocationName        Category            Latitude          Longitude          PartOf Gazetteer
0        Heath    trig station  -32.13166666666667  149.3011111111111             NSW    TLCMap
1        Heath          parish               -19.6              143.6             QLD    TLCMap
2        Heath            site   

In [649]:
# Save all the geolocdata as a messy combination of array rows and dataframnes
filename = "FtToHNL_matchedlocations.data"
save_location = os.path.normpath(os.path.join(csv_directory, filename))
save_filename = os.path.basename(save_location)
print("Saving to location data to ",save_location)

# save the list 
# using the savetxt from the numpy module
np.savetxt(save_location, 
           geolocdata,
           delimiter =", ", 
           fmt ='%s')

Saving to location data to  ../ner_output/FtToHNL_matchedlocations.data


In [650]:
# Final output
# Only use selected columns from the best match in csv format

# Make a new array of records
geoloc_output=[['Placename','PartOf','Latitude','Longitude']]
for p in geolocdata:
    # Only output those placenames with a best match location
    if len(p['locations']['best_match'])!=0:
        placename = p['placename']
        location = p['locations']['best_match']
        
        # convert Lat/Long into floats, rather than strings
        if type(location['Latitude']) == float:
            latitude = location['Latitude'].item()
        elif (location['Latitude'].item()!=""):
            latitude = float(location['Latitude'].item())
        else:
            latitude = ""
        if type(location['Longitude']) == float:
            longitude = location['Longitude'].item()
        elif (location['Longitude'].item()!=""):
            longitude = float(location['Longitude'].item())
        else:
            longitude = ""
            
        # add this record to the list
        geoloc_output.append([placename,
                              location['PartOf'].item(),
                              latitude, 
                              longitude])

In [651]:
geoloc_output

[['Placename', 'PartOf', 'Latitude', 'Longitude'],
 ['the sleepy sea', 'No location matched', '', ''],
 ['the Bay of Biscay', 'No suitable location selected', '', ''],
 ['Heath', 'NSW', -32.13166666666667, 149.3011111111111],
 ['London', 'United Kingdom', -0.083333, 51.5],
 ["Van Diemen's Land", 'Australia', -42.035067, 146.6366887],
 ['Vickers', 'United Kingdom', 51.59845985, -1.7415535311162076],
 ['Sylvia', 'QLD', -20.85, 143.81666666666666],
 ['Bath', 'SA', -34.83706, 138.49131],
 ["Julia Vickers's", 'No location matched', '', ''],
 ['Frere', 'Italia', 44.4701528, 7.004274],
 ['Chatham', 'VIC', -37.82402802, 145.089035],
 ['CHAPTER II', 'No suitable location selected', '', ''],
 ['Surgeon Pine', 'Norfolk Island', -29.0569188, 167.9554764],
 ['Coromandel', 'SA', -35.0251942, 138.6143361],
 ['Pine', 'SA', -36.0, 135.0],
 ['India', 'Asia', 77.2, 28.6],
 ['the Hydaspes for Calcutta', 'No location matched', '', ''],
 ['the poop guard', 'No location matched', '', ''],
 ['MONOTONY', 'No l

In [652]:
filename = "FtToHNL_geolocdata.csv"
save_location = os.path.normpath(os.path.join(csv_directory, filename))
save_filename = os.path.basename(save_location)
print("Saving to location data to ",save_location)

# save the list 
# using the savetxt from the numpy module
np.savetxt(save_location, 
           geoloc_output,
           delimiter =", ", 
           fmt ='% s')

Saving to location data to  ../ner_output/FtToHNL_geolocdata.csv
