# ATAP Notebook for the Geolocation project

This notebook helps you access the Geolocation tools in a Python development environment.

### Contents
* [Premise](#section-premise)
* [Requirements](#section-requirements) 
* [Data Preparation](#section-datapreparation)
* [Identifying Placenames in Text](#section-identifyingplacenames)
 * [Named Entity Recognition](#section-ner)
 * [Reviewing Candidate Placenames](#section-reviewplacenames)
* [Finding Locations for Placenames](#section-findinglocs)
 * [Identifying States and Capitals](#section-statescapitals)
 * [Searching a Gazzetteer for Locations](#section-searchgazetteer)

## Premise <a class="anchor" id="section-premise"></a>
*This section explains the Geolocation project and tools.*

The Geolocation project relates to doctoral research done by [Fiannuala Morgan](https://finnoscarmorgan.github.io/) at the Australian National University. It uses software to identify placenames in archived historical texts, then compares them to data about known locations to identify where the placenames may be located. 

This notebook is designed to allow you to perform similar operations on textual documents.

<div class="alert alert-block alert-success">
It will teach you how to
<ul>
    <li>use the spaCy library to identify and classify Named Entities (NEs)</li>
    <li>identify multi-word expressions (MWE) that are NEs</li>
    <li>search for spatial data about specific locations or places in gazetteers of such data</li>
    <li>determine which locations are referred to by placenames, based on the context in which they are used in a text</li>
</ul>
</div>

## Requirements <a class="anchor" id="section-requirements"></a>

<div class="alert alert-block alert-info">
This notebook uses various Python libraries. Most will come with your Python installation, but the following are crucial:
<ul>
    <li> pandas</li> 
    <li> json</li> 
    <li> nltk</li> 
    <li> geopandas</li> 
    <li> shapely  </li> 
</ul>
</div>

In [15]:
# [TO DO] UPDATE
# Many of these are probably not needed.

import os
from pickle import NONE
import nltk
import csv
import time
import urllib
import requests
import json
import math

#import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Geopandas is used to work with spatial data
# If you have issues installing it on a MAcOS, 
# see https://stackoverflow.com/questions/71137617/error-installing-geopandas-in-python-on-mac-m1
#import geopandas as gpd
#from geopandas import GeoDataFrame

# NLTK is used to work with textual data 
#from nltk.tag import StanfordNERTagger
#from nltk.tokenize import word_tokenize

# spaCy is used for a pipeline of NLP functions
import spacy
from spacy.tokens import Span
from spacy import displacy

# Shapely is used to work with geometric shapes
#from shapely.geometry import Point

# Fuzzywuzzy is used for fuzzy searches
#from fuzzywuzzy import fuzz

# used for the checklist
import ipywidgets as widgets

In [16]:
# imports for the OSM
# [TO DOI] Work oput what is still required
import requests
from IPython.display import JSON
import json
from pprint import pprint
from ratelimit import limits, RateLimitException, sleep_and_retry

In [17]:
# Make sure you can see as much of the output as possible within the Jupyter Notebook screen
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 115)

# Data preparation <a class="anchor" id="section-datapreparation"></a>

You also want to set up some directories for the import and output of data.

In [18]:
## Declare the data directories
## This presumes that Notebooks/ is the current working directory  
text_directory = os.path.normpath("../Texts/")
csv_directory = os.path.normpath("../ner_output/")
reference_directory = os.path.normpath("../Data")

## Create the data directories
if not os.path.exists(text_directory):
    os.makedirs(text_directory)
if not os.path.exists(csv_directory):
   os.makedirs(csv_directory)
if not os.path.exists(reference_directory):
   os.makedirs(reference_directory)


For this workshop, we will be examining the text of *For the Term of His Natural Life*, an 1874CE novel by Marcus Clarke that is in the public domain. Our copy was obtained via the [Gutenburg Project Australia](https://gutenberg.net.au/ebooks/e00016.txt). It is an unformatted textfile. We have slightly simplified it further by reducing it to only standard ASCII characters, replacing any accented characters with their unaccented forms and the British Pound Sterling symbol with the word pounds. 

The novel is divided into four books, each based in different regions of the world. You can start with part of the second book, which is titled *BOOK II.\-\-MACQUARIE HARBOUR.  1833*. 


In [19]:
filename="FtToHNL_BOOK_2_CHAPTER_3.txt"
print("Working on | ", filename)

# set the specific path for the 'filename' 
text_location = os.path.normpath(os.path.join(text_directory, filename))
text_filename = os.path.basename(text_location)

text = open(text_location, encoding="utf-8").read()

Working on |  FtToHNL_BOOK_2_CHAPTER_3.txt


This is no more than a long string of characters. So far, you have done no processing. 

In [20]:
text[0:500] # look at the first 501 characters

'CHAPTER III.\n\nA SOCIAL EVENING.\n\n\n\nIn the house of Major Vickers, Commandant of Macquarie Harbour,\nthere was, on this evening of December 3rd, unusual gaiety.\n\nLieutenant Maurice Frere, late in command at Maria Island, had unexpectedly\ncome down with news from head-quarters.  The Ladybird, Government schooner,\nvisited the settlement on ordinary occasions twice a year, and such visits\nwere looked forward to with no little eagerness by the settlers.\nTo the convicts the arrival of the Ladybird mean'

# Identifying Placenames in Text <a class="anchor" id="section-identifyingplacenames"></a>
*This section provides tools on identifying placenames in textual data*

## Named Entity Recognition <a class="anchor" id="section-ner"></a>

The following part of this notebook uses the Named Entity Recognition tool provided in the spaCy Python package. For an introduction to this tool and how it can be used to find Named Entities like placenames in text, go to the [spaCy NER notebook](https://github.com/Australian-Text-Analytics-Platform/geolocation-tools-workshop/blob/7d92664ac44f86b90a0c098bb3159793a4fe6c16/Notebooks/spacy_ner_introduction.ipynb).

You can now put all of this together and find the placenames that are identified by spaCy in each chapter of the text. They can all be collected in a single dataframe.

In [21]:
# Dataframe where we store the details about each instance of the placenames
placenames_df = pd.DataFrame(columns=['Book','Chapter',"NEIndex","Placename"])

In [22]:
# Define which chapters and books you want to annotate
CHAPTERS=[1,2,3] 
BOOKS=[1,2]

Load a spaCy model for English.

In [23]:
nlp = spacy.load("en_core_web_sm")  # language model

You should define what spaCy processing you do or don't want in the pipeline. You mainly need the tokenizer and the NER components. Others, like the Parser slow the processing down.

In [24]:
disabled_pipeline=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]

Of course, not all of these NE are suitable for placenames, so you will need to make a list of what categories regularly contain placenames.

__[TO DO] Expand on this explanation.__

In [118]:
PLACENAME_CATEGORIES = ["LOC", "GPE", "FAC", "ORG"]

Now let's process FtToHNL.

In [26]:
i=0 # counter of the entities
for book in BOOKS:
    for chapter in CHAPTERS:
        filename = "FtToHNL_BOOK_"+str(book)+"_CHAPTER_"+str(chapter)+".txt" 
        # set the specific path for the 'filename'
        text_location = os.path.normpath(os.path.join(text_directory, filename))
        text_filename = os.path.basename(text_location)

        # read this chapter
        text = open(text_location, encoding="utf-8").read()
        print("Working on |",filename)
        
        # run spaCy    
        doc = nlp(text,disable=disabled_pipeline)

        # document level
        ents = [(entity.text, entity.start_char, entity.end_char, entity.label_) for entity in doc.ents]

        # token level
        for entity in doc.ents:
            if entity.label_ in PLACENAME_CATEGORIES: # filter out MONEY, DATE etc
                print("{:5}\t\t{:30s}\t{}".format(i+1,entity.text, entity.label_))
                # To help understand the context of the text, extract the occurence
                context_text=doc.text[entity.start_char-30:entity.end_char+30].replace("\n"," ")

                # Add the placenames according to spaCy
                new_placename = {'Book':book,             # The Book number
                                'Chapter':chapter,        # The Chapter number
                                'NEIndex':i,              # A reference number to the nth Named Entity 
                                'Placename':entity.text,  # The placename in the text
                                'Category':entity.label_, # The spaCy category
                                'Context':context_text,   # The textual context where the placename was found
                                'Approval':1}             # A flag for whether this is a suitable placename
                placenames_df = placenames_df.append(new_placename, ignore_index=True)
                
            i=i+1 # entity counter

Working on | FtToHNL_BOOK_1_CHAPTER_1.txt
    2		Malabar                       	GPE
    6		Crown                         	ORG
   10		Crown                         	ORG
   11		Crown                         	ORG
   15		the sleepy sea                	FAC
   17		the Bay of Biscay             	LOC
   22		Lord Bellasis                 	ORG
   23		Heath                         	ORG
   25		London                        	GPE
   38		Van Diemen's Land             	ORG
   44		Vickers                       	ORG
   46		Vickers                       	ORG
   47		Sylvia                        	ORG
   50		Bath                          	ORG
   53		Julia Vickers's               	ORG
   54		Frere                         	ORG
   58		Sylvia                        	GPE
   64		Chatham                       	FAC
   81		Sylvia                        	GPE
   93		Frere                         	ORG
   96		Sylvia                        	LOC
Working on | FtToHNL_BOOK_1_CHAPTER_2.txt
   97		CHAPTER II                 

__[TO DO] Save a copy of this data__

In [27]:
placenames_df[['Book','Chapter','NEIndex','Placename','Category']]

Unnamed: 0,Book,Chapter,NEIndex,Placename,Category
0,1,1,1,Malabar,GPE
1,1,1,5,Crown,ORG
2,1,1,9,Crown,ORG
3,1,1,10,Crown,ORG
4,1,1,14,the sleepy sea,FAC
5,1,1,16,the Bay of Biscay,LOC
6,1,1,21,Lord Bellasis,ORG
7,1,1,22,Heath,ORG
8,1,1,24,London,GPE
9,1,1,37,Van Diemen's Land,ORG


This shows that a lot more placenames were found in Book 2 than Book 1. This makes sense since Book 1 is set on board an ocean voyage, whereas Book 2 is at Macquarie Harbour in Australia.

## Reviewing Candidate Placenames <a class="anchor" id="section-reviewplacenames"></a>

However, there are a number of NEs that are unlikely to be placenames, regardless of what spaCy categoriesd them as. It is best to consider the context in which the terms were used. Use the checkboxes to select which terms you do consider to be placenames.

__[TO DO] Expand on this explanation with example context.
Talk about the issue with _Van Diemen's_ versus _Van Diemen's Land_ (and _Tasman's Head_)__

In [28]:
def changed(b):
    k=b['new']

# lists of the data required for displaying the checkboxes
placename_items=[]
context_items=[]
num_items=[]

# Time to try to make checkboxes for every placename for b in BOOKS:
for book in BOOKS:
    for chapter in CHAPTERS:
        # Get the NEs from this book and chapter
        placenames_bookchapter=placenames_df[(placenames_df["Book"]==book) & 
                                             (placenames_df["Chapter"]==chapter)]
        # get the contextual data for each indexed NE
        for i in placenames_bookchapter["NEIndex"]:
            context_text=placenames_bookchapter[placenames_bookchapter["NEIndex"]==i]["Context"].values[0]
            category=placenames_bookchapter[placenames_bookchapter["NEIndex"]==i]["Category"].values[0]
            
        # Make lists of the candidate placenames, context text and index numbers 
        # Only the placenames are given a checkbox. 
        placename_items = placename_items + [widgets.Checkbox(True,description=i) for i in placenames_bookchapter["Placename"]]
        context_items = context_items + [widgets.Label(placenames_bookchapter[placenames_bookchapter["NEIndex"]==i]["Context"].values[0]) for i in placenames_bookchapter["NEIndex"]]
        num_items = num_items + [widgets.Label(str(i)) for i in placenames_bookchapter["NEIndex"]]

# create a display
num_placenames=len(placename_items)
left_box = widgets.VBox(placename_items)
right_box = widgets.VBox(context_items)
num_box = widgets.VBox(num_items)
whole_box = widgets.HBox([num_box, left_box, right_box])
        
print("Unselect any Named Entities (NEs) that you do not consider to be placenames.")
print("Each instance of an NE is listed, with the textual context in which it appeared.")
display(whole_box)

for i in range(num_placenames):
    placename_items[i].observe(changed)

Unselect any Named Entities (NEs) that you do not consider to be placenames.
Each instance of an NE is listed, with the textual context in which it appeared.


HBox(children=(VBox(children=(Label(value='1'), Label(value='5'), Label(value='9'), Label(value='10'), Label(v…

You can now copy all the values from the checkboxes to the data, so you know which placenames you have approved.

In [30]:
# Transfer the status of each checklist item to the data
for i in range(num_placenames):

    NEIndex_num = int(num_items[i].value)
    approval_flag = placename_items[i].value
    
    # set the flag to match the checklist
    for placename in placenames_df["NEIndex"]:
        if (placename - NEIndex_num == 0):
            placenames_df.loc[placenames_df["NEIndex"] == NEIndex_num,"Approval"] = approval_flag

You can now visualise the result.

In [31]:
placenames_df[['NEIndex','Placename','Approval']]

Unnamed: 0,NEIndex,Placename,Approval
0,1,Malabar,False
1,5,Crown,False
2,9,Crown,False
3,10,Crown,False
4,14,the sleepy sea,True
5,16,the Bay of Biscay,True
6,21,Lord Bellasis,False
7,22,Heath,True
8,24,London,True
9,37,Van Diemen's Land,True


From this, you can extract the final list of distinct placenames that you have approved. While the names aren't sorted (though they could be), if you missed unselecting an NE on the checklist, this will help find it. All you need to do is go back to the checklist, unselect it, then run all other steps from there to here.

In [32]:
# Make a unique list of the approved placenames
approved_placenames = placenames_df[placenames_df["Approval"]==True]['Placename'].unique()
print(approved_placenames)

['the sleepy sea' 'the Bay of Biscay' 'Heath' 'London' "Van Diemen's Land"
 'Vickers' 'Sylvia' 'Bath' "Julia Vickers's" 'Frere' 'Chatham'
 'CHAPTER II' 'Surgeon Pine' 'Coromandel' 'Pine' 'India'
 'the Hydaspes for Calcutta' 'the poop guard' 'MONOTONY' "Three'll"
 "Van Diemen's" 'Tasman' 'Cape Pillar' "Pirates' Bay" 'east' 'west'
 'the Isle of Wight' 'the South-West Cape' 'Swan Port' 'Mediterranean'
 'Maria Island' 'the Three Thumbs' 'Peninsula' 'Storm Bay'
 'Storing Island' 'Italy' 'Sorrell' 'Bruny Island' 'Mount Royal'
 "D'Entrecasteaux Channel" 'Actaeon' 'the South Cape' 'New Norfolk'
 'Derwent' 'the Southern Ocean' 'Tamar' 'Victoria' 'Port Philip Bay'
 'Wellington' 'Dromedary' 'Mount Wellington' 'Launceston' 'Smyrna'
 'Pyramid Island' 'Rocky Point' 'Port Davey' 'Mount Direction'
 'Macquarie Harbour' 'Mount Heemskirk' 'Mount Zeehan' "King's River"
 'Sarah Island' "Philip's Island" 'Hobart Town' 'earth' 'south-east'
 'Ladybird' 'Commandant' 'Port Arthur' 'Honduras' 'Arthur' 'Hells Gat

The final step is to save this new data to a csv file. You have already defined the directory for your data files.

In [33]:
filename = "FtToHNL_placenames.csv"
save_location = os.path.normpath(os.path.join(csv_directory, filename))
save_filename = os.path.basename(save_location)
print("Saving to placename data to ",save_location)

Saving to placename data to  ../ner_output/FtToHNL_placenames.csv


In [34]:
# save the list 
# using the savetxt from the numpy module
np.savetxt(save_location, 
           approved_placenames,
           delimiter =", ", 
           fmt ='% s')

## Finding Locations for the Placenames <a class="anchor" id="section-findinglocs"></a>

Now that you have a list of placenames from the text, the next step is to work out their location on Earth. For this you can use a combination of specialised lists of locations, gazzetteers and heuristics. The objective is to match every placename with the coordinates of a known location.

The first step is to read the file of your placenames.

In [77]:
filename="FtToHNL_placenames.csv"
print("Working on | ", filename)

# set the specific path for the 'filename'
data_location = os.path.normpath(os.path.join(csv_directory, filename))
data_filename = os.path.basename(data_location)

# Using pandas, read the csv file. This will place it in a dataframe format. 
placenames_df = pd.read_csv(data_location, encoding="utf-8",header=None)

Working on |  FtToHNL_placenames.csv


In [78]:
placenames_df = placenames_df.rename(columns={placenames_df.columns[0]: 'Placename'})

In [79]:
placenames_df

Unnamed: 0,Placename
0,the sleepy sea
1,the Bay of Biscay
2,Heath
3,London
4,Van Diemen's Land
5,Vickers
6,Sylvia
7,Bath
8,Julia Vickers's
9,Frere


### Identifying States and Capitals <a class="anchor" id="section-statescapitals"></a>

Some placenames, like *High Street* or *Maryborough*, may be very common across the world, or even in Australia. However, certain placenames refer to significant locations, like states, territories, large geographic features or capital cities. As such, if they are mentioned in a text, the placename is more likely to refer to the major location than a town or village in Tasmania.

These significant locations are a finite set. They can be defined in a reference file that can be reused when reviewing the placenames of any text.

A good point for you to start is a file about locations like modern capital cities and countries, combined with historical locations of significance.

In [80]:
filename="reference_location_data.csv"
print("Working on | ", filename)

# set the specific path for the 'filename' 
reference_location = os.path.normpath(os.path.join(reference_directory, filename))
reference_filename = os.path.basename(reference_location)

Working on |  reference_location_data.csv


Rather than reading this and then processing it, you can process each line as you read it.

__[TO DO] Change this to a dictionary?__

In [81]:
# Place the reference data in a dataframe
locref_df = pd.read_csv(reference_location, encoding="utf-8", header=0)

In [82]:
locref_df

Unnamed: 0,LocationName,Category,Latitude,Longitude,PartOf
0,Melbourne,City,-37.814218,144.963161,VIC
1,Brisbane,City,-27.468968,153.023499,QLD
2,Perth,City,-31.955896,115.860580,WA
3,Darwin,City,-12.460440,130.841047,NT
4,Alice Springs,City,-23.698388,133.881289,NT
...,...,...,...,...,...
558,Zagreb,Capital,16.000000,45.800000,Croatia
559,Zambia,Country,28.283333,-15.416667,Africa
560,Zanzibar City,Capital,39.198914,-6.165193,Tanzania
561,Zimbabwe,Country,31.033333,-17.816667,Africa


As you can see, this allows some bias to be introduced into the data to suit your geolocation needs. For instance, _Perth_ is entered in this file as a city in the state of Western Australia, rather than one in Scotland. _Victoria_ is recorded as a state of Australia, rather than the capital of the Seychelles, or the capital of British Columbia, Canada.

__[TO DO] update this text chunk to suit the workshop__

Of course, if you are researching historical texts, then some of these contemporary locations may have had different names. Old New York was once New Amsterdam (and had the [nickname of Gotham](https://www.nypl.org/blog/2011/01/25/so-why-do-we-call-it-gotham-anyway), amongst others). Istanbul was Constantinople. Some locations had [romanized names](https://en.wikipedia.org/wiki/Chinese_postal_romanization), like Beijing being called Peking. They may be a long time gone but you might want to add them to the list of significant known locations.

Another historical variant is changing which cities are the capitals. These may be due to political decisions, like the movement of the Australian parliament from Melbourne to the new city of Canberra, or they could be a necessity due to the results of war, like Bonn becoming the capital of West Germany after World War II. These older capitals may also have to be accomodated in your reference data.

Because FtToHNL is set in the 19th Century CE, the next step is to add various capital cities from then.

There are also larger geopolitical regions that may have been associated with placenames and cultures, for instance empires, dynasties and colonies like the British Empire or the Zulu Kingdom. Again, the borders and applicability of these political entities changed over time, so a contemporary reference list may not include them. 

The 19th Century CE was a time of many European Empires so for FtToHNL, you will need to add reference data associated with relevant entities.

When processing this reference file, you can add the old political entity, its capital (if known), the geographic region (like continent or part thereof) and the modern country it would be considered part of.  

The next step is to see if any of the placenames from our selected chapters of FtToHNL match these locations.

__[TO DO] Describe this without being technical__

If we match a placename, copy the geolocation data for the matching location. Otherwise, keep it empty so we know to keep looking for the placename.

In [83]:

geolocdata = [] # all the data about placenames and locations, once linked, as a list of dataframes

for placename in placenames_df['Placename']:
    
    # create a new geoloc entry about this placename
    # [TO DO] Formally declare this as a dataframe?  Though this might getting values be awkward.
    new_geolocdata={} 
    # start a record for a placename
    new_geolocdata['placename'] = placename
    new_geolocdata['locations'] = {} # start with no location details
    new_geolocdata['locations']['best_match'] = [] # start with no matching location
    
    # Match found in the reference data
    if(placename in list(locref_df['LocationName'])):
        
        print("*** Found", placename)
        # Copy the details from the reference file entry
        new_geolocdata['locations']['best_match'] = locref_df[locref_df['LocationName']==placename]
        
    else:
        print("Still looking for", placename)
    
    geolocdata.append(new_geolocdata) # add the new placename data to the list

Still looking for the sleepy sea
Still looking for the Bay of Biscay
Still looking for Heath
*** Found London
Still looking for Van Diemen's Land
Still looking for Vickers
Still looking for Sylvia
Still looking for Bath
Still looking for Julia Vickers's
Still looking for Frere
Still looking for Chatham
Still looking for CHAPTER II
Still looking for Surgeon Pine
Still looking for Coromandel
Still looking for Pine
*** Found India
Still looking for the Hydaspes for Calcutta
Still looking for the poop guard
Still looking for MONOTONY
Still looking for Three'll
Still looking for Van Diemen's
Still looking for Tasman
Still looking for Cape Pillar
Still looking for Pirates' Bay
Still looking for east
Still looking for west
Still looking for the Isle of Wight
Still looking for the South-West Cape
Still looking for Swan Port
Still looking for Mediterranean
Still looking for Maria Island
Still looking for the Three Thumbs
Still looking for Peninsula
Still looking for Storm Bay
Still looking for 

Check that you have recorded the matches (and mismatches)

In [84]:
geolocdata[:10]

[{'placename': 'the sleepy sea', 'locations': {'best_match': []}},
 {'placename': 'the Bay of Biscay', 'locations': {'best_match': []}},
 {'placename': 'Heath', 'locations': {'best_match': []}},
 {'placename': 'London',
  'locations': {'best_match':     LocationName Category  Latitude  Longitude          PartOf
   282       London  Capital -0.083333       51.5  United Kingdom}},
 {'placename': "Van Diemen's Land", 'locations': {'best_match': []}},
 {'placename': 'Vickers', 'locations': {'best_match': []}},
 {'placename': 'Sylvia', 'locations': {'best_match': []}},
 {'placename': 'Bath', 'locations': {'best_match': []}},
 {'placename': "Julia Vickers's", 'locations': {'best_match': []}},
 {'placename': 'Frere', 'locations': {'best_match': []}}]

What locations did you end up finding?

In [85]:
matchdata = [place['locations']['best_match'].to_string(index=False,header=False) for place in geolocdata 
             if len(place['locations']['best_match'])>0]
matchdata

['London Capital -0.083333 51.5 United Kingdom',
 'India Country 77.2 28.6 Asia',
 'Italy Country 12.483333 41.9 Europe',
 'Victoria State -37.814218 144.963161 Australia',
 'Wellington Capital 174.783333 -41.3 New Zealand',
 'Hobart Town City -42.893851 147.272086 TAS',
 'Honduras Country -87.216667 14.1 Central America',
 'Sydney City -33.869844 151.208285 NSW']

We can now forget about the dataframe with the complete set of reference data.

In [86]:
del locref_df

### Searching a Gazetteer for Locations <a class="anchor" id="section-searchgazzeteer"></a>

Search [Open Street Map (ODM)](https://nominatim.org/release-docs/develop/api/Search/) for locations that match the unknown placenames.

In [87]:
# How many (max) results do we want for each name?
#[TO DO] Make this a user setting, defaulting to 5
# The normal is (Default: 10, Maximum: 50), according to https://nominatim.org/release-docs/develop/api/Search/
OSM_limit = 5

In [88]:
# Send rate-limited requests that stay within n requests per second
# [TO DO] add link to webpage about this
@sleep_and_retry
@limits(calls=1, period=1)
def osm_call_api(url):
    response = requests.get(url)
    return response

# converted a postcode from a string into a state abbreviation
def postcode_to_state(postcodestr):
    postcode = int(postcodestr)
    
    if (postcode >=1000 and postcode <=2599) or (postcode >= 2619 and postcode < 2899) or (postcode >= 2921 and postcode <= 2999):
        return("NSW")
    elif (postcode >=200 and postcode <=299) or (postcode >= 2600 and postcode <= 2618) or (postcode >= 2900 and postcode <= 2920):
        return("ACT")
    elif (postcode >=3000 and postcode <=3999) or (postcode >= 8000 and postcode <= 8999):
        return("VIC")
    elif (postcode >=4000 and postcode <=4999) or (postcode >= 9000 and postcode < 9999):
        return("QLD")
    elif (postcode >=5000 and postcode <=5999):
        return("SA")
    elif (postcode >=6000 and postcode <=6797) or (postcode >= 6800 and postcode <= 6999):
        return("WA")
    elif (postcode >=8000 and postcode <=8999):
        return("TAS")
    elif (postcode >= 7000 and postcode <= 7999):
        return("TAS")
    elif (postcode >=800 and postcode <=999):
        return("NT")
    # some postcodes are special cases
    elif (postcode ==2899):
        return("Norfolk Island")  # coded as NSW
    elif (postcode ==6798):
        return("Christmas Island")  # coded as WA
    elif (postcode ==6799):
        return("Cocos (Keeling) Islands")  # coded as WA
    elif (postcode ==9999):
        return("North Pole")  # coded as VIC for Santa mail
    
    return(postcodestr)

    
# Format the api response to make comparison easier
def osm_format_response(input):

    # Shorten the name and extract the country name, if any
    hyperlocation = None;
    shortlocation = input["display_name"] # default to full address

    if input["display_name"].find(','):
        # break up the address
        namesplit = input["display_name"].split(',')
        shortlocation = namesplit[0].lstrip().rstrip()
        # extract the rightmost term from the split
        hyperlocation = namesplit[len(namesplit)-1].lstrip().rstrip()
        # change to a Australian state name, rather than Australia
        # [TO DO] Set this as an Australia-based function so to help separate it from the rest of the code
        if (hyperlocation=="Australia" and len(namesplit)>2):
            hyperlocation = namesplit[len(namesplit)-2].lstrip().rstrip()
            # Change postcodes into states
            if (hyperlocation.isdigit() and len(hyperlocation)==4):
                hyperlocation = postcode_to_state(hyperlocation)
        
    # for now, keep the names consistent between gazetteer records 
    response = {"LocationName": str(shortlocation), 
              "Category": str(input["type"]),
              "Latitude": input["lat"], 
              "Longitude": input["lon"],
              "PartOf": str(hyperlocation),
              "Gazetteer": "OSM"
                }
    return response

__[TO DO] Talk more__

You can now move to the data that is needed for the geolocation project.

In [89]:
# For every placename in our list
for place in geolocdata:
    # Already found a matching location, so skip to the next placename
    if len(place['locations']['best_match']) > 0:
        continue
        
    placename = place['placename']
    print ("looking for",placename)

    # query the OSM database
    url = f"https://nominatim.openstreetmap.org/search?q={placename}&format=json&limit={OSM_limit}"
    response = osm_call_api(url)
    response_dict = json.loads(response.text)

    place['locations']['candidates']=None
    
    # Handle no results found
    if len(response_dict) is 0:
        # skip to the next placename
        continue
        
    # Save the possible locations for later processing
    data_frames = []

    # Handle results found
    for response_record in response_dict:
        #  Use this to look at a reduced set of data from the results
        cleaned_response = pd.DataFrame([osm_format_response(response_record)])

        # Add the data to a dataframe
        # The cleaned_response should be in the form we want to keep.
        data_frames.append(cleaned_response)

    # add the results to the geoloc dataframe
    # review the outcomes later
    place['locations']['candidates'] = data_frames

    

looking for the sleepy sea
looking for the Bay of Biscay
looking for Heath
looking for Van Diemen's Land
looking for Vickers
looking for Sylvia
looking for Bath
looking for Julia Vickers's
looking for Frere
looking for Chatham
looking for CHAPTER II
looking for Surgeon Pine
looking for Coromandel
looking for Pine
looking for the Hydaspes for Calcutta
looking for the poop guard
looking for MONOTONY
looking for Three'll
looking for Van Diemen's
looking for Tasman
looking for Cape Pillar
looking for Pirates' Bay
looking for east
looking for west
looking for the Isle of Wight
looking for the South-West Cape
looking for Swan Port
looking for Mediterranean
looking for Maria Island
looking for the Three Thumbs
looking for Peninsula
looking for Storm Bay
looking for Storing Island
looking for Sorrell
looking for Bruny Island
looking for Mount Royal
looking for D'Entrecasteaux Channel
looking for Actaeon
looking for the South Cape
looking for New Norfolk
looking for Derwent
looking for the Sout

What placenames have you still not found?

In [90]:
unmatcheddata = [place['placename'] for place in geolocdata 
             if len(place['locations']['best_match'])==0 and place['locations']['candidates']==None]
unmatcheddata

['the sleepy sea',
 "Julia Vickers's",
 'the Hydaspes for Calcutta',
 'the poop guard',
 'MONOTONY',
 'the Three Thumbs',
 'Storing Island',
 'Commandant',
 'verandah.-She',
 'Grummet Island']

Where have you found locations for placenames?

In [91]:
matchdata = [place['placename'] for place in geolocdata 
             if len(place['locations']['best_match'])>0 or place['locations']['candidates']!=None]
matchdata

['the Bay of Biscay',
 'Heath',
 'London',
 "Van Diemen's Land",
 'Vickers',
 'Sylvia',
 'Bath',
 'Frere',
 'Chatham',
 'CHAPTER II',
 'Surgeon Pine',
 'Coromandel',
 'Pine',
 'India',
 "Three'll",
 "Van Diemen's",
 'Tasman',
 'Cape Pillar',
 "Pirates' Bay",
 'east',
 'west',
 'the Isle of Wight',
 'the South-West Cape',
 'Swan Port',
 'Mediterranean',
 'Maria Island',
 'Peninsula',
 'Storm Bay',
 'Italy',
 'Sorrell',
 'Bruny Island',
 'Mount Royal',
 "D'Entrecasteaux Channel",
 'Actaeon',
 'the South Cape',
 'New Norfolk',
 'Derwent',
 'the Southern Ocean',
 'Tamar',
 'Victoria',
 'Port Philip Bay',
 'Wellington',
 'Dromedary',
 'Mount Wellington',
 'Launceston',
 'Smyrna',
 'Pyramid Island',
 'Rocky Point',
 'Port Davey',
 'Mount Direction',
 'Macquarie Harbour',
 'Mount Heemskirk',
 'Mount Zeehan',
 "King's River",
 'Sarah Island',
 "Philip's Island",
 'Hobart Town',
 'earth',
 'south-east',
 'Ladybird',
 'Port Arthur',
 'Honduras',
 'Arthur',
 'Hells Gates',
 'England',
 'New Town'

__[TO DO] Show the output and discuss from OSM, so the context of the processing can be understood__

While Open Street Map is a wonderful resource, it focusses on current names of geographic locations. If the original source of your placenames was not written in the recent decades, then the OSM may not know the appropriate names of locations for the time of the document. 

One solution is to also look up a historical gazetteer, like the TLC. 

Like the OSM API, the TLCMap API has various options, like which type of search to use and whether to search any data in the database enetred by the public, rather than that which has been verified or entered by experts.

__[TO DO] Describe the TLCMap__

For this workshop, you will look for exact matches between the placenames and the locations, and not consider any publicly entered data.

In [92]:
# Which order to do different searches for known locations
search_type = 'exact' # alt values: 'exact', 'fuzzy', 'contains' 

# Flag whether to use data provided by the public
search_public_data = False # alt values = True, False

Like for the OSM, you can limit how many results you want to examine. The TLCMap default for this notebook is 1.

In [93]:
TLCMap_limit = 5

Like for the OSM, you will need a few functions to query the the API.

In [94]:
def tlc_build_url(placename: str, search_type: str, search_public_data: bool = False) -> str:
    """
    Build a url to query the tlcmap/ghap API.
    placename: the place we're trying to locate
    search_type: what search type to use (accepts one of ['contains','fuzzy','exact'])
    
    ref: https://www.tlcmap.org/guides/ghap/#ws
    """
    safe_placename = urllib.parse.quote(placename.strip().lower())

    url = f"https://tlcmap.org/ghap/search?"

    if search_type == 'fuzzy':
        url += f"fuzzyname={safe_placename}"
    elif search_type == 'exact':
        url += f"name={safe_placename}"
    elif search_type == 'contains':
        url += f"containsname={safe_placename}"
    else:
        return None

    # Search Australian National Placenames Survey provided data
    url += "&searchausgaz=on"
    
    # Search public provided data, this data could be unreliable
    if search_public_data == True:
        url += "&searchpublicdatasets=on"
    else:
        url += "&searchpublicdatasets=off"
        
    # Retrieve data as JSON
    url += "&format=json"
    
    # limit the number of results
    url += "&paging=" + str(TLCMap_limit)

    return url

# Send rate-limited requests that stay within n requests per second
# [TO DO] add link to webpage about this
@sleep_and_retry
@limits(calls=1, period=1)
def tlc_call_api(url):
    r = requests.get(url)
    if r.url == 'https://tlcmap.org/ghap/maxpaging':
        return None

    # If the reply says the placename wasn't found, customise the JSON data for the reply
    if r.content.decode() == "No search results to display.":
        # This should have obviously just be an empty list of features, but TLCMap is badly behaved
        response = json.loads('{"type": "FeatureCollection","metadata": {},"features": []}')
    # SUCCESS! Record the spatial data provided in the reply
    elif r.ok:
        response = r.json()    # get [lon, lat] for spatial matches

    return response


def tlc_query_name(placename: str, search_type: str):
    """
    Use tlcmap/ghap API to check a placename, implemented fuzzy search but will not handle non returns.
    """
    url = tlc_build_url(placename, search_type, search_public_data)
    if url:
        return tlc_call_api(url)
    return None

In [95]:
# Format the api response to make comparison easier
def tlc_format_response(input):

    locdata={} # formatted data
    
    # look at each location in the features
    location_features=input
    if len(location_features):
        # Gather the locdata for one of the placename's locations 
        # If the value is missing, set a value or leave it empty.
        if 'placename' in location_features['properties']:
            locdata['LocationName'] = location_features['properties']['placename'].lstrip().rstrip()
        else:
            locdata['LocationName'] = "Unknown Location"
        if 'feature_term' in location_features['properties']:
            locdata['Category']=location_features['properties']["feature_term"].lstrip().rstrip()
        else:
            locdata['Category']=None
        if 'longitude' in location_features['properties']:
            locdata['Longitude']=location_features['properties']["longitude"]
        else:
            locdata['Longitude']=""
        if 'latitude' in location_features['properties']:
            locdata['Latitude']=location_features['properties']["latitude"]
        else:
            locdata['Latitude']=""
        if 'state' in location_features['properties']:
            locdata['PartOf']=location_features['properties']['state'].lstrip().rstrip()
        else:
            locdata['PartOf']=None
        locdata['Gazetteer']="TLCMap"

        # for now, keep the names consistent between gazetteer records 
        response = {"LocationName": str(locdata["LocationName"]), 
              "Category": str(locdata["Category"]),
              "Latitude": locdata["Latitude"], 
              "Longitude": locdata["Longitude"],
              "PartOf": str(locdata["PartOf"]),
              "Gazetteer": "TLCMap"
                }
    else:
        response=None

    return response

You can now search the TLCMap for locations matching the same placenames you previously searched for in the OSM.

In [96]:
# For every placename in our list
for place in geolocdata:
    # Already found a location, so skip to the next placename
    if len(place['locations']['best_match']) > 0:
        continue
        
    placename = place['placename']
    print ("looking for",placename)

    # query the OSM database
    response = tlc_query_name(placename,search_type)
    
    # Handle no results found
    if response is None:
        # skip to the next placename
        continue
        
    # Save the possible locations for later processing
    data_frames = []

    # Handle results found
    for response_record in response["features"]:
        #  Use this to look at a reduced set of data from the results
        cleaned_response = pd.DataFrame([tlc_format_response(response_record)])

        # Add the data to a dataframe
        data_frames.append(cleaned_response)
        
    # add the results to the geoloc dataframe
    # review the outcomes later
    # Match sure you don't write over any candidates previously added from another gazetteer.
    if place['locations']['candidates'] == None:
        place['locations']['candidates'] = data_frames
    else:
        place['locations']['candidates'] = place['locations']['candidates'] + data_frames   

looking for the sleepy sea
looking for the Bay of Biscay
looking for Heath
looking for Van Diemen's Land
looking for Vickers
looking for Sylvia
looking for Bath
looking for Julia Vickers's
looking for Frere
looking for Chatham
looking for CHAPTER II
looking for Surgeon Pine
looking for Coromandel
looking for Pine
looking for the Hydaspes for Calcutta
looking for the poop guard
looking for MONOTONY
looking for Three'll
looking for Van Diemen's
looking for Tasman
looking for Cape Pillar
looking for Pirates' Bay
looking for east
looking for west
looking for the Isle of Wight
looking for the South-West Cape
looking for Swan Port
looking for Mediterranean
looking for Maria Island
looking for the Three Thumbs
looking for Peninsula
looking for Storm Bay
looking for Storing Island
looking for Sorrell
looking for Bruny Island
looking for Mount Royal
looking for D'Entrecasteaux Channel
looking for Actaeon
looking for the South Cape
looking for New Norfolk
looking for Derwent
looking for the Sout

What placenames have you still not found?

In [97]:
unmatcheddata = [place['placename'] for place in geolocdata 
            # must have not found an unambiguous best match or any candidate locations 
             if (len(place['locations']['best_match'])==0 and 
                 (place['locations']['candidates']==None or 
                  place['locations']['candidates'])==[])]
unmatcheddata

['the sleepy sea',
 "Julia Vickers's",
 'the Hydaspes for Calcutta',
 'the poop guard',
 'MONOTONY',
 'the Three Thumbs',
 'Storing Island',
 'Commandant',
 'verandah.-She']

You can now compare all the locations you have found.

In [98]:
matcheddata = [place for place in geolocdata 
             if (len(place['locations']['best_match'])!=0 or 
                 (place['locations']['candidates']!=None and 
                  place['locations']['candidates'])!=[])]
alllocations=[]
for place in matcheddata:
    placename = place['placename']
    locations = []
    # unambiguous best match already known
    if len(place['locations']['best_match'])!=0:
        locations=place['locations']['best_match']
    # any candidate locations you have found
    else:
        if place['locations']['candidates']!= None:
            locations = pd.concat(place['locations']['candidates'],ignore_index=True) 
    print("===> ",placename)
    print(locations)
    alllocations=alllocations+[locations]

===>  the Bay of Biscay
        LocationName Category     Latitude           Longitude          PartOf Gazetteer
0  The Bay of Biscay     park  52.95425415  -1.159789836905746  United Kingdom       OSM
===>  Heath
  LocationName        Category            Latitude          Longitude          PartOf Gazetteer
0        Heath  administrative          32.8365147         -96.474987   United States       OSM
1        Heath  administrative          31.3607243        -86.4696811   United States       OSM
2        Heath  administrative          40.0228421        -82.4445991   United States       OSM
3        Heath            site          53.1908153         -1.3499525  United Kingdom       OSM
4        Heath  administrative          42.6898311        -72.8178076   United States       OSM
5        Heath    trig station  -32.13166666666667  149.3011111111111             NSW    TLCMap
6        Heath          parish               -19.6              143.6             QLD    TLCMap
===>  London
    L

__[TO DO] Not sure which version to show - the full "pretty version" or the short dirty one__

In [99]:
pprint(alllocations[0:10])

[        LocationName Category     Latitude           Longitude          PartOf Gazetteer
0  The Bay of Biscay     park  52.95425415  -1.159789836905746  United Kingdom       OSM,
   LocationName        Category            Latitude          Longitude          PartOf Gazetteer
0        Heath  administrative          32.8365147         -96.474987   United States       OSM
1        Heath  administrative          31.3607243        -86.4696811   United States       OSM
2        Heath  administrative          40.0228421        -82.4445991   United States       OSM
3        Heath            site          53.1908153         -1.3499525  United Kingdom       OSM
4        Heath  administrative          42.6898311        -72.8178076   United States       OSM
5        Heath    trig station  -32.13166666666667  149.3011111111111             NSW    TLCMap
6        Heath          parish               -19.6              143.6             QLD    TLCMap,
     LocationName Category  Latitude  Longitude   

__[TO DO] Show the output and discuss from OSM, so the context of the processing can be understood__

In [120]:
pprint(response)

{'features': [{'geometry': {'coordinates': [146.4990234, -37.81597137],
                            'type': 'Point'},
               'properties': {'TLCMapDataset': 'https://tlcmap.org/ghap/',
                              'TLCMapLinkBack': 'https://tlcmap.org/ghap/search?id=a1cf9b',
                              'description': 'official; 146.499027777777, '
                                             '-37.8159722222222',
                              'feature_term': 'neighbourhoold',
                              'id': 'a1cf9b',
                              'latitude': '-37.81597137',
                              'longitude': '146.4990234',
                              'name': 'Dawes',
                              'original_data_source': 'Australian Gazetteer',
                              'placename': 'Dawes',
                              'state': 'VIC'},
               'type': 'Feature'},
              {'geometry': {'coordinates': [151.21666666666667,
                        

The issue is now to work out which of these locations is most suitable.

A few heuristics can be used to flag those locations with key features that can be used to help rank and select the locations. 

One way is to acknowledge if multiple locations have similar coordinates.  

In [100]:
# Compare two sets of coordinates
# Return true if they are both within 1 point of each other for both Lat and Lon
def compare_coords(lat1,lon1,lat2,lon2):

    if lat1.item() == "" or lat2.item() == "" or lon1.item() == "" or lon2.item() == "":
            return False
        
    lat1_int = int(float(lat1.item()))
    lon1_int = int(float(lon1.item()))
    lat2_int = int(float(lat2.item()))
    lon2_int = int(float(lon2.item()))

    # Is Coord1 within 1 point of Coord2? 
    # e.g., -47.5 is close to -46.5 and -48.5
    return (lat1_int in range(lat2_int-1,lat2_int+2,1)) and (lon1_int in range(lon2_int-1,lon2_int+2,1))

However, sometimes there are no coordinates for a location, which can complicate the comparisons. The solution is to default to the value of 0 for missing data. It is not perfect but in most cases it is adequate.

In [101]:
# Given a float as a string value, convert it to an integer
# This is required because of NA or "".
def sorting_coord(coord):
   
    if type(coord) in [float, int]:
        return int(coord)
    # set to 0 if missing data
    elif type(coord)==str and coord=="":
        return 0
    elif type(coord)!= str and coord.isna():
        return 0
    else:
        return (int(float(coord)))

Another consideration is to recognise which locations are in certain countries which you know are relevant to the original document. For this you can focus on what values may be included in the PartOf field.

In [102]:
# Flag locations in Australia or Britain
aus_states = {"AUSTRALIA","NSW","VIC","QLD","TAS","WA","NT","SA","ACT",
             "NEW SOUTH WALES", "VICTORIA", "QUEENSLAND","TASMANIA","WESTERN AUSTRALIA",
             "SOUTH AUSTRALIA", "NORTHERN TERRITORY","AUSTRALIAN CAPITAL TERRITORY"}
gb_states = {"BRITAIN","UK","GB","GREAT BRITAIN","UNITED KINGDOM","BRITISH ISLES",
            "ENGLAND","WALES","SCOTLAND",
            "IRELAND","NORTHERN IRELAND", "ÉIRE / IRELAND", "EIRE", "EIRE / IRELAND"}

Now you can go through each location and its candidate locations and flag whether they correspond to any of these criteria. A distinction is made between whether any two locations with similar coordinates were found in the same gazetteer or a different one.

In [103]:
for place in matcheddata:
    placename = place['placename']
    locations = []
    if (len(place['locations']['best_match'])==0 and
        place['locations']['candidates']!= None and 
        len(place['locations']['candidates']) > 1):
        
        # sort the locations by Latitude & Longitude
        sorted_candidates = sorted(place['locations']['candidates'],
                                   key=lambda x:[sorting_coord(x['Latitude'].item()),
                                                 sorting_coord(x['Longitude'].item())] )

        prev_location = {} 

        # flag the candidates according to the heuristics
        for candidate in sorted_candidates:
            rank_flags = []
            partof_flags = ""
            
            # Flag locations in Australia or Britain
            # [TO DO] Make these lines into a function to pull the Aus/GB code out of the main code.
            partings=candidate['PartOf'].item().upper()
            if partings in aus_states:
                partof_flags='Australia'
            if partings in gb_states:
                partof_flags='Britain'
            if partof_flags!="":
                rank_flags.append(partof_flags)
 
            # Flag coords in multiple gazetteers
            if len(prev_location)>0:
                coord_flag = compare_coords(prev_location['Latitude'],
                                        prev_location['Longitude'],
                                        candidate['Latitude'],
                                        candidate['Longitude'])
                # Dupl_Gaz2: Matching coords in locations from 2 gazetteers 
                if coord_flag and prev_location['Gazetteer'].item()!=candidate['Gazetteer'].item():
                    rank_flags.append("Dupl_2Gaz")
                # Dupl_Gaz1: Matching coords in locations from 1 gazetteer
                elif coord_flag:
                    rank_flags.append("Dupl_1Gaz")

            prev_location = candidate
                    
            # rank_flags has to be converted to a single string, rather than a list 
            # because c is a Pandas dataframe
            candidate["RankFlags"]=pd.Series(','.join(rank_flags))
            print(" ** ",candidate['LocationName'].item(),",",partings,",[",(candidate["RankFlags"].item()),']') 

        # Update the geoloc data
        place['locations']['candidates']=sorted_candidates

 **  Heath , NSW ,[ Australia ]
 **  Heath , QLD ,[ Australia ]
 **  Heath , UNITED STATES ,[  ]
 **  Heath , UNITED STATES ,[  ]
 **  Heath , UNITED STATES ,[  ]
 **  Heath , UNITED STATES ,[  ]
 **  Heath , UNITED KINGDOM ,[ Britain ]
 **  Tasmania , AUSTRALIA ,[ Australia ]
 **  Van Diemen's Land , TAS ,[ Australia ]
 **  Van Diemen's Land , UNITED KINGDOM ,[ Britain ]
 **  Vickers , UNITED STATES ,[  ]
 **  Vickers , UNITED STATES ,[ Dupl_1Gaz ]
 **  Vickers , UNITED STATES ,[  ]
 **  Vickers , UNITED STATES ,[  ]
 **  Vickers , UNITED KINGDOM ,[ Britain ]
 **  Sylvia , QLD ,[ Australia ]
 **  Sylvia , PHILIPPINES ,[  ]
 **  Sylvia , UNITED STATES ,[  ]
 **  Sylvia , UNITED STATES ,[  ]
 **  Sylvia , UNITED STATES ,[  ]
 **  Sylvia , UNITED STATES ,[  ]
 **  Bath , SA ,[ Australia ]
 **  Bath County , UNITED STATES ,[  ]
 **  Bath , UNITED STATES ,[  ]
 **  Bath , UNITED STATES ,[  ]
 **  Bath , UNITED KINGDOM ,[ Britain ]
 **  Bath Abbey , UNITED KINGDOM ,[ Britain,Dupl_1Gaz ]
 **

 **  Rocky Point , QLD ,[ Australia ]
 **  Rocky Point , UNITED STATES ,[  ]
 **  Rocky Point , UNITED STATES ,[  ]
 **  Rocky Point , UNITED STATES ,[  ]
 **  Rocky Point , CANADA ,[  ]
 **  Rocky Point , UNITED STATES ,[  ]
 **  Port Davey , TASMANIA ,[ Australia ]
 **  Port Davey , TAS ,[ Australia,Dupl_2Gaz ]
 **  Port Davey , TAS ,[ Australia,Dupl_1Gaz ]
 **  Mount Direction , TAS ,[ Australia ]
 **  Mount Direction , TAS ,[ Australia,Dupl_2Gaz ]
 **  Mount Direction , TASMANIA ,[ Australia,Dupl_2Gaz ]
 **  Mount Direction , TASMANIA ,[ Australia,Dupl_1Gaz ]
 **  Mount Direction , TAS ,[ Australia,Dupl_2Gaz ]
 **  Mount Direction , TAS ,[ Australia,Dupl_1Gaz ]
 **  Mount Direction , VICTORIA ,[ Australia ]
 **  Mount Direction , NSW ,[ Australia ]
 **  Mount Direction , QUEENSLAND ,[ Australia ]
 **  Mount Direction , QLD ,[ Australia,Dupl_2Gaz ]
 **  Macquarie Harbour , TASMANIA ,[ Australia ]
 **  Macquarie Harbour , TAS ,[ Australia,Dupl_2Gaz ]
 **  Mount Heemskirk , TASMANIA ,

You can now use the heuristics flags to re-sort the candidates and select the best one.

In [104]:
# Establish how to rank the heuristic flags
sortorder = ['Australia,Dupl_2Gaz',  # in Australia, found in 2 gazetteers
             'Australia,Dupl_1Gaz',  # in Australia, found more than once in 1 gazetteer
             'Australia',            # in Australia, found only once
             'Britain,Dupl_2Gaz',    # in Great Britain, found in 2 gazetteers
             'Dupl_2Gaz',            # not in Australia or Great Britain, found in 2 gazetteers
             'Britain,Dupl_1Gaz',    # in Great Britain, found more than once in 1 gazetteer
             'Britain',              # in Great Britain, found only once
             'Dupl_1Gaz',            # not in Australia or Great Britain, found more than once in 1 gazetteer
             ""]                     # not in Australia or Great Britain, found only once

In [105]:
# Sort the candidates
for place in geolocdata:
    placename = place['placename']
    pprint(placename)

    if (len(place['locations']['best_match'])==0 and # haven't found a winner yet
        place['locations']['candidates']!= None and 
        len(place['locations']['candidates'])> 1): # more than one candidate
        
        candidates = place['locations']['candidates']

        sorted_candidates=[]
        for heuristic in sortorder:
            matched_candidates=[]
            # get the candidates matching this ranking heuristic
            for candidate in candidates:
                if candidate['RankFlags'].item() == heuristic:
                    # add the candidate to the sorted list
                    sorted_candidates=sorted_candidates + [candidate]
        # prepare a version for printing
        locations = pd.concat(sorted_candidates,ignore_index=True) 
        pprint(locations)
        
        # update the geoloc data with the sorted candidates
        place['locations']['candidates']=sorted_candidates

'the sleepy sea'
'the Bay of Biscay'
'Heath'
  LocationName        Category            Latitude          Longitude          PartOf Gazetteer  RankFlags
0        Heath    trig station  -32.13166666666667  149.3011111111111             NSW    TLCMap  Australia
1        Heath          parish               -19.6              143.6             QLD    TLCMap  Australia
2        Heath            site          53.1908153         -1.3499525  United Kingdom       OSM    Britain
3        Heath  administrative          31.3607243        -86.4696811   United States       OSM           
4        Heath  administrative          32.8365147         -96.474987   United States       OSM           
5        Heath  administrative          40.0228421        -82.4445991   United States       OSM           
6        Heath  administrative          42.6898311        -72.8178076   United States       OSM           
'London'
"Van Diemen's Land"
        LocationName        Category     Latitude            Longitude

4                     
'Hobart Town'
'earth'
  LocationName        Category     Latitude           Longitude         PartOf Gazetteer  RankFlags
0        earth           track   49.5904859          14.6651358          Česko       OSM  Dupl_1Gaz
1        earth           track    49.802872          15.5172618          Česko       OSM  Dupl_1Gaz
2        Earth      university  10.21990825  -83.59177371174943     Costa Rica       OSM           
3        Earth  administrative   34.2331373        -102.4107493  United States       OSM           
4        earth           track   49.7933447          13.8667052          Česko       OSM           
'south-east'
                   LocationName        Category     Latitude           Longitude          PartOf Gazetteer  \
0  South East Constituency 1947       political   53.3262506  -6.236059861357543  Éire / Ireland       OSM   
1           South-East District  administrative  -24.9902332  25.726402352052464        Botswana       OSM   
2        Dép

In [106]:
# Outputting without the RankFlags column
matcheddata = [place for place in geolocdata 
             if (len(place ['locations']['best_match'])!=0 or 
                 (place ['locations']['candidates']!=None and 
                  place ['locations']['candidates'])!=[])]
alllocations=[]
for place  in matcheddata:
    placename = place ['placename']
    locations = []
    # unambiguous best match already known
    if len(place ['locations']['best_match'])!=0:
        locations=place ['locations']['best_match']
    # any candidate locations you have found
    else:
        if place ['locations']['candidates']!= None:
            short_candidates = []
            for candidate in place ['locations']['candidates']:
                # for output, ignore the RankFlags column
                short_candidates = short_candidates + [candidate.loc[:, candidate.columns != 'RankFlags']]
            locations = pd.concat(short_candidates,ignore_index=True) 
    print(locations)

        LocationName Category     Latitude           Longitude          PartOf Gazetteer
0  The Bay of Biscay     park  52.95425415  -1.159789836905746  United Kingdom       OSM
  LocationName        Category            Latitude          Longitude          PartOf Gazetteer
0        Heath    trig station  -32.13166666666667  149.3011111111111             NSW    TLCMap
1        Heath          parish               -19.6              143.6             QLD    TLCMap
2        Heath            site          53.1908153         -1.3499525  United Kingdom       OSM
3        Heath  administrative          31.3607243        -86.4696811   United States       OSM
4        Heath  administrative          32.8365147         -96.474987   United States       OSM
5        Heath  administrative          40.0228421        -82.4445991   United States       OSM
6        Heath  administrative          42.6898311        -72.8178076   United States       OSM
    LocationName Category  Latitude  Longitude        

6          Launceston     administrative    50.6365062          -4.3604432  United Kingdom       OSM
        LocationName        Category    Latitude    Longitude         PartOf Gazetteer
0             Smyrna  administrative   33.883887  -84.5147454  United States       OSM
1             Smyrna            town  35.9824598  -86.5199492  United States       OSM
2              İzmir            city  38.4224548   27.1310699        Türkiye       OSM
3             Smyrna  administrative  39.2998339  -75.6046494  United States       OSM
4  Village of Smyrna  administrative  42.6872921  -75.5707376  United States       OSM
     LocationName Category             Latitude           Longitude            PartOf Gazetteer
0  Pyramid Island   Island         -42.59000015         145.7299957               TAS    TLCMap
1  Pyramid Island   Island         -39.81999969         147.2299957               TAS    TLCMap
2  Pyramid Island    islet  -62.417425699999995  -60.10306847698165    Pyramid Island    

Now that the candidates are sorted, the best match can be selected. The most obvious choice is the highest ranked candidate.

In [107]:
for place in geolocdata:
    placename = place['placename']
    locations = []
    # Already have a best match
    if len(place['locations']['best_match'])!=0:
        locations=place['locations']['best_match']
    else:
        if place['locations']['candidates']!= None and len(place['locations']['candidates'])>0:
            # Presume the best match has the top rank
            topcandidate = place['locations']['candidates'][0]
            place['locations']['best_match'] = topcandidate.loc[:, ~topcandidate.columns.isin(['Gazetteer','RankFlags'])]

In [108]:
# Outputting just the best matching locations without the RankFlags column
matcheddata = [place for place in geolocdata 
             if (len(place['locations']['best_match'])!=0 or 
                 (place['locations']['candidates']!=None and 
                  place['locations']['candidates'])!=[])]
alllocations=[]
for place in matcheddata:
    placename = place['placename']
    locations = []
    if len(place['locations']['best_match'])!=0:
        locations=place['locations']['best_match']
    else:
        # This shouldn't run because all entries with matches should now have best matches,
        # but just in case
        if place['locations']['candidates']!= None:
            short_candidates = []
            for candidate in place['locations']['candidates']:
                short_candidates = short_candidates + [candidate.loc[:, candidate.columns != 'RankFlags']]
            locations = pd.concat(short_candidates,ignore_index=True) 
    print(locations)


        LocationName Category     Latitude           Longitude          PartOf
0  The Bay of Biscay     park  52.95425415  -1.159789836905746  United Kingdom
  LocationName      Category            Latitude          Longitude PartOf
0        Heath  trig station  -32.13166666666667  149.3011111111111    NSW
    LocationName Category  Latitude  Longitude          PartOf
282       London  Capital -0.083333       51.5  United Kingdom
  LocationName        Category    Latitude    Longitude     PartOf
0     Tasmania  administrative  -42.035067  146.6366887  Australia
  LocationName Category     Latitude            Longitude          PartOf
0      Vickers  primary  51.59845985  -1.7415535311162076  United Kingdom
  LocationName Category Latitude           Longitude PartOf
0       Sylvia   parish   -20.85  143.81666666666666    QLD
  LocationName             Category   Latitude  Longitude PartOf
0         Bath  locality unbounded)  -34.83706  138.49131     SA
  LocationName Category    Latitud

Unfortunately, the top ranked candidate may still not be the best. The second ranked candidate may actually be the same rank as the top ranked one, but it simply might have a lesser position due to the earlier ranking based on the latitude and longitude values (to find the matching candidates for the ranking). For this reason, the final stage of the processing is left to the user, showing them the best candidate location, but allowing them to select an alternate candidate.

There are a number of ways the user interface could be done. The first is very verbose and shows all the candidate locations of all placenames to the user. Using checkboxes, the user can then select what they think is the best match. 

Alternatively, the candidates could be presented as drop-down accordions.

In [109]:
# individually for each placename
for place in geolocdata:
    placename = place['placename']

    # If there isn't a bestmatch, then there aren't any candidates
    if len(place['locations']['best_match'])!=0:
        
        best_match=place['locations']['best_match']
        best_match_index = 0
        
        # Must be at least one candidate
        if ('candidates' in place['locations'].keys() and 
            place['locations']['candidates']!= None and 
            len(place['locations']['candidates'])>=1):

            # Create a RadioButton for each candidate location
            candidates = place['locations']['candidates']
            candidate_buttons = [widgets.RadioButtons(layout={'width':'max-content'},
                                                    options = [candidate['LocationName'].item()+
                                                    ", "+
                                                    candidate['PartOf'].item()+
                                                    " ("+
                                                    str(candidate['Latitude'].item())+
                                                    ","+
                                                    str(candidate['Longitude'].item())+
                                                    ")" 
                                                    for candidate in candidates])]
            # Add a button for "None of the above"
            candidate_buttons[0].options = candidate_buttons[0].options+tuple(["None of the above"])
            place["locations"]["candidatebuttons"] = candidate_buttons
    
            # Select the button for the current bestmatch
            place['locations']['candidatebuttons'][0].index = best_match_index
    
            # Create a HBox for the Placename 
            placename_box = widgets.HBox([widgets.Label("'"+
                                                place['placename']+
                                                "' matches best with: "+
                                                best_match['LocationName'].item()+ ", "+
                                                best_match['PartOf'].item()+" ("+
                                                best_match['Latitude'].item()+ ", "+
                                                best_match['Longitude'].item()+ ")"
                                                )])
    
            # Create an Accordion (in a HBox) for the candidates 
            new_accordion = widgets.Accordion()
            new_accordion.set_title(0,"Expand to chooose a different location")
            new_accordion.children = [widgets.HBox(place["locations"]["candidatebuttons"])]
            new_accordion.selected_index=None # close the accordion at startup

            # Put the HBoxes together in a VBox
            whole_box = widgets.VBox([placename_box,
                              new_accordion
                              ])
    
            # Show it all!
            # Note: this doesn't close one accordion if another is opened because they are in separate VBoxes
            display(whole_box)

VBox(children=(HBox(children=(Label(value="'the Bay of Biscay' matches best with: The Bay of Biscay, United Ki…

VBox(children=(HBox(children=(Label(value="'Heath' matches best with: Heath, NSW (-32.13166666666667, 149.3011…

VBox(children=(HBox(children=(Label(value="'Van Diemen's Land' matches best with: Tasmania, Australia (-42.035…

VBox(children=(HBox(children=(Label(value="'Vickers' matches best with: Vickers, United Kingdom (51.59845985, …

VBox(children=(HBox(children=(Label(value="'Sylvia' matches best with: Sylvia, QLD (-20.85, 143.81666666666666…

VBox(children=(HBox(children=(Label(value="'Bath' matches best with: Bath, SA (-34.83706, 138.49131)"),)), Acc…

VBox(children=(HBox(children=(Label(value="'Frere' matches best with: Frere, Italia (44.4701528, 7.004274)"),)…

VBox(children=(HBox(children=(Label(value="'Chatham' matches best with: Chatham, VIC (-37.82402802, 145.089035…

VBox(children=(HBox(children=(Label(value="'CHAPTER II' matches best with: Chapter II, United Kingdom (52.6189…

VBox(children=(HBox(children=(Label(value="'Surgeon Pine' matches best with: Surgeon's Kitchen, Norfolk Island…

VBox(children=(HBox(children=(Label(value="'Coromandel' matches best with: Coromandel, SA (-35.0251942, 138.61…

VBox(children=(HBox(children=(Label(value="'Pine' matches best with: Pine, SA (-36.0, 135.0)"),)), Accordion(c…

VBox(children=(HBox(children=(Label(value="'Three'll' matches best with: Three, United States (43.9445278, -71…

VBox(children=(HBox(children=(Label(value="'Van Diemen's' matches best with: Van Gogh Court, TAS (-41.3799529,…

VBox(children=(HBox(children=(Label(value="'Tasman' matches best with: Tasman, TAS (-43.054, 147.834)"),)), Ac…

VBox(children=(HBox(children=(Label(value="'Cape Pillar' matches best with: Cape Pillar, TAS (-43.18000031, 14…

VBox(children=(HBox(children=(Label(value="'Pirates' Bay' matches best with: Pirates Bay, TAS (-43.021334, 147…

VBox(children=(HBox(children=(Label(value="'east' matches best with: East, WA (-30.59869, 122.5848)"),)), Acco…

VBox(children=(HBox(children=(Label(value="'west' matches best with: Western, Papua Niugini (-7.5, 142)"),)), …

VBox(children=(HBox(children=(Label(value="'the Isle of Wight' matches best with: Isle of Wight, United Kingdo…

VBox(children=(HBox(children=(Label(value="'the South-West Cape' matches best with: MOSAIC Lagoon Cafe in the …

VBox(children=(HBox(children=(Label(value="'Swan Port' matches best with: Swan, Mauritius (-20.163587, 57.5032…

VBox(children=(HBox(children=(Label(value="'Mediterranean' matches best with: Mediterranean, United States (29…

VBox(children=(HBox(children=(Label(value="'Maria Island' matches best with: Maria Island, TAS (-42.61999893, …

VBox(children=(HBox(children=(Label(value="'Peninsula' matches best with: Peninsula, WA (-33.92686, 116.0693)"…

VBox(children=(HBox(children=(Label(value="'Storm Bay' matches best with: Storm Bay, TAS (-43.16999817, 147.44…

VBox(children=(HBox(children=(Label(value="'Sorrell' matches best with: Sorrell, United Kingdom (53.9614506000…

VBox(children=(HBox(children=(Label(value="'Bruny Island' matches best with: Bruny Island, TAS (-43.29000092, …

VBox(children=(HBox(children=(Label(value="'Mount Royal' matches best with: Mount Royal, NSW (-32.213611111111…

VBox(children=(HBox(children=(Label(value="'D'Entrecasteaux Channel' matches best with: D'Entrecasteaux Channe…

VBox(children=(HBox(children=(Label(value="'Actaeon' matches best with: Tweedelig hert, Nederland (51.9864345,…

VBox(children=(HBox(children=(Label(value="'the South Cape' matches best with: Cape, United Kingdom (52.284084…

VBox(children=(HBox(children=(Label(value="'New Norfolk' matches best with: New Norfolk, TAS (-42.77999878, 14…

VBox(children=(HBox(children=(Label(value="'Derwent' matches best with: Derwent, TAS (-42.813, 146.423)"),)), …

VBox(children=(HBox(children=(Label(value="'the Southern Ocean' matches best with: ocean, Ελλάς (37.1034349, 2…

VBox(children=(HBox(children=(Label(value="'Tamar' matches best with: Tamar, NSW (-35.815, 144.56777777777776)…

VBox(children=(HBox(children=(Label(value="'Port Philip Bay' matches best with: Yarra Bay Bicentennial Park, N…

VBox(children=(HBox(children=(Label(value="'Dromedary' matches best with: Dromedary, TAS (-42.74000168, 147.16…

VBox(children=(HBox(children=(Label(value="'Mount Wellington' matches best with: Mount Wellington, TAS (-42.88…

VBox(children=(HBox(children=(Label(value="'Launceston' matches best with: Launceston, TAS (-41.43999863, 147.…

VBox(children=(HBox(children=(Label(value="'Smyrna' matches best with: Smyrna, United States (33.883887, -84.5…

VBox(children=(HBox(children=(Label(value="'Pyramid Island' matches best with: Pyramid Island, TAS (-42.590000…

VBox(children=(HBox(children=(Label(value="'Rocky Point' matches best with: Rocky Point, QLD (-27.816666666666…

VBox(children=(HBox(children=(Label(value="'Port Davey' matches best with: Port Davey, TAS (-43.33000183, 145.…

VBox(children=(HBox(children=(Label(value="'Mount Direction' matches best with: Mount Direction, TAS (-42.7900…

VBox(children=(HBox(children=(Label(value="'Macquarie Harbour' matches best with: Macquarie Harbour, TAS (-42.…

VBox(children=(HBox(children=(Label(value="'Mount Heemskirk' matches best with: Mount Heemskirk, TAS (-41.8499…

VBox(children=(HBox(children=(Label(value="'Mount Zeehan' matches best with: Mount Zeehan, TAS (-41.91999817, …

VBox(children=(HBox(children=(Label(value="'King's River' matches best with: King's River, Éire / Ireland (52.…

VBox(children=(HBox(children=(Label(value="'Sarah Island' matches best with: Sarah Island, TAS (-42.38000107, …

VBox(children=(HBox(children=(Label(value="'Philip's Island' matches best with: St. Philip's Lane, United King…

VBox(children=(HBox(children=(Label(value="'earth' matches best with: earth, Česko (49.5904859, 14.6651358)"),…

VBox(children=(HBox(children=(Label(value="'south-east' matches best with: South East Constituency 1947, Éire …

VBox(children=(HBox(children=(Label(value="'Ladybird' matches best with: Ladybird, United Kingdom (50.23320239…

VBox(children=(HBox(children=(Label(value="'Port Arthur' matches best with: Port Arthur, TAS (-43.13999939, 14…

VBox(children=(HBox(children=(Label(value="'Arthur' matches best with: Arthur, NSW (-32.363055555555555, 150.8…

VBox(children=(HBox(children=(Label(value="'Hells Gates' matches best with: Hells Gates, TAS (-42.20999908, 14…

VBox(children=(HBox(children=(Label(value="'England' matches best with: England, QLD (-27.433333333333334, 152…

VBox(children=(HBox(children=(Label(value="'New Town' matches best with: New Town, TAS (-42.84999847, 147.3000…

VBox(children=(HBox(children=(Label(value="'Grummet Island' matches best with: Grummet Island, TAS (-42.380001…

VBox(children=(HBox(children=(Label(value="'Grummet' matches best with: grummet, Deutschland (50.4723087, 11.9…

VBox(children=(HBox(children=(Label(value="'Malabar' matches best with: Malabar, NSW (-34.2227815, 148.163882)…

VBox(children=(HBox(children=(Label(value="'Dawes' matches best with: Dawes, VIC (-37.81597137, 146.4990234)")…

Record the selected candidates as the best matches. Account for placenames without any best match.

In [110]:
for place in geolocdata:

    # If there isn't a bestmatch, then there aren't any candidates
    if len(place['locations']['best_match'])!=0:
        
        best_match_index = 0

        # Must be at least one canddiate
        if ('candidatebuttons' in place['locations'].keys()):

            # Extract the selection
            best_match_index = place['locations']['candidatebuttons'][0].index
            # Make sure the selection is a location
            if (best_match_index < len(place['locations']['candidates'])):
                # Record the best match
                topcandidate = place['locations']['candidates'][best_match_index]
                # Just record the important columns for the best match
                place['locations']['best_match'] = topcandidate.loc[:, ~topcandidate.columns.isin(['Gazetteer','RankFlags'])]
            else:
                # None of the above
                best_match = {'LocationName':"No suitable location selected",
                              'Category':"No suitable location selected",
                              'Latitude':"",
                              'Longitude': "",
                              'PartOf':"No suitable location selected"}
                place['locations']['best_match']= pd.DataFrame([best_match])
    else:
        # Note that there is no best match
        best_match = {'LocationName':"No location matched",
                      'Category':"No location matched",
                      'Latitude':"",
                      'Longitude': "",
                      'PartOf':"No location matched"}
        place['locations']['best_match']= pd.DataFrame([best_match])

Prepare the final geoloc data for output to file.

In [111]:
pprint(geolocdata[0:10])

[{'locations': {'best_match':           LocationName             Category Latitude Longitude               PartOf
0  No location matched  No location matched                     No location matched,
                'candidates': []},
  'placename': 'the sleepy sea'},
 {'locations': {'best_match':                     LocationName                       Category Latitude Longitude                         PartOf
0  No suitable location selected  No suitable location selected                     No suitable location selected,
                'candidatebuttons': [RadioButtons(index=1, layout=Layout(width='max-content'), options=('The Bay of Biscay, United Kingdom (52.95425415,-1.159789836905746)', 'None of the above'), value='None of the above')],
                'candidates': [        LocationName Category     Latitude           Longitude          PartOf Gazetteer
0  The Bay of Biscay     park  52.95425415  -1.159789836905746  United Kingdom       OSM]},
  'placename': 'the Bay of Biscay'},

In [112]:
# Verbose output
alllocations=[] # the processed data
for place in geolocdata:
    placename = place['placename']
    pprint(placename)

    # Reformat the data about the best match
    bestlocations = []
    # All placenames should now have a best_match
    if len(place['locations']['best_match'])!=0:
        for candidate in place['locations']['best_match']:
            if place['locations']['best_match'][candidate].item!="":
                bestlocations = bestlocations+[place['locations']['best_match'][candidate].item()]

    # Reformat the data about the candidate locations
    locations = []
    if 'candidates' in place['locations'].keys() and place['locations']['candidates']!= []:
            short_candidates = []
            for candidate in place['locations']['candidates']:
                # select which columns to output
                short_candidates = short_candidates + [candidate.loc[:, candidate.columns != 'RankFlags']]
            # merge the dataframe values into a more human-readible format for now
            # though this is not a comma-separated format
            locations = pd.concat(short_candidates,ignore_index=True) 

    # Put this all together
    newrecord = [['placename',placename],
                ['best_match',bestlocations],
                ['candidates',locations]]
    alllocations = alllocations + [newrecord]

'the sleepy sea'
'the Bay of Biscay'
'Heath'
'London'
"Van Diemen's Land"
'Vickers'
'Sylvia'
'Bath'
"Julia Vickers's"
'Frere'
'Chatham'
'CHAPTER II'
'Surgeon Pine'
'Coromandel'
'Pine'
'India'
'the Hydaspes for Calcutta'
'the poop guard'
'MONOTONY'
"Three'll"
"Van Diemen's"
'Tasman'
'Cape Pillar'
"Pirates' Bay"
'east'
'west'
'the Isle of Wight'
'the South-West Cape'
'Swan Port'
'Mediterranean'
'Maria Island'
'the Three Thumbs'
'Peninsula'
'Storm Bay'
'Storing Island'
'Italy'
'Sorrell'
'Bruny Island'
'Mount Royal'
"D'Entrecasteaux Channel"
'Actaeon'
'the South Cape'
'New Norfolk'
'Derwent'
'the Southern Ocean'
'Tamar'
'Victoria'
'Port Philip Bay'
'Wellington'
'Dromedary'
'Mount Wellington'
'Launceston'
'Smyrna'
'Pyramid Island'
'Rocky Point'
'Port Davey'
'Mount Direction'
'Macquarie Harbour'
'Mount Heemskirk'
'Mount Zeehan'
"King's River"
'Sarah Island'
"Philip's Island"
'Hobart Town'
'earth'
'south-east'
'Ladybird'
'Commandant'
'Port Arthur'
'Honduras'
'Arthur'
'Hells Gates'
'England'
'

In [113]:
for place in alllocations[0:10]:
    pprint(place)

[['placename', 'the sleepy sea'],
 ['best_match',
  ['No location matched',
   'No location matched',
   '',
   '',
   'No location matched']],
 ['candidates', []]]
[['placename', 'the Bay of Biscay'],
 ['best_match',
  ['No suitable location selected',
   'No suitable location selected',
   '',
   '',
   'No suitable location selected']],
 ['candidates',
          LocationName Category     Latitude           Longitude          PartOf Gazetteer
0  The Bay of Biscay     park  52.95425415  -1.159789836905746  United Kingdom       OSM]]
[['placename', 'Heath'],
 ['best_match',
  ['Heath', 'trig station', '-32.13166666666667', '149.3011111111111', 'NSW']],
 ['candidates',
    LocationName        Category            Latitude          Longitude          PartOf Gazetteer
0        Heath    trig station  -32.13166666666667  149.3011111111111             NSW    TLCMap
1        Heath          parish               -19.6              143.6             QLD    TLCMap
2        Heath            site   

In [114]:
# Save all the geolocdata as a messy combination of array rows and dataframnes
filename = "FtToHNL_matchedlocations.data"
save_location = os.path.normpath(os.path.join(csv_directory, filename))
save_filename = os.path.basename(save_location)
print("Saving to location data to ",save_location)

# save the list 
# using the savetxt from the numpy module
np.savetxt(save_location, 
           geolocdata,
           delimiter =", ", 
           fmt ='%s')

Saving to location data to  ../ner_output/FtToHNL_matchedlocations.data


In [115]:
# Final output
# Only use selected columns from the best match in csv format

# Make a new array of records
geoloc_output=[['Placename','PartOf','Latitude','Longitude']]
for place in geolocdata:
    # Only output those placenames with a best match location
    if len(place['locations']['best_match'])!=0:
        placename = place['placename']
        location = place['locations']['best_match']
        print("Formatting "+placename)
        pprint(location)
        
        # convert Lat/Long into floats, rather than strings
        if type(location['Latitude']) == float:
            latitude = location['Latitude'].item()
        elif (location['Latitude'].item()!=""):
            latitude = float(location['Latitude'].item())
        else:
            latitude = ""
        if type(location['Longitude']) == float:
            longitude = location['Longitude'].item()
        elif (location['Longitude'].item()!=""):
            longitude = float(location['Longitude'].item())
        else:
            longitude = ""
            
        # add this record to the list
        geoloc_output.append([placename,
                              location['PartOf'].item(),
                              latitude, 
                              longitude])

Formatting the sleepy sea
          LocationName             Category Latitude Longitude               PartOf
0  No location matched  No location matched                     No location matched
Formatting the Bay of Biscay
                    LocationName                       Category Latitude Longitude                         PartOf
0  No suitable location selected  No suitable location selected                     No suitable location selected
Formatting Heath
  LocationName      Category            Latitude          Longitude PartOf
0        Heath  trig station  -32.13166666666667  149.3011111111111    NSW
Formatting London
    LocationName Category  Latitude  Longitude          PartOf
282       London  Capital -0.083333       51.5  United Kingdom
Formatting Van Diemen's Land
  LocationName        Category    Latitude    Longitude     PartOf
0     Tasmania  administrative  -42.035067  146.6366887  Australia
Formatting Vickers
  LocationName Category     Latitude            Longitud

In [116]:
geoloc_output

[['Placename', 'PartOf', 'Latitude', 'Longitude'],
 ['the sleepy sea', 'No location matched', '', ''],
 ['the Bay of Biscay', 'No suitable location selected', '', ''],
 ['Heath', 'NSW', -32.13166666666667, 149.3011111111111],
 ['London', 'United Kingdom', -0.083333, 51.5],
 ["Van Diemen's Land", 'Australia', -42.035067, 146.6366887],
 ['Vickers', 'United Kingdom', 51.59845985, -1.7415535311162076],
 ['Sylvia', 'QLD', -20.85, 143.81666666666666],
 ['Bath', 'SA', -34.83706, 138.49131],
 ["Julia Vickers's", 'No location matched', '', ''],
 ['Frere', 'Italia', 44.4701528, 7.004274],
 ['Chatham', 'VIC', -37.82402802, 145.089035],
 ['CHAPTER II', 'No suitable location selected', '', ''],
 ['Surgeon Pine', 'Norfolk Island', -29.0569188, 167.9554764],
 ['Coromandel', 'SA', -35.0251942, 138.6143361],
 ['Pine', 'SA', -36.0, 135.0],
 ['India', 'Asia', 77.2, 28.6],
 ['the Hydaspes for Calcutta', 'No location matched', '', ''],
 ['the poop guard', 'No location matched', '', ''],
 ['MONOTONY', 'No l

In [117]:
filename = "FtToHNL_geolocdata.csv"
save_location = os.path.normpath(os.path.join(csv_directory, filename))
save_filename = os.path.basename(save_location)
print("Saving to location data to ",save_location)

# save the list 
# using the savetxt from the numpy module
np.savetxt(save_location, 
           geoloc_output,
           delimiter =", ", 
           fmt ='% s')

Saving to location data to  ../ner_output/FtToHNL_geolocdata.csv
