# ATAP Notebook for the Geolocation project

This notebook helps you access the Geolocation tools in a Python development environment.

### Contents
* [Premise](#section-premise)
* [Requirements](#section-requirements) 
* [Data Preparation](#section-datapreparation)
* [Identifying Placenames in Text](#section-identifyingplacenames)
 * [Named Entity Recognition](#section-ner)
 * [Reviewing Candidate Placenames](#section-reviewplacenames)
* [Finding Locations for Placenames](#section-findinglocs)
 * [Identifying States and Capitals](#section-statescapitals)
 * [Searching a Gazzetteer for Locations](#section-searchgazetteer)

## Premise <a class="anchor" id="section-premise"></a>
*This section explains the Geolocation project and tools.*

The Geolocation project relates to doctoral research done by [Fiannuala Morgan](https://finnoscarmorgan.github.io/) at the Australian National University. It uses software to identify placenames in archived historical texts, then compares them to data about known locations to identify where the placenames may be located. 

This notebook is designed to allow you to perform similar operations on textual documents as a step-by-step process. Various issues and considerations about such operations will be discussed during the process. Some choices will have to be made by you during certain steps in order to complete the process. The notebook uses the Python programming language but users are not expected to read or understand the Python code that is included. However, you will need a basic understanding of computer programming if you wish to edit aspects of the notebook.

<div class="alert alert-block alert-success">
It will teach you how to
<ul>
    <li>use the spaCy library to identify and classify Named Entities (NEs)</li>
    <li>identify multi-word expressions (MWE) that are NEs</li>
    <li>search for spatial data about specific locations or places in gazetteers of such data</li>
    <li>determine which locations are referred to by placenames, based on the context in which they are used in a text</li>
</ul>
</div>
 

## Requirements <a class="anchor" id="section-requirements"></a>

This notebook uses various Python libraries. Most will come with your Python installation, but the following are crucial:
- pandas
- numpy


In [1]:
%%capture
!pip install pandas
!pip install numpy
!pip install ratelimit
!pip install spacy
# You need the old version of this package due to issues setting titles with the latest version
!pip install ipywidgets=="7.6.2"
!pip install folium

!python -m spacy download en_core_web_sm

In [2]:
import os
import urllib
import pandas as pd
import numpy as np

# spaCy is used for a pipeline of NLP functions
import spacy

# ipywidgets is used for user interactive interfaces
import ipywidgets as widgets

# Imports for the map API requests and output formatting
import requests
import json
from pprint import pprint
from ratelimit import limits, sleep_and_retry

# Folium is used for displaying points on a map
import folium

In [3]:
# Make sure you can see as much of the output as possible within the Jupyter Notebook screen
pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)
pd.set_option("display.width", 115)

# Data preparation <a class="anchor" id="section-datapreparation"></a>

You also want to set up some directories for the import and output of data.

In [4]:
# Declare the data directories
# This presumes that 'notebooks' is the current working directory
text_directory = os.path.normpath("../texts/")
csv_directory = os.path.normpath("../output/")
reference_directory = os.path.normpath("../data")

# Create the data directories
if not os.path.exists(text_directory):
    os.makedirs(text_directory)
if not os.path.exists(csv_directory):
    os.makedirs(csv_directory)
if not os.path.exists(reference_directory):
    os.makedirs(reference_directory)

For this workshop, you will be examining the text of *For the Term of His Natural Life* (which will be abbreviated as FtToHNL), an 1874CE novel by Marcus Clarke that is in the public domain. This was obtained via the [Gutenburg Project Australia](https://gutenberg.net.au/ebooks/e00016.txt). It is an unformatted text file. For this workshop, it has been further simplified by reducing it to only standard ASCII characters, replacing any accented characters with their unaccented forms and the British Pound Sterling symbol with the word pounds.

The novel is divided into four books, each based in different regions of the world. For instance, you might want to start with the second book, which is titled *BOOK II.\-\-MACQUARIE HARBOUR.  1833*. Each book can be processed in its entirety, or as individual chapters.

In [5]:
filename = "FtToHNL_BOOK_2_CHAPTER_3.txt"
print("Reading:", filename)

# Set the specific path for the 'filename'
text_location = os.path.normpath(os.path.join(text_directory, filename))
text_filename = os.path.basename(text_location)

text = open(text_location, encoding="utf-8").read()

Reading: FtToHNL_BOOK_2_CHAPTER_3.txt


You have now read Chapter 3 of Book 2 into memory. This is currently no more than a long string of characters. So far, you have done no processing.

In [6]:
# Look at the first 500 characters
text[0:500]

'CHAPTER III.\n\nA SOCIAL EVENING.\n\n\n\nIn the house of Major Vickers, Commandant of Macquarie Harbour,\nthere was, on this evening of December 3rd, unusual gaiety.\n\nLieutenant Maurice Frere, late in command at Maria Island, had unexpectedly\ncome down with news from head-quarters.  The Ladybird, Government schooner,\nvisited the settlement on ordinary occasions twice a year, and such visits\nwere looked forward to with no little eagerness by the settlers.\nTo the convicts the arrival of the Ladybird mean'

# Identifying Placenames in Text <a class="anchor" id="section-identifyingplacenames"></a>
*This section provides tools on identifying placenames in textual data*

## Named Entity Recognition <a class="anchor" id="section-ner"></a>

Texts like the start of Chapter 3, Book 2, mention a lot of places. Some of them are a little vague, like _the house of Major Vickers_, _head-quarters_ and _the settlement_, but some are very explicit like _Maria Island_ and _Macquarie Harbour_. This notebook will show you how to use software automatically identify these proper noun phrases relating to placenames.

The following part of this notebook uses the Named Entity Recognition (NER) tool provided in the spaCy Python package. For an introduction to this tool and how it can be used to find Named Entities (NEs) like placenames in text, go to the [spaCy NER notebook](https://github.com/Australian-Text-Analytics-Platform/geolocation-tools-workshop/blob/7d92664ac44f86b90a0c098bb3159793a4fe6c16/Notebooks/spacy_ner_introduction.ipynb).

This NER need not be done one file at a time. You can now put all of this together and find the placenames that are identified by spaCy in each chapter of the text. The results can all be collected in a single data structure, reviewed and saved to file.

In [7]:
# Dataframe where we store the details about each instance of the placenames
placenames_df = pd.DataFrame(columns=["Book", "Chapter", "NEIndex", "Placename"])

In [8]:
# Define which chapters and books you want to examine
chapters = [1, 2, 3]
books = [1, 2]

Load a spaCy model for English.

In [9]:
nlp = spacy.load("en_core_web_sm")

You should define what spaCy processing you do or don't want in the pipeline. You mainly need the Tokenizer and the NER components. Others, like the Parser slow the processing down.

In [10]:
disabled_pipeline = ["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]

Of course, not all NEs are placenames, so you will need to make a list of what categories regularly contain placenames. This notebook focusses on the Location, Geo-Political Entity, Facility and Organisation categories. While they may not cover all processing of all instances of all placenames, they have been observed to identify the majority of placenames without providing too many irrelevant NEs.

In [11]:
placename_categories = ["LOC", "GPE", "FAC", "ORG"]

Now you can start processing the chapters from FtToHNL.

In [12]:
i = 0  # Counter of the entities
for book in books:
    for chapter in chapters:
        # Construct the filename for this book and chapter
        filename = "FtToHNL_BOOK_" + str(book) + "_CHAPTER_" + str(chapter) + ".txt"

        # Set the specific path for the 'filename'
        text_location = os.path.normpath(os.path.join(text_directory, filename))
        text_filename = os.path.basename(text_location)

        # Read this chapter
        text = open(text_location, encoding="utf-8").read()
        print("Reading:", filename)

        # Run spaCy
        doc = nlp(text, disable=disabled_pipeline)

        # Document level
        ents = [(entity.text, entity.start_char, entity.end_char, entity.label_) for entity in doc.ents]

        # Token level
        for entity in doc.ents:
            # filter out MONEY, DATE, etc.
            if entity.label_ in placename_categories:

                # To help understand the context of the text, extract the occurrence
                context_text = doc.text[entity.start_char - 30 : entity.end_char + 30].replace("\n", " ")

                # Add the placenames according to spaCy
                new_placename = {
                    "Book": book,  # The Book number
                    "Chapter": chapter,  # The Chapter number
                    "NEIndex": i,  # A reference number to the nth Named Entity
                    "Placename": entity.text,  # The placename in the text
                    "Category": entity.label_,  # The spaCy category
                    "Context": context_text,  # The textual context where the placename was found
                    "Approval": 1,  # A flag for whether this is a suitable placename
                }
                placenames_df = placenames_df.append(new_placename, ignore_index=True)

            i = i + 1  # Entity counter

Reading: FtToHNL_BOOK_1_CHAPTER_1.txt


  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placenam

Reading: FtToHNL_BOOK_1_CHAPTER_2.txt


  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placenam

Reading: FtToHNL_BOOK_1_CHAPTER_3.txt
Reading: FtToHNL_BOOK_2_CHAPTER_1.txt


  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placenam

Reading: FtToHNL_BOOK_2_CHAPTER_2.txt
Reading: FtToHNL_BOOK_2_CHAPTER_3.txt


  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placename, ignore_index=True)
  placenames_df = placenames_df.append(new_placenam

In [13]:
# Output what placenames have been found
placenames_df[["Book", "Chapter", "NEIndex", "Placename", "Category"]]

Unnamed: 0,Book,Chapter,NEIndex,Placename,Category
0,1,1,0,CHAPTER I.\n\nTHE PRISON SHIP,ORG
1,1,1,6,Crown,ORG
2,1,1,10,Crown,ORG
3,1,1,11,Crown,ORG
4,1,1,16,the Bay of Biscay,LOC
5,1,1,19,London,GPE
6,1,1,33,Van Diemen's,GPE
7,1,1,34,Vickers,ORG
8,1,1,39,Vickers,ORG
9,1,1,41,Vickers,ORG


This shows that a lot more placenames were found in Book 2 than Book 1. This makes sense since Book 1 is set on board an ocean voyage, whereas Book 2 is at Macquarie Harbour in Australia. You will also see that the same placename might be recognised multiple times but be categorised differently. This is because the spaCy model used for the NER is dependent on the linguistic content in which the NE instance appears in isolation. It does not assign the category based on multiple instances of an NE.

## Reviewing Candidate Placenames <a class="anchor" id="section-reviewplacenames"></a>

However, there are a number of NEs that are unlikely to be placenames, regardless of what spaCy categorised them as. You want to be able to filter them out. Sometimes though it is hard to work out whether an NE is a person, organisation or a location. For instance, spaCy has said _Van Diemen's_ and _Tasman_ are both ORG NE. While both could be the names of people (Dutch explorer Abel Tasman and Governor-General of the Dutch East Indies Anthony van Diemen) but they are actually part of the larger NEs _Van Diemen's Land_ (the former name for the state of Tasmania) and _Tasman's Head_ (a headland in Tasmania). For this, it is best to consider the context in which the terms were used. 

The following interfaces shows you each NE instance identified by spaCy's NER and the context in which they occure. Use the checkboxes to select which terms you consider to be placenames.

In [14]:
# Function that is used when a checkbox changes
def changed(b):
    k = b["new"]

# Lists of the data required for displaying the checkboxes
placename_items = []
context_items = []
num_items = []

# Make checkboxes for every placename for book in books:
for book in books:
    for chapter in chapters:

        # Get the NEs from this book and chapter
        placenames_book_chapter = placenames_df[(placenames_df["Book"] == book) & (placenames_df["Chapter"] == chapter)]

        # Get the contextual data for each indexed NE
        for i in placenames_book_chapter["NEIndex"]:
            context_text = placenames_book_chapter[placenames_book_chapter["NEIndex"] == i]["Context"].values[0]
            category = placenames_book_chapter[placenames_book_chapter["NEIndex"] == i]["Category"].values[0]

        # Make lists of the candidate placenames, context text and index numbers
        # Only the placenames are given a checkbox.
        placename_items = placename_items + [
            widgets.Checkbox(True, description=i) for i in placenames_book_chapter["Placename"]
        ]
        context_items = context_items + [
            widgets.Label(placenames_book_chapter[placenames_book_chapter["NEIndex"] == i]["Context"].values[0])
            for i in placenames_book_chapter["NEIndex"]
        ]
        num_items = num_items + [widgets.Label(str(i)) for i in placenames_book_chapter["NEIndex"]]

# Create a display
num_placenames = len(placename_items)
left_box = widgets.VBox(placename_items)
right_box = widgets.VBox(context_items)
num_box = widgets.VBox(num_items)
whole_box = widgets.HBox([num_box, left_box, right_box])

print("Unselect any Named Entities (NEs) that you do not consider to be placenames.")
print("Each instance of an NE is listed, with the textual context in which it appeared.")

display(whole_box)

for n in range(num_placenames):
    placename_items[n].observe(changed)

Unselect any Named Entities (NEs) that you do not consider to be placenames.
Each instance of an NE is listed, with the textual context in which it appeared.


HBox(children=(VBox(children=(Label(value='0'), Label(value='6'), Label(value='10'), Label(value='11'), Label(…

You can now copy all the values from the checkboxes to the data, so you know which placenames you have approved.

In [15]:
# Transfer the status of each checklist item to the data
for n in range(num_placenames):

    NEIndex_num = int(num_items[n].value)
    approval_flag = placename_items[n].value

    # Set the flag to match the checklist
    for placename in placenames_df["NEIndex"]:
        if placename - NEIndex_num == 0:
            placenames_df.loc[placenames_df["NEIndex"] == NEIndex_num, "Approval"] = approval_flag

You can now visualise the result.

In [16]:
placenames_df[["NEIndex", "Placename", "Approval"]]

Unnamed: 0,NEIndex,Placename,Approval
0,0,CHAPTER I.\n\nTHE PRISON SHIP,True
1,6,Crown,True
2,10,Crown,True
3,11,Crown,True
4,16,the Bay of Biscay,True
5,19,London,True
6,33,Van Diemen's,True
7,34,Vickers,True
8,39,Vickers,True
9,41,Vickers,True


From this, you can extract the final list of distinct placenames that you have approved. While the names aren't sorted (though they could be), if you missed unselecting an NE on the checklist, this will help find it. All you need to do is go back to the checklist, unselect it, then run all other steps from there to here.

In [17]:
# Make a unique list of the approved placenames
approved_placenames = placenames_df[placenames_df["Approval"] == True]["Placename"].unique()
print(approved_placenames)

['CHAPTER I.\n\nTHE PRISON SHIP' 'Crown' 'the Bay of Biscay' 'London'
 "Van Diemen's" 'Vickers' 'Familiarity' 'Chatham' 'Frere' 'Sylvia'
 'Surgeon Pine' 'Coromandel' 'Pine' "the King's Regulations" 'India'
 'the Hydaspes for Calcutta' 'shell--"why' 'Modesty' 'CHAPTER I.\n\n'
 'Cape Pillar' 'Pirates' 'the Isle of Wight' 'South' 'Mediterranean'
 'Cape Bougainville' 'Maria Island' 'Peninsula' 'Pillar' 'Storm Bay'
 'Storing Island' 'Hobart' 'Sorrell' 'Pittwater' 'Bruny Island' 'Actaeon'
 'Recherche Bay' 'the South Cape' 'New Norfolk' 'the Southern Ocean'
 'Victoria' 'Port Philip Bay' 'Bay' "Wyld's Crag" 'Wellington' 'Dromedary'
 'Derwent' 'Mount Wellington' 'south bay' 'Smyrna' 'Cape Grim'
 'Pyramid Island' 'Rocky Point' 'Port Davey' 'Cape' 'Huon' 'Navigation'
 'Macquarie Harbour' 'Mount Heemskirk' "King's River" 'Gordon'
 'Sarah Island' 'Philip' 'Island' 'Nature' 'Ladybird'
 'the Hobart Town Gaol' 'Commandant' 'Honduras' 'Honest' 'Hobart Town'
 'Hells Gates' 'Arthur' 'England' 'New Town' 

The final step is to save this new data to a csv file. You have already defined the directory for your data files.

In [18]:
filename = "FtToHNL_placenames.csv"
save_location = os.path.normpath(os.path.join(csv_directory, filename))
save_filename = os.path.basename(save_location)
print("Saving placename data to", save_location)

Saving placename data to ../output/FtToHNL_placenames.csv


In [19]:
# Save the list, using savetxt from the numpy module
np.savetxt(save_location, approved_placenames, delimiter=", ", fmt="% s")

You now have a copy of what you consider to be a list of placenames from a select set of FtToHNL chapters. Each placename found and approved is only included once. For this notebook, this data does not include contextual data like how many times a placename was found in the original text, nor where it was found.

You can easily change the notebook commands to search for placenames in different chapters or even look at a different single file.

## Finding Locations for the Placenames <a class="anchor" id="section-findinglocs"></a>

Now that you have a list of placenames from the text, the next step is to work out their location on Earth. For this you can use a combination of specialised lists of locations, gazzetteers and heuristics. The objective is to match every placename with the coordinates of a known location.

The first step is to read the file of your placenames.

In [20]:
filename = "FtToHNL_placenames.csv"
print("Reading:", filename)

# Set the specific path for the 'filename'
data_location = os.path.normpath(os.path.join(csv_directory, filename))
data_filename = os.path.basename(data_location)

# Read the csv file using pandas. This will place it in a dataframe format.
placenames_df = pd.read_csv(data_location, encoding="utf-8", header=None)
placenames_df = placenames_df.rename(columns={placenames_df.columns[0]: "Placename"})
placenames_df

Reading: FtToHNL_placenames.csv


Unnamed: 0,Placename
0,CHAPTER I.
1,THE PRISON SHIP
2,Crown
3,the Bay of Biscay
4,London
5,Van Diemen's
6,Vickers
7,Familiarity
8,Chatham
9,Frere


### Identifying States and Capitals <a class="anchor" id="section-statescapitals"></a>

Some placenames, like *High Street* or *Maryborough*, may be very common across the world, or even in Australia. However, certain placenames refer to significant locations, like states, territories, large geographic features or capital cities. As such, if they are mentioned in a text, the placename is more likely to refer to the major location than a small town or village.

These significant locations are a finite set. They can be defined in a reference file that can be reused when reviewing the placenames of any text.

A good point for you to start with is a file about locations like modern capital cities and countries. You can also add any historical locations of significance that are commonly relevant to your research.

In [21]:
filename = "reference_location_data.csv"
print("Reading:", filename)

# Set the specific path for the 'filename'
reference_location = os.path.normpath(os.path.join(reference_directory, filename))
reference_filename = os.path.basename(reference_location)

Reading: reference_location_data.csv


In [22]:
# Place the reference data in a dataframe
locref_df = pd.read_csv(reference_location, encoding="utf-8", header=0)
locref_df

Unnamed: 0,LocationName,Category,Latitude,Longitude,PartOf
0,Melbourne,City,-37.814218,144.963161,VIC
1,Brisbane,City,-27.468968,153.023499,QLD
2,Perth,City,-31.955896,115.860580,WA
3,Darwin,City,-12.460440,130.841047,NT
4,Alice Springs,City,-23.698388,133.881289,NT
...,...,...,...,...,...
558,Zagreb,Capital,16.000000,45.800000,Croatia
559,Zambia,Country,28.283333,-15.416667,Africa
560,Zanzibar City,Capital,39.198914,-6.165193,Tanzania
561,Zimbabwe,Country,31.033333,-17.816667,Africa


For each location, it is given a category (like _Continent_, _Country_, _Capital_ or _City_), the latitude and longitude coordinates and some indication of where in the world it is located. This _PartOf_ value may be a continent, a country or a region within an country (like a state or province). While you may only want to know the coordinates of any matching placename, the extra infomation will be vital in working out what is a relevant match.

As you can see, this allows some bias to be introduced into the data to suit your geolocation needs. For instance, _Perth_ is entered in this file as a city in the state of Western Australia, rather than one in Scotland. _Victoria_ is recorded as a state of Australia, rather than the capital of the Seychelles, or the capital of British Columbia, Canada. These locations are chosen because FtToHNL is mainly set in 18th century Australia.

If you are researching historical texts, then you might have to consider adding various locations to this list to accomodate changes that may have occured over time. For instance
* change of names, e.g., _New Amsterdam_ vs [_Gotham_](https://www.nypl.org/blog/2011/01/25/so-why-do-we-call-it-gotham-anyway) vs _New York_, _Constantinople_ vs _Istanbul_, _Ceylon_ vs _Sri Lanka_.
* change of spelling, e.g., _Peking_ is a [romanized form](https://en.wikipedia.org/wiki/Chinese_postal_romanization) of _Beijing_
* change of significance, like which city is a capital, e.g., _Bonn_ vs _Berlin_
* names of previous geopolitical realms, like countries, kingdoms and empires, e.g., the _British Empire_, the _Zulu Kingdom_

Obviously, you can't record every location everywhere, but the role of this reference file is to provide you with information on the major locations that might be relevant to your research, regardless of what textual document you are analysing.

The next step is to see if any of the placenames from our selected chapters of FtToHNL match these locations.

For each placename, we will be looking for the location that is the best match. If we find a location from our reference file with a name that matches a placename, then we record the geolocation data for the matching location as the best match for this placename. These records are kept independent from the reference file. Otherwise, don't record any geolocation data so we know to keep looking for the placename.

In [23]:
# All the data about placenames and locations, once linked, as a list of dataframes
geolocdata = []

for placename in placenames_df["Placename"]:

    # Create a new geoloc entry about this placename
    new_geolocdata = {}

    # Start a record for a placename
    new_geolocdata["placename"] = placename
    new_geolocdata["locations"] = {}  # Start with no location details
    new_geolocdata["locations"]["best_match"] = []  # Start with no matching location
    new_geolocdata["locations"]["candidates"] = []  # Start with no candidate locations
    
    # Match found in the reference data
    if placename in list(locref_df["LocationName"]):
        # Copy the details from the reference file entry
        new_geolocdata["locations"]["best_match"] = locref_df[locref_df["LocationName"] == placename]

        print("+++ Found", placename)
    else:
        print("--- Still looking for", placename)

    # Add the new placename data to the list
    geolocdata.append(new_geolocdata)

--- Still looking for CHAPTER I.
--- Still looking for THE PRISON SHIP
--- Still looking for Crown
--- Still looking for the Bay of Biscay
+++ Found London
--- Still looking for Van Diemen's
--- Still looking for Vickers
--- Still looking for Familiarity
--- Still looking for Chatham
--- Still looking for Frere
--- Still looking for Sylvia
--- Still looking for Surgeon Pine
--- Still looking for Coromandel
--- Still looking for Pine
--- Still looking for the King's Regulations
+++ Found India
--- Still looking for the Hydaspes for Calcutta
--- Still looking for shell--"why
--- Still looking for Modesty
--- Still looking for CHAPTER I.
--- Still looking for Cape Pillar
--- Still looking for Pirates
--- Still looking for the Isle of Wight
--- Still looking for South
--- Still looking for Mediterranean
--- Still looking for Cape Bougainville
--- Still looking for Maria Island
--- Still looking for Peninsula
--- Still looking for Pillar
--- Still looking for Storm Bay
--- Still looking for

For example, of the first ten placenames (depending on which ones you earlier approved), only _London_ found a matching location in the reference file. It turns out that it is the capital of the United Kingdom! No confusion there! However, you are still trying to match locations for the other placenames.

In [24]:
# Show the resulting geolocation records for the first 10 placenames
geolocdata[:10]

[{'placename': 'CHAPTER I.',
  'locations': {'best_match': [], 'candidates': []}},
 {'placename': 'THE PRISON SHIP',
  'locations': {'best_match': [], 'candidates': []}},
 {'placename': 'Crown', 'locations': {'best_match': [], 'candidates': []}},
 {'placename': 'the Bay of Biscay',
  'locations': {'best_match': [], 'candidates': []}},
 {'placename': 'London',
  'locations': {'best_match':     LocationName Category  Latitude  Longitude          PartOf
   282       London  Capital -0.083333       51.5  United Kingdom,
   'candidates': []}},
 {'placename': "Van Diemen's",
  'locations': {'best_match': [], 'candidates': []}},
 {'placename': 'Vickers', 'locations': {'best_match': [], 'candidates': []}},
 {'placename': 'Familiarity',
  'locations': {'best_match': [], 'candidates': []}},
 {'placename': 'Chatham', 'locations': {'best_match': [], 'candidates': []}},
 {'placename': 'Frere', 'locations': {'best_match': [], 'candidates': []}}]

There is nothing complex about this matching. It expects the spelling of the location and the placename to be exactly the same. It doesn't try to accomodate differences in spelling or whether the placename is upper or lower case. Since these locations are supposed to be regarded as unambiguous, you don't want to try and accomodate such variations.

What locations did you end up finding?

In [25]:
matched_data = [
    # Convert each best_match dataframe into a single csv string
    place["locations"]["best_match"].to_csv(index=False, header=False).strip('\n')
    for place in geolocdata
    if len(place["locations"]["best_match"]) > 0
]
matched_data

['London,Capital,-0.083333,51.5,United Kingdom',
 'India,Country,77.2,28.6,Asia',
 'Hobart,City,-42.893851,147.2720857717038,TAS',
 'Victoria,State,-37.8142176,144.9631608,Australia',
 'Wellington,Capital,174.783333,-41.3,New Zealand',
 'Honduras,Country,-87.216667,14.1,Central America',
 'Hobart Town,City,-42.893851,147.2720857717038,TAS',
 'Sydney,City,-33.8698439,151.2082848,NSW']

You can now discard the data from the reference file. All the geolocation details for any matches will stay with the placenames.

In [26]:
del locref_df

### Searching a Gazetteer for Locations <a class="anchor" id="section-searchgazzeteer"></a>

Another source of possible matching locations is a gazetteer. Gazetteers are databases of geolocation records that can commonly be searched through the use of Application Programming Interfaces (APIs), a series of software functions that enable external software developers to send query commands to the databases. The database will then respond with the requested information.

The [Open Street Map (OSM)](https://www.openstreetmap.org/about) is one such gazetteer that freely provides such a service. This notebook uses the [Nominatim API](https://nominatim.org/release-docs/develop/api/Search/) to send queries to the OSM gazetteer. These API queries look like website addresses (URLs) but allow you to include information about what locations you are looking for and various parameters about how you want to search for it.

For instance, you can limit how many matching locations you want to know about. The default for the OSM is 10 records per query ([a maximum of 50](https://nominatim.org/release-docs/develop/api/Search/)) but let's limit it to only 5. 

In [27]:
# How many (max) results do we want for each name?
OSM_limit = 5

Some gazetteers limit how often you can send them queries. By default, the OSM API only wants 1 query per second per user ([Nominatim policies](https://operations.osmfoundation.org/policies/nominatim/)), but because all the workshop's BinderHub sessions are regarded as coming from the same user, you will have to set a higher number. Likewise, you might want to limit how many queries the workshop sends. 

In [28]:
# number of seconds between queries
# DEFAULT = 1
OSM_query_delay = 5 

# Number of placenames to search for in the queries
workshop_OSM_query_limit = 25

You might also want to set up various functions that process the responses from the OSM, converting it into a more manageable form for your purposes.

In [29]:
# Set up proxy for OSM Nominatim API (if defined)
nominatim_proxy = os.getenv('NOMINATIM_PROXY')
proxies = {'https': nominatim_proxy} if nominatim_proxy else None
verify = not nominatim_proxy # Do not verify certificates if using proxy
if nominatim_proxy: # Supporess insecure request warning for caching proxy
    import urllib3
    urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# Send rate-limited requests that stay within n requests per second
@sleep_and_retry
@limits(calls=1, period=OSM_query_delay)
def osm_call_api(url):
    response = requests.get(url, proxies=proxies, verify=verify)
    return response

# Convert a postcode from a string into a state abbreviation
def postcode_to_state(postcodestr):
    postcode = int(postcodestr)

    if (1000 <= postcode <= 2599) or (2619 <= postcode < 2899) or (2921 <= postcode <= 2999):
        return "NSW"
    elif (200 <= postcode <= 299) or (2600 <= postcode <= 2618) or (2900 <= postcode <= 2920):
        return "ACT"
    elif (3000 <= postcode <= 3999) or (8000 <= postcode <= 8999):
        return "VIC"
    elif (4000 <= postcode <= 4999) or (9000 <= postcode < 9999):
        return "QLD"
    elif 5000 <= postcode <= 5999:
        return "SA"
    elif (6000 <= postcode <= 6797) or (6800 <= postcode <= 6999):
        return "WA"
    elif 8000 <= postcode <= 8999:
        return "TAS"
    elif 7000 <= postcode <= 7999:
        return "TAS"
    elif 800 <= postcode <= 999:
        return "NT"
    # Some postcodes are special cases
    elif postcode == 2899:
        return "Norfolk Island"  # Coded as NSW
    elif postcode == 6798:
        return "Christmas Island"  # Coded as WA
    elif postcode == 6799:
        return "Cocos (Keeling) Islands"  # Coded as WA
    elif postcode == 9999:
        return "North Pole"  # Coded as VIC for Santa mail

    # Fallback
    return postcodestr

# Format the api response to make comparison easier
def osm_format_response(input):

    # OSM tends to provide the full address as the name of a location
    # Shorten the name and extract the country name, if any
    hyper_location = None
    short_location = input["display_name"]  # default to full address

    if input["display_name"].find(","):
        # Break up the address
        namesplit = input["display_name"].split(",")
        short_location = namesplit[0].lstrip().rstrip()
        # Extract the rightmost term from the split
        hyper_location = namesplit[len(namesplit) - 1].lstrip().rstrip()
        # Change to an Australian state name, rather than Australia
        if hyper_location == "Australia" and len(namesplit) > 2:
            hyper_location = namesplit[len(namesplit) - 2].lstrip().rstrip()
            # Change postcodes into states
            if hyper_location.isdigit() and len(hyper_location) == 4:
                hyper_location = postcode_to_state(hyper_location)

    # Keep the field names consistent between gazetteer records
    response = {
        "LocationName": str(short_location),
        "Category": str(input["type"]),
        "Latitude": input["lat"],
        "Longitude": input["lon"],
        "PartOf": str(hyper_location),
        "Gazetteer": "OSM",
    }
    return response

You can now search for any unknown placename by sending the OSM a query.  You won't need to search for any placenames that you have already found an unambiguous location for.

In [30]:
# For every placename in our list
for place in geolocdata[0:workshop_OSM_query_limit]:

    # Already found an unambiguous location, so skip to the next placename
    if len(place["locations"]["best_match"]) > 0:
        continue

    placename = place["placename"]

    # Query the OSM database
    url = f"https://nominatim.openstreetmap.org/search?q={placename}&format=json&limit={OSM_limit}"
    response = osm_call_api(url)
    response_dict = json.loads(response.text)

    # Start a list of possible candidate locations for this placename
    place["locations"]["candidates"] = None

    # If no results found, then skip to the next placename
    if len(response_dict) == 0:
        print("--- Still looking for", placename)
        continue
        
    print("+++ Found", placename)
        
    # Save the candidate locations for later processing
    new_candidates = []

    # Handle results found
    for response_record in response_dict:
        # Tidy up the responses to the query, discarding any data you don't need
        # Add the data to a dataframe
        cleaned_response = pd.DataFrame([osm_format_response(response_record)])
        
        # The cleaned_response should be in the form you want to keep.
        new_candidates.append(cleaned_response)

    # Add the results for this placename to the geoloc dataframe
    place["locations"]["candidates"] = new_candidates

+++ Found CHAPTER I.
--- Still looking for THE PRISON SHIP
+++ Found Crown
+++ Found the Bay of Biscay
+++ Found Van Diemen's
+++ Found Vickers
--- Still looking for Familiarity
+++ Found Chatham
+++ Found Frere
+++ Found Sylvia
+++ Found Surgeon Pine
+++ Found Coromandel
+++ Found Pine
--- Still looking for the King's Regulations
--- Still looking for the Hydaspes for Calcutta
--- Still looking for shell--"why
+++ Found Modesty
+++ Found CHAPTER I.
+++ Found Cape Pillar
+++ Found Pirates
+++ Found the Isle of Wight
+++ Found South
+++ Found Mediterranean


The OSM provides a lot of information in reponse to each query. Not all of it is needed, so you will need to prune it, only keeping the data required for later processing.

For instance, this is the full response to the OSM query for the last placename. 

In [31]:
response_dict

[{'place_id': 244008172,
  'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
  'osm_type': 'way',
  'osm_id': 695975547,
  'boundingbox': ['35.2554556', '35.2559192', '33.903377', '33.9046383'],
  'lat': '35.255701349999995',
  'lon': '33.904011953206904',
  'display_name': 'Mediterranean, İskele, Kuzey Kıbrıs Türk Cumhuriyeti, Κύπρος - Kıbrıs',
  'class': 'landuse',
  'type': 'residential',
  'importance': 0.30999999999999994},
 {'place_id': 101362923,
  'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
  'osm_type': 'way',
  'osm_id': 7243263,
  'boundingbox': ['33.7438393', '33.7440801', '-116.3549966', '-116.3524485'],
  'lat': '33.7439509',
  'lon': '-116.3534839',
  'display_name': 'Mediterranean, Palm Desert, Riverside County, California, 92210, United States',
  'class': 'highway',
  'type': 'residential',
  'importance': 0.21},
 {'place_id': 230469276,
  'licence': 'Data © OpenStreetMap contributors, ODbL 1.

As you can see, the full address is given as the name for each location listed. The fields use different names to your previously matched geolocation data. Not all this data is beneficial to you, so the notebook has only kept the following data about each candidate location.

In [32]:
# The last location in the last OSM query response
cleaned_response

Unnamed: 0,LocationName,Category,Latitude,Longitude,PartOf,Gazetteer
0,Mediterranean,residential,38.9919995,-74.8279859,United States,OSM


Rather than keeping the entire address, the name of the matching location has been shortened to the leftmost phrase in the address. Likewise, the _PartOf_ value is generally extracted from the rightmost phrase in the address. However, for FtToHNL, you need to discriminate between locations in different states of Australia. For that reason, the notebook code will change to the second rightmost phrase in the address for Australian locations. This is normally the name of the state, but sometimes is an Australian postcode so the notebook needs to work out which state corresponds to which postcode and use that instead. For your research, you may need to similarly tailor how to handle the output from any gazetteer you use to your needs. 

You will also notice that the OSM output has a _class_ and a _type_ value that are both similar to the _Category_ value from our reference file and the NER, yet also have distinctly different values. This is a common issue. Just like there are no universal set of categories used by NER systems, there is no universal set for gazetteers. In fact, even the OSM doesn't have a clearly defined set of values. Some values are very general, e.g., _locality_ or _administrative_, but others are very specific, e.g., _cafe_ or  _place_of_worship_. However, it is important contextual information to use when evaluating what locations may be relevant. _type_ is more specific (or fine-grained) than _class_ so it is used as the _Category_ for any OSM matching locations.

The OSM also provides an _importance_ value to each query response. This is not useful information because it relates to the importance of the location, based on [how often a related page is linked to in Wikipedia](https://github.com/osm-search/wikipedia-wikidata), rather than the quality of the search match. The ranking order of the OSM locations in a query reponse [can be customised](https://nominatim.org/release-docs/develop/customize/Ranking/), but this notebook just uses the defaults settings. 

It should also be noted that the Nominatim OSM queries don't try to find exact matches for the placenames. Nominatim isn't very clear about how exactly they match the placenames, but partial matches seem to be made. For instance, _Mount Royal_ matches with locations called _Mount Royal_ as well as _Mont-Royal_. Likewise, _South Cape_ matches with _Cape_. These indicate that when searching for multi-word expressions, the Nominatim API will match both the entire expression as well as single words within the expression. There is no clear instructions on how you can narrow down the search to only look for exact matches. 

How helpful was the OSM gazetteer for you? Where have you found locations for placenames?

In [33]:
# Which placenames have been matched with locations from the OSM or your reference file?
matched_data = [
    place["placename"]
    for place in geolocdata
    if (len(place["locations"]["best_match"]) != 0 
           or (place["locations"]["candidates"] != None and place["locations"]["candidates"] != []))
]
matched_data

['CHAPTER I.',
 'Crown',
 'the Bay of Biscay',
 'London',
 "Van Diemen's",
 'Vickers',
 'Chatham',
 'Frere',
 'Sylvia',
 'Surgeon Pine',
 'Coromandel',
 'Pine',
 'India',
 'Modesty',
 'CHAPTER I.',
 'Cape Pillar',
 'Pirates',
 'the Isle of Wight',
 'South',
 'Mediterranean',
 'Hobart',
 'Victoria',
 'Wellington',
 'Honduras',
 'Hobart Town',
 'Sydney']

What placenames have you still not found?

In [34]:
unmatched_data = [
    place["placename"]
    for place in geolocdata
    if (
        # No non-ambiguous location
        len(place["locations"]["best_match"]) == 0
        # No candidate Locations
        and (place["locations"]["candidates"] is None or place["locations"]["candidates"] == [])
    )
]
unmatched_data

['THE PRISON SHIP',
 'Familiarity',
 "the King's Regulations",
 'the Hydaspes for Calcutta',
 'shell--"why',
 'Cape Bougainville',
 'Maria Island',
 'Peninsula',
 'Pillar',
 'Storm Bay',
 'Storing Island',
 'Sorrell',
 'Pittwater',
 'Bruny Island',
 'Actaeon',
 'Recherche Bay',
 'the South Cape',
 'New Norfolk',
 'the Southern Ocean',
 'Port Philip Bay',
 'Bay',
 "Wyld's Crag",
 'Dromedary',
 'Derwent',
 'Mount Wellington',
 'south bay',
 'Smyrna',
 'Cape Grim',
 'Pyramid Island',
 'Rocky Point',
 'Port Davey',
 'Cape',
 'Huon',
 'Navigation',
 'Macquarie Harbour',
 'Mount Heemskirk',
 "King's River",
 'Gordon',
 'Sarah Island',
 'Philip',
 'Island',
 'Nature',
 'Ladybird',
 'the Hobart Town Gaol',
 'Commandant',
 'Honest',
 'Hells Gates',
 'Arthur',
 'England',
 'New Town',
 'verandah.-She',
 'Barton',
 'Grummet Island',
 'Maria',
 'Malabar',
 'Dawes',
 'Gad',
 'Osprey',
 'nimbus',
 'Troke']

While Open Street Map is a wonderful resource, it focusses on current names of geographic locations. If the original source of your placenames was not written in the recent decades, then the OSM may not know the appropriate names of locations for the time of the document. Furthermore, some of the locations may no longer exist. 

One solution is to also look up a historical gazetteer, like that provided by [TLCMap](https://tlcmap.org/about/devstrategy.php#whatwhy), rather than just rely on the OSM. You can then look the top matching locations from both and decide if any seem appropriate for your placenames.

TLCMap is designed to enable humanities researchers to use, create and integrate datasets relating to spatial data and thematic mapping. The focus is on using Australian cultural and historical data. A core part of that is the establishment of the [Gazetteer of Historical Australian Places (GHAP, formerly 'Placenames')](https://tlcmap.org/ghap/), which includes data based on the [Australian National Placename Survey (ANPS)](https://www.anps.org.au/) and layers of cultural information contributed by researchers, institutions and the community.

Like the OSM API, the [TLCMap API](https://tlcmap.org/guides/ghap/#ws) has various options, like which type of search to use and whether to search any data in the database entered by the public, rather than only that which has been verified or entered by experts.

For this workshop, you will look for exact matches between the placenames and the locations, and not consider any publicly entered data.

In [35]:
# Which order to do different searches for known locations? 
# Alt values: 'exact', 'fuzzy', 'contains'
search_type = "exact"

# Flag whether to use data provided by the public
search_public_data = False

Like for the OSM, you can limit how many results you want to examine. The TLCMap default for this notebook is 1 but for now, you should use the same limit as for the OSM.

In [36]:
# How many (max) results do we want for each name?
TLCMap_limit = 5

Like for the OSM API, you probably want to limit how often you ask queries and how many you ask. 

In [37]:
# number of seconds between queries
# DEFAULT = 1
TLC_query_delay = 5 

# Number of placenames to search for in the queries
workshop_TLC_query_limit = 25

Like for the OSM, you will need a few functions to query the the API.

In [38]:
# Build a url to query the tlcmap/ghap API.
# - placename: the place we're trying to locate
# - search_type: what search type to use (accepts one of ['contains','fuzzy','exact'])
# ref: https://www.tlcmap.org/guides/ghap/#ws
def tlc_build_url(placename: str, search_type: str, search_public_data: bool = False) -> str:
    
    safe_placename = urllib.parse.quote(placename.strip().lower())

    url = f"https://tlcmap.org/ghap/search?"

    if search_type == "fuzzy":
        url += f"fuzzyname={safe_placename}"
    elif search_type == "exact":
        url += f"name={safe_placename}"
    elif search_type == "contains":
        url += f"containsname={safe_placename}"
    else:
        return None

    # Search Australian National Placenames Survey provided data
    url += "&searchausgaz=on"

    # Search public provided data, this data could be unreliable
    if search_public_data == True:
        url += "&searchpublicdatasets=on"
    else:
        url += "&searchpublicdatasets=off"

    # Retrieve data as JSON
    url += "&format=json"

    # Limit the number of results
    url += "&paging=" + str(TLCMap_limit)

    return url


# Send rate-limited requests that stay within 1 query per n seconds
@sleep_and_retry
@limits(calls=1, period=TLC_query_delay)
def tlc_call_api(url):
    r = requests.get(url)
    if r.url == "https://tlcmap.org/ghap/maxpaging":
        return None

    # If the reply says the placename wasn't found, customise the JSON data for the reply
    if r.content.decode() == "No search results to display.":
        # This should have obviously just be an empty list of features, but TLCMap is badly behaved
        response = json.loads('{"type": "FeatureCollection","metadata": {},"features": []}')
    # SUCCESS! Record the spatial data provided in the reply
    elif r.ok:
        response = r.json()  # get [lon, lat] etc. for spatial matches

    return response


# Use TLCMap/GHAP API to check a placename
def tlc_query_name(placename: str, search_type: str):
    
    url = tlc_build_url(placename, search_type, search_public_data)
    if url:
        return tlc_call_api(url)
    
    return None

In [39]:
# Format the api response to make comparison easier
def tlc_format_response(location_features):

    locdata = {}  # formatted data

    # Gather the locdata for one of the placename's locations
    # If the value is missing, set a value or leave it empty.
    if len(location_features):
        if "placename" in location_features["properties"]:
            locdata["LocationName"] = location_features["properties"]["placename"].lstrip().rstrip()
        else:
            locdata["LocationName"] = "Unknown Location"

        if "feature_term" in location_features["properties"]:
            locdata["Category"] = location_features["properties"]["feature_term"].lstrip().rstrip()
        else:
            locdata["Category"] = None

        if "longitude" in location_features["properties"]:
            locdata["Longitude"] = location_features["properties"]["longitude"]
        else:
            locdata["Longitude"] = ""

        if "latitude" in location_features["properties"]:
            locdata["Latitude"] = location_features["properties"]["latitude"]
        else:
            locdata["Latitude"] = ""

        if "state" in location_features["properties"]:
            locdata["PartOf"] = location_features["properties"]["state"].lstrip().rstrip()
        else:
            locdata["PartOf"] = None

        locdata["Gazetteer"] = "TLCMap"

        # Keep the names consistent between gazetteer records
        response = {
            "LocationName": str(locdata["LocationName"]),
            "Category": str(locdata["Category"]),
            "Latitude": locdata["Latitude"],
            "Longitude": locdata["Longitude"],
            "PartOf": str(locdata["PartOf"]),
            "Gazetteer": "TLCMap",
        }
    else:
        response = None

    return response

You can now search the TLCMap for locations matching the same placenames you previously searched for in the OSM.

In [40]:
# For every placename in our list
for place in geolocdata[0:workshop_TLC_query_limit]:

    # Already found an unambiguous location, so skip to the next placename
    if len(place["locations"]["best_match"]) > 0:
        continue

    placename = place["placename"]
    
    # Query the OSM database
    response = tlc_query_name(placename, search_type)
    
    # Handle no results found, skip to the next placename
    if (response is None or response["features"] == []) :
        print("--- Still looking for", placename)
        continue

    print("+++ Found", placename)
    
    # Save the possible locations for later processing
    data_frames = []

    # Handle results found
    for response_record in response["features"]:
        #  Use this to look at a reduced set of data from the results
        cleaned_response = pd.DataFrame([tlc_format_response(response_record)])

        # Add the data to a dataframe
        data_frames.append(cleaned_response)

    # Add the results to the geoloc dataframe
    # Review the outcomes later
    # Make sure you don't write over any candidates previously added from another gazetteer.
    if place["locations"]["candidates"] is None:
        place["locations"]["candidates"] = data_frames
    else:
        place["locations"]["candidates"] = place["locations"]["candidates"] + data_frames

--- Still looking for CHAPTER I.
--- Still looking for THE PRISON SHIP
+++ Found Crown
--- Still looking for the Bay of Biscay
--- Still looking for Van Diemen's
--- Still looking for Vickers
--- Still looking for Familiarity
+++ Found Chatham
--- Still looking for Frere
+++ Found Sylvia
--- Still looking for Surgeon Pine
+++ Found Coromandel
+++ Found Pine
--- Still looking for the King's Regulations
--- Still looking for the Hydaspes for Calcutta
--- Still looking for shell--"why
--- Still looking for Modesty
--- Still looking for CHAPTER I.
+++ Found Cape Pillar
--- Still looking for Pirates
--- Still looking for the Isle of Wight
--- Still looking for South
--- Still looking for Mediterranean


Like the OSM, TLCMap provides a lot of information in reponse to each query. Not all of it is needed, so you will need to prune it, only keeping the data required for later processing.

Each response has a set of metadata that is kept independent from the spatial data (called _features_) for each matched location. For instance, this is the datalocation response to the TLCMap query for the last matched placename. 

In [41]:
response_record

{'type': 'Feature',
 'geometry': {'type': 'Point', 'coordinates': [147.9299927, -43.18000031]},
 'properties': {'name': 'Cape Pillar',
  'placename': 'Cape Pillar',
  'description': 'Official',
  'id': 'a17513',
  'state': 'TAS',
  'feature_term': 'Suburb/Locality',
  'original_data_source': 'Australian Gazetteer',
  'latitude': '-43.18000031',
  'longitude': '147.9299927',
  'TLCMapLinkBack': 'https://tlcmap.org/ghap/search?id=a17513',
  'TLCMapDataset': 'https://tlcmap.org/ghap/'}}

Again, the notebook code cleans up the data, just keeping what is needed. The _PartOf_ value can be taken from _state_ field and the _feature_term_ is used as the _Category_ value. Once again, the TLCMap has its only set of _Category_ values. If you allow the search to include data entered by the general public, then this is prone to having spelling errors and being even more inconsistent. However, it is still helpful to provide context to you as a user.

In [42]:
# The last matched location in the last TLCMap query response
cleaned_response

Unnamed: 0,LocationName,Category,Latitude,Longitude,PartOf,Gazetteer
0,Cape Pillar,Suburb/Locality,-43.18000031,147.9299927,TAS,TLCMap


How helpful have the gazetteers been for you? What placenames have you still not found?

In [43]:
unmatched_data = [
    place["placename"]
    for place in geolocdata
    if (
        # No non-ambiguous location
        len(place["locations"]["best_match"]) == 0
        # No candidate Locations
        and (place["locations"]["candidates"] is None or place["locations"]["candidates"] == [])
    )
]
unmatched_data

['THE PRISON SHIP',
 'Familiarity',
 "the King's Regulations",
 'the Hydaspes for Calcutta',
 'shell--"why',
 'Cape Bougainville',
 'Maria Island',
 'Peninsula',
 'Pillar',
 'Storm Bay',
 'Storing Island',
 'Sorrell',
 'Pittwater',
 'Bruny Island',
 'Actaeon',
 'Recherche Bay',
 'the South Cape',
 'New Norfolk',
 'the Southern Ocean',
 'Port Philip Bay',
 'Bay',
 "Wyld's Crag",
 'Dromedary',
 'Derwent',
 'Mount Wellington',
 'south bay',
 'Smyrna',
 'Cape Grim',
 'Pyramid Island',
 'Rocky Point',
 'Port Davey',
 'Cape',
 'Huon',
 'Navigation',
 'Macquarie Harbour',
 'Mount Heemskirk',
 "King's River",
 'Gordon',
 'Sarah Island',
 'Philip',
 'Island',
 'Nature',
 'Ladybird',
 'the Hobart Town Gaol',
 'Commandant',
 'Honest',
 'Hells Gates',
 'Arthur',
 'England',
 'New Town',
 'verandah.-She',
 'Barton',
 'Grummet Island',
 'Maria',
 'Malabar',
 'Dawes',
 'Gad',
 'Osprey',
 'nimbus',
 'Troke']

You now have information about locations for most of the early placenames. Some locations are not ambiguous and are already recorded as a best match. Others came from gazetteers and are just candidate locations - you haven't worked out yet if any are suitable.

For instance, this what data you have found for the first ten matched placenames.

In [44]:
# Identify which placenames have matched locations
matched_data = [
    place
    for place in geolocdata
    if (
        # Non-ambiguous location
        len(place["locations"]["best_match"]) != 0
        # Candidate Locations
        or (place["locations"]["candidates"] is not None and place["locations"]["candidates"] != [])
    )
]

# For each of the first ten placenames
for place in matched_data[0:10]:
    print("\n+++ ", place["placename"])

    if len(place["locations"]["best_match"]) != 0:
        # Unambiguous best match already known
        pprint(place["locations"]["best_match"])
    else:
        # Any candidate locations you have found
        if place["locations"]["candidates"] is not None:
            pprint(pd.concat(place["locations"]["candidates"], ignore_index=True))


+++  CHAPTER I.
  LocationName     Category            Latitude          Longitude          PartOf Gazetteer
0      Chapter  arts_centre  51.483033649999996  -3.20352008370247  United Kingdom       OSM

+++  Crown
  LocationName      Category             Latitude           Longitude          PartOf Gazetteer
0        Crown        suburb           57.4779783          -4.2158751  United Kingdom       OSM
1        Crown        hamlet           37.7584406          -81.844007   United States       OSM
2        Crown        hamlet            39.583138         -80.1025722   United States       OSM
3        Crown        suburb          -26.2188623          28.0085834    South Africa       OSM
4        Crown        hamlet           28.9413607         -98.7400238   United States       OSM
5        Crown  trig station  -33.531666666666666  149.91777777777776             NSW    TLCMap

+++  the Bay of Biscay
        LocationName Category     Latitude           Longitude          PartOf Gazetteer


The issue is now to work out which of these candidate locations is most suitable.

A few heuristics can be used to flag those candidate locations with key features that can be used to help rank and select the locations.

One way is to acknowledge if multiple candidate locations have similar coordinates. However, sometimes there are no coordinates for a location, which can complicate the comparisons. The solution is to default to the value of 0 for missing data. It is not perfect but in most cases it is adequate.

In [45]:
# So lists of coordinates can be sorted,
# When given a coordinate as a string value, convert it to an integer.
# This is required when the value is sometimes not an number, like NA or "".
def sorting_coord(coord):

    # Make all numbers into integers
    if type(coord) in [float, int]:
        return int(coord)
    # Set to 0 if missing data
    elif type(coord) == str and coord == "":
        return 0
    elif type(coord) != str and coord.isna():
        return 0
    # Convert to a float then an integer
    return int(float(coord))

In [46]:
# Compare two sets of coordinates
# Return true if they are both within 1 point of each other for both Lat and Lon
def compare_coords(lat1, lon1, lat2, lon2):
    
    # Cannot compare missing values
    if lat1.item() == "" or lat2.item() == "" or lon1.item() == "" or lon2.item() == "":
        return False

    # Can only compare integers
    lat1_int = int(float(lat1.item()))
    lon1_int = int(float(lon1.item()))
    lat2_int = int(float(lat2.item()))
    lon2_int = int(float(lon2.item()))

    # Is Coord1 within 1 point of Coord2?
    # e.g., -47.5 is close to -46.5 and -48.5
    return (lat1_int in range(lat2_int - 1, lat2_int + 2, 1)) and (lon1_int in range(lon2_int - 1, lon2_int + 2, 1))

Another consideration is to recognise which locations are in certain countries or regions which you know are relevant to the original document. This should be indicated in the _PartOf_ field of the data you have matched. Most of FtToHNL is set in Australia but references are made to parts of Great Britain, so that is what you should focus on. A list of places-of-interest can be made for Australia and one for Great Britain. While these are very finite sets of places, you need to make sure you match the spelling in the gazetteers you have used.

In [47]:
# Flag locations in Australia or Britain
aus_states = {
    "AUSTRALIA",
    "NSW",
    "VIC",
    "QLD",
    "TAS",
    "WA",
    "NT",
    "SA",
    "ACT",
    "NEW SOUTH WALES",
    "VICTORIA",
    "QUEENSLAND",
    "TASMANIA",
    "WESTERN AUSTRALIA",
    "SOUTH AUSTRALIA",
    "NORTHERN TERRITORY",
    "AUSTRALIAN CAPITAL TERRITORY",
}
gb_states = {
    "BRITAIN",
    "UK",
    "GB",
    "GREAT BRITAIN",
    "UNITED KINGDOM",
    "BRITISH ISLES",
    "ENGLAND",
    "WALES",
    "CYMRU",
    "SCOTLAND",
    "IRELAND",
    "NORTHERN IRELAND",
    "ÉIRE / IRELAND",
    "EIRE",
    "EIRE / IRELAND",
}

Now you can go through each placename and its candidate locations and flag whether they correspond to any of these criteria. A distinction is made between whether any two locations with similar coordinates were found in the same gazetteer or a different one.

In [48]:
for place in matched_data:
    placename = place["placename"]
    
    if (
        # No unambiguous location found
        len(place["locations"]["best_match"]) == 0
        # Candidate locations found
        and place["locations"]["candidates"] is not None
        and len(place["locations"]["candidates"]) > 1
    ):

        # Sort the locations by Latitude & Longitude
        sorted_candidates = sorted(
            place["locations"]["candidates"],
            key=lambda x: [
                sorting_coord(x["Latitude"].item()),
                sorting_coord(x["Longitude"].item()),
            ],
        )

        prev_location = {}

        # Flag the candidates according to the heuristics
        for candidate in sorted_candidates:
            rank_flags = [] # List of all heuristic flags

            # Flag locations in Australia or Britain
            partings = candidate["PartOf"].item().upper()
            if partings in aus_states:
                rank_flags.append("Australia")
            elif partings in gb_states:
                rank_flags.append("Britain")

            # Flag coords in multiple gazetteers
            if len(prev_location) > 0:
                # Is this candidate location near to the previous candidate location?
                coord_flag = compare_coords(
                    prev_location["Latitude"],
                    prev_location["Longitude"],
                    candidate["Latitude"],
                    candidate["Longitude"],
                )
                # Dupl_Gaz2: Matching coords in candidate locations from 2 gazetteers
                if coord_flag and prev_location["Gazetteer"].item() != candidate["Gazetteer"].item():
                    rank_flags.append("Dupl_2Gaz")
                # Dupl_Gaz1: Matching coords in candidate locations from 1 gazetteer
                elif coord_flag:
                    rank_flags.append("Dupl_1Gaz")

            prev_location = candidate

            # rank_flags has to be converted to a single string to print it, rather than a list, 
            # because candidate is a Pandas dataframe
            candidate["RankFlags"] = pd.Series(",".join(rank_flags))
            print(
                " +++ ",
                candidate["LocationName"].item(),
                ",",
                partings,
                ",[",
                candidate["RankFlags"].item(),
                "]",
            )

        # Update the geoloc data
        place["locations"]["candidates"] = sorted_candidates

 +++  Crown , NSW ,[ Australia ]
 +++  Crown , SOUTH AFRICA ,[  ]
 +++  Crown , UNITED STATES ,[  ]
 +++  Crown , UNITED STATES ,[  ]
 +++  Crown , UNITED STATES ,[  ]
 +++  Crown , UNITED KINGDOM ,[ Britain ]
 +++  Van Diemens Crescent , TAS ,[ Australia ]
 +++  Van Diemen Avenue , TAS ,[ Australia ]
 +++  Van Gogh Court , TAS ,[ Australia,Dupl_1Gaz ]
 +++  Van Diemen Quality Bulbs , TAS ,[ Australia ]
 +++  Amsterdam-Van Diemenstraat , NEDERLAND ,[  ]
 +++  Vickers , UNITED STATES ,[  ]
 +++  Vickers , UNITED STATES ,[ Dupl_1Gaz ]
 +++  Vickers , UNITED STATES ,[  ]
 +++  Vickers , UNITED STATES ,[  ]
 +++  Vickers , UNITED KINGDOM ,[ Britain ]
 +++  Chatham , VIC ,[ Australia ]
 +++  Chatham , VIC ,[ Australia,Dupl_1Gaz ]
 +++  Chatham , NSW ,[ Australia ]
 +++  Chatham , NSW ,[ Australia,Dupl_1Gaz ]
 +++  Chatham , QLD ,[ Australia ]
 +++  Chatham County , UNITED STATES ,[  ]
 +++  Chatham County , UNITED STATES ,[  ]
 +++  Chatham , UNITED STATES ,[  ]
 +++  Chatham , CANADA ,[  ]

You can now use the heuristics flags to re-sort the candidates and select the best one. The first step is to rank the possible flag combinations in an order of importance.

In [49]:
# Establish how to rank the heuristic flags
sortorder = [
    "Australia,Dupl_2Gaz",  # In Australia, found in 2 gazetteers
    "Australia,Dupl_1Gaz",  # In Australia, found more than once in 1 gazetteer
    "Australia",  # In Australia, found only once
    "Britain,Dupl_2Gaz",  # In Great Britain, found in 2 gazetteers
    "Dupl_2Gaz",  # Not in Australia or Great Britain, found in 2 gazetteers
    "Britain,Dupl_1Gaz",  # In Great Britain, found more than once in 1 gazetteer
    "Britain",  # In Great Britain, found only once
    "Dupl_1Gaz",  # Not in Australia or Great Britain, found more than once in 1 gazetteer
    "",  # Not in Australia or Great Britain, found only once
]

Then you can use this order to rank the candidate locations against each other.

In [50]:
# Sort the candidates
for place in geolocdata:
    placename = place["placename"]

    if (
        # No unmabiguous location
        len(place["locations"]["best_match"]) == 0
        # More than one candidate location
        and place["locations"]["candidates"] is not None
        and len(place["locations"]["candidates"]) > 1
    ):
        candidates = place["locations"]["candidates"]

        sorted_candidates = []
        # For each combination of heuristic flags
        for heuristic in sortorder:
            # Get the candidate locations matching this ranking heuristic
            for candidate in candidates:
                if candidate["RankFlags"].item() == heuristic:
                    # Add the candidate to the sorted list
                    sorted_candidates = sorted_candidates + [candidate]
        # Prepare a version for printing
        locations = pd.concat(sorted_candidates, ignore_index=True)

        # Update the geoloc data with the sorted candidates
        place["locations"]["candidates"] = sorted_candidates

Your first ten placenames now have better ranked candidate locations.

In [51]:
# Identify which placenames have matched candidate locations
matched_data = [
    place
    for place in geolocdata
    if (
        # No unambiguous location
        len(place["locations"]["best_match"]) == 0
        # Candidate Locations
        and (place["locations"]["candidates"] is not None and place["locations"]["candidates"] != [])
    )
]

# For each of the first ten placenames
for place in matched_data[0:10]:
    print("\n+++ ", place["placename"])

    # Any candidate locations you have found
    if place["locations"]["candidates"] is not None:
        pprint(pd.concat(place["locations"]["candidates"], ignore_index=True))


+++  CHAPTER I.
  LocationName     Category            Latitude          Longitude          PartOf Gazetteer
0      Chapter  arts_centre  51.483033649999996  -3.20352008370247  United Kingdom       OSM

+++  Crown
  LocationName      Category             Latitude           Longitude          PartOf Gazetteer  RankFlags
0        Crown  trig station  -33.531666666666666  149.91777777777776             NSW    TLCMap  Australia
1        Crown        suburb           57.4779783          -4.2158751  United Kingdom       OSM    Britain
2        Crown        suburb          -26.2188623          28.0085834    South Africa       OSM           
3        Crown        hamlet           28.9413607         -98.7400238   United States       OSM           
4        Crown        hamlet           37.7584406          -81.844007   United States       OSM           
5        Crown        hamlet            39.583138         -80.1025722   United States       OSM           

+++  the Bay of Biscay
        Loca

Now that the candidates are sorted, the best match can be selected. The most obvious choice is the highest ranked candidate.

In [52]:
for place in geolocdata:
    placename = place["placename"]
    locations = []
    # Already have the best match
    if len(place["locations"]["best_match"]) != 0:
        locations = place["locations"]["best_match"]
    else:
        if place["locations"]["candidates"] is not None and len(place["locations"]["candidates"]) > 0:
            # Presume the best match has the top rank
            top_candidate = place["locations"]["candidates"][0]
            # Don't include the values for Gazetteer or RankFlags
            place["locations"]["best_match"] = top_candidate.loc[
                :, ~top_candidate.columns.isin(["Gazetteer", "RankFlags"])
            ]

You now have geolocation data for most of the placenames you have searched for!

In [53]:
for place in geolocdata:
    placename = place["placename"]
    print("+++ "+placename)
    # Only output those placenames with a best match location
    if len(place["locations"]["best_match"]) != 0:
        best_match = place["locations"]["best_match"].to_csv(index=False,header=False)
        print(best_match)

+++ CHAPTER I.
Chapter,arts_centre,51.483033649999996,-3.20352008370247,United Kingdom

+++ THE PRISON SHIP
+++ Crown
Crown,trig station,-33.531666666666666,149.91777777777776,NSW

+++ the Bay of Biscay
The Bay of Biscay,park,52.95425415,-1.159789836905746,United Kingdom

+++ London
London,Capital,-0.083333,51.5,United Kingdom

+++ Van Diemen's
Van Gogh Court,residential,-41.3799529,147.1181359,TAS

+++ Vickers
Vickers,primary,51.59845985,-1.7415535311162076,United Kingdom

+++ Familiarity
+++ Chatham
Chatham,railway station,-37.82402802,145.089035,VIC

+++ Frere
Frere,hamlet,44.4701528,7.004274,Italia

+++ Sylvia
Sylvia,parish,-20.85,143.81666666666666,QLD

+++ Surgeon Pine
Surgeon's Kitchen,yes,-29.0569188,167.9554764,Norfolk Island

+++ Coromandel
Coromandel,station,-35.0251942,138.6143361,SA

+++ Pine
Pine,homestead,-36.0,135.0,SA

+++ the King's Regulations
+++ India
India,Country,77.2,28.6,Asia

+++ the Hydaspes for Calcutta
+++ shell--"why
+++ Modesty
Modesty,cafe,43.5127672,16.

Unfortunately, the top ranked candidate may still not be the best. The second ranked candidate may actually be the same rank as the top ranked one, but it simply might have a lesser position due to the earlier ranking based on the latitude and longitude values (to find the matching candidates for the ranking). For this reason, the final stage of the processing is left to you as the user, showing you the best candidate location, but allowing you to select an alternate candidate by opening a drop-down list of options.

In [54]:
print("\nSelect the most suitable candidate location for each placename. The current 'best match' is provided.")

for place in geolocdata:
    placename = place["placename"]

    # If there isn't a best match, then there aren't any candidates
    if len(place["locations"]["best_match"]) != 0:

        best_match = place["locations"]["best_match"]
        best_match_index = 0

        # Must be at least one candidate
        if (
            "candidates" in place["locations"].keys()
            and place["locations"]["candidates"] is not None
            and len(place["locations"]["candidates"]) >= 1
        ):

            # Create a RadioButton for each candidate location
            candidates = place["locations"]["candidates"]
            candidate_buttons = [
                widgets.RadioButtons(
                    layout={"width": "max-content"},
                    options=[
                        candidate["LocationName"].item()
                        + ", "
                        + candidate["PartOf"].item()
                        + " ("
                        + str(candidate["Latitude"].item())
                        + ","
                        + str(candidate["Longitude"].item())
                        + ")"
                        for candidate in candidates
                    ],
                )
            ]
            # Add a button for "None of the above"
            candidate_buttons[0].options = candidate_buttons[0].options + tuple(["None of the above"])
            place["locations"]["candidate_buttons"] = candidate_buttons

            # Select the button for the current bestmatch
            place["locations"]["candidate_buttons"][0].index = best_match_index

            # Create a HBox for the Placename
            placename_box = widgets.HBox(
                [
                    widgets.Label(
                        "'"
                        + place["placename"]
                        + "' matches best with: "
                        + best_match["LocationName"].item()
                        + ", "
                        + best_match["PartOf"].item()
                        + " ("
                        + best_match["Latitude"].item()
                        + ", "
                        + best_match["Longitude"].item()
                        + ")"
                    )
                ]
            )

            # Create an Accordion (in a HBox) for the candidates
            new_accordion = widgets.Accordion()
            new_accordion.set_title(0, "Expand to choose a different location")
            new_accordion.children = [widgets.HBox(place["locations"]["candidate_buttons"])]
            new_accordion.selected_index = None  # close the accordion at startup

            # Put the HBoxes together in a VBox
            whole_box = widgets.VBox([placename_box, new_accordion])

            # Show it all!
            # Note: this doesn't close one accordion if another is opened because they are in separate VBoxes
            display(whole_box)


Select the most suitable candidate location for each placename. The current 'best match' is provided.


VBox(children=(HBox(children=(Label(value="'CHAPTER I.' matches best with: Chapter, United Kingdom (51.4830336…

VBox(children=(HBox(children=(Label(value="'Crown' matches best with: Crown, NSW (-33.531666666666666, 149.917…

VBox(children=(HBox(children=(Label(value="'the Bay of Biscay' matches best with: The Bay of Biscay, United Ki…

VBox(children=(HBox(children=(Label(value="'Van Diemen's' matches best with: Van Gogh Court, TAS (-41.3799529,…

VBox(children=(HBox(children=(Label(value="'Vickers' matches best with: Vickers, United Kingdom (51.59845985, …

VBox(children=(HBox(children=(Label(value="'Chatham' matches best with: Chatham, VIC (-37.82402802, 145.089035…

VBox(children=(HBox(children=(Label(value="'Frere' matches best with: Frere, Italia (44.4701528, 7.004274)"),)…

VBox(children=(HBox(children=(Label(value="'Sylvia' matches best with: Sylvia, QLD (-20.85, 143.81666666666666…

VBox(children=(HBox(children=(Label(value="'Surgeon Pine' matches best with: Surgeon's Kitchen, Norfolk Island…

VBox(children=(HBox(children=(Label(value="'Coromandel' matches best with: Coromandel, SA (-35.0251942, 138.61…

VBox(children=(HBox(children=(Label(value="'Pine' matches best with: Pine, SA (-36.0, 135.0)"),)), Accordion(c…

VBox(children=(HBox(children=(Label(value="'Modesty' matches best with: Modesty, Hrvatska (43.5127672, 16.4402…

VBox(children=(HBox(children=(Label(value="'CHAPTER I.' matches best with: Chapter, United Kingdom (51.4830336…

VBox(children=(HBox(children=(Label(value="'Cape Pillar' matches best with: Cape Pillar, TAS (-43.18000031, 14…

VBox(children=(HBox(children=(Label(value="'Pirates' matches best with: Pirates, South Africa (-34.0015814, 23…

VBox(children=(HBox(children=(Label(value="'the Isle of Wight' matches best with: Isle of Wight, United Kingdo…

VBox(children=(HBox(children=(Label(value="'South' matches best with: South, Éire / Ireland (52.34106825, -8.1…

VBox(children=(HBox(children=(Label(value="'Mediterranean' matches best with: Mediterranean, United States (29…

Once you have selected your preferred candidates, it is time to record them as the best matches. This is also a great time to identify any placenames that don't have a suitable location, either because those found were not contextual appropriate or none could be found at all. For such placenames, their "best match" will have explasantory text as the _LocationName_, _Category_ and PartOf_ values, with empty values instead of latitude and longitude coordinates.

In [55]:
for place in geolocdata:

    # If there isn't a best match, then there aren't any candidates
    if len(place["locations"]["best_match"]) != 0:

        best_match_index = 0

        # Must be at least one candidate
        if "candidate_buttons" in place["locations"].keys():

            # Extract the selection
            best_match_index = place["locations"]["candidate_buttons"][0].index
            # Make sure the selection is a location
            if best_match_index < len(place["locations"]["candidates"]):
                # Record the best match
                top_candidate = place["locations"]["candidates"][best_match_index]
                # Just record the important columns for the best match
                place["locations"]["best_match"] = top_candidate.loc[
                    :, ~top_candidate.columns.isin(["Gazetteer", "RankFlags"])
                ]
            else:
                # None of the above
                best_match = {
                    "LocationName": "No suitable location selected",
                    "Category": "No suitable location selected",
                    "Latitude": "",
                    "Longitude": "",
                    "PartOf": "No suitable location selected",
                }
                place["locations"]["best_match"] = pd.DataFrame([best_match])
    else:
        # Note that there is no best match
        best_match = {
            "LocationName": "No location matched",
            "Category": "No location matched",
            "Latitude": "",
            "Longitude": "",
            "PartOf": "No location matched",
        }
        place["locations"]["best_match"] = pd.DataFrame([best_match])

You now have all the geolocation data for all the placenames. Have a look at what you have got for the first ten placenames.

In [56]:
for place in geolocdata[0:10]:
    placename = place["placename"]
    print("+++ "+placename)
    # Only output those placenames with a best match location
    if len(place["locations"]["best_match"]) != 0:
        best_match = place["locations"]["best_match"].to_csv(index=False,header=False)
        print(best_match)

+++ CHAPTER I.
Chapter,arts_centre,51.483033649999996,-3.20352008370247,United Kingdom

+++ THE PRISON SHIP
No location matched,No location matched,,,No location matched

+++ Crown
Crown,trig station,-33.531666666666666,149.91777777777776,NSW

+++ the Bay of Biscay
The Bay of Biscay,park,52.95425415,-1.159789836905746,United Kingdom

+++ London
London,Capital,-0.083333,51.5,United Kingdom

+++ Van Diemen's
Van Gogh Court,residential,-41.3799529,147.1181359,TAS

+++ Vickers
Vickers,primary,51.59845985,-1.7415535311162076,United Kingdom

+++ Familiarity
No location matched,No location matched,,,No location matched

+++ Chatham
Chatham,railway station,-37.82402802,145.089035,VIC

+++ Frere
Frere,hamlet,44.4701528,7.004274,Italia



While ultimately you only need the details for the best matched location, it may be helpful for you to keep a record of all this data for record-keeping of your research workflow.

In [57]:
# Save all the geolocdata as a combination of array rows and dataframnes
filename = "FtToHNL_matchedlocations.data"
save_location = os.path.normpath(os.path.join(csv_directory, filename))
save_filename = os.path.basename(save_location)
print("Saving to location data to ", save_location)

# Save the list
np.savetxt(save_location, geolocdata, delimiter=", ", fmt="%s")

Saving to location data to  ../output/FtToHNL_matchedlocations.data


The final step is to prune the data down to what you need, so it  can be presented in a comma-separated value (csv) format.

In [58]:
# Final output
# Only use selected columns from the best match in csv format

# Make a new array of records
geoloc_output = [["Placename", "PartOf", "Latitude", "Longitude"]]
for place in geolocdata:
    # Only output those placenames with a best match location
    if len(place["locations"]["best_match"]) != 0:
        placename = place["placename"]
        location = place["locations"]["best_match"]

        # Convert Lat/Long into floats, rather than strings
        if type(location["Latitude"]) == float:
            latitude = location["Latitude"].item()
        elif location["Latitude"].item() != "":
            latitude = float(location["Latitude"].item())
        else:
            latitude = ""
        if type(location["Longitude"]) == float:
            longitude = location["Longitude"].item()
        elif location["Longitude"].item() != "":
            longitude = float(location["Longitude"].item())
        else:
            longitude = ""

        # Add this record to the list
        geoloc_output.append([placename, location["PartOf"].item(), latitude, longitude])

In [59]:
# Show the final formatted data for the first ten placenames
geoloc_output[0:10]

[['Placename', 'PartOf', 'Latitude', 'Longitude'],
 ['CHAPTER I.', 'United Kingdom', 51.483033649999996, -3.20352008370247],
 ['THE PRISON SHIP', 'No location matched', '', ''],
 ['Crown', 'NSW', -33.531666666666666, 149.91777777777776],
 ['the Bay of Biscay', 'United Kingdom', 52.95425415, -1.159789836905746],
 ['London', 'United Kingdom', -0.083333, 51.5],
 ["Van Diemen's", 'TAS', -41.3799529, 147.1181359],
 ['Vickers', 'United Kingdom', 51.59845985, -1.7415535311162076],
 ['Familiarity', 'No location matched', '', ''],
 ['Chatham', 'VIC', -37.82402802, 145.089035]]

This can then be written to file.

In [60]:
filename = "FtToHNL_geolocdata.csv"
save_location = os.path.normpath(os.path.join(csv_directory, filename))
save_filename = os.path.basename(save_location)
print("Saving to geolocation data to ", save_location)

# Save the list
np.savetxt(save_location, geoloc_output, delimiter=", ", fmt="% s")

Saving to geolocation data to  ../output/FtToHNL_geolocdata.csv


Congratulations! You have now found placenames in a series of historical Australian texts and matched most of them with the coordinates of actual locations. This geolocation data can then be used by you for further research.

The data can also be displayed as points on a map. The following code will place a marker on a Folium map for each place that has a valid location.

In [61]:
# Create the map
m = folium.Map(location=[-37.2094444, 144.2094444], zoom_start=5)

# Read data from the CSV
data = geoloc_output

# Add marker one by one on the map. Skip locations that we couldn't find geocode for
for i in range(1, len(data)):
    if data.iloc[i]["PartOf"] != "No location matched":
        folium.Marker(
            location=[data.iloc[i]["Latitude"], data.iloc[i]["Longitude"]],
            popup=data.iloc[i]["Placename"],
        ).add_to(m)

# Show the map
m


AttributeError: 'list' object has no attribute 'iloc'