# ATAP Notebook for the Geolocation project

This notebook helps you access the Geolocation tools in a Python development environment.

### Contents
* [Premise](#section-premise)
* [Requirements](#section-requirements) 
* [Data Preparation](#section-datapreparation)
* [Identifying Placenames in Text](#section-identifyingplacenames)
 * [Named Entity Recognition](#section-ner)
 * [Reviewing Candidate Placenames](#section-reviewplacenames)
* [Finding Locations for Placenames](#section-findinglocs)
 * [Identifying States and Capitals](#section-statescapitals)
 * [Searching a Gazzetteer for Locations](#section-searchgazetteer)

## Premise <a class="anchor" id="section-premise"></a>
*This section explains the Geolocation project and tools.*

The Geolocation project relates to doctoral research done by [Fiannuala Morgan](https://finnoscarmorgan.github.io/) at the Australian National University. It uses software to identify placenames in archived historical texts, then compares them to data about known locations to identify where the placenames may be located. 

This notebook is designed to allow you to perform similar operations on textual documents.

<div class="alert alert-block alert-success">
It will teach you how to
<ul>
    <li>use the spaCy library to identify and classify Named Entities (NEs)</li>
    <li>identify multi-word expressions (MWE) that are NEs</li>
    <li>search for spatial data about specific locations or places in gazetteers of such data</li>
    <li>determine which locations are referred to by placenames, based on the context in which they are used in a text</li>
</ul>
</div>

## Requirements <a class="anchor" id="section-requirements"></a>

This notebook uses various Python libraries. Most will come with your Python installation, but the following are crucial:
- pandas
- nltk


In [None]:
%%capture
!pip install pandas
!pip install numpy
!pip install nltk
!pip install ratelimit
!pip install spacy
# You need the old version due to issues setting titles with the latest
!pip install ipywidgets=="7.6.2"

!python -m spacy download en_core_web_sm

In [None]:
import os
import urllib
import pandas as pd
import numpy as np

# spaCy is used for a pipeline of NLP functions
import spacy

# ipywidgets is used for user interactive interfaces
import ipywidgets as widgets

# Imports for the map API requests and output formatting
import requests
import json
from pprint import pprint
from ratelimit import limits, sleep_and_retry

In [None]:
# Make sure you can see as much of the output as possible within the Jupyter Notebook screen
pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)
pd.set_option("display.width", 115)

# Data preparation <a class="anchor" id="section-datapreparation"></a>

You also want to set up some directories for the import and output of data.

In [None]:
# Declare the data directories
# This presumes that 'notebooks' is the current working directory
text_directory = os.path.normpath("../texts/")
csv_directory = os.path.normpath("../output/")
reference_directory = os.path.normpath("../data")

# Create the data directories
if not os.path.exists(text_directory):
    os.makedirs(text_directory)
if not os.path.exists(csv_directory):
    os.makedirs(csv_directory)
if not os.path.exists(reference_directory):
    os.makedirs(reference_directory)

For this workshop, we will be examining the text of *For the Term of His Natural Life*, an 1874CE novel by Marcus Clarke that is in the public domain. Our copy was obtained via the [Gutenburg Project Australia](https://gutenberg.net.au/ebooks/e00016.txt). It is an unformatted textfile. We have slightly simplified it further by reducing it to only standard ASCII characters, replacing any accented characters with their unaccented forms and the British Pound Sterling symbol with the word pounds. 

The novel is divided into four books, each based in different regions of the world. For instance, you might want to start with the second book, which is titled *BOOK II.\-\-MACQUARIE HARBOUR.  1833*. Each book can be processed in its entirety, or as individual chapters.

In [None]:
filename = "FtToHNL_BOOK_2_CHAPTER_3.txt"
print("Working on | ", filename)

# Set the specific path for the 'filename'
text_location = os.path.normpath(os.path.join(text_directory, filename))
text_filename = os.path.basename(text_location)

text = open(text_location, encoding="utf-8").read()

You have now read Chapter 3 of Book 2 into memory. This is currently no more than a long string of characters. So far, you have done no processing.

In [None]:
text[0:500]  # look at the first 501 characters

# Identifying Placenames in Text <a class="anchor" id="section-identifyingplacenames"></a>
*This section provides tools on identifying placenames in textual data*

## Named Entity Recognition <a class="anchor" id="section-ner"></a>

Texts like the start of Chapter 3, Book 2, mention a lot of places. Some of them are a little vague, like _the house of Major Vickers_, _head-quarters_ and _the settlement_, but some are very explicit like _Maria Island_ and _Macquarie Harbour_. This notebook will show you how to use software automatically identify these proper noun phrases relating to placenames.

The following part of this notebook uses the Named Entity Recognition (NER) tool provided in the spaCy Python package. For an introduction to this tool and how it can be used to find Named Entities (NEs) like placenames in text, go to the [spaCy NER notebook](https://github.com/Australian-Text-Analytics-Platform/geolocation-tools-workshop/blob/7d92664ac44f86b90a0c098bb3159793a4fe6c16/Notebooks/spacy_ner_introduction.ipynb).

This NER need not be done one file at a time. You can now put all of this together and find the placenames that are identified by spaCy in each chapter of the text. The results can all be collected in a single data structure, reviewed and saved to file.

In [None]:
# Dataframe where we store the details about each instance of the placenames
placenames_df = pd.DataFrame(columns=["Book", "Chapter", "NEIndex", "Placename"])

In [None]:
# Define which chapters and books you want to examine
chapters = [1, 2, 3]
books = [1, 2]

Load a spaCy model for English.

In [None]:
nlp = spacy.load("en_core_web_sm")

You should define what spaCy processing you do or don't want in the pipeline. You mainly need the Tokenizer and the NER components. Others, like the Parser slow the processing down.

In [None]:
disabled_pipeline = ["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]

Of course, not all NEs are placenames, so you will need to make a list of what categories regularly contain placenames. This notebook focusses on the Location, Geo-Political Entity, Facility and Organisation categories. While they may not cover all processing of all instances of all placenames, they have been observed to identify the majority of placenames without providing too many irrelevant NEs.

In [None]:
placename_categories = ["LOC", "GPE", "FAC", "ORG"]

Now you can start processing the chapters from FtToHNL.

In [None]:
i = 0  # Counter of the entities
for book in books:
    for chapter in chapters:
        # Construct the filename for this book and chapter
        filename = "FtToHNL_BOOK_" + str(book) + "_CHAPTER_" + str(chapter) + ".txt"

        # Set the specific path for the 'filename'
        text_location = os.path.normpath(os.path.join(text_directory, filename))
        text_filename = os.path.basename(text_location)

        # Read this chapter
        text = open(text_location, encoding="utf-8").read()
        print("Working on |", filename)

        # Run spaCy
        doc = nlp(text, disable=disabled_pipeline)

        # Document level
        ents = [(entity.text, entity.start_char, entity.end_char, entity.label_) for entity in doc.ents]

        # Token level
        for entity in doc.ents:
            # filter out MONEY, DATE, etc.
            if entity.label_ in placename_categories:

                # To help understand the context of the text, extract the occurrence
                context_text = doc.text[entity.start_char - 30 : entity.end_char + 30].replace("\n", " ")

                # Add the placenames according to spaCy
                new_placename = {
                    "Book": book,  # The Book number
                    "Chapter": chapter,  # The Chapter number
                    "NEIndex": i,  # A reference number to the nth Named Entity
                    "Placename": entity.text,  # The placename in the text
                    "Category": entity.label_,  # The spaCy category
                    "Context": context_text,  # The textual context where the placename was found
                    "Approval": 1,  # A flag for whether this is a suitable placename
                }
                placenames_df = placenames_df.append(new_placename, ignore_index=True)

            i = i + 1  # Entity counter

__[TO DO] Save a copy of this data__

In [None]:
# Output what placenames have been found
placenames_df[["Book", "Chapter", "NEIndex", "Placename", "Category"]]

This shows that a lot more placenames were found in Book 2 than Book 1. This makes sense since Book 1 is set on board an ocean voyage, whereas Book 2 is at Macquarie Harbour in Australia. You will also see that the same placename might be recognised multiple times but be categorised differently. This is because the spaCy model used for the NER is dependent on the linguistic content in which the NE instance appears in isolation. It does not assign the category based on multiple instances of an NE.

## Reviewing Candidate Placenames <a class="anchor" id="section-reviewplacenames"></a>

However, there are a number of NEs that are unlikely to be placenames, regardless of what spaCy categorised them as. You want to be able to filter them out. Sometimes though it is hard to work out whether an NE is a person, organisation or a location. For instance, spaCy has said _Van Diemen's_ and _Tasman_ are both ORG NE. While both could be the names of people (Dutch explorer Abel Tasman and Governor-General of the Dutch East Indies Anthony van Dieman) but they are actually part of larger NEs _Van Diemen's Land_ (the former name for the state of Tasmania) and _Tasman's Head_ (a headland in Tasmania). For this, it is best to consider the context in which the terms were used. 

The following interfaces shows you each NE instance identified by spaCy's NER and the context in which they occure. Use the checkboxes to select which terms you consider to be placenames.

In [None]:
# Function that is used when a checkbox changes
def changed(b):
    k = b["new"]

# Lists of the data required for displaying the checkboxes
placename_items = []
context_items = []
num_items = []

# Make checkboxes for every placename for book in books:
for book in books:
    for chapter in chapters:

        # Get the NEs from this book and chapter
        placenames_book_chapter = placenames_df[(placenames_df["Book"] == book) & (placenames_df["Chapter"] == chapter)]

        # Get the contextual data for each indexed NE
        for i in placenames_book_chapter["NEIndex"]:
            context_text = placenames_book_chapter[placenames_book_chapter["NEIndex"] == i]["Context"].values[0]
            category = placenames_book_chapter[placenames_book_chapter["NEIndex"] == i]["Category"].values[0]

        # Make lists of the candidate placenames, context text and index numbers
        # Only the placenames are given a checkbox.
        placename_items = placename_items + [
            widgets.Checkbox(True, description=i) for i in placenames_book_chapter["Placename"]
        ]
        context_items = context_items + [
            widgets.Label(placenames_book_chapter[placenames_book_chapter["NEIndex"] == i]["Context"].values[0])
            for i in placenames_book_chapter["NEIndex"]
        ]
        num_items = num_items + [widgets.Label(str(i)) for i in placenames_book_chapter["NEIndex"]]

# Create a display
num_placenames = len(placename_items)
left_box = widgets.VBox(placename_items)
right_box = widgets.VBox(context_items)
num_box = widgets.VBox(num_items)
whole_box = widgets.HBox([num_box, left_box, right_box])

print("Unselect any Named Entities (NEs) that you do not consider to be placenames.")
print("Each instance of an NE is listed, with the textual context in which it appeared.")

display(whole_box)

for n in range(num_placenames):
    placename_items[n].observe(changed)

You can now copy all the values from the checkboxes to the data, so you know which placenames you have approved.

In [None]:
# Transfer the status of each checklist item to the data
for n in range(num_placenames):

    NEIndex_num = int(num_items[n].value)
    approval_flag = placename_items[n].value

    # Set the flag to match the checklist
    for placename in placenames_df["NEIndex"]:
        if placename - NEIndex_num == 0:
            placenames_df.loc[placenames_df["NEIndex"] == NEIndex_num, "Approval"] = approval_flag

You can now visualise the result.

In [None]:
placenames_df[["NEIndex", "Placename", "Approval"]]

From this, you can extract the final list of distinct placenames that you have approved. While the names aren't sorted (though they could be), if you missed unselecting an NE on the checklist, this will help find it. All you need to do is go back to the checklist, unselect it, then run all other steps from there to here.

In [None]:
# Make a unique list of the approved placenames
approved_placenames = placenames_df[placenames_df["Approval"] == True]["Placename"].unique()
print(approved_placenames)

The final step is to save this new data to a csv file. You have already defined the directory for your data files.

In [None]:
filename = "FtToHNL_placenames.csv"
save_location = os.path.normpath(os.path.join(csv_directory, filename))
save_filename = os.path.basename(save_location)
print("Saving placename data to ", save_location)

In [None]:
# Save the list, using savetxt from the numpy module
np.savetxt(save_location, approved_placenames, delimiter=", ", fmt="% s")

You now have a copy of what you consider to be a list of placenames from a select set of FtToHNL chapters. Each placename found and approved is only included once. For this notebook, this data does not include contextual data like how many times a placename was found in the original text, nor where it was found.

You can easily change the notebook commands to select different chapters or even look at a different single file.

## Finding Locations for the Placenames <a class="anchor" id="section-findinglocs"></a>

Now that you have a list of placenames from the text, the next step is to work out their location on Earth. For this you can use a combination of specialised lists of locations, gazzetteers and heuristics. The objective is to match every placename with the coordinates of a known location.

The first step is to read the file of your placenames.

In [None]:
filename = "FtToHNL_placenames.csv"
print("Working on | ", filename)

# Set the specific path for the 'filename'
data_location = os.path.normpath(os.path.join(csv_directory, filename))
data_filename = os.path.basename(data_location)

# Read the csv file using pandas. This will place it in a dataframe format.
placenames_df = pd.read_csv(data_location, encoding="utf-8", header=None)
placenames_df = placenames_df.rename(columns={placenames_df.columns[0]: "Placename"})
placenames_df

### Identifying States and Capitals <a class="anchor" id="section-statescapitals"></a>

Some placenames, like *High Street* or *Maryborough*, may be very common across the world, or even in Australia. However, certain placenames refer to significant locations, like states, territories, large geographic features or capital cities. As such, if they are mentioned in a text, the placename is more likely to refer to the major location than a town or village in Tasmania.

These significant locations are a finite set. They can be defined in a reference file that can be reused when reviewing the placenames of any text.

A good point for you to start is a file about locations like modern capital cities and countries, combined with historical locations of significance.

In [None]:
filename = "reference_location_data.csv"
print("Working on | ", filename)

# Set the specific path for the 'filename'
reference_location = os.path.normpath(os.path.join(reference_directory, filename))
reference_filename = os.path.basename(reference_location)

Rather than reading this and then processing it, you can process each line as you read it.

__[TO DO] Change this to a dictionary?__

In [None]:
# Place the reference data in a dataframe
locref_df = pd.read_csv(reference_location, encoding="utf-8", header=0)
locref_df

As you can see, this allows some bias to be introduced into the data to suit your geolocation needs. For instance, _Perth_ is entered in this file as a city in the state of Western Australia, rather than one in Scotland. _Victoria_ is recorded as a state of Australia, rather than the capital of the Seychelles, or the capital of British Columbia, Canada.

__[TO DO] update this text chunk to suit the workshop__

Of course, if you are researching historical texts, then some of these contemporary locations may have had different names. Old New York was once New Amsterdam (and had the [nickname of Gotham](https://www.nypl.org/blog/2011/01/25/so-why-do-we-call-it-gotham-anyway), amongst others). Istanbul was Constantinople. Some locations had [romanized names](https://en.wikipedia.org/wiki/Chinese_postal_romanization), like Beijing being called Peking. They may be a long time gone but you might want to add them to the list of significant known locations.

Another historical variant is changing which cities are the capitals. These may be due to political decisions, like the movement of the Australian parliament from Melbourne to the new city of Canberra, or they could be a necessity due to the results of war, like Bonn becoming the capital of West Germany after World War II. These older capitals may also have to be accomodated in your reference data.

Because FtToHNL is set in the 19th Century CE, the next step is to add various capital cities from then.

There are also larger geopolitical regions that may have been associated with placenames and cultures, for instance empires, dynasties and colonies like the British Empire or the Zulu Kingdom. Again, the borders and applicability of these political entities changed over time, so a contemporary reference list may not include them. 

The 19th Century CE was a time of many European Empires so for FtToHNL, you will need to add reference data associated with relevant entities.

When processing this reference file, you can add the old political entity, its capital (if known), the geographic region (like continent or part thereof) and the modern country it would be considered part of.  

The next step is to see if any of the placenames from our selected chapters of FtToHNL match these locations.

__[TO DO] Describe this without being technical__

If we match a placename, copy the geolocation data for the matching location. Otherwise, keep it empty so we know to keep looking for the placename.

In [None]:
# All the data about placenames and locations, once linked, as a list of dataframes
geolocdata = []

for placename in placenames_df["Placename"]:

    # Create a new geoloc entry about this placename
    # [TO DO] Formally declare this as a dataframe?
    new_geolocdata = {}

    # Start a record for a placename
    new_geolocdata["placename"] = placename
    new_geolocdata["locations"] = {}  # Start with no location details
    new_geolocdata["locations"]["best_match"] = []  # Start with no matching location

    # Match found in the reference data
    if placename in list(locref_df["LocationName"]):
        # Copy the details from the reference file entry
        new_geolocdata["locations"]["best_match"] = locref_df[locref_df["LocationName"] == placename]

        print("+++ Found", placename)
    else:
        print("--- Still looking for", placename)

    # Add the new placename data to the list
    geolocdata.append(new_geolocdata)

Check that you have recorded the matches (and mismatches)

In [None]:
geolocdata[:10]

What locations did you end up finding?

In [None]:
matched_data = [
    place["locations"]["best_match"].to_string(index=False, header=False)
    for place in geolocdata
    if len(place["locations"]["best_match"]) > 0
]
matched_data

We can now forget about the dataframe with the complete set of reference data.

In [None]:
del locref_df

### Searching a Gazetteer for Locations <a class="anchor" id="section-searchgazzeteer"></a>

Search [Open Street Map (ODM)](https://nominatim.org/release-docs/develop/api/Search/) for locations that match the unknown placenames.

In [None]:
# How many (max) results do we want for each name?
# [TO DO] Make this a user setting, defaulting to 5
# The normal is (Default: 10, Maximum: 50), according to https://nominatim.org/release-docs/develop/api/Search/
OSM_limit = 5

__[TO DO] Talk more__

In [None]:
# Send rate-limited requests that stay within n requests per second
# [TO DO] add link to webpage about this
@sleep_and_retry
@limits(calls=1, period=1)
def osm_call_api(url):
    response = requests.get(url)
    return response


# Convert a postcode from a string into a state abbreviation
def postcode_to_state(postcodestr):
    postcode = int(postcodestr)

    if (1000 <= postcode <= 2599) or (2619 <= postcode < 2899) or (2921 <= postcode <= 2999):
        return "NSW"
    elif (200 <= postcode <= 299) or (2600 <= postcode <= 2618) or (2900 <= postcode <= 2920):
        return "ACT"
    elif (3000 <= postcode <= 3999) or (8000 <= postcode <= 8999):
        return "VIC"
    elif (4000 <= postcode <= 4999) or (9000 <= postcode < 9999):
        return "QLD"
    elif 5000 <= postcode <= 5999:
        return "SA"
    elif (6000 <= postcode <= 6797) or (6800 <= postcode <= 6999):
        return "WA"
    elif 8000 <= postcode <= 8999:
        return "TAS"
    elif 7000 <= postcode <= 7999:
        return "TAS"
    elif 800 <= postcode <= 999:
        return "NT"
    # Some postcodes are special cases
    elif postcode == 2899:
        return "Norfolk Island"  # Coded as NSW
    elif postcode == 6798:
        return "Christmas Island"  # Coded as WA
    elif postcode == 6799:
        return "Cocos (Keeling) Islands"  # Coded as WA
    elif postcode == 9999:
        return "North Pole"  # Coded as VIC for Santa mail

    # Fallback
    return postcodestr


# Format the api response to make comparison easier
def osm_format_response(input):

    # Shorten the name and extract the country name, if any
    hyper_location = None
    short_location = input["display_name"]  # default to full address

    if input["display_name"].find(","):
        # Break up the address
        namesplit = input["display_name"].split(",")
        short_location = namesplit[0].lstrip().rstrip()
        # Extract the rightmost term from the split
        hyper_location = namesplit[len(namesplit) - 1].lstrip().rstrip()
        # Change to an Australian state name, rather than Australia
        if hyper_location == "Australia" and len(namesplit) > 2:
            hyper_location = namesplit[len(namesplit) - 2].lstrip().rstrip()
            # Change postcodes into states
            if hyper_location.isdigit() and len(hyper_location) == 4:
                hyper_location = postcode_to_state(hyper_location)

    # For now, keep the names consistent between gazetteer records
    response = {
        "LocationName": str(short_location),
        "Category": str(input["type"]),
        "Latitude": input["lat"],
        "Longitude": input["lon"],
        "PartOf": str(hyper_location),
        "Gazetteer": "OSM",
    }
    return response

You can now move to the data that is needed for the geolocation project.

In [None]:
# For every placename in our list
for place in geolocdata:

    # Already found a matching location, so skip to the next placename
    if len(place["locations"]["best_match"]) > 0:
        continue

    placename = place["placename"]
    print("Looking for", placename)

    # Query the OSM database
    url = f"https://nominatim.openstreetmap.org/search?q={placename}&format=json&limit={OSM_limit}"
    response = osm_call_api(url)
    response_dict = json.loads(response.text)

    place["locations"]["candidates"] = None

    # Handle no results found, skip to the next placename
    if len(response_dict) == 0:
        continue

    # Save the possible locations for later processing
    data_frames = []

    # Handle results found
    for response_record in response_dict:
        #  Use this to look at a reduced set of data from the results
        cleaned_response = pd.DataFrame([osm_format_response(response_record)])

        # Add the data to a dataframe
        # The cleaned_response should be in the form we want to keep.
        data_frames.append(cleaned_response)

    # Add the results to the geoloc dataframe
    place["locations"]["candidates"] = data_frames

What placenames have you still not found?

In [None]:
unmatched_data = [
    place["placename"]
    for place in geolocdata
    if len(place["locations"]["best_match"]) == 0 and place["locations"]["candidates"] == None
]
unmatched_data

Where have you found locations for placenames?

In [None]:
matched_data = [
    place["placename"]
    for place in geolocdata
    if len(place["locations"]["best_match"]) > 0 or place["locations"]["candidates"] != None
]
matched_data

__[TO DO] Show the output and discuss from OSM, so the context of the processing can be understood__

In [None]:
# Output the final response to an OSM query
pprint(response)

While Open Street Map is a wonderful resource, it focusses on current names of geographic locations. If the original source of your placenames was not written in the recent decades, then the OSM may not know the appropriate names of locations for the time of the document.

One solution is to also look up a historical gazetteer, like the TLC. 

Like the OSM API, the TLCMap API has various options, like which type of search to use and whether to search any data in the database enetred by the public, rather than that which has been verified or entered by experts.

__[TO DO] Describe the TLCMap__

For this workshop, you will look for exact matches between the placenames and the locations, and not consider any publicly entered data.

In [None]:
# Which order to do different searches for known locations? Alt values: 'exact', 'fuzzy', 'contains'
search_type = "exact"

# Flag whether to use data provided by the public
search_public_data = False

Like for the OSM, you can limit how many results you want to examine. The TLCMap default for this notebook is 1.

In [None]:
TLCMap_limit = 5

Like for the OSM, you will need a few functions to query the the API.

In [None]:
def tlc_build_url(placename: str, search_type: str, search_public_data: bool = False) -> str:
    """
    Build a url to query the tlcmap/ghap API.
    placename: the place we're trying to locate
    search_type: what search type to use (accepts one of ['contains','fuzzy','exact'])

    ref: https://www.tlcmap.org/guides/ghap/#ws
    """
    safe_placename = urllib.parse.quote(placename.strip().lower())

    url = f"https://tlcmap.org/ghap/search?"

    if search_type == "fuzzy":
        url += f"fuzzyname={safe_placename}"
    elif search_type == "exact":
        url += f"name={safe_placename}"
    elif search_type == "contains":
        url += f"containsname={safe_placename}"
    else:
        return None

    # Search Australian National Placenames Survey provided data
    url += "&searchausgaz=on"

    # Search public provided data, this data could be unreliable
    if search_public_data == True:
        url += "&searchpublicdatasets=on"
    else:
        url += "&searchpublicdatasets=off"

    # Retrieve data as JSON
    url += "&format=json"

    # Limit the number of results
    url += "&paging=" + str(TLCMap_limit)

    return url


# Send rate-limited requests that stay within n requests per second
# [TO DO] add link to webpage about this
@sleep_and_retry
@limits(calls=1, period=1)
def tlc_call_api(url):
    r = requests.get(url)
    if r.url == "https://tlcmap.org/ghap/maxpaging":
        return None

    # If the reply says the placename wasn't found, customise the JSON data for the reply
    if r.content.decode() == "No search results to display.":
        # This should have obviously just be an empty list of features, but TLCMap is badly behaved
        response = json.loads('{"type": "FeatureCollection","metadata": {},"features": []}')
    # SUCCESS! Record the spatial data provided in the reply
    elif r.ok:
        response = r.json()  # get [lon, lat] for spatial matches

    return response


def tlc_query_name(placename: str, search_type: str):
    """
    Use tlcmap/ghap API to check a placename, implemented fuzzy search but will not handle non returns.
    """
    url = tlc_build_url(placename, search_type, search_public_data)
    if url:
        return tlc_call_api(url)
    return None

In [None]:
# Format the api response to make comparison easier
def tlc_format_response(location_features):

    locdata = {}  # formatted data

    # Gather the locdata for one of the placename's locations
    # If the value is missing, set a value or leave it empty.
    if len(location_features):
        if "placename" in location_features["properties"]:
            locdata["LocationName"] = location_features["properties"]["placename"].lstrip().rstrip()
        else:
            locdata["LocationName"] = "Unknown Location"

        if "feature_term" in location_features["properties"]:
            locdata["Category"] = location_features["properties"]["feature_term"].lstrip().rstrip()
        else:
            locdata["Category"] = None

        if "longitude" in location_features["properties"]:
            locdata["Longitude"] = location_features["properties"]["longitude"]
        else:
            locdata["Longitude"] = ""

        if "latitude" in location_features["properties"]:
            locdata["Latitude"] = location_features["properties"]["latitude"]
        else:
            locdata["Latitude"] = ""

        if "state" in location_features["properties"]:
            locdata["PartOf"] = location_features["properties"]["state"].lstrip().rstrip()
        else:
            locdata["PartOf"] = None

        locdata["Gazetteer"] = "TLCMap"

        # For now, keep the names consistent between gazetteer records
        response = {
            "LocationName": str(locdata["LocationName"]),
            "Category": str(locdata["Category"]),
            "Latitude": locdata["Latitude"],
            "Longitude": locdata["Longitude"],
            "PartOf": str(locdata["PartOf"]),
            "Gazetteer": "TLCMap",
        }
    else:
        response = None

    return response

You can now search the TLCMap for locations matching the same placenames you previously searched for in the OSM.

In [None]:
# For every placename in our list
for place in geolocdata:

    # Already found a location, so skip to the next placename
    if len(place["locations"]["best_match"]) > 0:
        continue

    placename = place["placename"]
    print("Looking for", placename)

    # Query the OSM database
    response = tlc_query_name(placename, search_type)

    # Handle no results found, skip to the next placename
    if response is None:
        continue

    # Save the possible locations for later processing
    data_frames = []

    # Handle results found
    for response_record in response["features"]:
        #  Use this to look at a reduced set of data from the results
        cleaned_response = pd.DataFrame([tlc_format_response(response_record)])

        # Add the data to a dataframe
        data_frames.append(cleaned_response)

    # Add the results to the geoloc dataframe
    # Review the outcomes later
    # Make sure you don't write over any candidates previously added from another gazetteer.
    if place["locations"]["candidates"] is None:
        place["locations"]["candidates"] = data_frames
    else:
        place["locations"]["candidates"] = place["locations"]["candidates"] + data_frames

What placenames have you still not found?

In [None]:
unmatched_data = [
    place["placename"]
    for place in geolocdata
    # Must have not found an unambiguous best match or any candidate locations
    if (
        len(place["locations"]["best_match"]) == 0
        and (place["locations"]["candidates"] is None or place["locations"]["candidates"]) == []
    )
]
unmatched_data

You can now compare all the locations you have found.

In [None]:
matched_data = [
    place
    for place in geolocdata
    if (
        len(place["locations"]["best_match"]) != 0
        or (place["locations"]["candidates"] is not None and place["locations"]["candidates"]) != []
    )
]

all_locations = []

for place in matched_data:
    placename = place["placename"]
    locations = []

    if len(place["locations"]["best_match"]) != 0:
        # Unambiguous best match already known
        locations = place["locations"]["best_match"]
    else:
        # Any candidate locations you have found
        if place["locations"]["candidates"] is not None:
            locations = pd.concat(place["locations"]["candidates"], ignore_index=True)
    print("===> ", placename)
    print(locations)
    all_locations = all_locations + [locations]

__[TO DO] Not sure which version to show - the full "pretty version" or the short dirty one__

In [None]:
pprint(all_locations[0:10])

__[TO DO] Show the output and discuss from OSM, so the context of the processing can be understood__

In [None]:
pprint(response)

The issue is now to work out which of these locations is most suitable.

A few heuristics can be used to flag those locations with key features that can be used to help rank and select the locations. 

One way is to acknowledge if multiple locations have similar coordinates.  

In [None]:
# Compare two sets of coordinates
# Return true if they are both within 1 point of each other for both Lat and Lon
def compare_coords(lat1, lon1, lat2, lon2):

    if lat1.item() == "" or lat2.item() == "" or lon1.item() == "" or lon2.item() == "":
        return False

    lat1_int = int(float(lat1.item()))
    lon1_int = int(float(lon1.item()))
    lat2_int = int(float(lat2.item()))
    lon2_int = int(float(lon2.item()))

    # Is Coord1 within 1 point of Coord2?
    # e.g., -47.5 is close to -46.5 and -48.5
    return (lat1_int in range(lat2_int - 1, lat2_int + 2, 1)) and (lon1_int in range(lon2_int - 1, lon2_int + 2, 1))

However, sometimes there are no coordinates for a location, which can complicate the comparisons. The solution is to default to the value of 0 for missing data. It is not perfect but in most cases it is adequate.

In [None]:
# Given a float as a string value, convert it to an integer
# This is required because of NA or "".
def sorting_coord(coord):

    if type(coord) in [float, int]:
        return int(coord)
    # set to 0 if missing data
    elif type(coord) == str and coord == "":
        return 0
    elif type(coord) != str and coord.isna():
        return 0
    else:
        return int(float(coord))

Another consideration is to recognise which locations are in certain countries which you know are relevant to the original document. For this you can focus on what values may be included in the PartOf field.

In [None]:
# Flag locations in Australia or Britain
aus_states = {
    "AUSTRALIA",
    "NSW",
    "VIC",
    "QLD",
    "TAS",
    "WA",
    "NT",
    "SA",
    "ACT",
    "NEW SOUTH WALES",
    "VICTORIA",
    "QUEENSLAND",
    "TASMANIA",
    "WESTERN AUSTRALIA",
    "SOUTH AUSTRALIA",
    "NORTHERN TERRITORY",
    "AUSTRALIAN CAPITAL TERRITORY",
}
gb_states = {
    "BRITAIN",
    "UK",
    "GB",
    "GREAT BRITAIN",
    "UNITED KINGDOM",
    "BRITISH ISLES",
    "ENGLAND",
    "WALES",
    "SCOTLAND",
    "IRELAND",
    "NORTHERN IRELAND",
    "ÉIRE / IRELAND",
    "EIRE",
    "EIRE / IRELAND",
}

Now you can go through each location and its candidate locations and flag whether they correspond to any of these criteria. A distinction is made between whether any two locations with similar coordinates were found in the same gazetteer or a different one.

In [None]:
for place in matched_data:
    placename = place["placename"]
    locations = []
    if (
        len(place["locations"]["best_match"]) == 0
        and place["locations"]["candidates"] is not None
        and len(place["locations"]["candidates"]) > 1
    ):

        # Sort the locations by Latitude & Longitude
        sorted_candidates = sorted(
            place["locations"]["candidates"],
            key=lambda x: [
                sorting_coord(x["Latitude"].item()),
                sorting_coord(x["Longitude"].item()),
            ],
        )

        prev_location = {}

        # Flag the candidates according to the heuristics
        for candidate in sorted_candidates:
            rank_flags = []
            partof_flags = ""

            # Flag locations in Australia or Britain
            # [TO DO] Make these lines into a function to pull the Aus/GB code out of the main code.
            partings = candidate["PartOf"].item().upper()
            if partings in aus_states:
                partof_flags = "Australia"
            if partings in gb_states:
                partof_flags = "Britain"
            if partof_flags != "":
                rank_flags.append(partof_flags)

            # Flag coords in multiple gazetteers
            if len(prev_location) > 0:
                coord_flag = compare_coords(
                    prev_location["Latitude"],
                    prev_location["Longitude"],
                    candidate["Latitude"],
                    candidate["Longitude"],
                )
                # Dupl_Gaz2: Matching coords in locations from 2 gazetteers
                if coord_flag and prev_location["Gazetteer"].item() != candidate["Gazetteer"].item():
                    rank_flags.append("Dupl_2Gaz")
                # Dupl_Gaz1: Matching coords in locations from 1 gazetteer
                elif coord_flag:
                    rank_flags.append("Dupl_1Gaz")

            prev_location = candidate

            # rank_flags has to be converted to a single string, rather than a list, 
            # because candidate is a Pandas dataframe
            candidate["RankFlags"] = pd.Series(",".join(rank_flags))
            print(
                " ** ",
                candidate["LocationName"].item(),
                ",",
                partings,
                ",[",
                candidate["RankFlags"].item(),
                "]",
            )

        # Update the geoloc data
        place["locations"]["candidates"] = sorted_candidates

You can now use the heuristics flags to re-sort the candidates and select the best one.

In [None]:
# Establish how to rank the heuristic flags
sortorder = [
    "Australia,Dupl_2Gaz",  # In Australia, found in 2 gazetteers
    "Australia,Dupl_1Gaz",  # In Australia, found more than once in 1 gazetteer
    "Australia",  # In Australia, found only once
    "Britain,Dupl_2Gaz",  # In Great Britain, found in 2 gazetteers
    "Dupl_2Gaz",  # Not in Australia or Great Britain, found in 2 gazetteers
    "Britain,Dupl_1Gaz",  # In Great Britain, found more than once in 1 gazetteer
    "Britain",  # In Great Britain, found only once
    "Dupl_1Gaz",  # Not in Australia or Great Britain, found more than once in 1 gazetteer
    "",  # Not in Australia or Great Britain, found only once
]

In [None]:
# Sort the candidates
for place in geolocdata:
    placename = place["placename"]
    pprint(placename)

    if (
        len(place["locations"]["best_match"]) == 0
        and place["locations"]["candidates"] is not None
        and len(place["locations"]["candidates"]) > 1
    ):
        candidates = place["locations"]["candidates"]

        sorted_candidates = []
        for heuristic in sortorder:
            matched_candidates = []
            # Get the candidates matching this ranking heuristic
            for candidate in candidates:
                if candidate["RankFlags"].item() == heuristic:
                    # Add the candidate to the sorted list
                    sorted_candidates = sorted_candidates + [candidate]
        # Prepare a version for printing
        locations = pd.concat(sorted_candidates, ignore_index=True)
        pprint(locations)

        # Update the geoloc data with the sorted candidates
        place["locations"]["candidates"] = sorted_candidates

In [None]:
# Outputting without the RankFlags column
matched_data = [
    place
    for place in geolocdata
    if (
        len(place["locations"]["best_match"]) != 0
        or (place["locations"]["candidates"] != None and place["locations"]["candidates"]) != []
    )
]
all_locations = []
for place in matched_data:
    placename = place["placename"]
    locations = []
    if len(place["locations"]["best_match"]) != 0:
        # Unambiguous best match already known
        locations = place["locations"]["best_match"]
    else:
        # Any candidate locations you have found
        if place["locations"]["candidates"] != None:
            short_candidates = []
            for candidate in place["locations"]["candidates"]:
                # For output, ignore the RankFlags column
                short_candidates = short_candidates + [candidate.loc[:, candidate.columns != "RankFlags"]]
            locations = pd.concat(short_candidates, ignore_index=True)
    print(locations)

Now that the candidates are sorted, the best match can be selected. The most obvious choice is the highest ranked candidate.

In [None]:
for place in geolocdata:
    placename = place["placename"]
    locations = []
    # Already have the best match
    if len(place["locations"]["best_match"]) != 0:
        locations = place["locations"]["best_match"]
    else:
        if place["locations"]["candidates"] is not None and len(place["locations"]["candidates"]) > 0:
            # Presume the best match has the top rank
            top_candidate = place["locations"]["candidates"][0]
            place["locations"]["best_match"] = top_candidate.loc[
                :, ~top_candidate.columns.isin(["Gazetteer", "RankFlags"])
            ]

In [None]:
# Outputting just the best matching locations without the RankFlags column
matched_data = [
    place
    for place in geolocdata
    if (
        len(place["locations"]["best_match"]) != 0
        or (place["locations"]["candidates"] is not None and place["locations"]["candidates"]) != []
    )
]
all_locations = []
for place in matched_data:
    placename = place["placename"]
    locations = []
    if len(place["locations"]["best_match"]) != 0:
        locations = place["locations"]["best_match"]
    else:
        # This shouldn't run because all entries with matches should now have best matches, but just in case
        if place["locations"]["candidates"] is not None:
            short_candidates = []
            for candidate in place["locations"]["candidates"]:
                short_candidates = short_candidates + [candidate.loc[:, candidate.columns != "RankFlags"]]
            locations = pd.concat(short_candidates, ignore_index=True)
    print(locations)

Unfortunately, the top ranked candidate may still not be the best. The second ranked candidate may actually be the same rank as the top ranked one, but it simply might have a lesser position due to the earlier ranking based on the latitude and longitude values (to find the matching candidates for the ranking). For this reason, the final stage of the processing is left to the user, showing them the best candidate location, but allowing them to select an alternate candidate.

There are a number of ways the user interface could be done. The first is very verbose and shows all the candidate locations of all placenames to the user. Using checkboxes, the user can then select what they think is the best match. 

Alternatively, the candidates could be presented as drop-down accordions.

In [None]:
for place in geolocdata:
    placename = place["placename"]

    # If there isn't a best match, then there aren't any candidates
    if len(place["locations"]["best_match"]) != 0:

        best_match = place["locations"]["best_match"]
        best_match_index = 0

        # Must be at least one candidate
        if (
            "candidates" in place["locations"].keys()
            and place["locations"]["candidates"] is not None
            and len(place["locations"]["candidates"]) >= 1
        ):

            # Create a RadioButton for each candidate location
            candidates = place["locations"]["candidates"]
            candidate_buttons = [
                widgets.RadioButtons(
                    layout={"width": "max-content"},
                    options=[
                        candidate["LocationName"].item()
                        + ", "
                        + candidate["PartOf"].item()
                        + " ("
                        + str(candidate["Latitude"].item())
                        + ","
                        + str(candidate["Longitude"].item())
                        + ")"
                        for candidate in candidates
                    ],
                )
            ]
            # Add a button for "None of the above"
            candidate_buttons[0].options = candidate_buttons[0].options + tuple(["None of the above"])
            place["locations"]["candidate_buttons"] = candidate_buttons

            # Select the button for the current bestmatch
            place["locations"]["candidate_buttons"][0].index = best_match_index

            # Create a HBox for the Placename
            placename_box = widgets.HBox(
                [
                    widgets.Label(
                        "'"
                        + place["placename"]
                        + "' matches best with: "
                        + best_match["LocationName"].item()
                        + ", "
                        + best_match["PartOf"].item()
                        + " ("
                        + best_match["Latitude"].item()
                        + ", "
                        + best_match["Longitude"].item()
                        + ")"
                    )
                ]
            )

            # Create an Accordion (in a HBox) for the candidates
            new_accordion = widgets.Accordion()
            new_accordion.set_title(0, "Expand to choose a different location")
            new_accordion.children = [widgets.HBox(place["locations"]["candidate_buttons"])]
            new_accordion.selected_index = None  # close the accordion at startup

            # Put the HBoxes together in a VBox
            whole_box = widgets.VBox([placename_box, new_accordion])

            # Show it all!
            # Note: this doesn't close one accordion if another is opened because they are in separate VBoxes
            display(whole_box)

Record the selected candidates as the best matches. Account for placenames without any best match.

In [None]:
for place in geolocdata:

    # If there isn't a best match, then there aren't any candidates
    if len(place["locations"]["best_match"]) != 0:

        best_match_index = 0

        # Must be at least one candidate
        if "candidate_buttons" in place["locations"].keys():

            # Extract the selection
            best_match_index = place["locations"]["candidate_buttons"][0].index
            # Make sure the selection is a location
            if best_match_index < len(place["locations"]["candidates"]):
                # Record the best match
                top_candidate = place["locations"]["candidates"][best_match_index]
                # Just record the important columns for the best match
                place["locations"]["best_match"] = top_candidate.loc[
                    :, ~top_candidate.columns.isin(["Gazetteer", "RankFlags"])
                ]
            else:
                # None of the above
                best_match = {
                    "LocationName": "No suitable location selected",
                    "Category": "No suitable location selected",
                    "Latitude": "",
                    "Longitude": "",
                    "PartOf": "No suitable location selected",
                }
                place["locations"]["best_match"] = pd.DataFrame([best_match])
    else:
        # Note that there is no best match
        best_match = {
            "LocationName": "No location matched",
            "Category": "No location matched",
            "Latitude": "",
            "Longitude": "",
            "PartOf": "No location matched",
        }
        place["locations"]["best_match"] = pd.DataFrame([best_match])

Prepare the final geoloc data for output to file.

In [None]:
pprint(geolocdata[0:10])

In [None]:
# Verbose output
all_locations = []  # the processed data
for place in geolocdata:
    placename = place["placename"]
    pprint(placename)

    # Reformat the data about the best match
    best_locations = []
    # All placenames should now have a best_match
    if len(place["locations"]["best_match"]) != 0:
        for candidate in place["locations"]["best_match"]:
            if place["locations"]["best_match"][candidate].item != "":
                best_locations = best_locations + [place["locations"]["best_match"][candidate].item()]

    # Reformat the data about the candidate locations
    locations = []
    if "candidates" in place["locations"].keys() and place["locations"]["candidates"] != []:
        short_candidates = []
        for candidate in place["locations"]["candidates"]:
            # Select which columns to output
            short_candidates = short_candidates + [candidate.loc[:, candidate.columns != "RankFlags"]]
        # Merge the dataframe values into a more human-readible format for now,
        # though this is not a comma-separated format
        locations = pd.concat(short_candidates, ignore_index=True)

    # Put this all together
    new_record = [["placename", placename], ["best_match", best_locations], ["candidates", locations]]
    all_locations = all_locations + [new_record]

In [None]:
for place in all_locations[0:10]:
    pprint(place)

In [None]:
# Save all the geolocdata as a messy combination of array rows and dataframnes
filename = "FtToHNL_matchedlocations.data"
save_location = os.path.normpath(os.path.join(csv_directory, filename))
save_filename = os.path.basename(save_location)
print("Saving to location data to ", save_location)

# Save the list
np.savetxt(save_location, geolocdata, delimiter=", ", fmt="%s")

In [None]:
# Final output
# Only use selected columns from the best match in csv format

# Make a new array of records
geoloc_output = [["Placename", "PartOf", "Latitude", "Longitude"]]
for place in geolocdata:
    # Only output those placenames with a best match location
    if len(place["locations"]["best_match"]) != 0:
        placename = place["placename"]
        location = place["locations"]["best_match"]
        print("Formatting " + placename)
        pprint(location)

        # Convert Lat/Long into floats, rather than strings
        if type(location["Latitude"]) == float:
            latitude = location["Latitude"].item()
        elif location["Latitude"].item() != "":
            latitude = float(location["Latitude"].item())
        else:
            latitude = ""
        if type(location["Longitude"]) == float:
            longitude = location["Longitude"].item()
        elif location["Longitude"].item() != "":
            longitude = float(location["Longitude"].item())
        else:
            longitude = ""

        # Add this record to the list
        geoloc_output.append([placename, location["PartOf"].item(), latitude, longitude])

In [None]:
geoloc_output

In [None]:
filename = "FtToHNL_geolocdata.csv"
save_location = os.path.normpath(os.path.join(csv_directory, filename))
save_filename = os.path.basename(save_location)
print("Saving to location data to ", save_location)

# Save the list
np.savetxt(save_location, geoloc_output, delimiter=", ", fmt="% s")