# ATAP Notebook for the Geolocation project

This notebook helps you access the Geolocation tools in a Python development environment.

### Contents
* [Premise](#section-premise)
* [Requirements](#section-requirements) 
* [Data Preparation](#section-datapreparation)
* [Named Entity Recognition](#section-ner)
 * [Look for NEs](#section-nes)
 * [Reviewing Candidate Placenames](#section-reviewplacenames)
* [Finding Locations for Placenames](#section-findinglocs)
 * [Identifying States and Capitals](#section-statescapitals)
 * [Searching a Gazzetteer for Locations](#section-searchgazetteer)

## Premise <a class="anchor" id="section-premise"></a>
*This section explains the Geolocation project and tools.*

The Geolocation project relates to doctoral research done by [Fiannuala Morgan](https://finnoscarmorgan.github.io/) at the Australian National University. It uses software to identify placenames in archived historical texts, then compares them to data about known locations to identify where the placenames may be located. 

This notebook is designed to allow you to perform similar operations on textual documents.

<div class="alert alert-block alert-success">
It will teach you how to
<ul>
    <li>use the spaCy library to identify and classify Named Entities (NEs)</li>
    <li>identify multi-word expressions (MWE) that are NEs</li>
    <li>search for spatial data about specific locations or places in gazetteers of such data</li>
    <li>determine which locations are referred to by placenames, based on the context in which they are used in a text</li>
</ul>
</div>

## Requirements <a class="anchor" id="section-requirements"></a>

<div class="alert alert-block alert-info">
This notebook uses various Python libraries. Most will come with your Python installation, but the following are crucial:
<ul>
    <li> pandas</li> 
    <li> json</li> 
    <li> nltk</li> 
    <li> geopandas</li> 
    <li> shapely  </li> 
</ul>
</div>

In [1]:
# TODO: UPDATE
# Many of these are probably not needed.

import os
from pickle import NONE
import nltk
import csv
import time
import urllib
import requests
import json
import math

#import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Geopandas is used to work with spatial data
# If you have issues installing it on a MAcOS, 
# see https://stackoverflow.com/questions/71137617/error-installing-geopandas-in-python-on-mac-m1
#import geopandas as gpd
#from geopandas import GeoDataFrame

# NLTK is used to work with textual data 
#from nltk.tag import StanfordNERTagger
#from nltk.tokenize import word_tokenize

# spaCy is used for a pipeline of NLP functions
import spacy
from spacy.tokens import Span
from spacy import displacy

# Shapely is used to work with geometric shapes
#from shapely.geometry import Point

# Fuzzywuzzy is used for fuzzy searches
#from fuzzywuzzy import fuzz

# used for the checklist
import ipywidgets as widgets

In [2]:
# Make sure you can see as much of the output as possible within the Jupyter Notebook screen
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# Data preparation <a class="anchor" id="section-datapreparation"></a>

You also want to set up some directories for the import and output of data.

In [3]:
## Declare the data directories
## This presumes that Notebooks/ is the current working directory  
text_directory = os.path.normpath("../Texts/")
csv_directory = os.path.normpath("../ner_output/")
reference_directory = os.path.normpath("../Data")
#maps_directory = os.path.normpath("../maps/")

## Create the data directories
if not os.path.exists(text_directory):
    os.makedirs(text_directory)
if not os.path.exists(csv_directory):
   os.makedirs(csv_directory)
if not os.path.exists(reference_directory):
   os.makedirs(reference_directory)

#if not os.path.exists(maps_directory):
#    os.makedirs(maps_directory)

For this workshop, we will be examining the text of *For the Term of His Natural Life*, an 1874CE novel by Marcus Clarke that is in the public domain. Our copy was obtained via the [Gutenburg Project Australia](https://gutenberg.net.au/ebooks/e00016.txt). It is an unformatted textfile. We have slightly simplified it further by reducing it to only standard ASCII characters, replacing any accented characters with their unaccented forms and the British Pound Sterling symbol with the word pounds. 

The novel is divided into four books, each based in different regions of the world. You can start with the second book, which is titled *BOOK II.\-\-MACQUARIE HARBOUR.  1833*. 


In [None]:
#filename="FtToHNL_BOOK_1.txt"
filename="FtToHNL_BOOK_2.txt"
print("Working on | ", filename)

# set the specific path for the 'filename' which is basically working through a list of everything that is in the folder
textlocation = os.path.normpath(os.path.join(text_directory, filename))
text_filename = os.path.basename(textlocation)

text = open(textlocation, encoding="utf-8").read()

This is no more than a long string of characters. So far, you have done no processing. 

In [None]:
text[0:499] # look at the first 500 characters

## Named Entity Recognition <a class="anchor" id="section-ner"></a>
*This section provides tools on identifying named entities in textual data*

### Look for NEs <a class="anchor" id="section-nes"></a>

Named Entities (NEs) are proper noun phrases within text, like names of places, people or organisations.

There are various packages that can include Named Enity Recognition (NER), e.g., the [Stanza CoreNLP](https://colab.research.google.com/github/stanfordnlp/stanza/blob/main/demo/Stanza_CoreNLP_Interface.ipynb), the Stanford NER, and the spaCy library. They often combine machine learning and a rule-based system to identify NEs and classify them into categories.

For this notebook, you will be using the spaCy NER - https://spacy.io/usage/linguistic-features#morphology .  This is available as a Python library.

SpaCy allows you to load a language model that has been trained on various examples of the language of interest. 

In [None]:
nlp = spacy.load("en_core_web_sm")

SpaCy will automatically run the model through various levels of natural language processing. This pipeline includes tokenising the text into individual tokens or terms, like words, values and puncuation.

TODO: Update

It is simple to use the client. You just tell it to annotate the text. However, Stanza allows you to specify what to annotate the text with. For instance, you might want it to tokenise the terms, label their parts-of-speech and lemmatised forms as well as recognising any NEs.

TODO: Update

<div class="alert alert-block alert-info">
The options for the Stanza client include:<ul>
    <li> <strong>tokenize - </strong> split into words or terms </li> 
      <li>    <strong>ssplit - </strong> split into sentences or independent statements</li> 
     <li>     <strong>pos -  </strong> syntactic parts-of-speech</li> 
      <li>    <strong>lemma -  </strong> lemmatised form (not always a root form)</li> 
      <li>    <strong>ner - </strong> named entity recognition</li> 
      <li>    <strong>depparse -  </strong> parsing of dependencies</li> 
      <li>    <strong>coref - </strong> co-reference resolution</li> 
      <li>    <strong>kbppandas - </strong> KBP competition format</li> 
</ul>
A full explanation can be found at <a href="https://stanfordnlp.github.io/CoreNLP/pipeline.html">https://stanfordnlp.github.io/CoreNLP/pipeline.html</a></div>

For example, the default line contains the following.

In [None]:
print("Pipeline:", nlp.pipe_names)

Text sent to the spaCy model will be processed by the pipeline.

In [None]:
# sample text
sampletext = "Autonomous cars shift insurance liability toward manufacturers"

In [None]:
doc = nlp(sampletext)
doc

The output from the pipeline is then  available in the output structure of the spaCy model.

In [None]:
for token in doc:
    print(token.text," - ", "Morph: ",token.morph, 
          "\n   Dep: ",token.dep_, 
          "\n   Head: ",token.head.text, 
          "\n   Pos: ",token.head.pos_,
          "\n   Child: ",[child for child in token.children])


However, you might not want some of this pipeline processing as it may not be beneficial to your analysis. Any unneeded processing will also slow the system down and place a greater demand on the memory. This is particularly true of the parser. Luckily, it is easy to stipulate what you want excluded from the pipeline. 

In [None]:
doc=nlp(sampletext, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])

In [None]:
for token in doc:
    print(token.text," - ", "Morph: ",token.morph, 
          "\n   Dep: ",token.dep_, 
          "\n   Head: ",token.head.text, 
          "\n   Pos: ",token.head.pos_,
          "\n   Child: ",[child for child in token.children])

Of course, what you are interested in is the NER. Each sentence sent down the pipeline with the ner will get a list of entities that have been found. 

In [None]:
sampletext = "Apple is looking at buying a U.K. startup based in London for $1 billion."

In [None]:
doc = nlp(sampletext)

for ent in doc.ents:
    print(ent.text, "[",ent.label_,"]")

As you can see, each entity is labelled with a category.

TODO: UPDATE

<div class="alert alert-block alert-info">
    The NER categories classified by Stanza include:
   <ul>
<li><strong>Default:</strong>
LOCATION, ORGANIZATION, PERSON</li>
<li><strong>High recall: </strong>
DATE, LOCATION, MONEY, ORGANIZATION, PERCENT, PERSON, TIME, MISC</li>
<li><strong>KBP fine-grained:</strong>
CAUSE_OF_DEATH, CITY, COUNTRY, CRIMINAL_CHARGE, EMAIL, HANDLE,
IDEOLOGY, NATIONALITY, RELIGION, STATE_OR_PROVINCE, TITLE, URL</li>
</ul>
</div>

TODO: Update

Most tokenised terms in the sentence have O as their NER value (that is the letter O not the number 0). Some however have been categorised. For instance, Van and Diemen are both classified as being PERSON named entities, Tasman is an ORGANIZATION and Cape and Pillar are in the LOCATION category. These categories are specific to Stanza. There are two key levels of processing available - the normal level will only identify the categories LOCATION, ORGANIZATION and PERSON, but the high recall processing will also consider other specialised phrases like TIME and MONEY which are not really named entities. Stanza can also look for the categories used in the KBP competition like CITY, COUNTRY and NATIONALITY, but this fine-grained processing will be slower.


The data for the entities includes the character position for the start and the end of the NE.

In [None]:
doc = nlp(sampletext)

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Each token will also have a value that indicates whether it is part of an NE.

In [None]:
for token in doc:
    print(token.text, "[", token.ent_type_, "]")

TODO: Update

The annotations so far show that *Cape* and *Pillar* are both a LOCATION NE, but not that *Cape Pillar* is actually the complete name of the location. However Stanza does also recognise multi-word expressions (MWEs), even though it recognises that they are part of an NE. Each *Sentence* annotated by Stanza has a list of [NE *Mentions*](https://stanfordnlp.github.io/CoreNLP/entitymentions.html) which are also given NER categories as their *Type*. 

It is also possible to hand-code entities after the NER has been done. This can help make up for any common irregularities with the NER for your input documents.

For instance, this model doesn't recognise that FB is a NE.

In [None]:
sampletext = "FB is hiring a new vice president of global policy"

doc = nlp(sampletext)
ents = [(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
print('Entities before:', ents)
# The model didn't recognize "fb" as an entity :-(

The solution is to create a new entry for the list of entities.

In [None]:
# Create a spaCy span for the new entity
fb_ent = Span(doc, 0, 1, label="ORG")
orig_ents = list(doc.ents)

# Assign a complete list of ents to doc.ents
doc.ents = orig_ents + [fb_ent]

ents = [(ent.text, ent.start, ent.end, ent.label_) for ent in doc.ents]
print('Entities after:', ents)

Even the data for the tokens is updated. 

In [None]:
for token in doc:
    print(token.text, "[", token.ent_type_, "]")

SpaCy also allows the input documents to be processed in batches. This helps better manage the processing demands of the system throughout the pipeline when there are multiple files or many sentences.

In [None]:
# multiple texts in a list
sampletexts = ["Autonomous cars shift insurance liability toward manufacturers","This is a text", "These are lots of texts", "..."]

# remove what elements you don't need from the pipeline
for doc in nlp.pipe(sampletexts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):
    print("Entities: ",[(ent.text, ent.label_) for ent in doc.ents])
    for token in doc:
        print("   ",token.text, "[", token.ent_type_, "]")


While you can process the output of the piped pipeline straight away, you can't print it unless you convert it into a list.

In [None]:
docs = nlp.pipe(sampletexts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])
print(docs)

In [None]:
print(list(docs))

TODO: Can't remember what this is trying to do. Think it has to do with the IOB.

In [None]:
from spacy.attrs import ENT_IOB, ENT_TYPE

nlp = spacy.load("en_core_web_sm")
#doc = nlp.make_doc("London is a big city in the United Kingdom and New York is in the United States of America.")
doc = nlp("London is a big city in the United Kingdom and New York is in the United States of America.")

ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print("Entities: ",ents)

print("\nBefore:", doc.ents)  # []

header = [ENT_IOB, ENT_TYPE]
attr_array = np.zeros((len(doc), len(header)), dtype="uint64")
attr_array[0, 0] = 3  # B
attr_array[0, 1] = doc.vocab.strings["GPE"]
doc.from_array(header, attr_array)
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print("\nEntities: ",ents)

print("\nAfter", doc.ents)  # [London]

TODO: Placeholder in-case we want to show of the displacy rendering.

In [None]:
sampletext = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."

nlp = spacy.load("en_core_web_sm")
doc = nlp(sampletext)

# displacy from spaCy
displacy.render(doc, style="ent")

TODO: talk about extracting just the NER types we want.

In [None]:
spacy.explain('LOC')

In [None]:
spacy.explain('FAC')

In [None]:
spacy.explain('GPE')

In [None]:
spacy.explain('ORG')

TODO: Run through a single chapter (variable: text) before doing the entire collection?
    This will allow all NEs to be shown, then the filter be introduced.
    This will put it more on topic.

In [None]:
text[0:499]

In [None]:
doc = nlp(text)

# document level
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
    
i=0 # entity counter
# token level
for e in doc.ents:
    print("{:5}\t\t{:30s}\t{}".format(i+1,e.text, e.label_))
    i=i+1

TODO: Expand on this explanation with example context.
Talk about the issue with _Van Diemen's_ versus _Van Diemen's Land_ (and _Tasman's Head_)

These categories are assigned according to the context in which the NE is used. For this reason, _Van Diemen's_ is considered an *ORG*, a *PERSON* and a *FAC*, depending on its linguistic context. Note also that _VAN DIEMEN'S LAND_ in the title of the chapter isn't recognised as a NE, probably due to its unconventional case.

TODO: Expand on this explanation.

Of course, not all of these NE are suitable for placenames, so you will need to make a list of what categories regularly contain placenames.

In [None]:
PLACENAME_CATEGORIES = ["LOC", "GPE", "FAC", "ORG"]

## Reviewing Candidate Placenames <a class="anchor" id="section-reviewplacenames"></a>

You can now put all of this together and find the placenames that are identified by spaCy in each chapter of the text. They can all be collected in a single dataframe.

In [None]:
# where we store the details about each instance of the placenames
placenames_df = pd.DataFrame(columns=['Book','Chapter',"NEIndex","Placename"])

In [None]:
# define which chapters and books you want to annotate
CHAPTERS=[1,2,3] 
BOOKS=[1,2]

In [None]:
disabledPipeline=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]

Now let's process FtToHNL.

In [None]:
nlp = spacy.load("en_core_web_sm")

i=0 # counter of the entities
for b in BOOKS:
    for c in CHAPTERS:
        filename = "FtToHNL_BOOK_"+str(b)+"_CHAPTER_"+str(c)+".txt" 
        # set the specific path for the 'filename'
        textlocation = os.path.normpath(os.path.join(text_directory, filename))
        text_filename = os.path.basename(textlocation)

        # read this chapter
        text = open(textlocation, encoding="utf-8").read()
        print("Working on |",filename)
        
        # run spaCy    
        doc = nlp(text,disable=disabledPipeline)

        # document level
        ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]

        # token level
        for e in doc.ents:
            if e.label_ in PLACENAME_CATEGORIES: # filter out MONEY, DATE etc
                print("{:5}\t\t{:30s}\t{}".format(i+1,e.text, e.label_))
                # To help understand the context of the text, extract the occurence
                context_text=doc.text[e.start_char-30:e.end_char+30].replace("\n"," ")
                
                # Code to render with displacy
                #context_doc={"text":str(i+1)+" \t "+context_text,
                #             "ents":[{"start": len(str(i+1)+"   ")+30, 
                #                      "end":   len(str(i+1)+"   "+context_text)-30, 
                #                      "label": e.label_}],
                #             "title": None}
                #print(context_doc)
                #displacy.render(context_doc, style="ent", manual=True, jupyter=True)

                # find the placenames according to spaCy
                new_placename = {'Book':b,              # The Book number
                                'Chapter':c,            # The Chapter number
                                'NEIndex':i,            # A reference number to the nth Named Entity 
                                'Placename':e.text,     # The placename in the text
                                'Category':e.label_,    # The spaCy category
                                'Context':context_text, # The textual context where the placename was found
                                'Approval':1}        # A flag for whether this is a suitable placename
                placenames_df = placenames_df.append(new_placename, ignore_index=True)
                
            i=i+1 # entity counter

In [None]:
placenames_df[['Book','Chapter','NEIndex','Placename','Category']]

This shows that a lot more placenames were found in Book 2 than Book 1. This makes sense since Book 1 is set on board an ocean voyage, whereas Book 2 is at Macquarie Harbour in Australia.

However, there are a number of NEs that are unlikely to be placenames, regardless of what spaCy categoriesd them as. It is best to consider the context in which the terms were used. Use the checkboxes to select which terms you do consider to be placenames.

In [None]:
#import ipywidgets as widgets

def changed(b):
    # The system sets the _property_lock_, changes the value, then releases the _property_lock_.
    # The confusing thing is that there is a value for the checkbox and a value for the property lock.
    # changed() is called three times for every change to the checkbox.

    # If you want to see any change to the checkbox value, uncomment this print()) command
    if b['name']=='value':
        #print(b,"\n")
        #print("found value")
        k=b['new']
        #print("    ",k)
    #print(b,"\n")

placename_items=[]
context_items=[]
num_items=[]

# Time to try to make checkboxes for every placename for b in BOOKS:
for b in BOOKS:
    for c in CHAPTERS:
        # Get the NEs from this book and chapter
        pbc=placenames_df[(placenames_df["Book"]==b) & (placenames_df["Chapter"]==c)]
        for i in pbc["NEIndex"]:
            context_text=pbc[pbc["NEIndex"]==i]["Context"].values[0]
            category=pbc[pbc["NEIndex"]==i]["Category"].values[0]
            
        # Make lists of the candidate placenames, context text and index numbers 
        # Only the placenames are given a checkbox. 
        placename_items = placename_items + [widgets.Checkbox(True,description=i) for i in pbc["Placename"]]
        context_items = context_items + [widgets.Label(pbc[pbc["NEIndex"]==i]["Context"].values[0]) for i in pbc["NEIndex"]]
        num_items = num_items + [widgets.Label(str(i)) for i in pbc["NEIndex"]]

# create a display
num_placenames=len(placename_items)
left_box = widgets.VBox(placename_items)
right_box = widgets.VBox(context_items)
num_box = widgets.VBox(num_items)
whole_box = widgets.HBox([num_box, left_box, right_box])
        
display(whole_box)

# respond to any changes in the checkboxes
for i in range(num_placenames):
    placename_items[i].observe(changed)

You can now copy all the values from the checkboxes to the data, so you know which placenames you have approved.

In [None]:
# Transfer the status of each checklist item to the data
for i in range(num_placenames):
    #print(num_items[i].value,placename_items[i].value,placename_items[i].description)

    NEIndex_num = int(num_items[i].value)
    approval_flag = placename_items[i].value
    
    #print("Looking for ["+str(NEIndex_num)+"]")
    
    # set the flag to match the checklist
    for p in placenames_df["NEIndex"]:
        if (p-NEIndex_num == 0):
            placenames_df.loc[placenames_df["NEIndex"] == NEIndex_num,"Approval"] = approval_flag

You can now visualise the result.

In [None]:
placenames_df[['NEIndex','Placename','Approval']]

From this, you can extract the final list of distinct placenames that you have approved. While the names aren't sorted (though they could be), if you missed unselecting an NE on the checklist, this will help find it. All you need to do is go back to the checklist, unselect it, then run all other steps from there to here.

In [None]:
# Make a unique list of the approved placenames
approved_placenames = placenames_df[placenames_df["Approval"]==True]['Placename'].unique()
approved_placenames

The final step is to save this new data to a csv file. You have already defined the directory for your data files.

In [None]:
filename = "FtToHNL_placenames.csv"
save_location = os.path.normpath(os.path.join(csv_directory, filename))
save_filename = os.path.basename(save_location)
print("Saving to placename data to ",save_location)

In [None]:
# save the list 
# using the savetxt from the numpy module
np.savetxt(save_location, 
           approved_placenames,
           delimiter =", ", 
           fmt ='% s')

TODO: Remove the MWE section as it is not needed for spaCy.
Stopped the code form executing for now, but kept for reference while updating the above steps.

### MWEs as Named Entities  <a class="anchor" id="section-mwes"></a>

The annotations so far show that *Cape* and *Pillar* are both a LOCATION NE, but not that *Cape Pillar* is actually the complete name of the location. However Stanza does also recognise multi-word expressions (MWEs), even though it recognises that they are part of an NE. Each *Sentence* annotated by Stanza has a list of [NE *Mentions*](https://stanfordnlp.github.io/CoreNLP/entitymentions.html) which are also given NER categories as their *Type*. 

Obviously, this isn't perfect. While *Van Diemen* is recognised as a *PERSON* NE, *Van Diemen's Land* (i.e., the former name for Tasmania) isn't recognised as a *LOCATION*. This is because Stanza is trained to only recognise certain combinations of words and categories as a new MWE category. These rules can however be added to but this workshop won't explore this aspect.

Each token is also annotated with an index number corresponding to any Mention it is part of. Each token can only be part of one Mention. While the Mentions may be annotated per Sentence, the index number is actually considering all Sentences and Mentions in the annotated text.

Of course, you are mainly interested in the Named Entities relate to locations. The location-based NER categories used by Stanza are:
* *LOCATION*
* *CITY*
* *COUNTRY*
* *STATE_OR_PROVINCE*

*NATIONALITY* might also be considered but it may depend on whether you care about phrases like *student of English history* or *Frenchman's cap*, or not.
It is easy to filter out all other NEs.

You can now put all of this together and find the placenames that are identified by Stanza in each chapter of the text. They can all be collected in a single dataframe.

__\[MN Note\]__ Change this to a background server, rather than a server on demand?

This shows that a lot more placenames were found in Book 2 than Book 1. This makes sense since Book 1 is set on board an ocean voyage, whereas Book 2 is at Macquarie Harbour in Australia.  

The final step is to save this new data to a csv file. You have already defined the directory for your data files.

## Finding Locations for the Placenames <a class="anchor" id="section-findinglocs"></a>

Now that you have a list of placenames from the text, the next step is to work out their location on Earth. For this you can use a combination of specialised lists of locations, gazzetteers and heuristics. The objective is to match every placename with the coordinates of a known location.

The first step is to read the file of your placenames.

In [58]:
filename="FtToHNL_placenames.csv"
print("Working on | ", filename)

# set the specific path for the 'filename' which is basically working through a list of everything that is in the folder
data_location = os.path.normpath(os.path.join(csv_directory, filename))
data_filename = os.path.basename(data_location)

# Using pandas, read the csv file. This will place it in a dataframe format. 
placenames_df = pd.read_csv(data_location, encoding="utf-8",header=None)

Working on |  FtToHNL_placenames.csv


In [59]:
placenames_df = placenames_df.rename(columns={placenames_df.columns[0]: 'Placename'})

In [60]:
placenames_df

Unnamed: 0,Placename
0,the sleepy sea
1,the Bay of Biscay
2,Heath
3,London
4,Van Diemen's Land
5,Vickers
6,Sylvia
7,Bath
8,Julia Vickers's
9,Frere


### Identifying States and Capitals <a class="anchor" id="section-statescapitals"></a>

Some placenames, like *High Street* or *Maryborough*, may be very common across the world, or even in Australia. However, certain placenames refer to significant locations, like states, territories, large geographic features or capital cities. As such, if they are mentioned in a text, the placename is more likely to refer to the major location than a town or village in Tasmania.

These significant locations are a finite set. They can be defined in a reference file that can be reused when reviewing the placenames of any text.

A good point for you to start is a file about locations like modern capital cities and countries, combined with historical locations of significance.

In [61]:
filename="reference_location_data.csv"
print("Working on | ", filename)

# set the specific path for the 'filename' which is basically working through a list of everything that is in the folder
reference_location = os.path.normpath(os.path.join(reference_directory, filename))
reference_filename = os.path.basename(reference_location)

Working on |  reference_location_data.csv


Rather than reading this and then processing it, you can process each line as you read it.

In [62]:
# Place the reference data in a dataframe
locref_df = pd.read_csv(reference_location, encoding="utf-8", header=0)

In [63]:
locref_df

Unnamed: 0,LocationName,Category,Latitude,Longitude,PartOf
0,Abuja,Capital,7.533333,9.083333,Nigeria
1,Accra,Capital,-0.216667,5.550000,Ghana
2,Adamstown,Capital,-130.083333,-25.066667,Pitcairn Islands
3,Addis Ababa,Capital,38.700000,9.033333,Ethiopia
4,Aegina,Capital,23.501421,37.740882,Greece
...,...,...,...,...,...
543,Zagreb,Capital,16.000000,45.800000,Croatia
544,Zambia,Country,28.283333,-15.416667,Africa
545,Zanzibar City,Capital,39.198914,-6.165193,Tanzania
546,Zimbabwe,Country,31.033333,-17.816667,Africa


\[TODO\]: update this text chunk to suit the workshop 

Of course, if you are researching historical texts, then some of these contemporary locations may have had different names. Old New York was once New Amsterdam (and had the [nickname of Gotham](https://www.nypl.org/blog/2011/01/25/so-why-do-we-call-it-gotham-anyway), amongst others). Istanbul was Constantinople. Some locations had [romanized names](https://en.wikipedia.org/wiki/Chinese_postal_romanization), like Beijing being called Peking. They may be a long time gone but you might want to add them to the list of significant known locations.

Another historical variant is changing which cities are the capitals. These may be due to political decisions, like the movement of the Australian parliament from Melbourne to the new city of Canberra, or they could be a necessity due to the results of war, like Bonn becoming the capital of West Germany after World War II. These older capitals may also have to be accomodated in your reference data.

Because FtToHNL is set in the 19th Century CE, the next step is to add various capital cities from then.

There are also larger geopolitical regions that may have been associated with placenames and cultures, for instance empires, dynasties and colonies like the British Empire or the Zulu Kingdom. Again, the borders and applicability of these political entities changed over time, so a contemporary reference list may not include them. 

The 19th Century CE was a time of many European Empires so for FtToHNL, you will need to add reference data associated with relevant entities.

When processing this reference file, you can add the old political entity, its capital (if known), the geographic region (like continent or part thereof) and the modern country it would be considered part of.  

The next step is to see if any of the placenames from our selected chapters of FtToHNL match these locations.

[TO DO] Describe this without being technical 

If we match a placename, copy the geolocation data for the matching location. Otherwise, keep it empty so we know to keep looking for the placename.

In [64]:

geolocdata = [] # all the data about placenames and locations, once linked

for placename in placenames_df['Placename']:
    
    # create a new geoloc entry about this placename
    new_geolocdata={} 
    ## start a record for a placename
    new_geolocdata['placename'] = placename
    new_geolocdata['locations'] = {} # start with no location details
    new_geolocdata['locations']['best_match'] = [] # start with no match
    
    # normalise its case and remove any leading whitespace
    # This will be needed later for the gazzetteer
    #safe_placename = urllib.parse.quote(placename.strip().lower()) 

    # Exact match
    if(placename in list(locref_df['LocationName'])):
        
        print("*** Found", placename,"[Exact match]")
        # Copy the details from the reference file entry
        new_geolocdata['locations']['best_match'] = locref_df[locref_df['LocationName']==placename]
        
    else:
        print("Still looking for ", placename)
    
    # If you have a match, show it
    #if (len(new_geolocdata['locations']['best_match']) > 0):
    #    print(new_geolocdata)
        
    geolocdata.append(new_geolocdata) # add the new placename data to the list

Still looking for  the sleepy sea
Still looking for  the Bay of Biscay
Still looking for  Heath
*** Found London [Exact match]
Still looking for  Van Diemen's Land
Still looking for  Vickers
Still looking for  Sylvia
Still looking for  Bath
Still looking for  Julia Vickers's
Still looking for  Frere
Still looking for  Chatham
Still looking for  CHAPTER II
Still looking for  Surgeon Pine
Still looking for  Coromandel
Still looking for  Pine
*** Found India [Exact match]
Still looking for  the Hydaspes for Calcutta
Still looking for  the poop guard
Still looking for  MONOTONY
Still looking for  Three'll
Still looking for  Van Diemen's
Still looking for  Tasman
Still looking for  Cape Pillar
Still looking for  Pirates' Bay
Still looking for  east
Still looking for  west
Still looking for  the Isle of Wight
Still looking for  the South-West Cape
Still looking for  Swan Port
Still looking for  Mediterranean
Still looking for  Maria Island
Still looking for  the Three Thumbs
Still looking fo

Check that you have recorded the matches (and mismatches)

In [65]:
geolocdata[:10]

[{'placename': 'the sleepy sea', 'locations': {'best_match': []}},
 {'placename': 'the Bay of Biscay', 'locations': {'best_match': []}},
 {'placename': 'Heath', 'locations': {'best_match': []}},
 {'placename': 'London',
  'locations': {'best_match':     LocationName Category  Latitude  Longitude          PartOf
   266       London  Capital -0.083333       51.5  United Kingdom}},
 {'placename': "Van Diemen's Land", 'locations': {'best_match': []}},
 {'placename': 'Vickers', 'locations': {'best_match': []}},
 {'placename': 'Sylvia', 'locations': {'best_match': []}},
 {'placename': 'Bath', 'locations': {'best_match': []}},
 {'placename': "Julia Vickers's", 'locations': {'best_match': []}},
 {'placename': 'Frere', 'locations': {'best_match': []}}]

What locations did you end up finding?

In [66]:
matchdata = [p['locations']['best_match'].to_string(index=False,header=False) for p in geolocdata 
             if len(p['locations']['best_match'])>0]
matchdata

['London Capital -0.083333 51.5 United Kingdom',
 'India Country 77.2 28.6 Asia',
 'Italy Country 12.483333 41.9 Europe',
 'Victoria Capital 55.45 -4.616667 Seychelles',
 'Wellington Capital 174.783333 -41.3 New Zealand',
 'Honduras Country -87.216667 14.1 Central America']

We can now forget about the dataframe with the complete set of reference data.

In [67]:
del locref_df

### Searching a Gazetteer for Locations <a class="anchor" id="section-searchgazzeteer"></a>

Search [Open Street Map (ODM)](https://nominatim.org/release-docs/develop/api/Search/) for locations that match the unknown placenames.

In [68]:
#install ratelimit
import requests
from IPython.display import JSON
import json
from pprint import pprint
from ratelimit import limits, RateLimitException, sleep_and_retry
#import pandas as pd

In [69]:
# How many (max) results do we want for each name?
#[TO DO] Make this a user setting, defaulting to 5
# The normal is (Default: 10, Maximum: 50), according to https://nominatim.org/release-docs/develop/api/Search/
OSM_limit = 5

In [70]:
# Send rate-limited requests that stay within n requests per second
# [TO DO] add link to webpage about this
@sleep_and_retry
@limits(calls=1, period=1)
def osm_call_api(url):
    response = requests.get(url)
    return response

# Format the api response to make comparison easier
def osm_format_response(input):

    # Shorten the name and extract the country name, if any
    hyperlocation = None;
    shortlocation = input["display_name"] # default to full address

    if input["display_name"].find(','):
        # break up the address
        namesplit = input["display_name"].split(',')
        # extract the rightmost term from the split
        hyperlocation = namesplit[len(namesplit)-1]
        shortlocation = namesplit[0]
        #hyperlocation=len(namesplit)     
        
    # for now, keep the names consistent between records 
    response = {"LocationName": shortlocation, 
              "Category": input["type"],
              "Latitude": input["lat"], 
              "Longitude": input["lon"],
              "PartOf": hyperlocation,
              "Gazetteer": "OSM"
                }
    return response

[TO BE DONE]

You can now move to the data that is needed for the geolocation project.

In [71]:
# For every placename in our list
for p in geolocdata:
    # Already found a location, so skip to the next placename
    if len(p['locations']['best_match']) > 0:
        continue
        
    placename = p['placename']
    print ("looking for",placename)

    # query the OSM database
    url = f"https://nominatim.openstreetmap.org/search?q={placename}&format=json&limit={OSM_limit}"
    response = osm_call_api(url)
    response_dict = json.loads(response.text)
    #print(response.text)

    p['locations']['candidates']=None
    
    # Handle no results found
    if len(response_dict) is 0:
        # skip to the next placename
        continue
        
    # Save the possible locations for later processing
    data_frames = []

    # Handle results found
    for response_record in response_dict:
        #  Use this to look at a reduced set of data from the results
        #print(response_record)
        #cleaned_response = osm_format_response(response_record)
        cleaned_response = pd.DataFrame([osm_format_response(response_record)])

        #print("   ...... ",cleaned_response)

        # Add the data to a dataframe
        # The cleaned_response should be in the form we want to keep.
        data_frames.append(cleaned_response)

        #df = pd.DataFrame(columns = [#'name' , 
        #                             'Location',
        #                             "Category",
        #                             "Latitude",
        #                             "Longitude",
        #                             "PartOf",
        #                             "Gazetteer"])
        #df = df.append({#"name": placename, 
        #                "LocationName": cleaned_response["LocationName"],
        #                "Category": cleaned_response["CategoryType"],
        #                'Latitude': cleaned_response["Latitude"],
        #                'Longitude': cleaned_response["Longitude"],
        #                'PartOf': cleaned_response["PartOf"],
        #                'Gazetteer': cleaned_response["Gazetteer"]}, 
        #               ignore_index=True)
        #data_frames.append(df)
        
        # print the output
        #matchdata = df.to_string(index=False,header=False)
        #print("  *  ",matchdata)


        #print(df)

    # add the results to the geoloc dataframe
    # review the outcomes later
    p['locations']['candidates'] = data_frames

    #print(p['locations']['candidates'])
        
    #exit()
    

looking for the sleepy sea
looking for the Bay of Biscay
looking for Heath
looking for Van Diemen's Land
looking for Vickers
looking for Sylvia
looking for Bath
looking for Julia Vickers's
looking for Frere
looking for Chatham
looking for CHAPTER II
looking for Surgeon Pine
looking for Coromandel
looking for Pine
looking for the Hydaspes for Calcutta
looking for the poop guard
looking for MONOTONY
looking for Three'll
looking for Van Diemen's
looking for Tasman
looking for Cape Pillar
looking for Pirates' Bay
looking for east
looking for west
looking for the Isle of Wight
looking for the South-West Cape
looking for Swan Port
looking for Mediterranean
looking for Maria Island
looking for the Three Thumbs
looking for Peninsula
looking for Storm Bay
looking for Storing Island
looking for Sorrell
looking for Bruny Island
looking for Mount Royal
looking for D'Entrecasteaux Channel
looking for Actaeon
looking for the South Cape
looking for New Norfolk
looking for Derwent
looking for the Sout

What placenames have you still not found?

In [72]:
unmatcheddata = [p['placename'] for p in geolocdata 
             if len(p['locations']['best_match'])==0 and p['locations']['candidates']==None]
unmatcheddata

['the sleepy sea',
 "Julia Vickers's",
 'the Hydaspes for Calcutta',
 'the poop guard',
 'MONOTONY',
 'the Three Thumbs',
 'Storing Island',
 'Commandant',
 'verandah.-She',
 'Grummet Island']

Where have you found locations for placenames?

In [73]:
matchdata = [p['placename'] for p in geolocdata 
             if len(p['locations']['best_match'])>0 or p['locations']['candidates']!=None]
matchdata

['the Bay of Biscay',
 'Heath',
 'London',
 "Van Diemen's Land",
 'Vickers',
 'Sylvia',
 'Bath',
 'Frere',
 'Chatham',
 'CHAPTER II',
 'Surgeon Pine',
 'Coromandel',
 'Pine',
 'India',
 "Three'll",
 "Van Diemen's",
 'Tasman',
 'Cape Pillar',
 "Pirates' Bay",
 'east',
 'west',
 'the Isle of Wight',
 'the South-West Cape',
 'Swan Port',
 'Mediterranean',
 'Maria Island',
 'Peninsula',
 'Storm Bay',
 'Italy',
 'Sorrell',
 'Bruny Island',
 'Mount Royal',
 "D'Entrecasteaux Channel",
 'Actaeon',
 'the South Cape',
 'New Norfolk',
 'Derwent',
 'the Southern Ocean',
 'Tamar',
 'Victoria',
 'Port Philip Bay',
 'Wellington',
 'Dromedary',
 'Mount Wellington',
 'Launceston',
 'Smyrna',
 'Pyramid Island',
 'Rocky Point',
 'Port Davey',
 'Mount Direction',
 'Macquarie Harbour',
 'Mount Heemskirk',
 'Mount Zeehan',
 "King's River",
 'Sarah Island',
 "Philip's Island",
 'Hobart Town',
 'earth',
 'south-east',
 'Ladybird',
 'Port Arthur',
 'Honduras',
 'Arthur',
 'Hells Gates',
 'England',
 'New Town'

While Open Street Map is a wonderful resource, it focusses on current names of geographic locations. If the original source of your placenames was not written in the recent decades, then the OSM may not know the appropriate names of locations for the time of the document. 

One solution is to also look up a historical gazetteer, like the TLC. 

Like the OSM API, the TLCMap API has various options, like which type of search to use and whether to search any data in the database enetred by the public, rather than that which has been verified or entered by experts.

[TO DO] Describe the TLCMap

For this workshop, you will look for exact matches between the placenames and the locations, and not consider any publicly entered data.

In [74]:
# Which order to do different searches for known locations
search_type = 'exact' # alt values: 'exact', 'fuzzy', 'contains' 

# Flag whether to use data provided by the public
search_public_data = False # alt values = True, False

Like for the OSM, you can limit how many results you want to examine. The TLCMap default for this notebook is 1.

In [75]:
TLCMap_limit = 1

Like for the OSM, you will need a few functions to query the the API.

In [76]:
def tlc_build_url(placename: str, search_type: str, search_public_data: bool = False) -> str:
    """
    Build a url to query the tlcmap/ghap API.
    placename: the place we're trying to locate
    search_type: what search type to use (accepts one of ['contains','fuzzy','exact'])
    
    ref: https://www.tlcmap.org/guides/ghap/#ws
    """
    safe_placename = urllib.parse.quote(placename.strip().lower())

    url = f"https://tlcmap.org/ghap/search?"

    if search_type == 'fuzzy':
        url += f"fuzzyname={safe_placename}"
    elif search_type == 'exact':
        url += f"name={safe_placename}"
    elif search_type == 'contains':
        url += f"containsname={safe_placename}"
    else:
        return None

    # Search Australian National Placenames Survey provided data
    url += "&searchausgaz=on"
    
    # Search public provided data, this data could be unreliable
    if search_public_data == True:
        url += "&searchpublicdatasets=on"
    else:
        url += "&searchpublicdatasets=off"
        
    # Retrieve data as JSON
    url += "&format=json"
    
    # limit the number of results
    url += "&paging=" + str(TLCMap_limit)

    return url

# Send rate-limited requests that stay within n requests per second
# [TO DO] add link to webpage about this
@sleep_and_retry
@limits(calls=1, period=1)
def tlc_call_api(url):
    r = requests.get(url)
    if r.url == 'https://tlcmap.org/ghap/maxpaging':
        return None
    #if (type(response)==bytes and str(response) == 'No search results to display.'):
    #    return None
    #print("Response:",r)

    # If the reply says the placename wasn't found, customise the JSON data for the reply
    if r.content.decode() == "No search results to display.":
        # This should have obviously just be an empty list of features, but TLCMap is badly behaved
        response = json.loads('{"type": "FeatureCollection","metadata": {},"features": []}')
    # SUCCESS! Record the spatial data provided in the reply
    elif r.ok:
        response = r.json()    # get [lon, lat] for spatial matches

    return response

    #log(f"Query returned {response.status_code}")
    #if response.ok:
    #    """
    #    NOTE: we could catch json.decoder.JSONDecodeError, but since json=<3.4 doesn't raise this,
    #         a generic ValueError is more portable
    #    See: https://stackoverflow.com/questions/44714046/python3-unable-to-import-jsondecodeerror-from-json-decoder
    #    """
    #    try:
    #       data = json.loads(response.content)
    #    except ValueError: #Error handling for 0 matches 
    #        return None
    #    return response
    #return None

def tlc_query_name(placename: str, search_type: str):
    """
    Use tlcmap/ghap API to check a placename, implemented fuzzy search but will not handle non returns.
    """
    url = tlc_build_url(placename, search_type, search_public_data)
    #print(url)
    if url:
        return tlc_call_api(url)
    return None

In [77]:
# Format the api response to make comparison easier
def tlc_format_response(input):

    #if input = None:
    #    return None
    
    locdata={} # formatted data
    
    # look at each location in the features
    locationfeatures=input
    if len(locationfeatures):
        # Gather the locdata for one of the placename's location 
        if 'placename' in locationfeatures['properties']:
            locdata['LocationName'] = locationfeatures['properties']['placename']
        else:
            locdata['LocationName'] = "Unknown Location"
        if 'feature_term' in locationfeatures['properties']:
            locdata['Category']=locationfeatures['properties']["feature_term"]
        else:
            locdata['Category']=None
        if 'longitude' in locationfeatures['properties']:
            locdata['Longitude']=locationfeatures['properties']["longitude"]
        else:
            locdata['Longitude']=None
        if 'latitude' in locationfeatures['properties']:
            locdata['Latitude']=locationfeatures['properties']["latitude"]
        else:
            locdata['Latitude']=None
        if 'state' in locationfeatures['properties']:
            locdata['PartOf']=locationfeatures['properties']['state']
        else:
            locdata['PartOf']=None
        locdata['Gazetteer']="TLCMap"

        # for now, keep the names consistent between records 
        response = {"LocationName": locdata["LocationName"], 
              "Category": locdata["Category"],
              "Latitude": locdata["Latitude"], 
              "Longitude": locdata["Longitude"],
              "PartOf": locdata["PartOf"],
              "Gazetteer": "TLCMap"
                }

        #response=pd.DataFrame([locdata])
        #response=locdata
    else:
        response=None

    return response

You can now search the TLCMap for locations matching the same placenames you previously searched for in the OSM.

In [78]:
# For every placename in our list
for p in geolocdata:
    # Already found a location, so skip to the next placename
    if len(p['locations']['best_match']) > 0:
        continue
        
    placename = p['placename']
    print ("looking for",placename)

    # query the OSM database
    response = tlc_query_name(placename,search_type)
    #response_dict = response.to_dict()
    #print(response.text)
    #print(response)
    
    #p['locations']['candidates']=None
    
    # Handle no results found
    if response is None:
        # skip to the next placename
        continue
        
    # Save the possible locations for later processing
    data_frames = []

    # Handle results found
    for response_record in response["features"]:
        #  Use this to look at a reduced set of data from the results
        #print(response_record['properties'])
        #cleaned_response = tlc_format_response(response_record)
        cleaned_response = pd.DataFrame([tlc_format_response(response_record)])

            
        #print(cleaned_response)
                        
        # Add the data to a dataframe
        data_frames.append(cleaned_response)
        
    # add the results to the geoloc dataframe
    # review the outcomes later
    # Match sure you don't write over any candidates previously added from another gazetteer.
    if p['locations']['candidates'] == None:
        p['locations']['candidates'] = data_frames
    else:
        p['locations']['candidates'] = p['locations']['candidates'] + data_frames

    #print(p['locations']['candidates'])
        
    #exit()
   

looking for the sleepy sea
looking for the Bay of Biscay
looking for Heath
looking for Van Diemen's Land
looking for Vickers
looking for Sylvia
looking for Bath
looking for Julia Vickers's
looking for Frere
looking for Chatham
looking for CHAPTER II
looking for Surgeon Pine
looking for Coromandel
looking for Pine
looking for the Hydaspes for Calcutta
looking for the poop guard
looking for MONOTONY
looking for Three'll
looking for Van Diemen's
looking for Tasman
looking for Cape Pillar
looking for Pirates' Bay
looking for east
looking for west
looking for the Isle of Wight
looking for the South-West Cape
looking for Swan Port
looking for Mediterranean
looking for Maria Island
looking for the Three Thumbs
looking for Peninsula
looking for Storm Bay
looking for Storing Island
looking for Sorrell
looking for Bruny Island
looking for Mount Royal
looking for D'Entrecasteaux Channel
looking for Actaeon
looking for the South Cape
looking for New Norfolk
looking for Derwent
looking for the Sout

What placenames have you still not found?

In [79]:
unmatcheddata = [p['placename'] for p in geolocdata 
             if (len(p['locations']['best_match'])==0 and 
                 (p['locations']['candidates']==None or 
                  p['locations']['candidates'])==[])]
unmatcheddata

['the sleepy sea',
 "Julia Vickers's",
 'the Hydaspes for Calcutta',
 'the poop guard',
 'MONOTONY',
 'the Three Thumbs',
 'Storing Island',
 'Commandant',
 'verandah.-She']

You can now compare all the locations you have found.

In [80]:
matcheddata = [p for p in geolocdata 
             if (len(p['locations']['best_match'])!=0 or 
                 (p['locations']['candidates']!=None and 
                  p['locations']['candidates'])!=[])]
alllocations=[]
for p in matcheddata:
    placename = p['placename']
    locations = []
    if len(p['locations']['best_match'])!=0:
        locations=p['locations']['best_match']
    else:
        if p['locations']['candidates']!= None:
            locations = pd.concat(p['locations']['candidates'],ignore_index=True) 
    print("===> ",placename)
    print(locations)
    alllocations=alllocations+[locations]
    #pprint(locations, compact=True)    

===>  the Bay of Biscay
        LocationName Category     Latitude           Longitude           PartOf Gazetteer
0  The Bay of Biscay     park  52.95425415  -1.159789836905746   United Kingdom       OSM
===>  Heath
  LocationName        Category    Latitude    Longitude          PartOf Gazetteer
0        Heath  administrative  32.8365147   -96.474987   United States       OSM
1        Heath  administrative  31.3607243  -86.4696811   United States       OSM
2        Heath  administrative  40.0228421  -82.4445991   United States       OSM
3        Heath  administrative  42.6898311  -72.8178076   United States       OSM
4        Heath          hamlet   40.461147  -86.7333397   United States       OSM
5        Heath          parish       -19.6        143.6             QLD    TLCMap
===>  London
    LocationName Category  Latitude  Longitude          PartOf
266       London  Capital -0.083333       51.5  United Kingdom
===>  Van Diemen's Land
        LocationName        Category     Latitu

In [81]:
pprint(alllocations)

[        LocationName Category     Latitude           Longitude           PartOf Gazetteer
0  The Bay of Biscay     park  52.95425415  -1.159789836905746   United Kingdom       OSM,
   LocationName        Category    Latitude    Longitude          PartOf Gazetteer
0        Heath  administrative  32.8365147   -96.474987   United States       OSM
1        Heath  administrative  31.3607243  -86.4696811   United States       OSM
2        Heath  administrative  40.0228421  -82.4445991   United States       OSM
3        Heath  administrative  42.6898311  -72.8178076   United States       OSM
4        Heath          hamlet   40.461147  -86.7333397   United States       OSM
5        Heath          parish       -19.6        143.6             QLD    TLCMap,
     LocationName Category  Latitude  Longitude          PartOf
266       London  Capital -0.083333       51.5  United Kingdom,
         LocationName        Category     Latitude            Longitude           PartOf Gazetteer
0           Tas

In [83]:
filename = "FtToHNL_matchedlocations.csv"
save_location = os.path.normpath(os.path.join(csv_directory, filename))
save_filename = os.path.basename(save_location)
print("Saving to location data to ",save_location)

# save the list 
# using the savetxt from the numpy module
np.savetxt(save_location, 
           alllocations,
           delimiter =", ", 
           fmt ='% s')

Saving to location data to  ../ner_output/FtToHNL_matchedlocations.csv


  X = np.asarray(X)
