# ATAP Notebook for the Geolocation project

This notebook helps you access the Geolocation tools in a Python development environment.

### Contents
* [Premise](#section-premise)
* [Requirements](#section-requirements) 
* [Data Preparation](#section-datapreparation)
* [Named Entity Recognition](#section-ner)
 * [Look for NEs](#section-nes)
 * [Reviewing Candidate Placenames](#section-reviewplacenames)
* [Finding Locations for Placenames](#section-findinglocs)
 * [Identifying States and Capitals](#section-statescapitals)
 * [Searching a Gazzetteer for Locations](#section-searchgazetteer)

## Premise <a class="anchor" id="section-premise"></a>
*This section explains the Geolocation project and tools.*

The Geolocation project relates to doctoral research done by [Fiannuala Morgan](https://finnoscarmorgan.github.io/) at the Australian National University. It uses software to identify placenames in archived historical texts, then compares them to data about known locations to identify where the placenames may be located. 

This notebook is designed to allow you to perform similar operations on textual documents.

<div class="alert alert-block alert-success">
It will teach you how to
<ul>
    <li>use the spaCy library to identify and classify Named Entities (NEs)</li>
    <li>identify multi-word expressions (MWE) that are NEs</li>
    <li>search for spatial data about specific locations or places in gazetteers of such data</li>
    <li>determine which locations are referred to by placenames, based on the context in which they are used in a text</li>
</ul>
</div>

## Requirements <a class="anchor" id="section-requirements"></a>

<div class="alert alert-block alert-info">
This notebook uses various Python libraries. Most will come with your Python installation, but the following are crucial:
<ul>
    <li> pandas</li> 
    <li> json</li> 
    <li> nltk</li> 
    <li> geopandas</li> 
    <li> shapely  </li> 
</ul>
</div>

In [None]:
# TODO: UPDATE
# Many of these are probably not needed.

import os
from pickle import NONE
import nltk
import csv
import time
import urllib
import requests
import json
import math

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Geopandas is used to work with spatial data
# If you have issues installing it on a MAcOS, 
# see https://stackoverflow.com/questions/71137617/error-installing-geopandas-in-python-on-mac-m1
import geopandas as gpd
from geopandas import GeoDataFrame

# NLTK is used to work with textual data 
#from nltk.tag import StanfordNERTagger
#from nltk.tokenize import word_tokenize

# spaCy is used for a pipeline of NLP functions
import spacy
from spacy.tokens import Span
from spacy import displacy

# Shapely is used to work with geometric shapes
from shapely.geometry import Point

# Fuzzywuzzy is used for fuzzy searches
from fuzzywuzzy import fuzz

# used for the checklist
import ipywidgets as widgets

# Data preparation <a class="anchor" id="section-datapreparation"></a>

You also want to set up some directories for the import and output of data.

In [None]:
## Declare the data directories
## This presumes that Notebooks/ is the current working directory  
text_directory = os.path.normpath("../Texts/")
csv_directory = os.path.normpath("../ner_output/")
reference_directory = os.path.normpath("../Data")
#maps_directory = os.path.normpath("../maps/")

## Create the data directories
if not os.path.exists(text_directory):
    os.makedirs(text_directory)
if not os.path.exists(csv_directory):
   os.makedirs(csv_directory)
if not os.path.exists(reference_directory):
   os.makedirs(reference_directory)

#if not os.path.exists(maps_directory):
#    os.makedirs(maps_directory)

For this workshop, we will be examining the text of *For the Term of His Natural Life*, an 1874CE novel by Marcus Clarke that is in the public domain. Our copy was obtained via the [Gutenburg Project Australia](https://gutenberg.net.au/ebooks/e00016.txt). It is an unformatted textfile. We have slightly simplified it further by reducing it to only standard ASCII characters, replacing any accented characters with their unaccented forms and the British Pound Sterling symbol with the word pounds. 

The novel is divided into four books, each based in different regions of the world. You can start with the second book, which is titled *BOOK II.\-\-MACQUARIE HARBOUR.  1833*. 


In [None]:
#filename="FtToHNL_BOOK_1.txt"
filename="FtToHNL_BOOK_2.txt"
print("Working on | ", filename)

# set the specific path for the 'filename' which is basically working through a list of everything that is in the folder
textlocation = os.path.normpath(os.path.join(text_directory, filename))
text_filename = os.path.basename(textlocation)

text = open(textlocation, encoding="utf-8").read()

This is no more than a long string of characters. So far, you have done no processing. 

In [None]:
text[0:499] # look at the first 500 characters

## Named Entity Recognition <a class="anchor" id="section-ner"></a>
*This section provides tools on identifying named entities in textual data*

### Look for NEs <a class="anchor" id="section-nes"></a>

Named Entities (NEs) are proper noun phrases within text, like names of places, people or organisations.

There are various packages that can include Named Enity Recognition (NER), e.g., the [Stanza CoreNLP](https://colab.research.google.com/github/stanfordnlp/stanza/blob/main/demo/Stanza_CoreNLP_Interface.ipynb), the Stanford NER, and the spaCy library. They often combine machine learning and a rule-based system to identify NEs and classify them into categories.

For this notebook, you will be using the spaCy NER - https://spacy.io/usage/linguistic-features#morphology .  This is available as a Python library.

SpaCy allows you to load a language model that has been trained on various examples of the language of interest. 

In [None]:
nlp = spacy.load("en_core_web_sm")

SpaCy will automatically run the model through various levels of natural language processing. This pipeline includes tokenising the text into individual tokens or terms, like words, values and puncuation.

TODO: Update

It is simple to use the client. You just tell it to annotate the text. However, Stanza allows you to specify what to annotate the text with. For instance, you might want it to tokenise the terms, label their parts-of-speech and lemmatised forms as well as recognising any NEs.

TODO: Update

<div class="alert alert-block alert-info">
The options for the Stanza client include:<ul>
    <li> <strong>tokenize - </strong> split into words or terms </li> 
      <li>    <strong>ssplit - </strong> split into sentences or independent statements</li> 
     <li>     <strong>pos -  </strong> syntactic parts-of-speech</li> 
      <li>    <strong>lemma -  </strong> lemmatised form (not always a root form)</li> 
      <li>    <strong>ner - </strong> named entity recognition</li> 
      <li>    <strong>depparse -  </strong> parsing of dependencies</li> 
      <li>    <strong>coref - </strong> co-reference resolution</li> 
      <li>    <strong>kbppandas - </strong> KBP competition format</li> 
</ul>
A full explanation can be found at <a href="https://stanfordnlp.github.io/CoreNLP/pipeline.html">https://stanfordnlp.github.io/CoreNLP/pipeline.html</a></div>

For example, the default line contains the following.

In [None]:
print("Pipeline:", nlp.pipe_names)

Text sent to the spaCy model will be processed by the pipeline.

In [None]:
# sample text
sampletext = "Autonomous cars shift insurance liability toward manufacturers"

In [None]:
doc = nlp(sampletext)
doc

The output from the pipeline is then  available in the output structure of the spaCy model.

In [None]:
for token in doc:
    print(token.text," - ", "Morph: ",token.morph, 
          "\n   Dep: ",token.dep_, 
          "\n   Head: ",token.head.text, 
          "\n   Pos: ",token.head.pos_,
          "\n   Child: ",[child for child in token.children])


However, you might not want some of this pipeline processing as it may not be beneficial to your analysis. Any unneeded processing will also slow the system down and place a greater demand on the memory. This is particularly true of the parser. Luckily, it is easy to stipulate what you want excluded from the pipeline. 

In [None]:
doc=nlp(sampletext, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])

In [None]:
for token in doc:
    print(token.text," - ", "Morph: ",token.morph, 
          "\n   Dep: ",token.dep_, 
          "\n   Head: ",token.head.text, 
          "\n   Pos: ",token.head.pos_,
          "\n   Child: ",[child for child in token.children])

Of course, what you are interested in is the NER. Each sentence sent down the pipeline with the ner will get a list of entities that have been found. 

In [None]:
sampletext = "Apple is looking at buying a U.K. startup based in London for $1 billion."

In [None]:
doc = nlp(sampletext)

for ent in doc.ents:
    print(ent.text, "[",ent.label_,"]")

As you can see, each entity is labelled with a category.

TODO: UPDATE

<div class="alert alert-block alert-info">
    The NER categories classified by Stanza include:
   <ul>
<li><strong>Default:</strong>
LOCATION, ORGANIZATION, PERSON</li>
<li><strong>High recall: </strong>
DATE, LOCATION, MONEY, ORGANIZATION, PERCENT, PERSON, TIME, MISC</li>
<li><strong>KBP fine-grained:</strong>
CAUSE_OF_DEATH, CITY, COUNTRY, CRIMINAL_CHARGE, EMAIL, HANDLE,
IDEOLOGY, NATIONALITY, RELIGION, STATE_OR_PROVINCE, TITLE, URL</li>
</ul>
</div>

TODO: Update

Most tokenised terms in the sentence have O as their NER value (that is the letter O not the number 0). Some however have been categorised. For instance, Van and Diemen are both classified as being PERSON named entities, Tasman is an ORGANIZATION and Cape and Pillar are in the LOCATION category. These categories are specific to Stanza. There are two key levels of processing available - the normal level will only identify the categories LOCATION, ORGANIZATION and PERSON, but the high recall processing will also consider other specialised phrases like TIME and MONEY which are not really named entities. Stanza can also look for the categories used in the KBP competition like CITY, COUNTRY and NATIONALITY, but this fine-grained processing will be slower.


The data for the entities includes the character position for the start and the end of the NE.

In [None]:
doc = nlp(sampletext)

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Each token will also have a value that indicates whether it is part of an NE.

In [None]:
for token in doc:
    print(token.text, "[", token.ent_type_, "]")

TODO: Update

The annotations so far show that *Cape* and *Pillar* are both a LOCATION NE, but not that *Cape Pillar* is actually the complete name of the location. However Stanza does also recognise multi-word expressions (MWEs), even though it recognises that they are part of an NE. Each *Sentence* annotated by Stanza has a list of [NE *Mentions*](https://stanfordnlp.github.io/CoreNLP/entitymentions.html) which are also given NER categories as their *Type*. 

It is also possible to hand-code entities after the NER has been done. This can help make up for any common irregularities with the NER for your input documents.

For instance, this model doesn't recognise that FB is a NE.

In [None]:
sampletext = "FB is hiring a new vice president of global policy"

doc = nlp(sampletext)
ents = [(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
print('Entities before:', ents)
# The model didn't recognize "fb" as an entity :-(

The solution is to create a new entry for the list of entities.

In [None]:
# Create a spaCy span for the new entity
fb_ent = Span(doc, 0, 1, label="ORG")
orig_ents = list(doc.ents)

# Assign a complete list of ents to doc.ents
doc.ents = orig_ents + [fb_ent]

ents = [(ent.text, ent.start, ent.end, ent.label_) for ent in doc.ents]
print('Entities after:', ents)

Even the data for the tokens is updated. 

In [None]:
for token in doc:
    print(token.text, "[", token.ent_type_, "]")

SpaCy also allows the input documents to be processed in batches. This helps better manage the processing demands of the system throughout the pipeline when there are multiple files or many sentences.

In [None]:
# multiple texts in a list
sampletexts = ["Autonomous cars shift insurance liability toward manufacturers","This is a text", "These are lots of texts", "..."]

# remove what elements you don't need from the pipeline
for doc in nlp.pipe(sampletexts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):
    print("Entities: ",[(ent.text, ent.label_) for ent in doc.ents])
    for token in doc:
        print("   ",token.text, "[", token.ent_type_, "]")


While you can process the output of the piped pipeline straight away, you can't print it unless you convert it into a list.

In [None]:
docs = nlp.pipe(sampletexts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])
print(docs)

In [None]:
print(list(docs))

TODO: Can't remember what this is trying to do. Think it has to do with the IOB.

In [None]:
from spacy.attrs import ENT_IOB, ENT_TYPE

nlp = spacy.load("en_core_web_sm")
#doc = nlp.make_doc("London is a big city in the United Kingdom and New York is in the United States of America.")
doc = nlp("London is a big city in the United Kingdom and New York is in the United States of America.")

ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print("Entities: ",ents)

print("\nBefore:", doc.ents)  # []

header = [ENT_IOB, ENT_TYPE]
attr_array = np.zeros((len(doc), len(header)), dtype="uint64")
attr_array[0, 0] = 3  # B
attr_array[0, 1] = doc.vocab.strings["GPE"]
doc.from_array(header, attr_array)
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print("\nEntities: ",ents)

print("\nAfter", doc.ents)  # [London]

TODO: Placeholder in-case we want to show of the displacy rendering.

In [None]:
sampletext = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."

nlp = spacy.load("en_core_web_sm")
doc = nlp(sampletext)

# displacy from spaCy
displacy.render(doc, style="ent")

TODO: talk about extracting just the NER types we want.

In [None]:
spacy.explain('LOC')

In [None]:
spacy.explain('FAC')

In [None]:
spacy.explain('GPE')

In [None]:
spacy.explain('ORG')

TODO: Run through a single chapter (variable: text) before doing the entire collection?
    This will allow all NEs to be shown, then the filter be introduced.
    This will put it more on topic.

In [None]:
text[0:499]

In [None]:
doc = nlp(text)

# document level
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
    
i=0 # entity counter
# token level
for e in doc.ents:
    print("{:5}\t\t{:30s}\t{}".format(i+1,e.text, e.label_))
    i=i+1

TODO: Expand on this explanation with example context.
Talk about the issue with _Van Diemen's_ versus _Van Diemen's Land_ (and _Tasman's Head_)

These categories are assigned according to the context in which the NE is used. For this reason, _Van Diemen's_ is considered an *ORG*, a *PERSON* and a *FAC*, depending on its linguistic context. Note also that _VAN DIEMEN'S LAND_ in the title of the chapter isn't recognised as a NE, probably due to its unconventional case.

TODO: Expand on this explanation.

Of course, not all of these NE are suitable for placenames, so you will need to make a list of what categories regularly contain placenames.

In [None]:
PLACENAME_CATEGORIES = ["LOC", "GPE", "FAC", "ORG"]

## Reviewing Candidate Placenames <a class="anchor" id="section-reviewplacenames"></a>

You can now put all of this together and find the placenames that are identified by spaCy in each chapter of the text. They can all be collected in a single dataframe.

In [None]:
# where we store the details about each instance of the placenames
placenames_df = pd.DataFrame(columns=['Book','Chapter',"NEIndex","Placename"])

In [None]:
# define which chapters and books you want to annotate
CHAPTERS=[1,2,3] 
BOOKS=[1,2]

In [None]:
disabledPipeline=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]

Now let's process FtToHNL.

In [None]:
nlp = spacy.load("en_core_web_sm")

i=0 # counter of the entities
for b in BOOKS:
    for c in CHAPTERS:
        filename = "FtToHNL_BOOK_"+str(b)+"_CHAPTER_"+str(c)+".txt" 
        # set the specific path for the 'filename'
        textlocation = os.path.normpath(os.path.join(text_directory, filename))
        text_filename = os.path.basename(textlocation)

        # read this chapter
        text = open(textlocation, encoding="utf-8").read()
        print("Working on |",filename)
        
        # run spaCy    
        doc = nlp(text,disable=disabledPipeline)

        # document level
        ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]

        # token level
        for e in doc.ents:
            if e.label_ in PLACENAME_CATEGORIES: # filter out MONEY, DATE etc
                print("{:5}\t\t{:30s}\t{}".format(i+1,e.text, e.label_))
                # To help understand the context of the text, extract the occurence
                context_text=doc.text[e.start_char-30:e.end_char+30].replace("\n"," ")
                
                # Code to render with displacy
                #context_doc={"text":str(i+1)+" \t "+context_text,
                #             "ents":[{"start": len(str(i+1)+"   ")+30, 
                #                      "end":   len(str(i+1)+"   "+context_text)-30, 
                #                      "label": e.label_}],
                #             "title": None}
                #print(context_doc)
                #displacy.render(context_doc, style="ent", manual=True, jupyter=True)

                # find the placenames according to spaCy
                new_placename = {'Book':b,              # The Book number
                                'Chapter':c,            # The Chapter number
                                'NEIndex':i,            # A reference number to the nth Named Entity 
                                'Placename':e.text,     # The placename in the text
                                'Category':e.label_,    # The spaCy category
                                'Context':context_text, # The textual context where the placename was found
                                'Approval':1}        # A flag for whether this is a suitable placename
                placenames_df = placenames_df.append(new_placename, ignore_index=True)
                
            i=i+1 # entity counter

In [None]:
placenames_df[['Book','Chapter','NEIndex','Placename','Category']]

This shows that a lot more placenames were found in Book 2 than Book 1. This makes sense since Book 1 is set on board an ocean voyage, whereas Book 2 is at Macquarie Harbour in Australia.

However, there are a number of NEs that are unlikely to be placenames, regardless of what spaCy categoriesd them as. It is best to consider the context in which the terms were used. Use the checkboxes to select which terms you do consider to be placenames.

In [None]:
#import ipywidgets as widgets

def changed(b):
    # The system sets the _property_lock_, changes the value, then releases the _property_lock_.
    # The confusing thing is that there is a value for the checkbox and a value for the property lock.
    # changed() is called three times for every change to the checkbox.

    # If you want to see any change to the checkbox value, uncomment this print()) command
    if b['name']=='value':
        #print(b,"\n")
        #print("found value")
        k=b['new']
        #print("    ",k)
    #print(b,"\n")

placename_items=[]
context_items=[]
num_items=[]

# Time to try to make checkboxes for every placename for b in BOOKS:
for b in BOOKS:
    for c in CHAPTERS:
        # Get the NEs from this book and chapter
        pbc=placenames_df[(placenames_df["Book"]==b) & (placenames_df["Chapter"]==c)]
        for i in pbc["NEIndex"]:
            context_text=pbc[pbc["NEIndex"]==i]["Context"].values[0]
            category=pbc[pbc["NEIndex"]==i]["Category"].values[0]
            
        # Make lists of the candidate placenames, context text and index numbers 
        # Only the placenames are given a checkbox. 
        placename_items = placename_items + [widgets.Checkbox(True,description=i) for i in pbc["Placename"]]
        context_items = context_items + [widgets.Label(pbc[pbc["NEIndex"]==i]["Context"].values[0]) for i in pbc["NEIndex"]]
        num_items = num_items + [widgets.Label(str(i)) for i in pbc["NEIndex"]]

# create a display
num_placenames=len(placename_items)
left_box = widgets.VBox(placename_items)
right_box = widgets.VBox(context_items)
num_box = widgets.VBox(num_items)
whole_box = widgets.HBox([num_box, left_box, right_box])
        
display(whole_box)

# respond to any changes in the checkboxes
for i in range(num_placenames):
    placename_items[i].observe(changed)

You can now copy all the values from the checkboxes to the data, so you know which placenames you have approved.

In [None]:
# Transfer the status of each checklist item to the data
for i in range(num_placenames):
    #print(num_items[i].value,placename_items[i].value,placename_items[i].description)

    NEIndex_num = int(num_items[i].value)
    approval_flag = placename_items[i].value
    
    #print("Looking for ["+str(NEIndex_num)+"]")
    
    # set the flag to match the checklist
    for p in placenames_df["NEIndex"]:
        if (p-NEIndex_num == 0):
            placenames_df.loc[placenames_df["NEIndex"] == NEIndex_num,"Approval"] = approval_flag

You can now visualise the result.

In [None]:
placenames_df[['NEIndex','Placename','Approval']]

From this, you can extract the final list of distinct placenames that you have approved. While the names aren't sorted (though they could be), if you missed unselecting an NE on the checklist, this will help find it. All you need to do is go back to the checklist, unselect it, then run all other steps from there to here.

In [None]:
# Make a unique list of the approved placenames
approved_placenames = placenames_df[placenames_df["Approval"]==True]['Placename'].unique()
approved_placenames

The final step is to save this new data to a csv file. You have already defined the directory for your data files.

In [None]:
filename = "FtToHNL_placenames.csv"
save_location = os.path.normpath(os.path.join(csv_directory, filename))
save_filename = os.path.basename(save_location)
print("Saving to placename data to ",save_location)

In [None]:
# save the list 
# using the savetxt from the numpy module
np.savetxt(save_location, 
           approved_placenames,
           delimiter =", ", 
           fmt ='% s')

TODO: Remove the MWE section as it is not needed for spaCy.
Stopped the code form executing for now, but kept for reference while updating the above steps.

### MWEs as Named Entities  <a class="anchor" id="section-mwes"></a>

The annotations so far show that *Cape* and *Pillar* are both a LOCATION NE, but not that *Cape Pillar* is actually the complete name of the location. However Stanza does also recognise multi-word expressions (MWEs), even though it recognises that they are part of an NE. Each *Sentence* annotated by Stanza has a list of [NE *Mentions*](https://stanfordnlp.github.io/CoreNLP/entitymentions.html) which are also given NER categories as their *Type*. 

Obviously, this isn't perfect. While *Van Diemen* is recognised as a *PERSON* NE, *Van Diemen's Land* (i.e., the former name for Tasmania) isn't recognised as a *LOCATION*. This is because Stanza is trained to only recognise certain combinations of words and categories as a new MWE category. These rules can however be added to but this workshop won't explore this aspect.

Each token is also annotated with an index number corresponding to any Mention it is part of. Each token can only be part of one Mention. While the Mentions may be annotated per Sentence, the index number is actually considering all Sentences and Mentions in the annotated text.

Of course, you are mainly interested in the Named Entities relate to locations. The location-based NER categories used by Stanza are:
* *LOCATION*
* *CITY*
* *COUNTRY*
* *STATE_OR_PROVINCE*

*NATIONALITY* might also be considered but it may depend on whether you care about phrases like *student of English history* or *Frenchman's cap*, or not.
It is easy to filter out all other NEs.

You can now put all of this together and find the placenames that are identified by Stanza in each chapter of the text. They can all be collected in a single dataframe.

__\[MN Note\]__ Change this to a background server, rather than a server on demand?

This shows that a lot more placenames were found in Book 2 than Book 1. This makes sense since Book 1 is set on board an ocean voyage, whereas Book 2 is at Macquarie Harbour in Australia.  

The final step is to save this new data to a csv file. You have already defined the directory for your data files.

## Finding Locations for the Placenames <a class="anchor" id="section-findinglocs"></a>

Now that you have a list of placenames from the text, the next step is to work out their location on Earth. For this you can use a combination of specialised lists of locations, gazzetteers and heuristics. The objective is to match every placename with the coordinates of a known location.

The first step is to read the file of your placenames.

In [294]:
filename="FtToHNL_placenames.csv"
print("Working on | ", filename)

# set the specific path for the 'filename' which is basically working through a list of everything that is in the folder
data_location = os.path.normpath(os.path.join(csv_directory, filename))
data_filename = os.path.basename(data_location)

# Using pandas, read the csv file. This will place it in a dataframe format. 
placenames_df = pd.read_csv(data_location, encoding="utf-8",header=None)

Working on |  FtToHNL_placenames.csv


In [295]:
placenames_df = placenames_df.rename(columns={placenames_df.columns[0]: 'Placename'})

In [296]:
placenames_df

Unnamed: 0,Placename
0,the sleepy sea
1,the Bay of Biscay
2,Heath
3,London
4,Van Diemen's Land
...,...
75,Grummet Island
76,Grummet
77,Malabar
78,Dawes


### Identifying States and Capitals <a class="anchor" id="section-statescapitals"></a>

Some placenames, like *High Street* or *Maryborough*, may be very common across the world, or even in Australia. However, certain placenames refer to significant locations, like states, territories, large geographic features or capital cities. As such, if they are mentioned in a text, the placename is more likely to refer to the major location than a town or village in Tasmania.

These significant locations are a finite set. They can be defined in a reference file that can be reused when reviewing the placenames of any text.

A good point for you to start is a file about modern capital cities.

In [299]:
filename="Contemporary_Capital_Cities.csv"
print("Working on | ", filename)

# set the specific path for the 'filename' which is basically working through a list of everything that is in the folder
reference_location = os.path.normpath(os.path.join(reference_directory, filename))
reference_filename = os.path.basename(reference_location)

# Using pandas, read the csv file. This will place it in a dataframe format. 
#locationreferences_df=pd.DataFrame(columns=['Contemporary'])

#contemporary_references_df = pd.read_csv(reference_location, encoding="utf-8", header=0)
#contemporary_references_dict = contemporary_references_df.to_dict()

Working on |  Contemporary_Capital_Cities.csv


Rather than reading this and then processing it, you can process each line as you read it.

In [300]:
# Design the new Dataframe to hold the Long data
locref_df= pd.DataFrame(columns=['LocationName','Category', 'Longitude','Latitude','PartOf'])

In [301]:
import csv

# read the file
with open(reference_location, newline='') as reffile:
    reader = csv.DictReader(reffile)
    # process each row
    for row in reader:
        
        # Get the Long Country data
        if row['CountryName']!=None and len(row['CountryName']) > 0:
            locdata_df={'LocationName': row['CountryName'],
                   'Category': 'country',
                   'Longitude': row['Country_Longitude'],
                   'Latitude': row['Country_Latitude'],
                   'PartOf': row['ContinentName']}
            print(locdata_df)
            locref_df = locref_df.append(locdata_df, ignore_index=True)
        
        # Get the Long Country data
        if row['Country']!=None and len(row['Country']) > 0:
            locdata_df={'LocationName': row['Country'],
                   'Category': 'country',
                   'Longitude': row['Country_Longitude'],
                   'Latitude': row['Country_Latitude'],
                   'PartOf': row['ContinentName']}
            print(locdata_df)
            locref_df = locref_df.append(locdata_df, ignore_index=True)

        # Get the Long Capital data
        if row['CapitalName']!=None:
            locdata_df={'LocationName': row['CapitalName'],
                   'Category': 'capital',
                   'Longitude': row['CapitalLongitude'],
                   'Latitude': row['CapitalLatitude'],
                   'PartOf': row['CountryName']}
            print(locdata_df)
            locref_df = locref_df.append(locdata_df, ignore_index=True)
 
        # Get the Long Continent data
        if row['ContinentName']!=None:
            locdata_df={'LocationName': row['ContinentName'],
                   'Category': 'continent',
                   'Longitude': None,
                   'Latitude': None,
                   'PartOf': None}
            print(locdata_df)
            locref_df = locref_df.append(locdata_df, ignore_index=True)



{'LocationName': 'Afghanistan', 'Category': 'country', 'Longitude': '69.183333', 'Latitude': '34.51666667', 'PartOf': 'Asia'}
{'LocationName': 'Kabul', 'Category': 'capital', 'Longitude': '69.183333', 'Latitude': '34.51666667', 'PartOf': 'Afghanistan'}
{'LocationName': 'Asia', 'Category': 'continent', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Aland Islands', 'Category': 'country', 'Longitude': '19.9', 'Latitude': '60.116667', 'PartOf': 'Europe'}
{'LocationName': 'Mariehamn', 'Category': 'capital', 'Longitude': '19.9', 'Latitude': '60.116667', 'PartOf': 'Aland Islands'}
{'LocationName': 'Europe', 'Category': 'continent', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Albania', 'Category': 'country', 'Longitude': '19.816667', 'Latitude': '41.31666667', 'PartOf': 'Europe'}
{'LocationName': 'Tirana', 'Category': 'capital', 'Longitude': '19.816667', 'Latitude': '41.31666667', 'PartOf': 'Albania'}
{'LocationName': 'Europe', 'Category': 'co

{'LocationName': 'Africa', 'Category': 'continent', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Bermuda', 'Category': 'country', 'Longitude': '-64.783333', 'Latitude': '32.28333333', 'PartOf': 'North America'}
{'LocationName': 'Hamilton', 'Category': 'capital', 'Longitude': '-64.783333', 'Latitude': '32.28333333', 'PartOf': 'Bermuda'}
{'LocationName': 'North America', 'Category': 'continent', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Bhutan', 'Category': 'country', 'Longitude': '89.633333', 'Latitude': '27.46666667', 'PartOf': 'Asia'}
{'LocationName': 'Thimphu', 'Category': 'capital', 'Longitude': '89.633333', 'Latitude': '27.46666667', 'PartOf': 'Bhutan'}
{'LocationName': 'Asia', 'Category': 'continent', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Bolivia', 'Category': 'country', 'Longitude': '-68.15', 'Latitude': '-16.5', 'PartOf': 'South America'}
{'LocationName': 'La Paz', 'Category': 'capital', 'Lon

{'LocationName': 'Comoros', 'Category': 'country', 'Longitude': '43.233333', 'Latitude': '-11.7', 'PartOf': 'Africa'}
{'LocationName': 'Moroni', 'Category': 'capital', 'Longitude': '43.233333', 'Latitude': '-11.7', 'PartOf': 'Comoros'}
{'LocationName': 'Africa', 'Category': 'continent', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Cook Islands', 'Category': 'country', 'Longitude': '-159.766667', 'Latitude': '-21.2', 'PartOf': 'Australia'}
{'LocationName': 'Avarua', 'Category': 'capital', 'Longitude': '-159.766667', 'Latitude': '-21.2', 'PartOf': 'Cook Islands'}
{'LocationName': 'Australia', 'Category': 'continent', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Costa Rica', 'Category': 'country', 'Longitude': '-84.083333', 'Latitude': '9.933333333', 'PartOf': 'Central America'}
{'LocationName': 'San Jose', 'Category': 'capital', 'Longitude': '-84.083333', 'Latitude': '9.933333333', 'PartOf': 'Costa Rica'}
{'LocationName': 'Central Ameri

{'LocationName': 'Australia', 'Category': 'continent', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Guatemala', 'Category': 'country', 'Longitude': '-90.516667', 'Latitude': '14.61666667', 'PartOf': 'Central America'}
{'LocationName': 'Guatemala City', 'Category': 'capital', 'Longitude': '-90.516667', 'Latitude': '14.61666667', 'PartOf': 'Guatemala'}
{'LocationName': 'Central America', 'Category': 'continent', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Guernsey', 'Category': 'country', 'Longitude': '-2.533333', 'Latitude': '49.45', 'PartOf': 'Europe'}
{'LocationName': 'Saint Peter Port', 'Category': 'capital', 'Longitude': '-2.533333', 'Latitude': '49.45', 'PartOf': 'Guernsey'}
{'LocationName': 'Europe', 'Category': 'continent', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Guinea', 'Category': 'country', 'Longitude': '-13.7', 'Latitude': '9.5', 'PartOf': 'Africa'}
{'LocationName': 'Conakry', 'Category': 'ca

{'LocationName': 'Europe', 'Category': 'continent', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Madagascar', 'Category': 'country', 'Longitude': '47.516667', 'Latitude': '-18.91666667', 'PartOf': 'Africa'}
{'LocationName': 'Antananarivo', 'Category': 'capital', 'Longitude': '47.516667', 'Latitude': '-18.91666667', 'PartOf': 'Madagascar'}
{'LocationName': 'Africa', 'Category': 'continent', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Malawi', 'Category': 'country', 'Longitude': '33.783333', 'Latitude': '-13.96666667', 'PartOf': 'Africa'}
{'LocationName': 'British Central Africa Protectorate', 'Category': 'country', 'Longitude': '33.783333', 'Latitude': '-13.96666667', 'PartOf': 'Africa'}
{'LocationName': 'Lilongwe', 'Category': 'capital', 'Longitude': '33.783333', 'Latitude': '-13.96666667', 'PartOf': 'Malawi'}
{'LocationName': 'Africa', 'Category': 'continent', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Ma

{'LocationName': 'Oman', 'Category': 'country', 'Longitude': '58.583333', 'Latitude': '23.61666667', 'PartOf': 'Asia'}
{'LocationName': 'Muscat', 'Category': 'capital', 'Longitude': '58.583333', 'Latitude': '23.61666667', 'PartOf': 'Oman'}
{'LocationName': 'Asia', 'Category': 'continent', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Pakistan', 'Category': 'country', 'Longitude': '73.05', 'Latitude': '33.68333333', 'PartOf': 'Asia'}
{'LocationName': 'Islamabad', 'Category': 'capital', 'Longitude': '73.05', 'Latitude': '33.68333333', 'PartOf': 'Pakistan'}
{'LocationName': 'Asia', 'Category': 'continent', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Palau', 'Category': 'country', 'Longitude': '134.633333', 'Latitude': '7.483333333', 'PartOf': 'Australia'}
{'LocationName': 'Melekeok', 'Category': 'capital', 'Longitude': '134.633333', 'Latitude': '7.483333333', 'PartOf': 'Palau'}
{'LocationName': 'Australia', 'Category': 'continent', 'Long

{'LocationName': 'King Edward Point', 'Category': 'capital', 'Longitude': '-36.5', 'Latitude': '-54.283333', 'PartOf': 'South Georgia and South Sandwich Islands'}
{'LocationName': 'Antarctica', 'Category': 'continent', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'South Korea', 'Category': 'country', 'Longitude': '126.983333', 'Latitude': '37.55', 'PartOf': 'Asia'}
{'LocationName': 'Seoul', 'Category': 'capital', 'Longitude': '126.983333', 'Latitude': '37.55', 'PartOf': 'South Korea'}
{'LocationName': 'Asia', 'Category': 'continent', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'South Sudan', 'Category': 'country', 'Longitude': '31.616667', 'Latitude': '4.85', 'PartOf': 'Africa'}
{'LocationName': 'Juba', 'Category': 'capital', 'Longitude': '31.616667', 'Latitude': '4.85', 'PartOf': 'South Sudan'}
{'LocationName': 'Africa', 'Category': 'continent', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Spain', 'Category':

{'LocationName': 'Vanuatu', 'Category': 'country', 'Longitude': '168.316667', 'Latitude': '-17.73333333', 'PartOf': 'Australia'}
{'LocationName': 'Port-Vila', 'Category': 'capital', 'Longitude': '168.316667', 'Latitude': '-17.73333333', 'PartOf': 'Vanuatu'}
{'LocationName': 'Australia', 'Category': 'continent', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Vatican City', 'Category': 'country', 'Longitude': '12.45', 'Latitude': '41.9', 'PartOf': 'Europe'}
{'LocationName': 'Vatican City', 'Category': 'capital', 'Longitude': '12.45', 'Latitude': '41.9', 'PartOf': 'Vatican City'}
{'LocationName': 'Europe', 'Category': 'continent', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Venezuela', 'Category': 'country', 'Longitude': '-66.866667', 'Latitude': '10.48333333', 'PartOf': 'South America'}
{'LocationName': 'Caracas', 'Category': 'capital', 'Longitude': '-66.866667', 'Latitude': '10.48333333', 'PartOf': 'Venezuela'}
{'LocationName': 'South A

In [302]:
locref_df

Unnamed: 0,LocationName,Category,Longitude,Latitude,PartOf
0,Afghanistan,country,69.183333,34.51666667,Asia
1,Kabul,capital,69.183333,34.51666667,Afghanistan
2,Asia,continent,,,
3,Aland Islands,country,19.9,60.116667,Europe
4,Mariehamn,capital,19.9,60.116667,Aland Islands
...,...,...,...,...,...
775,Marshall Islands District,capital,1885,Marshall Islands,"Jabor,ÊJaluit Atoll"
776,moved toÊMajuro Atoll,continent,,,
777,Kolonia,country,1986,Federated States of Micronesia,moved toÊPalikir
778,Federated States of Micronesia,capital,1986,Federated States of Micronesia,Kolonia


The next step is to reorganise this information into a format that is easier to work with. Rather than having country names and captial cities in a Wide data format (i.e., different columns), you can reorganise them into a Long format (i.e., the same column but with another column indicating what geopolitical category they relate to).

Of course, if you are researching historical texts, then some of these contemporary locations may have had different names. Old New York was once New Amsterdam (and had the [nickname of Gotham](https://www.nypl.org/blog/2011/01/25/so-why-do-we-call-it-gotham-anyway), amongst others). Istanbul was Constantinople. Some locations had [romanized names](https://en.wikipedia.org/wiki/Chinese_postal_romanization), like Beijing being called Peking. They may be a long time gone but you might want to add them to the list of significant known locations.

Another historical variant is changing which cities are the capitals. These may be due to political decisions, like the movement of the Australian parliament from Melbourne to the new city of Canberra, or they could be a necessity due to the results of war, like Bonn becoming the capital of West Germany after World War II. These older capitals may also have to be accomodated in your reference data.

Because FtToHNL is set in the 19th Century CE, the next step is to add various capital cities from then.

In [303]:
filename="Capital_Cities_of_19th_Century.csv"
print("Working on | ", filename)

# set the specific path for the 'filename' which is basically working through a list of everything that is in the folder
reference_location = os.path.normpath(os.path.join(reference_directory, filename))
reference_filename = os.path.basename(reference_location)

Working on |  Capital_Cities_of_19th_Century.csv


Again, you need to read the reference file and add any helpful data into your reference list.

In [304]:
# read the file
with open(reference_location, newline='') as reffile:
    reader = csv.DictReader(reffile)
    # process each row
    for row in reader:
        
        # Get the Long Capital data
        if row['Old Capital City']!=None and len(row['Old Capital City']) > 0:
            locdata_df={'LocationName': row['Old Capital City'],
                   'Category': 'capital',
                   'Longitude': None,
                   'Latitude': None,
                   'PartOf': row['Today_A_Part_of']}
            print(locdata_df)
            locref_df = locref_df.append(locdata_df, ignore_index=True)
        

{'LocationName': 'Kelmis', 'Category': 'capital', 'Longitude': None, 'Latitude': None, 'PartOf': 'Belgium'}
{'LocationName': 'Tarnovo', 'Category': 'capital', 'Longitude': None, 'Latitude': None, 'PartOf': 'Bulgaria'}
{'LocationName': 'Plovdiv', 'Category': 'capital', 'Longitude': None, 'Latitude': None, 'PartOf': 'Bulgaria'}
{'LocationName': 'Phnom Penh', 'Category': 'capital', 'Longitude': None, 'Latitude': None, 'PartOf': 'Cambodia'}
{'LocationName': 'Grand-Bassam', 'Category': 'capital', 'Longitude': None, 'Latitude': None, 'PartOf': 'Ivory Coast'}
{'LocationName': 'Ragusa (Dubrovnik)', 'Category': 'capital', 'Longitude': None, 'Latitude': None, 'PartOf': 'Croatia'}
{'LocationName': 'Boma', 'Category': 'capital', 'Longitude': None, 'Latitude': None, 'PartOf': 'Congo, Democratic Republic of the'}
{'LocationName': 'Copenhagen', 'Category': 'capital', 'Longitude': None, 'Latitude': None, 'PartOf': 'Denmark'}
{'LocationName': 'Levuka', 'Category': 'capital', 'Longitude': None, 'Latitud

There are also larger geopolitical regions that may have been associated with placenames and cultures, for instance empires, dynasties and colonies like the British Empire or the Zulu Kingdom. Again, the borders and applicability of these political entities changed over time, so a contemporary reference list may not include them. 

The 19th Century CE was a time of many European Empires so for FtToHNL, you will need to add reference data associated with relevant entities.

In [305]:
filename="Political_Entities_of_19th_Century.csv"
print("Working on | ", filename)

# set the specific path for the 'filename' which is basically working through a list of everything that is in the folder
reference_location = os.path.normpath(os.path.join(reference_directory, filename))
reference_filename = os.path.basename(reference_location)

Working on |  Political_Entities_of_19th_Century.csv


When processing this reference file, you can add the old political entity, its capital (if known), the geographic region (like continent or part thereof) and the modern country it would be considered part of.  

In [306]:
# read the file
with open(reference_location, newline='') as reffile:
    reader = csv.DictReader(reffile)
    # process each row
    for row in reader:
        
        # Get the Long Capital data
        if row['Capital']!=None and len(row['Capital']) > 0:
            locdata_df={'LocationName': row['Capital'],
                   'Category': 'capital',
                   'Longitude': None,
                   'Latitude': None,
                   'PartOf': row['CountryName']}
            print(locdata_df)
            locref_df = locref_df.append(locdata_df, ignore_index=True)
 
        # Get the Long Political data
        if row['Name']!=None and len(row['Name']) > 0:
            locdata_df={'LocationName': row['Name'],
                   'Category': 'political',
                   'Longitude': row['Country_Longitude'],
                   'Latitude': row['Country_Latitude'],
                   'PartOf': row['CountryName']}
            print(locdata_df)
            locref_df = locref_df.append(locdata_df, ignore_index=True)

        # Get the Long Country data (about the modern Country)
        if row['CountryName']!=None and len(row['CountryName']) > 0:
            locdata_df={'LocationName': row['CountryName'],
                   'Category': 'country',
                   'Longitude': row['Country_Longitude'],
                   'Latitude': row['Country_Latitude'],
                   'PartOf': row['Region']}
            print(locdata_df)
            locref_df = locref_df.append(locdata_df, ignore_index=True)
        if row['Today Part of']!=None and len(row['Today Part of']) > 0:
            locdata_df={'LocationName': row['Today Part of'],
                   'Category': 'country',
                   'Longitude': row['Country_Longitude'],
                   'Latitude': row['Country_Latitude'],
                   'PartOf': row['Region']}
            print(locdata_df)
            locref_df = locref_df.append(locdata_df, ignore_index=True)

        # Get the Long Region/Continent data
        if row['Region']!=None and len(row['Region']) > 0:
            locdata_df={'LocationName': row['Region'],
                   'Category': 'region',
                   'Longitude': None,
                   'Latitude': None,
                   'PartOf': None}
            print(locdata_df)
            locref_df = locref_df.append(locdata_df, ignore_index=True)
               

{'LocationName': 'Colony, Portuguese', 'Category': 'capital', 'Longitude': None, 'Latitude': None, 'PartOf': ''}
{'LocationName': 'Luanda', 'Category': 'political', 'Longitude': '', 'Latitude': '', 'PartOf': ''}
{'LocationName': 'Angola', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'São Salvador', 'Category': 'capital', 'Longitude': None, 'Latitude': None, 'PartOf': 'Angola'}
{'LocationName': 'Kingdom of Kongo', 'Category': 'political', 'Longitude': '13.216667', 'Latitude': '-8.833333333', 'PartOf': 'Angola'}
{'LocationName': 'Angola', 'Category': 'country', 'Longitude': '13.216667', 'Latitude': '-8.833333333', 'PartOf': 'Africa: Central'}
{'LocationName': 'Angola', 'Category': 'country', 'Longitude': '13.216667', 'Latitude': '-8.833333333', 'PartOf': 'Africa: Central'}
{'LocationName': 'Africa: Central', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Not specified', 'Category': 'capital', 'Lo

{'LocationName': 'Kingdom of Janjero', 'Category': 'political', 'Longitude': '38.7', 'Latitude': '9.033333333', 'PartOf': 'Ethiopia'}
{'LocationName': 'Ethiopia', 'Category': 'country', 'Longitude': '38.7', 'Latitude': '9.033333333', 'PartOf': 'Africa: East, Horn'}
{'LocationName': 'Ethiopia', 'Category': 'country', 'Longitude': '38.7', 'Latitude': '9.033333333', 'PartOf': 'Africa: East, Horn'}
{'LocationName': 'Africa: East, Horn', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Kingdom of Jimma', 'Category': 'political', 'Longitude': '38.7', 'Latitude': '9.033333333', 'PartOf': 'Ethiopia'}
{'LocationName': 'Ethiopia', 'Category': 'country', 'Longitude': '38.7', 'Latitude': '9.033333333', 'PartOf': 'Africa: East, Horn'}
{'LocationName': 'Ethiopia', 'Category': 'country', 'Longitude': '38.7', 'Latitude': '9.033333333', 'PartOf': 'Africa: East, Horn'}
{'LocationName': 'Africa: East, Horn', 'Category': 'region', 'Longitude': None, 'Latitude': 

{'LocationName': 'Lesotho', 'Category': 'country', 'Longitude': '27.483333', 'Latitude': '-29.31666667', 'PartOf': 'Africa: South'}
{'LocationName': 'Africa: South', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Maravi', 'Category': 'political', 'Longitude': '33.783333', 'Latitude': '-13.96666667', 'PartOf': 'Malawi'}
{'LocationName': 'Malawi', 'Category': 'country', 'Longitude': '33.783333', 'Latitude': '-13.96666667', 'PartOf': 'Africa: South'}
{'LocationName': 'Malawi', 'Category': 'country', 'Longitude': '33.783333', 'Latitude': '-13.96666667', 'PartOf': 'Africa: South'}
{'LocationName': 'Africa: South', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Gaza Empire', 'Category': 'political', 'Longitude': '32.583333', 'Latitude': '-25.95', 'PartOf': 'Mozambique'}
{'LocationName': 'Mozambique', 'Category': 'country', 'Longitude': '32.583333', 'Latitude': '-25.95', 'PartOf': 'Africa: South'}
{'Lo

{'LocationName': 'Ghana', 'Category': 'country', 'Longitude': '-0.216667', 'Latitude': '5.55', 'PartOf': 'Africa: West'}
{'LocationName': 'Africa: West', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Portuguese Guinea', 'Category': 'political', 'Longitude': '-13.7', 'Latitude': '9.5', 'PartOf': 'Guinea'}
{'LocationName': 'Guinea', 'Category': 'country', 'Longitude': '-13.7', 'Latitude': '9.5', 'PartOf': 'Africa: West'}
{'LocationName': 'Guinea', 'Category': 'country', 'Longitude': '-13.7', 'Latitude': '9.5', 'PartOf': 'Africa: West'}
{'LocationName': 'Africa: West', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Imamate of Futa Jallon', 'Category': 'political', 'Longitude': '-13.7', 'Latitude': '9.5', 'PartOf': 'Guinea'}
{'LocationName': 'Guinea', 'Category': 'country', 'Longitude': '-13.7', 'Latitude': '9.5', 'PartOf': 'Africa: West'}
{'LocationName': 'Guinea\xa0', 'Category': 'country', 'Long

{'LocationName': 'Senegal', 'Category': 'country', 'Longitude': '-17.633333', 'Latitude': '14.73333333', 'PartOf': 'Africa: West'}
{'LocationName': 'Senegal', 'Category': 'country', 'Longitude': '-17.633333', 'Latitude': '14.73333333', 'PartOf': 'Africa: West'}
{'LocationName': 'Africa: West', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Cayor', 'Category': 'political', 'Longitude': '-17.633333', 'Latitude': '14.73333333', 'PartOf': 'Senegal'}
{'LocationName': 'Senegal', 'Category': 'country', 'Longitude': '-17.633333', 'Latitude': '14.73333333', 'PartOf': 'Africa: West'}
{'LocationName': 'Senegal', 'Category': 'country', 'Longitude': '-17.633333', 'Latitude': '14.73333333', 'PartOf': 'Africa: West'}
{'LocationName': 'Africa: West', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Imamate of Futa Toro', 'Category': 'political', 'Longitude': '-17.633333', 'Latitude': '14.73333333', 'PartOf': 'Sen

{'LocationName': 'Honduras', 'Category': 'country', 'Longitude': '-87.216667', 'Latitude': '14.1', 'PartOf': 'Americas: Central'}
{'LocationName': 'Americas: Central', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Honduras', 'Category': 'political', 'Longitude': '-87.216667', 'Latitude': '14.1', 'PartOf': 'Honduras'}
{'LocationName': 'Honduras', 'Category': 'country', 'Longitude': '-87.216667', 'Latitude': '14.1', 'PartOf': 'Americas: Central'}
{'LocationName': 'Honduras', 'Category': 'country', 'Longitude': '-87.216667', 'Latitude': '14.1', 'PartOf': 'Americas: Central'}
{'LocationName': 'Americas: Central', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Mosquito Coast', 'Category': 'political', 'Longitude': '-86.25', 'Latitude': '12.13333333', 'PartOf': 'Nicaragua'}
{'LocationName': 'Nicaragua', 'Category': 'country', 'Longitude': '-86.25', 'Latitude': '12.13333333', 'PartOf': 'Americas: Cent

{'LocationName': 'Republic of Indian Stream', 'Category': 'political', 'Longitude': '-77', 'Latitude': '38.883333', 'PartOf': 'United States'}
{'LocationName': 'United States', 'Category': 'country', 'Longitude': '-77', 'Latitude': '38.883333', 'PartOf': 'Americas: North'}
{'LocationName': 'United States', 'Category': 'country', 'Longitude': '-77', 'Latitude': '38.883333', 'PartOf': 'Americas: North'}
{'LocationName': 'Americas: North', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'State of Muskogee', 'Category': 'political', 'Longitude': '-77', 'Latitude': '38.883333', 'PartOf': 'United States'}
{'LocationName': 'United States', 'Category': 'country', 'Longitude': '-77', 'Latitude': '38.883333', 'PartOf': 'Americas: North'}
{'LocationName': 'United States', 'Category': 'country', 'Longitude': '-77', 'Latitude': '38.883333', 'PartOf': 'Americas: North'}
{'LocationName': 'Americas: North', 'Category': 'region', 'Longitude': None, 'Latitude'

{'LocationName': 'Americas: South', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Liga Federal', 'Category': 'political', 'Longitude': '-56.166667', 'Latitude': '-34.85', 'PartOf': 'Uruguay'}
{'LocationName': 'Uruguay', 'Category': 'country', 'Longitude': '-56.166667', 'Latitude': '-34.85', 'PartOf': 'Americas: South'}
{'LocationName': 'Uruguay', 'Category': 'country', 'Longitude': '-56.166667', 'Latitude': '-34.85', 'PartOf': 'Americas: South'}
{'LocationName': 'Americas: South', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Uruguay', 'Category': 'political', 'Longitude': '-56.166667', 'Latitude': '-34.85', 'PartOf': 'Uruguay'}
{'LocationName': 'Uruguay', 'Category': 'country', 'Longitude': '-56.166667', 'Latitude': '-34.85', 'PartOf': 'Americas: South'}
{'LocationName': 'Uruguay', 'Category': 'country', 'Longitude': '-56.166667', 'Latitude': '-34.85', 'PartOf': 'Americas: South'}
{'LocationN

{'LocationName': 'India', 'Category': 'country', 'Longitude': '77.2', 'Latitude': '28.6', 'PartOf': 'Asia: South'}
{'LocationName': 'Asia: South', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Alirajpur State', 'Category': 'political', 'Longitude': '77.2', 'Latitude': '28.6', 'PartOf': 'India'}
{'LocationName': 'India', 'Category': 'country', 'Longitude': '77.2', 'Latitude': '28.6', 'PartOf': 'Asia: South'}
{'LocationName': 'India\xa0', 'Category': 'country', 'Longitude': '77.2', 'Latitude': '28.6', 'PartOf': 'Asia: South'}
{'LocationName': 'Asia: South', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Alwar State', 'Category': 'political', 'Longitude': '77.2', 'Latitude': '28.6', 'PartOf': 'India'}
{'LocationName': 'India', 'Category': 'country', 'Longitude': '77.2', 'Latitude': '28.6', 'PartOf': 'Asia: South'}
{'LocationName': 'India\xa0', 'Category': 'country', 'Longitude': '77.2', 'Latitude'

{'LocationName': 'Burma', 'Category': 'country', 'Longitude': '', 'Latitude': '', 'PartOf': 'Asia: Southeast'}
{'LocationName': 'Asia: Southeast', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Kingdom of Cambodia: Middle Period', 'Category': 'political', 'Longitude': '104.916667', 'Latitude': '11.55', 'PartOf': 'Cambodia'}
{'LocationName': 'Cambodia', 'Category': 'country', 'Longitude': '104.916667', 'Latitude': '11.55', 'PartOf': 'Asia: Southeast'}
{'LocationName': 'Cambodia', 'Category': 'country', 'Longitude': '104.916667', 'Latitude': '11.55', 'PartOf': 'Asia: Southeast'}
{'LocationName': 'Asia: Southeast', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Dutch East Indies', 'Category': 'political', 'Longitude': '', 'Latitude': '', 'PartOf': ''}
{'LocationName': 'Indonesia\xa0', 'Category': 'country', 'Longitude': '', 'Latitude': '', 'PartOf': 'Asia: Southeast'}
{'LocationName': 'Asia: Southe

{'LocationName': 'Muscat and Oman', 'Category': 'political', 'Longitude': '58.583333', 'Latitude': '23.61666667', 'PartOf': 'Oman'}
{'LocationName': 'Oman', 'Category': 'country', 'Longitude': '58.583333', 'Latitude': '23.61666667', 'PartOf': 'Asia: West'}
{'LocationName': 'Oman', 'Category': 'country', 'Longitude': '58.583333', 'Latitude': '23.61666667', 'PartOf': 'Asia: West'}
{'LocationName': 'Asia: West', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Imamate of Oman', 'Category': 'political', 'Longitude': '58.583333', 'Latitude': '23.61666667', 'PartOf': 'Oman'}
{'LocationName': 'Oman', 'Category': 'country', 'Longitude': '58.583333', 'Latitude': '23.61666667', 'PartOf': 'Asia: West'}
{'LocationName': 'Oman', 'Category': 'country', 'Longitude': '58.583333', 'Latitude': '23.61666667', 'PartOf': 'Asia: West'}
{'LocationName': 'Asia: West', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Persia

{'LocationName': 'Europe: British Isles', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Vienna,\xa0Budapest', 'Category': 'capital', 'Longitude': None, 'Latitude': None, 'PartOf': 'Austria'}
{'LocationName': 'Austria-Hungary', 'Category': 'political', 'Longitude': '16.366667', 'Latitude': '48.2', 'PartOf': 'Austria'}
{'LocationName': 'Austria', 'Category': 'country', 'Longitude': '16.366667', 'Latitude': '48.2', 'PartOf': 'Europe: Central'}
{'LocationName': 'Austria', 'Category': 'country', 'Longitude': '16.366667', 'Latitude': '48.2', 'PartOf': 'Europe: Central'}
{'LocationName': 'Europe: Central', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Vienna', 'Category': 'capital', 'Longitude': None, 'Latitude': None, 'PartOf': 'Austria'}
{'LocationName': 'Archduchy of Austria', 'Category': 'political', 'Longitude': '16.366667', 'Latitude': '48.2', 'PartOf': 'Austria'}
{'LocationName': 'Austria', 'C

{'LocationName': 'Hohenzollern-Sigmaringen', 'Category': 'political', 'Longitude': '13.4', 'Latitude': '52.51666667', 'PartOf': 'Germany'}
{'LocationName': 'Germany', 'Category': 'country', 'Longitude': '13.4', 'Latitude': '52.51666667', 'PartOf': 'Europe: Central'}
{'LocationName': 'Germany', 'Category': 'country', 'Longitude': '13.4', 'Latitude': '52.51666667', 'PartOf': 'Europe: Central'}
{'LocationName': 'Europe: Central', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Duchy of Holstein', 'Category': 'political', 'Longitude': '13.4', 'Latitude': '52.51666667', 'PartOf': 'Germany'}
{'LocationName': 'Germany', 'Category': 'country', 'Longitude': '13.4', 'Latitude': '52.51666667', 'PartOf': 'Europe: Central'}
{'LocationName': 'Germany', 'Category': 'country', 'Longitude': '13.4', 'Latitude': '52.51666667', 'PartOf': 'Europe: Central'}
{'LocationName': 'Europe: Central', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': N

{'LocationName': 'Principality of Schaumburg-Lippe', 'Category': 'political', 'Longitude': '13.4', 'Latitude': '52.51666667', 'PartOf': 'Germany'}
{'LocationName': 'Germany', 'Category': 'country', 'Longitude': '13.4', 'Latitude': '52.51666667', 'PartOf': 'Europe: Central'}
{'LocationName': 'Germany', 'Category': 'country', 'Longitude': '13.4', 'Latitude': '52.51666667', 'PartOf': 'Europe: Central'}
{'LocationName': 'Europe: Central', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Schwarzburg-Rudolstadt', 'Category': 'political', 'Longitude': '13.4', 'Latitude': '52.51666667', 'PartOf': 'Germany'}
{'LocationName': 'Germany', 'Category': 'country', 'Longitude': '13.4', 'Latitude': '52.51666667', 'PartOf': 'Europe: Central'}
{'LocationName': 'Germany', 'Category': 'country', 'Longitude': '13.4', 'Latitude': '52.51666667', 'PartOf': 'Europe: Central'}
{'LocationName': 'Europe: Central', 'Category': 'region', 'Longitude': None, 'Latitude': None

{'LocationName': 'Europe: Nordic', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Cisalpine Republic', 'Category': 'political', 'Longitude': '12.483333', 'Latitude': '41.9', 'PartOf': 'Italy'}
{'LocationName': 'Italy', 'Category': 'country', 'Longitude': '12.483333', 'Latitude': '41.9', 'PartOf': 'Europe: Southcentral'}
{'LocationName': 'Italy', 'Category': 'country', 'Longitude': '12.483333', 'Latitude': '41.9', 'PartOf': 'Europe: Southcentral'}
{'LocationName': 'Europe: Southcentral', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Republic of Cospaia', 'Category': 'political', 'Longitude': '12.483333', 'Latitude': '41.9', 'PartOf': 'Italy'}
{'LocationName': 'Italy', 'Category': 'country', 'Longitude': '12.483333', 'Latitude': '41.9', 'PartOf': 'Europe: Southcentral'}
{'LocationName': 'Italy', 'Category': 'country', 'Longitude': '12.483333', 'Latitude': '41.9', 'PartOf': 'Europe: Southcentral'}

{'LocationName': 'Europe: Southcentral', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Andorra', 'Category': 'political', 'Longitude': '1.516667', 'Latitude': '42.5', 'PartOf': 'Andorra'}
{'LocationName': 'Andorra', 'Category': 'country', 'Longitude': '1.516667', 'Latitude': '42.5', 'PartOf': 'Europe: Southwest'}
{'LocationName': 'Andorra', 'Category': 'country', 'Longitude': '1.516667', 'Latitude': '42.5', 'PartOf': 'Europe: Southwest'}
{'LocationName': 'Europe: Southwest', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Various', 'Category': 'capital', 'Longitude': None, 'Latitude': None, 'PartOf': 'Portugal'}
{'LocationName': 'Kingdom of Portugal', 'Category': 'political', 'Longitude': '-9.133333', 'Latitude': '38.71666667', 'PartOf': 'Portugal'}
{'LocationName': 'Portugal', 'Category': 'country', 'Longitude': '-9.133333', 'Latitude': '38.71666667', 'PartOf': 'Europe: Southwest'}
{'LocationNa

{'LocationName': 'Easter Island', 'Category': 'country', 'Longitude': '', 'Latitude': '', 'PartOf': 'Oceania'}
{'LocationName': 'Oceania', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Colony of Fiji', 'Category': 'political', 'Longitude': '178.416667', 'Latitude': '-18.13333333', 'PartOf': 'Fiji'}
{'LocationName': 'Fiji', 'Category': 'country', 'Longitude': '178.416667', 'Latitude': '-18.13333333', 'PartOf': 'Oceania'}
{'LocationName': 'Fiji', 'Category': 'country', 'Longitude': '178.416667', 'Latitude': '-18.13333333', 'PartOf': 'Oceania'}
{'LocationName': 'Oceania', 'Category': 'region', 'Longitude': None, 'Latitude': None, 'PartOf': None}
{'LocationName': 'Kingdom of Fiji', 'Category': 'political', 'Longitude': '178.416667', 'Latitude': '-18.13333333', 'PartOf': 'Fiji'}
{'LocationName': 'Fiji', 'Category': 'country', 'Longitude': '178.416667', 'Latitude': '-18.13333333', 'PartOf': 'Oceania'}
{'LocationName': 'Fiji', 'Category': 'countr

You now have a lot of reference data from a various sources. Some of it might have doubled up. Some of the entries might be talking about the same location but may not have all the data. You will need to clean up the reference data, so that there is only a single entry for each location.

In [307]:
# Delete empty entries (there shouldn't be any)
print("Start:",len(locref_df))


Start: 2701


In [308]:
# For each location, review any entries
for l in locref_df.sort_values('LocationName',inplace=True, ascending=False)['LocationName']:
    print(l)

TypeError: 'NoneType' object is not subscriptable

The next step is to see if any of the placenames from our selected chapters of FtToHNL match these locations.

In [309]:
for placename in placenames_df['Placename']:
    # Exact match
    if(placename in list(locref_df['LocationName'])):
        print("*** Found", placename,"[Exact match]")
    else:
        print("Still looking for ", placename)
    # Partial Match???

Still looking for  the sleepy sea
Still looking for  the Bay of Biscay
Still looking for  Heath
*** Found London [Exact match]
Still looking for  Van Diemen's Land
Still looking for  Vickers
Still looking for  Sylvia
Still looking for  Bath
Still looking for  Julia Vickers's
Still looking for  Frere
Still looking for  Chatham
Still looking for  CHAPTER II
Still looking for  Surgeon Pine
Still looking for  Coromandel
Still looking for  Pine
*** Found India [Exact match]
Still looking for  the Hydaspes for Calcutta
Still looking for  the poop guard
Still looking for  MONOTONY
Still looking for  Three'll
Still looking for  Van Diemen's
Still looking for  Tasman
Still looking for  Cape Pillar
Still looking for  Pirates' Bay
Still looking for  east
Still looking for  west
Still looking for  the Isle of Wight
Still looking for  the South-West Cape
Still looking for  Swan Port
Still looking for  Mediterranean
Still looking for  Maria Island
Still looking for  the Three Thumbs
Still looking fo

### Searching a Gazzetteer for Locations <a class="anchor" id="section-searchgazzeteer"></a>

[TO BE DONE]