# ATAP Notebook for the Geolocation project

This notebook helps you access the Geolocation tools in a Python development environment.

### Contents
* [Premise](#section-premise)
* [Requirements](#section-requirements) 
* [Data Preparation](#section-datapreparation)
* [Named Entity Recognition](#section-ner)
 * [Look for NEs](#section-nes)
 * [MWEs as Named Entities](#section-mwes)

## Premise <a class="anchor" id="section-premise"></a>
*This section explains the Geolocation project and tools.*

The Geolocation project relates to doctoral research done by [Fiannuala Morgan](https://finnoscarmorgan.github.io/) at the Australian National University. It uses software to identify placenames in archived historical texts, then compares them to data about known locations to identify where the placenames may be located. 

This notebook is designed to allow you to perform similar operations on textual documents.

<div class="alert alert-block alert-success">
It will teach you how to
<ul>
    <li>use the spaCy library to identify and classify Named Entities (NEs)</li>
    <li>identify multi-word expressions (MWE) that are NEs</li>
    <li>search for spatial data about specific locations or places in gazetteers of such data</li>
    <li>determine which locations are referred to by placenames, based on the context in which they are used in a text</li>
</ul>
</div>

## Requirements <a class="anchor" id="section-requirements"></a>

<div class="alert alert-block alert-info">
This notebook uses various Python libraries. Most will come with your Python installation, but the following are crucial:
<ul>
    <li> pandas</li> 
    <li> json</li> 
    <li> nltk</li> 
    <li> geopandas</li> 
    <li> shapely  </li> 
</ul>
</div>

In [207]:
# TODO: UPDATE
# Many of these are probably not needed.

import os
from pickle import NONE
import nltk
import csv
import time
import urllib
import requests
import json
import math

import matplotlib.pyplot as plt
import pandas as pd
from fuzzywuzzy import fuzz

# Geopandas is used to work with spatial data
# If you have issues installing it on a MAcOS, 
# see https://stackoverflow.com/questions/71137617/error-installing-geopandas-in-python-on-mac-m1
import geopandas as gpd
from geopandas import GeoDataFrame

# NLTK is used to work with textual data 
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize

# Shapely is used to work with geometric shapes
from shapely.geometry import Point

# Fuzzywuzzy is used for fuzzy searches
from fuzzywuzzy import fuzz

# Data preparation <a class="anchor" id="section-datapreparation"></a>

You also want to set up some directories for the import and output of data.

In [208]:
## Declare the data directories
## This presumes that Notebooks/ is the current working directory  
text_directory = os.path.normpath("../Texts/")
csv_directory = os.path.normpath("../ner_output/")
#maps_directory = os.path.normpath("../maps/")

## Create the data directories
if not os.path.exists(text_directory):
    os.makedirs(text_directory)
if not os.path.exists(csv_directory):
   os.makedirs(csv_directory)
#if not os.path.exists(maps_directory):
#    os.makedirs(maps_directory)

For this workshop, we will be examining the text of *For the Term of His Natural Life*, an 1874CE novel by Marcus Clarke that is in the public domain. Our copy was obtained via the [Gutenburg Project Australia](https://gutenberg.net.au/ebooks/e00016.txt). It is an unformatted textfile. We have slightly simplified it further by reducing it to only standard ASCII characters, replacing any accented characters with their unaccented forms and the British Pound Sterling symbol with the word pounds. 

The novel is divided into four books, each based in different regions of the world. You can start with the second book, which is titled *BOOK II.\-\-MACQUARIE HARBOUR.  1833*. 


In [209]:
#filename="FtToHNL_BOOK_1.txt"
filename="FtToHNL_BOOK_2.txt"
print("Working on | ", filename)

# set the specific path for the 'filename' which is basically working through a list of everything that is in the folder
textlocation = os.path.normpath(os.path.join(text_directory, filename))
text_filename = os.path.basename(textlocation)

text = open(textlocation, encoding="utf-8").read()

Working on |  FtToHNL_BOOK_2.txt


This is no more than a long string of characters. So far, you have done no processing. 

In [210]:
text[0:499] # look at the first 500 characters

"BOOK II.--MACQUARIE HARBOUR.  1833.\n\n\n\n\n\n\nCHAPTER I.\n\nTHE TOPOGRAPHY OF VAN DIEMEN'S LAND.\n\n\n\nThe south-east coast of Van Diemen's Land, from the solitary Mewstone\nto the basaltic cliffs of Tasman's Head, from Tasman's Head to Cape Pillar,\nand from Cape Pillar to the rugged grandeur of Pirates' Bay, resembles\na biscuit at which rats have been nibbling.  Eaten away by the continual action\nof the ocean which, pouring round by east and west, has divided the peninsula\nfrom the mainland of the Austr"

## Named Entity Recognition <a class="anchor" id="section-ner"></a>
*This section provides tools on identifying named entities in textual data*

### Look for NEs <a class="anchor" id="section-nes"></a>

Named Entities (NEs) are proper noun phrases within text, like names of places, people or organisations.

There are various packages that can include Named Enity Recognition (NER), e.g., the [Stanza CoreNLP](https://colab.research.google.com/github/stanfordnlp/stanza/blob/main/demo/Stanza_CoreNLP_Interface.ipynb), the Stanford NER, and the spaCy library. They often combine machine learning and a rule-based system to identify NEs and classify them into categories.

For this notebook, you will be using the spaCy NER - https://spacy.io/usage/linguistic-features#morphology .  This is available as a Python library.

In [211]:
import spacy
#import spacy_lookups_data

SpaCy allows you to load a language model that has been trained on various examples of the language of interest. 

In [212]:
nlp = spacy.load("en_core_web_sm")

SpaCy will automatically run the model through various levels of natural language processing. This pipeline includes tokenising the text into individual tokens or terms, like words, values and puncuation.

TODO: Update

It is simple to use the client. You just tell it to annotate the text. However, Stanza allows you to specify what to annotate the text with. For instance, you might want it to tokenise the terms, label their parts-of-speech and lemmatised forms as well as recognising any NEs.

TODO: Update

<div class="alert alert-block alert-info">
The options for the Stanza client include:<ul>
    <li> <strong>tokenize - </strong> split into words or terms </li> 
      <li>    <strong>ssplit - </strong> split into sentences or independent statements</li> 
     <li>     <strong>pos -  </strong> syntactic parts-of-speech</li> 
      <li>    <strong>lemma -  </strong> lemmatised form (not always a root form)</li> 
      <li>    <strong>ner - </strong> named entity recognition</li> 
      <li>    <strong>depparse -  </strong> parsing of dependencies</li> 
      <li>    <strong>coref - </strong> co-reference resolution</li> 
      <li>    <strong>kbppandas - </strong> KBP competition format</li> 
</ul>
A full explanation can be found at <a href="https://stanfordnlp.github.io/CoreNLP/pipeline.html">https://stanfordnlp.github.io/CoreNLP/pipeline.html</a></div>

For example, the default line contains the following.

In [213]:
print("Pipeline:", nlp.pipe_names)

Pipeline: ['tagger', 'parser', 'ner']


Text sent to the spaCy model will be processed by the pipeline.

In [214]:
# sample text
sampletext = "Autonomous cars shift insurance liability toward manufacturers"

In [215]:
doc = nlp(sampletext)
doc

Autonomous cars shift insurance liability toward manufacturers

The output from the pipeline is then  available in the output structure of the spaCy model.

In [216]:
for token in doc:
    print(token.text," - ", "Morph: ",token.morph, 
          "\n   Dep: ",token.dep_, 
          "\n   Head: ",token.head.text, 
          "\n   Pos: ",token.head.pos_,
          "\n   Child: ",[child for child in token.children])


Autonomous  -  Morph:  <spacy.tokens.morphanalysis.MorphAnalysis object at 0x7fb5af0cfd50> 
   Dep:  amod 
   Head:  cars 
   Pos:  NOUN 
   Child:  []
cars  -  Morph:  <spacy.tokens.morphanalysis.MorphAnalysis object at 0x7fb5af0cfd50> 
   Dep:  nsubj 
   Head:  shift 
   Pos:  VERB 
   Child:  [Autonomous]
shift  -  Morph:  <spacy.tokens.morphanalysis.MorphAnalysis object at 0x7fb5af0cfd50> 
   Dep:  ROOT 
   Head:  shift 
   Pos:  VERB 
   Child:  [cars, liability]
insurance  -  Morph:  <spacy.tokens.morphanalysis.MorphAnalysis object at 0x7fb5af0cfd50> 
   Dep:  compound 
   Head:  liability 
   Pos:  NOUN 
   Child:  []
liability  -  Morph:  <spacy.tokens.morphanalysis.MorphAnalysis object at 0x7fb5af0cfd50> 
   Dep:  dobj 
   Head:  shift 
   Pos:  VERB 
   Child:  [insurance, toward]
toward  -  Morph:  <spacy.tokens.morphanalysis.MorphAnalysis object at 0x7fb5af0cfd50> 
   Dep:  prep 
   Head:  liability 
   Pos:  NOUN 
   Child:  [manufacturers]
manufacturers  -  Morph:  <spacy

However, you might not want some of this pipeline processing as it may not be beneficial to your analysis. Any unneeded processing will also slow the system down and place a greater demand on the memory. This is particularly true of the parser. Luckily, it is easy to stipulate what you want excluded from the pipeline. 

In [217]:
doc=nlp(sampletext, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])

In [218]:
for token in doc:
    print(token.text," - ", "Morph: ",token.morph, 
          "\n   Dep: ",token.dep_, 
          "\n   Head: ",token.head.text, 
          "\n   Pos: ",token.head.pos_,
          "\n   Child: ",[child for child in token.children])

Autonomous  -  Morph:  <spacy.tokens.morphanalysis.MorphAnalysis object at 0x7fb5af0cfd50> 
   Dep:   
   Head:  Autonomous 
   Pos:   
   Child:  []
cars  -  Morph:  <spacy.tokens.morphanalysis.MorphAnalysis object at 0x7fb5af0cfd50> 
   Dep:   
   Head:  cars 
   Pos:   
   Child:  []
shift  -  Morph:  <spacy.tokens.morphanalysis.MorphAnalysis object at 0x7fb5af0cfd50> 
   Dep:   
   Head:  shift 
   Pos:   
   Child:  []
insurance  -  Morph:  <spacy.tokens.morphanalysis.MorphAnalysis object at 0x7fb5af0cfd50> 
   Dep:   
   Head:  insurance 
   Pos:   
   Child:  []
liability  -  Morph:  <spacy.tokens.morphanalysis.MorphAnalysis object at 0x7fb5af0cfd50> 
   Dep:   
   Head:  liability 
   Pos:   
   Child:  []
toward  -  Morph:  <spacy.tokens.morphanalysis.MorphAnalysis object at 0x7fb5af0cfd50> 
   Dep:   
   Head:  toward 
   Pos:   
   Child:  []
manufacturers  -  Morph:  <spacy.tokens.morphanalysis.MorphAnalysis object at 0x7fb5af0cfd50> 
   Dep:   
   Head:  manufacturers 
   

Of course, what you are interested in is the NER. Each sentence sent down the pipeline with the ner will get a list of entities that have been found. 

In [219]:
sampletext = "Apple is looking at buying a U.K. startup based in London for $1 billion."

In [220]:
doc = nlp(sampletext)

for ent in doc.ents:
    print(ent.text, "[",ent.label_,"]")

Apple [ ORG ]
U.K. [ GPE ]
London [ GPE ]
$1 billion [ MONEY ]


As you can see, each entity is labelled with a category.

TODO: UPDATE

<div class="alert alert-block alert-info">
    The NER categories classified by Stanza include:
   <ul>
<li><strong>Default:</strong>
LOCATION, ORGANIZATION, PERSON</li>
<li><strong>High recall: </strong>
DATE, LOCATION, MONEY, ORGANIZATION, PERCENT, PERSON, TIME, MISC</li>
<li><strong>KBP fine-grained:</strong>
CAUSE_OF_DEATH, CITY, COUNTRY, CRIMINAL_CHARGE, EMAIL, HANDLE,
IDEOLOGY, NATIONALITY, RELIGION, STATE_OR_PROVINCE, TITLE, URL</li>
</ul>
</div>

TODO: Update

Most tokenised terms in the sentence have O as their NER value (that is the letter O not the number 0). Some however have been categorised. For instance, Van and Diemen are both classified as being PERSON named entities, Tasman is an ORGANIZATION and Cape and Pillar are in the LOCATION category. These categories are specific to Stanza. There are two key levels of processing available - the normal level will only identify the categories LOCATION, ORGANIZATION and PERSON, but the high recall processing will also consider other specialised phrases like TIME and MONEY which are not really named entities. Stanza can also look for the categories used in the KBP competition like CITY, COUNTRY and NATIONALITY, but this fine-grained processing will be slower.


The data for the entities includes the character position for the start and the end of the NE.

In [221]:
doc = nlp(sampletext)

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 29 33 GPE
London 51 57 GPE
$1 billion 62 72 MONEY


Each token will also have a value that indicates whether it is part of an NE.

In [222]:
for token in doc:
    print(token.text, "[", token.ent_type_, "]")

Apple [ ORG ]
is [  ]
looking [  ]
at [  ]
buying [  ]
a [  ]
U.K. [ GPE ]
startup [  ]
based [  ]
in [  ]
London [ GPE ]
for [  ]
$ [ MONEY ]
1 [ MONEY ]
billion [ MONEY ]
. [  ]


TODO: Update

The annotations so far show that *Cape* and *Pillar* are both a LOCATION NE, but not that *Cape Pillar* is actually the complete name of the location. However Stanza does also recognise multi-word expressions (MWEs), even though it recognises that they are part of an NE. Each *Sentence* annotated by Stanza has a list of [NE *Mentions*](https://stanfordnlp.github.io/CoreNLP/entitymentions.html) which are also given NER categories as their *Type*. 

It is also possible to hand-code entities after the NER has been done. This can help make up for any common irregularities with the NER for your input documents.

For instance, this model doesn't recognise that FB is a NE.

In [223]:
#import spacy
from spacy.tokens import Span

sampletext = "FB is hiring a new vice president of global policy"

doc = nlp(sampletext)
ents = [(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
print('Entities before:', ents)
# The model didn't recognize "fb" as an entity :-(

Entities before: []


The solution is to create a new entry for the list of entities.

In [224]:
# Create a span for the new entity
fb_ent = Span(doc, 0, 1, label="ORG")
orig_ents = list(doc.ents)

# Assign a complete list of ents to doc.ents
doc.ents = orig_ents + [fb_ent]

ents = [(ent.text, ent.start, ent.end, ent.label_) for ent in doc.ents]
print('Entities after:', ents)

Entities after: [('FB', 0, 1, 'ORG')]


Even the data for the tokens is updated. 

In [225]:
for token in doc:
    print(token.text, "[", token.ent_type_, "]")

FB [ ORG ]
is [  ]
hiring [  ]
a [  ]
new [  ]
vice [  ]
president [  ]
of [  ]
global [  ]
policy [  ]


SpaCy also allows the input documents to be processed in batches. This helps better manage the processing demands of the system throughout the pipeline when there are multiple files or many sentences.

In [226]:
# multiple texts in a list
sampletexts = ["Autonomous cars shift insurance liability toward manufacturers","This is a text", "These are lots of texts", "..."]

# remove what elements you don't need from the pipeline
for doc in nlp.pipe(sampletexts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):
    print("Entities: ",[(ent.text, ent.label_) for ent in doc.ents])
    for token in doc:
        print("   ",token.text, "[", token.ent_type_, "]")


Entities:  []
    Autonomous [  ]
    cars [  ]
    shift [  ]
    insurance [  ]
    liability [  ]
    toward [  ]
    manufacturers [  ]
Entities:  []
    This [  ]
    is [  ]
    a [  ]
    text [  ]
Entities:  []
    These [  ]
    are [  ]
    lots [  ]
    of [  ]
    texts [  ]
Entities:  []
    ... [  ]


While you can process the output of the piped pipeline straight away, you can't print it unless you convert it into a list.

In [227]:
docs = nlp.pipe(sampletexts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])
print(docs)

<generator object Language.pipe at 0x7fb5b03b37d0>


In [228]:
print(list(docs))

[Autonomous cars shift insurance liability toward manufacturers, This is a text, These are lots of texts, ...]


TODO: Can't remember what this is trying to do. Think it has to do with the IOB.

In [229]:
import numpy
#import spacy
from spacy.attrs import ENT_IOB, ENT_TYPE

nlp = spacy.load("en_core_web_sm")
#doc = nlp.make_doc("London is a big city in the United Kingdom and New York is in the United States of America.")
doc = nlp("London is a big city in the United Kingdom and New York is in the United States of America.")

ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print("Entities: ",ents)

print("\nBefore:", doc.ents)  # []

header = [ENT_IOB, ENT_TYPE]
attr_array = numpy.zeros((len(doc), len(header)), dtype="uint64")
attr_array[0, 0] = 3  # B
attr_array[0, 1] = doc.vocab.strings["GPE"]
doc.from_array(header, attr_array)
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print("\nEntities: ",ents)

print("\nAfter", doc.ents)  # [London]

Entities:  [('London', 0, 6, 'GPE'), ('the United Kingdom', 24, 42, 'GPE'), ('New York', 47, 55, 'GPE'), ('the United States of America', 62, 90, 'GPE')]

Before: (London, the United Kingdom, New York, the United States of America)

Entities:  [('London', 0, 6, 'GPE')]

After (London,)


TODO: Placeholder in-case we want to show of the displacy rendering.

In [230]:
#import spacy
from spacy import displacy

sampletext = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."

nlp = spacy.load("en_core_web_sm")
doc = nlp(sampletext)
displacy.render(doc, style="ent")

TODO: talk about extracting just the NER types we want.

In [245]:
spacy.explain('LOC')

'Non-GPE locations, mountain ranges, bodies of water'

In [232]:
spacy.explain('FAC')

'Buildings, airports, highways, bridges, etc.'

In [233]:
spacy.explain('GPE')

'Countries, cities, states'

In [234]:
spacy.explain('ORG')

'Companies, agencies, institutions, etc.'

TODO: Run through a single chapter (variable: text) before doing the entire collection?
    This will allow all NEs to be shown, then the filter be introduced.
    This will put it more on topic.

In [235]:
text[0:499]

"BOOK II.--MACQUARIE HARBOUR.  1833.\n\n\n\n\n\n\nCHAPTER I.\n\nTHE TOPOGRAPHY OF VAN DIEMEN'S LAND.\n\n\n\nThe south-east coast of Van Diemen's Land, from the solitary Mewstone\nto the basaltic cliffs of Tasman's Head, from Tasman's Head to Cape Pillar,\nand from Cape Pillar to the rugged grandeur of Pirates' Bay, resembles\na biscuit at which rats have been nibbling.  Eaten away by the continual action\nof the ocean which, pouring round by east and west, has divided the peninsula\nfrom the mainland of the Austr"

In [236]:
doc = nlp(text)

# document level
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
    
i=0 # entity counter
# token level
for e in doc.ents:
    print("{:5}\t\t{:30s}\t{}".format(i+1,e.text, e.label_))
    i=i+1

    1		1833                          	DATE
    2		Van Diemen's                  	ORG
    3		Tasman                        	ORG
    4		Tasman                        	ORG
    5		Cape Pillar                   	PERSON
    6		Cape Pillar                   	GPE
    7		Pirates' Bay                  	LOC
    8		east                          	LOC
    9		west                          	LOC
   10		Australasian                  	NORP
   11		Van Diemen's                  	ORG
   12		the Isle of Wight             	GPE
   13		the South-West Cape           	LOC
   14		Swan Port                     	FAC
   15		Australian                    	NORP
   16		Van Diemen's                  	PERSON
   17		Mediterranean                 	LOC
   18		Cape Bougainville             	PERSON
   19		Maria Island                  	LOC
   20		the Three Thumbs              	ORG
   21		Tasman                        	ORG
   22		Peninsula                     	LOC
   23		Pillar                        	NORP
   24		Storm Bay     

 1550		Dawes                         	PERSON
 1551		eleven                        	CARDINAL
 1552		the night                     	TIME
 1553		the white face                	FAC
 1554		Rufus Dawes                   	PERSON
 1555		Come, Sylvia                  	WORK_OF_ART
 1556		two                           	CARDINAL
 1557		four                          	CARDINAL
 1558		first                         	ORDINAL
 1559		an hour                       	TIME
 1560		Frere                         	PERSON
 1561		evening                       	TIME
 1562		Gates                         	PERSON
 1563		Dawes                         	ORG
 1564		night                         	TIME
 1565		The night                     	TIME
 1566		Providence                    	ORG
 1567		Rufus Dawes                   	PERSON
 1568		Frere                         	ORG
 1569		Two                           	CARDINAL
 1570		nearly a fifth                	CARDINAL
 1571		two                           	CARDINAL
 1572		second 

TODO: Expand on this explanation with example context.
Talk about the issue with _Van Diemen's_ versus _Van Diemen's Land_ (and _Tasman's Head_)

These categories are assigned according to the context in which the NE is used. For this reason, _Van Diemen's_ is considered an *ORG*, a *PERSON* and a *FAC*, depending on its linguistic context. Note also that _VAN DIEMEN'S LAND_ in the title of the chapter isn't recognised as a NE, probably due to its unconventional case.

TODO: Expand on this explanation.

Of course, not all of these NE are suitable for placenames, so you will need to make a list of what categories regularly contain placenames.

In [237]:
PLACENAME_CATEGORIES = ["LOC", "GPE", "FAC", "ORG"]

You can now put all of this together and find the placenames that are identified by spaCy in each chapter of the text. They can all be collected in a single dataframe.

In [238]:
# where we store the details about each instance of the placenames
placenames = pd.DataFrame(columns=['Book','Chapter',"NEIndex","Placename"])

In [239]:
# define which chapters and books you want to annotate
CHAPTERS=[1,2,3] 
BOOKS=[1,2]

Now let's process FtToHNL.

In [240]:
import spacy

nlp = spacy.load("en_core_web_sm")


i=0 # counter of the entities
for b in BOOKS:
    for c in CHAPTERS:
        filename = "FtToHNL_BOOK_"+str(b)+"_CHAPTER_"+str(c)+".txt" 
        # set the specific path for the 'filename'
        textlocation = os.path.normpath(os.path.join(text_directory, filename))
        text_filename = os.path.basename(textlocation)

        # read this chapter
        text = open(textlocation, encoding="utf-8").read()
        print("Working on |",filename)
        
        # run spaCy    
        doc = nlp(text)

        # document level
        ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]

        # token level
        for e in doc.ents:
            if e.label_ in PLACENAME_CATEGORIES: # filter out MONEY, DATE etc
                print("{:5}\t\t{:30s}\t{}".format(i+1,e.text, e.label_))

                # find the placenames according to spaCy
                new_placename = {'Book':b,
                                'Chapter':c,
                                'NEIndex':i,
                                'Placename':e.text,
                                'Category':e.label_}
                placenames = placenames.append(new_placename, ignore_index=True)
                    
            i=i+1 # entity counter

Working on | FtToHNL_BOOK_1_CHAPTER_1.txt
    2		Malabar                       	GPE
    6		Crown                         	ORG
   10		Crown                         	ORG
   11		Crown                         	ORG
   15		the sleepy sea                	FAC
   17		the Bay of Biscay             	LOC
   22		Lord Bellasis                 	ORG
   23		Heath                         	ORG
   25		London                        	GPE
   38		Van Diemen's Land             	ORG
   44		Vickers                       	ORG
   46		Vickers                       	ORG
   47		Sylvia                        	ORG
   50		Bath                          	ORG
   53		Julia Vickers's               	ORG
   54		Frere                         	ORG
   58		Sylvia                        	GPE
   64		Chatham                       	FAC
   81		Sylvia                        	GPE
   93		Frere                         	ORG
   96		Sylvia                        	LOC
Working on | FtToHNL_BOOK_1_CHAPTER_2.txt
   97		CHAPTER II                 

In [241]:
placenames

Unnamed: 0,Book,Chapter,NEIndex,Placename,Category
0,1,1,1,Malabar,GPE
1,1,1,5,Crown,ORG
2,1,1,9,Crown,ORG
3,1,1,10,Crown,ORG
4,1,1,14,the sleepy sea,FAC
...,...,...,...,...,...
154,2,3,590,Sydney,GPE
155,2,3,591,England,GPE
156,2,3,592,Troke,ORG
157,2,3,597,Troke,ORG


This shows that a lot more placenames were found in Book 2 than Book 1. This makes sense since Book 1 is set on board an ocean voyage, whereas Book 2 is at Macquarie Harbour in Australia.

The final step is to save this new data to a csv file. You have already defined the directory for your data files.

In [242]:
filename = "TtToHNL_placenames.txt"
save_location = os.path.normpath(os.path.join(csv_directory, filename))
save_filename = os.path.basename(save_location)
print("Saving to placename data to ",save_location)

Saving to placename data to  ../ner_output/TtToHNL_placenames.txt


In [243]:
placenames.to_csv(save_location)

TODO: Remove the MWE section as it is not needed for spaCy.
Stopped the code form executing for now, but kept for reference while updating the above steps.

### MWEs as Named Entities  <a class="anchor" id="section-mwes"></a>

The annotations so far show that *Cape* and *Pillar* are both a LOCATION NE, but not that *Cape Pillar* is actually the complete name of the location. However Stanza does also recognise multi-word expressions (MWEs), even though it recognises that they are part of an NE. Each *Sentence* annotated by Stanza has a list of [NE *Mentions*](https://stanfordnlp.github.io/CoreNLP/entitymentions.html) which are also given NER categories as their *Type*. 

Obviously, this isn't perfect. While *Van Diemen* is recognised as a *PERSON* NE, *Van Diemen's Land* (i.e., the former name for Tasmania) isn't recognised as a *LOCATION*. This is because Stanza is trained to only recognise certain combinations of words and categories as a new MWE category. These rules can however be added to but this workshop won't explore this aspect.

Each token is also annotated with an index number corresponding to any Mention it is part of. Each token can only be part of one Mention. While the Mentions may be annotated per Sentence, the index number is actually considering all Sentences and Mentions in the annotated text.

Of course, you are mainly interested in the Named Entities relate to locations. The location-based NER categories used by Stanza are:
* *LOCATION*
* *CITY*
* *COUNTRY*
* *STATE_OR_PROVINCE*

*NATIONALITY* might also be considered but it may depend on whether you care about phrases like *student of English history* or *Frenchman's cap*, or not.
It is easy to filter out all other NEs.

You can now put all of this together and find the placenames that are identified by Stanza in each chapter of the text. They can all be collected in a single dataframe.

__\[MN Note\]__ Change this to a background server, rather than a server on demand?

This shows that a lot more placenames were found in Book 2 than Book 1. This makes sense since Book 1 is set on board an ocean voyage, whereas Book 2 is at Macquarie Harbour in Australia.  

The final step is to save this new data to a csv file. You have already defined the directory for your data files.