# ATAP Notebook for the Geolocation project

This notebook helps you access the Geolocation tools in a Python development environment.

### Contents
* [Premise](#section-premise)
* [Requirements](#section-requirements) 
* [Data Preparation](#section-datapreparation)
* [Named Entity Recognition](#section-ner)
 * [Look for NEs](#section-nes)
 * [MWEs as Named Entities](#section-mwes)

## Premise <a class="anchor" id="section-premise"></a>
*This section explains the Geolocation project and tools.*

The Geolocation project relates to doctoral research done by [Fiannuala Morgan](https://finnoscarmorgan.github.io/) at the Australian National University. It uses software to identify placenames in archived historical texts, then compares them to data about known locations to identify where the placenames may be located. 

This notebook is designed to allow you to perform similar operations on textual documents.

<div class="alert alert-block alert-success">
It will teach you how to
<ul>
    <li>use the Stanza CoreNLP library to identify and classify Named Entities (NEs)</li>
    <li>identify multi-word expressions (MWE) that are NEs</li>
    <li>search for spatial data about specific locations or places in gazetteers of such data</li>
    <li>determine which locations are referred to by placenames, based on the context in which they are used in a text</li>
</ul>
</div>

## Requirements <a class="anchor" id="section-requirements"></a>

<div class="alert alert-block alert-info">
This notebook uses various Python libraries. Most will come with your Python installation, but the following are crucial:
<ul>
    <li> pandas</li> 
    <li> json</li> 
    <li> nltk</li> 
    <li> geopandas</li> 
    <li> shapely  </li> 
</ul>
</div>

In [None]:
import os
from pickle import NONE
import nltk
import csv
import time
import urllib
import requests
import json
import math

import matplotlib.pyplot as plt
import pandas as pd
from fuzzywuzzy import fuzz

# Geopandas is used to work with spatial data
# If you have issues installing it on a MAcOS, 
# see https://stackoverflow.com/questions/71137617/error-installing-geopandas-in-python-on-mac-m1
import geopandas as gpd
from geopandas import GeoDataFrame

# NLTK is used to work with textual data 
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize

# Shapely is used to work with geometric shapes
from shapely.geometry import Point

# Fuzzywuzzy is used for fuzzy searches
from fuzzywuzzy import fuzz

# Data preparation <a class="anchor" id="section-datapreparation"></a>

You also want to set up some directories for the import and output of data.

In [None]:
## Declare the data directories
## This presumes that Notebooks/ is the current working directory  
text_directory = os.path.normpath("../Texts/")
csv_directory = os.path.normpath("../ner_output/")
#maps_directory = os.path.normpath("../maps/")

## Create the data directories
if not os.path.exists(text_directory):
    os.makedirs(text_directory)
if not os.path.exists(csv_directory):
   os.makedirs(csv_directory)
#if not os.path.exists(maps_directory):
#    os.makedirs(maps_directory)

For this workshop, we will be examining the text of *For the Term of His Natural Life*, an 1874CE novel by Marcus Clarke that is in the public domain. Our copy was obtained via the [Gutenburg Project Australia](https://gutenberg.net.au/ebooks/e00016.txt). It is an unformatted textfile. We have slightly simplified it further by reducing it to only standard ASCII characters, replacing any accented characters with their unaccented forms and the British Pound Sterling symbol with the word pounds. 

The novel is divided into four books, each based in different regions of the world. You can start with the second book, which is titled *BOOK II.\-\-MACQUARIE HARBOUR.  1833*. 


In [None]:
#filename="FtToHNL_BOOK_1.txt"
filename="FtToHNL_BOOK_2.txt"
print("Working on | ", filename)

# set the specific path for the 'filename' which is basically working through a list of everything that is in the folder
textlocation = os.path.normpath(os.path.join(text_directory, filename))
text_filename = os.path.basename(textlocation)

text = open(textlocation, encoding="utf-8").read()

Prepare the data.

## Named Entity Recognition <a class="anchor" id="section-ner"></a>
*This section provides tools on identifying named entities in textual data*

### Look for NEs <a class="anchor" id="section-nes"></a>

Named Entities (NEs) are proper noun phrases within text, like names of places, people or organisations.

There are various packages that can be used to recognise NE. The [Stanza CoreNLP](https://colab.research.google.com/github/stanfordnlp/stanza/blob/main/demo/Stanza_CoreNLP_Interface.ipynb) uses the Stanford CoreNLP in Python and Java. This combines machine learning and a rule-based system to identify NEs and classify them into categories.

The first step is to install the core Python interface of Stanza. You should only need to do this once per machine (unless you are using a virtual machine, like through Binder).

In [None]:
# Install stanza; note that the prefix "!" is not needed if you are running in a terminal
!pip install stanza

You can then import the Python package.

In [None]:
# Import stanza
import stanza

However, you will still need to install the Stanza Java package as well. This will take longer. Again, you should only need to do this once but you will have to remember where you installed it.

In [None]:
# Set the CORENLP_HOME environment variable to point to the installation location
import os
corenlp_dir = './corenlp'
os.environ["CORENLP_HOME"] = corenlp_dir

In [None]:
# Download the Stanford CoreNLP package with Stanza's installation command
# This'll take several minutes, depending on the network speed
stanza.install_corenlp(dir=corenlp_dir)

In [None]:
# Examine the CoreNLP installation folder to make sure the installation is successful
!ls $CORENLP_HOME

Stanza works as a combination of a server, which does the bulk of the processing, and the the client interface for the user. The client talks to the server, sending it queries that tell the server what to do. The server then sends back messages with the required output. 

In [None]:
# Import client module
from stanza.server import CoreNLPClient

In [None]:
# Construct a CoreNLPClient with some basic annotators, a memory allocation of 4GB, and port number 9001
testclient = CoreNLPClient(
    annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner'], 
    memory='4G', 
    endpoint='http://localhost:9001',
    be_quiet=True)
print(testclient)

Servers are normally run continually in the background. This is great if there are often client queries, maybe from multiple clients. However Stanza also allows the server to be run on demand, which is more suitable for your needs (and this workshop). This will allow your computer's power to allocate its resources to other things when the server isn't doing anything.

It is simple to use the client. You just tell it to annotate the text. However, Stanza allows you to specify what to annotate the text with. For instance, you might want it to tokenise the terms, label their parts-of-speech and lemmatised forms as well as recognising any NEs.

<div class="alert alert-block alert-info">
The options for the Stanza client include:<ul>
    <li> <strong>tokenize - </strong> split into words or terms </li> 
      <li>    <strong>ssplit - </strong> split into sentences or independent statements</li> 
     <li>     <strong>pos -  </strong> syntactic parts-of-speech</li> 
      <li>    <strong>lemma -  </strong> lemmatised form (not always a root form)</li> 
      <li>    <strong>ner - </strong> named entity recognition</li> 
      <li>    <strong>depparse -  </strong> parsing of dependencies</li> 
      <li>    <strong>coref - </strong> co-reference resolution</li> 
      <li>    <strong>kbppandas - </strong> KBP competition format</li> 
</ul>
A full explanation can be found at <a href="https://stanfordnlp.github.io/CoreNLP/pipeline.html">https://stanfordnlp.github.io/CoreNLP/pipeline.html</a></div>

In [None]:
print("Starting a server with the Python \"with\" statement...")
with CoreNLPClient(annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner'], 
                   memory='4G', 
                   endpoint='http://localhost:9001', 
                   be_quiet=True) as client:
    # Annotate the text
    document = client.annotate(text[0:99999])

print("\nThe server should be stopped upon exit from the \"with\" statement.")

Note that the client wasn't told to annotate all of the text. The default text limit is 100000 characters. If your text is too large then you might want to divide it up and process it in a series of batches, like per section or chapter. 

The text can now be examined per sentence and per word for the three tags that have been annotated - Lemma, POS and NER.

In [None]:
# Iterate over all tokens in all sentences, and print out the word, lemma, POS and NER tags
print("{:12s}\t{:12s}\t{:6s}\t{}".format("Word", "Lemma", "POS", "NER"))

for i, sent in enumerate(document.sentence):
    print("[Sentence {}]".format(i+1))
    for t in sent.token:
        print("{:12s}\t{:12s}\t{:6s}\t{}".format(t.word, t.lemma, t.pos, t.ner))
    print("")

Sentence 6 is a good example to focus on. 

__\[MN NOTE\]__ Look for a better sample sentence with more good locations that are recognised by Stanza.

In [None]:
print("{:12s}\t{:12s}\t{:6s}\t{}".format("Word", "Lemma", "POS", "NER"))

sent = document.sentence[5]
for t in sent.token:
    print("{:12s}\t{:12s}\t{:6s}\t{}".format(t.word, t.lemma, t.pos, t.ner))

Most tokenised terms in the sentence have O as their NER value (that is the letter O not the number 0). Some however have been categorised. For instance, *Van* and *Diemen* are both classified as being *PERSON* named entities, *Tasman* is an *ORGANIZATION* and *Cape* and *Pillar* are in the *LOCATION* category. These categories are [specific to Stanza](https://stanfordnlp.github.io/CoreNLP/ner.html). There are two key levels of processing available - the normal level will only identify the categories *LOCATION*, *ORGANIZATION* and *PERSON*, but the high recall processing will also consider other specialised phrases like *TIME* and *MONEY* which are not really named entities. Stanza can also look for the categories used in the KBP competition like *CITY*, *COUNTRY* and *NATIONALITY*, but this fine-grained processing will be slower.

<div class="alert alert-block alert-info">
    The NER categories classified by Stanza include:
   <ul>
<li><strong>Default:</strong>
LOCATION, ORGANIZATION, PERSON</li>
<li><strong>High recall: </strong>
DATE, LOCATION, MONEY, ORGANIZATION, PERCENT, PERSON, TIME, MISC</li>
<li><strong>KBP fine-grained:</strong>
CAUSE_OF_DEATH, CITY, COUNTRY, CRIMINAL_CHARGE, EMAIL, HANDLE,
IDEOLOGY, NATIONALITY, RELIGION, STATE_OR_PROVINCE, TITLE, URL</li>
</ul>
</div>

### MWEs as Named Entities  <a class="anchor" id="section-mwes"></a>

The annotations so far show that *Cape* and *Pillar* are both a LOCATION NE, but not that *Cape Pillar* is actually the complete name of the location. However Stanza does also recognise multi-word expressions (MWEs), even though it recognises that they are part of an NE. Each *Sentence* annotated by Stanza has a list of [NE *Mentions*](https://stanfordnlp.github.io/CoreNLP/entitymentions.html) which are also given NER categories as their *Type*. 

In [None]:
    print("{}\t{:30s}\t{}".format("Sentence", "Mention", "Type"))
    for i, sent in enumerate(document.sentence):
        for m in sent.mentions:
            print("{:5}\t\t{:30s}\t{}".format(i+1,m.entityMentionText, m.entityType))


Obviously, this isn't perfect. While *Van Diemen* is recognised as a *PERSON* NE, *Van Diemen's Land* (i.e., the former name for Tasmania) isn't recognised as a *LOCATION*. This is because Stanza is trained to only recognise certain combinations of words and categories as a new MWE category. These rules can however be added to but this workshop won't explore this aspect.

Each token is also annotated with an index number corresponding to any Mention it is part of. Each token can only be part of one Mention. While the Mentions may be annotated per Sentence, the index number is actually considering all Sentences and Mentions in the annotated text.

In [None]:
TotalNumMentions=0 # keep a running list of how many NEs are identified
print("{:15s}\t{:15s}\t{:6s}\t{:10s}\t{:10s}\t{}".format("word","lemma","POS","NER","Mention Index","Mention"))

for i, sent in enumerate(document.sentence):
    NumMentions = len(sent.mentions) # count how many NEs annotated to this sentence
    if NumMentions:
        print("\n[Sentence {}]".format(i+1))
        for t in sent.token:
            # only output tokens that are in NEs
            if t.ner != 'O': # Letter, not zero!
                # get the full text for the NE
                mentionStr=sent.mentions[t.entityMentionIndex -  TotalNumMentions].entityMentionText                
                print("{:15s}\t{:15s}\t{:6s}\t{:10s}\t{:10}\t{}".format(t.word, 
                                                                 t.lemma, 
                                                                 t.pos, 
                                                                 t.ner, 
                                                                 t.entityMentionIndex, 
                                                                 mentionStr))
    TotalNumMentions += NumMentions # add to the running total

Of course, you are mainly interested in the Named Entities relate to locations. The location-based NER categories used by Stanza are:
* *LOCATION*
* *CITY*
* *COUNTRY*
* *STATE_OR_PROVINCE*

*NATIONALITY* might also be considered but it may depend on whether you care about phrases like *student of English history* or *Frenchman's cap*, or not.
It is easy to filter out all other NEs.

In [None]:
# NATIONALITY is ignored 
LOCATION_CATEGORIES = ['LOCATION','CITY','COUNTRY','STATE_OR_PROVINCE']

# Iterate over all detected entity mentions
print("{:30s}\t{}".format("Mention", "Type"))

STANZA_Locations = [] # list of locations Stanza recognises
for sent in document.sentence:
    for m in sent.mentions:
        if (m.entityType in LOCATION_CATEGORIES): # ignore any nonLocation NEs
            print("{:30s}\t{}".format(m.entityMentionText, m.entityType))
            STANZA_Locations.append(m.entityMentionText)
            
print("\n{}".format(STANZA_Locations))

You can now put all of this together and find the placenames that are identified by Stanza in each chapter of the text. They can all be collected in a single dataframe.

In [None]:
placenames = pd.DataFrame(columns=['Book','Chapter',"NEIndex","Placename"])

In [None]:
# define which chapters and books you want to annotate
CHAPTERS=[1,2,3] 
BOOKS=[1,2]

__\[MN Note\]__ Change this to a background server, rather than a server on demand?

In [None]:
for b in BOOKS:
    for c in CHAPTERS:
        filename = "FtToHNL_BOOK_"+str(b)+"_CHAPTER_"+str(c)+".txt" 
        # set the specific path for the 'filename' which is basically working through a list of everything that is in the folder
        textlocation = os.path.normpath(os.path.join(text_directory, filename))
        text_filename = os.path.basename(textlocation)

        # read this chapter
        text = open(textlocation, encoding="utf-8").read()
        print("Working on |",filename)
        
        # run the Stanza server & client
        with CoreNLPClient(annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner'], 
                   memory='4G', 
                   endpoint='http://localhost:9001', 
                   be_quiet=True) as client:
            # Annotate the text (presuming it is not too large)
            document = client.annotate(text)
        
        # find the placenames according to Stanza
        for sent in document.sentence:
            for m in sent.mentions:
                if (m.entityType in LOCATION_CATEGORIES):
                    new_placename = {'Book':b,
                                    'Chapter':c,
                                    'NEIndex':m.entityMentionIndex,
                                    'Placename':m.entityMentionText}
                    placenames = placenames.append(new_placename, ignore_index=True)

This shows that a lot more placenames were found in Book 2 than Book 1. This makes sense since Book 1 is set on board an ocean voyage, whereas Book 2 is at Macquarie Harbour in Australia.  

The final step is to save this new data to a csv file. You have already defined the directory for your data files.

In [None]:
filename = "TtToHNL_placenames.txt"
save_location = os.path.normpath(os.path.join(csv_directory, filename))
save_filename = os.path.basename(save_location)
print("Saving placename data to ",save_location)

In [None]:
placenames.to_csv(save_location)