# Introduction to the Named Entity Recognition with spaCy 

This notebook helps you access the Named Entity Recognition (NER) tools in the spaCy Python package.

### Contents
* [Premise](#section-premise)
* [Requirements](#section-requirements) 
* [Data Preparation](#section-datapreparation)
* [Named Entity Recognition](#section-ner)
 * [Look for NEs](#section-nes)
 * [Reviewing Candidate Placenames](#section-reviewplacenames)
* [Finding Locations for Placenames](#section-findinglocs)
 * [Identifying States and Capitals](#section-statescapitals)
 * [Searching a Gazzetteer for Locations](#section-searchgazetteer)

## Premise <a class="anchor" id="section-premise"></a>
*This section explains the Geolocation project and tools.*

The Geolocation project relates to doctoral research done by [Fiannuala Morgan](https://finnoscarmorgan.github.io/) at the Australian National University. It uses software to identify placenames in archived historical texts, then compares them to data about known locations to identify where the placenames may be located. 

This notebook is designed to allow you to perform similar operations on textual documents.

<div class="alert alert-block alert-success">
It will teach you how to
<ul>
    <li>use the spaCy library to identify and classify Named Entities (NEs)</li>
    <li>identify multi-word expressions (MWE) that are NEs</li>
    <li>search for spatial data about specific locations or places in gazetteers of such data</li>
    <li>determine which locations are referred to by placenames, based on the context in which they are used in a text</li>
</ul>
</div>

## Requirements <a class="anchor" id="section-requirements"></a>

<div class="alert alert-block alert-info">
This notebook uses various Python libraries. Most will come with your Python installation, but the following are crucial:
<ul>
    <li> pandas</li> 
    <li> json</li> 
    <li> nltk</li> 
    <li> geopandas</li> 
    <li> shapely  </li> 
</ul>
</div>

In [28]:
# [TO DO] UPDATE
# Many of these are probably not needed.

import os
from pickle import NONE
import nltk
import csv
import time
import urllib
import requests
import json
import math

#import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Geopandas is used to work with spatial data
# If you have issues installing it on a MAcOS, 
# see https://stackoverflow.com/questions/71137617/error-installing-geopandas-in-python-on-mac-m1
#import geopandas as gpd
#from geopandas import GeoDataFrame

# NLTK is used to work with textual data 
#from nltk.tag import StanfordNERTagger
#from nltk.tokenize import word_tokenize

# spaCy is used for a pipeline of NLP functions
import spacy
from spacy.tokens import Span
from spacy import displacy

# Shapely is used to work with geometric shapes
#from shapely.geometry import Point

# Fuzzywuzzy is used for fuzzy searches
#from fuzzywuzzy import fuzz

# used for the checklist
import ipywidgets as widgets

In [27]:
# imports for the OSM
# [TO DOI] Work oput what is still required
import requests
from IPython.display import JSON
import json
from pprint import pprint
from ratelimit import limits, RateLimitException, sleep_and_retry

In [4]:
# Make sure you can see as much of the output as possible within the Jupyter Notebook screen
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 115)

# Data preparation <a class="anchor" id="section-datapreparation"></a>

You also want to set up some directories for the import and output of data.

In [5]:
## Declare the data directories
## This presumes that Notebooks/ is the current working directory  
text_directory = os.path.normpath("../Texts/")
csv_directory = os.path.normpath("../ner_output/")
reference_directory = os.path.normpath("../Data")

## Create the data directories
if not os.path.exists(text_directory):
    os.makedirs(text_directory)
if not os.path.exists(csv_directory):
   os.makedirs(csv_directory)
if not os.path.exists(reference_directory):
   os.makedirs(reference_directory)


For this workshop, we will be examining the text of *For the Term of His Natural Life*, an 1874CE novel by Marcus Clarke that is in the public domain. Our copy was obtained via the [Gutenburg Project Australia](https://gutenberg.net.au/ebooks/e00016.txt). It is an unformatted textfile. We have slightly simplified it further by reducing it to only standard ASCII characters, replacing any accented characters with their unaccented forms and the British Pound Sterling symbol with the word pounds. 

The novel is divided into four books, each based in different regions of the world. You can start with part of the second book, which is titled *BOOK II.\-\-MACQUARIE HARBOUR.  1833*. 


In [6]:
filename="FtToHNL_BOOK_2_CHAPTER_3.txt"
print("Working on | ", filename)

# set the specific path for the 'filename' 
text_location = os.path.normpath(os.path.join(text_directory, filename))
text_filename = os.path.basename(text_location)

text = open(text_location, encoding="utf-8").read()

Working on |  FtToHNL_BOOK_2_CHAPTER_3.txt


This is no more than a long string of characters. So far, you have done no processing. 

In [7]:
text[0:500] # look at the first 501 characters

'CHAPTER III.\n\nA SOCIAL EVENING.\n\n\n\nIn the house of Major Vickers, Commandant of Macquarie Harbour,\nthere was, on this evening of December 3rd, unusual gaiety.\n\nLieutenant Maurice Frere, late in command at Maria Island, had unexpectedly\ncome down with news from head-quarters.  The Ladybird, Government schooner,\nvisited the settlement on ordinary occasions twice a year, and such visits\nwere looked forward to with no little eagerness by the settlers.\nTo the convicts the arrival of the Ladybird mean'

## Named Entity Recognition <a class="anchor" id="section-ner"></a>
*This section provides tools on identifying named entities in textual data*

### Look for NEs <a class="anchor" id="section-nes"></a>

Named Entities (NEs) are proper noun phrases within text, like the names of places, people or organisations.

Looking at the above text from Book 2 of FtToHNL, you can see there are the names of places, characters, and a ship.  While you could mannually extract them from the text, Natural Language Processing (NLP) technology allows this process to be semi-automated through software.

There are various packages that can include Named Enity Recognition (NER), e.g., the [Stanza CoreNLP](https://colab.research.google.com/github/stanfordnlp/stanza/blob/main/demo/Stanza_CoreNLP_Interface.ipynb), the Stanford NER, and the spaCy library. They often combine machine learning and a rule-based system to identify NEs and classify them into categories.

For this notebook, you will be using the spaCy NER - https://spacy.io/usage/linguistic-features#morphology .  This is available as a Python library.

SpaCy allows you to load a [language model](https://spacy.io/models/en#en_core_web_sm) that has been trained on various examples of the language of interest. 

__[TO DO] Explain a language model in one sentence.__

In [8]:
nlp = spacy.load("en_core_web_sm")

SpaCy will automatically run the model through various levels of natural language processing, that is called a processing pipeline. This pipeline includes dividing the text into individual tokens or terms, like words, values and punctuation.

<div class="alert alert-block alert-info">
The options for the spaCy pipeline include:
    <p>&nbsp;</p>
<table>
    <tr><th>NAME</th>	<th>COMPONENT</th>		<th>	DESCRIPTION</th>	
    </tr>
    <tr>
        <td><strong>tokenizer</strong></td><td>	Tokenizer</td><td>	Segment text into tokens.</td>
    </tr>
    <tr>

<td><strong>tagger</strong></td><td>	Tagger</td><td>	Assign part-of-speech tags.</td>
    </tr>
    <tr>

<td><strong>parser</strong></td><td>	DependencyParser</td><td>	Assign dependency labels.</td>
    </tr>
    <tr>

<td><strong>ner</strong></td><td>	EntityRecognizer</td><td>	Detect and label named entities.</td>
    </tr>
    <tr>

<td><strong>lemmatizer</strong></td><td>	Lemmatizer</td><td>	Assign base forms.</td>
    </tr>
    <tr>

<td><strong>textcat</strong></td><td>	TextCategorizer</td><td>	Assign document labels.</td>
    </tr>
    <tr>

<td><strong>custom</strong></td><td>	custom components</td><td>	Assign custom attributes, methods or properties.</td>
    </tr>
    </table>
    
   A full explanation can be found at <a href="https://spacy.io/usage/processing-pipelines">https://spacy.io/usage/processing-pipelines</a>
    </div>

![spaCy Language Procesing Pipeline](./spaCy_pipeline.png)

For example, by default the spaCy pipeline contains the following.

In [9]:
print("Pipeline:", nlp.pipe_names)

Pipeline: ['tagger', 'parser', 'ner']


Text sent to the spaCy model will be processed by the pipeline.

In [10]:
doc = nlp(text[0:500])
doc

CHAPTER III.

A SOCIAL EVENING.



In the house of Major Vickers, Commandant of Macquarie Harbour,
there was, on this evening of December 3rd, unusual gaiety.

Lieutenant Maurice Frere, late in command at Maria Island, had unexpectedly
come down with news from head-quarters.  The Ladybird, Government schooner,
visited the settlement on ordinary occasions twice a year, and such visits
were looked forward to with no little eagerness by the settlers.
To the convicts the arrival of the Ladybird mean

This may not look very different but the output from the pipeline is now available in the output structure of the spaCy model. Each word is regarded as a token.

<div class="alert alert-block alert-info">
From <a href="https://spacy.io/usage/linguistic-features">https://spacy.io/usage/linguistic-features</a>:
        <ul><li> <strong>Text:</strong> The original token text.</li>
<li> <strong>Dep:</strong> The syntactic relation connecting child to head.</li>
<li> <strong>Head text:</strong> The original text of the token head.</li>
<li> <strong>Head POS:</strong> The part-of-speech tag of the token head.</li>
<li> <strong>Children:</strong> The immediate syntactic dependents of the token.</li>
    </ul>
    </div>

In [11]:
for token in doc[9:59]:
    print(token.text," - ", 
          "\n   Dep: ",token.dep_,       
          "\n   Head: ",token.head.text, 
          "\n   Pos: ",token.head.pos_,  
          "\n   Child: ",[child for child in token.children]) 


In  -  
   Dep:  prep 
   Head:  was 
   Pos:  AUX 
   Child:  [house]
the  -  
   Dep:  det 
   Head:  house 
   Pos:  NOUN 
   Child:  []
house  -  
   Dep:  pobj 
   Head:  In 
   Pos:  ADP 
   Child:  [the, of]
of  -  
   Dep:  prep 
   Head:  house 
   Pos:  NOUN 
   Child:  [Vickers]
Major  -  
   Dep:  compound 
   Head:  Vickers 
   Pos:  PROPN 
   Child:  []
Vickers  -  
   Dep:  pobj 
   Head:  of 
   Pos:  ADP 
   Child:  [Major, ,, Commandant]
,  -  
   Dep:  punct 
   Head:  Vickers 
   Pos:  PROPN 
   Child:  []
Commandant  -  
   Dep:  appos 
   Head:  Vickers 
   Pos:  PROPN 
   Child:  [of]
of  -  
   Dep:  prep 
   Head:  Commandant 
   Pos:  PROPN 
   Child:  [Harbour]
Macquarie  -  
   Dep:  compound 
   Head:  Harbour 
   Pos:  PROPN 
   Child:  []
Harbour  -  
   Dep:  pobj 
   Head:  of 
   Pos:  ADP 
   Child:  [Macquarie]
,  -  
   Dep:  punct 
   Head:  was 
   Pos:  AUX 
   Child:  [
]

  -  
   Dep:   
   Head:  , 
   Pos:  PUNCT 
   Child:  []
there  -  
  

However, you might not want some of this pipeline processing as it may not be beneficial to your analysis. Any unneeded processing will also slow the system down and place a greater demand on the memory. This is particularly true of the parser. Luckily, it is easy to stipulate what you want excluded from the spaCy pipeline. 

In [12]:
doc=nlp(text[0:500], disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])

In [13]:
for token in doc[9:59]:
    print(token.text," - ", 
          "\n   Dep: ",token.dep_,        
          "\n   Head: ",token.head.text,  
          "\n   Pos: ",token.head.pos_,   
          "\n   Child: ",[child for child in token.children])  

In  -  
   Dep:   
   Head:  In 
   Pos:   
   Child:  []
the  -  
   Dep:   
   Head:  the 
   Pos:   
   Child:  []
house  -  
   Dep:   
   Head:  house 
   Pos:   
   Child:  []
of  -  
   Dep:   
   Head:  of 
   Pos:   
   Child:  []
Major  -  
   Dep:   
   Head:  Major 
   Pos:   
   Child:  []
Vickers  -  
   Dep:   
   Head:  Vickers 
   Pos:   
   Child:  []
,  -  
   Dep:   
   Head:  , 
   Pos:   
   Child:  []
Commandant  -  
   Dep:   
   Head:  Commandant 
   Pos:   
   Child:  []
of  -  
   Dep:   
   Head:  of 
   Pos:   
   Child:  []
Macquarie  -  
   Dep:   
   Head:  Macquarie 
   Pos:   
   Child:  []
Harbour  -  
   Dep:   
   Head:  Harbour 
   Pos:   
   Child:  []
,  -  
   Dep:   
   Head:  , 
   Pos:   
   Child:  []

  -  
   Dep:   
   Head:  
 
   Pos:  SPACE 
   Child:  []
there  -  
   Dep:   
   Head:  there 
   Pos:   
   Child:  []
was  -  
   Dep:   
   Head:  was 
   Pos:   
   Child:  []
,  -  
   Dep:   
   Head:  , 
   Pos:   
   Child:  []
on 

Of course, what you are interested in is the NER. Any text sent down the pipeline with the NER will get a list of entities that have been found. 

In [14]:
for entity in doc.ents:
    print(entity.text, "[",entity.label_,"]")

Vickers [ PERSON ]
this evening [ TIME ]
December 3rd [ DATE ]
Maurice Frere [ PERSON ]
Maria Island [ LOC ]
Ladybird [ ORG ]


As you can see, each entity is labelled with a category. The categories are defined by the model as it is trained to recognise them. 

__[TO DO] UPDATE__

<div class="alert alert-block alert-info">
    The NER categories classified by this spaCy model include:
   <ul>
       <li><strong>CARDINAL:</strong> Numerals that do not fall under another type</li>
<li><strong>DATE:</strong> Absolute or relative dates or periods</li>
<li><strong>EVENT:</strong> Named hurricanes, battles, wars, sports events, etc.</li>
<li><strong>FAC:</strong> Buildings, airports, highways, bridges, etc.</li>
<li><strong>GPE:</strong> Countries, cities, states</li>
<li><strong>LANGUAGE:</strong> Any named language</li>
<li><strong>LAW:</strong> Named documents made into laws.</li>
<li><strong>LOC:</strong> Non-GPE locations, mountain ranges, bodies of water</li>
<li><strong>MONEY:</strong> Monetary values, including unit</li>
<li><strong>NORP:</strong> Nationalities or religious or political groups</li>
<li><strong>ORDINAL:</strong> "first", "second", etc.</li>
<li><strong>ORG:</strong> Companies, agencies, institutions, etc.</li>
<li><strong>PERCENT:</strong> Percentage, including "%"</li>
<li><strong>PERSON:</strong> People, including fictional</li>
<li><strong>PRODUCT:</strong> Objects, vehicles, foods, etc. (not services)</li>
<li><strong>QUANTITY:</strong> Measurements, as of weight or distance</li>
<li><strong>TIME:</strong> Times smaller than a day</li>
<li><strong>WORK_OF_ART:</strong> Titles of books, songs, etc.</li>
</ul>
</div>

__[TO DO] Update__
- one category per NE
- not always correct
- some NE are MWE
- MWE subsume smaller NEs, e.g., Maria Island is only a LOC, not a PERSON and a LOC
- If you need a more precise training/categories, you need a new model, or to post process

Most tokenised terms in the sentence have O as their NER value (that is the letter O not the number 0). Some however have been categorised. For instance, Van and Diemen are both classified as being PERSON named entities, Tasman is an ORGANIZATION and Cape and Pillar are in the LOCATION category. These categories are specific to Stanza. There are two key levels of processing available - the normal level will only identify the categories LOCATION, ORGANIZATION and PERSON, but the high recall processing will also consider other specialised phrases like TIME and MONEY which are not really named entities. Stanza can also look for the categories used in the KBP competition like CITY, COUNTRY and NATIONALITY, but this fine-grained processing will be slower.


Some judgement is needed to work out which NEs correspond to placenames. Understanding the context of word usage is important. Luckily, the data for the entities includes the character position for the start and the end of the NE.

In [61]:
for entity in doc.ents:
    print(entity.text, "("+str(entity.start_char)+","+str(entity.end_char)+") [",entity.label_,"]")

Vickers (57,64) [ PERSON ]
this evening (113,125) [ TIME ]
December 3rd (129,141) [ DATE ]
Maurice Frere (171,184) [ PERSON ]
Maria Island (205,217) [ LOC ]
Ladybird (487,495) [ ORG ]


Each token will also have a value that indicates whether it is part of an NE.

In [62]:
for token in doc[9:59]:
    print(token.text, "[", token.ent_type_, "]")

In [  ]
the [  ]
house [  ]
of [  ]
Major [  ]
Vickers [ PERSON ]
, [  ]
Commandant [  ]
of [  ]
Macquarie [  ]
Harbour [  ]
, [  ]

 [  ]
there [  ]
was [  ]
, [  ]
on [  ]
this [ TIME ]
evening [ TIME ]
of [  ]
December [ DATE ]
3rd [ DATE ]
, [  ]
unusual [  ]
gaiety [  ]
. [  ]


 [  ]
Lieutenant [  ]
Maurice [ PERSON ]
Frere [ PERSON ]
, [  ]
late [  ]
in [  ]
command [  ]
at [  ]
Maria [ LOC ]
Island [ LOC ]
, [  ]
had [  ]
unexpectedly [  ]

 [  ]
come [  ]
down [  ]
with [  ]
news [  ]
from [  ]
head [  ]
- [  ]
quarters [  ]
. [  ]


__[TO DO] Update__

The annotations so far show that *Cape* and *Pillar* are both a LOCATION NE, but not that *Cape Pillar* is actually the complete name of the location. However Stanza does also recognise multi-word expressions (MWEs), even though it recognises that they are part of an NE. Each *Sentence* annotated by Stanza has a list of [NE *Mentions*](https://stanfordnlp.github.io/CoreNLP/entitymentions.html) which are also given NER categories as their *Type*. 

__[TO DO] Remove any comments and code about hand-coding corrections for exceptions?__

It is also possible to hand-code entities after the NER has been done. This can help make up for any common irregularities with the NER for your input documents. 

For instance, this model doesn't recognise that FB is a NE.

The solution is to create a new entry for the list of entities.

Even the data for the tokens is updated. 

__[TO DO] Decide whether to keep this part about batch processing. This may be too spaCy specific.__

SpaCy also allows the input documents to be processed in batches. This helps better manage the processing demands of the system throughout the pipeline when there are multiple files or many sentences.

While you can process the output of the piped pipeline straight away, you can't print it unless you convert it into a list.

__[TO DO] This is about spaCy indicating the scope of a NE.
Probably best to omit as it is too technical and about the spaCy rather than the problem.__

<div class="alert alert-block alert-info">
    From <a href="https://spacy.io/usage/linguistic-features">https://spacy.io/usage/linguistic-features</a>:

<ul>
    <li>IOB SCHEME
        <ul>
<li>I – Token is inside an entity.</li>
<li>O – Token is outside an entity.</li>
<li>B – Token is the beginning of an entity.</li>
        </ul></li>
<li>BILUO SCHEME
    <ul>
<li>B – Token is the beginning of a multi-token entity.</li>
<li>I – Token is inside a multi-token entity.</li>
<li>L – Token is the last token of a multi-token entity.</li>
<li>U – Token is a single-token unit entity.</li>
<li>O – Token is outside an entity.</li>
    </ul></li>
    </ul>
    </div>

__[TO DO] talk about extracting just the NER types we want.__

__[TO DO] Run through a single chapter (variable: text) before doing the entire collection?
    This will allow all NEs to be shown, then the filter be introduced.
    This will put it more on topic.__

In [17]:
text[0:500]

'CHAPTER III.\n\nA SOCIAL EVENING.\n\n\n\nIn the house of Major Vickers, Commandant of Macquarie Harbour,\nthere was, on this evening of December 3rd, unusual gaiety.\n\nLieutenant Maurice Frere, late in command at Maria Island, had unexpectedly\ncome down with news from head-quarters.  The Ladybird, Government schooner,\nvisited the settlement on ordinary occasions twice a year, and such visits\nwere looked forward to with no little eagerness by the settlers.\nTo the convicts the arrival of the Ladybird mean'

In [18]:
doc = nlp(text)

# document level
entities = [(entity.text, entity.start_char, entity.end_char, entity.label_) for entity in doc.ents]
    
i=0 # entity counter
# token level
for entity in doc.ents:
    print("{:5}\t\t{:30s}\t{}".format(i+1,entity.text, entity.label_))
    i=i+1

    1		Vickers                       	PERSON
    2		this evening                  	TIME
    3		December 3rd                  	DATE
    4		Maurice Frere                 	PERSON
    5		Maria Island                  	LOC
    6		Ladybird                      	ORG
    7		Ladybird                      	PERSON
    8		Ladybird                      	ORG
    9		Tom                           	PERSON
   10		Dick                          	PERSON
   11		Harry                         	PERSON
   12		bush                          	PERSON
   13		Jack                          	PERSON
   14		Town Gaol                     	PERSON
   15		Ladybird                      	ORG
   16		one                           	CARDINAL
   17		Commandant                    	ORG
   18		Vickers                       	PERSON
   19		Arthur                        	PERSON
   20		Hobart Town                   	ORG
   21		Arthur                        	PERSON
   22		Tasman                        	PERSON
   23		Peninsula              

__[TO DO] Expand on this explanation with example context.
Talk about the issue with _Van Diemen's_ versus _Van Diemen's Land_ (and _Tasman's Head_)__

These categories are assigned according to the context in which the NE is used. For this reason, _Van Diemen's_ is considered an *ORG*, a *PERSON* and a *FAC*, depending on its linguistic context. Note also that _VAN DIEMEN'S LAND_ in the title of the chapter isn't recognised as a NE, probably due to its unconventional case.

__[TO DO] Expand on this explanation.__

Of course, not all of these NE are suitable for placenames, so you will need to make a list of what categories regularly contain placenames.

In [19]:
PLACENAME_CATEGORIES = ["LOC", "GPE", "FAC", "ORG"]