# Introduction to the Named Entity Recognition with spaCy 

This notebook helps you access the Named Entity Recognition (NER) tools in the spaCy Python package.

### Contents
* [Premise](#section-premise)
* [Requirements](#section-requirements) 
* [Data Preparation](#section-datapreparation)
* [Named Entity Recognition](#section-ner)
  * [Looking for Named Entities](#section-nes)
  * [Categorising Named Entities](#section-categories) 
  * [Named Entities as Multi-Word Expressions](#section-mwes)
  * [Improving the spaCy processing](#section-improvingspacy)

## Premise <a class="anchor" id="section-premise"></a>
*This section explains purpose of this notebook.*

This notebook is designed to introduce you to the spaCy Python package and show you how toi use it to recognise certain proper noun phrases or "Named Entities" in electronic text. 

<div class="alert alert-block alert-success">
It will teach you how to
<ul>
    <li>use the spaCy library to identify and classify Named Entities (NEs)</li>
    <li>identify multi-word expressions (MWE) that are NEs</li>
</ul>
</div>

## Requirements <a class="anchor" id="section-requirements"></a>

<div class="alert alert-block alert-info">
This notebook uses various Python libraries. Most will come with your Python installation, but the following are crucial:
<ul>
    <li> pandas</li> 
    <li> numpy</li> 
    <li> spacy</li> 
</ul>
</div>

The following will load the required libraries into the notebook. If your computer is unable to do so, you may have to add an install line for the required package or seek technical advice.

For example:
<pre><code>import sys
!{sys.executable} -m pip install spacy
</code></pre>

In [None]:
import os
import pandas as pd

# spaCy is used for a pipeline of NLP functions
import spacy
from spacy.tokens import Span
from spacy import displacy

In [None]:
# Make sure you can see as much of the output as possible within the Jupyter Notebook screen
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 115)

# Data preparation <a class="anchor" id="section-datapreparation"></a>

For this notebook, we will use the text of *For the Term of His Natural Life* (*FtToHNL*), an 1874CE novel by Marcus Clarke that is in the public domain. Our copy was obtained via the [Gutenburg Project Australia](https://gutenberg.net.au/ebooks/e00016.txt) and is an unformatted textfile. We have slightly simplified it further by reducing it to only standard ASCII characters, replacing any accented characters with their unaccented forms and the British Pound Sterling symbol with the word pounds. 

The novel is divided into four books, each based in different regions of the world. You can start Chapter 3 of of the second book, which is titled *BOOK II.\-\-MACQUARIE HARBOUR.  1833*. 

The first step is to read the text from the file.

In [None]:
# This presumes that Notebooks/ is the current working directory  
text_directory = os.path.normpath("../Texts/")
filename="FtToHNL_BOOK_2_CHAPTER_3.txt"
print("Reading: ", filename)

# Set the specific path for the 'filename' 
text_location = os.path.normpath(os.path.join(text_directory, filename))
text_filename = os.path.basename(text_location)

# Read the text from the file
text = open(text_location, encoding="utf-8").read()

This is no more than a long string of characters. So far, you have done no processing. 

In [None]:
# Look at the first 501 characters
text[0:500] 

## Named Entity Recognition <a class="anchor" id="section-ner"></a>
*This section provides tools on identifying named entities in textual data*

### Looking for NEs <a class="anchor" id="section-nes"></a>

Named Entities (NEs) are proper noun phrases within text, like the names of places, people or organisations.

Looking at the above text from Book 2 of FtToHNL, you can see there are the names of places, characters, and a ship.  While you could mannually extract them from the text, Natural Language Processing (NLP) technology allows this process to be semi-automated through software.

There are various packages that can include Named Enity Recognition (NER) tools, e.g., the [Stanza CoreNLP](https://colab.research.google.com/github/stanfordnlp/stanza/blob/main/demo/Stanza_CoreNLP_Interface.ipynb), the Stanford NER, and the spaCy library. They often combine machine learning and a rule-based system to identify NEs and classify them into categories.

For this notebook, you will be using the [spaCy NER](https://spacy.io/usage/linguistic-features#named-entities).  This is available as part of the spaCy Python library.

SpaCy allows you to load a [language model](https://spacy.io/models/en#en_core_web_sm) that has been trained on various examples of the language of interest. This training allows it to use statistical methods to evaluate the presence of various patterns in related texts that can be associated to linguistic behaviours.

__[TO DO] Explain a language model *better* in one sentence.__

In [None]:
# Load a spaCy English language model
nlp = spacy.load("en_core_web_sm")

SpaCy will use the model in various levels of natural language processing, that is called a processing pipeline. This pipeline includes dividing the text into individual tokens or terms, like words, values and punctuation.

<div class="alert alert-block alert-info">
The options for the spaCy pipeline include:
    <p>&nbsp;</p>
<table>
    <tr><th>NAME</th>	<th>COMPONENT</th>		<th>	DESCRIPTION</th>	
    </tr>
    <tr>
        <td><strong>tokenizer</strong></td><td>	Tokenizer</td><td>	Segment text into tokens.</td>
    </tr>
    <tr>

<td><strong>tagger</strong></td><td>	Tagger</td><td>	Assign part-of-speech tags.</td>
    </tr>
    <tr>

<td><strong>parser</strong></td><td>	DependencyParser</td><td>	Assign dependency labels.</td>
    </tr>
    <tr>

<td><strong>ner</strong></td><td>	EntityRecognizer</td><td>	Detect and label named entities.</td>
    </tr>
    <tr>

<td><strong>lemmatizer</strong></td><td>	Lemmatizer</td><td>	Assign base forms.</td>
    </tr>
    <tr>

<td><strong>textcat</strong></td><td>	TextCategorizer</td><td>	Assign document labels.</td>
    </tr>
    <tr>

<td><strong>custom</strong></td><td>	custom components</td><td>	Assign custom attributes, methods or properties.</td>
    </tr>
    </table>
    
   A full explanation can be found at <a href="https://spacy.io/usage/processing-pipelines">https://spacy.io/usage/processing-pipelines</a>
    </div>

![spaCy Language Procesing Pipeline](./spaCy_pipeline.png)

For example, by default the spaCy pipeline contains the following.

In [None]:
# Output the processes in the current pipeline
print("Pipeline:", nlp.pipe_names)

If you send the FtToHNL text to the spaCy model, it will be processed by the pipeline.

In [None]:
# Process the first 501 characters of the text using the model and pipeline
doc = nlp(text[0:500])
doc

This may not look very different to what we read from the file, but the output from the pipeline is now available in the output structure of the spaCy model. Each word is regarded as a token and each token has various linguistic features, based on the lexical and grammatical aspects that were identified by the pipeline, especially the parser.

<div class="alert alert-block alert-info">
From <a href="https://spacy.io/usage/linguistic-features">https://spacy.io/usage/linguistic-features</a>:
        <ul><li> <strong>Text:</strong> The original token text.</li>
<li> <strong>Dep:</strong> The syntactic relation connecting child to head.</li>
<li> <strong>Head text:</strong> The original text of the token head.</li>
<li> <strong>Head POS:</strong> The part-of-speech tag of the token head.</li>
<li> <strong>Children:</strong> The immediate syntactic dependents of the token.</li>
    </ul>
    </div>

In [None]:
# For each token in the first two sentences,
for token in doc[9:59]:
    # print the linguistic features of the token identified by the pipeline
    print(token.text," - ", 
          "\n   Dep: ",token.dep_,       
          "\n   Head: ",token.head.text, 
          "\n   Pos: ",token.head.pos_,  
          "\n   Child: ",[child for child in token.children]) 

However, you might not want some of this pipeline processing as it may not be beneficial to your analysis. Any unneeded processing will also slow the system down and place a greater demand on the memory. This is particularly true of the parser. Luckily, it is easy to stipulate what you want excluded from the spaCy pipeline. 

In [None]:
# Process the first 501 characters of the text with a shortened pipeline
doc=nlp(text[0:500], disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])

In [None]:
# For each token in the first two sentences, 
for token in doc[9:59]:
    # print the linguistic features of the tokenidentified by the shorter pipeline
    print(token.text," - ", 
          "\n   Dep: ",token.dep_,        
          "\n   Head: ",token.head.text,  
          "\n   Pos: ",token.head.pos_,   
          "\n   Child: ",[child for child in token.children])  

Of course, for this notebook what you are interested in is [spaCy's NER](https://spacy.io/usage/linguistic-features#named-entities). Any text sent down the pipeline with the NER will get a list of entities that have been found. 

In [None]:
# For each Named Entity recognised by the NER,
for entity in doc.ents:
    # output the relevant tokens and NE category 
    print(entity.text+" ["+entity.label_+"]")

### Categorising Named Entities <a class="anchor" id="section-categories"></a>

As you can see, each entity is labelled with a category. The categories are defined by the model as it is trained to recognise them. 

<div class="alert alert-block alert-info">
    The NER categories classified by this spaCy model include:
   <ul>
       <li><strong>CARDINAL:</strong> Numerals that do not fall under another type</li>
<li><strong>DATE:</strong> Absolute or relative dates or periods</li>
<li><strong>EVENT:</strong> Named hurricanes, battles, wars, sports events, etc.</li>
<li><strong>FAC:</strong> Buildings, airports, highways, bridges, etc.</li>
<li><strong>GPE:</strong> Countries, cities, states</li>
<li><strong>LANGUAGE:</strong> Any named language</li>
<li><strong>LAW:</strong> Named documents made into laws.</li>
<li><strong>LOC:</strong> Non-GPE locations, mountain ranges, bodies of water</li>
<li><strong>MONEY:</strong> Monetary values, including unit</li>
<li><strong>NORP:</strong> Nationalities or religious or political groups</li>
<li><strong>ORDINAL:</strong> "first", "second", etc.</li>
<li><strong>ORG:</strong> Companies, agencies, institutions, etc.</li>
<li><strong>PERCENT:</strong> Percentage, including "%"</li>
<li><strong>PERSON:</strong> People, including fictional</li>
<li><strong>PRODUCT:</strong> Objects, vehicles, foods, etc. (not services)</li>
<li><strong>QUANTITY:</strong> Measurements, as of weight or distance</li>
<li><strong>TIME:</strong> Times smaller than a day</li>
<li><strong>WORK_OF_ART:</strong> Titles of books, songs, etc.</li>
</ul>
</div>

These categories can be very helpful when you are trying to identify certain types of NEs. However, they are not perfect. Because the NER is based on the language model, words are given categories based on what the language model has seen in the training data. This means that the same NE can be given different categories, dependent on the linguistic context in which it appears. It should also be noted that spaCy's NER only labels each NE with one category and does not normally provide any measure of how certain it is about that category. Other NLP systems will have similar NE categories, but there is no universal ontology. Some are more fine-grained than others. If you need even more fine-grained or alternate categories in spaCy, you will have to train and use a suitable language model or add some post-processing, which will be briefly discussed [later in this notebook](#section-improvingspacy).

### Named Entities as Multi-Word Expressions <a class="anchor" id="section-mwes"></a>

You will notice that some of the NEs recognised by spaCy in the text include more than one word. The ability to recognise Multi-Word Expressions (MWEs) as NEs is important, as is understanding the context of word usage. Luckily, the data for the entities includes the character position for the start and the end of the NE.

In [None]:
# For each Named Entity recognised by the NER,
for entity in doc.ents:
    # output the start and end characters for the relevant tokens
    print(entity.text, "("+str(entity.start_char)+","+str(entity.end_char)+") ["+entity.label_+"]")

Each token will also have a value that indicates whether it is part of an NE.

In [None]:
# For each token in the first two sentences
for token in doc[9:59]:
    # output whether it has an NE category
    print(token.text+" ["+token.ent_type_+"]")

This contextual information is very helpful as it helps you evaluate the scope of terms regarded as a part of an NE and whether they are appropriate, given the linguistic use of the terms. For instance, in your example sentences, you can see that both of the PERSON NEs are directly preceded by military titles, e.g., _Major Vickers_ and _Lieutenant Maurice Frere_. It also shows that _Maria_ is not regarded as a Person NE because it is part of a LOC NE. This is one of the consequences of spaCy only labelling one NE category per token. If a token is regarded to be part of a MWE, then this subsumes any possible categories it may be regarded as.

### Improving the spaCy processing  <a class="anchor" id="section-improvingspacy"></a>

You might have also recognised that the spaCy NER didn't recognise that _Macquarie Harbour_ was a Named Entity. This might have happened if the training data for the language model you used didn't have any example text that included the words _Macquarie_ or _harbour_, especially as part of a proper noun phrase. The most obvious way to correct this is to train a new model that does have such examples. It is beyond the scope of this notebook to discuss but [spaCy has various guides](https://spacy.io/usage/training) on how this can be done, especially for NER.    

If you don't want to train models, then you can always set up some post-processing Python code to add some new NEs or change their categories. This is a good option if you don't have a wide range of changes to make, or if the linguistic contexts in which it would apply are very specific, e.g., only for a small closed set of words or distinct grammatical strutures. Again, this is beyond the scope of this workshop, as it requires an understanding of the [spaCy IOB and BILUO schema](https://spacy.io/usage/linguistic-features#accessing-ner) if you want to [set your own entities](https://spacy.io/usage/linguistic-features#setting-entities).

Finally, as previously mentioned, some of the natural language processing in the spaCy pipeline can be resource-hungry, taking up CPU usage and/or memory. For this reason, if you are wanting to process a lot of data, like a lot of documents or sentences, you might be better processing them in batches. This will split up the processing by [piping the data to the pipeline](https://spacy.io/usage/processing-pipelines#processing) as a stream rather than processing it altogether at each stage. This is relatively easy to do but it does change some of the data structures of some output.

  <p>&nbsp;</p>

Hope this notebook has given you a basic understanding of how you can use the spaCy Named Entity Recgontion tool. 