# **Extracting Spatial Entities from text (2)**
---

## Task Description:

![](https://raw.githubusercontent.com/IgnatiusEzeani/spatial_narratives_workshop/main/img/from_penrith_tagged.png)

Assuming we know nothing about the geography of the place(s) described by the corpus, what can we learn about it. In particular:
* **What places are there?** These can be:
 * `Toponyms` (*Keswick*, *Pooley Bridge*, *the River Lowther*)
 * `Geographical features` (*the town*, *a hill*, *the road*)
 * `Locative adverbs` (*above*, *north-of*, *eastwards*, *here*, *there*)
 

## Named Entity Recognition and Semantic Tagging
Previously, we applied the a rule based method to spatial elements extraction from text in **Demo 1**.

There were a few limitations with the rule-based method.
* It requires a complete list of entities.
* Rules need to be provided for all possible scenarios:
  - e.g. spelling errors, variations in capitalizations, inflections etc.
  - Over-lapping instances ('Eamont' vs 'Eamont Bridge')
* Difficult to extract references to time and date as well as sentiments and emotions.
* Does not generalize well with other corpora

In this demo, we will explore the option of adapting named entity recognition and semantic tagging systems. 

## **Step 1: Downloading the workshop materials**
Let's download (clone) the resources for the workshop from the [Spatial Narrative Demo](https://github.com/SpaceTimeNarratives/demo)  GitHub repository.

In [None]:
!git clone https://github.com/SpaceTimeNarratives/demo.git

<div>
<img src="https://raw.githubusercontent.com/SpaceTimeNarratives/demo/main/img/file_structure.png" width="300"/>
</div>

The `demo` directory contains an example file `example_text.txt`. Our aim is to read file and display the text as well as identify all the place names mentioned in the text.

### Changing into the working directory
Everything we need for this exercise can be found in the working directory (or folder) named *demo*. We will use the `os` (operating system) library which contains all the useful functions we may need to manage our folders and files programmatically. 

Here we use the `chdir()` (change directory) function to get into our working directory and list the contents of our directory using the `listdir()` 

Run the code below to change to the working directory `demo/` and list its content.

In [None]:
# Type or paste the command below:
import os
os.chdir('demo/')
os.listdir()

## **Step 2: Installing and importing `spaCy`**
The [spaCy NLP](https://spacy.io/) library provides a named entity recognizer that we can use.

### Installing `spacy v3.3.1`
The current version of `PyMUSAS` is compatible with `spacy v3.3.1`. However, Google Colab ships with `spacy v3.4.4` and its dependencies by default.

So we will take the following steps to install the compatible version:
- uninstall the current version with `pip uninstall -y spacy`
- install `spacy v3.3.1` and other dependencies from the `requirements.txt` file in the working folder.
- run the `function.py` files which will import `spaCy` as well as all the other required libraries.

In [None]:
pip -q uninstall -y spacy

In [None]:
pip -q install -r requirements.txt

### Running the `function.py` file 
We have wrapped up all the functions we created in the previous demo into a Python file `functions.py`. 

One of the functions depend on the `lemminflect` Python library that helps to generate lemmas and inflections, so we have to install it first.

In [None]:
%run functions.py

### Importing `spaCy`
We need to import the `spaCy` NLP pipeline
and load the small version of the English model `en_core_web_sm` for tokenization, tagging, parsing and named entity recognition.

In [None]:
import spacy

In [None]:
nlp = spacy.load("en_core_web_sm")

## **Step 3: Extracting named entitities**


### Process `example.txt`
Read the content of the `example.txt` file in the working folder into the `example_text` variable and pass the variable through the NLP pipeline to produce the `spacy_processed` document.

In [None]:
example_text = open('example.txt').read()
example_text

In [None]:
spacy_processed = nlp(example_text)

Now let's see all the entities and tags the tagger has seen from `processed_text`...

In [None]:
for entity in spacy_processed.ents:
    print(entity.text, entity.label_)

This looks quite interesting. We have more tags in the spacy NER tagger and, unsurprisingly, there is no `PLNAME` or `GEONOUN`. We can see that even without a list, it identified similar entities to the regular expression method, and even more (e.g. `King Arthur's Round Table` and `Lowther Castle`), although with different tags.

### Visualizing the entities
Spacy has an inbuilt visualisation feature which we can call as below:

```python
from spacy import displacy
HTML(displacy.render(spacy_processed, style="ent"))
```
However, to have a bit of control over the visualization, we can use the `visualizer()` we built in the last demo. Therefore we need a function `extract_spacy_entities(spacy_processed)` to extract the spacy entities into a dictionary and build the list of tagged tokens for visualization.

We can convert all tags refering to a place (e.g. `GPE`, `ORG` and `FAC`) to `PLNAME`. Also we have redefined the `BG_COLOR` variable in the `function.py` file to accommodate all possible spacy NER tags (See pages 21 & 22 of the [OntoNotes 5.0](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf) document for the descriptions of the NER tags) and others that we may need later.

```python
BG_COLOR = {
    'PLNAME':'#feca74','GEONOUN': '#9cc9cc', 'GPE':'#feca74', 'CARDINAL':'#e4e7d2',
    'FAC':'#9cc9cc','QUANTITY':'#e4e7d2','PERSON':'#aa9cfc', 'ORDINAL':'#e4e7d2',
    'ORG':'#7aecec', 'NORP':'#d9fe74', 'LOC':'#9ac9f5', 'DATE':'#c7f5a9',
    'PRODUCT':'#edf5a9', 'EVENT': '#e1a9f5','TIME':'#a9f5bc', 'WORK_OF_ART':'#e6c1d7',
    'LAW':'#e6e6c1','LANGUAGE':'#c9bdc7', 'PERCENT':'#c9ebf5', 'MONEY':'#b3d6f2',
    'EMOTION':'#f2ecd0', 'TIME-sem':'#d0e0f2', 'MOVEMENT':'#f2d0d0','no_tag':'#FFFFFF'
}
```

In [None]:
def extract_spacy_entities(processed_text):
  entities = {}
  for ent in processed_text.ents:
    tag='PLNAME' if ent.label_ in ['GPE', 'ORG', 'FAC', 'LOC'] else ent.label_
    entities[ent.start_char] = ent.text, tag
  return OrderedDict(sorted(entities.items()))

spacy_entities = extract_spacy_entities(spacy_processed)
visualize(example_text, spacy_entities)

### Combining the rule-based and `spaCy` methods
As shown above, some of the placenames are tagged as `PERSON`. However, the regular expression rules method used previously was able to identify them as placenames using the Lake District gazetteer.

So using the `merge_entities()` function, we can combine the outputs from both methods in a way that the output of the regular expression overrides that of spacy NER where there is conflict.

In [None]:
# Read LD placenames into a list
ld_place_names = [name.strip() for name in open('LD_placenames.txt').readlines()]

regex_entities = extract_entities(example_text, ld_place_names)

In [None]:
visualize(example_text, merge_entities(regex_entities, spacy_entities))

### Including Geo nouns and locative adverbs
Great so far! We can also include the geo nouns and the locative adverbs in our extracted entity types using their respective lists.

Remember that `merge_entities()` accepts two extracted entities arguments and the order matters because the first overrides the other when there is a conflict.

In [None]:
# Extract geo nouns...
geonouns = get_inflections([noun.strip() for noun in open('geo_feature_nouns.txt').readlines()])
extracted_geonouns = extract_entities(example_text, geonouns, tag='GEONOUN')

# Extract locative adverbs 
loc_advs = [adv.split()[0] for adv in open('locativeAdverbs.txt').readlines()]
extracted_locadvs = extract_entities(example_text, geonouns,  tag='LOCADV')

# Merger the entities in the other below
merged_entities = merge_entities(regex_entities,
                    merge_entities(spacy_entities,
                       merge_entities(extracted_geonouns, extracted_locadvs)))
# Visualize the merged entities.
visualize(example_text, merged_entities)

### So far...
...this is going great 😊. 

##### **Rule based method**
With the rule based method, we can identify, extract and merge any number of entity types (placenames, geo nouns, locative adverbs) as long as we have a list of named elements in that class. But it is inefficient to create an exhaustive list that can generalize across different writings. 

##### **Named entity recognizer**
Fortunately, we can use a named entity recognizer identify interesting entities even without a list of elements. Although the NER model was not trained for our case study, we can adapt the relevant tags from the model output that correspond to our categories e.g. converting `[GPE, ORG, FAC, LOC]` to `PLNAME`.

## **Step 4: Extracting semantic entitities**

We may also want extract spatial elements that indicate the movements, emotion or a sense of time. The previous methods are not capable of identifying references to movements and emotions.

Altough the NER method can detect time and date references (e.g. `August`, `daily`, `the year 1635`), it is not able to pick expressions like `retrospective`, `for some time`, `never ending`, `prolonged` etc., as references to time.

We will use **P**ython **M**ultilingual **U**crel **S**emantic **A**nalysis **S**ystem ([PyMUSAS](https://ucrel.github.io/pymusas/)), which is a rule based token and Multi Word Expression semantic tagger that uses the [Ucrel Semantic Analysis System (USAS)](https://ucrel.lancs.ac.uk/usas/) and runs on the spaCy pipeline.

The USAS tagset three highlevel tags:
* `E` - Emotion 
* `M` - Movement, location, travel and transport
* `T` - Time


### Setting up `PyMUSAS`



In [None]:
# Load the English PyMUSAS rule based tagger in a separate spaCy pipeline
english_tagger_pipeline = spacy.load('en_dual_none_contextual')
# Adds the English PyMUSAS rule based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=english_tagger_pipeline)

In [None]:
spacy_processed = nlp(example_text)

In [None]:
tag_types = ['EMOTION', 'MOVEMENT', 'TIME-SEM']
semtag_entities = extract_sem_entities(spacy_processed, tag_types)
visualize(example_text, semtag_entities)

In [None]:
merged_entities = merge_entities(merged_entities, semtag_entities)

In [None]:
visualize(example_text, merged_entities)

## **Putting it all together..**

Below is the functions that support all the activities in this demo as defined in the `functions.py` file.

```python
"""
This code provides functions to extract and visualize entities and semantic tokens from text. 
Here are the descriptions of the functions:
"""

import re
from IPython.display import HTML
from collections import OrderedDict
from lemminflect import getLemma, getInflection

BG_COLOR = {
    'PLNAME':'#feca74','GEONOUN': '#9cc9cc', 'GPE':'#feca74', 'CARDINAL':'#e4e7d2',
    'FAC':'#9cc9cc','QUANTITY':'#e4e7d2','PERSON':'#aa9cfc', 'ORDINAL':'#e4e7d2',
    'ORG':'#7aecec', 'NORP':'#d9fe74', 'LOC':'#9ac9f5', 'DATE':'#c7f5a9',
    'PRODUCT':'#edf5a9', 'EVENT': '#e1a9f5','TIME':'#a9f5bc', 'WORK_OF_ART':'#e6c1d7',
    'LAW':'#e6e6c1','LANGUAGE':'#c9bdc7', 'PERCENT':'#c9ebf5', 'MONEY':'#b3d6f2',
    'EMOTION':'#f2ecd0', 'TIME-SEM':'#d0e0f2', 'MOVEMENT':'#f2d0d0','no_tag':'#FFFFFF'
}

"""
Function `extract_entities(text, ent_list, tag='PLNAME')`
This function takes a text, a list of entities (as strings), and an optional tag as input, 
and returns a dictionary of entities with their indexes in the text as keys. 
The optional tag parameter is used to specify the entity type, which defaults to `'PLNAME'` if not provided.
"""
def extract_entities(text, ent_list, tag='PLNAME'):
  sorted(set(ent_list), key=lambda x:len(x), reverse=True)
  extracted_entities = {}
  for ent in ent_list:
    for match in re.finditer(f' {ent}[\.,\s\n;:]', text):
      # modified to return the `tag` too...
      extracted_entities[match.start()+1]=text[match.start()+1:match.end()-1], tag
  return {i:extracted_entities[i] for i in sorted(extracted_entities.keys())}

combine = lambda x, y: (x[0], x[1], x[2]+' '+y[2], x[3])

"""
Function `get_inflections(names_list)`
This function takes a list of geo nouns as input and returns a list of their inflections and lemmas using the `lemminflect` package.
"""
def get_inflections(names_list):
    gf_names_inflected = []
    for w in names_list:
      gf_names_inflected.append(w)
      gf_names_inflected.extend(list(getInflection(w.strip(), tag='NNS', inflect_oov=False)))
      gf_names_inflected.extend(list(getLemma(w.strip(), 'NOUN', lemmatize_oov=False)))
    return list(set(gf_names_inflected))

"""
Function `combine_multi_tokens(a_list)`
This function takes a list of adjacent semantic tokens (a semantic token is a tuple of a token
and its tag) as input and returns a list of tuples where adjacent tokens of the same type are 
combined into a single tuple.
Example: `[('at','TIME'), ('this','TIME'), ('point','TIME')] => [('at this point', 'TIME')]` 
"""
def combine_multi_tokens(a_list):
  new_list = [a_list.pop()]
  while a_list:
    last = a_list.pop()
    if new_list[-1][0] - last[0] == 1:
      new_list.append(combine(last, new_list.pop()))
    else:
      new_list.append(last)
  return sorted(new_list)

"""
Function `extract_sem_entities(processed_text, tag_types)`
This function takes processed text and a list of semantic tags as input and returns a
dictionary of semantic entities with their indexes in the text as keys.
"""
def extract_sem_entities(processed_text, tag_types):
  entities, tokens = {}, [token.text for token in processed_text]
  for tag_type in tag_types:
    tag_indices = [(i, token.idx, token.text, tag_type) for i, token in enumerate(processed_text) 
                        if token._.pymusas_tags[0].startswith(tag_type[0])]
    if tag_indices:
      for i, idx, token, tag in combine_multi_tokens(tag_indices):
        entities[idx] = token, tag
  return OrderedDict(sorted(entities.items()))

"""
Function `merge_entities(first_ents, second_ents)`
This function takes two dictionaries of entities and returns a merged dictionary.
"""
def merge_entities(first_ents, second_ents):
  return OrderedDict(sorted({** second_ents, **first_ents}.items()))

"""
Function `get_tagged_list(text, entities)`
This function takes text and a dictionary of entities as input and returns a list of tuples where
each tuple contains a token and its tag.
"""
def get_tagged_list(text, entities):
  begin, tokens_tags = 0, []
  for start, (ent, tag) in entities.items():
    if begin <= start:
      tokens_tags.append((text[begin:start], None))
      tokens_tags.append((text[start:start+len(ent)], tag))
      begin = start+len(ent)
  tokens_tags.append((text[begin:], None)) #add the last untagged chunk
  return tokens_tags

"""
Function `mark_up(token, tag=None)`
This function takes a token and an optional tag as input and returns the token surrounded
by HTML markup that will be used for visualization. If a tag is provided, the token will be 
highlighted with a background color corresponding to the tag.
"""
def mark_up(token, tag=None):
  if tag:
    begin_bkgr = f'<bgr class="entity" style="background: {BG_COLOR[tag]}; padding: 0.05em 0.05em; margin: 0 0.15em;  border-radius: 0.55em;">'
    end_bkgr = '\n</bgr>'
    begin_span = '<span style="font-size: 0.8em; font-weight: bold; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">'
    end_span = '\n</span>'
    return f"{begin_bkgr}{token}{begin_span}{tag}{end_span}{end_bkgr}"
  return f"{token}"

"""
Function `visualize(text, entities)`
This function takes text and a dictionary of entities as input and returns an HTML-formatted
string that visually highlights the entities in the text. 
"""
def visualize(text, entities):
  token_tag_list = get_tagged_list(text, entities)
  start_div = f'<div class="entities" style="line-height: 2.5; direction: ltr">'
  end_div = '\n</div>'
  html = start_div
  for token, tag in token_tag_list:
    html += mark_up(token,tag)
  html += end_div
  return HTML(html)

```

## **Next step...**


**Using other text files**

So far we have been able to extract placenames, geo nouns, locative adverbs as well as semantic depictions of the concepts of emotion, movement and time. 

In the next notebook, we will put these techniques together to build a demo that allows us to use our own files and choose the options we are interested in.