<a href="https://colab.research.google.com/github/IgnatiusEzeani/spatial_narratives_workshop/blob/main/spatial_narrative_demo2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Extracting Spatial Entities from text (2)**
---



## Task Description:

![](https://raw.githubusercontent.com/IgnatiusEzeani/spatial_narratives_workshop/main/img/from_penrith_both.png)

Assuming we know nothing about the geography of the place(s) described by the corpus, what can we learn about it. In particular:
* **What places are there?** These can be:
 * `Toponyms` (*Keswick*, *Pooley Bridge*, *the River Lowther*)
 * `Geographical features` (*the town*, *a hill*, *the road*)
 * `Locative adverbs` (*above*, *north-of*, *eastwards*, *here*, *there*)
 

## Named Entity Recognition and Semantic Tagging
Previously, we applied the a rule based method to spatial elements extraction from text in **Demo 1**.

There were a few limitations with the rule-based method.
* It requires a complete list of entities.
* Rules need to be provided for all possible scenarios:
  - e.g. spelling errors, variations in capitalizations, inflections etc.
  - Over-lapping instances ('Eamont' vs 'Eamont Bridge')
* Difficult to extract references to time and date as well as sentiments and emotions.
* Does not generalize well with other corpora

In this demo, we will explore the option of adapting named entity recognition and semantic tagging systems. 

## **Step 1: Downloading the workshop materials**
Let's download (clone) the resources for the workshop from the [Spatial Narrative Workshop](https://github.com/IgnatiusEzeani/spatial_narratives_workshop)  GitHub repository.

In [None]:
!git clone https://github.com/SpaceTimeNarratives/demo.git

<div>
<img src="https://raw.githubusercontent.com/SpaceTimeNarratives/demo/main/img/file_structure.png" width="300"/>
</div>

The `demo` directory contains an example file `example_text.txt`. Our aim is to read file and display the text as well as identify all the place names mentioned in the text.

### Changing into the working directory
Everything we need for this exercise can be found in the working directory (or folder) named *demo*. We will use the `os` (operating system) library which contains all the useful functions we may need to manage our folders and files programmatically. 

Here we use the `chdir()` (change directory) function to get into our working directory and list the contents of our directory using the `listdir()` 

Type (or copy and paste) the code below in the next cell and run.

```python
import os
os.chdir('demo/')
os.listdir()
```

In [None]:
# Type or paste the command below:
import os
os.chdir('demo/')
os.listdir()

### Running the `function.py` file 
We have wrapped up all the functions we created in the previous demo into a Python file `functions.py`. 

One of the functions depend on the `lemminflect` Python library that helps to generate lemmas and inflections, so we have to install it first.

In [None]:
!pip install lemminflect

In [None]:
%run functions.py

## **Step 2: Extracting named entitities**
The [spaCy NLP](https://spacy.io/) library provides a named entity recognizer that we can use. So we start by importing `spacy`

### Importing `spaCy`
We need to import the `spaCy` NLP pipeline
and load the small version of the English model `en_core_web_sm` for tokenization, tagging, parsing and named entity recognition.

In [None]:
import spacy

In [None]:
nlp = spacy.load("en_core_web_sm")

### Process `example.txt`
Read the content of the `example.txt` file in the working folder into the `example_text` variable and pass the variable through the NLP pipeline to produce the `spacy_processed` document.

In [None]:
example_text = open('example.txt').read()
# example_text

In [None]:
spacy_processed = nlp(example_text)

Now let's see all the entities and tags the tagger has seen from `processed_text`...

In [None]:
for entity in spacy_processed.ents:
    print(entity.text, entity.label_)

This looks quite interesting. We have more tags in the spacy NER tagger and, unsurprisingly, there is no `PLNAME` or `GEONOUN`. We can see that even without a list, it identified similar entities to the regular expression method, and even more (e.g. `King Arthur's Round Table` and `Lowther Castle`), although with different tags.

### Visualizing the entities
Spacy has an inbuilt visualisation feature which we can call as below:

```python
from spacy import displacy
HTML(displacy.render(spacy_processed, style="ent"))
```
However, to have a bit of control over the visualization, we can use the `visualizer()` we built in the last demo. Therefore we need a function `extract_spacy_entities(spacy_processed)` to extract the spacy entities into a dictionary and build the list of tagged tokens for visualization.

We can convert all tags refering to a place (e.g. `GPE`, `ORG` and `FAC`) to `PLNAME`. Also we have redefined the `BG_COLOR` variable in the `function.py` file to accommodate all possible spacy NER tags (See pages 21 & 22 of the [OntoNotes 5.0](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf) document for the descriptions of the NER tags) and others that we may need later.

```python
BG_COLOR = {
    'PLNAME':'#feca74','GEONOUN': '#9cc9cc', 'GPE':'#feca74', 'CARDINAL':'#e4e7d2',
    'FAC':'#9cc9cc','QUANTITY':'#e4e7d2','PERSON':'#aa9cfc', 'ORDINAL':'#e4e7d2',
    'ORG':'#7aecec', 'NORP':'#d9fe74', 'LOC':'#9ac9f5', 'DATE':'#c7f5a9',
    'PRODUCT':'#edf5a9', 'EVENT': '#e1a9f5','TIME':'#a9f5bc', 'WORK_OF_ART':'#e6c1d7',
    'LAW':'#e6e6c1','LANGUAGE':'#c9bdc7', 'PERCENT':'#c9ebf5', 'MONEY':'#b3d6f2',
    'EMOTION':'#f2ecd0', 'TIME-sem':'#d0e0f2', 'MOVEMENT':'#f2d0d0','no_tag':'#FFFFFF'
}
```

In [None]:
def extract_spacy_entities(processed_text):
  entities = {}
  for ent in processed_text.ents:
    tag='PLNAME' if ent.label_ in ['GPE', 'ORG', 'FAC', 'LOC'] else ent.label_
    entities[ent.start_char] = ent.text, tag
  return OrderedDict(sorted(entities.items()))

spacy_entities = extract_spacy_entities(spacy_processed)
visualize(get_tagged_list(example_text, spacy_entities))

### Combining the rule-based and `spaCy` methods
As shown above, some of the placenames are tagged as `PERSON`. However, the regular expression rules method used previously was able to identify them as placenames using the Lake District gazetteer.

So using the `merge_entities()` function, we can combine the outputs from both methods in a way that the output of the regular expression overrides that of spacy NER where there is conflict.

In [None]:
# Read LD placenames into a list
ld_place_names = [name.strip() for name in open('LD_placenames.txt').readlines()]

regex_entities = extract_entities(example_text, ld_place_names)

In [None]:
visualize(get_tagged_list(example_text, merge_entities(regex_entities, spacy_entities)))

### Including Geo nouns and locative adverbs
Great so far! We can also include the geo nouns and the locative adverbs in our extracted entity types using their respective lists.

Remember that `merge_entities()` accepts two extracted entities arguments and the order matters because the first overrides the other when there is a conflict.

In [None]:
# Extract geo nouns...
geonouns = get_inflections([noun.strip() for noun in open('geo_feature_nouns.txt').readlines()])
extracted_geonouns = extract_entities(example_text, geonouns, tag='GEONOUN')

# Extract locative adverbs 
loc_advs = [adv.split()[0] for adv in open('locativeAdverbs.txt').readlines()]
extracted_locadvs = extract_entities(example_text, geonouns,  tag='LOCADV')

# Merger the entities in the other below
merged_entities = merge_entities(regex_entities,
                    merge_entities(spacy_entities,
                       merge_entities(extracted_geonouns, extracted_locadvs)))
# Visualize the merged entities.
visualize(get_tagged_list(example_text, merged_entities))

### So far...
...this is going great 😊. 

##### **Rule based method**
With the rule based method, we can identify, extract and merge any number of entity types (placenames, geo nouns, locative adverbs) as long as we have a list of named elements in that class. But it is inefficient to create an exhaustive list that can generalize across different writings. 

##### **Named entity recognizer**
Fortunately, we can use a named entity recognizer identify interesting entities even without a list of elements. Although the NER model was not trained for our case study, we can adapt the relevant tags from the model output that correspond to our categories e.g. converting `[GPE, ORG, FAC, LOC]` to `PLNAME`.

## **Step 3: Extracting semantic entitities**

We may also want extract spatial elements that indicate the movements, emotion or a sense of time. The previous methods are not capable of identifying references to movements and emotions.

Altough the NER method can detect time and date references (e.g. `August`, `daily`, `the year 1635`), it is not able to pick expressions like `retrospective`, `for some time`, `never ending`, `prolonged` etc., as references to time.

We will use **P**ython **M**ultilingual **U**crel **S**emantic **A**nalysis **S**ystem ([PyMUSAS]( https://ucrel.github.io/pymusas/)), which is a rule based token and Multi Word Expression semantic tagger that uses the [Ucrel Semantic Analysis System (USAS)](https://ucrel.lancs.ac.uk/usas/ and runs on the spaCy pipeline.

The USAS tagset three highlevel tags:
* `E` - Emotion 
* `M` - Movement, location, travel and transport
* `T` - Time


### Setting up `PyMUSAS`



In [None]:
# extract all entities with semtagger
def extract_entities_with_semtagger(tokens, index_list, tag):
  entityPosLen={}
  for i in index_list:
    start_char = 1+len(" ".join(tokens[:i]))
    entityPosLen[start_char] = (len(tokens[i]), tokens[i], tag)
  return entityPosLen

In [None]:
# Get the index list of a sem tag
def get_sem_tagged(tag_type):
  index_list = []
  for i in range(len(output_doc)):
    if output_doc[i]._.pymusas_tags[0].startswith(tag_type[0]):
       index_list.append(i)
  return index_list

In [None]:
def extract_sem_entities():
  entities = {}
  for tag_type in sem_tag_types:
    tag_indices = [(i, token.text, tag_type) for i, token in enumerate(self.nlp_doc) if token._.pymusas_tags[0].startswith(tag_type[0])]
    if tag_indices:
      for i, token, tag in combine_multi_tokens(tag_indices):
        start_char = 1+len(" ".join(self.tokens[:i]))
        entities[start_char] = (len(token), token, tag)
  return collections.OrderedDict(sorted(entities.items()))

In [None]:
tag_types = ['EMOTION', 'MOVEMENT', 'TIME-Sem']
semtagger_entities={}
for tag_type in tag_types:
  tag_entities = extract_entities_with_semtagger(text_tokens, get_sem_tagged(tag_type),tag_type) 
  semtagger_entities = {**semtagger_entities, **tag_entities}
semtagger_entities = collections.OrderedDict(sorted(semtagger_entities.items()))

IPython.display.HTML(
    generate_html(get_token_tags(text, semtagger_entities)))

#### Extracting placenames
Here we think about a way to extract a known place name (e.g. `Penrith`) from the text. We will use the Python library `re` (regular expression) to build search patterns to look for in the text.

So let's say we defining called function (or method) called `extract_placename(text, placename)` such that we can give it some *text* and a list of *placenames* and it returns all occurences of each of the placename in the text.

The code below defines the `extract_placename(text, placename)` function.


In [None]:
import re
def extract_placenames(text, placenames):
  sorted(set(placenames), key=lambda x:len(x), reverse=True)
  extracted_placenames = {}
  for name in placenames:
    for match in re.finditer(f'{name}[\.,\s\n;:]', text):
      extracted_placenames[match.start()]=text[match.start():match.end()-1]
  return {i:extracted_placenames[i] for i in sorted(extracted_placenames.keys())}

In [None]:
place_names = ['Carleton Hall', 'Dunmallet', 'Eamont', 'Eamont Bridge', 
               'Hallen Fell',  'Penrith', 'Pooley Bridge', 'Shap', 'ULLESWATER',
               'Ulleswater']

extracted_place_names = extract_placenames(example_text, place_names)
extracted_place_names

The above output (`{5: 'Penrith', 31: 'Pooley Bridge', ...}`) is a dictionary with the `start: placename` entries where `start` is the starting character index (or position) of the `placename` in the order they are found in text. 

For example `Penrith` was the first placename found and the started from character index 5. `Pooley Bridge` appeared thrice in the text with starting indexes `31`, `450` and `856`. By the way the first character is in position `0` (not `1`).

You may observe that it got all occurences of the placenames on our list. But there are other placenames in the text such as `King Arthur's Round Table`, `Lowther Castle`, `Martindale`, `Mayborough`, `Patterdale`. We will come back to this later.



## **Step 3: Visualizing the outputs**
---
It is often a good idea to present a graphic representation of our outputs for better visualization and understanding of how our process works.

Using the `HTML` function inside the `IPython` library's `display` package, we define functions that display the HTML format of the visualisation of the extracted place names in the text.

#### **Visualizing the plain text**

Visualizing the untagged example text is easy. We simply pass the `example_text` variable to the I

In [None]:
from IPython.display import HTML
HTML(example_text)

#### **Visualizing the extracted place names**
This is a little more challenging but will follow the same principle. Having extracted the place names, we can define functions that will help us 'mark-up' or highlight the extracted place names from the plain text based on their starting index positions and spans so we can visualize it in HTML format.

Let's call the first function `get_tagged_list()`. It will parse the text with dictionary of extracted place names and identify spans that will be tagged as place names in the text. Its output is a list of tuples containing text spans and tags (either `PL-NAME` or `None`)

In [None]:
# extract all known place name in a list
def get_tagged_list(text, ext_pl_names):
  begin, tokens_tags = 0, []
  for start, plname in ext_pl_names.items():
    length, ent, tag = len(plname), plname, 'PLNAME'
    if begin <= start:
      tokens_tags.append((text[begin:start], None))
      tokens_tags.append((text[start:start+length], tag))
      begin = start+length
  tokens_tags.append((text[begin:], None)) #add the last untagged chunk
  return tokens_tags

# get_tagged_list(example_text, extracted_place_names)

The second function `mark_up`, which takes a `token` (actually a span of characters) and a tag (i.e. `PL-NAME` for place name) basically marks up or highlights any piece of text with a given background colour in HTML format.

In [None]:
def mark_up(token, tag=None):
  if tag:
    begin_bkgr = f'<bgr class="entity" style="background: #feca74 ; padding: 0.1em 0.1em; margin: 0 0.15em; border-radius: 0.23em;">'
    end_bkgr = '\n</bgr>'
    begin_span = '<span style="font-size: 0.8em; font-weight: bold; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">'
    end_span = '\n</span>'
    return f"{begin_bkgr}{token}{begin_span}{tag}{end_span}{end_bkgr}"
  return f"{token}"
# HTML(mark_up('Penrith', tag='PLNAME'))

Finally, we piece everything together with the function `generate_html()` which does exactly that by marking up the output of the `get_tagged_list()` with the `mark_up()` function.

In [None]:
# generate html formatted text 
def visualize(token_tag_list):
  start_div = f'<div class="entities" style="line-height: 2.0; direction: ltr">'
  end_div = '\n</div>'
  html = start_div
  for token, tag in token_tag_list:
    html += mark_up(token,tag)
  html += end_div
  return HTML(html)

visualize(get_tagged_list(example_text, extracted_place_names))

## **Step 4: Extracting with a gazetteer**
Our previous examples so far is only able to extract and visualise a few place names. Obviously, for a chance to be able to extract all the place names in the text, we will need a more comprehensive list.

So for this task, we will apply the techniques and processes defined above with a list of the Lake District place names from the gazetteer created by [Source]() to identify and extract mentions of the place names in the same text.

In [None]:
# Read LD placenames into a list
ld_place_names = [name.strip() for name in open('LD_placenames.txt').readlines()]

# Extract place names from the same example text 
extracted_place_names = extract_placenames(example_text, ld_place_names)

# Get list of tagged entities (or placenames) and their tags
tagged_list = get_tagged_list(example_text, extracted_place_names)

visualize(tagged_list)

Okay, so far our method works quite well if we have what we are looking for in the list exactly as it appears in the text. Otherwise, it may wobble a bit. 

A typical example in the text above <mark>Lowther</mark> and <mark>Castle</mark> separately marked instead of <mark>Lowther Castle</mark>.

## **Step 5: Extracting geographical feature nouns with list**
To extract geographical features from a list of feature nouns (e.g. `castle`, `ridge`, `forest`, `village`, `river` etc), we will apply the same method as placenames.

We will modify the `extract_placenames()` function to be more generic for all entity classes i.e. `extract_entities()`.

Also, to enable us apply a new tag `GEONOUN`, let's modify the `get_tagged_list()` function to accept the tag parameter that defaults to the `PLNAME` but supports other tags.

In [None]:
def extract_entities(text, ent_list):
  sorted(set(ent_list), key=lambda x:len(x), reverse=True)
  extracted_entities = {}
  for ent in ent_list:
    for match in re.finditer(f' {ent}[\.,\s\n;:]', text):
      extracted_entities[match.start()+1]=text[match.start()+1:match.end()-1]
  return {i:extracted_entities[i] for i in sorted(extracted_entities.keys())}

extract_entities(example_text, place_names)

Include `tag` in the `get_tagged_list()` parameters to enable the use of other tags `tag='PLNAME'` defaults to placenames but explicit passing of other tags (e.g. `GEONOUNS`) will override it.

In [None]:
# extract all known place name in a list
def get_tagged_list(text, entities, tag='PLNAME'):
  begin, tokens_tags = 0, []
  for start, ent in entities.items():
    if begin <= start:
      tokens_tags.append((text[begin:start], None))
      tokens_tags.append((text[start:start+len(ent)], tag))
      begin = start+len(ent)
  tokens_tags.append((text[begin:], None)) #add the last untagged chunk
  return tokens_tags

# get_tagged_list(example_text, extracted_place_names)

We need inflections and lemmas of all the words in the geo nouns list. For example, if we have `road` in the list, then we will need `roads` as well, and vice versa.

So we will install the `lemminflect` library and define the `get_inflections()` function.

In [None]:
!pip install lemminflect

In [None]:
# Expand list with inflections and lemmas
from lemminflect import getLemma, getInflection
def get_inflections(names_list):
    gf_names_inflected = []
    for w in names_list:
      gf_names_inflected.append(w)
      gf_names_inflected.extend(list(getInflection(w.strip(), tag='NNS', inflect_oov=False)))
      gf_names_inflected.extend(list(getLemma(w.strip(), 'NOUN', lemmatize_oov=False)))
    return list(set(gf_names_inflected))

We need to define a background color dictionary `BG_COLOR` for visualization so that the system can decide what colour to apply in highliting the entities of different tags in the text. 

Accordingly, the entity backgound in the `mark_up()` will be modified to select a color from the dictionary using the tag as the key (i.e. `BG_COLOR[tag]`).

In [None]:
BG_COLOR = {'PLNAME':'#feca74','GEONOUN': '#9cc9cc'}
# Marking up the token for visualization
def mark_up(token, tag=None):
  if tag:
    begin_bkgr = f'<bgr class="entity" style="background: {BG_COLOR[tag]}; padding: 0.1em 0.1em; margin: 0 0.15em; border-radius: 0.23em;">'
    end_bkgr = '\n</bgr>'
    begin_span = '<span style="font-size: 0.8em; font-weight: bold; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">'
    end_span = '\n</span>'
    return f"{begin_bkgr}{token}{begin_span}{tag}{end_span}{end_bkgr}"
  return f"{token}"

Now we read the geo nouns saved in `geo_feature_nouns.txt` into a list and produce the tagged list with the extracted geo nouns using the `GEONOUN` tag. 

In [None]:
# Read LD placenames into a list
geonouns = get_inflections([noun.strip() for noun in open('geo_feature_nouns.txt').readlines()])
tagged_geonouns = get_tagged_list(example_text, extract_entities(example_text, geonouns), 'GEONOUN')
visualize(tagged_geonouns)

## **Step 6: Extracting and visualising multiple entity types**

Now that we can extract multiple entities from text, we have to modify or functions to be more generalizable. That way, we can pass any list of items of eny category that we are interested in.

The `extract_entities()` function will also be rewritten to return not just the entities and their starting position but also their tags.

In [None]:
# Generates a list of all tokens, tagged and untagged, for visualisation
def extract_entities(text, ent_list, tag='PLNAME'):
  sorted(set(ent_list), key=lambda x:len(x), reverse=True)
  extracted_entities = {}
  for ent in ent_list:
    for match in re.finditer(f' {ent}[\.,\s\n;:]', text):
      # modified to return the `tag` too...
      extracted_entities[match.start()+1]=text[match.start()+1:match.end()-1], tag
  return {i:extracted_entities[i] for i in sorted(extracted_entities.keys())}

# extract_entities(example_text, place_names)

Also, in building the tagged list of tokens for the visualizer, we need to include the tags of the entities (`PLNAME` for placenames and GEONOUN for geo feature nouns) and `None` for other tokens.

In [None]:
# Generates a list of all tokens, tagged and untagged, for visualisation
def get_tagged_list(text, entities):
  begin, tokens_tags = 0, []
  for start, (ent, tag) in entities.items():
    if begin <= start:
      tokens_tags.append((text[begin:start], None))
      tokens_tags.append((text[start:start+len(ent)], tag))
      begin = start+len(ent)
  tokens_tags.append((text[begin:], None)) #add the last untagged chunk
  return tokens_tags

# get_tagged_list(example_text, extracted_place_names)

We need another function `merge_entities()` to combine the entities that we have extracted and their tags into a single dictionary.

In [None]:
# merging entities
from collections import OrderedDict
def merge_entities(first_ents, second_ents):
  return OrderedDict(sorted({** second_ents, **first_ents}.items()))

Now let's try to extract, merge and visualize multiple entities (i.e. placenames and geo nouns).

The code below will tag all extracted names as place names by default

```python
extracted_placenames = extract_entities(example_text, place_names)
```

However, in the code below, we have to explicitly pass `GEONOUN` to the tag parameter

```python
extracted_geonouns = extract_entities(example_text, geonouns, tag='GEONOUN')
```

In [None]:
extracted_placenames = extract_entities(example_text, place_names)
extracted_geonouns = extract_entities(example_text, geonouns, tag='GEONOUN')

merged_entities = merge_entities(extracted_placenames, extracted_geonouns)

In [None]:
# Get list of tagged entities (or placenames) and their tags
tagged_list = get_tagged_list(example_text, merged_entities)
visualize(tagged_list)

## **Exercise:**

The `locativeAdverbs.txt` file in the working folder contains a list of the locative adverbs (i.e *above*, *homewards*, *northbound*, *southwards* etc.

**Task 1:** Use the code below to read the list into a Python variable `loc_advs`.

```python
loc_advs = [adv.split()[0] for adv in open('locativeAdverbs.txt').readlines()]
```

In [None]:
# Type code below...


**Task 2:** Extract locative locative adverbs in the text using the following code.

```python
extracted_locadvs = extract_entities(example_text, loc_advs, tag='LOCADV')
```

In [None]:
# Type code below...


**Task 3:** Modify the background colour dictionary `BG_COLOR` to add the colour for the `LOCADV` tag.

```python
BG_COLOR = {'PLNAME':'#feca74','GEONOUN': '#9cc9cc', 'LOCADV':'#f5b5cf'}
```

In [None]:
# Type code below...


**Task 4:** Visualize the locative adverbs in text with the `visualize()` function.

```python
visualize(get_tagged_list(example_text, extracted_locadvs))
```

In [None]:
# Type code below...


**Task 5:** Use `extract_entities()`, `merge_entities()` and `visualize()` functions to extract, merger and visualize **placenames**, **geo nouns** and **locative adverbs**.

In [None]:
# Type code below...


## **Putting it all together..**

Below is the summary of the code that powers the rule-based extraction method in this notebook.

```python
!git clone https://github.com/IgnatiusEzeani/spatial_narratives_workshop.git
```

```python
!pip install lemminflect
```

```python
from IPython.display import HTML
from collections import OrderedDict
from lemminflect import getLemma, getInflection
```

```python
BG_COLOR = {'PLNAME':'#feca74','GEONOUN': '#9cc9cc', 'LOCADV':'#f5b5cf'}
```

```python
# Generates a list of all tokens, tagged and untagged, for visualisation
def extract_entities(text, ent_list, tag='PLNAME'):
  sorted(set(ent_list), key=lambda x:len(x), reverse=True)
  extracted_entities = {}
  for ent in ent_list:
    for match in re.finditer(f' {ent}[\.,\s\n;:]', text):
      # modified to return the `tag` too...
      extracted_entities[match.start()+1]=text[match.start()+1:match.end()-1], tag
  return {i:extracted_entities[i] for i in sorted(extracted_entities.keys())}

# Merging entities
def merge_entities(first_ents, second_ents):
  return OrderedDict(sorted({** second_ents, **first_ents}.items()))

# Generates a list of all tokens, tagged and untagged, for visualisation
def get_tagged_list(text, entities):
  begin, tokens_tags = 0, []
  for start, (ent, tag) in entities.items():
    if begin <= start:
      tokens_tags.append((text[begin:start], None))
      tokens_tags.append((text[start:start+len(ent)], tag))
      begin = start+len(ent)
  tokens_tags.append((text[begin:], None)) #add the last untagged chunk
  return tokens_tags

# Marking up the token for visualization
def mark_up(token, tag=None):
  if tag:
    begin_bkgr = f'<bgr class="entity" style="background: {BG_COLOR[tag]}; padding: 0.05em 0.05em; margin: 0 0.15em;  border-radius: 0.55em;">'
    end_bkgr = '\n</bgr>'
    begin_span = '<span style="font-size: 0.8em; font-weight: bold; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">'
    end_span = '\n</span>'
    return f"{begin_bkgr}{token}{begin_span}{tag}{end_span}{end_bkgr}"
  return f"{token}"

# generate html formatted text 
def visualize(token_tag_list):
  start_div = f'<div class="entities" style="line-height: 2.5; direction: ltr">'
  end_div = '\n</div>'
  html = start_div
  for token, tag in token_tag_list:
    html += mark_up(token,tag)
  html += end_div
  return HTML(html)

# Get inflections and lemmas of geo nouns
def get_inflections(names_list):
    gf_names_inflected = []
    for w in names_list:
      gf_names_inflected.append(w)
      gf_names_inflected.extend(list(getInflection(w.strip(), tag='NNS', inflect_oov=False)))
      gf_names_inflected.extend(list(getLemma(w.strip(), 'NOUN', lemmatize_oov=False)))
    return list(set(gf_names_inflected))
```

## **Next step...**


**Named Entity Recognition and Semantic Tagging**

With the rule-based approach, we could extract the placenames, geo nouns, locative adverbs any other category of items in alist. 

However, it is limited in a number of ways.
* It requires an exhaustive list of place names which is difficult to build for different types of writings.
* Hand-crafted rules for all possible scenarios will need to be developed
  - e.g. spelling errors, capitalizations, inflections etc.
  - Over-lapping instances ('Eamont' vs 'Eamont Bridge')
* It will be more difficult to extract references to time and date as well as sentiments and emotions.
* The approach will not generalize well with other corpora

In the next section, we will adapt some some existing tools - a named entity recognizer and a semantic tagger to try to mitigate some of the challenges above. 