<a href="https://colab.research.google.com/github/IgnatiusEzeani/spatial_narratives_workshop/blob/main/spatial_narrative_demo1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Extracting Spatial Entities from text (1)**

## Task Description:

![](https://raw.githubusercontent.com/SpaceTimeNarratives/demo/main/img/from_penrith_both.png)

Assuming we know nothing about the geography of the place(s) described by the corpus, what can we learn about it. In particular:
* **What places are there?** These can be:
 * `Toponyms` (*Keswick*, *Pooley Bridge*, *the River Lowther*)
 * `Geographical features` (*the town*, *a hill*, *the road*)
 * `Locative adverbs` (*above*, *north-of*, *eastwards*, *here*, *there*)
 

## The Rule-Based Method
Our aim in these exercises is to extract and mark up these spatial elements in text as shown. We will walk through building the foundations of a baseline extraction tool for *placenames*, *geographic feature nouns*, *locative adverbs* and any other entity category for which we have a list.

This notebook focuses on developing a **rule based** method for extracting entities in a lists using regular expression.

## **Step 1: Downloading the workshop materials**
Let's download (clone) the resources for the workshop from the [Spatial Narratives Demo](https://github.com/SpaceTimeNarratives/demo)  GitHub repository.

In [None]:
!git clone https://github.com/SpaceTimeNarratives/demo.git

<div>
<img src="https://raw.githubusercontent.com/SpaceTimeNarratives/demo/main/img/file_structure.png" width="300"/>
</div>

The `spatial_narratives_workshop` directory contains an example file `example_text.txt`. Our aim is to read file and display the text as well as identify all the place names mentioned in the text.

### Changing into the working directory
Everything we need for this exercise can be found in the working directory (or folder) named *demo*. We will use the `os` (operating system) library which contains all the useful functions we may need to manage our folders and files programmatically. 

Here we use the `chdir()` (change directory) function to get into our working directory and list the contents of our directory using the `listdir()` 

Type (or copy and paste) the code below in the next cell and run.

```python
import os
os.chdir('demo/')
os.listdir()
```

In [None]:
# Type or paste the command below:


### Viewing the example text
Our working folder contains the example text file *example_text.txt* which you can double-click on to read it's contents on another pane.

However we can programmatically read it into a *variable* named `example_text` by using the `open` command. A variable is basically a named container for storing values that vary (of course..) while the program runs and reusing them when we need to.

*Type copy and paste the command in the code cell below:* 

```python
example_text =  open('example_text.txt').read()
example_text
```

In [None]:
# Type or paste the command below:


## **Step 2: Rule-Based Placename Extraction**
In this section, we will apply a rule-based approach that uses regular expression (regex) and a combination of other techniques to extract and visualize place names from text. 

#### Extracting placenames
Here we think about a way to extract a known place name (e.g. `Penrith`) from the text. We will use the Python library `re` (regular expression) to build search patterns to look for in the text.

So let's say we defining called function (or method) called `extract_placename(text, placename)` such that we can give it some *text* and a list of *placenames* and it returns all occurences of each of the placename in the text.

The code below defines the `extract_placename(text, placename)` function.


In [None]:
import re
def extract_placenames(text, placenames):
  sorted(set(placenames), key=lambda x:len(x), reverse=True)
  extracted_placenames = {}
  for name in placenames:
    for match in re.finditer(f'{name}[\.,\s\n;:]', text):
      extracted_placenames[match.start()]=text[match.start():match.end()-1]
  return {i:extracted_placenames[i] for i in sorted(extracted_placenames.keys())}

In [None]:
place_names = ['Carleton Hall', 'Dunmallet', 'Eamont', 'Eamont Bridge', 
               'Hallen Fell',  'Penrith', 'Pooley Bridge', 'Shap', 'ULLESWATER',
               'Ulleswater']

extracted_place_names = extract_placenames(example_text, place_names)
extracted_place_names

The above output (`{5: 'Penrith', 31: 'Pooley Bridge', ...}`) is a dictionary with the `start: placename` entries where `start` is the starting character index (or position) of the `placename` in the order they are found in text. 

For example `Penrith` was the first placename found and the started from character index 5. `Pooley Bridge` appeared thrice in the text with starting indexes `31`, `450` and `856`. By the way the first character is in position `0` (not `1`).

You may observe that it got all occurences of the placenames on our list. But there are other placenames in the text such as `King Arthur's Round Table`, `Lowther Castle`, `Martindale`, `Mayborough`, `Patterdale`. We will come back to this later.



## **Step 3: Visualizing the outputs**
It is often a good idea to present a graphic representation of our outputs for better visualization and understanding of how our process works.

Using the `HTML` function inside the `IPython` library's `display` package, we define functions that display the HTML format of the visualisation of the extracted place names in the text.

#### **Visualizing the plain text**

Visualizing the untagged example text is easy. We simply pass the `example_text` variable to the I

In [None]:
from IPython.display import HTML
HTML(example_text)

#### **Visualizing the extracted place names**
This is a little more challenging but will follow the same principle. Having extracted the place names, we can define functions that will help us 'mark-up' or highlight the extracted place names from the plain text based on their starting index positions and spans so we can visualize it in HTML format.

Let's call the first function `get_tagged_list()`. It will parse the text with dictionary of extracted place names and identify spans that will be tagged as place names in the text. Its output is a list of tuples containing text spans and tags (either `PLNAME` or `None`)

In [None]:
# extract all known place name in a list
def get_tagged_list(text, ext_pl_names):
  begin, tokens_tags = 0, []
  for start, plname in ext_pl_names.items():
    length, ent, tag = len(plname), plname, 'PLNAME'
    if begin <= start:
      tokens_tags.append((text[begin:start], None))
      tokens_tags.append((text[start:start+length], tag))
      begin = start+length
  tokens_tags.append((text[begin:], None)) #add the last untagged chunk
  return tokens_tags

# get_tagged_list(example_text, extracted_place_names)

The second function `mark_up`, which takes a `token` (actually a span of characters) and a tag (i.e. `PL-NAME` for place name) basically marks up or highlights any piece of text with a given background colour in HTML format.

In [None]:
def mark_up(token, tag=None):
  if tag:
    begin_bkgr = f'<bgr class="entity" style="background: #feca74 ; padding: 0.1em 0.1em; margin: 0 0.15em; border-radius: 0.23em;">'
    end_bkgr = '\n</bgr>'
    begin_span = '<span style="font-size: 0.8em; font-weight: bold; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">'
    end_span = '\n</span>'
    return f"{begin_bkgr}{token}{begin_span}{tag}{end_span}{end_bkgr}"
  return f"{token}"
# HTML(mark_up('Penrith', tag='PLNAME'))

Finally, we piece everything together with the function `generate_html()` which does exactly that by marking up the output of the `get_tagged_list()` with the `mark_up()` function.

In [None]:
# generate html formatted text 
def visualize(token_tag_list):
  start_div = f'<div class="entities" style="line-height: 2.0; direction: ltr">'
  end_div = '\n</div>'
  html = start_div
  for token, tag in token_tag_list:
    html += mark_up(token,tag)
  html += end_div
  return HTML(html)

visualize(get_tagged_list(example_text, extracted_place_names))

## **Step 4: Extracting with a gazetteer**
Our previous examples so far is only able to extract and visualise a few place names. Obviously, for a chance to be able to extract all the place names in the text, we will need a more comprehensive list.

So for this task, we will apply the techniques and processes defined above with a list of the Lake District place names from the gazetteer created by [Source]() to identify and extract mentions of the place names in the same text.

In [None]:
# Read LD placenames into a list
ld_place_names = [name.strip() for name in open('LD_placenames.txt').readlines()]

# Extract place names from the same example text 
extracted_place_names = extract_placenames(example_text, ld_place_names)

# Get list of tagged entities (or placenames) and their tags
tagged_list = get_tagged_list(example_text, extracted_place_names)

visualize(tagged_list)

Okay, so far our method works quite well if we have what we are looking for in the list exactly as it appears in the text. Otherwise, it may wobble a bit. 

A typical example in the text above <mark>Lowther</mark> and <mark>Castle</mark> separately marked instead of <mark>Lowther Castle</mark>.

## **Step 5: Extracting geographical feature nouns with list**
To extract geographical features from a list of feature nouns (e.g. `castle`, `ridge`, `forest`, `village`, `river` etc), we will apply the same method as placenames.

We will modify the `extract_placenames()` function to be more generic for all entity classes i.e. `extract_entities()`.

Also, to enable us apply a new tag `GEONOUN`, let's modify the `get_tagged_list()` function to accept the tag parameter that defaults to the `PLNAME` but supports other tags.

In [None]:
def extract_entities(text, ent_list):
  sorted(set(ent_list), key=lambda x:len(x), reverse=True)
  extracted_entities = {}
  for ent in ent_list:
    for match in re.finditer(f' {ent}[\.,\s\n;:]', text):
      extracted_entities[match.start()+1]=text[match.start()+1:match.end()-1]
  return {i:extracted_entities[i] for i in sorted(extracted_entities.keys())}

extract_entities(example_text, place_names)

Include `tag` in the `get_tagged_list()` parameters to enable the use of other tags `tag='PLNAME'` defaults to placenames but explicit passing of other tags (e.g. `GEONOUNS`) will override it.

In [None]:
# extract all known place name in a list
def get_tagged_list(text, entities, tag='PLNAME'):
  begin, tokens_tags = 0, []
  for start, ent in entities.items():
    if begin <= start:
      tokens_tags.append((text[begin:start], None))
      tokens_tags.append((text[start:start+len(ent)], tag))
      begin = start+len(ent)
  tokens_tags.append((text[begin:], None)) #add the last untagged chunk
  return tokens_tags

# get_tagged_list(example_text, extracted_place_names)

We need inflections and lemmas of all the words in the geo nouns list. For example, if we have `road` in the list, then we will need `roads` as well, and vice versa.

So we will install the `lemminflect` library and define the `get_inflections()` function.

In [None]:
!pip install lemminflect

In [None]:
# Expand list with inflections and lemmas
from lemminflect import getLemma, getInflection
def get_inflections(names_list):
    gf_names_inflected = []
    for w in names_list:
      gf_names_inflected.append(w)
      gf_names_inflected.extend(list(getInflection(w.strip(), tag='NNS', inflect_oov=False)))
      gf_names_inflected.extend(list(getLemma(w.strip(), 'NOUN', lemmatize_oov=False)))
    return list(set(gf_names_inflected))

We need to define a background color dictionary `BG_COLOR` for visualization so that the system can decide what colour to apply in highliting the entities of different tags in the text. 

Accordingly, the entity backgound in the `mark_up()` will be modified to select a color from the dictionary using the tag as the key (i.e. `BG_COLOR[tag]`).

In [None]:
BG_COLOR = {'PLNAME':'#feca74','GEONOUN': '#9cc9cc'}
# Marking up the token for visualization
def mark_up(token, tag=None):
  if tag:
    begin_bkgr = f'<bgr class="entity" style="background: {BG_COLOR[tag]} ; padding: 0.1em 0.1em; margin: 0 0.15em; border-radius: 0.23em;">'
    end_bkgr = '\n</bgr>'
    begin_span = '<span style="font-size: 0.8em; font-weight: bold; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">'
    end_span = '\n</span>'
    return f"{begin_bkgr}{token}{begin_span}{tag}{end_span}{end_bkgr}"
  return f"{token}"

Now we read the geo nouns saved in `geo_feature_nouns.txt` into a list and produce the tagged list with the extracted geo nouns using the `GEONOUN` tag. 

In [None]:
# Read LD placenames into a list
geonouns = get_inflections([noun.strip() for noun in open('geo_feature_nouns.txt').readlines()])
tagged_geonouns = get_tagged_list(example_text, extract_entities(example_text, geonouns), 'GEONOUN')
visualize(tagged_geonouns)

## **Step 6: Extracting and visualising multiple entity types**

Now that we can extract multiple entities from text, we have to modify or functions to be more generalizable. That way, we can pass any list of items of eny category that we are interested in.

The `extract_entities()` function will also be rewritten to return not just the entities and their starting position but also their tags.

In [None]:
# Generates a list of all tokens, tagged and untagged, for visualisation
def extract_entities(text, ent_list, tag='PLNAME'):
  sorted(set(ent_list), key=lambda x:len(x), reverse=True)
  extracted_entities = {}
  for ent in ent_list:
    for match in re.finditer(f' {ent}[\.,\s\n;:]', text):
      # modified to return the `tag` too...
      extracted_entities[match.start()+1]=text[match.start()+1:match.end()-1], tag
  return {i:extracted_entities[i] for i in sorted(extracted_entities.keys())}

# extract_entities(example_text, place_names)

Also, in building the tagged list of tokens for the visualizer, we need to include the tags of the entities (`PLNAME` for placenames and GEONOUN for geo feature nouns) and `None` for other tokens.

In [None]:
# Generates a list of all tokens, tagged and untagged, for visualisation
def get_tagged_list(text, entities):
  begin, tokens_tags = 0, []
  for start, (ent, tag) in entities.items():
    if begin <= start:
      tokens_tags.append((text[begin:start], None))
      tokens_tags.append((text[start:start+len(ent)], tag))
      begin = start+len(ent)
  tokens_tags.append((text[begin:], None)) #add the last untagged chunk
  return tokens_tags

# get_tagged_list(example_text, extracted_place_names)

We need another function `merge_entities()` to combine the entities that we have extracted and their tags into a single dictionary.

In [None]:
# merging entities
from collections import OrderedDict
def merge_entities(first_ents, second_ents):
  return OrderedDict(sorted({** second_ents, **first_ents}.items()))

Now let's try to extract, merge and visualize multiple entities (i.e. placenames and geo nouns).

The code below will tag all extracted names as place names by default

```python
extracted_placenames = extract_entities(example_text, place_names)
```

However, in the code below, we have to explicitly pass `GEONOUN` to the tag parameter

```python
extracted_geonouns = extract_entities(example_text, geonouns, tag='GEONOUN')
```

In [None]:
extracted_placenames = extract_entities(example_text, place_names)
extracted_geonouns = extract_entities(example_text, geonouns, tag='GEONOUN')

merged_entities = merge_entities(extracted_placenames, extracted_geonouns)

In [None]:
# Get list of tagged entities (or placenames) and their tags
tagged_list = get_tagged_list(example_text, merged_entities)
visualize(tagged_list)

## **Exercise:**

The `locativeAdverbs.txt` file in the working folder contains a list of the locative adverbs (i.e *above*, *homewards*, *northbound*, *southwards* etc.

**Task 1:** Use the code below to read the list into a Python variable `loc_advs`.

```python
loc_advs = [adv.split()[0] for adv in open('locativeAdverbs.txt').readlines()]
```

In [None]:
# Type code below...


**Task 2:** Extract locative locative adverbs in the text using the following code.

```python
extracted_locadvs = extract_entities(example_text, loc_advs, tag='LOCADV')
```

In [None]:
# Type code below...


**Task 3:** Modify the background colour dictionary `BG_COLOR` to add the colour for the `LOCADV` tag.

```python
BG_COLOR = {'PLNAME':'#feca74','GEONOUN': '#9cc9cc', 'LOCADV':'#f5b5cf'}
```

In [None]:
# Type code below...


**Task 4:** Visualize the locative adverbs in text with the `visualize()` function.

```python
visualize(get_tagged_list(example_text, extracted_locadvs))
```

In [None]:
# Type code below...


**Task 5:** Use `extract_entities()`, `merge_entities()` and `visualize()` functions to extract, merger and visualize **placenames**, **geo nouns** and **locative adverbs**.

In [None]:
# Type code below...


## **Putting it all together..**

Below is the summary of the code that powers the rule-based extraction method in this notebook.

```python
!git clone https://github.com/SpaceTimeNarratives/demo.git
```

```python
!pip install lemminflect
```

```python
from IPython.display import HTML
from collections import OrderedDict
from lemminflect import getLemma, getInflection
```

```python
BG_COLOR = {'PLNAME':'#feca74','GEONOUN': '#9cc9cc', 'LOCADV':'#f5b5cf'}
```

```python
# Generates a list of all tokens, tagged and untagged, for visualisation
def extract_entities(text, ent_list, tag='PLNAME'):
  sorted(set(ent_list), key=lambda x:len(x), reverse=True)
  extracted_entities = {}
  for ent in ent_list:
    for match in re.finditer(f' {ent}[\.,\s\n;:]', text):
      # modified to return the `tag` too...
      extracted_entities[match.start()+1]=text[match.start()+1:match.end()-1], tag
  return {i:extracted_entities[i] for i in sorted(extracted_entities.keys())}

# Merging entities
def merge_entities(first_ents, second_ents):
  return OrderedDict(sorted({** second_ents, **first_ents}.items()))

# Generates a list of all tokens, tagged and untagged, for visualisation
def get_tagged_list(text, entities):
  begin, tokens_tags = 0, []
  for start, (ent, tag) in entities.items():
    if begin <= start:
      tokens_tags.append((text[begin:start], None))
      tokens_tags.append((text[start:start+len(ent)], tag))
      begin = start+len(ent)
  tokens_tags.append((text[begin:], None)) #add the last untagged chunk
  return tokens_tags

# Marking up the token for visualization
def mark_up(token, tag=None):
  if tag:
    begin_bkgr = f'<bgr class="entity" style="background: {BG_COLOR[tag]}; padding: 0.05em 0.05em; margin: 0 0.15em;  border-radius: 0.55em;">'
    end_bkgr = '\n</bgr>'
    begin_span = '<span style="font-size: 0.8em; font-weight: bold; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">'
    end_span = '\n</span>'
    return f"{begin_bkgr}{token}{begin_span}{tag}{end_span}{end_bkgr}"
  return f"{token}"

# generate html formatted text 
def visualize(token_tag_list):
  start_div = f'<div class="entities" style="line-height: 2.5; direction: ltr">'
  end_div = '\n</div>'
  html = start_div
  for token, tag in token_tag_list:
    html += mark_up(token,tag)
  html += end_div
  return HTML(html)

# Get inflections and lemmas of geo nouns
def get_inflections(names_list):
    gf_names_inflected = []
    for w in names_list:
      gf_names_inflected.append(w)
      gf_names_inflected.extend(list(getInflection(w.strip(), tag='NNS', inflect_oov=False)))
      gf_names_inflected.extend(list(getLemma(w.strip(), 'NOUN', lemmatize_oov=False)))
    return list(set(gf_names_inflected))
```

## **Next step...**


#### **Named Entity Recognition and Semantic Tagging**

With the rule-based approach, we could extract the placenames, geo nouns, locative adverbs any other category of items in alist. 

However, it is limited in a number of ways.
* It requires an exhaustive list of place names which is difficult to build for different types of writings.
* Hand-crafted rules for all possible scenarios will need to be developed
  - e.g. spelling errors, capitalizations, inflections etc.
  - Over-lapping instances ('Eamont' vs 'Eamont Bridge')
* It will be more difficult to extract references to time and date as well as sentiments and emotions.
* The approach will not generalize well with other corpora

In the next section, we will adapt some existing tools - a named entity recognizer and a semantic tagger to try to mitigate some of the challenges above. 