# **Extracting Spatial Entities from text (3)**
---

## Task Description:

![](https://raw.githubusercontent.com/IgnatiusEzeani/spatial_narratives_workshop/main/img/from_penrith_tagged.png)

Assuming we know nothing about the geography of the place(s) described by the corpus, what can we learn about it. In particular:
* **What places are there?** These can be:
 * `Toponyms` (*Keswick*, *Pooley Bridge*, *the River Lowther*)
 * `Geographical features` (*the town*, *a hill*, *the road*)
 * `Locative adverbs` (*above*, *north-of*, *eastwards*, *here*, *there*)

## Using Spacy's `EntityRuler` to build rules based extraction model
In **Demo 2**, we included the PyMUSAS Semantic Tagger to the rules based pipeline for spatial elements extraction from text.

While this broadened the scope of the spactial classes we could extract, it has the problem of managing too many components within the pipeline. 

The [EntityRuler](https://spacy.io/api/entityruler) is a Spacy component that enables us to add named entities based on pattern dictionaries, which makes it easy to combine rule-based and statistical named entity recognition for even more powerful pipelines.

In this demo, we will demonstrate how to use the EntityRuler to build a much simpler and more efficient extraction pipeline.

## **Step 1: Downloading the workshop materials**
Let's download (clone) the resources for the workshop from the [Spatial Narrative Demo](https://github.com/SpaceTimeNarratives/demo)  GitHub repository.

In [None]:
!git clone https://github.com/SpaceTimeNarratives/demo.git

As in the previous demos, the `demo` directory contains the example file `example.txt` and everything we need for this exercise.

Run the code below to change to the working directory `demo/` and list its content.

In [None]:
# Type or paste the command below:
import os
os.chdir('demo/')

Install required libraries in `requirements.txt` files.

In [None]:
pip -q install -r requirements.txt

Run the code `%run functions.py` to define the required functions

In [None]:
%run functions.py

## **Step 2: Read the text and load the entity lists**
First, let's load the example text from `example.txt` into the variable `example_text`...

In [None]:
example_text = open('example.txt').read()

Load the place names from `LD_placenames.txt` file. Also, ensure that each name is in *title* (i.e. starting characters are capitalised) or *upper* case. Those are thee most likely case they will have in the text.

In [None]:
place_names = [name.strip().title().replace("'S", "'s") for name in open('LD_placenames.txt').readlines()] #read and convert to title case 
place_names += [name.upper() for name in place_names] #retain the upper case versions

## **Step 3: Building the rules with `EntityRuler`**
Let's start by importing `spacy` for building the model

In [None]:
import spacy

Create a blank `spacy` English model

In [None]:
nlp = spacy.blank("en")

Create the EntityRuler

In [None]:
ruler = nlp.add_pipe("entity_ruler")

Define the patterns for the `EntityRuler` by labelling all the names with the tag `PLNAME`

In [None]:
patterns = [{"label": "PLNAME", "pattern": plname} for plname in set(place_names)]
ruler.add_patterns(patterns)

## **Step 4: Extracting placenames from example text**
Now we are ready to extract and visualize place names from text using the Spacy pipeline. Let's start by processing the text with the `nlp` pipeline. 

In [None]:
doc = nlp(example_text)

We can look at the place names that were extracted from our example text. The code below displays `<place name> <start char index> <end char index> <label>` on each line for all place names found.

In [None]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Let's use `displacy` to render the visualization of the extracted place names. We can use our pre-defined color dictionary, `BG_COLOR` to highlight the place names in the text.

In [None]:
from spacy import displacy
options = {'colors':BG_COLOR}
displacy.render(doc, style="ent", jupyter=True, options=options)

## **Step 5: Extracting geographic feature nouns**
As in the previous demos, we are also intersted in extracting other features besides the place names. In this section, we will add the goe feature nouns with an appropriate label to our `patterns` list.

Let's start by reading the geo nouns from file and getting the lemmas and inflections. The code below shows a list of the 261 geo nouns and their inflections (`mountains`, `islands`, `pikes`, `towers`, `bays`, etc) 

In [None]:
geonouns = get_inflections([noun.strip() for noun in open('geo_feature_nouns.txt').readlines()])

We then need to update the `patterns` list (which currently have the place names) with the geo nouns labelled as `GEONOUN`. This will require re-initialising the model with a new `EntityRuler` pipeline.

In [None]:
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
patterns += [{"label": "GEONOUN", "pattern": noun} for noun in geonouns]
ruler.add_patterns(patterns)

Processing the `example_text` and visualising entities...

In [None]:
doc = nlp(example_text)
displacy.render(doc, style="ent", jupyter=True, options=options)

---

## **Exercise**
As in the previous demos, we can include locative adverbs (i.e *above*, *homewards*, *northbound*, *southwards* etc.) in `locativeAdverbs.txt` file in the visualisation.

**Task 1:** Use the code below to read the list into a Python variable `loc_advs`.

```python
loc_advs = [adv.split()[0] for adv in open('locativeAdverbs.txt').readlines()]
```

In [None]:
# Type code below...


**Task 2:** Create a blank spacy model and add a new `EntityRuler` to the pipeline using the code below:
```python
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
```



In [None]:
# Type code below...


**Task 3:** Update the `patterns` list to include locative locative adverbs in the text using the following code.

```python
patterns += [{"label": "LOCADV", "pattern": adv} for adv in loc_advs]
ruler.add_patterns(patterns)
```

In [None]:
# Type code below...


**Task 4:** Process the example text with the new `nlp` model and visualise as done in *Step 5*.

In [None]:
# Type code below...
doc = nlp(example_text)
displacy.render(doc, style="ent", jupyter=True, options=options)

## **Step 6: Adding the `EntityRuler` to an existing pipeline**
So far, we have used a blank NLP model for our pipeline. However, we can leverage a pre-trained model by adding the `EntityRuler` to the model's pipeline. This will allow us to extract other entities not captured in our defined `patterns`.

So instead of `nlp = spacy.blank("en")` let's try loading Spacy's small model `en_core_web_sm` as below:

In [None]:
nlp = spacy.load("en_core_web_sm")

Then let's build the `patterns` list with the labels we are interested in i.e. `PLNAME`, `GEONOUN` and `LOCADV`.

In [None]:
patterns = [{"label": "PLNAME", "pattern": plname}
            for plname in place_names
            ] + [{"label": "GEONOUN", "pattern": noun}
            for noun in geonouns
            ] + [{"label": "LOCADV", "pattern": adverb}
            for adverb in loc_advs]

Finally, let's add the entity ruler with the patterns to the model's pipeline and visualise

In [None]:
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)

In [None]:
doc = nlp(example_text)
displacy.render(doc, style="ent", jupyter=True, options=options)

Well, this doesn't look good 😞. 

However, if you look at the components in the models processing pipeline, it appears that the existing NER component overrode our rules.
```
['tok2vec','tagger','parser','attribute_ruler','lemmatizer', 'ner', 'entity_ruler']
```

In [None]:
nlp.pipe_names

So, let's remove the `entity_ruler` and add it *before* the `ner`...

In [None]:
nlp.remove_pipe("entity_ruler")

In [None]:
ruler = nlp.add_pipe("entity_ruler", before='ner')
ruler.add_patterns(patterns)

Then visualize...

In [None]:
doc = nlp(example_text)
displacy.render(doc, style="ent", jupyter=True, options=options)

This is now better 🙂. Now we can extract matching patterns for place names, geo nouns and locative adverbs as well as other entities defined in the models `ner` component.