# Spatial Narratives Project

## **Task Description: Place name Extraction from Text**

### Background

![](https://raw.githubusercontent.com/IgnatiusEzeani/spatial_narratives_workshop/main/img/from_penrith_plain.png)

Assuming we know nothing about the geography of the place(s) described by the corpus, what can we learn about it. In particular:
* **What places are there?** These can be:
 * `Toponyms` (Keswick, Pooley Bridge, the River Lowther, etc)
 * `Geographical features` (the town, a hill, the road)
 * `‘Between’ places` (‘between A and B there are nice views of the lake’, ‘on the way to A we did something’)
* **What are those places like?** (ie how are they described)?
* **What events happened at those places?**
* **How are the places mentioned related to each other?**
* **What can we infer about places by bringing this information together**. For example:
 * If one text says ‘At Pooley Bridge we hired a boat to row on the lake’ and another says ‘Pooley Bridge is at the head of Ullswater’, can we infer that at Pooley Bridge you can hire boats to row on Ullswater.
 * If one text says that ‘On the road from Pooley Bridge to Penrith there is a bridge after three miles’ and another says that ‘The road from Pooley Bridge to Penrith crosses the River Lowther’ can we draw this together to the bridge after three miles is over the Lowther’

### Methods
Our aim in this exercise is to extract and mark up these spatial elements in text as shown

<div>
<img src="https://raw.githubusercontent.com/IgnatiusEzeani/spatial_narratives_workshop/main/img/from_penrith_tagged.png" width="700"/>
</div>

<!-- ![Extracted spatial entities](https://raw.githubusercontent.com/IgnatiusEzeani/spatial_narratives_workshop/main/img/from_penrith_tagged.png) -->

In this exercise, we will go through various methods for extracting placename, geographic feature nouns which includes
 - **Rule Based Method** (using regular expression)
 - **Named Entity Recognition** (using spaCy)
 - **Semantic Tagging** (using PyMUSAS)



 Let's begin...

# Rule-Based method
In this section, we will apply a rule-based approach that uses regular expression (regex) and a combination of other techniques to extract and visualize place names from text. 

## **Step 1: Downloading the workshop materials**
Let's download (clone) the resources for the workshop from the [Spatial Narrative Workshop](https://github.com/IgnatiusEzeani/spatial_narratives_workshop)  GitHub repository.

In [1]:
!git clone https://github.com/IgnatiusEzeani/spatial_narratives_workshop.git

Cloning into 'spatial_narratives_workshop'...
remote: Enumerating objects: 71, done.[K
remote: Counting objects: 100% (71/71), done.[K
remote: Compressing objects: 100% (60/60), done.[K
remote: Total 71 (delta 15), reused 62 (delta 9), pack-reused 0[K
Unpacking objects: 100% (71/71), 2.13 MiB | 9.25 MiB/s, done.


The `spatial_narratives_workshop` directory contains an example file `example_text.txt`. Our aim is to read file and display the text as well as identify all the place names mentioned in the text.

Let's change into the `spatial_narratives_workshop` directory.

In [2]:
cd spatial_narratives_workshop/

/content/spatial_narratives_workshop


Open the `example_text.txt` file and read its content into the variable `example_text`

In [4]:
example_text =  open('example_text.txt').read()
print(example_text) #show the content of the file

From Penrith two roads lead to Pooley Bridge, about six miles distant, which spans the Eamont just at its issue from Ulleswater. Either road may be taken, be we recommend that which follows the Shap road to Eamont Bridge. Carleton Hall is near to it on the left. Cross the bridge, and take the first road to the right. At this point, on the left, are the druidical remains called King Arthur's Round Table, and Mayborough. Immediately after crossing Pooley Bridge, the road runs along the western shore of ULLESWATER.

To Patterdale, a distance of ten miles; but, before proceeding along it, the tourist would do well to take a walk of a few miles along the eastern shore, in the direction of Martindale, from several points on which he will obtain a good view of the lake. Should this deviation be made, it will be necessary to return by the same road to Pooley Bridge, where there are two small inns, at which boats, for an excursion on the water, or for fishing, may be procured if desired. A fine

## **Step 2: Extracting a placename**
Here we think about a way to extract a known place name (e.g. `Penrith`) from the text.

We start by defining a funtion, `extract_placename`, that can help us identify and extract a given from a piece of text...

In [None]:
import re
def extract_placename(text, plname):
  p = re.compile(f'{plname}[\.,\s\n]')
  iterator = p.finditer(text)
  for match in iterator:
    print(match.span())

In [None]:
placename = 'Penrith'
extract_placename(example_text, placename)

(5, 13)


The output `(5, 13)` above indicates that there is one occurence of 'Penrith' in the text and it occurs between character positions `5` and `13`. By the way the first character is in position `0` (not `1`).

### **Task1 :**

*By calling the above function (`extract_placename`) on the text in `example_text`. Write the code to extract mentions of `Pooley Bridge`, `Eamont`, `Eamont Bridge` and `Lowther Castle` and discuss your observations.*

## **Step 3: Extracting with a list of placenames**
As can be observed above, we often need to extract multiple names from the text in one run. For example, we may want to to identify and extract all the place names in the list `['Penrith', 'Pooley Bridge', 'Eamont', 'Eamont Bridge']`.

Let's rename our function `extract_placenames()` and modify it to be able to identify multiple place names from a list. We will display each place name along with its instance in text.

In [None]:
def extract_placename(text, plnames):
  for name in plnames:
    p = re.compile(f'{name.lower()}[\.,\s\n]')
    iterator = p.finditer(text.lower())
    for match in iterator:
      print(match.span(), name)

place_names = ['Penrith', 'Pooley Bridge', 'Eamont', 'Eamont Bridge']
extract_placename(example_text, place_names)

(5, 13) Penrith
(31, 45) Pooley Bridge
(450, 464) Pooley Bridge
(856, 870) Pooley Bridge
(87, 94) Eamont
(207, 214) Eamont
(207, 221) Eamont Bridge


We now have now have multiple place names extracted from the text based on the list we have. However, we still have a little problem. How do we extract '`Eamont`' and '`Eamont Bridge`' as two separate places?

We will try to tackle this by:

1.   Sorting the list of names in reverse order of the lenght of names. That way `Eamont Bridge` will come before `Eamont`. 
2.   Ensuring that no two placenames are extracted with the same start index. So when we extract `Eamont Bridge` with the start index of `207` as above, we will not extract `Eamont` again with the same start index. So we need to keep track of the start index.

Okay, let's modify the function and code...


In [None]:
# Re-defining the functions 
def extract_placenames(text, plnames):
  extracted_place_names={} # dictionary to keep track of extracted name instances
  for name in plnames:
    p = re.compile(f'{name}[\.,;\s\n]')
    iterator = p.finditer(text)
    for match in iterator:
      start, end = match.span()

      # also the place name is expected to be at least three characters in length 
      if end-start>=3 and start not in extracted_place_names:
        extracted_place_names[start] = text[start:end][:-1]
  return extracted_place_names

place_names = ['Penrith', 'Pooley Bridge', 'Eamont', 'Eamont Bridge']

# sort the place names: ['Eamont Bridge', 'Pooley Bridge', 'Penrith', 'Eamont']
place_names = sorted(place_names, key=lambda x: len(x), reverse=True)

extracted_place_names = extract_placenames(example_text, place_names)
extracted_place_names

{31: 'Pooley Bridge',
 450: 'Pooley Bridge',
 856: 'Pooley Bridge',
 207: 'Eamont Bridge',
 5: 'Penrith',
 87: 'Eamont'}

It will be good to sort the dictionary in the ascending order of start indexes to keep track of everything for visualization.

In [None]:
extracted_place_names = extract_placenames(example_text, place_names)
extracted_place_names = {i:extracted_place_names[i] for i in sorted(extracted_place_names)}
extracted_place_names

{5: 'Penrith',
 31: 'Pooley Bridge',
 87: 'Eamont',
 207: 'Eamont Bridge',
 450: 'Pooley Bridge',
 856: 'Pooley Bridge'}

## **Step 4: Visualizing the outputs**
It is often a good idea to present a graphic representation of our outputs for better visualization and understanding of how our process works.

So let's define functions that displays a visualisation of the text and the extracted place names in HTML format.

#### **Visualizing the plain text**

In [None]:
import IPython

def show_text(txtstr):
  start_mark = f'<mark class="entity" style="background: #FFFFFF; line-height: 2; border-radius: 0.35em;">'
  end_mark = '\n</mark>'
  return IPython.display.HTML(f"{start_mark}{txtstr}{end_mark}")

show_text(example_text)

#### **Visualizing the extracted place names**
Having extracted the place names, we can also define functions that can 'mark-up' or highlight the extracted place names from the plain text so we can visualize it in HTML format.

Let's call the first function `get_tagged_list()`. It will parse the text with dictionary of extracted place names and identify spans that will be tagged as place names in the text. Its output is a list of tuples containing text spans and tags (either `PL-NAME` or `None`)

In [None]:
# extract all known place name in a list
def get_tagged_list(text, ext_pl_names):
  begin, tokens_tags = 0, []
  for start, plname in ext_pl_names.items():
    length, ent, tag = len(plname), plname, 'PL-NAME'
    if begin <= start:
      tokens_tags.append((text[begin:start], None))
      tokens_tags.append((text[start:start+length], tag))
      begin = start+length
  tokens_tags.append((text[begin:], None)) #add the last untagged chunk
  return tokens_tags

get_tagged_list(example_text, extracted_place_names)

[('From ', None),
 ('Penrith', 'PL-NAME'),
 (' two roads lead to ', None),
 ('Pooley Bridge', 'PL-NAME'),
 (', about six miles distant, which spans the ', None),
 ('Eamont', 'PL-NAME'),
 (' just at its issue from Ulleswater. Either road may be taken, be we recommend that which follows the Shap road to ',
  None),
 ('Eamont Bridge', 'PL-NAME'),
 (". Carleton Hall is near to it on the left. Cross the bridge, and take the first road to the right. At this point, on the left, are the druidical remains called King Arthur's Round Table, and Mayborough. Immediately after crossing ",
  None),
 ('Pooley Bridge', 'PL-NAME'),
 (', the road runs along the western shore of ULLESWATER.\n\nTo Patterdale, a distance of ten miles; but, before proceeding along it, the tourist would do well to take a walk of a few miles along the eastern shore, in the direction of Martindale, from several points on which he will obtain a good view of the lake. Should this deviation be made, it will be necessary to return 

The second function `mark_up`, which takes a `token` (actually a span of characters) and a tag (i.e. `PL-NAME` for place name) basically marks up or highlights any piece of text with a given background colour in HTML format.

In [None]:
# format a typical entity for display 
def mark_up(token, tag):
  if tag:
    start_mark = f'<mark class="entity" style="background: #feca74 ; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">'
    end_mark = '\n</mark>'
    start_span = '<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">'
    end_span = '\n</span>'
    return f"\n{start_mark}{token}{start_span}{tag}{end_span}{end_mark}"
  return f"{token}"

IPython.display.HTML(mark_up('Penrith', 'PL-NAME'))

Finally, we piece everything together with the function `generate_html()` which does exactly that by marking up the output of the `get_tagged_list()` with the `mark_up()` function.

In [None]:
# generate html formatted text 
def generate_html(token_tag_list):
  start_div = f'<div class="entities" style="line-height: 2.5; direction: ltr">'
  end_div = '\n</div>'
  html = start_div
  for token, tag in token_tag_list:
    html += mark_up(token,tag)
  html += end_div
  return html

In [None]:
tag_list = get_tagged_list(example_text, extracted_place_names) 
IPython.display.HTML(generate_html(tag_list))

## **Step 5: Extracting with a gazetteer list**
Our previous examples so far is only able to extract and visualise a few place names. Obviously, for a chance to be able to extract all the place names in the text, we will need a more comprehensive list. 

So for this task, we will apply the techniques defined above with a list of the Lake District place names from the gazetteer created by [Source]() to identify and extract mentions of the place names in the same text.

In [None]:
place_names = [name.strip() for name in open('placenames.txt').readlines()]
# place_names


Let's modify the `extract_place_name()` function to sort the input list in the reverse order of lengths of names automatically and also return a version of the created dictionary sorted in ascending order of the index.

In [None]:
import xml.etree.ElementTree as ET
from collections import Counter
import os

In [None]:
def extract_cdplace_tags(xml_file):
  tags=[]
  tree = ET.parse(xml_file)
  root = tree.getroot()
  for child in root:
    for grand_child in child.iter('cdplace'):
      if grand_child.text:
        # tags.append(f"<cdplace>{grand_child.text}</cdplace>")
        tags.append(f"{grand_child.text}")
      else:
        for text in grand_child.iter('i'):
          # if text.text: tags.append(f"<cdplace><i>{text.text}</i></cdplace>")
          if text.text: tags.append(f"{text.text}")
  return tags
gold_standard_placenames = extract_cdplace_tags('gold_standard/Anon_cqp_66.xml')
len(gold_standard_placenames)

569

In [None]:
allnames=[]
for fname in sorted(os.listdir('gold_standard')):
  gold_standard_tags = extract_cdplace_tags(os.path.join('gold_standard',fname))
  allnames.extend(gold_standard_tags)
  print(fname, len(gold_standard_tags), len(Counter(gold_standard_tags)))
f"allnames({len(allnames)}), unique({len(set(allnames))})"

Anon_cqp_66.xml 569 322
Brown_cqp_10.xml 16 5
Clarke_cqp_63.xml 56 26
Cockin_cqp_19.xml 100 76
Coleridge_cqp_33.xml 261 170
Defoe_cqp_4.xml 119 75
Garnett_cqp_62.xml 715 391
Gray_cqp_13.xml 181 125
Keats_cqp_44.xml 109 56
Lt.Hammond._cqp_2.xml 62 51
Otley__cqp_49.xml 1931 739
Pennant_cqp_12.xml 49 40
Pennant_cqp_15.xml 302 197
Phillips_cqp_38.xml 42 25
Rix_cqp_78.xml 67 38
Ruskin_cqp_55.xml 232 98
Rutland_cqp_42.xml 48 32
Shaw_cqp_24.xml 185 148
Smith_cqp_5.xml 37 29
Smith_cqp_6.xml 37 30
Smith_cqp_7.xml 112 67
Sullivan_cqp_20.xml 80 51
Wakefield_cqp_37.xml 120 83
Wesley_cqp_9.xml 35 35
West_cqp_17.xml 1305 654
Wordsworth_cqp_47.xml 117 65
Wordsworth_cqp_58.xml 397 251
Young_cqp_11.xml 125 80


'allnames(7409), unique(2882)'

In [None]:
space_punct = lambda tstr, punc: tstr.replace(punc, f" {punc}") \
                if punc in ":,.!]})" else tstr.replace(punc, f"{punc} ")

def space_puncts(tstr, punc_list=":,.!(){}[]"):
  for punc in punc_list:
    # print(tstr, punc)
    tstr = space_punct(tstr, punc)
  return tstr
# text = 'the road from Pooley Bridge: to (Penrith), from Pooley Bridge. to Penrith!'
# space_puncts(text)

In [None]:
# Get all the gold standard placenames
def get_gold_data(folder):
  placenames, contexts = [], []
  for i, fname in enumerate(sorted(os.listdir(folder))):
    xml_text = open(os.path.join(folder,fname),'r').read()
    search_string = f'<cdplace[ visited=\'*"*\w\'*"*]*[<i>]*[\w*\s*\'-\.*’:]*[,\.;!?:]*[</i>]*</cdplace>'
    placenames.extend([re.sub('<.*?>', '', match.group()) 
                            for match in re.finditer(search_string, xml_text)])
    contexts.extend([re.sub('<.*?>', '', xml_text[match.start()-25:match.end()+25])
                            for match in re.finditer(search_string, xml_text)])
  return placenames, contexts
placenames, contexts = get_gold_data('gold_standard')
len(set(placenames)), len(contexts)

(2717, 7323)

In [None]:
# import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Re-defining the functions 
def extract_placenames(text, place, plnames):
  #sort the list in reverse order...
  # plnames = sorted(plnames, key=lambda x: len(x), reverse=True)
  extracted_place_names = {} # dictionary to keep track of extracted name instances
  if place in plnames:
    try:
      p = re.compile(f'{place}[\.,\s\n;:]')
      iterator = p.finditer(text)
      for match in iterator:
        start, end = match.span()

        # also the place name is expected to be at least three characters in length 
        if end-start>=3 and start not in extracted_place_names:
          extracted_place_names[start] = text[start:end][:-1]
    except:
      # print(f'Error: {name}')
      pass
  else:
    extracted_place_names[0] = 'None' 
  return {i:extracted_place_names[i] for i in sorted(extracted_place_names)}

def compute_scores(folder):
  placenames, contexts = get_gold_data(folder)
  name_list = sorted(set(placenames + place_names), key=lambda x: len(x), reverse=True)
  y_true, y_pred = [True]*len(placenames), []
  for place, context in list(zip(placenames, contexts)):
    try:
      y_pred.append(place==list(extract_placenames(context, place, name_list).values())[0])
    except:
      y_pred.append(False)
  return f"""Acc: {accuracy_score(y_true, y_pred)*100:.2f}%, Pre: {precision_score(y_true, y_pred)*100:.2f}%, Rec: {recall_score(y_true, y_pred)*100:.2f}% F1: {f1_score(y_true, y_pred)*100:.2f}%"""
print(compute_scores('gold_standard'))

Acc: 93.95%, Pre: 100.00%, Rec: 93.95% F1: 96.88%


Then let's visualize...

In [None]:
xml_text = open(os.path.join('gold_standard/Anon_cqp_66.xml'),'r').read()
place_names = [name.strip() for name in open('placenames.txt').readlines()]
extracted_place_names = extract_placenames(xml_text, place_names)
extracted_place_names = {i:extracted_place_names[i] for i in sorted(extracted_place_names)}

tag_list = get_tagged_list(xml_text, extracted_place_names)
IPython.display.HTML(generate_html(tag_list))

As you may have observed from the output above, some of the place names where missed by this method either because they were not found in the gazetteer list (e.g. `Eamont`, `Earl of Lonsdale`) or inconsistent capitalization (e.g. `Patterdale` vs `PATTERDALE`) or even spelling errors.

We will attempt to address these issues in the next section using the named entity recognizer.

## **Step 6: Extracting geographical feature nouns with list**
To extract geographical features from a list of feature nouns (e.g. `castle`, `ridge`, `forest`, `village`, `river` etc), we will apply the same method.

To enable us apply a new tag `GEO-NOUN`, let's modify the `get_tagged_list()` function to  default to the `PLNAME` tag while supporting other tags.


In [None]:
# extract all known place name in a list
def get_tagged_list(text, ext_pl_names, tag='PL-NAME'): #incl the tag parameter 
  begin, tokens_tags = 0, []
  for start, plname in ext_pl_names.items():
    length, ent, tag = len(plname), plname, tag
    if begin <= start:
      tokens_tags.append((text[begin:start], None))
      tokens_tags.append((text[start:start+length], tag))
      begin = start+length
  tokens_tags.append((text[begin:], None)) #add the last untagged chunk
  return tokens_tags

To reuse the `extract_placenames()` function, let's rename and modify the function to be more generic

In [None]:
# Rename 'placename' to 'entities' 
def extract_entities(text, ent_list):
  ent_list = sorted(ent_list, key=lambda x: len(x), reverse=True)
  extracted_entities = {} # dictionary to keep track of extracted name instances
  for name in ent_list:
    p = re.compile(f' {name}[\.,\s\n]')
    iterator = p.finditer(text)
    for match in iterator:
      start, end = match.span()

      # also the place name is expected to be at least three characters in length 
      if end-start>=3 and start not in extracted_place_names:
        extracted_entities[start] = text[start:end][:-1]
  return {i:extracted_entities[i] for i in sorted(extracted_entities)}

In [None]:
geonouns = [geonoun.strip() for geonoun in open('geo_feature_nouns.txt').readlines()]
# geonouns

In [None]:
BG_COLOR = {
    'GPE':'#feca74', 'CARDINAL':'#e4e7d2', 'FAC':'#9cc9cc','QUANTITY':'#e4e7d2',
    'PERSON':'#aa9cfc', 'ORDINAL':'#e4e7d2', 'ORG':'#7aecec', 'PL-NAME':'#feca74',
    'no_tag':'#FFFFFF','GEO-NOUN': '#9cc9cc', 'NORP':'#d9fe74', 'LOC':'#9ac9f5',
    'DATE':'#c7f5a9', 'PRODUCT':'#edf5a9', 'EVENT': '#e1a9f5','TIME':'#a9f5bc',
    'WORK_OF_ART':'#e6c1d7', 'LAW':'#e6e6c1','LANGUAGE':'#c9bdc7', 
    'PERCENT':'#c9ebf5', 'MONEY':'#b3d6f2','EMOTION':'#f2ecd0',
    'TIME-sem':'#d0e0f2', 'MOVEMENT':'#f2d0d0'
}

In [None]:
tagged_geonouns = get_tagged_list(example_text, extract_entities(example_text, geonouns), 'GEO-NOUN')
IPython.display.HTML(generate_html(tagged_geonouns))

In [None]:
!pip install lemminflect

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Expand list with inflections and lemmas
from lemminflect import getLemma, getInflection
def get_inflections(names_list):
    gf_names_inflected = []
    for w in names_list:
      gf_names_inflected.append(w)
      gf_names_inflected.extend(list(getInflection(w.strip(), tag='NNS', inflect_oov=False)))
      gf_names_inflected.extend(list(getLemma(w.strip(), 'NOUN', lemmatize_oov=False)))
    return list(set(gf_names_inflected))

get_inflections(geonouns)

# tag_list = get_tagged_list(example_text, extracted_place_names) 

In [None]:
tagged_geonouns = get_tagged_list(example_text, extract_entities(example_text, geonouns), 'GEO-NOUN')
IPython.display.HTML(generate_html(tagged_geonouns))

# Using a Named Entity Recognizer
With the rule-based approach, we could extract the place names in our list. However, it is limited in a number of ways.
* It requires an exhaustive list of place names which is difficult to build for different types of writings.
* Hand-crafted rules for all possible scenarios will need to be developed
  - e.g. spelling errors, capitalizations, inflections etc.
  - Over-lapping instances ('Eamont' vs 'Eamont Bridge')
* It will be more difficult to extract references to time and date
* The approach will not generalize well with other corpora 



## **Step 6: Using a Named Entity Recognizer**
Our previous examples so far is only able to extract and visualise a few place names. Obviously, for a chance to be able to extract all the place names in the text, we will need a more comprehensive list. 

In [None]:
EXAMPLE_TEXT = open('gold_standard/Anon_cqp_66.xml').read()

# EXAMPLE_TEXT = open(os.path.join('data','example_texts','Anon1857_b.txt')).read()

place_names = [name.strip() for name in open('placenames.txt').readlines()]
# geof_names  = open('data/geo_feature_nouns.txt').readlines()

# Expand list with inflections and lemmas
def get_inflections(names_list):
    gf_names_inflected = []
    for w in names_list:
      gf_names_inflected.append(w)
      gf_names_inflected.extend(list(getInflection(w.strip(), tag='NNS', inflect_oov=False)))
      gf_names_inflected.extend(list(getLemma(w.strip(), 'NOUN', lemmatize_oov=False)))
    return list(set(gf_names_inflected))

# Get the index list of a sem tag
def get_sem_tagged(tag_type):
  index_list = []
  for i in range(len(output_doc)):
    if output_doc[i]._.pymusas_tags[0].startswith(tag_type[0]):
       index_list.append(i)
  return index_list

# extract all `seen` entities from a list of place names 
def extract_entities_with_regex(txtstr, ent_list, tag='PL-NAME'):
  entityPosLen={}
  for ent in ent_list:
    p = re.compile(f'{ent}[\.,\s\n]')#, flags=re.IGNORECASE)
    iterator = p.finditer(txtstr)
    for match in iterator:
      start, end = match.span()
      if end-start>=3 and start not in entityPosLen:
        entityPosLen[start] = (end-start, txtstr[start:end], tag)
  return entityPosLen

# extract all known entities with spacy
def extract_entities_with_spacy(spacy_doc):
  entityPosLen={}
  for ent in spacy_doc.ents:
    entityPosLen[ent.start_char] = (len(ent.text), ent.text, ent.label_)
  return entityPosLen

# extract all entities with semtagger
def extract_entities_with_semtagger(tokens, index_list, tag):
  entityPosLen={}
  for i in index_list:
    start_char = 1+len(" ".join(tokens[:i]))
    entityPosLen[start_char] = (len(tokens[i]), tokens[i], tag)
  return entityPosLen

# extract all known entities in a lists
def get_token_tags(txtstr, entities):
  begin, tokens_tags = 0, []
  for start, vals in entities.items():
    length, ent, tag = vals
    if begin <= start:
      tokens_tags.append((txtstr[begin:start], None))
      tokens_tags.append((txtstr[start:start+length], tag))
      begin = start+length
  tokens_tags.append((txtstr[begin:], None)) #add the last untagged chunk
  return tokens_tags

BG_COLOR = {'GPE':'#feca74', 'CARDINAL':'#e4e7d2', 'FAC':'#9cc9cc',
            'QUANTITY':'#e4e7d2', 'PERSON':'#aa9cfc', 'ORDINAL':'#e4e7d2', 
            'ORG':'#7aecec', 'PL-NAME':'#feca74', 'no_tag':'#FFFFFF',
            'GEO-FEATURE': '#9cc9cc', 'NORP':'#d9fe74', 'LOC':'#9ac9f5',
            'DATE':'#c7f5a9', 'PRODUCT':'#edf5a9', 'EVENT': '#e1a9f5',
            'TIME':'#a9f5bc', 'WORK_OF_ART':'#e6c1d7', 'LAW':'#e6e6c1',
            'LANGUAGE':'#c9bdc7', 'PERCENT':'#c9ebf5', 'MONEY':'#b3d6f2',
            'EMOTION':'#f2ecd0', 'TIME-sem':'#d0e0f2', 'MOVEMENT':'#f2d0d0'
}

# format a typical entity for display 
def format_entity(token, tag):
  if tag:
    start_mark = f'<mark class="entity" style="background: {BG_COLOR[tag]}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">'
    end_mark = '\n</mark>'
    start_span = '<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">'
    end_span = '\n</span>'
    return f"\n{start_mark}{token}{start_span}{tag}{end_span}{end_mark}"
  return f"{token}"

# format a typical entity span for display 
def format_span(ent_span, tag):
  new_ent_span = ent_span.copy()
  if tag:
    start_mark = f'<mark class="entity" style="background: {BG_COLOR[tag]}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">'
    end_mark = '\n</mark>'
    start_span = '<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">'
    end_span = '\n</span>'  
    if len(new_ent_span)>1:
      new_ent_span[0] = f"\n{start_mark}{ent_span[0]}"
      new_ent_span[-1] = f"{ent_span[-1]}{start_span}{tag}{end_span}{end_mark}"
      return new_ent_span
    else:
      new_ent_span[0] = f"\n{start_mark}{ent_span[0]}{start_span}{tag}{end_span}{end_mark}"
      return new_ent_span
  return new_ent_span

# generate html formatted text 
def generate_html(token_tag_list):
  start_div = f'<div class="entities" style="line-height: 2.5; direction: ltr">'
  end_div = '\n</div>'
  html = start_div
  for token, tag in token_tag_list:
    html += format_entity(token,tag)
  html += end_div
  return html

# show text unformated text
def show_text(txtstr):
  start_mark = f'<mark class="entity" style="background: #FFFFFF; line-height: 2; border-radius: 0.35em;">'
  end_mark = '\n</mark>'
  return IPython.display.HTML(f"{start_mark}{txtstr}{end_mark}")

### **Building the NLP Pipeline**



We start by building a baseline NER tagger. Two approaches are considered:
1. Try an existing Named Entity Recognition (NER) tool - **`SpaCy`**
2. Build a Rule-based recogniser for all known regions
3. Annotate our corpus with required tags and train a statistical model for name and feature recognition

#### Clone the Lake Distric Corpus directory

In [None]:
!git clone https://github.com/UCREL/LakeDistrictCorpus.git

In [None]:
cd LakeDistrictCorpus/

#### Install Spacy and PyMUSAS Models

In [None]:
!pip uninstall spacy

In [None]:
!pip install spacy==3.3.1

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
!pip install https://github.com/UCREL/pymusas-models/releases/download/en_dual_none_contextual-0.3.1/en_dual_none_contextual-0.3.1-py3-none-any.whl

In [None]:
!pip3 install lemminflect

#### Importing all necessary modules

In [None]:
# importing all necessary modules
import os
import sys
import re
import spacy
import pandas as pd
import IPython
import matplotlib.pyplot as plt
import en_core_web_sm
import collections
from collections import Counter
from lemminflect import getLemma, getInflection
from wordcloud import WordCloud, STOPWORDS

In [None]:
len(allnames)

7409

In [None]:
gold_standard_place_names = list(set(allnames))

In [None]:
print('|'.join(gold_standard_place_names))

Stock Gill|KIDSEY-PIKE|Brow top|VICAR'S ISLAND|Lake of Bassenthwaite|EAMON|Grizedal|river EDEN|Strickland|Watenlath|St. Herbert's Isle|GUINEA|Mediterranean Sea|MONKS-HALL|Ravenstondale|CHURCH-STREET|Bowne∫|KESWICK|the Pike|Pooley|Stony Tarn|PLYMOUTH|Helvellyn Man|Hawes|Benwewi∫h|Egremont|Black Bull|Man|Cockshut|the Lissa|Lancaster Castle|Kirby Launsdale|Newcafſtle|Grisdale|Carlile|Portingskall|Stanwick|ULLESWATER|Maryport|Cat-Bells|Vale of St John|HEST-BANK|HOLME-CRAG|BERKSHIR ISLAND|Buttermere|river Caldew|Torver|Grasmire|Winandermere|Green-caſtle-loch|Red Tarn|Loughrigg Tarn|priory of CARTMEL|Grasmere-hill|Dunald Mill Hole|Cresel|Thrilmere|Goldscope|Millum|vale of Grasmere|Stickle Knot|GLEASTON CASTLE|OLENACUM|MELL-FELL|GOLDRILL-BECK|Lugubalia|Riuer Lun|Bonus|Derwentwater|Keppel Cove Tarn|LEVEN-SANDS|ROUGH-HOLM|Applethwaite|Whinfield Chase|Bulmans cleugh|Leatherby|Skelgill|mount MAUDITE|Blencarter|Lowther Castle|GRASMERE WATER|Vicarage|Leatheswater|Penigent|Blencathara|Ben-Lomond|CUM

In [None]:
gold_standard_place_names = [name[:-2] if name[-2]=="'" else name 
                             for name in gold_standard_place_names]

In [None]:

# # ------------------
# wordcloud = WordCloud(width = 800, height = 800,
#         background_color ='white',
#         min_font_size = 10).generate(' '.join(pl_names_found))

# # plot the WordCloud image					
# plt.figure(figsize = (7, 7), facecolor = None)
# plt.imshow(wordcloud)
# plt.axis("off")
# plt.tight_layout(pad = 0)


# # Using find() to extract attributes
# # of the first instance of the tag
# b_name = Bs_data.find('child', {'name':'Frank'})

# print(b_name)

# # Extracting the data stored in a
# # specific attribute of the
# # `child` tag
# value = b_name.get('test')

# print(value)


#### Define functions and load data

In [None]:
!wget https://raw.githubusercontent.com/SpaceTimeNarratives/demo_app/main/code/data/placenames.txt

In [None]:
EXAMPLE_TEXT = open('/content/LakeDistrictCorpus/gold_standard/Anon_cqp_66.xml').read()

# EXAMPLE_TEXT = open(os.path.join('data','example_texts','Anon1857_b.txt')).read()

place_names = [name.strip() for name in open('placenames.txt').readlines()]
# geof_names  = open('data/geo_feature_nouns.txt').readlines()

# Expand list with inflections and lemmas
def get_inflections(names_list):
    gf_names_inflected = []
    for w in names_list:
      gf_names_inflected.append(w)
      gf_names_inflected.extend(list(getInflection(w.strip(), tag='NNS', inflect_oov=False)))
      gf_names_inflected.extend(list(getLemma(w.strip(), 'NOUN', lemmatize_oov=False)))
    return list(set(gf_names_inflected))

# Get the index list of a sem tag
def get_sem_tagged(tag_type):
  index_list = []
  for i in range(len(output_doc)):
    if output_doc[i]._.pymusas_tags[0].startswith(tag_type[0]):
       index_list.append(i)
  return index_list

# extract all `seen` entities from a list of place names 
def extract_entities_with_regex(txtstr, ent_list, tag='PL-NAME'):
  entityPosLen={}
  for ent in ent_list:
    p = re.compile(f'{ent}[\.,\s\n]')#, flags=re.IGNORECASE)
    iterator = p.finditer(txtstr)
    for match in iterator:
      start, end = match.span()
      if end-start>=3 and start not in entityPosLen:
        entityPosLen[start] = (end-start, txtstr[start:end], tag)
  return entityPosLen

# extract all known entities with spacy
def extract_entities_with_spacy(spacy_doc):
  entityPosLen={}
  for ent in spacy_doc.ents:
    entityPosLen[ent.start_char] = (len(ent.text), ent.text, ent.label_)
  return entityPosLen

# extract all entities with semtagger
def extract_entities_with_semtagger(tokens, index_list, tag):
  entityPosLen={}
  for i in index_list:
    start_char = 1+len(" ".join(tokens[:i]))
    entityPosLen[start_char] = (len(tokens[i]), tokens[i], tag)
  return entityPosLen

# extract all known entities in a lists
def get_token_tags(txtstr, entities):
  begin, tokens_tags = 0, []
  for start, vals in entities.items():
    length, ent, tag = vals
    if begin <= start:
      tokens_tags.append((txtstr[begin:start], None))
      tokens_tags.append((txtstr[start:start+length], tag))
      begin = start+length
  tokens_tags.append((txtstr[begin:], None)) #add the last untagged chunk
  return tokens_tags

BG_COLOR = {'GPE':'#feca74', 'CARDINAL':'#e4e7d2', 'FAC':'#9cc9cc',
            'QUANTITY':'#e4e7d2', 'PERSON':'#aa9cfc', 'ORDINAL':'#e4e7d2', 
            'ORG':'#7aecec', 'PL-NAME':'#feca74', 'no_tag':'#FFFFFF',
            'GEO-FEATURE': '#9cc9cc', 'NORP':'#d9fe74', 'LOC':'#9ac9f5',
            'DATE':'#c7f5a9', 'PRODUCT':'#edf5a9', 'EVENT': '#e1a9f5',
            'TIME':'#a9f5bc', 'WORK_OF_ART':'#e6c1d7', 'LAW':'#e6e6c1',
            'LANGUAGE':'#c9bdc7', 'PERCENT':'#c9ebf5', 'MONEY':'#b3d6f2',
            'EMOTION':'#f2ecd0', 'TIME-sem':'#d0e0f2', 'MOVEMENT':'#f2d0d0'
}

# format a typical entity for display 
def format_entity(token, tag):
  if tag:
    start_mark = f'<mark class="entity" style="background: {BG_COLOR[tag]}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">'
    end_mark = '\n</mark>'
    start_span = '<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">'
    end_span = '\n</span>'
    return f"\n{start_mark}{token}{start_span}{tag}{end_span}{end_mark}"
  return f"{token}"

# format a typical entity span for display 
def format_span(ent_span, tag):
  new_ent_span = ent_span.copy()
  if tag:
    start_mark = f'<mark class="entity" style="background: {BG_COLOR[tag]}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">'
    end_mark = '\n</mark>'
    start_span = '<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">'
    end_span = '\n</span>'  
    if len(new_ent_span)>1:
      new_ent_span[0] = f"\n{start_mark}{ent_span[0]}"
      new_ent_span[-1] = f"{ent_span[-1]}{start_span}{tag}{end_span}{end_mark}"
      return new_ent_span
    else:
      new_ent_span[0] = f"\n{start_mark}{ent_span[0]}{start_span}{tag}{end_span}{end_mark}"
      return new_ent_span
  return new_ent_span

# generate html formatted text 
def generate_html(token_tag_list):
  start_div = f'<div class="entities" style="line-height: 2.5; direction: ltr">'
  end_div = '\n</div>'
  html = start_div
  for token, tag in token_tag_list:
    html += format_entity(token,tag)
  html += end_div
  return html

# show text unformated text
def show_text(txtstr):
  start_mark = f'<mark class="entity" style="background: #FFFFFF; line-height: 2; border-radius: 0.35em;">'
  end_mark = '\n</mark>'
  return IPython.display.HTML(f"{start_mark}{txtstr}{end_mark}")

#### Add sem tagger to `spaCy` pipeline and process text

In [None]:
# We exclude the following components as we do not need them. 
nlp = spacy.load('en_core_web_sm')
# Load the English PyMUSAS rule based tagger in a separate spaCy pipeline
english_tagger_pipeline = spacy.load('en_dual_none_contextual')
# Adds the English PyMUSAS rule based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=english_tagger_pipeline)

In [None]:
# EXAMPLE_TEXT = open('data/example_texts/example_text.txt').read()
output_doc = nlp(EXAMPLE_TEXT)

# print(f'Text\tLemma\tPOS\tUSAS Tags')
# for i, token in enumerate(output_doc):
#     print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.pymusas_tags}')

#### Define functions and the `Extractor` class

In [None]:
import os
import re
from IPython.display import HTML
from collections import Counter

nlp = spacy.load('en_core_web_sm')

BG_COLOR = {'GPE':'#feca74', 'CARDINAL':'#e4e7d2', 'FAC':'#9cc9cc', 'QUANTITY':'#e4e7d2', 'PERSON':'#aa9cfc', 'ORDINAL':'#e4e7d2', 'ORG':'#7aecec',
            'PL-NAME':'#feca74', 'no_tag':'#FFFFFF', 'GEO-NOUN': '#9cc9cc', 'NORP':'#d9fe74', 'LOC':'#9ac9f5', 'DATE':'#c7f5a9', 'PRODUCT':'#edf5a9', 
            'EVENT': '#e1a9f5', 'TIME':'#a9f5bc', 'WORK_OF_ART':'#e6c1d7', 'LAW':'#e6e6c1', 'LANGUAGE':'#c9bdc7', 'PERCENT':'#c9ebf5', 'MONEY':'#b3d6f2', 
            'EMOTION':'#f2ecd0', 'TIME-sem':'#d0e0f2', 'MOVEMENT':'#f2d0d0', 'SP-PREP': '#f7d7e9', 'LOC-ADV': '#c4e5f5'
}

# format a typical entity for display 
def format_entity(token, tag):
  if tag:
    start_mark = f'<mark class="entity" style="background: {BG_COLOR[tag]}; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">'
    end_mark = '\n</mark>'
    start_span = '<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">'
    end_span = '\n</span>'
    return f"\n{start_mark}{token}{start_span}{tag}{end_span}{end_mark}"
  return f"{token}"

# extract all known entities in a lists
def get_token_tags(txtstr, entities):
  begin, tokens_tags = 0, []
  for start, vals in entities.items():
    length, ent, tag = vals
    if begin <= start:
      tokens_tags.append((txtstr[begin:start], None))
      tokens_tags.append((txtstr[start:start+length], tag))
      begin = start+length
  tokens_tags.append((txtstr[begin:], None)) #add the last untagged chunk
  return tokens_tags

# Expand list with inflections and lemmas
def get_inflections(names_list):
    gf_names_inflected = []
    for w in names_list:
      w = w.strip()
      gf_names_inflected.append(w)
      gf_names_inflected.extend(list(getInflection(w.strip(), tag='NNS', inflect_oov=False)))
      gf_names_inflected.extend(list(getLemma(w.strip(), 'NOUN', lemmatize_oov=False)))
    return list(set(gf_names_inflected))

combine = lambda x, y: (x[0], x[1]+' '+y[1], x[2])

def combine_multi_tokens(a_list):
  new_list = [a_list.pop()]
  while a_list:
    last = a_list.pop()
    if new_list[-1][0] - last[0] == 1:
      new_list.append(combine(last, new_list.pop()))
    else:
      new_list.append(last)
  return sorted(new_list)

# merge two entities
def merge_entities(first_ents, second_ents):
  return collections.OrderedDict(
      sorted({** second_ents, **first_ents}.items()))
# ------------------------------------------------------------------------------
# EXAMPLE_TEXT = open('data/example_texts/Anon1857_b.txt').read()

EXAMPLE_TEXT = open('/content/LakeDistrictCorpus/gold_standard/Anon_cqp_66.xml').read()
place_names_tags = [(name.strip(), 'PL-NAME') for name in open('placenames.txt').readlines()]
# geof_nouns_tags = [(noun.strip(), 'GEO-NOUN') for noun in get_inflections(open('data/example_texts/geo_feature_nouns.txt').readlines())]
# spatial_preps_tags = [(prep.strip(), 'SP-PREP') for prep in open('data/example_texts/spatial_prepositions.txt').readlines()]
# locative_adverbs = [(advb[:25].strip(), 'LOC-ADV') for advb in open('data/example_texts/locativeAdverbs.txt').readlines()]
  
entity_tag_list = place_names_tags # + geof_nouns_tags + spatial_preps_tags + locative_adverbs

In [None]:
class Extractor:
  def __init__(self, text, entity_tag_list):
    self.text = text
    self.tokens, self.tokenized_text, self.nlp_doc = self.process_text()
    self.entity_tag_list = entity_tag_list
    self.entities = self.extract_placenames()
    self.ner_entities = self.extract_ner_entities()
    self.sem_tag_types = ['EMOTION', 'MOVEMENT', 'TIME-sem']
    self.sem_entities = self.extract_sem_entities()
    
  def process_text(self):
      doc = nlp(self.text)
      tokens = [token.text for token in doc]
      tokenized_text = " ".join(tokens)
      nlp_doc = nlp(tokenized_text)
      return tokens, tokenized_text, nlp_doc

  def extract_placenames(self):
    entities = {}
    for ent, tag in self.entity_tag_list:
      p = re.compile(f'{ent}[\.,\s\n]')#, flags=re.IGNORECASE)
      iterator = p.finditer(self.tokenized_text)
      for match in iterator:
        start, end = match.span()
        if end-start>=2 and start not in entities:
          entities[start] = (end-start, self.tokenized_text[start:end-1], tag)
    return collections.OrderedDict(sorted(entities.items()))

  def extract_ner_entities(self):
    entities = {}
    for ent in self.nlp_doc.ents:
      tag='PL-NAME' if ent.label_ in ['GPE', 'ORG', 'FAC'] else ent.label_
      entities[ent.start_char] = (len(ent.text), ent.text, tag)
    return collections.OrderedDict(sorted(entities.items()))

  def extract_sem_entities(self):
    entities = {}
    for tag_type in self.sem_tag_types:
      tag_indices = [(i, token.text, tag_type) for i, token in enumerate(self.nlp_doc) if token._.pymusas_tags[0].startswith(tag_type[0])]
      if tag_indices:
        for i, token, tag in combine_multi_tokens(tag_indices):
          start_char = 1+len(" ".join(self.tokens[:i]))
          entities[start_char] = (len(token), token, tag)
    return collections.OrderedDict(sorted(entities.items()))

  # generate html formatted text 
  def visualize(self, ents=None, include_ner=True, include_sem=True):
    html, end_div = f'<div class="entities" style="line-height: 2.3; direction: ltr">', '\n</div>'
    if ents:
      entities = ents
    else:
      if include_ner:
        entities =  merge_entities(self.entities, self.ner_entities)
      if include_sem:
        entities = merge_entities(self.entities, self.sem_entities)
      if include_ner and include_sem:
        entities = merge_entities(self.entities, merge_entities(self.ner_entities, self.sem_entities))
    for token, tag in get_token_tags(self.tokenized_text, entities):
      html += format_entity(token,tag)
    html += end_div
    return HTML(html)

#### Instantiating the `extractor` and Visualising entities 

In [None]:
extractor = Extractor(EXAMPLE_TEXT,entity_tag_list)
my_ents = {i:(l, e, t) for i, (l, e, t) in extractor.entities.items() if t in ['PL-NAME']}
extractor.visualize(my_ents)

### **Regex Tagging**
#### Using Regex-based tagger for `place_names` and `geographical feature nouns`

In [None]:
place_names = sorted(place_names, key=lambda x: len(x), reverse=True)

#extract place name entities/mentions
pl_names_ents = extract_entities_with_regex(EXAMPLE_TEXT, place_names, tag='PL-NAME')

# #extract geo feature entities/mentions
# gf_names_ents = extract_entities_with_regex(EXAMPLE_TEXT, get_inflections(geof_names), tag='GEO-FEATURE')

# # Merge all extracted fearture names and mentions
# regex_entities = {**pl_names_ents, **gf_names_ents}
# regex_entities = collections.OrderedDict(sorted(regex_entities.items()))

IPython.display.HTML(
    generate_html(get_token_tags(EXAMPLE_TEXT, pl_names_ents)))

### **`spaCy` NER model**
##### Tag text with standard NER tags: `LOC`, `PERSON`, `ORG`, `DATE-TIME` etc

In [None]:
spacy_entities = extract_entities_with_spacy(output_doc)

IPython.display.HTML(
    generate_html(get_token_tags(text, spacy_entities)))

### **Regex + `spaCy` tagger**
Combining the Regex + Spacy tagger

In [None]:
regex_spacy_entities = regex_entities.copy()

# Do not overwrite regex with spacy entities
banned_start_points=[]
for start, (length, e, t) in regex_spacy_entities.items():
  banned_start_points.extend(list(range(start, start+length-1)))

# Add only entities not captured by regex
for start, (l, e, t) in spacy_entities.items():
  if start not in banned_start_points:
    if t in ['GPE','ORG', 'LOC', 'FAC']:#, 'PERSON']:
      regex_spacy_entities[start] = (l, e, 'PL-NAME')
    else:
      regex_spacy_entities[start] = (l, e, t)
regex_spacy_entities = collections.OrderedDict(sorted(regex_spacy_entities.items()))

IPython.display.HTML(
    generate_html(get_token_tags(text, regex_spacy_entities)))

# Update: Adding the Semantic Tagger


1. The demo app is now available on the project Github space and hopefully everyone can still access it
  - Demo App link: https://spacetimenarratives.streamlit.app/ 



### **Semantic Tagging**
##### Tag text with `MOVEMENT`, `TIME` and `EMOTION` semantic tags

In [None]:
tag_types = ['EMOTION', 'MOVEMENT', 'TIME-sem']
semtagger_entities={}
for tag_type in tag_types:
  tag_entities = extract_entities_with_semtagger(text_tokens, get_sem_tagged(tag_type),tag_type) 
  semtagger_entities = {**semtagger_entities, **tag_entities}
semtagger_entities = collections.OrderedDict(sorted(semtagger_entities.items()))

IPython.display.HTML(
    generate_html(get_token_tags(text, semtagger_entities)))

#### 4. Regex + **`spaCy`** + semantic tagger

In [None]:
regex_spacy_sem_entities = regex_spacy_entities.copy()

# Do not overwrite regex entities
banned_start_points=[]
for start, (length, e, t) in regex_spacy_sem_entities.items():
  banned_start_points.extend(list(range(start, start+length-1)))

# Add only entities not captured by regex
for start, (l, e, t) in semtagger_entities.items():
  if start not in banned_start_points:
      regex_spacy_sem_entities[start] = (l, e, t)

regex_spacy_sem_entities = collections.OrderedDict(sorted(regex_spacy_sem_entities.items()))

IPython.display.HTML(generate_html(get_token_tags(text, regex_spacy_sem_entities)))

# Annotation Tools


Data Annotation: 

1. Lighttag: https://www.lighttag.io/
2. AI-Annotator: https://ioannotator.com/
3. Prodigy: https://prodi.gy/
4. Tagtog: https://www.tagtog.com/
5. Brat Annotation Tool: https://brat.nlplab.org/index.html

---

# Google Map API

In [None]:
pip install -U googlemaps

In [None]:
pip install geopandas

In [None]:
API_KEY = 'AIzaSyCK0vBr87V6T6xFqktA7jttfD0k8AsX1fY'

In [None]:
import googlemaps
# from datetime import datetime

In [None]:
gmaps = googlemaps.Client(key=API_KEY)

# Geocoding an address
geocode_result1 = gmaps.geocode('Penrith, Lake district ')
geocode_result2 = gmaps.geocode('Pooley Bridge, Lake district')
print(geocode_result1[0])
print(geocode_result2[0])

# Look up an address with reverse geocoding
# reverse_geocode_result = gmaps.reverse_geocode((40.714224, -73.961452))
reverse_geocode_result = gmaps.reverse_geocode((54.6786628, -2.7247091))

pl1 = geocode_result1[0]['address_components'][0]['long_name'] #, reverse_geocode_result
pl2 = geocode_result2[0]['address_components'][0]['long_name'] #, reverse_geocode_result

pl1, pl2
# Request directions via public transit
# now = datetime.now()
# directions_result = gmaps.directions("Sydney Town Hall", 
#                                      "Parramatta, NSW",
#                                      mode="transit",
#                                      departure_time=now)

In [None]:
pd.DataFrame(geocode_result1[0]['geometry'])

In [None]:
pd.DataFrame(geocode_result2[0]['geometry'])

In [None]:
# Requires cities name
dist = gmaps.distance_matrix(pl1, pl2)['rows'][0]['elements'][0]
  
# Printing the result
print(dist)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

from shapely.geometry import Point
import geopandas as gpd
from geopandas import GeoDataFrame

df = pd.read_csv("Long_Lats.csv", delimiter=',', skiprows=0, low_memory=False)

geometry = [Point(xy) for xy in zip(df['Longitude'], df['Latitude'])]
gdf = GeoDataFrame(df, geometry=geometry)   

#this is a simple map that goes with geopandas
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
gdf.plot(ax=world.plot(figsize=(10, 6)), marker='o', color='red', markersize=15);

In [None]:
df = pd.DataFrame(
    {'City': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota', 'Caracas'],
     'Country': ['Argentina', 'Brazil', 'Chile', 'Colombia', 'Venezuela'],
     'Latitude': [-34.58, -15.78, -33.45, 4.60, 10.48],
     'Longitude': [-58.66, -47.91, -70.66, -74.08, -66.86]})

In [None]:
gdf = gpd.GeoDataFrame(
    df, geometry = gpd.points_from_xy(df.Longitude, df.Latitude))

In [None]:
print(gdf.head())

In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

# We restrict to South America.
ax = world[world.continent == 'South America'].plot(
    color='white', edgecolor='black')

# We can now plot our ``GeoDataFrame``.
gdf.plot(ax=ax, color='red')

plt.show()

### TODO: 14.11.2022
1. Include file upload feature [Done]
2. Include WordCloud (Nouns, Adjectives, Adverbs) [Done]
3. Redesign to include wrap up in a class [Done]

#### Reading the xml file

# Display Annotations

In [None]:
import os
import json
import pandas as pd

In [None]:
with open('data/simple_test1_annotations.json') as json_data:
    data = json.load(json_data)

In [None]:
for example in data['examples']:
  print(example['content'])

In [None]:
data.keys()
# pd.DataFrame(data['examples'][0]['annotations'])

In [None]:
pd.DataFrame(data['schema']['tags'])

In [None]:
example_0 = data['examples'][0]
print(example_0['content'])

In [None]:
all_annotations = {}
for i in range(4):
  example = data['examples'][i]
  # print(example['content'])
  for annotation in example['annotations']:
    # print(annotation['value'], annotation['tag'], annotation['start'], annotation['end'], annotation['tagged_token_id'])
    all_annotations[annotation['tagged_token_id']] = annotation['value'] 
    # print(annotation['tagged_token_id'], annotation['value'], annotation['tag'], annotation['start'], annotation['end'])
all_annotations

In [None]:
# pd.DataFrame(data['relations'])
for relation in data['relations']:
  if relation['tagged_token_id'] != None:
    # print(relation['tagged_token_id'])
    print(f"{relation['id']}\t{all_annotations[relation['tagged_token_id']]:20}\t")

In [None]:
all_relations={}
for relation in data['relations']:
  all_relations[relation['id']] = {'relation_type':relation['pseudo_node_type'],
                                   'parent_id':relation['parent_id'],
                                   'children':relation['children'],
                                   'tagged_token_id':relation['tagged_token_id'],
                                   'materialized_path':relation['materialized_path']}
  print(relation['id'], all_relations[relation['id']])

for relation_id, value in all_relations.items():
  if value['relation_type'] != None:
    parent =  all_annotations[all_relations[value['parent_id']]['tagged_token_id']]
    children_ids =  value['children']
    print(f"From\n- {parent}\nTo")
    if children_ids:
      for id in children_ids:
        print('-',all_annotations[all_relations[id]['tagged_token_id']])
    print('='*10)

In [None]:
html_text = """
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en" id="ee0deb35d002499ea0584c7aad5b8096-0" class="displacy" width="750" height="312.0" direction="ltr" style="max-width: none; height: 312.0px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr">
<text class="displacy-token" fill="currentColor" text-anchor="middle" y="222.0">
    <tspan class="displacy-word" fill="currentColor" x="50">Penrith</tspan>
    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="50">PL-NAME</tspan>
</text>

<text class="displacy-token" fill="currentColor" text-anchor="middle" y="222.0">
    <tspan class="displacy-word" fill="currentColor" x="225">Pooley Bridge</tspan>
    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="225">PL-NAME</tspan>
</text>


<text class="displacy-token" fill="currentColor" text-anchor="middle" y="222.0">
    <tspan class="displacy-word" fill="currentColor" x="400">Eamont</tspan>
    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="400">PL-NAME</tspan>
</text>

<text class="displacy-token" fill="currentColor" text-anchor="middle" y="222.0">
    <tspan class="displacy-word" fill="currentColor" x="575">Ulleswater</tspan>
    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="575">PL-NAME</tspan>
</text>

<g class="displacy-arrow">
    <path class="displacy-arc" id="arrow-ee0deb35d002499ea0584c7aad5b8096-0-0" stroke-width="2px" d="M70,180.0 C70,89.5 220.0,89.5 220.0,177.0" fill="none" stroke="currentColor"/>
    <text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">
        <textPath xlink:href="#arrow-ee0deb35d002499ea0584c7aad5b8096-0-0" class="displacy-label" startOffset="50%" side="left" fill="currentColor" text-anchor="middle">from</textPath>
    </text>
    <path class="displacy-arrowhead" d="M70,179.0 L62,167.0 78,167.0" fill="currentColor"/>
</g>

<g class="displacy-arrow">
    <path class="displacy-arc" id="arrow-ee0deb35d002499ea0584c7aad5b8096-0-1" stroke-width="2px" d="M420,177.0 C420,89.5 570.0,89.5 570.0,177.0" fill="none" stroke="currentColor"/>
    <text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">
        <textPath xlink:href="#arrow-ee0deb35d002499ea0584c7aad5b8096-0-1" class="displacy-label" startOffset="50%" side="left" fill="currentColor" text-anchor="middle">to</textPath>
    </text>
    <path class="displacy-arrowhead" d="M420,179.0 L412,167.0 428,167.0" fill="currentColor"/>
</g>

<g class="displacy-arrow">
    <path class="displacy-arc" id="arrow-ee0deb35d002499ea0584c7aad5b8096-0-2" stroke-width="2px" d="M245,177.0 C245,2.0 575.0,2.0 575.0,177.0" fill="none" stroke="currentColor"/>
    <text dy="1.25em" style="font-size: 0.8em; letter-spacing: 1px">
        <textPath xlink:href="#arrow-ee0deb35d002499ea0584c7aad5b8096-0-2" class="displacy-label" startOffset="50%" side="left" fill="currentColor" text-anchor="middle">to</textPath>
    </text>
    <path class="displacy-arrowhead" d="M575.0,179.0 L583.0,167.0 567.0,167.0" fill="currentColor"/>
</g>
</svg>
"""
IPython.display.HTML(html_text)

In [None]:
pd.DataFrame(data['relations'])

# Convert XML to TXT

In [None]:
import xml.etree.ElementTree as ET
def xml2txt(fpath):
  tree = ET.parse(fpath)
  root = tree.getroot()
  text=''
  for chap in root.findall('chap'):
    for c in chap:
      # if c.text: text+=f'\n{c.text.strip()}'
      if c.tag == 'poem':
        for l in c.findall('line'):
          if l.text: text+=f'\n{l.text.strip()}'
      if c.tag in 'pi':
        print(c.text)
  # with open(fpath[:-3]+'txt', 'w', encoding='utf8') as txtfile:
  #   txtfile.write(text)
  #   return f"{fpath[:-3]+'txt'} successfully created!"
    # return text
  # return f"Error creating {fpath[:-3]+'txt'}!"
xml2txt('gold_standard/Anon_cqp_66.xml')

In [None]:
tree = ET.parse('gold_standard/Anon_cqp_66.xml')
root = tree.getroot()
text=""
for t in root.itertext():
  text = f"{text} {t}"
text.replace('\n', '')
text.replace('\t', '')

' \n \n \n The English Lakes. \n \n INTRODUCTION. \n\n By the route which we have traced among the English Lakes in the following pages, we believe that the traveller may visit all the chief points of interest in the shortest space of time, while those who have their time more at command may extend their excursions to secondary points of interest by following the various diverging routes headed in  italics . We have selected  Penrith  as the starting point, because  Ulleswater , one of the finest of the lakes, is seen to greatest advantage by being approached from this direction, while it is as convenient a quarter as any of the others from which to set out on a tour through the district. \n\n With the carefully prepared map attached to this guide, the tourist will experience no difficulty in tracing the main route and the diverging excursions here laid down; and a very little consideration, with the aid of occasional inquiry as to minutiæ when on the spot, will enable him to vary his 

In [None]:
from xml.etree import ElementTree as ET
 
tree = ET.parse('data/Anon1857_b.xml')
s, d = "'", "\""
json_text = '[\n' #print('[')
for i, p in enumerate(tree.findall(".//p")):
    # Get all inner text
    json_text += '  {\n' #print('  {')
    json_text += f'    "para_id": "{i}",\n' #print(f'    "para_id": "{i}",')
    text = " ".join(t.strip() for t in p.itertext())
    json_text += f'    "text": "{text.replace(d,s)}"\n' #print(f'    "text": "{text.replace(d,s)}"')
    json_text +='  },\n' # print('  },')
json_text += ']' # print(']')

with open('data/the_english_lakes_anon1857b.json', 'w', encoding='utf8') as jsonfile:
  jsonfile.write(json_text)

In [None]:
import xml.etree.ElementTree as ET

tree = ET.parse('data/Anon1857_b.xml')
text = str(f"{ET.tostring(tree.getroot(), encoding='utf-8', method='text')}")

print(text.replace('\n', ''))

In [None]:
type(text)

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
# nlp = spacy.load("en_core_web_sm")
# doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
# for token in doc:
#     print(token.text, token.dep_, token.head.text, token.head.pos_,
#             [child for child in token.children])
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)

In [None]:
# nlp = spacy.load("en_core_web_sm")
doc = nlp("Penrith is a beautiful town with two roads leading to Pooley Bridge, about six miles distant, which spans the Eamont just at its issue from Ulleswater.")
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

In [None]:
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)

# Sense-of-plase (WordCloud)

In [None]:
files = [f"data/example_texts/{f}" for f in os.listdir('data/example_texts') if f.endswith('.txt')]

def get_cloud(plname, tag, files, window=20):
  plname_sop = []
  for f in files:
    doc = nlp(open(f, 'r', encoding='utf8').read())
    for i in range(len(doc)):
      if doc[i].text ==plname:
        for j in range(i-(int(window/2)),i+int(window/2)):
          if doc[j].pos_ == tag:
            plname_sop.append(doc[j].text)
  return plname_sop

In [None]:
doc1 = nlp("This is a sentence.")
doc2 = nlp("This is another sentence.")
html = displacy.render([doc1, doc2], style="dep", page=True)
html

In [None]:
# plname_sop = get_cloud('Keswick', 'VERB', files)
print(Counter(plname_sop).most_common(20))
wordcloud = WordCloud(width = 800, height = 800,
        background_color ='white',
        min_font_size = 10).generate(' '.join(plname_sop))

# plot the WordCloud image					
plt.figure(figsize = (7, 7), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

In [None]:
def get_cloud(plname, tag, window=20):
  plname_sop = []
  for i in range(len(doc)):
    if doc[i].text ==plname:
      for j in range(i-(int(window/2)),i+int(window/2)):
        if doc[j].pos_ == tag:
          plname_sop.append(doc[j].text)
  return plname_sop

In [None]:
# plname_sop = get_cloud('Penrith', 'ADJ')
plname_sop = get_cloud('Keswick', 'ADJ')

# plname_sop = get_cloud('Penrith', 'ADV')
# plname_sop = get_cloud('Keswick', 'ADV')

# plname_sop = get_cloud('Penrith', 'NOUN')
# plname_sop = get_cloud('Keswick', 'NOUN')

wordcloud = WordCloud(width = 800, height = 800,
        background_color ='white',
        min_font_size = 10).generate(' '.join(plname_sop))

# plot the WordCloud image					
plt.figure(figsize = (7, 7), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

plt.show()

#### Other code

In [None]:
spacy_entities = extract_entities_with_spacy(doc)

IPython.display.HTML(
    generate_html(get_token_tags(text, spacy_entities)))

In [None]:
# We exclude the following components as we do not need them. 
nlp = spacy.load('en_core_web_sm') #, exclude=['parser', 'ner'])

# Load the English PyMUSAS rule based tagger in a separate spaCy pipeline
english_tagger_pipeline = spacy.load('en_dual_none_contextual')

# Adds the English PyMUSAS rule based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=english_tagger_pipeline)

In [None]:
extractor = Extractor(EXAMPLE_TEXT, entity_tag_list)
# extractor.visualize(extractor.sem_entities)

In [None]:
import pandas as pd
# Load the LD80 corpus
LDv5_corpus_pl_names_workbook = pd.ExcelFile('data/LDv5_corpus_pl_names.xlsx')
LD_corpus_pl_names_df = pd.read_excel(LDv5_corpus_pl_names_workbook, sheet_name='LDv5_corpus_pl_names')

# Extract names from 'pl_name' column
pl_names = set([name.title() for name in LD_corpus_pl_names_df['pl_name']])

In [None]:
import os
import re
import spacy
import en_core_web_sm
# import streamlit as st
import pandas as pd
import collections
from collections import Counter
from lemminflect import getLemma, getInflection

EXAMPLES_DIR = 'code/data/example_texts'
example_files = sorted([f for f in os.listdir(EXAMPLES_DIR)]) # if f.startswith('Reviews')])

BG_COLOR = {'GPE':'#feca74', 'CARDINAL':'#e4e7d2', 'FAC':'#9cc9cc',
            'QUANTITY':'#e4e7d2', 'PERSON':'#aa9cfc', 'ORDINAL':'#e4e7d2', 
            'ORG':'#7aecec', 'PL-NAME':'#feca74', 'no_tag':'#FFFFFF',
            'GEO-FEATURE': '#9cc9cc', 'NORP':'#d9fe74', 'LOC':'#9ac9f5',
            'DATE':'#c7f5a9', 'PRODUCT':'#edf5a9', 'EVENT': '#e1a9f5',
            'TIME':'#a9f5bc', 'WORK_OF_ART':'#e6c1d7', 'LAW':'#e6e6c1',
            'LANGUAGE':'#c9bdc7', 'PERCENT':'#c9ebf5', 'MONEY':'#b3d6f2',
            'EMOTION':'#f2ecd0', 'TIME-sem':'#d0e0f2', 'MOVEMENT':'#f2d0d0'
}

# `PERSON` People, including fictional.	*Fred Flintstone*
# `NORP`	Nationalities or religious or political groups.	*The Republican Party*
# `FAC`	Buildings, airports, highways, bridges, etc.	*Logan International Airport, The Golden Gate*
# `ORG`	Companies, agencies, institutions, etc.	*Microsoft, FBI, MIT*
# `GPE`	Countries, cities, states.	*France, UAR, Chicago, Idaho*
# `LOC`	Non-GPE locations, mountain ranges, bodies of water.	*Europe, Nile River, Midwest*
# `DATE`	Absolute or relative dates or periods.	*20 July 1969*
# `CARDINAL`
# `QUANTITY` Measurements, as of weight or distance.	*Several kilometers, 55kg*
# `ORDINAL`	"first", "second", etc.	*9th, Ninth*
# `PRODUCT`	Objects, vehicles, foods, etc. (Not services.)	*Formula 1*
# `EVENT`	Named hurricanes, battles, wars, sports events, etc.	*Olympic Games*
# `TIME`	Times smaller than a day.	*Four hours*
# `LAW`	Named documents made into laws.	*Roe v. Wade*
# `LANGUAGE`	Any named language.	*English*
# `PERCENT`	Percentage, including "%".	*Eighty percent*
# `MONEY`	Monetary values, including unit. *Twenty Cents*

place_names = open('code/data/placenames.txt').readlines()
geof_names  = open('code/data/geo_feature_nouns.txt').readlines()
# locative_adverbs = open() # complete later

nlp = spacy.load('en_core_web_sm')

# Load the English PyMUSAS rule based tagger in a separate spaCy pipeline
english_tagger_pipeline = spacy.load('en_dual_none_contextual')

# Adds the English PyMUSAS rule based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=english_tagger_pipeline)

# Pass the example text through the pipeline for proper tokenisation

# show text unformated text
def show_plain_text(txtstr):
  'Original text:'
  start_mark = f'<mark class="entity" style="background: #FFFFFF; line-height: 2; border-radius: 0.35em;">'
  end_mark = '\n</mark>'
  return f"{start_mark}{txtstr}{end_mark}"

    # extract all entities with semtagger
    def extract_entities_with_semtagger(tokens, index_list, tag):
      entityPosLen={}
      for i in index_list:
        start_char = 1+len(" ".join(tokens[:i]))
        entityPosLen[start_char] = (len(tokens[i]), tokens[i], tag)
      return entityPosLen

    
    # Get the index list of a sem tag
    def get_sem_tagged(tag_type):
      index_list = []
      for i in range(len(output_doc)):
        if output_doc[i]._.pymusas_tags[0].startswith(tag_type[0]):
           index_list.append(i)
      return index_list  
      
    #Regex----------------------------------------------------------------
    sorted_pl_names = [name.strip() for name in sorted(place_names, key=lambda x: len(x), reverse=True)]
    #extract place name entities/mentions
    pl_names_ents = extract_entities_with_regex(processed_text, sorted_pl_names)

    #extract geo feature entities/mentions
    gf_names_ents = extract_entities_with_regex(processed_text, get_inflections(geof_names), tag='GEO-FEATURE')

    # Merge all extracted fearture names and mentions
    regex_entities = {**pl_names_ents, **gf_names_ents}
    regex_entities = collections.OrderedDict(sorted(regex_entities.items()))

    #Spacy-----------------------------------------------------------------
    doc = nlp(processed_text)
    spacy_entities = extract_entities_with_spacy(doc)

    #Regex+Spacy-----------------------------------------------------------
    regex_spacy_entities = regex_entities.copy()
    banned_start_points=[]
    for start, (length, e, t) in regex_spacy_entities.items():
      banned_start_points.extend(list(range(start, start+length-1)))

    for start, (l, e, t) in spacy_entities.items():
      if start not in banned_start_points:
        if t in ['GPE','ORG', 'LOC', 'FAC', 'PERSON']:
          regex_spacy_entities[start] = (l, e, 'PL-NAME')
        else:
          regex_spacy_entities[start] = (l, e, t)
    regex_spacy_entities = collections.OrderedDict(sorted(regex_spacy_entities.items()))

    #Sem Tagger-----------------------------------------------------------
    sem_tag_types = ['EMOTION', 'MOVEMENT', 'TIME-sem']
    semtagger_entities={}
    for tag_type in sem_tag_types:
      tag_entities = extract_entities_with_semtagger(text_tokens, get_sem_tagged(tag_type),tag_type) 
      semtagger_entities = {**semtagger_entities, **tag_entities}
    semtagger_entities = collections.OrderedDict(sorted(semtagger_entities.items()))

    #Regex Spacy and Sem Tagger-------------------------------------------
    regex_spacy_sem_entities = regex_spacy_entities.copy()
    # Do not overwrite regex entities
    banned_start_points=[]
    for start, (length, e, t) in regex_spacy_sem_entities.items():
      banned_start_points.extend(list(range(start, start+length-1)))

    # Add only entities not captured by regex
    for start, (l, e, t) in semtagger_entities.items():
      if start not in banned_start_points:
          regex_spacy_sem_entities[start] = (l, e, t)
    regex_spacy_sem_entities = collections.OrderedDict(sorted(regex_spacy_sem_entities.items()))

    t_dict = {
    '⛱ Regex Extractor': ("**⛱ Regex Extraction**", regex_entities),
    '🏓 Spacy Extractor': ("**🏓 Spacy Extraction**", spacy_entities),
    '🛸 Sem_Tag Extractor': ("**🛸 Semantic Tagging [`EMOTION`, `MOVEMENT`, `TIME-sem`]**", semtagger_entities),
    '📌 Regex_Spacy Extractor': ("**📌 Regex_Spacy Extraction**", regex_spacy_entities),
    '🏆 Regex_Spacy_Semtag Extractor': ("**📌 Regex_Spacy_Semtag Extraction**", regex_spacy_sem_entities)
    }
    
    return t_dict, processed_text
    
#📃📌📈📈📉⛱🏓🏆🎲 
# ⛱🏓📌🛸🎲♟ 💡🖱️
