<a href="https://colab.research.google.com/github/IgnatiusEzeani/spatial_narratives_workshop/blob/main/spatial_narrative_demo4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Building a dataset from corpus

### Corpus files

These are sample .xml files from the [Corpus of the Lake District Writing](https://www.lancaster.ac.uk/fass/projects/spatialhum.wordpress/?page_id=64#:~:text=The%20Corpus%20of%20Lake%20District,Poly%2DOlbion%20(1622).). They are annotated with important features for a previous project which could be relevant for us in future. However, we want to build a method that works on plain text too.

Execute the code below to change into the `corpus` folder and view the list of files in the `files` folder:

In [None]:
cd /content/drive/MyDrive/UCREL/demo/work_in_progress/corpus/

You can also see what the content of the files look like by looking at one of them `Anon_cqp_66.xml`

In [None]:
!head files/Anon_cqp_66.xml

Let's write a function `clean_text()` to remove the tags and return 'cleaned' text.

In [None]:
from bs4 import BeautifulSoup
import re

In [None]:
# cleans the text of tags and punctuations. Assumes .xml file
def clean_text(input_text):
  soup = BeautifulSoup(input_text, 'xml')
  # Define a regular expression pattern to match XML tags
  pattern = re.compile(r'<.*?>')
  _text = re.sub(pattern, '', input_text)

  # use the nltk sentence tokenizer to segment the text into sentences
  _text = _text.replace('\n', ' ').replace('\t', ' ').replace('âˆ«', 's')
  _text = re.sub(r'\s+', ' ', _text)
  return _text.strip()

Now, let's open and read the file `Anon_cqp_66.xml` and pass it to the `clean_text()` function

In [None]:
filename = 'Anon_cqp_66.xml'
clean_text(open(f'files/{filename}').read())

### Processing the corpus file

Here we want present the content of the file in a data table. Each row will represent a sentence while the columns will be the relevant details e.g. `sent_id`, `text`, `place names`(and the positions in text). `geo feature nouns`, `locative adverbs`, `spatial prepositions` etc.

We can also include a column for the `sentiment score` on the text.


Install required libraries...

In [None]:
pip -q install -r resources/requirements.txt

Let's pass all the corpus files through the `clean_text()` function and store the outputs for each file in a dictionary `clean_texts`

In [None]:
import os
cleaned_texts = {f: clean_text(open(f"files/{f}").read())
                for f in os.listdir('files/') if f.endswith('.xml')}

Run the function file for the required functions

In [None]:
%run resources/functions.py

Load all the lists for the entity categories

In [None]:
# Get the list of placenames and geonouns
place_names = [name.strip().title().replace("'S", "'s") for name in open('resources/LD_placenames.txt').readlines()] #read and convert to title case
place_names += [name.upper() for name in place_names] #retain the upper case versions
geonouns = get_inflections([noun.strip() for noun in open('resources/geo_feature_nouns.txt').readlines()])

# Get the locative adverbs
loc_advs = [l.split()[0] for l in open('resources/locative_adverbs.txt').readlines()]
sp_prep  = [l.strip() for l in open('resources/spatial_prepositions.txt').readlines()
                                                            if len(l.strip())>2]
# Get distances
distances = [l.strip() for l in open('resources/distances.txt').readlines()]

# Get dates
dates     = [l.strip() for l in open('resources/dates.txt').readlines()]

# Get times
times     = [l.strip() for l in open('resources/times.txt').readlines()]

# Get events
events    = [l.strip() for l in open('resources/events.txt').readlines()]

# Get the list of positive and negative words from the sentiment lexicon
pos_words = [w.strip() for w in open('resources/positive-words.txt','r', encoding='latin-1').readlines()[35:]]
neg_words = [w.strip() for w in open('resources/negative-words.txt','r', encoding='latin-1').readlines()[35:]]

Load the `Spacy`'s small English model and modify the `ner` rules by adding our patterns to pipeline with `Entity Ruler`

### Building the model for extracting spatial entities

Let's build the `Spacy`'s `ner EntityRuler`. We need to install the required libraries from the `requirements.txt`, then load and process the necessary files from the `resources` folder.

In [None]:
# Alternatively, load the small spacy English model
import spacy
nlp = spacy.load("en_core_web_sm")

# Add the `entity_ruler` to the pipeline before the NER module
ruler = nlp.add_pipe("entity_ruler", before='ner')


# Define the patterns for the EntityRuler by labelling all the names with the tag PLNAME
patterns =  [{"label": "PLNAME",  "pattern": plname} for plname in set(place_names)]
patterns += [{"label": "GEONOUN", "pattern": noun} for noun in geonouns]
patterns += [{"label": "+EMOTION", "pattern": word} for word in pos_words]
patterns += [{"label": "-EMOTION", "pattern": word} for word in neg_words]
patterns += [{"label": "EVENT",   "pattern": word} for word in events]
patterns += [{"label": "DATE", "pattern": word} for word in dates]
patterns += [{"label": "TIME", "pattern": word} for word in times]
patterns += [{"label": "DISTANCE", "pattern": word} for word in distances]
patterns += [{"label": "LOCADV", "pattern": word} for word in loc_advs]
patterns += [{"label": "SP-PREP", "pattern": word} for word in sp_prep]

ruler.add_patterns(patterns)

Let's define a tag name for each entity category

In [None]:
header_tag = [('plnames', 'PLNAME'), ('geonouns', 'GEONOUN'), ('pos_words', '+EMOTION'),
              ('neg_words', '-EMOTION'), ('events', 'EVENT'), ('dates', 'DATE'),
              ('times', 	'TIME'), ('distances', 'DISTANCE'), ('loc_advs', 'LOCADV'),
              ('spa_preps', 'SP-PREP')]

# keep all the tags here...
tags = [tag for _, tag in header_tag]

# label the entity span with the right tag
tagger = lambda d, t: [(ent,ent.start_char, ent.end_char) for ent in d.ents if ent.label_==t]

Import `nltk` and `pandas`, then define the `generate_sent_dataset()` function. `tqdm` will help us to monitor the progress of the process.

Also, we need the `pre_process_text()` function to lemmatize words, remove stopwords and punctuations while computing the sentiments.

In [None]:
import nltk
from tqdm.notebook import tqdm
import pandas as pd
import string

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
stop_words = stopwords.words('english')
lemma = WordNetLemmatizer()

In [None]:
def pre_process_text(text):
  return list(filter(lambda token: token not in string.punctuation,
             [lemma.lemmatize(word) for word in word_tokenize(text)
             if word.lower() not in stop_words]))

In [None]:
def generate_sent_dataset(filename):
  # define an empty pandas dataframe
  data_df = pd.DataFrame.from_dict({})

  # for each category, create an empty list for storing all extracted entitites
  header_list = {header:[] for header, _ in header_tag}

  # store the sentence ids and sentences
  id_sents = zip(*[(sentID,sent.strip()) for sentID, sent in
            enumerate(sent_tokenize(cleaned_texts[filename]))])

  data_df['sent_id'], data_df['sentence'] = list(id_sents)

  # Extract and store all entity categories found in each sentence
  pbar = tqdm(enumerate(data_df['sentence']))
  for i, sent in pbar:
    doc = nlp(sent)
    for header, tag in header_tag: header_list[header].append(tagger(doc, tag))
    pbar.set_description(f"-{filename[:-4]} sent {i:003d}")

  for header, tag in header_tag: data_df[header]=header_list[header]

  # include sentiment scores
  data_df['sentiment_score']= (data_df['pos_words'].apply(len) - data_df['neg_words'
                                        ].apply(len))/data_df['sentence'].apply(
                                            lambda x : len(pre_process_text(x)))

  return data_df

### Generating and exploring the datasets from files

Generating all the datasets for all the corpus files. May take a while to complete...

In [None]:
data_tables = {f:generate_sent_dataset(f) for f in sorted(cleaned_texts.keys())}

Let's look at the top 5 rows of one of the files `'Anon_cqp_66.xml'`...


In [None]:
data_tables['Anon_cqp_66.xml'].head()

#### Plot the sentiments on the sentences

In [None]:
import numpy as np
import matplotlib

In [None]:
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ["red","lightgreen", "green"])
normalize = lambda x: (x-np.min(x))/(np.max(x)-np.min(x))

In [None]:
def plot_sentiments(filename):
  data = data_tables[filename]
  data['norm_sentiment'] = normalize(data['sentiment_score'])
  data.plot.scatter('sent_id', 'norm_sentiment', c='norm_sentiment', colormap=cmap, figsize=(10, 5))

In [None]:
plot_sentiments('Anon_cqp_66.xml')

In [None]:
#@title #### Select filenames to change the plot...{ run: "auto" }

choose_filename = 'Anon_cqp_66.xml' #@param ['Anon_cqp_66.xml', 'Bree_cqp_56.xml', 'Carter_cqp_52.xml', 'Collingwood_cqp_75.xml', 'Denholm_cqp_35.xml', 'Gell_cqp_29.xml', 'Hawthorne_cqp_70.xml','Ostell_cqp_34.xml', 'Southey_cqp_40.xml', 'Walker_cqp_25.xml'] {allow-input: true}

plot_sentiments(choose_filename)

### Further ideas for the data tables

With the data tables (or data frames), you can attempt write codes that can query, extract and visualise your data in different ways e.g.:

*   Top placenames in a file
*   Top geonouns in a file
*   Top co-occurring geonouns to a place name
*   Search for sentences with specific placenames or geonouns or any combination of both

Also, the datasets form the basis for training a more complex machine learning models

