<a href="https://colab.research.google.com/github/SpaceTimeNarratives/spatio-textual-colab-demos/blob/main/demo_1_entity_annotation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introducing `spatio-textual` Python package

**spatio-textual**: a Python package for spatial entity recognition and verb relation extraction from text created by the [Spatial Narratives Project](https://spacetimenarratives.github.io/) and designed to support spatio-textual annotation, analysis and visualization in digital humanities projects, with initial applications to:

- the *Corpus of Lake District Writing* (CLDW)
- Holocaust survivors' testimonies (e.g., USC Shoah Foundation archives)

This package leverages spaCy and gazetteer-based classification to identify and label spatial entities such as cities, countries, camps, and geographic nouns, and also extracts action-verb contexts involving these entities.


---
## Setting up
Download `en_core_web_trf` spaCy model and install `spatio-textual` package.

**_Note:_** *Please wait a while, this may take a minute or 2...* 🕐


In [1]:
!python -m spacy download en_core_web_trf
!pip install -q git+https://github.com/SpaceTimeNarratives/spatio-textual.git

Collecting en-core-web-trf==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.8.0/en_core_web_trf-3.8.0-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting spacy-curated-transformers<1.0.0,>=0.2.2 (from en-core-web-trf==3.8.0)
  Downloading spacy_curated_transformers-0.3.1-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting curated-transformers<0.2.0,>=0.1.0 (from spacy-curated-transformers<1.0.0,>=0.2.2->en-core-web-trf==3.8.0)
  Downloading curated_transformers-0.1.1-py2.py3-none-any.whl.metadata (965 bytes)
Collecting curated-tokenizers<0.1.0,>=0.0.9 (from spacy-curated-transformers<1.0.0,>=0.2.2->en-core-web-trf==3.8.0)
  Downloading curated_tokenizers-0.0.9-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)
Downloading spacy_curated_transformers-0.3.1-py2.py3-none-any.whl (237 kB)
[2K   [90m━━━━━

---
## Importing the `spatio-textual` package
Having successfully downloaded the spaCy model and installed the `spatio-textual` package, it can now be imported and used in a Python environment to process text.

*Again, this may take about a minute too...*

In [2]:
import spatio_textual

The `spatio-textual` package has the `annotate` module with functions `annotate_text`, `annotate_texts`, `chunk_and_annotate_text`, `chunk_and_annotate_file` which identifies and labels spatial entities in text inputs of different formats.

So we can import the functions directly as below

In [3]:
from spatio_textual.annotate import (
    annotate_text,            # annotates a single text
    annotate_texts,           # annotates a collection of texts
    chunk_and_annotate_text,  # chunks and annotates a text
    chunk_and_annotate_file,  # chunks and annotates a file
    annotate_files,           # annotates a collection of files
)

Also, we can manage the saving and loading of annotations using the package `utils` functions `save_annotations(...)` and `load_annotations(...)` respectively

In [4]:
from spatio_textual.utils import save_annotations, load_annotations

---
## Annotating spatial entities

Beyond the typical labels for the named entity recognition task [`PERSON`, `ORG`, `LOC`, `DATE`], we have defined a set of entity labels that are relevant for our work as shown below:

| Tag          | Description                                                  |
| ------------ | ------------------------------------------------------------ |
| `PERSON`     | A named person                                               |
| `CONTINENT`  | A continent name (e.g. “Europe”, “Asia”)                     |
| `COUNTRY`    | A country name (e.g. “Germany”, “Czechoslovakia”)            |
| `US-STATE`   | A U.S. state name (e.g. “California”, “New York”)            |
| `CITY`       | A city name (e.g. “Berlin”, “London”,  when classified)     |
| `CAMP`       | A Holocaust-camp name e.g. “Auschwitz” (from your custom list)                |
| `PLACE`      | Other place-type entities not matched above                  |
| `GEONOUN`    | Generic geographic nouns (e.g. “valley”, “moor”)             |
| `NON-VERBAL` | Terms like [PAUSES], [LAUGHS] in non-verbal list |
| `FAMILY`     | Kinship terms (e.g. “mother”, “uncle”)                       |
| `DATE`       | Temporal expressions (e.g. “March 9, 1996”)                  |
| `TIME`       | Time-of-day expressions (e.g. “3 PM”)                        |
| `EVENT`      | Named events (e.g. “D-Day”)                                  |
| `QUANTITY`   | Numeric/measure expressions (e.g. “100 miles”)               |

We will demonstrate how to use the `annotate` module functions to label spatial entities in text in the next cells.

## Annotating text: `annotate_text(...)`

In [13]:
#@title ###### Let's download and display the text in the `short-text` file...
!wget -c -q "https://raw.githubusercontent.com/SpaceTimeNarratives/spatio-textual/refs/heads/main/example-texts/short-text"
text = open("short-text", 'r').read()
display(text)

'During the summer of 1942, my family and I were deported from our home in Krakow to the Plaszow labor camp. We spent several difficult months there before being transferred to Auschwitz-Birkenau.'

In [14]:
#@title ###### ...we can then identify the entities with the `annotate_text(...)` function.
result = annotate_text(text)
result

{'entities': [{'start_char': 7, 'token': 'the summer of 1942', 'tag': 'DATE'},
  {'start_char': 74, 'token': 'Krakow', 'tag': 'PLACE'},
  {'start_char': 88, 'token': 'Plaszow', 'tag': 'PLACE'},
  {'start_char': 96, 'token': 'labor camp', 'tag': 'GEONOUN'},
  {'start_char': 117, 'token': 'several difficult months', 'tag': 'DATE'},
  {'start_char': 176, 'token': 'Auschwitz-Birkenau', 'tag': 'CAMP'}],
 'verb_data': []}

We can see tokens identified as places (`PLACE`), geonouns (`GEONOUN`), camps (`CAMP`) etc.

Observe, however, that the `verb_data` value is empty. The `verb_data` is meant to capture the 'actions' performed by actors in the text for further analysis.

To fix this, we can set the optional parameter `include-verbs` to `True` in the function call.

In [7]:
#@title ###### So let's modify the function calls to extract `verb_data`
result = annotate_text(text, include_verbs=True)

print("Entities:")
display(result['entities'])

print("\nVerb Data:")
display(result['verb_data'])

Entities:


[{'start_char': 7, 'token': 'the summer of 1942', 'tag': 'DATE'},
 {'start_char': 74, 'token': 'Krakow', 'tag': 'PLACE'},
 {'start_char': 88, 'token': 'Plaszow', 'tag': 'PLACE'},
 {'start_char': 96, 'token': 'labor camp', 'tag': 'GEONOUN'},
 {'start_char': 117, 'token': 'several difficult months', 'tag': 'DATE'},
 {'start_char': 176, 'token': 'Auschwitz-Birkenau', 'tag': 'CAMP'}]


Verb Data:


[{'sent-id': 0,
  'verb': 'deported',
  'subject': 'family',
  'object': 'the summer of 1942',
  'sentence': 'During the summer of 1942, my family and I were deported from our home in Krakow to the Plaszow labor camp.'},
 {'sent-id': 1,
  'verb': 'spent',
  'subject': 'We',
  'object': 'several difficult months',
  'sentence': '\nWe spent several difficult months there before being transferred to Auschwitz-Birkenau."'},
 {'sent-id': 1,
  'verb': 'transferred',
  'subject': '',
  'object': 'Auschwitz-Birkenau',
  'sentence': '\nWe spent several difficult months there before being transferred to Auschwitz-Birkenau."'}]

## Annotating a list of texts

If we have a collection of texts (instead of just one piece of text), we can use the `annotate_texts(...)` instead.

In [15]:
list_of_texts = [
    "My family and I were deported from our home in Krakow to the Plaszow labor camp.",
    "We spent several difficult months there before being transferred to Maribor.",
    "Finally, we arrived at a place with barbed wire fences and watchtowers.",
    "It was Auschwitz."
]

results = annotate_texts(list_of_texts, include_verbs=True)
for result in results:
    print("\nEntities:")
    display(result['entities'])

    print("\nVerb Data:")
    display(result['verb_data'])


Entities:


[{'start_char': 47, 'token': 'Krakow', 'tag': 'PLACE'},
 {'start_char': 61, 'token': 'Plaszow', 'tag': 'PLACE'},
 {'start_char': 69, 'token': 'labor camp', 'tag': 'GEONOUN'}]


Verb Data:


[{'sent-id': 0,
  'verb': 'deported',
  'subject': 'family',
  'object': 'home',
  'sentence': 'My family and I were deported from our home in Krakow to the Plaszow labor camp.'}]


Entities:


[{'start_char': 9, 'token': 'several difficult months', 'tag': 'DATE'},
 {'start_char': 68, 'token': 'Maribor', 'tag': 'CITY'}]


Verb Data:


[{'sent-id': 0,
  'verb': 'spent',
  'subject': 'We',
  'object': 'several difficult months',
  'sentence': 'We spent several difficult months there before being transferred to Maribor.'},
 {'sent-id': 0,
  'verb': 'transferred',
  'subject': '',
  'object': 'Maribor',
  'sentence': 'We spent several difficult months there before being transferred to Maribor.'}]


Entities:


[{'start_char': 59, 'token': 'watchtowers', 'tag': 'GEONOUN'}]


Verb Data:


[{'sent-id': 0,
  'verb': 'arrived',
  'subject': 'we',
  'object': 'place',
  'sentence': 'Finally, we arrived at a place with barbed wire fences and watchtowers.'}]


Entities:


[{'start_char': 7, 'token': 'Auschwitz', 'tag': 'CAMP'}]


Verb Data:


[]

## Annotating text segments

Sometimes, we may need to split a text into segments (or chunks) before annotating them. This is similar to annotating a list of texts only that it includes the segmentation feature.

This can be achieved by using `chunk_and_annotate_text(...)` on a text string or using `chunk_and_annotate_file(...)` on a text file. In both cases a key parameter to set is the `n_segments` which specifies the number of segments.

#### Using `chunk_and_annotate_text(...)`

In [16]:
#@title ##### To demonstrate this let's download, read and display the text file `long-text`...
!wget -c -q "https://raw.githubusercontent.com/SpaceTimeNarratives/spatio-textual/refs/heads/main/example-texts/long-text"
text = open("long-text", 'r').read()
display(text)

"It was the spring of 1943 when they came for us. We were living in a small village near Krakow. The soldiers arrived early in the morning, shouting. They rounded everyone up and forced us onto crowded trains. The journey was terrible, days and nights without food or water. We didn't know where we were going, but the fear was palpable.\n\nFinally, we arrived at a place with barbed wire fences and watchtowers. It was Auschwitz. The separation was immediate. Men to one side, women and children to the other. I never saw my father again. My mother and I were sent to the women's camp. The conditions were horrific. Overcrowding, starvation, disease. We were forced to do hard labor, building roads and clearing rubble.\n\nOne day, my mother became very ill. I tried to care for her, but there was nothing I could do. She died a few weeks later. I was alone. The days blurred into weeks, weeks into months. The constant threat of death was always present. Selections were frequent, and those deemed 

In [17]:
#@title ##### ... then let's segment and annotate it with `chunk_and_annotate_text(...)`
result = chunk_and_annotate_text(text, n_segments=5, include_text=True)
result

[{'entities': [{'start_char': 11, 'token': 'spring', 'tag': 'GEONOUN'},
   {'start_char': 75, 'token': 'village', 'tag': 'GEONOUN'},
   {'start_char': 88, 'token': 'Krakow', 'tag': 'PLACE'},
   {'start_char': 117, 'token': 'early in the morning', 'tag': 'TIME'},
   {'start_char': 235, 'token': 'days', 'tag': 'DATE'},
   {'start_char': 244, 'token': 'nights', 'tag': 'DATE'}],
  'verb_data': [],
  'segId': 1,
  'text': "It was the spring of 1943 when they came for us. We were living in a small village near Krakow. The soldiers arrived early in the morning, shouting. They rounded everyone up and forced us onto crowded trains. The journey was terrible, days and nights without food or water. We didn't know where we were going, but the fear was palpable."},
 {'entities': [{'start_char': 59, 'token': 'watchtowers', 'tag': 'GEONOUN'},
   {'start_char': 79, 'token': 'Auschwitz', 'tag': 'CAMP'},
   {'start_char': 147, 'token': 'children', 'tag': 'FAMILY'},
   {'start_char': 185, 'token': 'father

#### Using `chunk_and_annotate_file(...)`

With the `chunk_and_annotate_file(...)` function, you can actually pass a text file (instead of a text string as with `chunk_and_annotate_text(...)`) for annotation.

>**NOTE:** You can also upload your file using the file icon in the left sidebar. Then, replace the example text `long-text` with the name of your uploaded file.

In [11]:
#@title ##### ... Let's segment and annotate a file `chunk_and_annotate_file(...)`
result = chunk_and_annotate_file('long-text', n_segments=5)
result

[{'entities': [{'start_char': 11, 'token': 'spring', 'tag': 'GEONOUN'},
   {'start_char': 75, 'token': 'village', 'tag': 'GEONOUN'},
   {'start_char': 88, 'token': 'Krakow', 'tag': 'PLACE'},
   {'start_char': 117, 'token': 'early in the morning', 'tag': 'TIME'},
   {'start_char': 235, 'token': 'days', 'tag': 'DATE'},
   {'start_char': 244, 'token': 'nights', 'tag': 'DATE'}],
  'verb_data': [],
  'file': 'long-text',
  'fileId': 'long-text',
  'segId': 1,
  'segCount': 5},
 {'entities': [{'start_char': 59, 'token': 'watchtowers', 'tag': 'GEONOUN'},
   {'start_char': 79, 'token': 'Auschwitz', 'tag': 'CAMP'},
   {'start_char': 147, 'token': 'children', 'tag': 'FAMILY'},
   {'start_char': 185, 'token': 'father', 'tag': 'FAMILY'},
   {'start_char': 202, 'token': 'mother', 'tag': 'FAMILY'},
   {'start_char': 240, 'token': 'camp', 'tag': 'GEONOUN'}],
  'verb_data': [],
  'file': 'long-text',
  'fileId': 'long-text',
  'segId': 2,
  'segCount': 5},
 {'entities': [{'start_char': 98, 'token': 'b

## Annotating files

We can also pass a file or a collection of files in a folder as an input to annotate,

In [12]:
#@title ###### Let's start by segmenting and annotating the `long-text` file...
result = annotate_files('long-text',
                        chunk=True,
                        n_segments=5,
                        include_text=True,
                        include_verbs=True)
result

[{'entities': [{'start_char': 11, 'token': 'spring', 'tag': 'GEONOUN'},
   {'start_char': 75, 'token': 'village', 'tag': 'GEONOUN'},
   {'start_char': 88, 'token': 'Krakow', 'tag': 'PLACE'},
   {'start_char': 117, 'token': 'early in the morning', 'tag': 'TIME'},
   {'start_char': 235, 'token': 'days', 'tag': 'DATE'},
   {'start_char': 244, 'token': 'nights', 'tag': 'DATE'}],
  'verb_data': [{'sent-id': 0,
    'verb': 'came',
    'subject': 'they',
    'object': 'us',
    'sentence': 'It was the spring of 1943 when they came for us.'},
   {'sent-id': 1,
    'verb': 'living',
    'subject': 'We',
    'object': 'village',
    'sentence': 'We were living in a small village near Krakow.'},
   {'sent-id': 2,
    'verb': 'arrived',
    'subject': 'soldiers',
    'object': '',
    'sentence': 'The soldiers arrived early in the morning, shouting.'},
   {'sent-id': 2,
    'verb': 'shouting',
    'subject': '',
    'object': '',
    'sentence': 'The soldiers arrived early in the morning, shouting

In [20]:
#@title ###### We can also annotate multiple files...
result = annotate_files(['short-text','long-text'],
                        chunk=True,
                        n_segments=5,
                        include_text=True,
                        include_verbs=True)
result

[{'entities': [{'start_char': 7, 'token': 'the summer of 1942', 'tag': 'DATE'},
   {'start_char': 74, 'token': 'Krakow', 'tag': 'PLACE'},
   {'start_char': 88, 'token': 'Plaszow', 'tag': 'PLACE'},
   {'start_char': 96, 'token': 'labor camp', 'tag': 'GEONOUN'}],
  'verb_data': [{'sent-id': 0,
    'verb': 'deported',
    'subject': 'family',
    'object': 'the summer of 1942',
    'sentence': 'During the summer of 1942, my family and I were deported from our home in Krakow to the Plaszow labor camp.'}],
  'file': 'short-text',
  'fileId': 'short-text',
  'segId': 1,
  'segCount': 2,
  'text': 'During the summer of 1942, my family and I were deported from our home in Krakow to the Plaszow labor camp.'},
 {'entities': [{'start_char': 9,
    'token': 'several difficult months',
    'tag': 'DATE'},
   {'start_char': 68, 'token': 'Auschwitz-Birkenau', 'tag': 'CAMP'}],
  'verb_data': [{'sent-id': 0,
    'verb': 'spent',
    'subject': 'We',
    'object': 'several difficult months',
    'senten

## Saving and loading files

Annotations can be saved for future analysis and visualisation using the `save_annotations()` function. The supported formats include: `json`, `jsonl`, `csv` and `tsv`.

In [21]:
# from spatio_textual.utils import save_annotations
save_annotations(result, 'result.jsonl')

Saved annotation can be reloaded in memory as a Pandas dataframe using the `load_annotations()` function.

In [22]:
load_annotations('result.jsonl')

Unnamed: 0,entities,verb_data,file,fileId,segId,segCount,text,error
0,"[{'start_char': 7, 'token': 'the summer of 194...","[{'sent-id': 0, 'verb': 'deported', 'subject':...",short-text,short-text,1,2,"During the summer of 1942, my family and I wer...",
1,"[{'start_char': 9, 'token': 'several difficult...","[{'sent-id': 0, 'verb': 'spent', 'subject': 'W...",short-text,short-text,2,2,We spent several difficult months there before...,
2,"[{'start_char': 11, 'token': 'spring', 'tag': ...","[{'sent-id': 0, 'verb': 'came', 'subject': 'th...",long-text,long-text,1,5,It was the spring of 1943 when they came for u...,
3,"[{'start_char': 59, 'token': 'watchtowers', 't...","[{'sent-id': 0, 'verb': 'arrived', 'subject': ...",long-text,long-text,2,5,"Finally, we arrived at a place with barbed wir...",
4,"[{'start_char': 98, 'token': 'building', 'tag'...","[{'sent-id': 2, 'verb': 'forced', 'subject': '...",long-text,long-text,3,5,"The conditions were horrific. Overcrowding, st...",
5,"[{'start_char': 9, 'token': 'a few weeks later...","[{'sent-id': 0, 'verb': 'died', 'subject': 'Sh...",long-text,long-text,4,5,She died a few weeks later. I was alone. The d...,
6,"[{'start_char': 124, 'token': 'years', 'tag': ...","[{'sent-id': 0, 'verb': 'know', 'subject': 'I'...",long-text,long-text,5,5,I don't know how I survived. Maybe it was luck...,


# ToDo
Additional features to include
- ~Annotating a list of texts~ Done ✅
- ~Saving annotations to files (what format: csv/tsv, json, jsonl)~ Done ✅
- ~loading annotations from corpus file~ Done ✅
- Emotion classification (LLM vs BERT-based)
- Sentiment.
- Geocoding
- Create the `Lake District` and `Holocaust` modules?
  - `Holocaust`:
      - `journey` extraction
      - `event` extraction
  - `Lake District`:
      - `nearness`
      - `wild` and `picturesque`
- Analysis
- Visualization