<a href="https://colab.research.google.com/github/IgnatiusEzeani/spatio-textual/blob/main/spatio_textual_package_a_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introducing `spatio-textual` Python package

**spatio-textual**: a Python package for spatial entity recognition and verb relation extraction from text created by the [Spatial Narratives Project](https://spacetimenarratives.github.io/) and designed to support spatio-textual annotation, analysis and visualization in digital humanities projects, with initial applications to:

- the *Corpus of Lake District Writing* (CLDW)
- Holocaust survivors' testimonies (e.g., USC Shoah Foundation archives)

This package leverages spaCy and gazetteer-based classification to identify and label spatial entities such as cities, countries, camps, and geographic nouns, and also extracts action-verb contexts involving these entities.


---
## Setting up
Download `en_core_web_trf` spaCy model and install `spatio-textual` package.

**_Note:_** *Please wait a while, this may take a minute or 2...* 🕐


In [None]:
!python -m spacy download en_core_web_trf
!pip install -q git+https://github.com/SpaceTimeNarratives/spatio-textual.git

---
## Importing the `spatio-textual` package
Having successfully downloaded the spaCy model and installed the `spatio-textual` package, it can now be imported and used in a Python environment to process text.

*Again, this may take about a minute too, sorry...*

In [None]:
import spatio_textual

The `spatio-textual` package has the `annotate` module with functions `annotate_text`, `annotate_texts`, `chunk_and_annotate_text`, `chunk_and_annotate_file` which identifies and labels spatial entities in text inputs of different formats.

So we can import the functions directly as below

In [None]:
from spatio_textual.annotate import (
    annotate_text,            # annotates a single text
    annotate_texts,           # annotates a collection of texts
    chunk_and_annotate_text,  # chunks and annotates a text
    chunk_and_annotate_file,  # chunks and annotates a file
    annotate_files,           # annotates a collection of files
)

Also, we can manage the saving and loading of annotations using the package `utils` functions `save_annotations(...)` and `load_annotations(...)` respectively

In [None]:
from spatio_textual.utils import save_annotations, load_annotations

---
## Annotating spatial entities

Beyond the typical labels for the named entity recognition task [`PERSON`, `ORG`, `LOC`, `DATE`], we have defined a set of entity labels that are relevant for our work as shown below:

| Tag          | Description                                                  |
| ------------ | ------------------------------------------------------------ |
| `PERSON`     | A named person                                               |
| `CONTINENT`  | A continent name (e.g. “Europe”, “Asia”)                     |
| `COUNTRY`    | A country name (e.g. “Germany”, “Czechoslovakia”)            |
| `US-STATE`   | A U.S. state name (e.g. “California”, “New York”)            |
| `CITY`       | A city name (e.g. “Berlin”, “London”,  when classified)     |
| `CAMP`       | A Holocaust-camp name e.g. “Auschwitz” (from your custom list)                |
| `PLACE`      | Other place-type entities not matched above                  |
| `GEONOUN`    | Generic geographic nouns (e.g. “valley”, “moor”)             |
| `NON-VERBAL` | Terms like [PAUSES], [LAUGHS] in non-verbal list |
| `FAMILY`     | Kinship terms (e.g. “mother”, “uncle”)                       |
| `DATE`       | Temporal expressions (e.g. “March 9, 1996”)                  |
| `TIME`       | Time-of-day expressions (e.g. “3 PM”)                        |
| `EVENT`      | Named events (e.g. “D-Day”)                                  |
| `QUANTITY`   | Numeric/measure expressions (e.g. “100 miles”)               |

We will demonstrate how to use the `annotate` module functions to label spatial entities in text in the next cells.

## Annotating text: `annotate_text(...)`

In [None]:
#@title ###### Assume we have our text stored in `text` variable as below...
text = """During the summer of 1942, my family and I were deported from our home in Krakow to the Plaszow labor camp.
We spent several difficult months there before being transferred to Auschwitz-Birkenau."
"""

In [None]:
#@title ###### ...we can then identify the entities with the `annotate_text(...)` function.
result = annotate_text(text)
result

We can see tokens identified as places (`PLACE`), geonouns (`GEONOUN`), camps (`CAMP`) etc.

Observe, however, that the `verb_data` value is empty. The `verb_data` is meant to capture the 'actions' performed by actors in the text for further analysis.

To fix this, we can set the optional parameter `include-verbs` to `True` in the function call.

In [None]:
#@title ###### So let's modify the function calls to extract `verb_data`
result = annotate_text(text, include_verbs=True)

print("Entities:")
display(result['entities'])

print("\nVerb Data:")
display(result['verb_data'])

## Annotating a list of texts

If we have a collection of texts (instead of just one piece of text), we can use the `annotate_texts(...)` instead.

In [None]:
list_of_texts = [
    "My family and I were deported from our home in Krakow to the Plaszow labor camp.",
    "We spent several difficult months there before being transferred to Maribor.",
    "Finally, we arrived at a place with barbed wire fences and watchtowers.",
    "It was Auschwitz."
]

results = annotate_texts(list_of_texts, include_verbs=True)
for result in results:
    print("\nEntities:")
    display(result['entities'])

    print("\nVerb Data:")
    display(result['verb_data'])

## Annotating text segments

Sometimes, we may need to split a text into segments (or chunks) before annotating them. This is similar to annotating a list of texts only that it includes the segmentation feature.

This can be achieved by using `chunk_and_annotate_text(...)` on a text string or using `chunk_and_annotate_file(...)` on a text file. In both cases a key parameter to set is the `n_segments` which specifies the number of segments.

#### Using `chunk_and_annotate_text(...)`

In [None]:
#@title ##### To demonstrate this let's download, read and display the text file `long-text`...
!wget -c -q "https://raw.githubusercontent.com/SpaceTimeNarratives/spatio-textual/refs/heads/main/example-texts/long-text"
text = open("long-text", 'r').read()
display(text)

In [None]:
#@title ##### ... then let's segment and annotate it with `chunk_and_annotate_text(...)`
result = chunk_and_annotate_text(text, n_segments=5, include_text=True)
result

#### Using `chunk_and_annotate_file(...)`

>**NOTE:** You can also upload your file using the file icon in the left sidebar. Then, replace the example text `long-text` with the name of your uploaded file.

In [None]:
#@title ##### ... Let's segment and annotate a file `chunk_and_annotate_file(...)`
result = chunk_and_annotate_file('long-text', n_segments=5)
result

## Annotating files

We can also pass a file or a collection of files in a folder as an input to annotate,

In [None]:
result = annotate_files('long-text',
                        chunk=True,
                        n_segments=5,
                        include_text=True,
                        include_verbs=True)
result

## Saving and loading files

Annotations can be saved for future analysis and visualisation using the `save_annotations()` function. The supported formats include: `json`, `jsonl`, `csv` and `tsv`.

In [None]:
# from spatio_textual.utils import save_annotations
save_annotations(result, 'result.jsonl')

Saved annotation can be reloaded in memory as a Pandas dataframe using the `load_annotations()` function.

In [None]:
load_annotations('result.jsonl')

# ToDo
Additional features to include
- ~Annotating a list of texts~ Done ✅
- ~Saving annotations to files (what format: csv/tsv, json, jsonl)~ Done ✅
- ~loading annotations from corpus file~ Done ✅
- Emotion classification (LLM vs BERT-based)
- Sentiment.
- Geocoding
- Create the `Lake District` and `Holocaust` modules?
  - `Holocaust`:
      - `journey` extraction
      - `event` extraction
  - `Lake District`:
      - `nearness`
      - `wild` and `picturesque`
- Analysis
- Visualization