<a href="https://colab.research.google.com/github/IgnatiusEzeani/spatio-textual/blob/main/spatio_textual_package_a_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introducing `spatio-textual` Python package

**spatio-textual**: a Python package for spatial entity recognition and verb relation extraction from text created by the [Spatial Narratives Project](https://spacetimenarratives.github.io/) and designed to support spatio-textual annotation, analysis and visualization in digital humanities projects, with initial applications to:

- the *Corpus of Lake District Writing* (CLDW)
- Holocaust survivors' testimonies (e.g., USC Shoah Foundation archives)

This package leverages spaCy and gazetteer-based classification to identify and label spatial entities such as cities, countries, camps, and geographic nouns, and also extracts action-verb contexts involving these entities.


---
## Setting up
Download `en_core_web_trf` spaCy model and install `spatio-textual` package.

**_Note:_** *Please wait a while, this may take a minute or 2...* 🕐


In [None]:
!python -m spacy download en_core_web_trf
!pip install -q git+https://github.com/SpaceTimeNarratives/spatio-textual.git

Collecting en-core-web-trf==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.8.0/en_core_web_trf-3.8.0-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting spacy-curated-transformers<1.0.0,>=0.2.2 (from en-core-web-trf==3.8.0)
  Downloading spacy_curated_transformers-0.3.1-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting curated-transformers<0.2.0,>=0.1.0 (from spacy-curated-transformers<1.0.0,>=0.2.2->en-core-web-trf==3.8.0)
  Downloading curated_transformers-0.1.1-py2.py3-none-any.whl.metadata (965 bytes)
Collecting curated-tokenizers<0.1.0,>=0.0.9 (from spacy-curated-transformers<1.0.0,>=0.2.2->en-core-web-trf==3.8.0)
  Downloading curated_tokenizers-0.0.9-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)
Downloading spacy_curated_transformers-0.3.1-py2.py3-none-any.whl (237 kB)
[2K   [90m━━━━━

---
## Importing the `spatio-textual` package
Having successfully downloaded the spaCy model and installed the `spatio-textual` package, it can now be imported and used in a Python environment to process text.

*Again, this may take about a minute too, sorry...*

In [28]:
import spatio_textual

The `spatio-textual` package has the `annotate` module with functions `annotate_text`, `annotate_texts`, `chunk_and_annotate_text`, `chunk_and_annotate_file` which identifies and labels spatial entities in text inputs of different formats.

So we can import the functions directly as below

In [29]:
from spatio_textual.annotate import (
    annotate_text,            # annotates a single text
    annotate_texts,           # annotates a collection of texts
    chunk_and_annotate_text,  # chunks and annotates a text
    chunk_and_annotate_file,  # chunks and annotates a file
)

---
## Annotating spatial entities

Beyond the typical labels for the named entity recognition task [`PERSON`, `ORG`, `LOC`, `DATE`], we have defined a set of entity labels that are relevant for our work as shown below:

| Tag          | Description                                                  |
| ------------ | ------------------------------------------------------------ |
| `PERSON`     | A named person                                               |
| `CONTINENT`  | A continent name (e.g. “Europe”, “Asia”)                     |
| `COUNTRY`    | A country name (e.g. “Germany”, “Czechoslovakia”)            |
| `US-STATE`   | A U.S. state name (e.g. “California”, “New York”)            |
| `CITY`       | A city name (e.g. “Berlin”, “London”,  when classified)     |
| `CAMP`       | A Holocaust-camp name e.g. “Auschwitz” (from your custom list)                |
| `PLACE`      | Other place-type entities not matched above                  |
| `GEONOUN`    | Generic geographic nouns (e.g. “valley”, “moor”)             |
| `NON-VERBAL` | Terms like [PAUSES], [LAUGHS] in non-verbal list |
| `FAMILY`     | Kinship terms (e.g. “mother”, “uncle”)                       |
| `DATE`       | Temporal expressions (e.g. “March 9, 1996”)                  |
| `TIME`       | Time-of-day expressions (e.g. “3 PM”)                        |
| `EVENT`      | Named events (e.g. “D-Day”)                                  |
| `QUANTITY`   | Numeric/measure expressions (e.g. “100 miles”)               |

with the `annotate_text` function, we will now be able to label these entities in the given text as shown below

### Annotating text

In [30]:
text = """
"During the summer of 1942, my family and I were deported from our home in Krakow to the Plaszow labor camp.
We spent several difficult months there before being transferred to Auschwitz-Birkenau."
"""

result = annotate_text(text)

In the above code, the output of the `annotate_text` function is stored in the variable `result` which is a dictionary containing `'entities'` and `'verb_data'`. We can look at the individual elements in each of them

In [31]:
#@title ##### Let's look at `'entities'`...
print("Entities:")
display(result['entities'])

Entities:


[{'start_char': 9, 'token': 'the summer of 1942', 'tag': 'DATE'},
 {'start_char': 76, 'token': 'Krakow', 'tag': 'PLACE'},
 {'start_char': 90, 'token': 'Plaszow', 'tag': 'PLACE'},
 {'start_char': 98, 'token': 'labor camp', 'tag': 'GEONOUN'},
 {'start_char': 119, 'token': 'several difficult months', 'tag': 'DATE'},
 {'start_char': 178, 'token': 'Auschwitz-Birkenau', 'tag': 'CAMP'}]

As you can see, it contains a list of all identified entities in the text each of which is a dictionar containing the starting character position, the entity span, as well as its label.

In [32]:
#@title ##### Now let's see the `'verb_data'`...

print("\nVerb Data:")
display(result['verb_data'])


Verb Data:


[]

## Annotating text from file

You can read the content of a text file for annotation.

The code below downloads the example text file, `example-text`, from the source repo [here](https://github.com/SpaceTimeNarratives/spatio-textual/blob/main/example-text) and annotates it.

In [None]:
#@title ##### Download the example text:
!wget -c -q "https://raw.githubusercontent.com/SpaceTimeNarratives/spatio-textual/refs/heads/main/example-text"

In [None]:
#@title ##### Read and annotate the text:
example_text = open("example-text", 'r').read()
file_result = annotate_text(example_text)

In [None]:
#@title ##### Display annotation results:
print("Entities from file:")
display(file_result['entities'])

print("\nVerb Data from file:")
display(file_result['verb_data'])

Entities from file:


[{'start_char': 11, 'token': 'spring', 'tag': 'GEONOUN'},
 {'start_char': 75, 'token': 'village', 'tag': 'GEONOUN'},
 {'start_char': 88, 'token': 'Krakow', 'tag': 'PLACE'},
 {'start_char': 117, 'token': 'early in the morning', 'tag': 'TIME'},
 {'start_char': 235, 'token': 'days', 'tag': 'DATE'},
 {'start_char': 244, 'token': 'nights', 'tag': 'DATE'},
 {'start_char': 397, 'token': 'watchtowers', 'tag': 'GEONOUN'},
 {'start_char': 417, 'token': 'Auschwitz', 'tag': 'CAMP'},
 {'start_char': 485, 'token': 'children', 'tag': 'FAMILY'},
 {'start_char': 523, 'token': 'father', 'tag': 'FAMILY'},
 {'start_char': 540, 'token': 'mother', 'tag': 'FAMILY'},
 {'start_char': 578, 'token': 'camp', 'tag': 'GEONOUN'},
 {'start_char': 682, 'token': 'building', 'tag': 'GEONOUN'},
 {'start_char': 731, 'token': 'mother', 'tag': 'FAMILY'},
 {'start_char': 823, 'token': 'a few weeks later', 'tag': 'DATE'},
 {'start_char': 855, 'token': 'The days', 'tag': 'DATE'},
 {'start_char': 877, 'token': 'weeks', 'tag': '


Verb Data from file:


[{'sent-id': 0,
  'verb': 'came',
  'subject': 'they',
  'object': 'us',
  'sentence': 'It was the spring of 1943 when they came for us.'},
 {'sent-id': 1,
  'verb': 'living',
  'subject': 'We',
  'object': 'village',
  'sentence': 'We were living in a small village near Krakow.'},
 {'sent-id': 2,
  'verb': 'arrived',
  'subject': 'soldiers',
  'object': '',
  'sentence': 'The soldiers arrived early in the morning, shouting.'},
 {'sent-id': 2,
  'verb': 'shouting',
  'subject': '',
  'object': '',
  'sentence': 'The soldiers arrived early in the morning, shouting.'},
 {'sent-id': 3,
  'verb': 'rounded',
  'subject': 'They',
  'object': 'everyone',
  'sentence': 'They rounded everyone up and forced us onto crowded trains.'},
 {'sent-id': 3,
  'verb': 'forced',
  'subject': '',
  'object': 'us',
  'sentence': 'They rounded everyone up and forced us onto crowded trains.'},
 {'sent-id': 5,
  'verb': 'know',
  'subject': 'We',
  'object': '',
  'sentence': "We didn't know where we were goin


Another way to this is by uploading your file using the file icon in the left sidebar. Then, replace `"example-text"` with the name of your uploaded file and run the code cell.

```python
example_text = open("your-uploaded-file", 'r').read()
file_result = annotate_text(example_text)
```



## Annotating a list of texts

In [None]:
results = chunk_and_annotate_text(example_text, n_segments=5, file_id="sample")
results

[{'entities': [{'start_char': 11, 'token': 'spring', 'tag': 'GEONOUN'},
   {'start_char': 75, 'token': 'village', 'tag': 'GEONOUN'},
   {'start_char': 88, 'token': 'Krakow', 'tag': 'PLACE'},
   {'start_char': 117, 'token': 'early in the morning', 'tag': 'TIME'},
   {'start_char': 235, 'token': 'days', 'tag': 'DATE'},
   {'start_char': 244, 'token': 'nights', 'tag': 'DATE'}],
  'verb_data': [{'sent-id': 0,
    'verb': 'came',
    'subject': 'they',
    'object': 'us',
    'sentence': 'It was the spring of 1943 when they came for us.'},
   {'sent-id': 1,
    'verb': 'living',
    'subject': 'We',
    'object': 'village',
    'sentence': 'We were living in a small village near Krakow.'},
   {'sent-id': 2,
    'verb': 'arrived',
    'subject': 'soldiers',
    'object': '',
    'sentence': 'The soldiers arrived early in the morning, shouting.'},
   {'sent-id': 2,
    'verb': 'shouting',
    'subject': '',
    'object': '',
    'sentence': 'The soldiers arrived early in the morning, shouting

# ToDo
Additional features to include
- Annotating a list of texts
- Saving annotations to files (what format: csv, json, txt)
- Emotion classification (LLM vs BERT-based)
- Sentiment.
- Geocoding
- Create the `Lake District` and `Holocaust` modules?
  - `Holocaust`:
      - `journey` extraction
      - `event` extraction
  - `Lake District`:
      - `nearness`
      - `wild` and `picturesque`
- Analysis
- Visualization