# Reflection on Unlocking Texts

The [position paper](https://osf.io/u6vb4) by Neil Jefferies at al. (Bodleian Library Oxford)
is searching for an Interoperable Text Format.

So am I, and Text-Fabric is my practical answer for a number of use cases
(by no means all possible use cases).

[Text-Fabric](https://github.com/annotation/text-fabric)
exists for 6 years under its name, plus another 3 years as its precursor
[LAF-Fabric](https://laf-fabric.readthedocs.io/en/latest/texts/welcome.html).
I have applied it to a number of
[corpora](https://annotation.github.io/text-fabric/tf/about/corpora.html)
and it has proved to be useful in research workflows.

In this notebook I want to demonstrate some of the ideas and concepts that were hinted at during
a workshop on 2023-01-26 in Oxford.

# Addressing and displaying text fragments

From the position paper:

> Fragment Addressing
This bottom-up approach starts by considering the range of text file formats under consideration for this proof-of-concept phase.
> Complex formats, which can include embedded binary objects,
requiring specialised software to display or otherwise interact with.

But why should we interact with texts in the source formats in which they come to us?
Most likely they have already been converted many times over from the time they were digitally born 
till the moment they arrive at the screen of your laptop.

We can add another conversion so that we facilitate the operations on text that we are interested in.

And yes, the position paper states, a bit later:

> However, at this stage, it appears that it would be advantageous
to also have a higher level scheme that operates in a more “human-friendly”way, with word (or token) granularity and some sense of semantic structure at a level similar to Markdown or a light-TEI schema.

This is a nice starting point, except that instead of "Markdown or a light-TEI scheme" I would opt 
for an abstract data model: the graph with nodes and edges.

Because we can then operate on the very abstract structure of text and dress that up with annotations as needed.

What does a text graph look like? A graph is a set of nodes plus a set of edges between nodes.
But text has a bit more structure than that: it has the notion of sequence and embedding.

In Text-Fabric the nodes are everything you can address: first of all the textual positions (*slots*)
but also subsets of slots, which may represent pages, lines, sentences, lexemes, or whatever,
up to the modeller.

The abstract model does not contain the text itself. The text is a set of annotations (features) to
the slots.

## The BHSA

Enough talk, let's see this in practice.
The Biblia Hebraica Stuttgartensia (BHS) is available as a Text-Fabric dataset,
the [BHS Amstelodamensis](https://github.com/ETCBC/bhsa).

What I'm about to show you in a moment, you can do yourself by installing python, and then
jupyter lab and text-fabric (`pip install text-fabric`).

Fire up a notebook, and mimic what you see here. Everything is open source,
and all data that is needed will be downloaded to your computer when needed.

We start by importing text-fabric, `tf` in short.

In [1]:
from tf.app import use

Then we can `use` our corpus. We retrieve the corpus by its location on GitHub,
under the `ETCBC` organisation, in the repo `bhsa`.

In [2]:
A = use("ETCBC/bhsa", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
,,,
book,39.0,10938.21,100.0
chapter,929.0,459.19,100.0
lex,9230.0,46.22,100.0
verse,23213.0,18.38,100.0
half_verse,45179.0,9.44,100.0
sentence,63717.0,6.7,100.0
sentence_atom,64514.0,6.61,100.0
clause,88131.0,4.84,100.0
clause_atom,90704.0,4.7,100.0


*by all means, open those triangles above ...*

## The Text-Fabric API

The variable `A` gives access to everything in our corpus.

The corpus consists of nodes, the nodes are natural numbers.

Let's see some nodes.

Here are the [docs](https://annotation.github.io/text-fabric/tf/about/datamodel.html)
on the data model. Note that *text* and *fabric* have the same semantic root.

In [52]:
n1 = 100000
n2 = 500000

## Display

We are going to display them:

In [53]:
A.plain(n1)

Apparently, 100000 is a node that represents a word, a word that occurs in book Deuteronomy, chapter 11, verse 19.

In [54]:
A.plain(n2)

This is more than a word, but what exactly is it?

### Richer display

We can show it with more details:

In [6]:
A.pretty(n2)

Ah, it is a *clause*, and it is composed of several *phrases*, and they all have certain properties.

### Tweaking the display: text formats

In case you are not familiar with the Hebrew script, the dataset contains a feature `phono` that contains the phonological representation of the text.

The corpus has a text format that uses this feature to display the text, instead of the feature `g_word_utf8` that is used by the default format.

Let's go phonological:

In [7]:
A.pretty(n2, fmt="text-phono-full")

Here it pays off that it is not the text that is annotated, but that the text itself is a feature of
the textual positions, the slots.

Let's show the nodes:

In [9]:
A.pretty(n2, fmt="text-orig-plain", withNodes=True)

By the way, now I changed to yet another text format, leaving out the diacritics (vowels and accents).

If you wonder which formats there are to choose from, we just ask:

In [10]:
T.formats

{'lex-default': 'word',
 'lex-orig-full': 'word',
 'lex-orig-plain': 'word',
 'lex-trans-full': 'word',
 'lex-trans-plain': 'word',
 'text-orig-full': 'word',
 'text-orig-full-ketiv': 'word',
 'text-orig-plain': 'word',
 'text-phono-full': 'word',
 'text-trans-full': 'word',
 'text-trans-full-ketiv': 'word',
 'text-trans-plain': 'word'}

These are formats defined by the corpus modeller, not formats that come with Text-Fabric.

Text-Fabric itself is very agnostic about corpora.

As a check, let's drill down to phrases and words, and now we switch back to phono:

In [11]:
A.pretty(860096, fmt="text-phono-full", withNodes=True)

## Navigating nodes

It does get cumbersome to type in those numbers manually.
In normal workflows, you never see them.

The Text-Fabric API has methods to start at some node, and then find related nodes, and show features of those nodes:

In [12]:
L.d(n2, otype="phrase")

(860096, 860097, 860098)

The `L` operator goes from nodes to enclosing or embedded nodes or to preceding or following nodes.

We went `d`own, from `n2` to all nodes of type `phrase` whose slots are contained in the slots of
`n2`.

In [14]:
for p in L.d(n2, otype="phrase"):
    A.pretty(p, fmt="text-phono-full", withNodes=True)

## Locality-API

You noted that the graph has the information to provide the context for each word.

The API has functions to get from a context specification to a node:

In [18]:
A.nodeFromSectionStr("Job 10:5")

1432286

Fine, but what is it?

In [17]:
A.pretty(A.nodeFromSectionStr("Job 10:5"), fmt="text-phono-full", withNodes=True)

## Regulate the amount of information

If we find the display a bit much, we can reduce it.
In this case, every *sentence* has exactly one *clause* which coincides with it.

But first we go the other way, because there is more to the data than met the eye so far:

In [21]:
j = A.nodeFromSectionStr("Job 10:5")

A.pretty(j, fmt="text-phono-full", withNodes=True, hideTypes=False)

Now we reduce:

In [22]:
hiddenTypes = {
    "half_verse",
    "sentence",
    "sentence_atom",
    "clause_atom",
    "phrase_atom",
    "subphrase",
}
A.pretty(j, fmt="text-phono-full", withNodes=True, hideTypes=True, hiddenTypes=hiddenTypes)

We can also decide not to drill down further than the *phrase* level.

In [55]:
A.pretty(
    j,
    fmt="text-phono-full",
    withNodes=True,
    hideTypes=True,
    hiddenTypes=hiddenTypes,
    baseTypes={"phrase"},
)

## Querying and highlighting

Displaying text fragments is important when you need to show query results.

Here is how the display mechanism works with the query mechanism.

Let's look for the prepositions in Job 10:5

In [56]:
query = """
book book@en=Job
  chapter chapter=10
    verse verse=5
      word sp=prep
"""

In [57]:
results = A.search(query)

  0.23s 2 results


In [58]:
A.show(results, condensed=True)

The results are highlighted, and the word features that were mentioned in the query (`sp`) are shown for each word.

When it comes to displaying annotations on a text, the situation is much the same:

annotations have targets, we want to highlight the targets in the text.
Annotations have bodies, and we want to display the bodies as extra information on top of the text.

Having a system of nodes and edges helps to display text fragments in agile ways.

See also [display design](https://annotation.github.io/text-fabric/tf/about/displaydesign.html).

## Further

This is just scratching the surface.

Once you have your nodes, edges and features in place, the ground is fertile to build
additional workflows, and to do data science. Or build apps.

The next important step is to produce and share and invoke new annotation data.

As an example, the following notebook shows how we can load the correspondence of Descartes
with additional data that relates similar sentences in that corpus:

[CLARIAH/descartes-tf](https://nbviewer.org/github/CLARIAH/descartes-tf/blob/main/tutorial/similar.ipynb).