# Text-Fabric and Alpino

## Introduction

Let's compare two tools that give users computational power over annotated corpora:

[Text-Fabric](https://github.com/annotation/text-fabric)
and
[Alpino](https://urd2.let.rug.nl/~kleiweg/alpinograph-docs/), especially its graph-based
query language.

### How to digest an annotated corpus

If you have an annotated corpus there are three ways you could *consume* the data:

* browsing and reading,
* computing (walk the corpus by means of a program that collects results),
* querying.

The advantage of computing and querying over browsing and reading is that you can find needles in haystacks,
and filter and process the corpus in ways that are infeasible by the human eye and hand.

Yet, when computing and querying are done, it is vitally important that users can read and browse around
the results, in order to see what is happening in the corpus and get ideas for new computations and queries.

The advantage of querying is that it can be done without programming, although the query language
must be mastered. But that is still much easier and less time consuming than the art of programming.

The disadvantage of querying is that sooner or later, when the research questions become increasingly complex,
the query language tends to become a straightjacket. Users have to become over-ingeneous in order to find
the queries that work for them. They approach a point where it pays off to compute.

In an ideal world, it should be easy to bridge the gap between querying and hand-coding smoothly.

In the real world, this gap tends to be an enormous barrier.

Results from the query engine are serialized from an internal representation to an external representation.
In the worst case, the user has only access to the results by means of a web interface.
In better cases the user can download results as data, in JSON or TSV, or TXT.

Even then, important parts of the context tend to get lost. Where exactly are the results located in the corpus?
Can I get from a result sentence to the sentence that immediately follows it?

This gap can be made surmountable if the system would have an addressing system for every possible text-fragment
in the corpus, and able to deliver thoses addresses within the results.

There are also design characteristics of the query language that help to lessen the gap.

*First of all*, a query should expose the terms of the corpus clearly and unaltered, and should be minimalistic
in all other respects.

*Secondly*, a query should mimick the pattern it is designed to retrieve, this helps when you want to search by example.

*Thirdly*, a query should be able to express spatial relationships between text-fragments, such as "contained-in",
"overlapping", "completely before", "adjacent", etc.

*Fourthly*, a query should be able to combine spatial relationships with all other features that the corpus has on offer.

## Text-Fabric

In Text-Fabric we aim to produce such a query language. Here are a few characteristics:

* Queries are `topographical`: a query is a relationship pattern, and the results are all instantiations of that
  pattern found in the corpus.
  
  
* The query language is data-agnostic, but it can use the names of all features defined in the corpus.

* The results of queries are tuples of nodes and nodes are integers.

### Example

Suppose the corpus

* has nodes for `sentences`, `clauses`, `phrases`, `words`.
* defines a feature `typ` for clauses and phrases
* defines features `sp` (part-of-speech) and `g_cons` (consonantal transcription) for words.

Then we can make a query that looks for NP phrases with a verb in it, where such phrases occur
in clauses of type `Ptcp`. Oh, and the verb should begin with an `M`.

**N.B.: This is a real-world example. You can reproduce this on your own computer.**

```
sentence
  clause typ=Ptcp
    phrase typ=NP
      word sp=verb g_cons~^M
```

And the query results form a tuple of individual results, where each individual result is a tuple

(*s*, *c*, *p*, *w*)

of a sentence node, a clause node, a phrase node and a word node.

## Running a query

In Text-Fabric, a query can be run from within a Python program.

Load the software.

In [2]:
# assumed: pip install 'text-fabric[all]'

from tf.app import use

Load the data and give a handle to the API that gives access to it:

In [3]:
A = use("ETCBC/bhsa") # the corpus data is retrieved from github.com/ETCBC/bhsa and then cached locally

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
,,,
book,39.0,10938.21,100.0
chapter,929.0,459.19,100.0
lex,9230.0,46.22,100.0
verse,23213.0,18.38,100.0
half_verse,45179.0,9.44,100.0
sentence,63717.0,6.7,100.0
sentence_atom,64514.0,6.61,100.0
clause,88131.0,4.84,100.0
clause_atom,90704.0,4.7,100.0


Write a query:

In [4]:
query = """
sentence
  clause typ=Ptcp
    phrase typ=NP
      word sp=verb g_cons~^M
"""

Run the query:

In [5]:
results = A.search(query)

  0.38s 20 results


Show the results as nodes:

In [6]:
results

[(1174454, 430355, 660050, 14127),
 (1177926, 434807, 673387, 35121),
 (1187343, 448669, 715553, 112534),
 (1187717, 449226, 717232, 115558),
 (1187736, 449248, 717296, 115657),
 (1202000, 468385, 774745, 213096),
 (1202370, 468879, 776134, 215412),
 (1210183, 479652, 805689, 262762),
 (1217009, 488825, 831501, 304239),
 (1217856, 489988, 834658, 309591),
 (1217882, 490016, 834736, 309691),
 (1217899, 490036, 834790, 309767),
 (1226243, 501363, 863729, 350269),
 (1226332, 501471, 864016, 350676),
 (1226471, 501648, 864488, 351373),
 (1226625, 501858, 865033, 352145),
 (1227200, 502606, 866980, 355006),
 (1227224, 502635, 867058, 355103),
 (1229566, 506177, 876769, 370968),
 (1231842, 509729, 887037, 390501)]

Dress-up the results and show the first two of them in a table:

In [7]:
A.table(results, end=2)

n,p,sentence,clause,phrase,word
1,Genesis 27:29,וּֽ מְבָרֲכֶ֖יךָ בָּרֽוּךְ׃,וּֽ מְבָרֲכֶ֖יךָ בָּרֽוּךְ׃,מְבָרֲכֶ֖יךָ,מְבָרֲכֶ֖יךָ
2,Exodus 12:19,כִּ֣י׀ כָּל־ אֹכֵ֣ל מַחְמֶ֗צֶת וְ נִכְרְתָ֞ה הַנֶּ֤פֶשׁ הַהִוא֙ מֵעֲדַ֣ת יִשְׂרָאֵ֔ל בַּגֵּ֖ר וּבְאֶזְרַ֥ח הָאָֽרֶץ׃,אֹכֵ֣ל מַחְמֶ֗צֶת,מַחְמֶ֗צֶת,מַחְמֶ֗צֶת


Change to the ascii transliteration of the consonants of each word, and show all results in a table, but hide the
last three columns.

In [8]:
A.table(results, fmt="text-trans-plain", skipCols={2, 3, 4})

n,p,sentence,clause,phrase,word
1,Genesis 27:29,W MBRKJK BRWK00,,,
2,Exodus 12:19,KJ05 KL& >KL MXMYT W NKRTH HNPC HHW> M<DT JFR>L BGR WB>ZRX H>RY00,,,
3,Deuteronomy 33:20,BRWK MRXJB GD,,,
4,Joshua 6:9,W HM>SP HLK >XRJ H>RWN,,,
5,Joshua 6:13,W HM>SP HLK >XRJ >RWN JHWH,,,
6,Isaiah 3:12,<MJ M>CRJK MT<JM W DRK >RXTJK BL<W00_S,,,
7,Isaiah 9:15,W M>CRJW MBL<JM00,,,
8,Jeremiah 51:1,HNNJ M<JR <L&BBL W>L&JCBJ LB QMJ RWX MCXJT00,,,
9,Haggai 1:6,W HMFTKR MFTKR >L&YRWR NQWB00_P,,,
10,Malachi 1:7,MGJCJM <L&MZBXJ LXM MG>L,,,


Make the `text-trans-plain` format the default and show results by sentence until further notice.

In [9]:
A.displaySetup(fmt="text-trans-plain", condenseType="sentence")

Expand result 11, because something is going on there: a phrase gets interrupted!

In [10]:
A.show(results, start=11, end=11)

**N.B.: None of the words `sentence`, `clause`, `phrase`, `word`, `typ`, `sp`, `g_cons`
`Ptcp`, `NP`, is built into Text-Fabric, they are taken from the corpus organisation.**

So the text of the query is almost entirely made up of terms that are familiar if you know the corpus.

In order to get to know the corpus, the user needs to consult a
*feature documentation document*. For the Hebrew Bible that looks like 
[this](https://etcbc.github.io/bhsa/features/0_home/).

There is much more to search in Text-Fabric.

Here are the
[search docs](https://annotation.github.io/text-fabric/tf/about/searchusage.html#what-is-text-fabric-search)
and here is a search
[tutorial for the Hebrew Bible](https://nbviewer.org/github/ETCBC/bhsa/blob/master/tutorial/search.ipynb).

## Computing with the results

The fact that the results are just tuples of integers makes it easy post-process results with
your own code.

Suppose you want to limit the results to those sentences that do not have *hapaxes*,
then you can write your own Python code to do that.

Suppose the corpus has a word feature `freq_lex` that for each word occurrence has the number
of occurrences of the lexeme of the word in the corpus.

Then we can filter like this:

In [11]:
F = A.api.Feature # the API to retrieve feature values
L = A.api.Locality # the API to navigate to nodes in the neighbourhood

unwantedResults = []
wantedResults = []

for result in results:
    s = result[0]
    words = L.d(s, otype="word")
    hasHapax = any(F.freq_lex.v(w) == 1 for w in words)
    if hasHapax:
        unwantedResults.append(result)
    else:
        wantedResults.append(result)
        
print(f"{len(wantedResults)=} {len(unwantedResults)=}")

len(wantedResults)=19 len(unwantedResults)=1


Let's show the unwanted result and show the `freq_lex` feature for all words:

In [12]:
A.show(unwantedResults, extraFeatures="freq_lex")

This ends the Text-Fabric demo.

# Comparison with Alpino

The Alpino system has a graph-based
[query language](https://urd2.let.rug.nl/~kleiweg/alpinograph-docs/zoeken/).

Let's make an explorative comparison between this Alpino way of searching and Text-Fabric, because there seems
to be quite a bit of convergence between the two.

## Data model

Alpino works with `nodes` and `words`.

Text-Fabric works with `nodes`. Some nodes are the atomic ones, called `slots`, they are the textual positions.
What you find at a slot depends on what the corpus modeller has chosen, but quite often slots correspond to words.

Alpino nodes have categories (`cat`), e.g. `NP`, `PP`, `SMAIN`.

Text-Fabric nodes have a type (`otype` = object type). Above we saw node types `sentence`, `clause`, `phrase`, `word`,
but this is the choice of the corpus modeller. Text-Fabric expects in every corpus a feature file `otype` that
maps all nodes to types.

For example, the BHSA above has this `otype.tf`:

```
1-426590	word
426591-426629	book
426630-427558	chapter
427559-515689	clause
515690-606393	clause_atom
606394-651572	half_verse
651573-904775	phrase
904776-1172307	phrase_atom
1172308-1236024	sentence
1236025-1300538	sentence_atom
1300539-1414388	subphrase
1414389-1437601	verse
1437602-1446831	lex
```

This is shorthand for a mapping of the integers 1..1446831 to strings `word`, `book`, ... , `lex`.

By the way, all data files of a TF corpus are in this format, and each file specifies a mapping from
numbers (or pairs of numbers) to values, which can be numbers or strings.

## Edges

In a graph there are also edges. How do you search for nodes that are connected by certain edges?

Both in Alpino and Text-Fabric edges may have properties that can be used in queries.

An example Alpino query:

```
match (n:node{cat:'pp'})-[:rel{rel:'hdf'}]->(:nw)
return n
```

Look for a PP node that is connected to an other node or word by means of an edge with `ref` property being
the string `hdf`.

In Text-Fabric we can also make queries like this.

We have edges between similar verses and these edges are labeled with the similarity of both verses in percents.

First we look for verses that are 90% similar, and then for verses that are for more than 90% similar.

In [13]:
query1 = """
verse
-crossref=90> verse
"""

query2 = """
verse
-crossref>90> verse
"""

In [14]:
results1 = A.search(query1)
results2 = A.search(query2)

  0.03s 240 results
  0.04s 9574 results


In [15]:
A.table(results1, end=2, condenseType="verse", full=True)

n,p,verse,verse.1
1,Genesis 25:31,W J>MR J<QB MKRH KJWM >T&BKRTK LJ00,W J>MR J<QB HCB<H LJ KJWM W JCB< LW W JMKR >T&BKRTW LJ<QB00
2,Genesis 25:33,W J>MR J<QB HCB<H LJ KJWM W JCB< LW W JMKR >T&BKRTW LJ<QB00,W J>MR J<QB MKRH KJWM >T&BKRTK LJ00


Alpino does a good job in understanding linguistics. It has quite a bit of meaningful
linguistic relationships, and its corpora supply the data for those.

Text-Fabric is different. It is much more agnostic. It only assumes that there is an ordered set of slots
plus nodes that represent certain subsets of slots; the nodes are divided into types.

Both the types and the subsets are given with the corpus as TF data. Above we saw the `otype.tf` file,
but there is also an `oslots.tf` file, that maps each non-slot node to the set of slots it is linked to.

## Query results

What do queries return?

In Alpino they return nodes and/or values that certain features have for those nodes.

In Text-Fabric they return tuples of nodes. In a TF query, most lines specifiy a node with properties that
the query has to instantiate in the corpus. If a query specifies 10 such nodes, then the query results
are 10-tuples of nodes in the corresponding order.

The nodes returned are naked, not dressed-up with features.

In real-life, people issue TF-queries either in the Text-Fabric browser, where they can customise how
query results are displayed, or they can catch the result nodes in their programs (which run typically in a 
Jupyter notebook, but by no means necessarily so).

In the Text-Fabric browser users can influence the features that must be displayed in various ways, one of them 
being: if a feature is mentioned in a query, then it is displayed.
Users can also categorically request features, or inhibit certain standard features that are displayed by default.
(These defaults are not Text-Fabric things, but set by the corpus modeller).

In Jupyter notebooks, users can programmatically achieve the same effects by using the functions `table()` and `show()`
and supplying various keyword arguments.

All in all, given the difference in purpose, technology and scope between Text-Fabric and Alpino,
the underlying concepts map fairly well from the one to the other.

## More complicated queries

It is not always the case that a query is a neatly nested template. In fact, such queries
only use the "embedding" relationship, but there are much more relationships.

In Alpino, you can give names to the nodes in a query, and the same is true for Text-Fabric.

For example:

In [16]:
query = """

clause
  phrase
    := w1:word sp=verb
  <: phrase
    =: w2:word sp=verb

w1 .lex. w2
""" 

This means: find a verb in two phrases of a clause The first verb should be the last word in its phrase,
the second verb should be the first word in its phrase.

The last line states a relationship between the two words: there lexeme value should be identical.

In [17]:
results = A.search(query)

  0.66s 475 results


In [18]:
A.table(results, end=10, skipCols={2, 3, 4, 5})

n,p,clause,phrase,word,phrase.1,word.1
1,Genesis 2:16,MKL <Y&HGN >KL T>KL00,,,,
2,Genesis 2:17,KJ BJWM MWT TMWT00,,,,
3,Genesis 3:4,L>& MWT TMTWN00,,,,
4,Genesis 3:16,HRBH >RBH <YBWNK WHRNK,,,,
5,Genesis 8:7,W JY> JYW> WCWB,,,,
6,Genesis 12:3,W >BRKH MBRKJK,,,,
7,Genesis 15:13,JD< TD<,,,,
8,Genesis 16:10,HRBH >RBH >T&ZR<K,,,,
9,Genesis 17:13,HMWL05 JMWL JLJD BJTK WMQNT KSPK,,,,
10,Genesis 18:10,CWB >CWB >LJK K<T XJH,,,,


## Quantifiers

In Alpino you can use quantifiers. These are parts of a query where you look for the
existence or non-existence of certain patterns.
This seems to be a bit of a problematic device, because there are certain conditions on quantifiers.

In Text-Fabric it is not different: here there are also quantifiers, and here there are also restrictions
on the quantifier expressions.

In Text-Fabric, quantified parts of the query do not contribute to the result tuple.

Here is an example.

Let's see how many VP-phrases there are:

In [19]:
resultsVP = A.search("""phrase typ=VP""")

  0.16s 69024 results


Sometimes VPs contain a noun:

In [20]:
query = """
phrase typ=VP
  word sp=subs
"""
resultsWithNoun = A.search(query, shallow=True)

  0.31s 234 results


Note the `shallow=True`: this means that we deliver the results differently: not as a tuple of tuples,
but as a set of nodes that correspond to the first node in the query: the `phrase`.

If we want the VPs without nouns:

In [21]:
resultsWithoutNoun = A.search("""
phrase typ=VP
/without/
  word sp=subs
/-/
""")

  0.35s 68790 results


A check to see if the results add up:

In [22]:
len(resultsVP) == len(resultsWithNoun) + len(resultsWithoutNoun)

True

OK, the numbers of results of the different queries are as expected, but we can also
compare the results themselves:

In [23]:
set(r[0] for r in resultsVP) == resultsWithNoun | set(r[0] for r in resultsWithoutNoun)

True

If we want the VPs with only verbs:

In [24]:
resultsVerb1 = A.search("""
phrase typ=VP
/without/
  word sp#verb
/-/
""")

  0.45s 62771 results


Or, in a slightly different way, showing a different quantifier:

In [25]:
resultsVerb2 = A.search("""
phrase typ=VP
/where/
  w:word
/have/
  w sp=verb
/-/
""")

  0.70s 62771 results


## Counting and sorting

Unlike Alpino, in Text-Fabric there are no SQL-like constructs to count, group and aggregate results,
except the `shallow=True` parameter, which reduces all results that have the same first node
to a single result consisting of that node only.

Text-Fabric relies on post-processing by the user, either in the program in which he
issued the query, or in other programs in which he imports results exported from the
Text-Fabric browser.

There is also an API function [A.export()](https://annotation.github.io/text-fabric/tf/advanced/display.html#tf.advanced.display.export) for use in your programs
to export results as tab-separated tables of dressed-up nodes.

Concerning *sorting*: the search function can be passed a sort key to order results.
If `sort=True` is passed, the results are ordered by the text-induced ordering of the result tuples.

## Set-theoretic operations on results

Alpino provides set-theoretic operations on results, in Text-Fabric the user
has to do that by means of post-processing.

Note that Text-Fabric constrains search templates: they have to be connected components, in the sense
that between every pair of nodes in the query template there must be a path of relationships.

If a search template would consist of multiple connected components, the result would be the cartesian product of the results of the individual queries.

Such result sets are potentially monstrous, and it is unlikely that the user can and will deal with them, so they are prohibited.

# Limitations

Alpino search is based on Cypher and SQL. Limitations in Cypher queries can sometimes be
compensated by excursions into SQL.

Also in Text-Fabric the expressive power of queries is limited. 
Moreover, there are also queries that can be expressed but require too much time to execute.

That is partly due to lack of sophistication in the Text-Fabric engine and partly due to 
inherent complexity of spatial relationships between nodes.

In Text-Fabric we do not have an escape to an other query language.
Instead, the escape is to hand-coding. 

There is a 
[tutorial notebook](https://nbviewer.org/github/ETCBC/bhsa/blob/master/tutorial/searchGaps.ipynb)
in which we explore a difficult query task.
Although we can solve it by a query, we also do it by hand-coding.
We make sure both give the same result and then we save the result as a *named set* to disk.

We can then invoke Text-Fabric later on with a parameter to include this named set.
At that moment the name of the set can be used in queries in place where a node type
is expected.

In the above notebook the section **Custom sets for (non-)gapped phrases** shows
how that works.

# Precomputation

In Alpino there are pre-computed pieces of data, e.g. the feature `vor_feld`.
This is advertised as a device that simplifies queries and makes them much more efficient.

In Text-Fabric pre-computation is also used, at several levels.

## Internal precomputation

Some of the precomputation belongs to the internals of Text-Fabric, such as
spatial indexes that facilitate the computation of embedding relations between nodes, among
other things.
See for example [levUp](https://annotation.github.io/text-fabric/tf/core/prepare.html#tf.core.prepare.levUp).
Such precomputed data is also made available in raw form to the end user 
through the [Computed API](https://annotation.github.io/text-fabric/tf/cheatsheet.html#c-computed-data-components).

## Corpus-level precomputation

At the second level there are computed features that the corpus modeller has included into the 
corpus. For example, in the BHSA there are features for the frequency and rank of words and lexemes.
See [freq/rank](https://etcbc.github.io/bhsa/features/freq_lex/).

There are also features that have been added by others to the corpus as a separate module.
The [similarity](https://nbviewer.org/github/ETCBC/parallels/blob/master/programs/parallels.ipynb)
feature that we encountered before is an example of that.

## User-generated precomputation

The named sets before are an example where end users themselves can compute data that are
helpful in subsequent queries.

This is the ethos of Text-Fabric: that end users, corpus modellers, and researchers have maximum
scope and flexibility to compute with the corpus.

# Alpino and Text-Fabric

Alpino and Text-Fabric are very different in the size of corpora they deal with and the specific
features they assume to be present in the corpora.

Alpino corpora are linguistic corpora, Text-Fabric corpora do not have to be linguistic.
There are even corpora in Text-Fabric whose texts are not in a language, such as the
proto-cuneiform tablets of [Uruk](https://github.com/Nino-cunei/uruk).

Can there be synergies between the Alpino world and the Text-Fabric world?

## Alpino helps Text-Fabric

There are several corpora in Text-Fabric that could benefit from linguistic tools, especially
tokenizers, pos-taggers and morphological taggers.
Whether Alpino can help depends on the languages that are supported by Alpino, because up till now Text-Fabric deals with historical corpora using mixtures of historical languages with a variety
of spelling idiosyncrasies.

## Text-Fabric helps Alpino

When end-users want to combine close reading with data-analysis, Text-Fabric is a handy
tool. Although Text-Fabric cannot deal with huge corpora, it can deal with corpora the size
of 5 million words with dozens of features or 0,5 million words and over hundred features.

Text-Fabric also has machinery to deal with volumes of a corpus.

So one could use Alpino to make a top-level search on a huge corpus, and then export
results volume by volume, where Text-Fabric can be used to deal with individual volumes.

# Conclusion

Regardless of whether there is real synergy between Alpino and Text-Fabric, it is encouraging
to see that once text is regarded as a graph, there is a certain logic in how a query language
should work. Both Alpino and Text-Fabric have hit upon the same elements, driven by how
other tools have tackled these things.

Whereas Alpino rests on Cypher (at least for this part of its query capability), Text-Fabric
has been inspired by [Emdros](https://emdros.org/whatis.html) by Ulrik Sandborg-Petersen.

# Author

Dirk Roorda

CC-BY