<img align="right" src="images/tf.png" width="128"/>
<img align="right" src="images/logo.png" width="128"/>
<img align="right" src="images/dans.png" width="128"/>

# Tutorial

This notebook gets you started with using
[Text-Fabric](https://annotation.github.io/text-fabric/) for coding in the NinMed corpus (cuneiform).

Familiarity with the underlying
[data model](https://annotation.github.io/text-fabric/tf/about/datamodel.html)
is recommended.

## Installing Text-Fabric

See [here](https://annotation.github.io/text-fabric/tf/about/install.html)

## Tip
If you start computing with this tutorial, first copy its parent directory to somewhere else, outside your copy of the repositorty
If you pull changes from the repository later, your work will not be overwritten.
Where you put your tutorial directory is up to you.
It will work from any directory.

## NinMed data

Text-Fabric will fetch the data set for you from the newest github release binaries.

The data will be stored in the `text-fabric-data` in your home directory.

# Features
The data of the corpus is organized in features.
They are *columns* of data.
Think of the corpus as a gigantic spreadsheet, where row 1 corresponds to the
first sign, row 2 to the second sign, and so on, for all signs.

The information which reading each sign has, constitutes a column in that spreadsheet.
The NinMed corpus contains about 45 columns, not only for the signs, but also for other
textual objects, such as clusters, lines, columns, faces, documents.

Instead of putting that information in one big table, the data is organized in separate columns.
We call those columns **features**.

# Before you compute

You can open the corpus in the Text-Fabric browser and
explore the contents in a browser window which is served
from your computer:

In [None]:
!!text-fabric Nino-cunei/ninmed:clone --checkout=clone

# Prepare to compute

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import collections

# Incantation

The simplest way to get going is by this *incantation*:

In [3]:
from tf.app import use

For the very last version, use `hot`.

For the latest release, use `latest`.

If you have cloned the repos (TF app and data), use `clone`.

If you do not want/need to upgrade, leave out the checkout specifiers.

In [6]:
# A = use("Nino-cunei/ninmed:clone", checkout="clone", hoist=globals())
A = use("Nino-cunei/ninmed", hoist=globals())

This is Text-Fabric 9.2.5
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

48 features found and 0 ignored


You can see which features have been loaded, and if you click on a feature name, you find its documentation.
If you hover over a name, you see where the feature is located on your system.

## API

The result of the incantation is that we have a bunch of special variables at our disposal
that give us access to the text and data of the corpus.

At this point it is helpful to throw a quick glance at the text-fabric API documentation
(see the links under **API Members** above).

The most essential thing for now is that we can use `F` to access the data in the features
we've loaded.
But there is more, such as `N`, which helps us to walk over the text, as we see in a minute.

The **API members** above show you exactly which new names have been inserted in your namespace.
If you click on these names, you go to the API documentation for them.

## Search
Text-Fabric contains a flexible search engine, that does not only work for the data,
of this corpus, but also for other corpora and data that you add to corpora.

**Search is the quickest way to come up-to-speed with your data, without too much programming.**

Jump to the dedicated [search](search.ipynb) search tutorial first, to whet your appetite.

The real power of search lies in the fact that it is integrated in a programming environment.
You can use programming to:

* compose dynamic queries
* process query results

Therefore, the rest of this tutorial is still important when you want to tap that power.
If you continue here, you learn all the basics of data-navigation with Text-Fabric.

# Counting

In order to get acquainted with the data, we start with the simple task of counting.

## Count all nodes
We use the
[`N.walk()` generator](https://annotation.github.io/text-fabric/tf/core/nodes.html#tf.core.nodes.Nodes.walk)
to walk through the nodes.

We compared the TF data to a gigantic spreadsheet, where the rows correspond to the signs.
In Text-Fabric, we call the rows `slots`, because they are the textual positions that can be filled with signs.

We also mentioned that there are also other textual objects.
They are the clusters, lines, faces and documents.
They also correspond to rows in the big spreadsheet.

In Text-Fabric we call all these rows *nodes*, and the `N()` generator
carries us through those nodes in the textual order.

Just one extra thing: the `info` statements generate timed messages.
If you use them instead of `print` you'll get a sense of the amount of time that
the various processing steps typically need.

In [7]:
A.indent(reset=True)
A.info("Counting nodes ...")

i = 0
for n in N.walk():
    i += 1

A.info("{} nodes".format(i))

  0.00s Counting nodes ...
  0.01s 93211 nodes


Here you see it: nearly 100,000 nodes.

## What are those nodes?
Every node has a type, like sign, or line, face.
But what exactly are they?

Text-Fabric has two special features, `otype` and `oslots`, that must occur in every Text-Fabric data set.
`otype` tells you for each node its type, and you can ask for the number of `slot`s in the text.

Here we go!

In [8]:
F.otype.slotType

'sign'

In [9]:
F.otype.maxSlot

52829

In [10]:
F.otype.maxNode

93211

In [11]:
F.otype.all

('document', 'face', 'line', 'cluster', 'word', 'sign')

In [12]:
C.levels.data

(('document', 1553.7941176470588, 59226, 59259),
 ('face', 978.3148148148148, 59260, 59313),
 ('line', 17.14114211550941, 59314, 62395),
 ('cluster', 2.6222639149468416, 52830, 59225),
 ('word', 1.712097611630322, 62396, 93211),
 ('sign', 1, 1, 52829))

This is interesting: above you see all the textual objects, with the average size of their objects,
the node where they start, and the node where they end.

## Count individual object types
This is an intuitive way to count the number of nodes in each type.
Note in passing, how we use the `indent` in conjunction with `info` to produce neat timed
and indented progress messages.

In [13]:
A.indent(reset=True)
A.info("counting objects ...")

for otype in F.otype.all:
    i = 0

    A.indent(level=1, reset=True)

    for n in F.otype.s(otype):
        i += 1

    A.info("{:>7} {}s".format(i, otype))

A.indent(level=0)
A.info("Done")

  0.00s counting objects ...
   |     0.00s      34 documents
   |     0.00s      54 faces
   |     0.00s    3082 lines
   |     0.00s    6396 clusters
   |     0.00s   30816 words
   |     0.01s   52829 signs
  0.01s Done


# Viewing textual objects

You can use the A API (the extra power) to display cuneiform text.

See the [display](display.ipynb) tutorial.

# Feature statistics

`F`
gives access to all features.
Every feature has a method
`freqList()`
to generate a frequency list of its values, higher frequencies first.
Here are the top 20 frequent lemmas:

In [14]:
F.lemma.freqList()[0:20]

(('', 8538),
 ('ina I', 1420),
 ('ana I', 647),
 ('šumma I', 520),
 ('sâku I', 500),
 ('awīlu I', 445),
 ('šikaru I', 326),
 ('libbu I', 319),
 ('u I', 287),
 ('šamnu I', 287),
 ('mû I', 279),
 ('ša I', 262),
 ('šiptu I', 242),
 ('īnu I, -šu I', 239),
 ('lā I', 222),
 ('šatû II', 201),
 ('qû II', 195),
 ('balāṭu II', 194),
 ('kasû II', 180),
 ('šanû I', 178))

Signs have types and clusters have types. We can count them separately:

In [15]:
F.type.freqList("cluster")

(('missing', 3615),
 ('det', 2672),
 ('uncertain', 67),
 ('gloss', 21),
 ('supplied', 16),
 ('excised', 4),
 ('erasure', 1))

In [16]:
F.type.freqList("sign")

(('reading', 23570),
 ('grapheme', 22092),
 ('unknown', 4768),
 ('ellipsis', 1266),
 ('numeral', 962),
 ('wdiv', 102),
 ('empty', 69))

Finally, the flags:

In [17]:
F.flags.freqList()

(('#', 4305),
 ('?', 501),
 ('?#', 145),
 ('!', 39),
 ('#?', 20),
 ('!#', 3),
 ('!?', 3),
 ('!?#', 2),
 ('*', 1),
 ('??', 1))

## Word distribution

Let's do a bit more fancy word stuff.

### Hapaxes

A hapax can be found by picking the words with frequency 1

We count the hapaxes and print 20 of them.

In [18]:
lemmaFreqs = F.lemma.freqList()
allLemmas = {x[0] for x in lemmaFreqs}
nLemmas = len(allLemmas)

hapaxes = [w for (w, amount) in lemmaFreqs if amount == 1]
nHapaxes = len(hapaxes)
percent = int(round(100 * nHapaxes / nLemmas))

print(f"{len(hapaxes)} hapaxes out of {nLemmas} lemmas = {percent}%:\n")

for w in hapaxes[0:20]:
    print(f"\t{w}")

641 hapaxes out of 1805 lemmas = 36%:

	-ma I
	-šu I, īnu I
	Adad I
	Adapa I
	Ayyāru I
	Ašnan I
	Du'ūzu I
	Elulu I
	Elūnu I
	Ereqqu I
	Eridu I
	Isin I
	Lili I
	Magan I
	Nanaya I
	Ningirim I
	Ninkarak I
	Ninlil I
	Nippur I
	Nisaba I


# Locality API
We travel upwards and downwards, forwards and backwards through the nodes.
The Locality-API (`L`) provides functions: `u()` for going up, and `d()` for going down,
`n()` for going to next nodes and `p()` for going to previous nodes.

These directions are indirect notions: nodes are just numbers, but by means of the
`oslots` feature they are linked to slots. One node *contains* an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to.
And one if next or previous to an other, if its slots follow or precede the slots of the other one.

`L.u(node)` **Up** is going to nodes that embed `node`.

`L.d(node)` **Down** is the opposite direction, to those that are contained in `node`.

`L.n(node)` **Next** are the next *adjacent* nodes, i.e. nodes whose first slot comes immediately after the last slot of `node`.

`L.p(node)` **Previous** are the previous *adjacent* nodes, i.e. nodes whose last slot comes immediately before the first slot of `node`.

All these functions yield nodes of all possible otypes.
By passing an optional parameter, you can restrict the results to nodes of that type.

The result are ordered according to the order of things in the text.

The functions return always a tuple, even if there is just one node in the result.

## Going up
We go from the first word to the document it contains.
Note the `[0]` at the end. You expect one document, yet `L` returns a tuple.
To get the only element of that tuple, you need to do that `[0]`.

If you are like me, you keep forgetting it, and that will lead to weird error messages later on.

In [19]:
firstDoc = L.u(1, otype="document")[0]
print(firstDoc)

59226


And let's see all the containing objects of sign 3:

In [20]:
s = 3
for otype in F.otype.all:
    if otype == F.otype.slotType:
        continue
    up = L.u(s, otype=otype)
    upNode = "x" if len(up) == 0 else up[0]
    print("sign {} is contained in {} {}".format(s, otype, upNode))

sign 3 is contained in document 59226
sign 3 is contained in face 59260
sign 3 is contained in line 59314
sign 3 is contained in cluster 52830
sign 3 is contained in word 62396


## Going next
Let's go to the next nodes of the first document.

In [21]:
afterFirstDoc = L.n(firstDoc)
for n in afterFirstDoc:
    print(
        "{:>7}: {:<13} first slot={:<6}, last slot={:<6}".format(
            n,
            F.otype.v(n),
            E.oslots.s(n)[0],
            E.oslots.s(n)[-1],
        )
    )
secondDoc = L.n(firstDoc, otype="document")[0]

    267: sign          first slot=267   , last slot=267   
  59329: line          first slot=267   , last slot=267   
  59261: face          first slot=267   , last slot=446   
  59227: document      first slot=267   , last slot=715   


## Going previous

And let's see what is right before the second document.

In [22]:
for n in L.p(secondDoc):
    print(
        "{:>7}: {:<13} first slot={:<6}, last slot={:<6}".format(
            n,
            F.otype.v(n),
            E.oslots.s(n)[0],
            E.oslots.s(n)[-1],
        )
    )

  59226: document      first slot=1     , last slot=266   
  59260: face          first slot=1     , last slot=266   
  59328: line          first slot=258   , last slot=266   
  52852: cluster       first slot=264   , last slot=266   
  62600: word          first slot=266   , last slot=266   
    266: sign          first slot=266   , last slot=266   


## Going down

We go to the faces of the first document, and just count them.

In [23]:
faces = L.d(firstDoc, otype="face")
print(len(faces))

1


## The first line
We pick two nodes and explore what is above and below them:
the first line and the first word.

In [24]:
for n in [
    F.otype.s("word")[0],
    F.otype.s("line")[0],
]:
    A.indent(level=0)
    A.info("Node {}".format(n), tm=False)
    A.indent(level=1)
    A.info("UP", tm=False)
    A.indent(level=2)
    A.info("\n".join(["{:<15} {}".format(u, F.otype.v(u)) for u in L.u(n)]), tm=False)
    A.indent(level=1)
    A.info("DOWN", tm=False)
    A.indent(level=2)
    A.info("\n".join(["{:<15} {}".format(u, F.otype.v(u)) for u in L.d(n)]), tm=False)
A.indent(level=0)
A.info("Done", tm=False)

Node 62396
   |   UP
   |      |   52830           cluster
   |      |   59314           line
   |      |   59260           face
   |      |   59226           document
   |   DOWN
   |      |   1               sign
   |      |   2               sign
   |      |   3               sign
Node 59314
   |   UP
   |      |   52830           cluster
   |      |   59260           face
   |      |   59226           document
   |   DOWN
   |      |   52830           cluster
   |      |   62396           word
   |      |   1               sign
   |      |   2               sign
   |      |   3               sign
   |      |   62397           word
   |      |   4               sign
Done


# Text API

So far, we have mainly seen nodes and their numbers, and the names of node types.
You would almost forget that we are dealing with text.
So let's try to see some text.

In the same way as `F` gives access to feature data,
`T` gives access to the text.
That is also feature data, but you can tell Text-Fabric which features are specifically
carrying the text, and in return Text-Fabric offers you
a Text API: `T`.

## Formats
Cuneiform text can be represented in a number of ways:

* original ATF, with bracketings and flags
* essential symbols: readings and graphemes, repeats and fractions (of numerals), no flags, no clusterings
* unicode symbols

If you wonder where the information about text formats is stored:
not in the program text-fabric, but in the data set.
It has a feature `otext`, which specifies the formats and which features
must be used to produce them. `otext` is the third special feature in a TF data set,
next to `otype` and `oslots`.
It is an optional feature.
If it is absent, there will be no `T` API.

Here is a list of all available formats in this data set.

In [25]:
sorted(T.formats)

['layout-orig-full', 'layout-orig-plain', 'text-orig-full', 'text-orig-plain']

## Using the formats

The ` T.text()` function is central to get text representations of nodes. Its most basic usage is

```python
T.text(nodes, fmt=fmt)
```
where `nodes` is a list or iterable of nodes, usually word nodes, and `fmt` is the name of a format.
If you leave out `fmt`, the default `text-orig-full` is chosen.

The result is the text in that format for all nodes specified:

In [26]:
print(T.text([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], fmt="text-orig-plain"))

DU₃.DU₃.BI ...
5 KA.INIM.MA x x x 


There is also another usage of this function:

```python
T.text(node, fmt=fmt)
```

where `node` is a single node.
In this case, the default format is *ntype*`-orig-full` where *ntype* is the type of `node`.

If the format is defined in the corpus, it will be used. Otherwise, the word nodes contained in `node` will be looked up
and represented with the default format `text-orig-full`.

In this way we can sensibly represent a lot of different nodes, such as documents, faces, lines, clusters, words and signs.

We compose a set of example nodes and run `T.text` on them.
The example nodes are the second document and the first nodes of the different kinds that it contains.

In [27]:
doc = F.otype.s("document")[1]
face = L.d(doc, otype="face")[0]
line = L.d(doc, otype="line")[1]
cluster = L.d(doc, otype="cluster")[0]
word = L.d(line, otype="word")[0]
sign = L.d(line, otype="sign")[0]
exampleNodes = [sign, word, cluster, line, face, doc]
exampleNodes

[268, 62601, 52853, 59330, 59261, 59227]

In [28]:
for n in exampleNodes:
    print(" ".join(T.sectionFromNode(n)) + f", {F.otype.v(n)} node {n}:")
    print(T.text(n))
    print("")

P479250 obverse 1', sign node 268:
[x 

P479250 obverse 1', word node 62601:
[x 

P479250 obverse 1', cluster node 52853:
[x x x x x x x x x x x x a-na]-

P479250 obverse 1', line node 59330:
[x x x x x x x x x x x x a-na]-ku# u₂#-[ša₂-an-ni]


P479250 obverse, face node 59261:
[x x x x x x x x x x x x a-na]-ku# u₂#-[ša₂-an-ni]
[x x x x x x x] TI-e#
[... LI.DUR] ŠID-nu
%sux [x x x x {gi}pisan]-gen₇ keš₂-da
%sux [x x x x x x x x x x x a-ge₆]-a nu-tuku
%sux [x x x x x x x x x x x x nu]-ku₄-ku₄
%sux [... nig₂ ge₂₆]-e : gen-na dumu-gu₁₀
%sux [x x x x x x x x x x x] šu u-me-ti
%sux [x x x x x x x x x x x x x x x x ka]-bi-ta u-me-ni-gar
%sux [x x x x x x x x x x x x x he₂]-en-si-il-e
%sux [x x x x x x x he₂-em-ma-ra]-e₃
[x x x x x x x x x x x x x x x x x x x ina KAŠ] NAG-ma ina-eš
[... ŠA₃?] GIG


P479250, document node 59227:
[x x x x x x x x x x x x a-na]-ku# u₂#-[ša₂-an-ni]
[x x x x x x x] TI-e#
[... LI.DUR] ŠID-nu
%sux [x x x x {gi}pisan]-gen₇ keš₂-da
%sux [x x x x x x x x x x x a-ge₆]-a

## Using the formats
Now let's use those formats to print out the first line in this corpus.

Note that only the formats starting with `text-` are usable for this.

For the `layout-` formats, see [display](display.ipynb).

In [29]:
for fmt in sorted(T.formats):
    if fmt.startswith("text-"):
        print("\t{}:\n{}\n".format(fmt, T.text(range(1, 12), fmt=fmt)))

	text-orig-full:
[DU₃.DU₃.BI ...]
5 KA#.INIM.MA [x x x 

	text-orig-plain:
DU₃.DU₃.BI ...
5 KA.INIM.MA x x x 



If we do not specify a format, the **default** format is used (`text-orig-full`).

In [30]:
print(T.text(range(1, 12)))

[DU₃.DU₃.BI ...]
5 KA#.INIM.MA [x x x 


In [31]:
firstLine = F.otype.s("line")[0]
print(T.text(firstLine))

[DU₃.DU₃.BI ...]



In [32]:
print(T.text(firstLine, fmt="text-orig-plain"))

DU₃.DU₃.BI ...



The important things to remember are:

* you can supply a list of slot nodes and get them represented in all formats
* you can get non-slot nodes `n` in default format by `T.text(n)`
* you can get non-slot nodes `n` in other formats by `T.text(n, fmt=fmt, descend=True)`

## Whole text in all formats

In [33]:
A.indent(reset=True)
A.info("writing plain text of all letters in all text formats")

text = collections.defaultdict(list)

for ln in F.otype.s("line"):
    for fmt in sorted(T.formats):
        if fmt.startswith("text-"):
            text[fmt].append(F.lnno.v(ln) + ". " + T.text(ln, fmt=fmt, descend=True))

A.info("done {} formats".format(len(text)))

for fmt in sorted(text):
    print("{}\n{}\n".format(fmt, "".join(text[fmt][0:5])))

  0.00s writing plain text of all letters in all text formats
  0.19s done 2 formats
text-orig-full
1'. [DU₃.DU₃.BI ...]
2'. 5 KA#.INIM.MA [x x x x]
3'. EN₂ su.ub hur.ri.im# su#.[ub ...]
4'. ša₂ sa.ku.tu₂ hi.si a.pi.il.lat aš [kur.ba.an.ni ...]
5'. KA.INIM.MA GIG.GIR ZI-hi DU₃.DU₃.BI SIG₂# SA₅# [...]


text-orig-plain
1'. DU₃.DU₃.BI ...
2'. 5 KA.INIM.MA x x x x
3'. EN₂ su.ub hur.ri.im su.ub ...
4'. ša₂ sa.ku.tu₂ hi.si a.pi.il.lat aš kur.ba.an.ni ...
5'. KA.INIM.MA GIG.GIR ZI-hi DU₃.DU₃.BI SIG₂ SA₅ ...




### The full plain text
We write all formats to file, in your `Downloads` folder.

In [34]:
for fmt in T.formats:
    if fmt.startswith("text-"):
        with open(os.path.expanduser(f"~/Downloads/NinMed-{A.version}-{fmt}.txt"), "w") as f:
            f.write("".join(text[fmt]))

## Sections

A section in the letter corpus is a document, a face or a line.
Knowledge of sections is not baked into Text-Fabric.
The config feature `otext.tf` may specify three section levels, and tell
what the corresponding node types and features are.

From that knowledge it can construct mappings from nodes to sections, e.g. from line
nodes to tuples of the form:

    (p-number, face specifier, line number)

You can get the section of a node as a tuple of relevant document, face, and line nodes.
Or you can get it as a passage label, a string.

You can ask for the passage corresponding to the first slot of a node, or the one corresponding to the last slot.

If you are dealing with document and face nodes, you can ask to fill out the line and face parts as well.

Here are examples of getting the section that corresponds to a node and vice versa.

**NB:** `sectionFromNode` always delivers a verse specification, either from the
first slot belonging to that node, or, if `lastSlot`, from the last slot
belonging to that node.

In [35]:
someNodes = (
    F.otype.s("sign")[10000],
    F.otype.s("word")[1000],
    F.otype.s("cluster")[500],
    F.otype.s("line")[1500],
    F.otype.s("face")[10],
    F.otype.s("document")[5],
)

In [36]:
for n in someNodes:
    nType = F.otype.v(n)
    d = f"{n:>7} {nType}"
    first = A.sectionStrFromNode(n)
    last = A.sectionStrFromNode(n, lastSlot=True, fillup=True)
    tup = (
        T.sectionTuple(n),
        T.sectionTuple(n, lastSlot=True, fillup=True),
    )
    print(f"{d:<16} - {first:<22} {last:<22} {tup}")

  10001 sign     - P365742 obverse:1:18   P365742 obverse:1:18   ((59238, 59276, 59922), (59238, 59276, 59922))
  63396 word     - P399223 obverse:2:1    P399223 obverse:2:1    ((59229, 59265, 59399), (59229, 59265, 59399))
  53330 cluster  - P394104 obverse:2:67   P394104 obverse:2:67   ((59235, 59271, 59562), (59235, 59271, 59562))
  60814 line     - P394454 obverse:6'     P394454 obverse:6'     ((59243, 59286, 60814), (59243, 59286, 60814))
  59270 face     - P403381 obverse        P403381 obverse:1':6'  ((59234, 59270), (59234, 59270, 59442))
  59231 document - P400909                P400909 obverse:9'     ((59231,), (59231, 59267, 59422))


# Clean caches

Text-Fabric pre-computes data for you, so that it can be loaded faster.
If the original data is updated, Text-Fabric detects it, and will recompute that data.

But there are cases, when the algorithms of Text-Fabric have changed, without any changes in the data, that you might
want to clear the cache of precomputed results.

There are two ways to do that:

* Locate the `.tf` directory of your dataset, and remove all `.tfx` files in it.
  This might be a bit awkward to do, because the `.tf` directory is hidden on Unix-like systems.
* Call `TF.clearCache()`, which does exactly the same.

It is not handy to execute the following cell all the time, that's why I have commented it out.
So if you really want to clear the cache, remove the comment sign below.

In [36]:
# TF.clearCache()

# Next steps

By now you have an impression how to compute around in the corpus.
While this is still the beginning, I hope you already sense the power of unlimited programmatic access
to all the bits and bytes in the data set.

Here are a few directions for unleashing that power.

* **[display](display.ipynb)** become an expert in creating pretty displays of your text structures
* **[search](search.ipynb)** turbo charge your hand-coding with search templates
* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results
* **[similarLines](similarLines.ipynb)** spot the similarities between lines

---

See the [cookbook](cookbook) for recipes for small, concrete tasks.

CC-BY Dirk Roorda