<img align="right" src="images/quad.png" width="300"/>
<img align="right" src="images/tf-small.png"/>


# Tutorial

This notebook gets you started with using
[Text-Fabric](https://github.com/Dans-labs/text-fabric) for coding in the Hebrew Bible.

Chances are that a bit of reading about the underlying
[data model](https://github.com/Dans-labs/text-fabric/wiki/Data-model)
helps you to follow the exercises below, and vice versa.

Most programs start with loading a few modules.
In the next cell, the first line loads standard modules that come with Python itself,
and the second cell loads Text-Fabric.

Before you can run this, you need to install it.
The basic instruction for that is, on a terminal:

```
pip install text-fabric
```

if you have installed Python with the help of Anaconda, or

```
sudo -H pip3 install text-fabric
```
if you have installed Python from [python.org](https://www.python.org).

Make sure that you do all this with Python **3**, not 2.

In [1]:
import sys, os, collections
from tf.fabric import Fabric

# Call Text-Fabric

Everything starts by setting up Text-Fabric.
It needs to know where to look for data.

The cuneiform tablet transcriptions are in the same repository as this tutorial.
I assume you have cloned [nino-cunei](https://github.com/Dans-labs/nino-cunei).
in your directory `~/github/Dans-labs`, so that your directory structure looks like this

    your home direcectory\
    |                     - github\
    |                       |      - Dans-labs\
    |                       |        |         - nino-cunei
    
## Tip
If you start computing with this tutorial, first copy its parent directory to somewhere else,
outside your `nino-cunei` directory.
If you pull changes from the `nino-cunei` repository later, your work will not be overwritten.
Where you put your tutorial directory is up till you.
It will work from any directory.

In [3]:
REPO = '~/github/Dans-labs/nino-cunei'
SOURCE = 'uruk'
VERSION = '0.1'
CORPUS = f'{REPO}/tf/{SOURCE}/{VERSION}'
TF = Fabric(locations=[CORPUS], modules=[''], silent=False )

This is Text-Fabric 3.1.3
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

23 features found and 0 ignored


# Load Features
The data of the corpus is organized in features.
They are *columns* of data.
Think of the corpus of tablet transcriptions as a gigantic spreadsheet, where row 1 corresponds to the
first sign, row 2 to the second sign, and so on, for all 100,000+ sign.

The grapheme name of each sign is a column `grapheme` in that spreadsheet.

The information whether a sign is damaged, constitutes a column `damaged`.

The corpus contains over 20 columns, not only for the signs, but also for a 150,000+ more
textual objects, such as *(sub)quads*, *clusters*, *lines*, *cases*, *columns*, *faces* and *tablets*.

We also have features that contain the original lines of transcription.
These features are filled for tablets, faces, columns, lines, and comments.

Instead of putting that information in one big table, the data is organized in separate columns.
We call those columns **features**.

We just load the features we need for this tutorial.
Later on, where we use them, it will become clear what they mean.

In [5]:
api = TF.load('''
    grapheme prime variant
    damage uncertain remarkable written
    modifier
    name number catalogId
    srcLn srcLnNum
    op
''')
api.makeAvailableIn(globals())

  0.00s loading features ...
   |     0.00s B op                   from /Users/dirk/github/Dans-labs/nino-cunei/tf/uruk/0.1
   |     0.00s Feature overview: 20 for nodes; 2 for edges; 1 configs; 7 computed
  0.02s All features loaded/computed - for details use loadLog()


The result of this all is that we have a bunch of special variables at our disposal
that give us access to the text and data of the tablets.

At this point it is helpful to throw a quick glance at the text-fabric
[API documentation](https://github.com/Dans-labs/text-fabric/wiki/Api)
especially the right side bar.

The most essential thing for now is that we can use `F` to access the data in the features
we've loaded.
But there is more, such as `N`, which helps us to walk over the text, as we see in a minute.

# Counting

In order to get acquainted with the data, we start with the simple task of counting.

## Count all nodes
We use the 
[`N()` generator](https://github.com/Dans-labs/text-fabric/wiki/Api#walking-through-nodes)
to walk through the nodes.

We compared the tablet data to a gigantic spreadsheet, where the rows correspond to the words.
In Text-Fabric, we call the rows `slots`, because they are the textual positions that can be filled with words.

We also mentioned that there are also other textual objects. 
They are the tablets, columns, lines, etc.
They also correspond to rows in the big spreadsheet.

In Text-Fabric we call all these rows *nodes*, and the `N()` generator
carries us through those nodes in the textual order.

Just one extra thing: the `info` statements generate timed messages.
If you use them instead of `print` you'll get a sense of the amount of time that 
the various processing steps typically need.

In [6]:
indent(reset=True)
info('Counting nodes ...')

i = 0
for n in N(): i += 1

info('{} nodes'.format(i))

  0.00s Counting nodes ...
  0.09s 396870 nodes


Here you see it: nearly 400,000 nodes!

## What are all those nodes?
Every node has a type, like sign, or line, face.
We know that we have many of them,
but what exactly are they?

Text-Fabric has two special features, `otype` and `oslots`, that must occur in every Text-Fabric data set.
`otype` tells you for each node its type, and you can ask for the number of `slot`s in the text.

Here we go!

In [7]:
F.otype.slotType

'sign'

In [8]:
F.otype.maxSlot

125822

In [9]:
F.otype.maxNode

396870

In [10]:
F.otype.all

('tablet',
 'face',
 'column',
 'case',
 'line',
 'subquad',
 'quad',
 'cluster',
 'sign')

In [11]:
C.levels.data

(('tablet', 19.671982489055658, 125823, 132218),
 ('face', 13.32577843677187, 140149, 149590),
 ('column', 9.445477406080927, 149591, 162713),
 ('case', 3.5357118353344767, 233589, 277313),
 ('line', 3.1235549497703063, 193971, 233588),
 ('subquad', 1.042749054224464, 132219, 140148),
 ('quad', 1.0339252406801775, 277314, 396870),
 ('cluster', 1.0325687046101673, 162714, 193970),
 ('sign', 1, 1, 125822))

This is interesting: above you see all the textual objects, with the average size of their objects,
the node where they start, and the node where they end.

## Count individual object types
This is an intuitive way to count the number of nodes in each type.
Note in passing, how we use the `indent` in conjunction with `info` to produce neat timed 
and indented progress messages.

In [12]:
indent(reset=True)
info('counting objects ...')

for otype in F.otype.all:
    i = 0

    indent(level=1, reset=True)

    for n in F.otype.s(otype): i+=1

    info('{:>7} {}s'.format(i, otype))

indent(level=0)
info('Done')

  0.00s counting objects ...
   |     0.00s    6396 tablets
   |     0.00s    9442 faces
   |     0.00s   13123 columns
   |     0.01s   43725 cases
   |     0.00s   39618 lines
   |     0.00s    7930 subquads
   |     0.01s  119557 quads
   |     0.00s   31257 clusters
   |     0.02s  125822 signs
  0.06s Done


# Feature statistics

`F`
gives access to all features.
Every feature has a method
`freqList()`
to generate a frequency list of its values, higher frequencies first.
Here are the graphemes (the top 20):

In [13]:
F.grapheme.freqList()[0:20]

(('…', 28475),
 ('N01', 21099),
 ('X', 6728),
 ('N14', 5638),
 ('', 2374),
 ('EN', 1876),
 ('N57', 1747),
 ('N34', 1702),
 ('SZE', 1270),
 ('GAL', 1130),
 ('DUG', 1077),
 ('AN', 997),
 ('U4', 980),
 ('SAL', 845),
 ('GI', 834),
 ('NUN', 833),
 ('E2', 822),
 ('PAP', 817),
 ('BA', 766),
 ('SANGA', 693))

# Layer API
We travel upwards and downwards, forwards and backwards through the nodes.
The Layer-API (`L`) provides functions: `u()` for going up, and `d()` for going down,
`n()` for going to next nodes and `p()` for going to previous nodes.

These directions are indirect notions: nodes are just numbers, but by means of the
`oslots` feature they are linked to slots. One node *contains* an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to.
And one if next or previous to an other, if its slots follow of precede the slots of the other one.

`L.u(node)` **Up** is going to nodes that embed `node`.

`L.d(node)` **Down** is the opposite direction, to those that are contained in `node`.

`L.n(node)` **Next** are the next *adjacent* nodes, i.e. nodes whose first slot comes immediately after the last slot of `node`.

`L.p(node)` **Previous** are the previous *adjacent* nodes, i.e. nodes whose last slot comes immediately before the first slot of `node`.

All these functions yield nodes of all possible otypes.
By passing an optional parameter, you can restrict the results to nodes of that type.

The result are ordered according to the order of things in the text.

The functions return always a tuple, even if there is just one node in the result.

## Going up
We go from the first sign to the tablet it contains.
Note the `[0]` at the end. You expect one book, yet `L` returns a tuple. 
To get the only element of that tuple, you need to do that `[0]`.

If you are like me, you keep forgetting it, and that will lead to weird error messages later on.

In [14]:
firstTablet = L.u(1, otype='tablet')[0]
print(firstTablet)

125823


And let's see all the containing objects of sign 100:

In [15]:
w = 100
for otype in F.otype.all:
    if otype == F.otype.slotType: continue
    up = L.u(w, otype=otype)
    upNode = 'x' if len(up) == 0 else up[0]
    print('sign {} is contained in {} {}'.format(w, otype, upNode))

sign 100 is contained in tablet 125828
sign 100 is contained in face 140157
sign 100 is contained in column 149605
sign 100 is contained in case 233624
sign 100 is contained in line 194003
sign 100 is contained in subquad x
sign 100 is contained in quad 277399
sign 100 is contained in cluster 162723


## Going next
Let's go to the next nodes of the first tablet.

In [16]:
afterFirstTablet = L.n(firstTablet)
for n in afterFirstTablet:
    print('{:>7}: {:<13} first slot={:<6}, last slot={:<6}'.format(
        n, F.otype.v(n),
        E.oslots.s(n)[0],
        E.oslots.s(n)[-1],
    ))
secondTablet = L.n(firstTablet, otype='tablet')[0]

      8: sign          first slot=8     , last slot=8     
 277321: quad          first slot=8     , last slot=8     
 193973: line          first slot=8     , last slot=9     
 233591: case          first slot=8     , last slot=9     
 149593: column        first slot=8     , last slot=11    
 140150: face          first slot=8     , last slot=35    
 125824: tablet        first slot=8     , last slot=35    


## Going previous

And let's see what is right before the second tablet.

In [17]:
for n in L.p(secondTablet):
    print('{:>7}: {:<13} first slot={:<6}, last slot={:<6}'.format(
        n, F.otype.v(n),
        E.oslots.s(n)[0],
        E.oslots.s(n)[-1],
    ))

 125823: tablet        first slot=1     , last slot=7     
 140149: face          first slot=1     , last slot=7     
 149592: column        first slot=4     , last slot=7     
 233590: case          first slot=4     , last slot=7     
 193972: line          first slot=4     , last slot=7     
 277320: quad          first slot=7     , last slot=7     
 162715: cluster       first slot=7     , last slot=7     
      7: sign          first slot=7     , last slot=7     


## Going down

We go to the columns of the second tablet, and just count them.

In [18]:
columns = L.d(secondTablet, otype='column')
print(len(columns))

5


## The first line
We pick the first line and the first sign, and explore what is above and below them.

In [19]:
for n in [1, L.u(1, otype='line')[0]]:
    indent(level=0)
    info('Node {}'.format(n), tm=False)
    indent(level=1)
    info('UP', tm=False)
    indent(level=2)
    info('\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.u(n)]), tm=False)
    indent(level=1)
    info('DOWN', tm=False)
    indent(level=2)
    info('\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.d(n)]), tm=False)
indent(level=0)
info('Done', tm=False)

Node 1
   |   UP
   |      |   162714          cluster
   |      |   277314          quad
   |      |   193971          line
   |      |   233589          case
   |      |   149591          column
   |      |   140149          face
   |      |   125823          tablet
   |   DOWN
   |      |   
Node 193971
   |   UP
   |      |   233589          case
   |      |   149591          column
   |      |   140149          face
   |      |   125823          tablet
   |   DOWN
   |      |   277314          quad
   |      |   162714          cluster
   |      |   1               sign
   |      |   277315          quad
   |      |   2               sign
   |      |   277316          quad
   |      |   3               sign
Done


# Edge features: left and right

We have not talked about edges much. If the nodes correspond to the rows in the big spreadsheet,
the edges point from one row to another.

One edge we have encountered: the special feature `oslots`.
Each non-slot node is linked by `oslots` to all of its slot nodes.

An edge is really a feature as well.
Whereas a node feature is a column of information,
one cell per node, 
an edge feature is also a column of information, one cell per pair of nodes.

In the tablets quads may be subdivided into subquads and signs, related by operators.
If there is an operator `op` between `qLeft` and `qRight`, there is an 
edge between `qLeft` and `qRight` with value `op`.

# Next steps

By now you have an impression how to compute around in the Hebrew Bible.
While this is still the beginning, I hope you already sense the power of unlimited programmatic access
to all the bits and bytes in the data set.

Here are a few directions for unleashing that power.

## Search
Text-Fabric contains a flexible search engine, that does not only work for the BHSA data,
but also for data that you add to it.
There is a tutorial dedicated to [search](search.ipynb).
And if you already know MQL queries, you can build from that in
[searchFromMQL](searchFromMQL.ipynb).

## Explore additional data
The ETCBC has a few other repositories with data that work in conjunction with the BHSA data.
One of them you have already seen: 
[phono](https://github.com/ETCBC/phono),
for phonetic transcriptions.

There is also
[parallels](https://github.com/ETCBC/parallels)
for detecting parallel passages,
and
[valence](https://github.com/ETCBC/valence)
for studying patterns around verbs that determine their meanings.

## Add your own data
If you study the additional data, you can observe how that data is created and also
how it is turned into a text-fabric data module.
The last step is incredibly easy. You can write out every Python dictionary where the keys are numbers
and the values string or numbers as a Text-Fabric feature.
When you are creating data, you have already constructed those dictionaries, so writing
them out is just one method call.
See for example how the
[flowchart](https://github.com/ETCBC/valence/blob/master/programs/flowchart.ipynb#Add-sense-feature-to-valence-module)
notebook in valence writes out verb sense data.
![flow](images/valence.png)

You can then easily share your new features on GitHub, so that your colleagues everywhere 
can try it out for themselves.

## Export to Emdros MQL

[EMDROS](http://emdros.org), written by Ulrik Petersen,
is a text database system with the powerful *topographic* query language MQL.
The ideas are based on a model devised by Christ-Jan Doedens in
[Text Databases: One Database Model and Several Retrieval Languages](https://books.google.nl/books?id=9ggOBRz1dO4C).

Text-Fabric's model of slots, nodes and edges is a fairly straightforward translation of the models of Christ-Jan Doedens and Ulrik Petersen.

[SHEBANQ](https://shebanq.ancient-data.org) uses EMDROS to offer users to execute and save MQL queries against the Hebrew Text Database of the ETCBC.

So it is kind of logical and convenient to be able to work with a Text-Fabric resource through MQL.

If you have obtained an MQL dataset somehow, you can turn it into a text-fabric data set by `importMQL()`,
which we will not show here.

And if you want to export a Text-Fabric data set to MQL, that is also possible.

After the `Fabric(modules=...)` call, you can call `exportMQL()` in order to save all features of the
indicated modules into a big MQL dump, which can be imported by an EMDROS database.

# Clean caches

Text-Fabric pre-computes data for you, so that it can be loaded faster.
If the original data is updated, Text-Fabric detects it, and will recompute that data.

But there are cases, when the algorithms of Text-Fabric have changed, without any changes in the data, that you might
want to clear the cache of precomputed results.

There are two ways to do that:

* Locate the `.tf` directory of your dataset, and remove all `.tfx` files in it.
  This might be a bit awkward to do, because the `.tf` directory is hidden on Unix-like systems.
* Call `TF.clearCache()`, which does exactly the same.

It is not handy to execute the following cell all the time, that's why I have commented it out.
So if you really want to clear the cache, remove the comment sign below.

In [39]:
# TF.clearCache()