<img align="right" src="images/quad.png" width="300"/>
<img align="right" src="images/tf-small.png"/>


# Tutorial

This notebook gets you started with using
[Text-Fabric](https://github.com/Dans-labs/text-fabric) for coding in cuneiform tablet transcriptions.

## What is Text-Fabric?

### Cuneiform tablets in ATF

Cuneiform tablets have been transcribed in ATF files, in which the marks on a tablet are represented
by ascii characters. The marks on a tablet have structure (they can be composed, they build cases and lines) and properties (they can be uncertain).

When you search for tablet data in an ATF file, you can do so conveniently by using regular expressions.

However, the ATF transcriptions have become packed with information. Not every transcriber uses ATF in the same way, and there are a few coding errors in the sources.

That means that the most obvious search expressions will leave out cases. Either you live with that, or you refine your search expressions.

An other issue is, that when you look for something, your search expressions must reflect the shape of not
only your target, but also everything else. There is virtually no separation of concerns.

### Text-Fabric

Text-Fabric is a model for textual data with annotations that is optimized for efficient data analysis. Not only that, it also facilitates the creation of new, derived data, which can be added to the original data.
Data combination is a feature of Text-Fabric.

Text-Fabric is being used for the [Hebrew Bible]() and a large body of linguisitic annotations on top of it. The researchers of the [ETCBC]() thought that a plain database is not a satisfactory text model, and that XML is too limited too express multiple hierarchies in a text smoothly.

That's why they adopted a model by [Doedens]() that reflects more of the essential properties of text (sequence, embedding). This model is the basis of MQL, a working text-database system.
Text-Fabric is based on the same model, and once the data is in Text-Fabric, it can be exported to MQL.

With data in Text-Fabric, it becomes possible to build rich online interfaces on the data of ancient texts.
For the Hebrew Bible, we have built [SHEBANQ]().

Working with TF is a bit like buying from IKEA. You get your product in bits and pieces, and you assemble it yourself. TF decomposes any dataset into its components, nicely stacked per component, with every component uniquely labeled. You go to the store, make your selection, enter the warehouse, collect your parts, and, at home, assemble your product.

In order to enjoy an IKEA product, you do not need to be a craftsman, but you do need to be able to handle a screw driver.

In the TF world, it is the same. You do not have to be a professional programmer, but you do need to be able to program little things. A first course in Python is enough.

Another parallel: in IKEA you take a package with components home, and there you assemble it. 
In TF it is likewise: you download the TF data, and then you write a little program. Inside that program you can call up the Text-Fabric tool, which act as the IKEA user manual. But your program takes control, not Text-Fabric.

The best environment to enjoy Text-Fabric is in Python programs that you develop in a 
[Jupyter Notebook]().
This tutorial is such a notebook. If you are reading it online, you see text bits and code bits,
but you cannot execute the code bits.

If you download this tutorial, and you have installed Python, Jupyter, and Text-Fabric,
you can *execute* the code bits.

## Overview

* we tell you how to get Text-Fabric on your system
* we tell you how to get a set of cuneiform tablet transcriptions on your system
* we show how to explore the data:
  * finding the relevant nodes
  * moving from one place to an other
  * collecting the relevant information
  * perform analysis
  * visualize your results

## More information
Chances are that a bit of reading about the underlying
[data model](https://github.com/Dans-labs/text-fabric/wiki/Data-model)
helps you to follow the exercises below, and vice versa.

We have checked the conversion from the transcriptions to Text-Fabric extensively.
Cruelly, you might say. You can follow the checks
in a separate notebook [checks](checks.ipynb).

## Installing Text-Fabric

### Prerequisites

You need to have Python on your system. Most systems have it out of the box,
but alas, that is python2 and we need at least python 3.6.

Install it from [python.org]() or from [Anaconda]().
If you got it from python.org, you also have to install [Jupyter]().

### TF itself

```
pip install text-fabric
```

if you have installed Python with the help of Anaconda, or

```
sudo -H pip3 install text-fabric
```
if you have installed Python from [python.org](https://www.python.org).

###### Execute: If all this is done, the following cells can be executed.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys, os, collections
from tf.fabric import Fabric
from utils import Compare

### Cuneiform data

We have prepared a corpus of 6000 tablets, from the Uruk-III/IV period in Text-Fabric.
We have downloaded the transcriptions from CDLI, and converted them to Text-Fabric.

You can get the results from GitHub as follows.

We suggest you make an appropriate directory in your home directory:

```
github/Dans-labs
```

then go to that directory in a terminal, and then say

```
git clone https://github.com/Dans-labs/Nino-cunei
```

After that your directory structure shold look like this:

    your home direcectory\
    |                     - github\
    |                       |      - Dans-labs\
    |                       |        |         - Nino-cunei
    
#### Tip
If you start computing with this tutorial, first copy its parent directory to somewhere else,
outside your `Nino-cunei` directory.
If you pull changes from the `Nino-cunei` repository later, your work will not be overwritten.
Where you put your tutorial directory is up till you.
It will work from any directory.

###### Execute: it this has been done, you can execute the next cells

In [3]:
REPO = '~/github/Dans-labs/Nino-cunei'
SOURCE = 'uruk'
VERSION = '0.1'
CORPUS = f'{REPO}/tf/{SOURCE}/{VERSION}'
SOURCE_DIR = os.path.expanduser(f'{REPO}/sources/cdli')
TEMP_DIR = os.path.expanduser(f'{REPO}/_temp')
TF = Fabric(locations=[CORPUS], modules=[''], silent=False )

This is Text-Fabric 3.2.0
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

33 features found and 0 ignored


## Load Data

The transcriptions of the tablets in their TF form is organized in a model of nodes, edges and features.

The things such as tablets, faces, columns, cases, and, at the most basic level, signs, are numbered.
The signs correspond to number 1 ... ca. 120,000, in the same order as they occur in the corpus.
All other things are built from signs. They have number from ca 120,000 to 450,000.

In TF, we call these numbers *nodes*. Like a barcode, this number gives access to a whole bunch of
information about the corresponding object.

For example, lines have a property (in TF we call it a *feature*) called `fullNumber`. 
It contains the hierarchical number found at the start of the line in the transcription.

If the node for a line is `n`, we can find its hierarchical number by saying

```
F.fullNumber.v(n)
```

In words, it reads as:

* `F`: I want to look up a `F`eature
* `fullNumber`: the name of the feature
* `.v`: I want the value of that feature
* `(n)`: for the given node `n`

Seen in this way, the data is like a gigantic spreadsheet of 450,000 rows (the nodes),
with 30 columns (the features).

There is a bit more to it, since the nodes can be grouped together in ways we will see in a moment.

The grapheme name of each sign is a column `grapheme` in that spreadsheet.

The information whether a sign is damaged, constitutes a column `damaged`.

The corpus contains over 20 columns, not only for the signs, but also for a 150,000+ more
textual objects, such as *(sub)quads*, *clusters*, *lines*, *cases*, *columns*, *faces* and *tablets*.

We also have features that contain the original lines of transcription.
These features are filled for tablets, faces, columns, lines, and comments.

Instead of putting that information in one big table, the data is organized in separate columns.
We call those columns **features**.

We just load the features we need for this tutorial.
Later on, where we use them, it will become clear what they mean.

In [4]:
api = TF.load('''
    grapheme prime repeat
    variant variantOuter
    modifier modifierInner modifierFirst
    damage uncertain remarkable written
    kind
    period name type identifier catalogId
    number fullNumber origNumber badNumbering
    crossref text
    srcLn srcLnNum
    op sub comments''')
api.makeAvailableIn(globals())
COMP = Compare(api, SOURCE_DIR, TEMP_DIR)

  0.00s loading features ...
   |     0.00s B catalogId            from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.02s B fullNumber           from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.02s B number               from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.05s B grapheme             from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.04s B srcLn                from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.02s B srcLnNum             from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.00s B prime                from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.01s B repeat               from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.01s B variant              from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.00s B variantOuter         from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.00s B modi

The result of this all is that we have a bunch of special variables at our disposal
that give us access to the text and data of the tablets.

At this point it is helpful to throw a quick glance at the text-fabric
[API documentation](https://github.com/Dans-labs/text-fabric/wiki/Api)
especially the right side bar.

The most essential thing for now is that we can use `F` to access the data in the features
we've loaded.
But there is more, such as `N`, which helps us to walk over the text, as we see in a minute.

# Counting

In order to get acquainted with the data, we start with the simple task of counting.

## Count all nodes
We use the 
[`N()` generator](https://github.com/Dans-labs/text-fabric/wiki/Api#walking-through-nodes)
to walk through the nodes.

We compared the tablet data to a gigantic spreadsheet, where the rows correspond to the words.
In Text-Fabric, we call the rows `slots`, because they are the textual positions that can be filled with words.

We also mentioned that there are also other textual objects. 
They are the tablets, columns, lines, etc.
They also correspond to rows in the big spreadsheet.

In Text-Fabric we call all these rows *nodes*, and the `N()` generator
carries us through those nodes in the textual order.

Just one extra thing: the `info` statements generate timed messages.
If you use them instead of `print` you'll get a sense of the amount of time that 
the various processing steps typically need.

In [5]:
indent(reset=True)
info('Counting nodes ...')

i = 0
for n in N(): i += 1

info('{} nodes'.format(i))

  0.00s Counting nodes ...
  0.08s 318107 nodes


Here you see it: more than 400,000 nodes!

## What are all those nodes?
Every node has a type, like sign, or line, face.
We know that we have many of them,
but what exactly are they?

Text-Fabric has two special features, `otype` and `oslots`, that must occur in every Text-Fabric data set.
`otype` tells you for each node its type, and you can ask for the number of `slot`s in the text.

Here we go!

In [6]:
F.otype.slotType

'sign'

In [7]:
F.otype.maxSlot

146955

In [8]:
F.otype.maxNode

318107

In [9]:
F.otype.all

('tablet',
 'face',
 'column',
 'line',
 'case',
 'cluster',
 'quad',
 'comment',
 'sign')

We can obtain a bit more knowledge about the types of nodes.

In [10]:
for (nodeType, avLen, startNode, endNode) in C.levels.data:
    print(f'{nodeType:<8} average length {avLen:>7.4f} from {startNode:>6} to {endNode:>6}')

tablet   average length 22.9761 from 150862 to 157257
face     average length 14.3466 from 157258 to 166699
column   average length  9.4888 from 166700 to 180732
line     average length  3.5643 from 229498 to 266422
case     average length  3.1821 from 266423 to 318107
cluster  average length  1.0313 from 196539 to 229497
quad     average length  2.0507 from 146956 to 150861
comment  average length  1.0000 from 180733 to 196538
sign     average length  1.0000 from      1 to 146955


Here is the first *cluster*.

In [11]:
cl = F.otype.s('cluster')[0]
cl

196539

This is what is embedded in it.

In [12]:
for n in L.d(cl):
    print(f'node {n:>6} of type {F.otype.v(n)}')

node      3 of type sign


Here is the third *sign*.

In [13]:
F.otype.v(3)

'sign'

And this is its context.

In [14]:
for n in L.u(3):
    print(f'node {n:>6} of type {F.otype.v(n)}')

node 196539 of type cluster
node 266423 of type case
node 229498 of type line
node 166700 of type column
node 157258 of type face
node 150862 of type tablet


## Count individual object types
This is an intuitive way to count the number of nodes in each type.
Note in passing, how we use the `indent` in conjunction with `info` to produce neat timed 
and indented progress messages.

In [15]:
indent(reset=True)
info('counting objects ...')

for otype in F.otype.all:
    i = 0

    indent(level=1, reset=True)

    for n in F.otype.s(otype): i+=1

    info('{:>7} {}s'.format(i, otype))

indent(level=0)
info('Done')

  0.00s counting objects ...
   |     0.00s    6396 tablets
   |     0.00s    9442 faces
   |     0.01s   14033 columns
   |     0.00s   36925 lines
   |     0.01s   51685 cases
   |     0.01s   32959 clusters
   |     0.00s    3906 quads
   |     0.00s   15806 comments
   |     0.02s  146955 signs
  0.09s Done


# Locality

We travel upwards and downwards, forwards and backwards through the nodes.
The Locality-API (`L`) provides functions: `u()` for going up, and `d()` for going down,
`n()` for going to next nodes and `p()` for going to previous nodes.

These directions are indirect notions: nodes are just numbers, but by means of the
`oslots` feature they are linked to slots. One node *contains* an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to.
And one if next or previous to an other, if its slots follow of precede the slots of the other one.

`L.u(node)` **Up** is going to nodes that embed `node`.

`L.d(node)` **Down** is the opposite direction, to those that are contained in `node`.

`L.n(node)` **Next** are the next *adjacent* nodes, i.e. nodes whose first slot comes immediately after the last slot of `node`.

`L.p(node)` **Previous** are the previous *adjacent* nodes, i.e. nodes whose last slot comes immediately before the first slot of `node`.

All these functions yield nodes of all possible otypes.
By passing an optional parameter, you can restrict the results to nodes of that type.

The result are ordered according to the order of things in the text.

The functions return always a tuple, even if there is just one node in the result.

## Going up
We go from the first sign to the tablet it contains.
Note the `[0]` at the end. You expect one tablet, yet `L` returns a tuple. 
To get the only element of that tuple, you need to do that `[0]`.

If you are like me, you keep forgetting it, and that will lead to weird error messages later on.

In [16]:
firstTablet = L.u(1, otype='tablet')[0]
print(firstTablet)

150862


And let's see all the containing objects of sign 100:

In [17]:
w = 100
for otype in F.otype.all:
    if otype == F.otype.slotType: continue
    up = L.u(w, otype=otype)
    upNode = None if len(up) == 0 else up[0]
    if upNode is None:
        print('sign {} is not contained in a {}'.format(w, otype))
    else:
        print('sign {} is contained in {} {}'.format(w, otype, upNode))

sign 100 is contained in tablet 150866
sign 100 is contained in face 157264
sign 100 is contained in column 166712
sign 100 is contained in line 229525
sign 100 is contained in case 266453
sign 100 is not contained in a cluster
sign 100 is not contained in a quad
sign 100 is not contained in a comment


## Going next
Let's go to the next nodes of the first tablet.

In [18]:
afterFirstTablet = L.n(firstTablet)
for n in afterFirstTablet:
    print('{:>7}: {:<13} first slot={:<6}, last slot={:<6}'.format(
        n, F.otype.v(n),
        E.oslots.s(n)[0],
        E.oslots.s(n)[-1],
    ))
secondTablet = L.n(firstTablet, otype='tablet')[0]

     10: sign          first slot=10    , last slot=10    
 180735: comment       first slot=10    , last slot=10    
 150863: tablet        first slot=10    , last slot=39    


## Going previous

And let's see what is right before the second tablet.

In [19]:
for n in L.p(secondTablet):
    print('{:>7}: {:<13} first slot={:<6}, last slot={:<6}'.format(
        n, F.otype.v(n),
        E.oslots.s(n)[0],
        E.oslots.s(n)[-1],
    ))

 150862: tablet        first slot=1     , last slot=9     
 157258: face          first slot=3     , last slot=9     
 166701: column        first slot=6     , last slot=9     
 229499: line          first slot=6     , last slot=9     
 266424: case          first slot=6     , last slot=9     
 196540: cluster       first slot=9     , last slot=9     
      9: sign          first slot=9     , last slot=9     


## Going down

We go to the columns of the second tablet, and just count them.

In [20]:
columns = L.d(secondTablet, otype='column')
print(len(columns))

5


## The first line
We pick the first line and the first sign, and explore what is above and below them.

In [21]:
firstLine = L.d(firstTablet, otype='line')[0]

for n in [1, firstLine]:
    indent(level=0)
    info('Node {}'.format(n), tm=False)
    indent(level=1)
    info('UP', tm=False)
    indent(level=2)
    info('\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.u(n)]), tm=False)
    indent(level=1)
    info('DOWN', tm=False)
    indent(level=2)
    info('\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.d(n)]), tm=False)
indent(level=0)
info('Done', tm=False)

Node 1
   |   UP
   |      |   180733          comment
   |      |   150862          tablet
   |   DOWN
   |      |   
Node 229498
   |   UP
   |      |   266423          case
   |      |   166700          column
   |      |   157258          face
   |      |   150862          tablet
   |   DOWN
   |      |   166700          column
   |      |   266423          case
   |      |   196539          cluster
   |      |   3               sign
   |      |   4               sign
   |      |   5               sign
Done


# Text

The `T` functions provide ways of printing out text, and they know about section levels.

We use section levels `tablet`, `column`, `line`.
`face` is a level of nodes, but not a section level.

We will define our own function to get the literal transcription text back for
tablets, faces, etc.

In [22]:
oLevels = '''
    tablet
    face
    column
    case
'''.strip().split()

lowerLevel = dict((oLevels[n], oLevels[n+1]) for n in range(len(oLevels) - 1))

def transObject(n):
    kind = F.otype.v(n)
    trans = []
    trans.append(f'{F.srcLnNum.v(n):>7}: {F.srcLn.v(n)}')
    for c in E.comments.f(n):
        trans.append(f'{F.srcLnNum.v(c):>7}: {F.srcLn.v(c)}')
    print('\n'.join(trans))
    subKind = lowerLevel.get(kind, None)
    if subKind:
        for m in L.d(n, otype=subKind):
            if F.srcLn.v(m) is not None:
                transObject(m)

In [23]:
transObject(firstTablet)

      1: &P006427 = HJN 0044
      2: #version: 0.1
      3: #atf: lang qpc
      4: @obverse
      5: @column 1
      6: 1. [...] , X X
      7: @column 2
      8: 1. 3(N14) X SANGA~a? [...]


If we know the *P-number*, we can get the tablet with that P-number by means of
`T.nodeFromSection()`.

You pass this function a tuple, representing *tablet*, *column*, *line*, and it gives you back
the node of the corresponding object.

*column* and *line* are optional.

In [24]:
tabletId = 'P471695'
tabletNode = T.nodeFromSection((tabletId,))
tabletNode

150867

Now we can get the transcription of this tablet.

In [25]:
transObject(tabletNode)

     87: &P471695 = Anonymous 0712 
     88: #atf: lang qpc 
     90: @obverse 
     91: @column 1
     92: 1.a. 3(N01) , APIN~a 3(N57) UR4~a 
     93: 1.b1. , (EN~a DU ZATU759)a 
     94: 1.b2. , (BAN~b KASZ~c)a 
     95: 1.b3. , (KI@n SAG)a 
     96: 2.a. 1(N14) 2(N01) , [...] 
     97: 2.b1. , (3(N57) PAP~a)a 
     98: 2.b2. , (SZU KI X)a 
     99: $ n lines broken  
    100: 2.b3'. , (EN~a AN EZINU~d)a 
    101: 2.b4'. , (IDIGNA [...])a 
    102: $ rest broken 
    103: $ (for a total of 12 sub-cases with PNN) 
    104: @column 2
    105: 1.a. 1(N01) , ISZ~a#? 
    106: 1.b1. , (PAP~a GIR3~c)a
    107: $ blank space 
    108: $ rest broken 
    109: @reverse 
    110: $ beginning broken 
    111: 1'. [1(N14)] 6(N01)# , [...] 
    111: 1'. [1(N14)] 6(N01)# , [...] 


# Graphemes

`F`
gives access to all features.
Every feature has a method
`freqList()`
to generate a frequency list of its values, higher frequencies first.
Here are the graphemes (the top 20):

In [26]:
for (value, frequency) in F.grapheme.freqList()[0:20]:
    print(f'{frequency:>5} x {value}')

29618 x …
21676 x N01
17128 x 
 6956 x X
 5924 x N14
 1970 x EN
 1846 x N57
 1835 x N34
 1349 x SZE
 1241 x GAL
 1125 x DUG
 1069 x AN
 1046 x U4
  892 x NUN
  881 x SAL
  879 x PAP
  877 x E2
  875 x GI
  788 x BA
  747 x SANGA


We can do a bit more: we'll write a file with all graphemes to your TEMP_DIR.
In fact, we'll write two: one ordered by grapheme, and one ordered by frequency.

In order to not clutter this notebook, we use a function `writeFreqs()`, defined in 
[utils](utils.py) in the same directory.

In [27]:
COMP.writeFreqs('grapheme-plain', F.grapheme.freqList(), 'bare grapheme')

There are 667 bare graphemes


Now have a look at your TEMP_DIR and you see two generated files:

* `graphemes-plain-alpha.txt` (sorted by grapheme)
* `graphemes-plain-freq.txt` (sorted by frequency)

But we can do better, we also want the prime, variants, and modifiers taken into account.

Let us first see what they can be.

## Prime

The prime is a feature with two values: 1 or 0. 1 means: there is a prime.
Below you see how often that occurs.
Note that we count all primes here: on signs, case numbers and column numbers.

For more info and a check on the occurrences of primes, see [checks](checks.ipynb).

In [28]:
for (value, frequency) in F.prime.freqList():
    print(f'{frequency:>5} x {value}')

 5184 x 1


## Variant

The variant or allograph is what occurs after the grapheme and after the `~` symbol, which should be digits and/or
lowercase letters except the `x`.

Here is the frequency list of variant values.

In [29]:
for (value, frequency) in F.variant.freqList():
    print(f'{frequency:>5} x {value}')

23804 x a
 4172 x b
 1532 x c
 1356 x a1
  703 x b1
  194 x a2
  187 x d
  127 x b2
   85 x f
   73 x a3
   40 x e
   29 x c2
   22 x c1
   22 x c3
   17 x v
   14 x c5
   13 x b3
   12 x a0
   12 x d1
   11 x c4
    6 x a4
    6 x g
    5 x d2
    4 x d4
    4 x h
    2 x 3a
    2 x d3
    1 x h2


## Modifier

The modifier is what occurs after the grapheme and after the `@` symbol, which should be digits and/or
lowercase letters except the `x`.

Here is the frequency list of *modifier* and *rmodifier* values.

In [30]:
for (value, frequency) in F.modifier.freqList():
    print(f'{frequency:>5} x {value}')

  648 x g
  251 x t
   39 x n
    6 x r
    4 x s
    1 x c
    1 x v


In [32]:
for (value, frequency) in F.modifierInner.freqList():
    print(f'{frequency:>5} x {value}')

   25 x f
   15 x t
    1 x r
    1 x v


## Full grapheme overview

We make a frequency list of all full graphemes, i.e. the grapheme including variant, modifier, and prime.
We show as they appear in transcriptions.

First we show on what node types primes, variants and modifiers occur.
We only deal with cases where they occur on signs, ignoring the cases where they occur on (sub)quads.

In [33]:
for feature in ('prime', 'variant', 'modifier'):
    nodeTypes = collections.Counter()
    for n in N():
        if Fs(feature).v(n):
            nodeTypes[F.otype.v(n)] += 1
    for (value, frequency) in nodeTypes.items():
        print(f'{feature:<10}: {frequency:>5} x {value}')

prime     :  4652 x case
prime     :   523 x column
prime     :     9 x sign
variant   : 32455 x sign
modifier  :   950 x sign


Now the full graphemes.

In [34]:
fullGraphemes = collections.Counter()

for n in F.otype.s('sign'):
    fullGrapheme = COMP.strFromSign(n)
    fullGraphemes[fullGrapheme] += 1
    
for (value, frequency) in sorted(fullGraphemes.items(), key=lambda x: (-x[1], x[0]))[0:20]:
    print(f'{frequency:>5} x {value}')
    
COMP.writeFreqs('grapheme-full', fullGraphemes.items(), 'full grapheme')

29618 x ...
17128 x 
12996 x 1(N01)
 6956 x X
 3081 x 2(N01)
 2606 x 1(N14)
 1849 x EN~a
 1603 x 3(N01)
 1357 x 2(N14)
 1308 x SZE~a
 1304 x 5(N01)
 1224 x GAL~a
 1119 x 4(N01)
 1069 x AN
 1045 x U4
 1001 x 1(N34)
  881 x SAL
  874 x GI
  854 x PAP~a
  801 x 1(N57)
There are 1529 full graphemes


# Edge features: left and right

We have not talked about edges much. If the nodes correspond to the rows in the big spreadsheet,
the edges point from one row to another.

One edge we have encountered: the special feature `oslots`.
Each non-slot node is linked by `oslots` to all of its slot nodes.

An edge is really a feature as well.
Whereas a node feature is a column of information,
one cell per node, 
an edge feature is also a column of information, one cell per pair of nodes.

In the tablets quads may be subdivided into subquads and signs, related by operators.
If there is an operator *op* between `qLeft` and `qRight`, there is an 
edge between `qLeft` and `qRight` with feature `op` having value *op*.

And if a quad is the result of an operator working on operands, which are sub-*quads* or *signs*,
there will be edges between the big quad and its operands with feature `sub`, having no value.

Likewise, there will be edges between *lines* and *cases* and their subcases, also
having feature `sub` with no value.

# Next steps

By now you have an impression how to compute around in the Hebrew Bible.
While this is still the beginning, I hope you already sense the power of unlimited programmatic access
to all the bits and bytes in the data set.

Here are a few directions for unleashing that power.

## Search
Text-Fabric contains a flexible search engine, that does not only work for the BHSA data,
but also for data that you add to it.
There is a tutorial dedicated to [search](search.ipynb).
And if you already know MQL queries, you can build from that in
[searchFromMQL](searchFromMQL.ipynb).

## Explore additional data
The ETCBC has a few other repositories with data that work in conjunction with the BHSA data.
One of them you have already seen: 
[phono](https://github.com/ETCBC/phono),
for phonetic transcriptions.

There is also
[parallels](https://github.com/ETCBC/parallels)
for detecting parallel passages,
and
[valence](https://github.com/ETCBC/valence)
for studying patterns around verbs that determine their meanings.

## Add your own data
If you study the additional data, you can observe how that data is created and also
how it is turned into a text-fabric data module.
The last step is incredibly easy. You can write out every Python dictionary where the keys are numbers
and the values string or numbers as a Text-Fabric feature.
When you are creating data, you have already constructed those dictionaries, so writing
them out is just one method call.
See for example how the
[flowchart](https://github.com/ETCBC/valence/blob/master/programs/flowchart.ipynb#Add-sense-feature-to-valence-module)
notebook in valence writes out verb sense data.
![flow](images/valence.png)

You can then easily share your new features on GitHub, so that your colleagues everywhere 
can try it out for themselves.

## Export to Emdros MQL

[EMDROS](http://emdros.org), written by Ulrik Petersen,
is a text database system with the powerful *topographic* query language MQL.
The ideas are based on a model devised by Christ-Jan Doedens in
[Text Databases: One Database Model and Several Retrieval Languages](https://books.google.nl/books?id=9ggOBRz1dO4C).

Text-Fabric's model of slots, nodes and edges is a fairly straightforward translation of the models of Christ-Jan Doedens and Ulrik Petersen.

[SHEBANQ](https://shebanq.ancient-data.org) uses EMDROS to offer users to execute and save MQL queries against the Hebrew Text Database of the ETCBC.

So it is kind of logical and convenient to be able to work with a Text-Fabric resource through MQL.

If you have obtained an MQL dataset somehow, you can turn it into a text-fabric data set by `importMQL()`,
which we will not show here.

And if you want to export a Text-Fabric data set to MQL, that is also possible.

After the `Fabric(modules=...)` call, you can call `exportMQL()` in order to save all features of the
indicated modules into a big MQL dump, which can be imported by an EMDROS database.

# Clean caches

Text-Fabric pre-computes data for you, so that it can be loaded faster.
If the original data is updated, Text-Fabric detects it, and will recompute that data.

But there are cases, when the algorithms of Text-Fabric have changed, without any changes in the data, that you might
want to clear the cache of precomputed results.

There are two ways to do that:

* Locate the `.tf` directory of your dataset, and remove all `.tfx` files in it.
  This might be a bit awkward to do, because the `.tf` directory is hidden on Unix-like systems.
* Call `TF.clearCache()`, which does exactly the same.

It is not handy to execute the following cell all the time, that's why I have commented it out.
So if you really want to clear the cache, remove the comment sign below.

In [34]:
# TF.clearCache()