<img align="right" src="images/etcbc.png"/>
<img align="right" src="images/peshitta_small.png"/>
<img align="right" src="images/tf-small.png"/>


# Tutorial

This notebook gets you started with using
[Text-Fabric](https://github.com/Dans-labs/text-fabric) for coding in Syriac texts.

Chances are that a bit of reading about the underlying
[data model](https://github.com/Dans-labs/text-fabric/wiki/Data-model)
helps you to follow the exercises below, and vice versa.

Most programs start with loading a few modules.
In the next cell, the first line loads standard modules that come with Python itself,
and the second cell loads Text-Fabric.

Before you can run this, you need to install it.
The basic instruction for that is, on a terminal:

```
pip install text-fabric
```

if you have installed Python with the help of Anaconda, or

```
sudo -H pip3 install text-fabric
```
if you have installed Python from [python.org](https://www.python.org).

Make sure that you do all this with Python **3**, not 2.

In [1]:
import sys, os, collections
from tf.fabric import Fabric

# Call Text-Fabric

Everything starts by setting up Text-Fabric.
It needs to know where to look for data.

The Syriac texts are in the same repository as this tutorial.
I assume you have cloned [linksyr](https://github.com/etcbc/linksyr).
in your directory `~/github/etcbc`, so that your directory structure looks like this

    your home direcectory\
    |                     - github\
    |                       |      - etcbc\
    |                       |        |         - linksyr
    
## Tip
If you start computing with this tutorial, first copy its parent directory to somewhere else,
outside your `linksyr` directory.
If you pull changes from the `linksyr` repository later, your work will not be overwritten.
Where you put your tutorial directory is up till you.
It will work from any directory.

In [61]:
REPO = '~/github/etcbc/linksyr'
SOURCE = 'syrnt'
CORPUS = f'{REPO}/data/tf/{SOURCE}'
TF = Fabric(locations=[CORPUS], modules=[''], silent=False )

This is Text-Fabric 3.1.5
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

37 features found and 0 ignored


# Load Features
The data of the corpus is organized in features.
They are *columns* of data.
Think of the text as a gigantic spreadsheet, where row 1 corresponds to the
first word, row 2 to the second word, and so on, for all 100,000+ words.

The letters of each word is a column `form` in that spreadsheet.

The corpus contains ca. 30 columns, not only for the words, but also for 
textual objects, such as *books*, *chapters*, and *verses*.

Instead of putting that information in one big table, the data is organized in separate columns.
We call those columns **features**.

We just load the features we need for this tutorial.
Later on, where we use them, it will become clear what they mean.

In [62]:
api = TF.load('''
    grammatical_category
    lexeme lexeme_ascii
''')
api.makeAvailableIn(globals())

  0.00s loading features ...
   |     0.05s B lexeme               from /Users/dirk/github/etcbc/linksyr/data/tf/syrnt
   |     0.04s B lexeme_ascii         from /Users/dirk/github/etcbc/linksyr/data/tf/syrnt
   |     0.03s B grammatical_category from /Users/dirk/github/etcbc/linksyr/data/tf/syrnt
   |     0.00s Feature overview: 35 for nodes; 1 for edges; 1 configs; 7 computed
  0.33s All features loaded/computed - for details use loadLog()


The result of this all is that we have a bunch of special variables at our disposal
that give us access to the text and data of the Hebrew Bible.

At this point it is helpful to throw a quick glance at the text-fabric
[API documentation](https://github.com/Dans-labs/text-fabric/wiki/Api)
especially the right side bar.

The most essential thing for now is that we can use `F` to access the data in the features
we've loaded.
But there is more, such as `N`, which helps us to walk over the text, as we see in a minute.

# Counting

In order to get acquainted with the data, we start with the simple task of counting.

## Count all nodes
We use the 
[`N()` generator](https://github.com/Dans-labs/text-fabric/wiki/Api#walking-through-nodes)
to walk through the nodes.

We compared corpus to a gigantic spreadsheet, where the rows correspond to the words.
In Text-Fabric, we call the rows `slots`, because they are the textual positions that can be filled with words.

We also mentioned that there are also more textual objects. 
They are the verses, chapters and books.
They also correspond to rows in the big spreadsheet.

In Text-Fabric we call all these rows *nodes*, and the `N()` generator
carries us through those nodes in the textual order.

Just one extra thing: the `info` statements generate timed messages.
If you use them instead of `print` you'll get a sense of the amount of time that 
the various processing steps typically need.

In [63]:
indent(reset=True)
info('Counting nodes ...')

i = 0
for n in N(): i += 1

info('{} nodes'.format(i))

  0.00s Counting nodes ...
  0.03s 117884 nodes


## What are those nodes?
Every node has a type, like word, or phrase, sentence.
We know that we have approximately 100,000 words and a few other nodes.
But what exactly are they?

Text-Fabric has two special features, `otype` and `oslots`, that must occur in every Text-Fabric data set.
`otype` tells you for each node its type, and you can ask for the number of `slot`s in the text.

Here we go!

In [64]:
F.otype.slotType

'word'

In [65]:
F.otype.maxSlot

109640

In [66]:
F.otype.maxNode

117884

In [67]:
F.otype.all

('book', 'chapter', 'verse', 'word')

In [68]:
C.levels.data

(('book', 4060.740740740741, 109641, 109667),
 ('chapter', 421.6923076923077, 109668, 109927),
 ('verse', 13.779062460726404, 109928, 117884),
 ('word', 1, 1, 109640))

This is interesting: above you see all the textual objects, with the average size of their objects,
the node where they start, and the node where they end.

## Count individual object types
This is an intuitive way to count the number of nodes in each type.
Note in passing, how we use the `indent` in conjunction with `info` to produce neat timed 
and indented progress messages.

In [69]:
indent(reset=True)
info('counting objects ...')

for otype in F.otype.all:
    i = 0
    indent(level=1, reset=True)

    for n in F.otype.s(otype): i+=1

    info('{:>7} {}s'.format(i, otype))

indent(level=0)
info('Done')

  0.00s counting objects ...
   |     0.00s      27 books
   |     0.00s     260 chapters
   |     0.00s    7957 verses
   |     0.02s  109640 words
  0.03s Done


# Feature statistics

`F`
gives access to all features.
Every feature has a method
`freqList()`
to generate a frequency list of its values, higher frequencies first.
Here are the parts of speech:

In [70]:
F.grammatical_category.freqList()

(('noun', 33717),
 ('verb', 30441),
 ('particle', 28486),
 ('pronoun', 10247),
 ('adjective', 4503),
 ('numeral', 1620),
 ('adverb', 434),
 ('idiom', 192))

# Lexeme matters

## Top 10 frequent verbs

If we count the frequency of words, we usually mean the frequency of their
corresponding lexemes.

There are several methods for working with lexemes.

### Method 1: counting words

In [71]:
verbs = collections.Counter()
indent(reset=True)
info('Collecting data')

for w in F.otype.s('word'):
    if F.grammatical_category.v(w) != 'verb': continue
    verbs[F.lexeme_ascii.v(w)] +=1

info('Done')
print(''.join(
    '{}: {}\n'.format(verb, cnt) for (verb, cnt) in sorted(
        verbs.items() , key=lambda x: (-x[1], x[0]))[0:10],
    )
)       

  0.00s Collecting data
  0.10s Done
HOA: 4006
AMR: 2553
ATA: 965
KZA: 734
EBD: 706
;DE: 704
MW;KA: 585
XM: 550
;HB: 534
WME: 494



## Lexeme distribution

Let's do a bit more fancy lexeme stuff.

### Hapaxes

A hapax can be found by inspecting lexemes and see to how many word nodes they are linked.
If that is number is one, we have a hapax.

We print 10 hapaxes with their gloss.

In [72]:
indent(reset=True)

hapax = []
lexIndex = collections.defaultdict(list)

for n in F.otype.s('word'):
    lexIndex[F.lexeme_ascii.v(n)].append(n)
    
hapax = dict((lex, occs) for (lex, occs) in lexIndex.items() if len(occs) == 1)
    
info('{} hapaxes found'.format(len(hapax)))

for h in sorted(hapax)[0:10]:
    print(f'\t{h}')

  0.10s 835 hapaxes found
	//A
	//LA
	/;DN;A
	/AA
	/ATA
	/D;A
	/IK
	/IKTA
	/KOA
	/LL


If we want more info on the hapaxes, we get that by means of its *node*.
The lexIndex dictionary stores the occurrences of a lexeme as a list of nodes.

Let's get the part of speech and the syriac form of those 10 hapaxes.

In [73]:
for h in sorted(hapax)[0:10]:
    node = hapax[h][0]
    print(f'\t{F.grammatical_category.v(node):<12} {F.lexeme.v(node)}')

	noun         ܨܨܐ
	noun         ܨܨܠܐ
	adjective    ܨܝܕܢܝܐ
	adjective    ܨܐܐ
	noun         ܨܐܬܐ
	verb         ܨܕܝܐ
	verb         ܨܦܚ
	noun         ܨܦܚܬܐ
	noun         ܨܚܘܐ
	verb         ܨܠܠ


### Small occurrence base

The occurrence base of a lexeme are the verses, chapters and books in which occurs.
Let's look for lexemes that occur in a single chapter.

Oh yes, we have already found the hapaxes, we will skip them here.

In [74]:
indent(reset=True)
info('Finding single chapter lexemes')

lexChapterIndex = {}

for (lex, occs) in lexIndex.items():
    lexChapterIndex[lex] = set(L.u(n, otype='chapter')[0] for n in occs)
    
singleCh = [(lex, occs) for (lex, occs) in lexChapterIndex.items() if len(lexChapterIndex[lex]) == 1]

info('{} single chapter lexemes found'.format(len(singleCh)))

for (lex, occs) in singleCh[0:10]:
    print('{:<20} {:<6} ({}x)'.format(
        '{} {}:{}'.format(*T.sectionFromNode(sorted(occs)[0])),
        lex,
        len(occs),
    ))

  0.00s Finding single chapter lexemes
  0.51s 947 single chapter lexemes found
Matthew 1:1          ;L;DOTA (1x)
Matthew 1:1          ZRK    (1x)
Matthew 1:1          TMR    (1x)
Matthew 1:1          REOT   (1x)
Matthew 1:1          RKBEM  (1x)
Matthew 1:1          ;HOWIY (1x)
Matthew 1:1          EOZ;A  (1x)
Matthew 1:1          ;OTM   (1x)
Matthew 1:1          AKZ    (1x)
Matthew 1:1          KZX;A  (1x)


### Confined to books

As a final exercise with lexemes, lets make a list of all books, and show their total number of lexemes and
the number of lexemes that occur exclusively in that book.

In [75]:
indent(reset=True)
info('Making book-lexeme index')

allBook = collections.defaultdict(set)
allLex = set()

for b in F.otype.s('book'):
    for w in L.d(b, 'word'):
        l = F.lexeme.v(w)
        allBook[b].add(l)
        allLex.add(l)

info('Found {} lexemes'.format(len(allLex)))

  0.00s Making book-lexeme index
  0.14s Found 3038 lexemes


In [76]:
indent(reset=True)
info('Finding single book lexemes')

lexBookIndex = {}

for (lex, occs) in lexIndex.items():
    lexBookIndex[lex] = set(L.u(n, otype='book')[0] for n in occs)

singleBookLex = collections.defaultdict(set)
for (lex, books) in lexBookIndex.items():
    if len(books) == 1:
        singleBookLex[list(books)[0]].add(lex)

singleBook = {book: len(lexs) for (book, lexs) in singleBookLex.items()}

info('found {} single book lexemes'.format(sum(singleBook.values())))

  0.00s Finding single book lexemes
  0.60s found 1079 single book lexemes


In [77]:
print('{:<20}{:>5}{:>5}{:>5}\n{}'.format(
    'book', '#all', '#own', '%own',
    '-'*35,
))
booklist = []

for b in F.otype.s('book'):
    book = T.bookName(b)
    a = len(allBook[b])
    o = singleBook.get(b, 0)
    p = 100 * o / a
    booklist.append((book, a, o, p))

for x in sorted(booklist, key=lambda e: (-e[3], -e[1], e[0])):
    print('{:<20} {:>4} {:>4} {:>4.1f}%'.format(*x))

book                 #all #own %own
-----------------------------------
Acts                 1315  252 19.2%
Revelation            809  104 12.9%
Luke                 1368  153 11.2%
2_Peter               348   34  9.8%
Romans                782   60  7.7%
2_Timothy             379   26  6.9%
Hebrews               747   51  6.8%
James                 434   28  6.5%
Matthew              1244   80  6.4%
John                  810   47  5.8%
Jude                  209   12  5.7%
2_Corinthians         628   36  5.7%
1_Corinthians         740   41  5.5%
1_Timothy             444   20  4.5%
Philippians           371   15  4.0%
Colossians            375   15  4.0%
Ephesians             460   18  3.9%
1_Peter               452   17  3.8%
Mark                  966   32  3.3%
2_John                 91    3  3.3%
Galatians             418   13  3.1%
Titus                 249    6  2.4%
3_John                 97    2  2.1%
1_Thessalonians       322    6  1.9%
Philemon              127    2  1.6%
1_J

# Layer API
We travel upwards and downwards, forwards and backwards through the nodes.
The Layer-API (`L`) provides functions: `u()` for going up, and `d()` for going down,
`n()` for going to next nodes and `p()` for going to previous nodes.

These directions are indirect notions: nodes are just numbers, but by means of the
`oslots` feature they are linked to slots. One node *contains* an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to.
And one if next or previous to an other, if its slots follow of precede the slots of the other one.

`L.u(node)` **Up** is going to nodes that embed `node`.

`L.d(node)` **Down** is the opposite direction, to those that are contained in `node`.

`L.n(node)` **Next** are the next *adjacent* nodes, i.e. nodes whose first slot comes immediately after the last slot of `node`.

`L.p(node)` **Previous** are the previous *adjacent* nodes, i.e. nodes whose last slot comes immediately before the first slot of `node`.

All these functions yield nodes of all possible otypes.
By passing an optional parameter, you can restrict the results to nodes of that type.

The result are ordered according to the order of things in the text.

The functions return always a tuple, even if there is just one node in the result.

## Going up
We go from the first word to the book it contains.
Note the `[0]` at the end. You expect one book, yet `L` returns a tuple. 
To get the only element of that tuple, you need to do that `[0]`.

If you are like me, you keep forgetting it, and that will lead to weird error messages later on.

In [78]:
firstBook = L.u(1, otype='book')[0]
print(firstBook)

109641


And let's see all the containing objects of word 3:

In [79]:
w = 3
for otype in F.otype.all:
    if otype == F.otype.slotType: continue
    up = L.u(w, otype=otype)
    upNode = 'x' if len(up) == 0 else up[0]
    print('word {} is contained in {} {}'.format(w, otype, upNode))

word 3 is contained in book 109641
word 3 is contained in chapter 109668
word 3 is contained in verse 109928


## Going next
Let's go to the next nodes of the first book.

In [80]:
afterFirstBook = L.n(firstBook)
for n in afterFirstBook:
    print('{:>7}: {:<13} first slot={:<6}, last slot={:<6}'.format(
        n, F.otype.v(n),
        E.oslots.s(n)[0],
        E.oslots.s(n)[-1],
    ))
secondBook = L.n(firstBook, otype='book')[0]

  13980: word          first slot=13980 , last slot=13980 
 110999: verse         first slot=13980 , last slot=13985 
 109696: chapter       first slot=13980 , last slot=14490 
 109642: book          first slot=13980 , last slot=22772 


## Going previous

And let's see what is right before the second book.

In [81]:
for n in L.p(secondBook):
    print('{:>7}: {:<13} first slot={:<6}, last slot={:<6}'.format(
        n, F.otype.v(n),
        E.oslots.s(n)[0],
        E.oslots.s(n)[-1],
    ))

 109641: book          first slot=1     , last slot=13979 
 109695: chapter       first slot=13714 , last slot=13979 
 110998: verse         first slot=13964 , last slot=13979 
  13979: word          first slot=13979 , last slot=13979 


## Going down

We go to the chapters of the second book, and just count them.

In [82]:
chapters = L.d(secondBook, otype='chapter')
print(len(chapters))

16


## The first verse
We pick the first verse and the first word, and explore what is above and below them.

In [83]:
for n in [1, L.u(1, otype='verse')[0]]:
    indent(level=0)
    info('Node {}'.format(n), tm=False)
    indent(level=1)
    info('UP', tm=False)
    indent(level=2)
    info('\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.u(n)]), tm=False)
    indent(level=1)
    info('DOWN', tm=False)
    indent(level=2)
    info('\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.d(n)]), tm=False)
indent(level=0)
info('Done', tm=False)

Node 1
   |   UP
   |      |   109928          verse
   |      |   109668          chapter
   |      |   109641          book
   |   DOWN
   |      |   
Node 109928
   |   UP
   |      |   109668          chapter
   |      |   109641          book
   |   DOWN
   |      |   1               word
   |      |   2               word
   |      |   3               word
   |      |   4               word
   |      |   5               word
   |      |   6               word
   |      |   7               word
   |      |   8               word
Done


# Text API

So far, we have mainly seen nodes and their numbers, and the names of node types.
You would almost forget that we are dealing with text.
So let's try to see some text.

In the same way as `F` gives access to feature data,
`T` gives access to the text.
That is also feature data, but you can tell Text-Fabric which features are specifically
carrying the text, and in return Text-Fabric offers you
a Text API: `T`.

## Formats
Syriac text can be represented in a number of ways:

* in transliteration, or in Syriac characters,
* showing the actual text or only the lexemes,

If you wonder where the information about text formats is stored: 
not in the program text-fabric, but in the data set.
It has a feature `otext`, which specifies the formats and which features
must be used to produce them. `otext` is the third special feature in a TF data set,
next to `otype` and `oslots`. 
It is an optional feature. 
If it is absent, there will be no `T` API.

Here is a list of all available formats in this data set.

In [84]:
sorted(T.formats)

['lex-orig-full', 'lex-trans-full', 'text-orig-full', 'text-trans-full']

## Using the formats
Now let's use those formats to print out the first verse of the Hebrew Bible.

In [85]:
for fmt in sorted(T.formats):
    print('{}:\n\t{}'.format(fmt, T.text(range(1,12), fmt=fmt)))

lex-orig-full:
	ܟܬܒܐ ܝܠܝܕܘܬܐ ܝܫܘܥ ܡܫܝܚܐ ܒܪܐ ܕܘܝܕ ܒܪܐ ܐܒܪܗܡ ܐܒܪܗܡ ܝܠܕ ܐܝܣܚܩ 
lex-trans-full:
	CTBA ;L;DOTA ;WOE MW;KA BRA DO;D BRA ABRHM ABRHM ;LD A;SKX 
text-orig-full:
	ܟܬܒܐ ܕܝܠܝܕܘܬܗ ܕܝܫܘܥ ܡܫܝܚܐ ܒܪܗ ܕܕܘܝܕ ܒܪܗ ܕܐܒܪܗܡ ܐܒܪܗܡ ܐܘܠܕ ܠܐܝܣܚܩ 
text-trans-full:
	CTBA D;L;DOTH D;WOE MW;KA BRH DDO;D BRH DABRHM ABRHM AOLD LA;SKX 


If we do not specify a format, the **default** format is used (`text-orig-full`).

In [86]:
print(T.text(range(1,12)))

ܟܬܒܐ ܕܝܠܝܕܘܬܗ ܕܝܫܘܥ ܡܫܝܚܐ ܒܪܗ ܕܕܘܝܕ ܒܪܗ ܕܐܒܪܗܡ ܐܒܪܗܡ ܐܘܠܕ ܠܐܝܣܚܩ 


## Whole text in all formats in less than a second
Part of the pleasure of working with computers is that they can crunch massive amounts of data.
The text of the Hebrew Bible is a piece of cake.

It takes just ten seconds to have that cake and eat it. 
In nearly a dozen formats.

In [87]:
indent(reset=True)
info('writing plain text of whole Syriac New Testament in all formats')

text = collections.defaultdict(list)

for v in F.otype.s('verse'):
    words = L.d(v, 'word')
    for fmt in sorted(T.formats):
        text[fmt].append(T.text(words, fmt=fmt))

info('done {} formats'.format(len(text)))

for fmt in sorted(text):
    print('{}\n{}\n'.format(fmt, '\n'.join(text[fmt][0:5])))

  0.00s writing plain text of whole Syriac New Testament in all formats
  0.78s done 4 formats
lex-orig-full
ܟܬܒܐ ܝܠܝܕܘܬܐ ܝܫܘܥ ܡܫܝܚܐ ܒܪܐ ܕܘܝܕ ܒܪܐ ܐܒܪܗܡ 
ܐܒܪܗܡ ܝܠܕ ܐܝܣܚܩ ܐܝܣܚܩ ܝܠܕ ܝܥܩܘܒ ܝܥܩܘܒ ܝܠܕ ܝܗܘܕܐ ܐܚܐ 
ܝܗܘܕܐ ܝܠܕ ܦܪܨ ܙܪܚ ܡܢ ܬܡܪ ܦܪܨ ܝܠܕ ܚܨܪܘܢ ܚܨܪܘܢ ܝܠܕ ܐܪܡ 
ܐܪܡ ܝܠܕ ܥܡܝܢܕܒ ܥܡܝܢܕܒ ܝܠܕ ܢܚܫܘܢ ܢܚܫܘܢ ܝܠܕ ܣܠܡܘܢ 
ܣܠܡܘܢ ܝܠܕ ܒܥܙ ܡܢ ܪܚܒ ܒܥܙ ܝܠܕ ܥܘܒܝܕ ܡܢ ܪܥܘܬ ܥܘܒܝܕ ܝܠܕ ܐܝܫܝ 

lex-trans-full
CTBA ;L;DOTA ;WOE MW;KA BRA DO;D BRA ABRHM 
ABRHM ;LD A;SKX A;SKX ;LD ;EXOB ;EXOB ;LD ;HODA AKA 
;HODA ;LD IR/ ZRK MN TMR IR/ ;LD K/RON K/RON ;LD ARM 
ARM ;LD EM;NDB EM;NDB ;LD NKWON NKWON ;LD SLMON 
SLMON ;LD BEZ MN RKB BEZ ;LD EOB;D MN REOT EOB;D ;LD A;W; 

text-orig-full
ܟܬܒܐ ܕܝܠܝܕܘܬܗ ܕܝܫܘܥ ܡܫܝܚܐ ܒܪܗ ܕܕܘܝܕ ܒܪܗ ܕܐܒܪܗܡ 
ܐܒܪܗܡ ܐܘܠܕ ܠܐܝܣܚܩ ܐܝܣܚܩ ܐܘܠܕ ܠܝܥܩܘܒ ܝܥܩܘܒ ܐܘܠܕ ܠܝܗܘܕܐ ܘܠܐܚܘܗܝ 
ܝܗܘܕܐ ܐܘܠܕ ܠܦܪܨ ܘܠܙܪܚ ܡܢ ܬܡܪ ܦܪܨ ܐܘܠܕ ܠܚܨܪܘܢ ܚܨܪܘܢ ܐܘܠܕ ܠܐܪܡ 
ܐܪܡ ܐܘܠܕ ܠܥܡܝܢܕܒ ܥܡܝܢܕܒ ܐܘܠܕ ܠܢܚܫܘܢ ܢܚܫܘܢ ܐܘܠܕ ܠܣܠܡܘܢ 
ܣܠܡܘܢ ܐܘܠܕ ܠܒܥܙ ܡܢ ܪܚܒ ܒܥܙ ܐܘܠܕ ܠܥܘܒܝܕ ܡܢ ܪܥܘܬ ܥܘܒܝܕ ܐܘܠܕ ܠܐܝܫܝ 

text-trans-full
CTBA D;L;DOTH D;WOE MW;KA BRH D

### The full plain text
We write a few formats to file, in your `Downloads` folder.

In [88]:
orig = 'text-orig-full'
trans = 'text-trans-full'
for fmt in (orig, trans):
    with open(os.path.expanduser(f'~/Downloads/{fmt}.txt'), 'w') as f:
        f.write('\n'.join(text[fmt]))

In [89]:
!head -n 20 ~/Downloads/{orig}.txt

ܟܬܒܐ ܕܝܠܝܕܘܬܗ ܕܝܫܘܥ ܡܫܝܚܐ ܒܪܗ ܕܕܘܝܕ ܒܪܗ ܕܐܒܪܗܡ 
ܐܒܪܗܡ ܐܘܠܕ ܠܐܝܣܚܩ ܐܝܣܚܩ ܐܘܠܕ ܠܝܥܩܘܒ ܝܥܩܘܒ ܐܘܠܕ ܠܝܗܘܕܐ ܘܠܐܚܘܗܝ 
ܝܗܘܕܐ ܐܘܠܕ ܠܦܪܨ ܘܠܙܪܚ ܡܢ ܬܡܪ ܦܪܨ ܐܘܠܕ ܠܚܨܪܘܢ ܚܨܪܘܢ ܐܘܠܕ ܠܐܪܡ 
ܐܪܡ ܐܘܠܕ ܠܥܡܝܢܕܒ ܥܡܝܢܕܒ ܐܘܠܕ ܠܢܚܫܘܢ ܢܚܫܘܢ ܐܘܠܕ ܠܣܠܡܘܢ 
ܣܠܡܘܢ ܐܘܠܕ ܠܒܥܙ ܡܢ ܪܚܒ ܒܥܙ ܐܘܠܕ ܠܥܘܒܝܕ ܡܢ ܪܥܘܬ ܥܘܒܝܕ ܐܘܠܕ ܠܐܝܫܝ 
ܐܝܫܝ ܐܘܠܕ ܠܕܘܝܕ ܡܠܟܐ ܕܘܝܕ ܐܘܠܕ ܠܫܠܝܡܘܢ ܡܢ ܐܢܬܬܗ ܕܐܘܪܝܐ 
ܫܠܝܡܘܢ ܐܘܠܕ ܠܪܚܒܥܡ ܪܚܒܥܡ ܐܘܠܕ ܠܐܒܝܐ ܐܒܝܐ ܐܘܠܕ ܠܐܣܐ 
ܐܣܐ ܐܘܠܕ ܠܝܗܘܫܦܛ ܝܗܘܫܦܛ ܐܘܠܕ ܠܝܘܪܡ ܝܘܪܡ ܐܘܠܕ ܠܥܘܙܝܐ 
ܥܘܙܝܐ ܐܘܠܕ ܠܝܘܬܡ ܝܘܬܡ ܐܘܠܕ ܠܐܚܙ ܐܚܙ ܐܘܠܕ ܠܚܙܩܝܐ 
ܚܙܩܝܐ ܐܘܠܕ ܠܡܢܫܐ ܡܢܫܐ ܐܘܠܕ ܠܐܡܘܢ ܐܡܘܢ ܐܘܠܕ ܠܝܘܫܝܐ 
ܝܘܫܝܐ ܐܘܠܕ ܠܝܘܟܢܝܐ ܘܠܐܚܘܗܝ ܒܓܠܘܬܐ ܕܒܒܠ 
ܡܢ ܒܬܪ ܓܠܘܬܐ ܕܝܢ ܕܒܒܠ ܝܘܟܢܝܐ ܐܘܠܕ ܠܫܠܬܐܝܠ ܫܠܬܐܝܠ ܐܘܠܕ ܠܙܘܪܒܒܠ 
ܙܘܪܒܒܠ ܐܘܠܕ ܠܐܒܝܘܕ ܐܒܝܘܕ ܐܘܠܕ ܠܐܠܝܩܝܡ ܐܠܝܩܝܡ ܐܘܠܕ ܠܥܙܘܪ 
ܥܙܘܪ ܐܘܠܕ ܠܙܕܘܩ ܙܕܘܩ ܐܘܠܕ ܠܐܟܝܢ ܐܟܝܢ ܐܘܠܕ ܠܐܠܝܘܕ 
ܐܠܝܘܕ ܐܘܠܕ ܠܐܠܝܥܙܪ ܐܠܝܥܙܪ ܐܘܠܕ ܠܡܬܢ ܡܬܢ ܐܘܠܕ ܠܝܥܩܘܒ 
ܝܥܩܘܒ ܐܘܠܕ ܠܝܘܣܦ ܓܒܪܗ ܕܡܪܝܡ ܕܡܢܗ ܐܬܝܠܕ ܝܫܘܥ ܕܡܬܩܪܐ ܡܫܝܚܐ 
ܟܠܗܝܢ ܗܟܝܠ ܫܪܒܬܐ ܡܢ ܐܒܪܗܡ ܥܕܡܐ ܠܕܘܝܕ ܫܪܒܬܐ ܐܪܒܥܣܪܐ ܘܡܢ ܕܘܝܕ ܥܕܡܐ ܠܓܠܘܬܐ ܕܒܒܠ ܫܪܒܬܐ ܐܪܒܥܣܪܐ ܘܡܢ ܓܠܘܬܐ ܕܒܒܠ ܥܕܡ

## Book names

For Bible book names, we can use several languages.
Well, in this case we have just English.

### Languages
Here are the languages that we can use for book names.
These languages come from the features `book@ll`, where `ll` is a two letter
ISO language code. Have a look in your data directory, you can't miss them.

In [90]:
T.languages

{'': {'language': 'default', 'languageEnglish': 'default'},
 'en': {'language': 'English', 'languageEnglish': 'English'}}

## Sections

A section is a book, a chapter or a verse.
Knowledge of sections is not baked into Text-Fabric. 
The config feature `otext.tf` may specify three section levels, and tell
what the corresponding node types and features are.

From that knowledge it can construct mappings from nodes to sections, e.g. from verse
nodes to tuples of the form:

    (bookName, chapterNumber, verseNumber)
   
Here are examples of getting the section that corresponds to a node and vice versa.

**NB:** `sectionFromNode` always delivers a verse specification, either from the
first slot belonging to that node, or, if `lastSlot`, from the last slot
belonging to that node.

In [94]:
for x in (
    ('section of first word',     T.sectionFromNode(1)                            ),
    ('node of Matthew 1:1',       T.nodeFromSection(('Matthew', 1, 1))            ),
    ('node of book Matthew',      T.nodeFromSection(('Matthew',))                 ),
    ('node of chapter Matthew 1', T.nodeFromSection(('Matthew', 1))               ),
    ('section of book node',      T.sectionFromNode(109641)                      ),
    ('idem, now last word',       T.sectionFromNode(109641, lastSlot=True)       ),
    ('section of chapter node',   T.sectionFromNode(109668)                      ),
    ('idem, now last word',       T.sectionFromNode(109668, lastSlot=True)       ),
): print('{:<30} {}'.format(*x))

section of first word          ('Matthew', 1, 1)
node of Matthew 1:1            109928
node of book Matthew           109641
node of chapter Matthew 1      109668
section of book node           ('Matthew', 1, 1)
idem, now last word            ('Matthew', 28, 20)
section of chapter node        ('Matthew', 1, 1)
idem, now last word            ('Matthew', 1, 25)


# Next steps

By now you have an impression how to compute around in the text.
While this is still the beginning, I hope you already sense the power of unlimited programmatic access
to all the bits and bytes in the data set.

Here are a few directions for unleashing that power.

## Search
Text-Fabric contains a flexible search engine, that does not only work for this data,
but also for data that you add to it.
There is a tutorial dedicated to [search](search.ipynb).
And if you already know MQL queries, you can build from that in
[searchFromMQL](searchFromMQL.ipynb).


## Add your own data
If you study the additional data, you can observe how that data is created and also
how it is turned into a text-fabric data module.
The last step is incredibly easy. You can write out every Python dictionary where the keys are numbers
and the values string or numbers as a Text-Fabric feature.
When you are creating data, you have already constructed those dictionaries, so writing
them out is just one method call.
See for example how the
[flowchart](https://github.com/ETCBC/valence/blob/master/programs/flowchart.ipynb#Add-sense-feature-to-valence-module)
notebook in valence writes out verb sense data.
![flow](images/valence.png)

You can then easily share your new features on GitHub, so that your colleagues everywhere 
can try it out for themselves.

## Export to Emdros MQL

[EMDROS](http://emdros.org), written by Ulrik Petersen,
is a text database system with the powerful *topographic* query language MQL.
The ideas are based on a model devised by Christ-Jan Doedens in
[Text Databases: One Database Model and Several Retrieval Languages](https://books.google.nl/books?id=9ggOBRz1dO4C).

Text-Fabric's model of slots, nodes and edges is a fairly straightforward translation of the models of Christ-Jan Doedens and Ulrik Petersen.

[SHEBANQ](https://shebanq.ancient-data.org) uses EMDROS to offer users to execute and save MQL queries against the Hebrew Text Database of the ETCBC.

So it is kind of logical and convenient to be able to work with a Text-Fabric resource through MQL.

If you have obtained an MQL dataset somehow, you can turn it into a text-fabric data set by `importMQL()`,
which we will not show here.

And if you want to export a Text-Fabric data set to MQL, that is also possible.

After the `Fabric(modules=...)` call, you can call `exportMQL()` in order to save all features of the
indicated modules into a big MQL dump, which can be imported by an EMDROS database.

In [95]:
TF.exportMQL('mysygnt','~/Downloads')

  0.00s Checking features of dataset mysygnt
   |     0.00s M aspect               from /Users/dirk/github/etcbc/linksyr/data/tf/syrnt


   |     0.00s feature "book@en" => "book_en"


   |     0.00s M demonstrative_category from /Users/dirk/github/etcbc/linksyr/data/tf/syrnt
   |     0.00s M feminine_he_dot      from /Users/dirk/github/etcbc/linksyr/data/tf/syrnt
   |     0.00s M gender               from /Users/dirk/github/etcbc/linksyr/data/tf/syrnt
   |     0.00s M noun_type            from /Users/dirk/github/etcbc/linksyr/data/tf/syrnt
   |     0.00s M number               from /Users/dirk/github/etcbc/linksyr/data/tf/syrnt
   |     0.00s M numeral_type         from /Users/dirk/github/etcbc/linksyr/data/tf/syrnt
   |     0.00s M participle_type      from /Users/dirk/github/etcbc/linksyr/data/tf/syrnt
   |     0.00s M person               from /Users/dirk/github/etcbc/linksyr/data/tf/syrnt
   |     0.00s M prefix               from /Users/dirk/github/etcbc/linksyr/data/tf/syrnt
   |     0.00s M prefix_ascii         from /Users/dirk/github/etcbc/linksyr/data/tf/syrnt
   |     0.00s M pronoun_type         from /Users/dirk/github/etcbc/linksyr/data/tf/syrnt
   |    

Now you have a file `~/Downloads/mysygnt.mql` of 72 MB.
You can import it into an Emdros database by saying:

    cd ~/Downloads
    rm mysygnt
    mql -b 3 < mysygnt.mql
    
The result is an SQLite3 database `mysygnt` in the same directory (17 MB).
You can run a query against it by creating a text file test.mql with this contents:

    select all objects where
    [verse
        [word FOCUS lexeme_ascii = 'WME']
    ]
    
And then say

    mql -b 3 -d mysygnt test.mql
    
You will see raw query results: all word occurrences that belong to lexemes with `make` in their gloss.
     
It is not very pretty, and probably you should use a more visual Emdros tool to run those queries.
You see a lot of node numbers, but the good thing is, you can look those node numbers up in Text-Fabric.

# Clean caches

Text-Fabric pre-computes data for you, so that it can be loaded faster.
If the original data is updated, Text-Fabric detects it, and will recompute that data.

But there are cases, when the algorithms of Text-Fabric have changed, without any changes in the data, that you might
want to clear the cache of precomputed results.

There are two ways to do that:

* Locate the `.tf` directory of your dataset, and remove all `.tfx` files in it.
  This might be a bit awkward to do, because the `.tf` directory is hidden on Unix-like systems.
* Call `TF.clearCache()`, which does exactly the same.

It is not handy to execute the following cell all the time, that's why I have commented it out.
So if you really want to clear the cache, remove the comment sign below.

In [39]:
# TF.clearCache()