<img align="right" src="tf-small.png"/>

# Greek Test

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys, collections
from tf.fabric import Fabric

# Call Text-Fabric

Everything starts by setting up Text-Fabric.
It needs to know where to look for data.

In [3]:
SBLGNT = 'greek/sblgnt'
TF = Fabric( modules=SBLGNT )

This is Text-Fabric 2.1.3
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data/features/hebrew/etcbc4c/0_overview.html
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an invite to Slack
35 features found and 0 ignored


# Load Features
Specify the features to load, and receive the API to work with that data.

In [4]:
api = TF.load('''
    book
    booknum
    Case
    Cat
    chapter
    child
    ClType
    Degree
    End
    function
    Gender
    HasDet
    Head
    Mood
    morphId
    nodeId
    Number
    oslots
    otext
    otype
    Person
    psp
    Rule
    Start
    Tense
    Type
    Unicode
    UnicodeLemma
    verse
    Voice
    freq_occ
    freq_lex
    rank_occ
    rank_lex
''')
api.makeAvailableIn(globals())

  0.00s loading features ...
   |     0.28s T otype                from /Users/dirk/github/text-fabric-data/greek/sblgnt
   |     2.88s T oslots               from /Users/dirk/github/text-fabric-data/greek/sblgnt
   |     0.00s M otext                from /Users/dirk/github/text-fabric-data/greek/sblgnt
   |     0.00s T book                 from /Users/dirk/github/text-fabric-data/greek/sblgnt
   |     0.00s T chapter              from /Users/dirk/github/text-fabric-data/greek/sblgnt
   |     0.03s T verse                from /Users/dirk/github/text-fabric-data/greek/sblgnt
   |     0.60s T Unicode              from /Users/dirk/github/text-fabric-data/greek/sblgnt
   |     0.53s T UnicodeLemma         from /Users/dirk/github/text-fabric-data/greek/sblgnt
   |      |     0.37s C __levels__           from otype, oslots
   |      |     4.02s C __order__            from otype, oslots, __levels__
   |      |     0.25s C __rank__             from otype, __order__
   |      |     6.14s C __le

# Counting

In order to get acquainted with the data, we start with simple tasks: counting.

## Count all nodes
We use the 
[`N()` generator](https://github.com/ETCBC/text-fabric/wiki/Api#walking-through-nodes)
to walk us through the nodes.

In [5]:
indent(reset=True)
info('Counting nodes ...')
i = 0
for n in N(): i += 1
info('{} nodes'.format(i))

  0.00s Counting nodes ...
  0.12s 428430 nodes


## Sort some nodes

Get some nodes, 
[slot](https://github.com/ETCBC/text-fabric/wiki/Data-model#summary)
and non-slot, and sort them in the 
[canonical order](https://github.com/ETCBC/text-fabric/wiki/Api#sorting-nodes).

The [`otype` feature](https://github.com/ETCBC/text-fabric/wiki/Data-model#otype-node-feature)
is a
[GRID feature](https://github.com/ETCBC/text-fabric/wiki/Data-model#more-about-the-grid),
a special feature that provides defining characteristics for the
data set as a whole. 
It tells us where the slots end and the other nodes start.

In [6]:
sortNodes(list(range(F.otype.maxSlot+1, F.otype.maxSlot+10))+list(range(1,11)))

[137795,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 137796,
 137797,
 137798,
 137799,
 137800,
 137801,
 137802,
 137803]

## Numbers in the otype feature
Get more information that is readily available in the 
[GRID feature](https://github.com/ETCBC/text-fabric/wiki/Data-model#more-about-the-grid)
[`otype`](https://github.com/ETCBC/text-fabric/wiki/Data-model#otype-node-feature),
namely what types of objects there are in the dataset.

In [7]:
info('{:<9} = {}\n{:<9}={}\n{:<9}={}'.format(
    'slotType', F.otype.slotType,
    'maxSlot', F.otype.maxSlot,
    'maxNode', F.otype.maxNode,
), tm=False)
info('All otypes:\n\t', nl=False, tm=False)
info('\n\t'.join(F.otype.all), tm=False)

slotType  = word
maxSlot  =137794
maxNode  =428430
All otypes:
	book
	chapter
	verse
	sentence
	clause
	clause_atom
	phrase
	conjunction
	wordx
	word


## Count individual object types

In [8]:
indent(reset=True)
info('counting objects ...')
for otype in F.otype.all:
    i = 0
    indent(level=1, reset=True)
    for n in F.otype.s(otype): i+=1
    info('{:>7} {}s'.format(i, otype))
indent(level=0)
info('Done')

  0.00s counting objects ...
   |     0.00s      27 books
   |     0.00s     260 chapters
   |     0.00s    7939 verses
   |     0.00s    8014 sentences
   |     0.01s   54800 clauses
   |     0.02s   75967 clause_atoms
   |     0.04s  142578 phrases
   |     0.00s     172 conjunctions
   |     0.00s     879 wordxs
   |     0.03s  137794 words
  0.12s Done


# Feature statistics

The content data resides in the features.
The
[`F` function](https://github.com/ETCBC/text-fabric/wiki/Api#node-features)
gives access to that data.
Every feature has a method
[`freqList()`](https://github.com/ETCBC/text-fabric/wiki/Api#node-features)
to generate a frequency list of its values, ordered by highest frequency first.

In [9]:
for feat in 'Cat Gender Number Person Case Degree Voice Tense Mood psp Type ClType function HasDet'.split():
    print(f"{feat} `{'` `'.join(x[0] for x in Fs(feat).freqList())}`")

Cat `np` `CL` `vp` `noun` `verb` `V` `det` `ADV` `S` `conj` `pron` `pp` `prep` `O` `adjp` `adj` `advp` `adv` `P` `IO` `VC` `ptcl` `nump` `num` `intj` `O2`
Gender `Masculine` `Feminine` `Neuter`
Number `Singular` `Plural`
Person `Third` `Second` `First`
Case `Nominative` `Accusative` `Genitive` `Dative` `Vocative`
Degree `Comparative` `Superlative`
Voice `Active` `Middle` `Passive`
Tense `Aorist` `Present` `Imperfect` `Future` `Perfect` `Pluperfect`
Mood `Indicative` `Participle` `Infinitive` `Subjunctive` `Imperative` `Optative`
psp `noun` `verb` `det` `conj` `pron` `prep` `adj` `adv` `ptcl` `num` `intj`
Type `Common` `Personal` `Proper` `Demonstrative` `Relative` `Interrogative` `Indefinite`
ClType `VerbElided` `Verbless` `Minor`
function `np` `vp` `V` `ADV` `pp` `S` `O` `adjp` `advp` `P` `IO` `VC` `nump` `intj` `O2` `adj` `conj` `prep` `adv` `ptcl` `pron`
HasDet `True`


In [10]:
for book in F.otype.s('book'):
    print('`{}` | {:>2}'.format(F.book.v(book), len(L.d(book, otype='chapter'))))

`matthew` | 28
`mark` | 16
`luke` | 24
`john` | 21
`acts` | 28
`romans` | 16
`1corinthians` | 16
`2corinthians` | 13
`galatians` |  6
`ephesians` |  6
`philippians` |  4
`colossians` |  4
`1thessalonians` |  5
`2thessalonians` |  3
`1timothy` |  6
`2timothy` |  4
`titus` |  3
`philemon` |  1
`hebrews` | 13
`james` |  5
`1peter` |  5
`2peter` |  3
`1john` |  5
`2john` |  1
`3john` |  1
`jude` |  1
`revelation` | 22


# Lexeme matters

## Top 10 frequent verbs

If we count the frequency of words, we usually mean the frequency of their
corresponding lexemes.

There are several methods for working with lexemes.

### Method 1: counting words

In [13]:
verbs = collections.Counter()
indent(reset=True)
info('Collecting data')
for w in F.otype.s('word'):
    if F.sp.v(w) != 'verb': continue
    verbs[F.lex.v(w)] +=1
info('Done')
print(''.join(
    '{}: {}\n'.format(verb, cnt) for (verb, cnt) in sorted(
        verbs.items() , key=lambda x: (-x[1], x[0]))[0:10],
    )
)       

  0.00s Collecting data
  0.34s Done
>MR[: 5378
HJH[: 3561
<FH[: 2629
BW>[: 2570
NTN[: 2017
HLK[: 1554
R>H[: 1298
CM<[: 1168
DBR[: 1138
JCB[: 1082



### Method 2: counting lexemes

An alternative way to do this is to use the feature `freq_lex`, defined for `lex` nodes.
Now we walk the lexemes instead of the occurrences.
Note that the feature `sp` (part-of-speech) is defined for nodes of type `word` as well as `lex`.
Both also have the `lex` feature.
Note that we might encounter Hebrew lexemes as well as Aramaic lexemes, so we still have to
accumulate the `freq_lex`es of the lexeme nodes with the same lexeme value.

In [14]:
verbs = collections.Counter()
indent(reset=True)
info('Collecting data')
for w in F.otype.s('lex'):
    if F.sp.v(w) != 'verb': continue
    verbs[F.lex.v(w)] += F.freq_lex.v(w)
info('Done')
print(''.join(
    '{}: {}\n'.format(verb, cnt) for (verb, cnt) in sorted(
        verbs.items() , key=lambda x: (-x[1], x[0]))[0:10],
    )
)       

  0.00s Collecting data
  0.03s Done
>MR[: 5378
HJH[: 3561
<FH[: 2629
BW>[: 2570
NTN[: 2017
HLK[: 1554
R>H[: 1298
CM<[: 1168
DBR[: 1138
JCB[: 1082



## Lexeme distribution

Let's do a bit more fancy lexeme stuff.

### Hapaxes

A hapax can be found by inspecting lexemes and see to how many word nodes they are linked.
If that is number is one, we have a hapax.

We print 10 hapaxes with their gloss.

In [15]:
hapax = []
zero = set()

indent(reset=True)
for l in F.otype.s('lex'):
    occs = L.d(l, otype='word')
    n = len(occs)
    if n == 0: # that's weird: should not happen
        zero.add(l)
    elif n == 1: # hapax found!
        hapax.append(l)
info('{} hapaxes found'.format(len(hapax)))
if zero:
    error('{} zeroes found'.format(len(zero)), tm=False)
else:
    info('No zeroes found', tm=False)
for h in hapax[0:10]:
    print('\t{:<8} {}'.format(F.lex.v(h), F.gloss.v(h)))

  0.18s 3074 hapaxes found
No zeroes found
	PJCWN/   Pishon
	CWP[     bruise
	HRWN/    pregnancy
	Z<H/     sweat
	LHV/     flame
	NWD/     Nod
	XNWK=/   Enoch
	MXWJ>L/  Mehujael
	MXJJ>L/  Mehujael
	JBL=/    Jabal


### Small occurrence base

The occurrence base of a lexeme are the verses, chapters and books in which occurs.
Let's look for lexemes that occur in a single chapter.

If a lexeme occurs in a single chapter, its slots are a subset of the slots of that chapter.
So, if you go *up* from the lexeme, you encounter the chapter.

Normally, lexemes occur in many chapters, and then none of them totally includes all occurrences of it,
so if you go up from such lexemes, you don not find chapters.

Let's check it out.

Oh yes, we have already found the hapaxes, we will skip them here.

In [16]:
singleCh = []
multiple = []
indent(reset=True)
info('Finding single chapter lexemes')
for l in F.otype.s('lex'):
    chapters = L.u(l, 'chapter')
    if len(chapters) == 1:
        if l not in hapax:
            singleCh.append(l)
    elif len(chapters) > 0: # should not happen
        multipleCh.append(l)
info('{} single chapter lexemes found'.format(len(singleCh)))
if multiple:
    error('{} chapter embedders of multiple lexemes found'.format(len(multiple)), tm=False)
else:
    info('No chapter embedders of multiple lexemes found', tm=False)
for s in singleCh[0:10]:
    print('{:<20} {:<6}'.format(
        '{} {}:{}'.format(*T.sectionFromNode(s)),
        F.lex.v(s),
    ))

  0.00s Finding single chapter lexemes
  0.18s 450 single chapter lexemes found
No chapter embedders of multiple lexemes found
Genesis 4:1          QJN=/ 
Genesis 4:2          HBL=/ 
Genesis 4:18         <JRD/ 
Genesis 4:18         MTWC>L/
Genesis 4:19         YLH/  
Genesis 4:22         TWBL_QJN/
Genesis 10:11        KLX=/ 
Genesis 14:1         >MRPL/
Genesis 14:1         >RJWK/
Genesis 14:1         >LSR/ 


### Confined to books

As a final exercise with lexemes, lets make a list of all books, and show their total number of lexemes and
the number of lexemes that occur exclusively in that book.

In [17]:
indent(reset=True)

allBook = collections.defaultdict(set)
allLex = set()

info('Making book-lexeme index')
for b in F.otype.s('book'):
    for w in L.d(b, 'word'):
        l = L.u(w, 'lex')[0]
        allBook[b].add(l)
        allLex.add(l)
info('Found {} lexemes'.format(len(allLex)))

  0.00s Making book-lexeme index
  4.79s Found 9236 lexemes


In [18]:
indent(reset=True)

singleBook = collections.defaultdict(lambda:0)
info('Finding single book lexemes')
for l in F.otype.s('lex'):
    book = L.u(l, 'book')
    if len(book) == 1:
        singleBook[book[0]] += 1
info('found {} single book lexemes'.format(sum(singleBook.values())))

  0.00s Finding single book lexemes
  0.06s found 4228 single book lexemes


In [19]:
print('{:<20}{:>5}{:>5}{:>5}\n{}'.format(
    'book', '#all', '#own', '%own',
    '-'*35,
))
booklist = []

for b in F.otype.s('book'):
    book = T.bookName(b)
    a = len(allBook[b])
    o = singleBook.get(b, 0)
    p = 100 * o / a
    booklist.append((book, a, o, p))

for x in sorted(booklist, key=lambda e: (-e[3], -e[1], e[0])):
    print('{:<20} {:>4} {:>4} {:>4.1f}%'.format(*x))

book                 #all #own %own
-----------------------------------
Daniel               1121  428 38.2%
1_Chronicles         2016  489 24.3%
Ezra                  991  199 20.1%
Joshua               1175  206 17.5%
Esther                472   67 14.2%
Isaiah               2554  350 13.7%
Numbers              1457  197 13.5%
Ezekiel              1718  212 12.3%
Song_of_songs         503   60 11.9%
Job                  1718  202 11.8%
Genesis              1817  208 11.4%
Nehemiah             1076  110 10.2%
Psalms               2251  216  9.6%
Leviticus             960   89  9.3%
Judges               1210   99  8.2%
Ecclesiastes          575   46  8.0%
Proverbs             1356  103  7.6%
Jeremiah             1950  147  7.5%
1_Samuel             1257   86  6.8%
2_Samuel             1304   89  6.8%
2_Kings              1266   85  6.7%
Exodus               1425   92  6.5%
1_Kings              1291   81  6.3%
Deuteronomy          1449   80  5.5%
Lamentations          592   31  5.2%
2_C

## Part of speech counting
We count the words of each part of speech, and we list to top 10 of frequent lexemes.

**NB**: This is not so much about lexemes as well as
generating pretty progress messages.

In [20]:
partOfSpeech = collections.Counter()
freqLex = collections.Counter()

indent(level=0, reset=True)
info('Starting tasks')
indent(level=1, reset=True)
info('Counting the words by part-of-speech ...')
for w in F.otype.s('word'):
    partOfSpeech[F.sp.v(w)] += 1
info('Done: {} categories'.format(len(partOfSpeech)))
indent(level=2)
info('\n'.join('{:<7}: {:>6}x'.format(*x) for x in sorted(
    partOfSpeech.items(),
    key=lambda x: (-x[1], x[0])
)), tm=False)
indent(level=1, reset=True)
info('Listing the top 10 frequent words ...')
for w in F.otype.s('word'):
    freqLex[F.lex.v(w)] += 1
info('Done: {} lexemes'.format(len(freqLex)))
indent(level=2)
info('\n'.join('{:<7}: {:>6}x'.format(*x) for x in sorted(
    freqLex.items(),
    key=lambda x: (-x[1], x[0])
)[0:10]), tm=False)
indent(level=0)
info('All tasks completed')

  0.00s Starting tasks
   |     0.00s Counting the words by part-of-speech ...
   |     0.46s Done: 14 categories
   |      |   subs   : 121481x
   |      |   verb   :  73710x
   |      |   prep   :  73273x
   |      |   conj   :  62722x
   |      |   nmpr   :  33082x
   |      |   art    :  30384x
   |      |   adjv   :   9464x
   |      |   nega   :   6053x
   |      |   prps   :   5011x
   |      |   advb   :   4550x
   |      |   prde   :   2660x
   |      |   intj   :   1885x
   |      |   inrg   :   1285x
   |      |   prin   :   1021x
   |     0.00s Listing the top 10 frequent words ...
   |     0.45s Done: 8776 lexemes
   |      |   W      :  51003x
   |      |   H      :  30390x
   |      |   L      :  20447x
   |      |   B      :  15768x
   |      |   >T     :  11002x
   |      |   MN     :   7681x
   |      |   JHWH/  :   6828x
   |      |   <L     :   5870x
   |      |   >L     :   5521x
   |      |   >CR    :   5500x
  0.94s All tasks completed


# Layer API
We travel upwards and downwards, forwards and backwards through the nodes.
The Layer-API (`L`) provides functions: `u()` for going up, and `d()` for going down,
`n()` for going to next nodes and `p()` for going to previous nodes.

These directions are indirect notions: nodes are just numbers, but by means of the
`oslots` feature they are linked to slots. One node *contains* an other node, if the one is linked to a set of slots that contains the set of slots that the other is linked to.
And one if next or previous to an other, if its slots follow of precede the slots of the other one.

`L.u(node)` **Up** is going to nodes that embed `node`.

`L.d(node)` **Down** is the opposite direction, to those that are contained in `node`.

`L.n(node)` **Next** are the next *adjacent* nodes, i.e. nodes whose first slot comes immediately after the last slot of `node`.

`L.p(node)` **Previous** are the previous *adjacent* nodes, i.e. nodes whose last slot comes immediately before the first slot of `node`.

All these functions yield nodes of all possible otypes.
By passing an optional parameter, you can restrict the results to nodes of that type.

The result is ordered in the canonical node ordering.
The functions return always a tuple, even if there is just one node in the result.

## Going up
We go from the first word to the book it contains.

In [21]:
firstBook = L.u(1, otype='book')[0]
print(firstBook)

1367534


And let's see all the containing objects of word 3:

In [28]:
w = 3
for otype in F.otype.all:
    if otype == F.otype.slotType: continue
    up = L.u(w, otype=otype)
    upNode = 'x' if len(up) == 0 else up[0]
    print('word {} is contained in {} {}'.format(w, otype, upNode))

word 3 is contained in book 1367534
word 3 is contained in chapter 1367573
word 3 is contained in lex 1436897
word 3 is contained in verse 1413682
word 3 is contained in half_verse 1368502
word 3 is contained in sentence 1125833
word 3 is contained in sentence_atom 1189403
word 3 is contained in clause 426582
word 3 is contained in clause_atom 514582
word 3 is contained in phrase 605145
word 3 is contained in phrase_atom 858319
word 3 is contained in subphrase x


## Going next
Let's go to the next nodes of the first book.

In [26]:
afterFirstBook = L.n(firstBook)
for n in afterFirstBook:
    print('{:>7}: {:<13} first slot={:<6}, last slot={:<6}'.format(
        n, F.otype.v(n),
        E.oslots.s(n)[0],
        E.oslots.s(n)[-1],
    ))
secondBook = L.n(firstBook, otype='book')[0]

  28763: word          first slot=28763 , last slot=28763 
 877074: phrase_atom   first slot=28763 , last slot=28763 
 623125: phrase        first slot=28763 , last slot=28763 
 520716: clause_atom   first slot=28763 , last slot=28767 
 432570: clause        first slot=28763 , last slot=28767 
1371502: half_verse    first slot=28763 , last slot=28770 
1194035: sentence_atom first slot=28763 , last slot=28772 
1130436: sentence      first slot=28763 , last slot=28791 
1415215: verse         first slot=28763 , last slot=28776 
1367623: chapter       first slot=28763 , last slot=29111 
1367535: book          first slot=28763 , last slot=52510 


## Going previous

And let's see what is right before the second book.

In [27]:
for n in L.p(secondBook):
    print('{:>7}: {:<13} first slot={:<6}, last slot={:<6}'.format(
        n, F.otype.v(n),
        E.oslots.s(n)[0],
        E.oslots.s(n)[-1],
    ))

1367534: book          first slot=1     , last slot=28762 
1367622: chapter       first slot=28258 , last slot=28762 
1415214: verse         first slot=28745 , last slot=28762 
1371501: half_verse    first slot=28753 , last slot=28762 
1130435: sentence      first slot=28756 , last slot=28762 
1194034: sentence_atom first slot=28756 , last slot=28762 
 432569: clause        first slot=28756 , last slot=28762 
 520715: clause_atom   first slot=28756 , last slot=28762 
 623124: phrase        first slot=28761 , last slot=28762 
 877073: phrase_atom   first slot=28761 , last slot=28762 
  28762: word          first slot=28762 , last slot=28762 


## Going down

We go to the chapters of the second book, and just count them.

In [30]:
chapters = L.d(secondBook, otype='chapter')
print(len(chapters))

40


## The first verse
We pick the first verse and the first word, and explore what is above and below them.

In [31]:
for n in [1, L.u(1, otype='verse')[0]]:
    indent(level=0)
    info('Node {}'.format(n), tm=False)
    indent(level=1)
    info('UP', tm=False)
    indent(level=2)
    info('\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.u(n)]), tm=False)
    indent(level=1)
    info('DOWN', tm=False)
    indent(level=2)
    info('\n'.join(['{:<15} {}'.format(u, F.otype.v(u)) for u in L.d(n)]), tm=False)
indent(level=0)
info('Done', tm=False)

Node 1
   |   UP
   |      |   1436895         lex
   |      |   858318          phrase_atom
   |      |   605144          phrase
   |      |   1368502         half_verse
   |      |   514582          clause_atom
   |      |   426582          clause
   |      |   1189403         sentence_atom
   |      |   1125833         sentence
   |      |   1413682         verse
   |      |   1367573         chapter
   |      |   1367534         book
   |   DOWN
   |      |   
Node 1413682
   |   UP
   |      |   1367573         chapter
   |      |   1367534         book
   |   DOWN
   |      |   1125833         sentence
   |      |   1189403         sentence_atom
   |      |   426582          clause
   |      |   514582          clause_atom
   |      |   1368502         half_verse
   |      |   605144          phrase
   |      |   858318          phrase_atom
   |      |   1               word
   |      |   2               word
   |      |   605145          phrase
   |      |   858319          phra

# Text API

We examine the functions of the Text API: `T`.

## Formats
First the formats that we have available to represent the actual text.
These formats have been defined in the `otext` feature.
This is an optional GRID config feature: it has only metadata.

In [11]:
sorted(T.formats)

['lex-orig-full', 'text-orig-full']

Note the `text-phono-full` format here.
It does not come from the main data source `etcbc4c`, but from the module `phono`.
Look in your data directory, find `text-fabric-data/hebrew/phono/otext@phono.tf`,
and you'll see this format defined there.

## Using the formats
Now let's use those formats to print out the first verse of the Hebrew Bible.

In [12]:
for fmt in sorted(T.formats):
    print('{}:\n\t{}'.format(fmt, T.text(range(1,12), fmt=fmt)))

lex-orig-full:
	βίβλος γένεσις Ἰησοῦς Χριστός υἱός Δαυίδ υἱός Ἀβραάμ Ἀβραάμ γεννάω ὁ 
text-orig-full:
	Βίβλος γενέσεως Ἰησοῦ χριστοῦ υἱοῦ Δαυὶδ υἱοῦ Ἀβραάμ. Ἀβραὰμ ἐγέννησεν τὸν 


If we do not specify a format, the **default** format is used (`text-orig-full`).

In [13]:
print(T.text(range(1,12)))

Βίβλος γενέσεως Ἰησοῦ χριστοῦ υἱοῦ Δαυὶδ υἱοῦ Ἀβραάμ. Ἀβραὰμ ἐγέννησεν τὸν 


## Whole text in all formats in just 10 seconds
We are going to produce the complete text of the Hebrew Bible in all available formats.

In [14]:
text = collections.defaultdict(list)
indent(reset=True)
info('writing plain text of whole Bible in all formats')
for v in F.otype.s('verse'):
    words = L.d(v, 'word')
    for fmt in sorted(T.formats):
        text[fmt].append(T.text(words, fmt=fmt))
info('done {} formats'.format(len(text)))
for fmt in sorted(text):
    print('{}\n{}\n'.format(fmt, '\n'.join(text[fmt][0:5])))

  0.00s writing plain text of whole Bible in all formats
  0.59s done 2 formats
lex-orig-full
βίβλος γένεσις Ἰησοῦς Χριστός υἱός Δαυίδ υἱός Ἀβραάμ 
Ἀβραάμ γεννάω ὁ Ἰσαάκ δέ Ἰσαάκ γεννάω ὁ Ἰακώβ δέ Ἰακώβ γεννάω ὁ Ἰούδας καί ὁ ἀδελφός αὐτός 
δέ Ἰούδας γεννάω ὁ Φαρές καί ὁ Ζάρα ἐκ ὁ Θαμάρ δέ Φαρές γεννάω ὁ Ἑσρώμ δέ Ἑσρώμ γεννάω ὁ Ἀράμ 
δέ Ἀράμ γεννάω ὁ Ἀμιναδάβ δέ Ἀμιναδάβ γεννάω ὁ Ναασσών δέ Ναασσών γεννάω ὁ Σαλμών 
δέ Σαλμών γεννάω ὁ Βόες ἐκ ὁ Ῥαχάβ δέ Βόες γεννάω ὁ Ἰωβήδ ἐκ ὁ Ῥούθ δέ Ἰωβήδ γεννάω ὁ Ἰεσσαί 

text-orig-full
Βίβλος γενέσεως Ἰησοῦ χριστοῦ υἱοῦ Δαυὶδ υἱοῦ Ἀβραάμ. 
Ἀβραὰμ ἐγέννησεν τὸν Ἰσαάκ, δὲ Ἰσαὰκ ἐγέννησεν τὸν Ἰακώβ, δὲ Ἰακὼβ ἐγέννησεν τὸν Ἰούδαν καὶ τοὺς ἀδελφοὺς αὐτοῦ, 
δὲ Ἰούδας ἐγέννησεν τὸν Φαρὲς καὶ τὸν Ζάρα ἐκ τῆς Θαμάρ, δὲ Φαρὲς ἐγέννησεν τὸν Ἑσρώμ, δὲ Ἑσρὼμ ἐγέννησεν τὸν Ἀράμ, 
δὲ Ἀρὰμ ἐγέννησεν τὸν Ἀμιναδάβ, δὲ Ἀμιναδὰβ ἐγέννησεν τὸν Ναασσών, δὲ Ναασσὼν ἐγέννησεν τὸν Σαλμών, 
δὲ Σαλμὼν ἐγέννησεν τὸν Βόες ἐκ τῆς Ῥαχάβ, δὲ Βόες ἐγέννησεν τὸν Ἰωβὴδ ἐκ τῆς Ῥούθ, δ

## Book names

For Bible book names, we can use several languages.

### Languages
Here are the languages that we can use for book names.
These languages come from the features `book@ll`, where `ll` is a two letter
ISO language code. Have a look in your data directory, you can't miss them.

In [15]:
T.languages

{'en': {'language': 'English', 'languageEnglish': 'english'}}

### Book names in Swahili
Get the book names in Swahili.

In [37]:
nodeToSwahili = ''
for b in F.otype.s('book'):
    nodeToSwahili += '{} = {}\n'.format(b, T.bookName(b, lang='sw'))
print(nodeToSwahili)

1367534 = Mwanzo
1367535 = Kutoka
1367536 = Mambo_ya_Walawi
1367537 = Hesabu
1367538 = Kumbukumbu_la_Torati
1367539 = Yoshua
1367540 = Waamuzi
1367541 = 1_Samweli
1367542 = 2_Samweli
1367543 = 1_Wafalme
1367544 = 2_Wafalme
1367545 = Isaya
1367546 = Yeremia
1367547 = Ezekieli
1367548 = Hosea
1367549 = Yoeli
1367550 = Amosi
1367551 = Obadia
1367552 = Yona
1367553 = Mika
1367554 = Nahumu
1367555 = Habakuki
1367556 = Sefania
1367557 = Hagai
1367558 = Zekaria
1367559 = Malaki
1367560 = Zaburi
1367561 = Ayubu
1367562 = Mithali
1367563 = Ruthi
1367564 = Wimbo_Ulio_Bora
1367565 = Mhubiri
1367566 = Maombolezo
1367567 = Esta
1367568 = Danieli
1367569 = Ezra
1367570 = Nehemia
1367571 = 1_Mambo_ya_Nyakati
1367572 = 2_Mambo_ya_Nyakati



## Book nodes from Swahili
OK, there they are. We copy them into a string, and do the opposite: get the nodes back.
We check whether we get exactly the same nodes as the ones we started with.

In [38]:
swahiliNames = '''
Mwanzo
Kutoka
Mambo_ya_Walawi
Hesabu
Kumbukumbu_la_Torati
Yoshua
Waamuzi
1_Samweli
2_Samweli
1_Wafalme
2_Wafalme
Isaya
Yeremia
Ezekieli
Hosea
Yoeli
Amosi
Obadia
Yona
Mika
Nahumu
Habakuki
Sefania
Hagai
Zekaria
Malaki
Zaburi
Ayubu
Mithali
Ruthi
Wimbo_Ulio_Bora
Mhubiri
Maombolezo
Esta
Danieli
Ezra
Nehemia
1_Mambo_ya_Nyakati
2_Mambo_ya_Nyakati
'''.strip().split()

swahiliToNode = ''
for nm in swahiliNames:
    swahiliToNode += '{} = {}\n'.format(T.bookNode(nm, lang='sw'), nm)
    
if swahiliToNode != nodeToSwahili:
    print('Something is not right with the book names')
else:
    print('Going from nodes to booknames and back yields the original nodes')

Going from nodes to booknames and back yields the original nodes


## Sections

A section in the Hebrew bible is a book, a chapter or a verse.
Knowledge of sections is not baked into Text-Fabric. 
The config feature `otext.tf` may specify three section levels, and tell
what the corresponding node types and features are.

From that knowledge it can construct mappings from nodes to sections, e.g. from verse
nodes to tuples of the form:

    (bookName, chapterNumber, verseNumber)
   
Here are examples of getting the section that corresponds to a node and vice versa.

**NB:** `sectionFromNode` always delivers a verse specification, either from the
first slot belonging to that node, or, if `lastSlot`, from the last slot
belonging to that node.

In [17]:
T.nodeFromSection(('Matthew', 1, 1))

419613

In [19]:
T.sectionFromNode(419613, lang='en')

('Matthew', 1, 1)

In [16]:
for x in (
    ('section of first word',   T.sectionFromNode(1)                            ),
    ('node of Gen 1:1',         T.nodeFromSection(('Genesis', 1, 1))            ),
    ('idem',                    T.nodeFromSection(('Mwanzo', 1, 1), lang='sw')  ),
    ('node of book Genesis',    T.nodeFromSection(('Genesis',))                 ),
    ('node of Genesis 1',       T.nodeFromSection(('Genesis', 1))               ),
    ('section of book node',    T.sectionFromNode(1367534)                      ),
    ('idem, now last word',     T.sectionFromNode(1367534, lastSlot=True)       ),
    ('section of chapter node', T.sectionFromNode(1367573)                      ),
    ('idem, now last word',     T.sectionFromNode(1367573, lastSlot=True)       ),
): print('{:<30} {}'.format(*x))

KeyError: ''

## Sentences spanning multiple verses
If you go up from a sentence node, you expect to find a verse node.
But some sentences span multiple verses, and in that case, you will not find the enclosing
verse node, because it is not there.

Here is a piece of code to detect and list all cases where sentences span multiple verses.

The idea is to pick the first and the last word of a sentence, use `T.sectionFromNode` to
discover the verse in which that word occurs, and if they are different: bingo!

We show the first 10 of 915 cases.

In [40]:
indent(reset=True)
info('Get sentences that span multiple verses')
spanSentences = []
for s in F.otype.s('sentence'):
    f = T.sectionFromNode(s, lastSlot=False)
    l = T.sectionFromNode(s, lastSlot=True)
    if f != l:
        spanSentences.append('{} {}:{}-{}'.format(f[0], f[1], f[2], l[2]))
info('Found {} cases'.format(len(spanSentences)))
info('\n{}'.format('\n'.join(spanSentences[0:10])))

  0.00s Get sentences that span multiple verses
  5.19s Found 915 cases
  5.19s 
Genesis 1:17-18
Genesis 1:29-30
Genesis 2:4-7
Genesis 7:2-3
Genesis 7:8-9
Genesis 7:13-14
Genesis 9:9-10
Genesis 10:11-12
Genesis 10:13-14
Genesis 10:15-18


# Ketiv Qere
Let us explore where Ketiv/Qere pairs are and how they render.

In [41]:
qeres = [w for w in F.otype.s('word') if F.qere.v(w) != None]
print('{} qeres'.format(len(qeres)))
for w in qeres[0:10]:
    print('{}: ketiv = "{}"+"{}" qere = "{}"+"{}"'.format(
        w, F.g_word.v(w), F.trailer.v(w), F.qere.v(w), F.qere_trailer.v(w),
    ))

1892 qeres
3897: ketiv = "*HWY>"+" " qere = "HAJ:Y;74>"+" "
4420: ketiv = "*>HLH"+" " qere = ">@H:@LO75W"+"00"
5645: ketiv = "*>HLH"+" " qere = ">@H:@LO92W"+" "
5912: ketiv = "*>HLH"+" " qere = ">@95H:@LOW03"+" "
6246: ketiv = "*YBJJM"+" " qere = "Y:BOWJI80m"+" "
6354: ketiv = "*YBJJM"+" " qere = "Y:BOWJI80m"+" "
11761: ketiv = "*W-"+"" qere = "WA"+""
11762: ketiv = "*JJFM"+" " qere = "J.W.FA70m"+" "
12783: ketiv = "*GJJM"+" " qere = "GOWJIm03"+" "
13684: ketiv = "*YJDH"+" " qere = "Y@75JID"+"00"


## Show a ketiv-qere pair
Let us print all text representations of the verse in which word node 4419 occurs.

In [42]:
refWord = 4419
vn = L.u(refWord, otype='verse')[0]
ws = L.d(vn, otype='word')
print('{} {}:{}'.format(*T.sectionFromNode(refWord)))
for fmt in sorted(T.formats):
    if fmt.startswith('text-'):
        print('{:<25} {}'.format(fmt, T.text(ws, fmt=fmt)))

Genesis 9:21
text-orig-full            וַיֵּ֥שְׁתְּ מִן־הַיַּ֖יִן וַיִּשְׁכָּ֑ר וַיִּתְגַּ֖ל בְּתֹ֥וךְ אָהֳלֹֽו׃
text-orig-full-ketiv      וַיֵּ֥שְׁתְּ מִן־הַיַּ֖יִן וַיִּשְׁכָּ֑ר וַיִּתְגַּ֖ל בְּתֹ֥וךְ אהלה 
text-orig-plain           וישׁת מן־היין וישׁכר ויתגל בתוך אהלה 
text-phono-full           wayyˌēšt min-hayyˌayin wayyiškˈār wayyiṯgˌal bᵊṯˌôḵ *ʔohᵒlˈô .
text-trans-full           WA-J.;71C:T.: MIN&HA-J.A73JIN WA-J.IC:K.@92R WA-J.IT:G.A73L B.:-TO71WK: >@H:@LO75W00
text-trans-full-ketiv     WA-J.;71C:T.: MIN&HA-J.A73JIN WA-J.IC:K.@92R WA-J.IT:G.A73L B.:-TO71WK: *>HLH 
text-trans-plain          WJCT MN&HJJN WJCKR WJTGL BTWK >HLH 


# Edge features: mother

Let us do a few basic enquiries on an edge feature:
[mother](https://etcbc.github.io/text-fabric-data/features/hebrew/etcbc4c/mother).

We count how many mothers nodes can have (it turns to be 0 or 1).
We walk through all nodes and per node we retrieve the mother nodes, and
we store the lengths (if non-zero) in a dictionary (`mother_len`).

We see that nodes have at most one mother.

We also count the inverse relationship: daughters.

In [43]:
motherLen = {}
daughterLen = {}
info('Counting edges')
for c in N():
    lms = E.mother.f(c) or []
    lds = E.mother.t(c) or []
    nms = len(lms)
    nds = len(lds)
    if nms: motherLen[c] = nms
    if nds: daughterLen[c] = nds
info('{} nodes have mothers'.format(len(motherLen)))
info('{} nodes have daughters'.format(len(daughterLen)))

motherCount = collections.Counter()
daughterCount = collections.Counter()

for (n, lm) in motherLen.items(): motherCount[lm] += 1
for (n, ld) in daughterLen.items(): daughterCount[ld] += 1

print('mothers', motherCount)
print('daughters', daughterCount)

    14s Counting edges
    17s 181182 nodes have mothers
    17s 143041 nodes have daughters
mothers Counter({1: 181182})
daughters Counter({1: 116872, 2: 17439, 3: 6277, 4: 1842, 5: 463, 6: 123, 7: 20, 8: 5})


# Export to Emdros MQL

[EMDROS](http://emdros.org), written by Ulrik Petersen.
is a text database system with the powerful *topographic* query language MQL.
The ideas are based on a model devised by Christ-Jan Doedens in
[Text Databases: One Database Model and Several Retrieval Languages](https://books.google.nl/books?id=9ggOBRz1dO4C).

Text-Fabric's model of slots, nodes and edges is a fairly straightforward translation of the models of Christ-Jan Doedens and Ulrik Petersen.

[SHEBANQ](https://shebanq.ancient-data.org) uses EMDROS to offer users to execute and save MQL queries against the Hebrew Text Database of the ETCBC.

So it is kind of logical and convenient to be able to work with a Text-Fabric resource through MQL.

After the `Fabric(modules=...)` call, you can call `exportMQL()` in order to save all features of the
indicated modules into a big MQL dump, which can be imported by an EMDROS database.

In [44]:
TF.exportMQL('etcbc4c','~/Downloads')

  0.00s Checking features of dataset etcbc4c


   |     0.00s feature "book@am" => "book_am"
   |     0.00s feature "book@ar" => "book_ar"
   |     0.00s feature "book@bn" => "book_bn"
   |     0.00s feature "book@da" => "book_da"
   |     0.00s feature "book@de" => "book_de"
   |     0.00s feature "book@el" => "book_el"
   |     0.00s feature "book@en" => "book_en"
   |     0.00s feature "book@es" => "book_es"
   |     0.00s feature "book@fa" => "book_fa"
   |     0.00s feature "book@fr" => "book_fr"
   |     0.00s feature "book@he" => "book_he"
   |     0.00s feature "book@hi" => "book_hi"
   |     0.00s feature "book@id" => "book_id"
   |     0.00s feature "book@ja" => "book_ja"
   |     0.00s feature "book@ko" => "book_ko"
   |     0.00s feature "book@la" => "book_la"
   |     0.00s feature "book@nl" => "book_nl"
   |     0.00s feature "book@pa" => "book_pa"
   |     0.00s feature "book@pt" => "book_pt"
   |     0.00s feature "book@ru" => "book_ru"
   |     0.00s feature "book@sw" => "book_sw"
   |     0.00s feature "book@syc" 

   |     0.00s M code                 from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s M det                  from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s M dist                 from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s M dist_unit            from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s M distributional_parent from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s M domain               from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s M freq_occ             from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s M function             from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s M functional_parent    from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s M g_nme                from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s M g_nme_utf8           from /Users/dirk/gith

Now you have a file `~/Downloads/etcbc4c.mql` of 530 MB.
You can import it into an Emdros database by saying:

    cd ~/Downloads
    rm etcbc4c.mql
    mql -b 3 < etcbc4c.mql
    
The result is an sqlite3 database `etcbc4c` in the same directory (168 MB).
You can run a query against it by creating a text file test.mql with this contents:

    select all objects where
    [lex gloss ~ 'make'
        [word FOCUS]
    ]
    
And then say

    mql -b 3 -d etcbc4c test.mql
    
You will see raw query results: all word occurrences that belong to lexemes with `make` in their gloss.

The end looks like this:

    < [ lex 1443661 { 313199, 339890, 343902, 343933, 343998, 344891 } false  //  <  < [ word 313199 { 313199 } true  //  <  > 
     ]
     > 
     < [ word 339890 { 339890 } true  //  <  > 
     ]
     > 
     < [ word 343902 { 343902 } true  //  <  > 
     ]
     > 
     < [ word 343933 { 343933 } true  //  <  > 
     ]
     > 
     < [ word 343998 { 343998 } true  //  <  > 
     ]
     > 
     < [ word 344891 { 344891 } true  //  <  > 
     
It is not very pretty, and probably you should use a more visual Emdros tool to run those queries.
But, while we're at it, observe how the word object ids coincide with their monad numbers.

And let's look up this last lexeme here in TF, together with its first three occurrences.

In [45]:
lexNode = 1443661
print('{} {} "{}"'.format(
    F.lex.v(lexNode),
    F.voc_utf8.v(lexNode),
    F.gloss.v(lexNode),
))
for wordNode in L.d(lexNode, otype='word')[0:2]:
    verseNode = L.u(wordNode, otype='verse')[0]
    print('{} {}\n\t{}'.format(
        '{} {}:{}'.format(*T.sectionFromNode(wordNode)),
        F.g_word.v(wordNode),
        T.text(L.d(verseNode, otype='word')),
    ))

XWH=[ חוה "make known"
Psalms 19:3 J:XAW.EH
	יֹ֣ום לְ֭יֹום יַבִּ֣יעַֽ אֹ֑מֶר וְלַ֥יְלָה לְּ֝לַ֗יְלָה יְחַוֶּה־דָּֽעַת׃ 
Job 15:17 >:AXAW:K@71
	אֲחַוְךָ֥ שְֽׁמַֽע־לִ֑י וְזֶֽה־חָ֝זִ֗יתִי וַאֲסַפֵּֽרָה׃ 


# Clean caches

Text-Fabric precomputes data for you, so that it can be loaded faster.
If the original data is updated, Text-Fabric detects it, and will recompute that data.

But there are cases, when the algorithms of Text-Fabric have changed, without any changes in the data, that you might
want to clear the cache of precomputed results.

There are two ways to do that:

* Locate the `.tf` directory of your dataset, and remove all `.tfx` files in it.
  This might be a bit awkward to do, because the `.tf` directory is hidden on Unix-like systems.
* Call `TF.clearCache()`, which does exactly the same.

It is not handy to execute the following cell all the time, that's why I have commented it out.
So if you really want to clear the cache, remove the comment sign below.

In [40]:
# TF.clearCache()