<img align="right" src="images/dans-small.png"/>
<img align="right" src="images/tf-small.png"/>
<img align="right" src="images/etcbc.png"/>

# Trees - for BHSA data (Hebrew)

## Example

This notebook makes use of the syntax trees composed by 
[trees.ipynb](trees.ipynb).

The feature `tree` holds for each sentence a
[Penn Treebank notation](https://en.wikipedia.org/wiki/Treebank) structure,
like this (Genesis 1:1):

```
(S(C(PP(pp 0)(n 1))(VP(vb 2))(NP(n 3))(PP(U(pp 4)(dt 5)(n 6))(cj 7)(U(pp 8)(dt 9)(n 10)))))
```

The numbers are the leave nodes. You can replace them by concrete words by adding the slot number of the 
first word of the sentence to it, and then substituting the values of textual features for it.

Like this

```
(S(C(PP(pp בְּ)(n רֵאשִׁ֖ית))(VP(vb בָּרָ֣א))(NP(n אֱלֹהִ֑ים))(PP(U(pp אֵ֥ת)(dt הַ)(n שָּׁמַ֖יִם))(cj וְ)(U(pp אֵ֥ת)(dt הָ)(n אָֽרֶץ)))))
```

or this

```
(S(C(PP(pp bᵊ)(n rēšˌîṯ))(VP(vb bārˈā))(NP(n ʔᵉlōhˈîm))(PP(U(pp ʔˌēṯ)(dt ha)(n ššāmˌayim))(cj wᵊ)(U(pp ʔˌēṯ)(dt hā)(n ʔˈāreṣ)))))
```

We shall show how.

And you can now investigate the "bare" syntax of sentences, group them, filter them.
It turns out that there are roughly half as many distinct syntaxes as there are sentences.
And some sentence syntaxes occur thousands of times...

In [1]:
%load_ext autoreload
%autoreload 2

import sys
import os
import collections
import re

from tf.fabric import Fabric

# Load data
We load the some features of the
[BHSA](https://github.com/etcbc/bhsa) data.
See the [feature documentation](https://etcbc.github.io/bhsa/features/hebrew/2017/0_home.html) for more info.

In [2]:
VERSION = '2017'
BHSA = 'BHSA/tf/{}'.format(VERSION)
TREES = 'lingo/trees/tf/{}'.format(VERSION)
PHONO = 'phono/tf/{}'.format(VERSION)

In [3]:
TF = Fabric(locations='~/github/etcbc', modules=[BHSA, TREES, PHONO])
api = TF.load(f'''
    g_word_utf8
    g_cons_utf8
    phono
    tree
''')
api.makeAvailableIn(globals())

This is Text-Fabric 3.1.1
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

119 features found and 0 ignored
  0.00s loading features ...
   |     0.19s B g_cons_utf8          from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.21s B g_word_utf8          from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.21s B phono                from /Users/dirk/github/etcbc/phono/tf/2017
   |     0.30s T tree                 from /Users/dirk/github/etcbc/lingo/trees/tf/2017
   |     0.00s Feature overview: 112 for nodes; 5 for edges; 2 configs; 7 computed
  5.09s All features loaded/computed - for details use loadLog()


# Printing a tree - step by step

Let us print the tree of Genesis 1:1, with fully pointed, terminals, consonantal ones, and phonetic ones.

In [5]:
passage = ('Genesis', 1, 1)
verseNode = T.nodeFromSection(passage)
sentenceNode = L.d(verseNode, otype='sentence')[0]
firstSlot = L.d(sentenceNode, otype='word')[0]
rawTree = F.tree.v(sentenceNode)
print('{} {}:{} - first word = {}\ntree = {}'.format(*passage, firstSlot, rawTree))

Genesis 1:1 - first word = 1
tree = (S(C(PP(pp 0)(n 1))(VP(vb 2))(NP(n 3))(PP(U(pp 4)(dt 5)(n 6))(cj 7)(U(pp 8)(dt 9)(n 10)))))


Now we replace the leave numbers by word representations.

First we define a helper function, that substitutes all numerals in a string by pieces
that are dependent on the numeric values of those numerals.

In [6]:
numPattern = re.compile('[0-9]+')

def fillWords(tree, start, wordRep):
    def numReplace(match):
        return wordRep(int(match.group(0)) + start)
    return numPattern.sub(numReplace, tree)

Now we can get the phonetic representation as a matter of filling in the dots:

In [7]:
fillWords(rawTree, firstSlot, F.phono.v)

'(S(C(PP(pp bᵊ)(n rēšˌîṯ))(VP(vb bārˈā))(NP(n ʔᵉlōhˈîm))(PP(U(pp ʔˌēṯ)(dt ha)(n ššāmˌayim))(cj wᵊ)(U(pp ʔˌēṯ)(dt hā)(n ʔˈāreṣ)))))'

Likewise the consonantal representation:

In [9]:
fillWords(rawTree, firstSlot, F.g_cons_utf8.v)

'(S(C(PP(pp ב)(n ראשׁית))(VP(vb ברא))(NP(n אלהים))(PP(U(pp את)(dt ה)(n שׁמים))(cj ו)(U(pp את)(dt ה)(n ארץ)))))'

And the fully pointed one:

In [10]:
fillWords(rawTree, firstSlot, F.g_word_utf8.v)

'(S(C(PP(pp בְּ)(n רֵאשִׁ֖ית))(VP(vb בָּרָ֣א))(NP(n אֱלֹהִ֑ים))(PP(U(pp אֵ֥ת)(dt הַ)(n שָּׁמַ֖יִם))(cj וְ)(U(pp אֵ֥ת)(dt הָ)(n אָֽרֶץ)))))'

# Exploring the space of tree structures

Let us see how many distinct tree structures we've got.

In [14]:
trees = collections.Counter()
for sNode in F.otype.s('sentence'):
    trees[F.tree.v(sNode)] += 1
print('{} distinct structures'.format(len(trees)))

28096 distinct structures


Let's see the most frequent structures.

In [15]:
for (tree, nOccs) in sorted(trees.items(), key=lambda x: (-x[1], x[0])):
    if nOccs < 50: break
    print('{:>4} x {}'.format(nOccs, tree))

3772 x (S(C(CP(cj 0))(VP(vb 1))))
1238 x (S(C(VP(vb 0))))
1173 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2))))
 857 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(n 3))))
 749 x (S(C(CP(cj 0))(VP(vb 1))(NP(n 2))))
 577 x (S(C(CP(cj 0))(VP(vb 1))(PrNP(n-pr 2))))
 568 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(dt 3)(n 4))))
 554 x (S(C(VP(vb 0))(NP(n 1))))
 441 x (S(C(CP(cj 0))(NegP(ng 1))(VP(vb 2))))
 406 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(n-pr 3))))
 326 x (S(C(CP(cj 0))(VP(vb 1))(PrNP(n-pr 2))(PP(pp 3)(n-pr 4))))
 314 x (S(C(NegP(ng 0))(VP(vb 1))))
 310 x (S(C(CP(cj 0))(VP(vb 1))(NP(dt 2)(n 3))))
 274 x (S(C(VP(vb 0))(PP(pp 1)(n 2))))
 259 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(U(n 3))(U(n 4)))))
 257 x (S(C(NP(U(n 0))(U(n-pr 1)))))
 238 x (S(C(CP(cj 0))(AdvP(av 1))))
 227 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2))(NP(n 3))))
 226 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(U(n 3))(U(n-pr 4)))))
 216 x (S(C(NP(n 0))(VP(vb 1))))
 207 x (S(C(CP(cj 0))(NP(n 1))(VP(vb 2))))
 198 x (S(C(CP(cj 0))(VP(vb 1))(NP(n 2))(PP(pp 3)(n 4))))

And this is just the beginning...