<img align="right" src="images/dans-small.png"/>
<img align="right" src="images/tf-small.png"/>
<img align="right" src="images/etcbc.png"/>

# Trees - for BHSA data (Hebrew)

## Example

This notebook makes use of the syntax trees composed by 
[trees.ipynb](trees.ipynb).

The feature `tree` holds for each sentence a
[Penn Treebank notation](https://en.wikipedia.org/wiki/Treebank) structure,
like this (Genesis 1:1):

```
(S(C(PP(pp 0)(n 1))(VP(vb 2))(NP(n 3))(PP(U(pp 4)(dt 5)(n 6))(cj 7)(U(pp 8)(dt 9)(n 10)))))
```

The numbers are the leave nodes. You can replace them by concrete words by adding the slot number of the 
first word of the sentence to it, and then substituting the values of textual features for it.

Like this

```
(S(C(PP(pp בְּ)(n רֵאשִׁ֖ית))(VP(vb בָּרָ֣א))(NP(n אֱלֹהִ֑ים))(PP(U(pp אֵ֥ת)(dt הַ)(n שָּׁמַ֖יִם))(cj וְ)(U(pp אֵ֥ת)(dt הָ)(n אָֽרֶץ)))))
```

or this

```
(S(C(PP(pp bᵊ)(n rēšˌîṯ))(VP(vb bārˈā))(NP(n ʔᵉlōhˈîm))(PP(U(pp ʔˌēṯ)(dt ha)(n ššāmˌayim))(cj wᵊ)(U(pp ʔˌēṯ)(dt hā)(n ʔˈāreṣ)))))
```

or even this

```
 1  S
 2    C
 3      PP
 4        pp "בְּ" "[in]"
 4        n "רֵאשִׁ֖ית" "[beginning]"
 3      VP
 4        vb "בָּרָ֣א" "[create]"
 3      NP
 4        n "אֱלֹהִ֑ים" "[god(s)]"
 3      PP
 4        U
 5          pp "אֵ֥ת" "[<object marker>]"
 5          dt "הַ" "[the]"
 5          n "שָּׁמַ֖יִם" "[heavens]"
 4        cj "וְ" "[and]"
 4        U
 5          pp "אֵ֥ת" "[<object marker>]"
 5          dt "הָ" "[the]"
 5          n "אָֽרֶץ" "[earth]"
```

We shall show how.

And you can now investigate the "bare" syntax of sentences, group them, filter them.
It turns out that there are roughly half as many distinct syntaxes as there are sentences, while
some sentence syntaxes occur thousands of times.

We finish off by showing the sentences whose trees have a depth at least 10.

In [3]:
%load_ext autoreload
%autoreload 2

import sys
import os
import collections
import re

from tf.fabric import Fabric

from utils import structure, layout

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Load data
We load the some features of the
[BHSA](https://github.com/etcbc/bhsa) data.
See the [feature documentation](https://etcbc.github.io/bhsa/features/hebrew/2017/0_home.html) for more info.

In [4]:
VERSION = '2017'
BHSA = 'BHSA/tf/{}'.format(VERSION)
TREES = 'lingo/trees/tf/{}'.format(VERSION)
PHONO = 'phono/tf/{}'.format(VERSION)

In [15]:
TF = Fabric(locations='~/github/etcbc', modules=[BHSA, TREES, PHONO])
api = TF.load(f'''
    g_word_utf8 g_cons_utf8 gloss
    phono
    tree
''')
api.makeAvailableIn(globals())

This is Text-Fabric 3.1.1
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

119 features found and 0 ignored
  0.00s loading features ...
   |     0.19s B g_cons_utf8          from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.21s B g_word_utf8          from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.21s B phono                from /Users/dirk/github/etcbc/phono/tf/2017
   |     0.01s B gloss                from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.06s B tree                 from /Users/dirk/github/etcbc/lingo/trees/tf/2017
   |     0.00s Feature overview: 112 for nodes; 5 for edges; 2 configs; 7 computed
  5.23s All features loaded/computed - for details use loadLog()


# Printing a tree - step by step

Let us print the tree of Genesis 1:1, with fully pointed, terminals, consonantal ones, and phonetic ones.

In [43]:
passage = ('Genesis', 1, 1)
verseNode = T.nodeFromSection(passage)
sentenceNode = L.d(verseNode, otype='sentence')[0]
firstSlot = L.d(sentenceNode, otype='word')[0]
rawTree = F.tree.v(sentenceNode)
print('{} {}:{} - first word = {}\ntree = {}'.format(*passage, firstSlot, rawTree))

Genesis 1:1 - first word = 1
tree = (S(C(PP(pp 0)(n 1))(VP(vb 2))(NP(n 3))(PP(U(pp 4)(dt 5)(n 6))(cj 7)(U(pp 8)(dt 9)(n 10)))))


Now we replace the leave numbers by word representations.

First we define a helper function, that substitutes all numerals in a string by pieces
that are dependent on the numeric values of those numerals.

In [44]:
numPattern = re.compile('[0-9]+')

def fillWords(tree, start, wordRep):
    def numReplace(match):
        return wordRep(int(match.group(0)) + start)
    return numPattern.sub(numReplace, tree)

Now we can get the phonetic representation as a matter of filling in the dots:

In [45]:
fillWords(rawTree, firstSlot, F.phono.v)

'(S(C(PP(pp bᵊ)(n rēšˌîṯ))(VP(vb bārˈā))(NP(n ʔᵉlōhˈîm))(PP(U(pp ʔˌēṯ)(dt ha)(n ššāmˌayim))(cj wᵊ)(U(pp ʔˌēṯ)(dt hā)(n ʔˈāreṣ)))))'

Likewise the consonantal representation:

In [46]:
fillWords(rawTree, firstSlot, F.g_cons_utf8.v)

'(S(C(PP(pp ב)(n ראשׁית))(VP(vb ברא))(NP(n אלהים))(PP(U(pp את)(dt ה)(n שׁמים))(cj ו)(U(pp את)(dt ה)(n ארץ)))))'

And the fully pointed one:

In [47]:
fillWords(rawTree, firstSlot, F.g_word_utf8.v)

'(S(C(PP(pp בְּ)(n רֵאשִׁ֖ית))(VP(vb בָּרָ֣א))(NP(n אֱלֹהִ֑ים))(PP(U(pp אֵ֥ת)(dt הַ)(n שָּׁמַ֖יִם))(cj וְ)(U(pp אֵ֥ת)(dt הָ)(n אָֽרֶץ)))))'

# Multiline display

In many cases, multiline display is better.
In order to do that, we need to parse the brackets.

Next to this notebook is a module `utils.py` with the function `structure(rawTree)` which 
delivers a nested list that corresponds to the tree structure.

Here you see it in action.

In [48]:
structure(rawTree)

['S',
 ['C',
  ['PP', [('pp', 0)], [('n', 1)]],
  ['VP', [('vb', 2)]],
  ['NP', [('n', 3)]],
  ['PP',
   ['U', [('pp', 4)], [('dt', 5)], [('n', 6)]],
   [('cj', 7)],
   ['U', [('pp', 8)], [('dt', 9)], [('n', 10)]]]]]

We can display it a bit more friendly with the `layout(structuredTree, firstSlot, terminalRep)` function.
The `terminalRep` function should take a number and return a string. The number is the slot number of
a leave node. Here we just represent the number as a numeral.

In [49]:
print(layout(structure(rawTree), firstSlot, str))

  S
    C
      PP
        pp 1
        n 2
      VP
        vb 3
      NP
        n 4
      PP
        U
          pp 5
          dt 6
          n 7
        cj 8
        U
          pp 9
          dt 10
          n 11


We can now easily substitute in full word representations, as we did above.

Let's get fancy: we want to represent each word with its phonetic translation between quotes and
a gloss between square brackets.

And while we're at it, we also want to see the level of the nodes in question.

The only thing we have to write is function that gives this back if we give it a slot number.

In [54]:
def phonoGloss(n):
    return '"{}" "[{}]"'.format(
        F.g_word_utf8.v(n),
        F.gloss.v(L.u(n, otype='lex')[0]),
    )

In [55]:
print(layout(structure(rawTree), firstSlot, phonoGloss, withLevel=True))

 1  S
 2    C
 3      PP
 4        pp "בְּ" "[in]"
 4        n "רֵאשִׁ֖ית" "[beginning]"
 3      VP
 4        vb "בָּרָ֣א" "[create]"
 3      NP
 4        n "אֱלֹהִ֑ים" "[god(s)]"
 3      PP
 4        U
 5          pp "אֵ֥ת" "[<object marker>]"
 5          dt "הַ" "[the]"
 5          n "שָּׁמַ֖יִם" "[heavens]"
 4        cj "וְ" "[and]"
 4        U
 5          pp "אֵ֥ת" "[<object marker>]"
 5          dt "הָ" "[the]"
 5          n "אָֽרֶץ" "[earth]"


# Exploring the space of tree structures

Let us see how many distinct tree structures we've got.

In [56]:
trees = collections.Counter()
for sNode in F.otype.s('sentence'):
    trees[F.tree.v(sNode)] += 1
print('{} distinct structures'.format(len(trees)))

28096 distinct structures


Let's see the most frequent structures.

In [57]:
for (tree, nOccs) in sorted(trees.items(), key=lambda x: (-x[1], x[0])):
    if nOccs < 50: break
    print('{:>4} x {}'.format(nOccs, tree))

3772 x (S(C(CP(cj 0))(VP(vb 1))))
1238 x (S(C(VP(vb 0))))
1173 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2))))
 857 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(n 3))))
 749 x (S(C(CP(cj 0))(VP(vb 1))(NP(n 2))))
 577 x (S(C(CP(cj 0))(VP(vb 1))(PrNP(n-pr 2))))
 568 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(dt 3)(n 4))))
 554 x (S(C(VP(vb 0))(NP(n 1))))
 441 x (S(C(CP(cj 0))(NegP(ng 1))(VP(vb 2))))
 406 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(n-pr 3))))
 326 x (S(C(CP(cj 0))(VP(vb 1))(PrNP(n-pr 2))(PP(pp 3)(n-pr 4))))
 314 x (S(C(NegP(ng 0))(VP(vb 1))))
 310 x (S(C(CP(cj 0))(VP(vb 1))(NP(dt 2)(n 3))))
 274 x (S(C(VP(vb 0))(PP(pp 1)(n 2))))
 259 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(U(n 3))(U(n 4)))))
 257 x (S(C(NP(U(n 0))(U(n-pr 1)))))
 238 x (S(C(CP(cj 0))(AdvP(av 1))))
 227 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2))(NP(n 3))))
 226 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(U(n 3))(U(n-pr 4)))))
 216 x (S(C(NP(n 0))(VP(vb 1))))
 207 x (S(C(CP(cj 0))(NP(n 1))(VP(vb 2))))
 198 x (S(C(CP(cj 0))(VP(vb 1))(NP(n 2))(PP(pp 3)(n 4))))

# Distribution by depth

We show the frequency distribution of the depths of trees.

We have to define a depth function based on the structure of a tree.

In [66]:
def depth(structs):
    if type(structs) is list:
        return max(depth(struct) for struct in structs) + 1
    else:
        return 0

In [67]:
depths = collections.Counter()
for s in F.otype.s('sentence'):
    d = depth(structure(F.tree.v(s)))
    depths[d] += 1

In [70]:
for (d, amount) in sorted(depths.items(), key=lambda x: (-x[1], x[0])):
    print('{:>5} sentence{} of depth {:>2}'.format(amount, ' ' if amount == 1 else 's', d))

30616 sentences of depth  4
19107 sentences of depth  5
 8655 sentences of depth  6
 3499 sentences of depth  7
 1148 sentences of depth  8
  472 sentences of depth  9
  138 sentences of depth 10
   56 sentences of depth 11
   17 sentences of depth 12
    2 sentences of depth 13
    1 sentence  of depth 14


## The "deepest" trees
We gather all trees with a depth of at least 12.

In [71]:
limit = 12
deepTrees = []
for s in F.otype.s('sentence'):
    d = depth(structure(F.tree.v(s)))
    if d >= limit:
        deepTrees.append((s, d))
print('There are {} sentences with a tree depth of at least {}'.format(
    len(deepTrees),
    limit,
))

There are 20 sentences with a tree depth of at least 12


In [72]:
for (i, (s, d)) in enumerate(sorted(deepTrees, key=lambda x: (-x[1], x[0]))):
    firstSlot = L.d(s, otype='word')[0]
    print('\n{} {}:{} depth {:>3}'.format(
        *T.sectionFromNode(s), 
        d,
    ))
    if i > 1: continue
    print('{}\n'.format(
        layout(structure(F.tree.v(s)), firstSlot, phonoGloss, withLevel=True),
    ))


Jeremiah 44:2 depth  14
 1  S
 2    C
 3      PPrP
 4        pr-ps "אַתֶּ֣ם" "[you]"
 3      VP
 4        vb "רְאִיתֶ֗ם" "[see]"
 3      PP
 4        pp "אֵ֤ת" "[<object marker>]"
 4        U
 5          n "כָּל" "[whole]"
 4        U
 5          dt "הָֽ" "[the]"
 5          n "רָעָה֙" "[evil]"
 4        Cattr
 5          CP
 6            cj "אֲשֶׁ֤ר" "[<relative>]"
 5          VP
 6            vb "הֵבֵ֨אתִי֙" "[come]"
 5          PP
 6            U
 7              pp "עַל" "[upon]"
 7              n-pr "יְר֣וּשָׁלִַ֔ם" "[Jerusalem]"
 6            cj "וְ" "[and]"
 6            U
 7              pp "עַ֖ל" "[upon]"
 7              U
 8                n "כָּל" "[whole]"
 7              U
 8                U
 9                  n "עָרֵ֣י" "[town]"
 8                U
 9                  n-pr "יְהוּדָ֑ה" "[Judah]"
 5          PP
 6            pp "מִ" "[from]"
 6            U
 7              n "פְּנֵ֣י" "[face]"
 6            U
 7              n "רָעָתָ֗ם" "[evil]"
 7              Cattr
 8 

And this is just the beginning...