<img align="right" src="images/dans-small.png"/>
<img align="right" src="images/tf-small.png"/>
<img align="right" src="images/etcbc.png"/>

# Trees - for BHSA data (Hebrew)

This notebook makes use of the syntax trees composed by 
[trees.ipynb](trees.ipynb).

The feature `tree` holds for each sentence a
[Penn Treebank notation](https://en.wikipedia.org/wiki/Treebank) structure,
like this (Genesis 1:1):

```
(S(C(PP(pp 0)(n 1))(VP(vb 2))(NP(n 3))(PP(U(pp 4)(dt 5)(n 6))(cj 7)(U(pp 8)(dt 9)(n 10)))))
```

## Overview
### Multiple serializations
The numbers are the leave nodes. You can replace them by concrete words by adding the slot number of the 
first word of the sentence to it, and then substituting the values of textual features for it.

Like this

```
(S(C(PP(pp בְּ)(n רֵאשִׁ֖ית))(VP(vb בָּרָ֣א))(NP(n אֱלֹהִ֑ים))(PP(U(pp אֵ֥ת)(dt הַ)(n שָּׁמַ֖יִם))(cj וְ)(U(pp אֵ֥ת)(dt הָ)(n אָֽרֶץ)))))
```

or this

```
(S(C(PP(pp bᵊ)(n rēšˌîṯ))(VP(vb bārˈā))(NP(n ʔᵉlōhˈîm))(PP(U(pp ʔˌēṯ)(dt ha)(n ššāmˌayim))(cj wᵊ)(U(pp ʔˌēṯ)(dt hā)(n ʔˈāreṣ)))))
```

or even this

```
 1  S
 2    C
 3      PP
 4        pp "בְּ" "[in]"
 4        n "רֵאשִׁ֖ית" "[beginning]"
 3      VP
 4        vb "בָּרָ֣א" "[create]"
 3      NP
 4        n "אֱלֹהִ֑ים" "[god(s)]"
 3      PP
 4        U
 5          pp "אֵ֥ת" "[<object marker>]"
 5          dt "הַ" "[the]"
 5          n "שָּׁמַ֖יִם" "[heavens]"
 4        cj "וְ" "[and]"
 4        U
 5          pp "אֵ֥ת" "[<object marker>]"
 5          dt "הָ" "[the]"
 5          n "אָֽרֶץ" "[earth]"
```

We shall show how.

### Structure analysis
And you can now investigate the "bare" syntax of sentences, group them, filter them.
It turns out that there are roughly half as many distinct syntaxes as there are sentences, while
some sentence syntaxes occur thousands of times.

We proceed by showing the sentences whose trees have a depth at least 10.

### Intrusions and inversions
The word order in the trees is not always the surface word order of the text, due to the way the trees
have been constructed from the BHSA data.
For a description of the issue, see [trees.ipynb](trees.ipynb).

Below we give examples.

In [1]:
%load_ext autoreload
%autoreload 2

import sys
import os
import collections
import re
import operator
from functools import reduce

from tf.fabric import Fabric

from utils import structure, layout

# Load data
We load the some features of the
[BHSA](https://github.com/etcbc/bhsa) data.
See the [feature documentation](https://etcbc.github.io/bhsa/features/hebrew/2017/0_home.html) for more info.
We also load phonetic representations provided by the
[phono](https://github.com/etcbc/phono) module.

In [2]:
VERSION = '2017'
BHSA = 'BHSA/tf/{}'.format(VERSION)
TREES = 'lingo/trees/tf/{}'.format(VERSION)
OSM = 'bridging/tf/{}'.format(VERSION)
PHONO = 'phono/tf/{}'.format(VERSION)

In [3]:
TF = Fabric(locations='~/github/etcbc', modules=[BHSA, TREES, PHONO, OSM])
api = TF.load(f'''
    g_word_utf8 g_cons_utf8 gloss
    phono
    osm
    tree
''')
api.makeAvailableIn(globals())

This is Text-Fabric 3.1.1
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

121 features found and 0 ignored
  0.00s loading features ...
   |     0.22s B g_cons_utf8          from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.21s B g_word_utf8          from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.22s B phono                from /Users/dirk/github/etcbc/phono/tf/2017
   |     0.00s B gloss                from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.14s B osm                  from /Users/dirk/github/etcbc/bridging/tf/2017
   |     0.05s B tree                 from /Users/dirk/github/etcbc/lingo/trees/tf/2017
   |     0.00s Feature overview: 114 for nodes; 5 for edges; 2 configs; 7 computed
  5.44s All features loaded/computed - for details use loadLog()


# Multiple serializations - step by step
## The raw tree

Let us print the tree of Genesis 1:1.

In [4]:
passage = ('Genesis', 1, 1)
verseNode = T.nodeFromSection(passage)
sentenceNode = L.d(verseNode, otype='sentence')[0]
firstSlot = L.d(sentenceNode, otype='word')[0]
rawTree = F.tree.v(sentenceNode)
print('{} {}:{} - first word = {}\ntree = {}'.format(*passage, firstSlot, rawTree))

Genesis 1:1 - first word = 1
tree = (S(C(PP(pp 0)(n 1))(VP(vb 2))(NP(n 3))(PP(U(pp 4)(dt 5)(n 6))(cj 7)(U(pp 8)(dt 9)(n 10)))))


## Filling in the numbers
We replace the word numbers in the leaves by actual representations of those words.

First we define a helper function `fillWords`
that finds all numerals in a string so that we can replace them with desired representations.

Note, that we have to take the start position of the sentence into account,
since the numbering of the terminal nodes is relative to the sentence.
By adding the start position we obtain absolute positions of the words in the corpus,
and these are the proxies by which Text-Fabric obtains their features.

In [5]:
numPattern = re.compile('[0-9]+')

def fillWords(tree, start, wordRep):
    def numReplace(match):
        return wordRep(int(match.group(0)) + start)
    return numPattern.sub(numReplace, tree)

Now we can get the phonetic representation as a matter of filling in the dots.

Note that `F.phono.v` is a *function*, taking a slot number and returning the value of the feature `phono`
for the word occupying that slot number.

In [6]:
fillWords(rawTree, firstSlot, F.phono.v)

'(S(C(PP(pp bᵊ)(n rēšˌîṯ))(VP(vb bārˈā))(NP(n ʔᵉlōhˈîm))(PP(U(pp ʔˌēṯ)(dt ha)(n ššāmˌayim))(cj wᵊ)(U(pp ʔˌēṯ)(dt hā)(n ʔˈāreṣ)))))'

Likewise the consonantal representation:

In [7]:
fillWords(rawTree, firstSlot, F.g_cons_utf8.v)

'(S(C(PP(pp ב)(n ראשׁית))(VP(vb ברא))(NP(n אלהים))(PP(U(pp את)(dt ה)(n שׁמים))(cj ו)(U(pp את)(dt ה)(n ארץ)))))'

And the fully pointed one:

In [8]:
fillWords(rawTree, firstSlot, F.g_word_utf8.v)

'(S(C(PP(pp בְּ)(n רֵאשִׁ֖ית))(VP(vb בָּרָ֣א))(NP(n אֱלֹהִ֑ים))(PP(U(pp אֵ֥ת)(dt הַ)(n שָּׁמַ֖יִם))(cj וְ)(U(pp אֵ֥ת)(dt הָ)(n אָֽרֶץ)))))'

## Multiline display

In many cases, multiline display is better.
In order to do that, we need to parse the brackets.

Next to this notebook is a module `utils.py` with the function `structure(rawTree)` which 
delivers a nested list that corresponds to the tree structure.

Here you see it in action.

In [9]:
structure(rawTree)

['S',
 ['C',
  ['PP', [('pp', 0)], [('n', 1)]],
  ['VP', [('vb', 2)]],
  ['NP', [('n', 3)]],
  ['PP',
   ['U', [('pp', 4)], [('dt', 5)], [('n', 6)]],
   [('cj', 7)],
   ['U', [('pp', 8)], [('dt', 9)], [('n', 10)]]]]]

We can display it a bit more friendly with the `layout(structuredTree, firstSlot, terminalRep)` function,
also in `utils.py`.

The `terminalRep` function should take a slot number and return a string.
The number is the slot number of a leave node.
Here we just represent the number as a numeral.

In [10]:
print(layout(structure(rawTree), firstSlot, str))

  S
    C
      PP
        pp 1
        n 2
      VP
        vb 3
      NP
        n 4
      PP
        U
          pp 5
          dt 6
          n 7
        cj 8
        U
          pp 9
          dt 10
          n 11


Note that the `layout()` function will call `terminalRep()`
with the numbers found at the leaves increased by `firstSlot`.

Hence you do not see the numbers 0-10, but the numbers 1-11.

To illustrate the point, let us compute a layout for the third sentence of Genesis 1:3.

In [11]:
sentence2 = L.d(
    T.nodeFromSection(('Genesis', 1, 3)),
    otype='sentence'
)[2]

words = L.d(
    sentence2,
    otype='word'
)

print(F.tree.v(sentence2))
print(T.text(words))

firstSlot2 = words[0]

print(
    layout(
        structure(
            F.tree.v(
                sentence2,
            ),
        ), firstSlot2, str,
    )
)

(S(C(CP(cj 0))(VP(vb 1))(NP(n 2))))
וַֽיְהִי־אֹֽור׃ 
  S
    C
      CP
        cj 37
      VP
        vb 38
      NP
        n 39


We can now easily substitute in full word representations, as we did above.

Let's get fancy: we want to represent each word with its phonetic translation between quotes and
a gloss between square brackets.

The only thing we have to write is function `phonoGloss` that gives this in return for a slot number.

And while we're at it, we also want to see the level of the nodes in question.
We can achieve this by passing an extra parameter: `withLevel`.

In [12]:
def phonoGloss(n):
    return '"{}" "[{}]"'.format(
        F.phono.v(n),
        F.gloss.v(L.u(n, otype='lex')[0]),
    )

In [13]:
print(layout(structure(rawTree), firstSlot, phonoGloss, withLevel=True))

 1  S
 2    C
 3      PP
 4        pp "bᵊ" "[in]"
 4        n "rēšˌîṯ" "[beginning]"
 3      VP
 4        vb "bārˈā" "[create]"
 3      NP
 4        n "ʔᵉlōhˈîm" "[god(s)]"
 3      PP
 4        U
 5          pp "ʔˌēṯ" "[<object marker>]"
 5          dt "ha" "[the]"
 5          n "ššāmˌayim" "[heavens]"
 4        cj "wᵊ" "[and]"
 4        U
 5          pp "ʔˌēṯ" "[<object marker>]"
 5          dt "hā" "[the]"
 5          n "ʔˈāreṣ" "[earth]"


Why not add the OpenScriptures morphology string for each word?
We add it at the start of each terminal, between braces.

In [14]:
def osmPhonoGloss(n):
    return '{{{}}}"{}" "[{}]"'.format(
        F.osm.v(n),
        F.phono.v(n),
        F.gloss.v(L.u(n, otype='lex')[0]),
    )
print(layout(structure(rawTree), firstSlot, osmPhonoGloss, withLevel=True))

 1  S
 2    C
 3      PP
 4        pp {HR}"bᵊ" "[in]"
 4        n {HNcfsa}"rēšˌîṯ" "[beginning]"
 3      VP
 4        vb {HVqp3ms}"bārˈā" "[create]"
 3      NP
 4        n {HNcmpa}"ʔᵉlōhˈîm" "[god(s)]"
 3      PP
 4        U
 5          pp {HTo}"ʔˌēṯ" "[<object marker>]"
 5          dt {HTd}"ha" "[the]"
 5          n {HNcmpa}"ššāmˌayim" "[heavens]"
 4        cj {HC}"wᵊ" "[and]"
 4        U
 5          pp {HTo}"ʔˌēṯ" "[<object marker>]"
 5          dt {HTd}"hā" "[the]"
 5          n {HNcbsa}"ʔˈāreṣ" "[earth]"


# Structure analysis

Let us see how many distinct tree structures we've got.

In [15]:
trees = collections.Counter()
for sNode in F.otype.s('sentence'):
    trees[F.tree.v(sNode)] += 1
print('{} distinct structures'.format(len(trees)))

28096 distinct structures


Let's see the most frequent structures.

In [16]:
for (tree, nOccs) in sorted(trees.items(), key=lambda x: (-x[1], x[0])):
    if nOccs < 50: break
    print('{:>4} x {}'.format(nOccs, tree))

3772 x (S(C(CP(cj 0))(VP(vb 1))))
1238 x (S(C(VP(vb 0))))
1173 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2))))
 857 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(n 3))))
 749 x (S(C(CP(cj 0))(VP(vb 1))(NP(n 2))))
 577 x (S(C(CP(cj 0))(VP(vb 1))(PrNP(n-pr 2))))
 568 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(dt 3)(n 4))))
 554 x (S(C(VP(vb 0))(NP(n 1))))
 441 x (S(C(CP(cj 0))(NegP(ng 1))(VP(vb 2))))
 406 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(n-pr 3))))
 326 x (S(C(CP(cj 0))(VP(vb 1))(PrNP(n-pr 2))(PP(pp 3)(n-pr 4))))
 314 x (S(C(NegP(ng 0))(VP(vb 1))))
 310 x (S(C(CP(cj 0))(VP(vb 1))(NP(dt 2)(n 3))))
 274 x (S(C(VP(vb 0))(PP(pp 1)(n 2))))
 259 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(U(n 3))(U(n 4)))))
 257 x (S(C(NP(U(n 0))(U(n-pr 1)))))
 238 x (S(C(CP(cj 0))(AdvP(av 1))))
 227 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2))(NP(n 3))))
 226 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(U(n 3))(U(n-pr 4)))))
 216 x (S(C(NP(n 0))(VP(vb 1))))
 207 x (S(C(CP(cj 0))(NP(n 1))(VP(vb 2))))
 198 x (S(C(CP(cj 0))(VP(vb 1))(NP(n 2))(PP(pp 3)(n 4))))

## Distribution by depth

We show the frequency distribution of the depths of trees.

We have to define a depth function based on the structure of a tree.

In [17]:
def depth(structs):
    if type(structs) is list:
        return max(depth(struct) for struct in structs) + 1
    else:
        return 0

In [18]:
depths = collections.Counter()
for s in F.otype.s('sentence'):
    d = depth(structure(F.tree.v(s)))
    depths[d] += 1

In [19]:
for (d, amount) in sorted(depths.items(), key=lambda x: (-x[1], x[0])):
    print('{:>5} sentence{} of depth {:>2}'.format(amount, ' ' if amount == 1 else 's', d))

30616 sentences of depth  4
19107 sentences of depth  5
 8655 sentences of depth  6
 3499 sentences of depth  7
 1148 sentences of depth  8
  472 sentences of depth  9
  138 sentences of depth 10
   56 sentences of depth 11
   17 sentences of depth 12
    2 sentences of depth 13
    1 sentence  of depth 14


## The "deepest" trees
We gather all trees with a depth of at least 12.

In [20]:
limit = 12
deepTrees = []
for s in F.otype.s('sentence'):
    d = depth(structure(F.tree.v(s)))
    if d >= limit:
        deepTrees.append((s, d))
print('There are {} sentences with a tree depth of at least {}'.format(
    len(deepTrees),
    limit,
))

There are 20 sentences with a tree depth of at least 12


In [21]:
for (i, (s, d)) in enumerate(sorted(deepTrees, key=lambda x: (-x[1], x[0]))):
    firstSlot = L.d(s, otype='word')[0]
    print('\n{} {}:{} depth {:>3}'.format(
        *T.sectionFromNode(s), 
        d,
    ))
    if i > 1: continue
    print('{}\n'.format(
        layout(structure(F.tree.v(s)), firstSlot, phonoGloss, withLevel=True),
    ))


Jeremiah 44:2 depth  14
 1  S
 2    C
 3      PPrP
 4        pr-ps "ʔattˈem" "[you]"
 3      VP
 4        vb "rᵊʔîṯˈem" "[see]"
 3      PP
 4        pp "ʔˈēṯ" "[<object marker>]"
 4        U
 5          n "kol-" "[whole]"
 4        U
 5          dt "hˈā" "[the]"
 5          n "rāʕˌā" "[evil]"
 4        Cattr
 5          CP
 6            cj "ʔᵃšˈer" "[<relative>]"
 5          VP
 6            vb "hēvˈēṯî" "[come]"
 5          PP
 6            U
 7              pp "ʕal-" "[upon]"
 7              n-pr "yᵊrˈûšālˈaim" "[Jerusalem]"
 6            cj "wᵊ" "[and]"
 6            U
 7              pp "ʕˌal" "[upon]"
 7              U
 8                n "kol-" "[whole]"
 7              U
 8                U
 9                  n "ʕārˈê" "[town]"
 8                U
 9                  n-pr "yᵊhûḏˈā" "[Judah]"
 5          PP
 6            pp "mi" "[from]"
 6            U
 7              n "ppᵊnˈê" "[face]"
 6            U
 7              n "rāʕāṯˈām" "[evil]"
 7              Cattr
 8            

# Intrusions and inversions

Let us filter out the trees where nodes intrude in each other in such a way that the order of 
terminal nodes in no longer equal to the order of the words in the surface text.

We need a function `terminalOrder` that yields the tuple of number found in terminal nodes if we visit them
from left to right.

The easiest way is to compute it from the structure of a tree.

In [22]:
def terminalOrder(structs):
    if type(structs) is list:
        return reduce(operator.add, (terminalOrder(struct) for struct in structs), ())
    elif type(structs) is tuple:
        return (structs[1],)
    else:
        return ()

The following example shows that terminalOrder is capable of doing the job.

In [23]:
structure(rawTree)

['S',
 ['C',
  ['PP', [('pp', 0)], [('n', 1)]],
  ['VP', [('vb', 2)]],
  ['NP', [('n', 3)]],
  ['PP',
   ['U', [('pp', 4)], [('dt', 5)], [('n', 6)]],
   [('cj', 7)],
   ['U', [('pp', 8)], [('dt', 9)], [('n', 10)]]]]]

In [24]:
terminalOrder(structure(rawTree))

(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

If a sentence on `n` words does not yield exactly the tuple `(0 .. n-1)`,
then the order is disturbed.

In [25]:
abnormalTrees = []

for s in F.otype.s('sentence'):
    treeOrder = terminalOrder(structure(F.tree.v(s)))
    norm = tuple(range(len(treeOrder)))
    if treeOrder != norm:
        abnormalTrees.append(s)

print('{} trees have an abnormal terminal order'.format(len(abnormalTrees)))

1999 trees have an abnormal terminal order


Let us inspect the first 10 of them.

We like to see the fully pointed representation and glosses, as `phonoGloss()` provides (see above),
but we want also to show the sequence number of the word in the sentence.
We define `phonoGlossSeq()`.

In [26]:
def phonoGlossSeq(firstSlot):
    return lambda n: '{} {}'.format(n - firstSlot, phonoGloss(n))

In [27]:
for s in abnormalTrees[0:5]:
    firstSlot = L.d(s, otype='word')[0]
    print('{} {}:{}\n{}\n'.format(
        *T.sectionFromNode(s),
        layout(structure(F.tree.v(s)), firstSlot, phonoGlossSeq(firstSlot))
    ))

Genesis 1:11
  S
    C
      VP
        vb 0 "tˈaḏšˈē" "[grow green]"
      NP
        dt 1 "hā" "[the]"
        n 2 "ʔˈāreṣ" "[earth]"
      NP
        n 3 "dˈeše" "[young grass]"
        n 4 "ˈʕēśev" "[herb]"
        Cattr
          VP
            vb 5 "mazrˈîₐʕ" "[sow]"
          NP
            n 6 "zˈeraʕ" "[seed]"
        U
          n 7 "ʕˈēṣ" "[tree]"
        U
          n 8 "pᵊrˈî" "[fruit]"
        Cattr
          VP
            vb 9 "ʕˈōśeh" "[make]"
          NP
            n 10 "pᵊrˌî" "[fruit]"
            Cattr
              CP
                cj 13 "ʔᵃšˌer" "[<relative>]"
              NP
                n 14 "zarʕô-" "[seed]"
              PP
                pp 15 "vˌô" "[in]"
          PP
            pp 11 "lᵊ" "[to]"
            n 12 "mînˈô" "[kind]"
          PP
            pp 16 "ʕal-" "[upon]"
            dt 17 "hā" "[the]"
            n 18 "ʔˈāreṣ" "[earth]"

Genesis 1:29
  S
    C
      InjP
        ij 0 "hinnˌē" "[behold]"
      VP
        vb 1 "nāṯˌattî" "[give

Now you are peeking into the kitchen of the ETCBC encoders.
Fascinating, isn't it?

In [28]:
T.formats

{'lex-orig-full',
 'lex-orig-plain',
 'lex-trans-full',
 'lex-trans-plain',
 'text-orig-full',
 'text-orig-full-ketiv',
 'text-orig-plain',
 'text-phono-full',
 'text-trans-full',
 'text-trans-full-ketiv',
 'text-trans-plain'}

In [36]:
T.text(
    L.d(
        T.nodeFromSection(('Lamentations', 1, 1)),
        otype='word',
    )[0:3],
    fmt='text-trans-plain',
)

'>JKH05 JCBH BDD '