# Add phrase and clause nodes

The data for clauses and phrases comes from csv files prepared by Martijn Naaijer.

We compile interpret the CSV and compile it into input data for the
[modify](https://annotation.github.io/text-fabric/compose/modify.html)
function.

Adding phrases and clauses is just a single call to that modify function, which turns DSS version 0.8 into 0.9.

In [1]:
%load_ext autoreload
%autoreload 2

In [49]:
import pprint as pp

from tf.app import use
from addBoundariesFromNaaijer import readBoundariesPlain

PP = pp.PrettyPrinter(indent=2)


def pprint(x):
    PP.pprint(x)

In [93]:
A = use("dss:hot", checkout="clone", hoist=globals(), version="0.8")

rate limit is 5000 requests per hour, with 5000 left for this hour
	connecting to online GitHub repo annotation/app-dss ... connected
	code/app.py...downloaded
	code/config.yaml...downloaded
	code/static...directory
		code/static/display.css...downloaded
		code/static/logo.png...downloaded
	OK


# Check

We check whether the ISA clause/word boundary file mentions the correct word nodes.

First a visual inspection of the first 10 words.

In [51]:
F.otype.v(firstWord)  # noqa F821

'word'

In [52]:
T.formats

{'lex-default': 'word',
 'lex-orig-full': 'word',
 'lex-source-full': 'word',
 'lex-trans-full': 'word',
 'morph-source-full': 'word',
 'text-orig-extra': 'word',
 'text-orig-full': 'sign',
 'text-source-extra': 'word',
 'text-source-full': 'sign',
 'text-trans-extra': 'word',
 'text-trans-full': 'sign',
 'layout-orig-full': 'sign',
 'layout-source-full': 'sign',
 'layout-trans-full': 'sign'}

In [53]:
FMT = "text-trans-full"

In [54]:
firstWord = 1894861
for w in range(firstWord, firstWord + 10):
    rep = T.text(w, fmt=FMT)
    print(f"{w} = {rep}")

1894861 = XZWn 
1894862 = J#<JHW 
1894863 = Bn 
1894864 = >MWy 
1894865 = >#R 
1894866 = XZH 
1894867 = <L 
1894868 = JHWDH 
1894869 = W
1894870 = JRW#Lm 


The data file reads this

```
id,scroll,book,verse,word,clause_nr,phrase_nr
1894861,1Qisaa,Isaiah,1,XZWN,1,1
1894862,1Qisaa,Isaiah,1,JC<JHW,1,1
1894863,1Qisaa,Isaiah,1,BN,1,1
1894864,1Qisaa,Isaiah,1,>MWY,1,1
1894865,1Qisaa,Isaiah,1,>CR,2,2
1894866,1Qisaa,Isaiah,1,XZH,2,3
1894867,1Qisaa,Isaiah,1,<L,2,4
1894868,1Qisaa,Isaiah,1,JHWDH,2,4
1894869,1Qisaa,Isaiah,1,W,2,4
1894870,1Qisaa,Isaiah,1,JRWCLM,2,4
```

We are going to check whether:

1. the word nodes (first column) increase by one for each subsequent line
2. the word in the file equals the word according to TF, modulo a small transformation:
    * C in the file stands for #
    * we ignore case differences (relevant in the last letter)

First a visual check.

In [55]:
data = readBoundariesPlain()
pprint(list(data.keys())[0:10])

[ 1894861,
  1894862,
  1894863,
  1894864,
  1894865,
  1894866,
  1894867,
  1894868,
  1894869,
  1894870]


In [56]:
pprint(list(data.values())[0:10])

[ ('XZWN', 'Isaiah', '1', '1'),
  ('JC<JHW', 'Isaiah', '1', '1'),
  ('BN', 'Isaiah', '1', '1'),
  ('>MWY', 'Isaiah', '1', '1'),
  ('>CR', 'Isaiah', '2', '2'),
  ('XZH', 'Isaiah', '2', '3'),
  ('<L', 'Isaiah', '2', '4'),
  ('JHWDH', 'Isaiah', '2', '4'),
  ('W', 'Isaiah', '2', '4'),
  ('JRWCLM', 'Isaiah', '2', '4')]


In [57]:
T.text(1894879, fmt="text-trans-full")

'00 '

In [59]:
END = "00 "

prevW = None

good = True

for (w, (word, book, clNr, phrNr)) in data.items():
    wordTrans = word.replace("C", "#").replace("F", "#")
    wordTf = T.text(w, fmt=FMT).rstrip().upper().replace("'", "")
    if wordTrans != wordTf:
        print(f"irregularity at {w}: `{wordTrans}` != `{wordTf}`")
        good = False
        break
    if prevW is not None and w != prevW + 1:
        if not (prevW + 2 == w and T.text(prevW + 1, fmt=FMT) == END):
            print(f"irregularity at {w} following {prevW}")
            good = False
            break
    prevW = w

if good:
    print("all is well")

all is well


# Run

We can now run the script to produce a new DSS dataset with extra node types: `clause` and `phrase`, both with feature `nr`.

**NB**:

The script specifies the source version and the destination version for the new TF dataset.

We can run it on the commandline, or right here, in the notebook.


In [79]:
!python3 addBoundariesFromNaaijer.py

<IPython.core.display.HTML object>
<IPython.core.display.HTML object>
<IPython.core.display.HTML object>
<IPython.core.display.HTML object>
<IPython.core.display.HTML object>
  0.00s preparing and checking ...
This is Text-Fabric 8.4.5
Api reference : https://annotation.github.io/text-fabric/cheatsheet.html

66 features found and 0 ignored
  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
  5.00s All features loaded/computed - for details use loadLog()
  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
  0.04s All features loaded/computed - for details use loadLog()
  0.00s loading features ...
  1.81s All additional features loaded - for details use loadLog()
   |     6.87s done
  7.17s add types ...
   |     0.00s done (2 types)
   |      |   clause, phrase
  7.17s applying updates ...
  9.21s write TF data ...
This is Text-Fabric 8.4.5
Ap

# Test

Let's see whether the first word is now contained in a phrase and in a clause:

In [81]:
A = use("dss:clone", checkout="clone", hoist=globals(), version="0.9")

   |     0.72s T otype                from ~/github/etcbc/dss/tf/0.9
   |     7.72s T oslots               from ~/github/etcbc/dss/tf/0.9
   |     1.04s T fulle                from ~/github/etcbc/dss/tf/0.9
   |     3.53s T glypho               from ~/github/etcbc/dss/tf/0.9
   |     0.09s T punc                 from ~/github/etcbc/dss/tf/0.9
   |     3.46s T glyphe               from ~/github/etcbc/dss/tf/0.9
   |     0.92s T glexe                from ~/github/etcbc/dss/tf/0.9
   |     0.08s T punce                from ~/github/etcbc/dss/tf/0.9
   |     0.12s T scroll               from ~/github/etcbc/dss/tf/0.9
   |     3.68s T glyph                from ~/github/etcbc/dss/tf/0.9
   |     0.34s T lang                 from ~/github/etcbc/dss/tf/0.9
   |     0.93s T glexo                from ~/github/etcbc/dss/tf/0.9
   |     0.96s T glex                 from ~/github/etcbc/dss/tf/0.9
   |     0.99s T lexe                 from ~/github/etcbc/dss/tf/0.9
   |     0.96s T morpho           

In [82]:
L.u(firstWord)

(2107989, 2107864, 1590062, 1540420, 1606613, 1543783)

In [83]:
for n in L.u(firstWord):
    print(n, F.otype.v(n))

2107989 phrase
2107864 clause
1590062 line
1540420 fragment
1606613 scroll
1543783 lex


This is a fragment of the data file:
```
1894900,1Qisaa,Isaiah,3,QWNJHW,11,24
1894901,1Qisaa,Isaiah,3,W,12,25
1894902,1Qisaa,Isaiah,3,XMWR,12,26
1894903,1Qisaa,Isaiah,3,>BWS,12,27
1894904,1Qisaa,Isaiah,3,B<LJW,12,27
1894905,1Qisaa,Isaiah,3,JFR>L,13,28
```

We gather the words belonging to clause 12:

In [91]:
c = F.otype.s("clause")[11]
print(F.nr.v(c))
print(L.d(c, otype="word"))

12
(1894901, 1894902, 1894903, 1894904)


Correct!

We gather the words belonging to phrase 27:

In [92]:
p = F.otype.s("phrase")[26]
print(F.nr.v(p))
print(L.d(p, otype="word"))

27
(1894903, 1894904)


Correct!