<img src="images/etcbc.png" align="left"/>
<img src="images/dans.png" align="right"/>

<img src="images/tf.png" width="50%"/>

# Handling Biblical data with IKEA logistics <a class="tocSkip">

Dirk Roorda - 2018-03-20 [ETEN Workshop](http://global-learning.org/course/view.php?id=16)

# Perspective

* Research data
* Researchers are ...
  * not engineers
  * not consumers
  * tinkerers (in their own sheds ... or *labs* as they like to say)
* Programming *theologians*, *cuneiform decipherers*, *humanists*

# Workbench (Cunei)

In [60]:
import sys, os
from IPython.display import display, HTML, Markdown

In [61]:
LOC = ('~/github', 'Nino-cunei/uruk', 'Copenhagen2018')
sys.path.append(os.path.expanduser(f'{LOC[0]}/{LOC[1]}/programs'))
from cunei import Cunei
CN = Cunei(*LOC)
CN.api.makeAvailableIn(globals())

Found 2095 ideograph linearts
Found 2724 tablet linearts
Found 5495 tablet photos


**Documentation:** <a target="_blank" href="https://github.com/Nino-cunei/uruk/blob/master/docs/about.md" title="{provenance of this corpus}">Uruk IV-III (v1.0)</a> <a target="_blank" href="https://github.com/Nino-cunei/uruk/blob/master/docs/transcription.md" title="{source} feature documentation">Feature docs</a> <a target="_blank" href="https://github.com/Nino-cunei/uruk/blob/master/docs/cunei.md" title="cunei api documentation">Cunei API</a> <a target="_blank" href="https://github.com/Dans-labs/text-fabric/wiki/api" title="text-fabric-api">Text-Fabric API</a>


This notebook online:
<a target="_blank" href="http://nbviewer.jupyter.org/github/etcbc/lingo/blob/master/presentations/Copenhagen2018.ipynb">NBViewer</a>
<a target="_blank" href="https://github.com/etcbc/lingo/blob/master/presentations/Copenhagen2018.ipynb">GitHub</a>


In [62]:
pNumX = 'P005381'

In [63]:
CN.photo(pNumX, width="400")

In [64]:
CN.lineart(pNumX, width="300")

In [65]:
tabletX = T.nodeFromSection((pNumX,))
sourceLines = CN.getSource(tabletX)
print('\n'.join(sourceLines))

&P005381 = MSVO 3, 70
#atf: lang qpc 
@obverse 
@column 1 
1.a. 2(N14) , SZE~a SAL TUR3~a NUN~a 
1.b. 3(N19) , |GISZ.TE| 
2. 1(N14) , NAR NUN~a SIG7 
3. 2(N04)# , PIRIG~b1 SIG7 URI3~a NUN~a 
@column 2 
1. 3(N04) , |GISZ.TE| GAR |SZU2.((HI+1(N57))+(HI+1(N57)))| GI4~a 
2. , GU7 AZ SI4~f 
@reverse 
@column 1 
1. 3(N14) , SZE~a 
2. 3(N19) 5(N04) , 
3. , GU7 
@column 2 
1. , AZ SI4~f 


In [66]:
case = CN.nodeFromCase((pNumX, 'obverse:1', '1a'))

In [67]:
CN.lineart(CN.getOuterQuads(case), width=50)

## Tablet calculator

In [68]:
pNums = '''
    P005381
    P005447
    P005448
'''.strip().split()

pNumPat = '|'.join(pNums)

In [69]:
shinPP = dict(
    N41=0.2,
    N04=1,
    N19=6,
    N46=60,
    N36=180,
    N49=1800,
)

shinPPPat = '|'.join(shinPP)

We query for shinPP numerals on the faces of selected tablets.
The result of the query is a list of tuples `(t, f, s)` consisting of
a tablet node, a face node and a sign node, which is a shinPP numeral.

In [70]:
query = f'''
tablet catalogId={pNumPat}
    face
        sign type=numeral grapheme={shinPPPat}
'''

In [71]:
results = list(S.search(query))
len(results)

20

We have found 20 numerals.
We group the results by tablet and by face.

In [72]:
numerals = {}
for (tablet, face, sign) in results:
    numerals.setdefault(tablet, {}).setdefault(face, []).append(sign)

We show the tablets, the shinPP numerals per face, and we add up the numerals per face.

In [73]:
def dm(x): display(Markdown(x))

for (tablet, faces) in numerals.items():
    dm('---\n')
    display(CN.lineart(tablet, withCaption="top", width="200"))
    for (face, signs) in faces.items():
        dm(f'### {F.type.v(face)}')
        distinctSigns = {}
        for s in signs:
            distinctSigns.setdefault(CN.atfFromSign(s), []).append(s)
        display(CN.lineart(distinctSigns))
        total = 0
        for (signAtf, signs) in distinctSigns.items():
            # note that all signs for the same signAtf have the same grapheme and repeat
            value = 0
            for s in signs:
                value += F.repeat.v(s) * shinPP[F.grapheme.v(s)]
            total += value
            amount = len(signs)
            shinPPval = shinPP[F.grapheme.v(signs[0])]
            repeat = F.repeat.v(signs[0])
            print(f'{amount} x {signAtf} = {amount} x {repeat} x {shinPPval} = {value}')
        dm(f'**total** = **{total}**')

---


### obverse

1 x 9(N19) = 1 x 9 x 6 = 54
1 x 4(N04) = 1 x 4 x 1 = 4
2 x 2(N19) = 2 x 2 x 6 = 24
1 x 2(N04) = 1 x 2 x 1 = 2


**total** = **84**

### reverse

2 x 1(N46) = 2 x 1 x 60 = 120
2 x 2(N19) = 2 x 2 x 6 = 24
1 x 4(N19) = 1 x 4 x 6 = 24


**total** = **168**

---


### obverse

1 x 2(N04) = 1 x 2 x 1 = 2
1 x 1(N46) = 1 x 1 x 60 = 60
1 x 9(N19) = 1 x 9 x 6 = 54


**total** = **116**

### reverse

1 x 1(N36) = 1 x 1 x 180 = 180
1 x 1(N19) = 1 x 1 x 6 = 6


**total** = **186**

---


### obverse

1 x 3(N04) = 1 x 3 x 1 = 3
1 x 2(N04) = 1 x 2 x 1 = 2
1 x 3(N19) = 1 x 3 x 6 = 18


**total** = **23**

### reverse

1 x 3(N19) = 1 x 3 x 6 = 18
1 x 5(N04) = 1 x 5 x 1 = 5


**total** = **23**

# Workbench (Syriac NT)

In [74]:
from tf.fabric import Fabric
REPO = '~/github/etcbc/linksyr'
SOURCE = 'syrnt'
CORPUS = f'{REPO}/data/tf/{SOURCE}'

In [75]:
TF = Fabric(locations=[CORPUS], modules=[''], silent=False )
api = TF.load('', silent=True)
allFeatures = TF.explore(silent=True, show=True)
loadableFeatures = allFeatures['nodes'] + allFeatures['edges']
TF.load(loadableFeatures, add=True, silent=True)
api.makeAvailableIn(globals())

This is Text-Fabric 3.2.2
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

37 features found and 0 ignored


In [76]:
SYRIACA = os.path.expanduser(f'{REPO}/data/syriaca')
SC_PEOPLE = f'{SYRIACA}/index_of_persons.csv'
SC_PLACES = f'{SYRIACA}/index_of_places.csv'

SC_URL = 'http://syriaca.org'
SC_PLACE = 'place'
SC_PERSON = 'person'

SC_CONFIG = (
    (SC_PERSON, SC_URL, SC_PEOPLE),
    (SC_PLACE, SC_URL, SC_PLACES),    
)

SC_TYPES = tuple(x[0] for x in SC_CONFIG)

SC_FIELDS = ('trans', 'syriac', 'id')

NA_SYRIAC = {
    '[Syriac Not Available]', 
    '[Syriac Not', 
    '[Syriac',
}

In [77]:
HTML('''
<style>
.syc {
    font-family: Estrangelo Edessa;
    font-size: 14pt;
}
</style>
''')

In [78]:
tables = {}
irregular = {}

(transF, syriacF, idF) = SC_FIELDS

for (dataType, baseUrl, dataFile) in SC_CONFIG:
    tables[dataType] = {field: {} for field in SC_FIELDS}
    irregular[dataType] = set()
    dest = tables[dataType]
    irreg = irregular[dataType]
    table = dest[idF]
    indexTrans = dest[transF]
    indexSyriac = dest[syriacF]
    with open(dataFile) as fh:
        for (i, line) in enumerate(fh):
            (transV, syriacV, idV) = line.rstrip('\n').split('\t')
            prefix = f'{baseUrl}/{dataType}/'
            if idV.startswith(prefix):
                idV = idV.replace(prefix, '', 1)
            else:
                irreg.add(idV)
            table[idV] = (transV, syriacV)
            indexTrans.setdefault(transV, set()).add(idV)
            if syriacV not in NA_SYRIAC:
                if '[' in syriacV:
                    print(f'WARNING {dataType} line {i+1}: syriac value "{syriacV}"')
                indexSyriac.setdefault(syriacV, set()).add(idV)

In [79]:
for (dataType, data) in tables.items():
    table = data[idF]
    irreg = irregular[dataType]
    print(f'''
{dataType:>12}s: {len(table):>5} (irregular: {len(irreg):>4})
{"by syriac":>12} : {len(data[syriacF]):>5}
{"by trans":>12} : {len(data[transF]):>5}
''')


      persons:  2371 (irregular:    0)
   by syriac :  1503
    by trans :  1964


       places:  2488 (irregular:    0)
   by syriac :   527
    by trans :  2165



In [80]:
hits = {dataType: {} for dataType in SC_TYPES}

for lx in F.otype.s('lexeme'):
    lex = F.lexeme.v(lx)
    for dataType in SC_TYPES:
        idV = tables[dataType][syriacF].get(lex, None)
        if idV is not None:
            hits[dataType][lx] = idV

In [81]:
for (dataType, theseHits) in hits.items():
    print(f'{dataType:>12}s: {len(theseHits):>5} hits')

      persons:    98 hits
       places:    37 hits


We show the hits by picking the first occurrence of each lexeme and showing it in context.

In [82]:
for (dataType, theseHits) in hits.items():
    markdown = f'''### {dataType}s
lexeme | linked | n-occs | passage | verse text
--- | --- | --- | --- | ---
'''
    for (lx, linked) in sorted(
        theseHits.items(),
        key=lambda x: F.lexeme.v(x[0]),
    )[0:10]:
        lex = F.lexeme.v(lx)
        ids = ' '.join(sorted(linked))
        occs = L.d(lx, otype='word')
        passage = '{} {}:{}'.format(*T.sectionFromNode(occs[0]))
        verse = L.u(occs[0], otype='verse')[0]
        text = T.text(L.d(verse, otype='word'))
        markdown += (
            f'<span class="syc">{lex}</span> | {ids} | {len(occs)} | {passage} |'
            f' <span class="syc">{text}</syc>\n'
        )
    display(Markdown(markdown))

### persons
lexeme | linked | n-occs | passage | verse text
--- | --- | --- | --- | ---
<span class="syc">ܐܒܐ</span> | 1094 2582 308 | 9 | Matthew 2:22 | <span class="syc">ܟܕ ܕܝܢ ܫܡܥ ܕܐܪܟܠܐܘܣ ܗܘܐ ܡܠܟܐ ܒܝܗܘܕ ܚܠܦ ܗܪܘܕܣ ܐܒܘܗܝ ܕܚܠ ܕܢܐܙܠ ܠܬܡܢ ܘܐܬܚܙܝ ܠܗ ܒܚܠܡܐ ܕܢܐܙܠ ܠܐܬܪܐ ܕܓܠܝܠܐ </syc>
<span class="syc">ܐܒܪܗܡ</span> | 1108 1109 1110 1546 1547 1548 1549 1551 1552 1553 1554 2202 964 | 2 | Matthew 1:1 | <span class="syc">ܟܬܒܐ ܕܝܠܝܕܘܬܗ ܕܝܫܘܥ ܡܫܝܚܐ ܒܪܗ ܕܕܘܝܕ ܒܪܗ ܕܐܒܪܗܡ </syc>
<span class="syc">ܐܕܝ</span> | 1117 1118 2203 | 2 | Luke 3:28 | <span class="syc">ܒܪ ܡܠܟܝ ܒܪ ܐܕܝ ܒܪ ܩܘܣܡ ܒܪ ܐܠܡܘܕܕ ܒܪ ܥܝܪ </syc>
<span class="syc">ܐܕܡ</span> | 1560 | 208 | Luke 3:38 | <span class="syc">ܒܪ ܐܢܘܫ ܒܪ ܫܝܬ ܒܪ ܐܕܡ ܕܡܢ ܐܠܗܐ </syc>
<span class="syc">ܐܗܪܘܢ</span> | 1012 1092 1533 1534 | 3 | Luke 1:5 | <span class="syc">ܗܘܐ ܒܝܘܡܬܗ ܕܗܪܘܕܣ ܡܠܟܐ ܕܝܗܘܕܐ ܟܗܢܐ ܚܕ ܕܫܡܗ ܗܘܐ ܙܟܪܝܐ ܡܢ ܬܫܡܫܬܐ ܕܒܝܬ ܐܒܝܐ ܘܐܢܬܬܗ ܡܢ ܒܢܬܗ ܕܐܗܪܘܢ ܫܡܗ ܗܘܐ ܐܠܝܫܒܥ </syc>
<span class="syc">ܐܘܒܘܠܘܣ</span> | 3028 | 1 | 2_Timothy 4:21 | <span class="syc">ܢܬܒܛܠ ܠܟ ܕܩܕܡ ܣܬܘܐ ܬܐܬܐ ܫܐܠ ܒܫܠܡܟ ܐܘܒܘܠܘܣ ܘܦܘܕܣ ܘܠܝܢܘܣ ܘܩܠܘܕܝܐ ܘܐܚܐ ܟܠܗܘܢ </syc>
<span class="syc">ܐܚܐ</span> | 1122 1123 1740 | 3 | Matthew 1:2 | <span class="syc">ܐܒܪܗܡ ܐܘܠܕ ܠܐܝܣܚܩ ܐܝܣܚܩ ܐܘܠܕ ܠܝܥܩܘܒ ܝܥܩܘܒ ܐܘܠܕ ܠܝܗܘܕܐ ܘܠܐܚܘܗܝ </syc>
<span class="syc">ܐܝܣܚܩ</span> | 1788 1789 1790 1791 1792 2578 | 3 | Matthew 1:2 | <span class="syc">ܐܒܪܗܡ ܐܘܠܕ ܠܐܝܣܚܩ ܐܝܣܚܩ ܐܘܠܕ ܠܝܥܩܘܒ ܝܥܩܘܒ ܐܘܠܕ ܠܝܗܘܕܐ ܘܠܐܚܘܗܝ </syc>
<span class="syc">ܐܠܝܐ</span> | 1698 1699 1700 1703 1704 1705 2541 3145 945 | 1 | Matthew 2:18 | <span class="syc">ܩܠܐ ܐܫܬܡܥ ܒܪܡܬܐ ܒܟܝܐ ܘܐܠܝܐ ܣܓܝܐܐ ܪܚܝܠ ܒܟܝܐ ܥܠ ܒܢܝܗ ܘܠܐ ܨܒܝܐ ܠܡܬܒܝܐܘ ܡܛܠ ܕܠܐ ܐܝܬܝܗܘܢ </syc>
<span class="syc">ܐܠܟܣܢܕܪܘܣ</span> | 1574 887 | 1 | Mark 15:21 | <span class="syc">ܘܫܚܪܘ ܚܕ ܕܥܒܪ ܗܘܐ ܫܡܥܘܢ ܩܘܪܝܢܝܐ ܕܐܬܐ ܗܘܐ ܡܢ ܩܪܝܬܐ ܐܒܘܗܝ ܕܐܠܟܣܢܕܪܘܣ ܘܕܪܘܦܘܣ ܕܢܫܩܘܠ ܙܩܝܦܗ </syc>


### places
lexeme | linked | n-occs | passage | verse text
--- | --- | --- | --- | ---
<span class="syc">ܐܘܪܫܠܡ</span> | 104 | 2 | Matthew 2:1 | <span class="syc">ܟܕ ܕܝܢ ܐܬܝܠܕ ܝܫܘܥ ܒܒܝܬ-ܠܚܡ ܕܝܗܘܕܐ ܒܝܘܡܝ ܗܪܘܕܣ ܡܠܟܐ ܐܬܘ ܡܓܘܫܐ ܡܢ ܡܕܢܚܐ ܠܐܘܪܫܠܡ </syc>
<span class="syc">ܐܠܟܣܢܕܪܝܐ</span> | 572 | 2 | Acts 6:9 | <span class="syc">ܘܩܡܘ ܗܘܘ ܐܢܫܐ ܡܢ ܟܢܘܫܬܐ ܕܡܬܩܪܝܐ ܕܠܝܒܪܛܝܢܘ ܘܩܘܪܝܢܝܐ ܘܐܠܟܣܢܕܪܝܐ ܘܕܡܢ ܩܝܠܝܩܝܐ ܘܡܢ ܐܣܝܐ ܘܕܪܫܝܢ ܗܘܘ ܥܡ ܐܣܛܦܢܘܣ </syc>
<span class="syc">ܐܢܛܝܘܟܝܐ</span> | 10 995 | 44 | Acts 6:5 | <span class="syc">ܘܫܦܪܬ ܗܕܐ ܡܠܬܐ ܩܕܡ ܟܠܗ ܥܡܐ ܘܓܒܘ ܠܐܣܛܦܢܘܣ ܓܒܪܐ ܕܡܠܐ ܗܘܐ ܗܝܡܢܘܬܐ ܘܪܘܚܐ ܕܩܘܕܫܐ ܘܠܦܝܠܝܦܘܣ ܘܠܦܪܟܪܘܣ ܘܠܢܝܩܢܘܪ ܘܠܛܝܡܘܢ ܘܠܦܪܡܢܐ ܘܠܢܝܩܠܐܘܣ ܓܝܘܪܐ ܐܢܛܝܘܟܝܐ </syc>
<span class="syc">ܐܣܦܣ</span> | 288 | 5 | Romans 3:13 | <span class="syc">ܩܒܪܐ ܦܬܝܚܐ ܓܓܪܬܗܘܢ ܘܠܫܢܝܗܘܢ ܢܟܘܠܬܢܝܢ ܘܚܡܬܐ ܕܐܣܦܣ ܬܚܝܬ ܣܦܘܬܗܘܢ </syc>
<span class="syc">ܐܦܣܘܣ</span> | 623 | 69 | Acts 18:19 | <span class="syc">ܘܡܛܝܘ ܠܐܦܣܘܣ ܘܥܠ ܦܘܠܘܣ ܠܟܢܘܫܬܐ ܘܡܡܠܠ ܗܘܐ ܥܡ ܝܗܘܕܝܐ </syc>
<span class="syc">ܐܪܟ</span> | 515 | 1 | Matthew 23:5 | <span class="syc">ܘܟܠܗܘܢ ܥܒܕܝܗܘܢ ܥܒܕܝܢ ܕܢܬܚܙܘܢ ܠܒܢܝ ܐܢܫܐ ܡܦܬܝܢ ܓܝܪ ܬܦܠܝܗܘܢ ܘܡܘܪܟܝܢ ܬܟܠܬܐ ܕܡܪܛܘܛܝܗܘܢ </syc>
<span class="syc">ܓܐܝܘܣ</span> | 1494 | 1 | Acts 19:29 | <span class="syc">ܘܐܫܬܓܫܬ ܟܠܗ ܡܕܝܢܬܐ ܘܪܗܛܘ ܐܟܚܕܐ ܘܐܙܠܘ ܠܬܐܛܪܘܢ ܘܚܛܦܘ ܐܘܒܠܘ ܥܡܗܘܢ ܠܓܐܝܘܣ ܘܠܐܪܣܛܪܟܘܣ ܓܒܪܐ ܡܩܕܘܢܝܐ ܒܢܝ ܠܘܝܬܗ ܕܦܘܠܘܣ </syc>
<span class="syc">ܕܪܐ</span> | 67 | 32 | Matthew 21:44 | <span class="syc">ܘܡܢ ܕܢܦܠ ܥܠ ܟܐܦܐ ܗܕܐ ܢܬܪܥܥ ܘܟܠ ܡܢ ܕܗܝ ܬܦܠ ܥܠܘܗܝ ܬܕܪܝܘܗܝ </syc>
<span class="syc">ܕܪܡܣܘܩ</span> | 66 | 24 | Acts 9:2 | <span class="syc">ܘܫܐܠ ܠܗ ܐܓܪܬܐ ܡܢ ܪܒ ܟܗܢܐ ܕܢܬܠ ܠܗ ܠܕܪܡܣܘܩ ܠܟܢܘܫܬܐ ܕܐܢ ܗܘ ܕܢܫܟܚ ܕܪܕܝܢ ܒܗܕܐ ܐܘܪܚܐ ܓܒܪܐ ܐܘ ܢܫܐ ܢܐܣܘܪ ܢܝܬܐ ܐܢܘܢ ܠܐܘܪܫܠܡ </syc>
<span class="syc">ܚܘܪܐ</span> | 1456 | 4 | Matthew 5:36 | <span class="syc">ܐܦܠܐ ܒܪܫܟ ܬܐܡܐ ܕܠܐ ܡܫܟܚ ܐܢܬ ܠܡܥܒܕ ܒܗ ܡܢܬܐ ܚܕܐ ܕܣܥܪܐ ܐܘܟܡܬܐ ܐܘ ܚܘܪܬܐ </syc>


In [83]:
syriacaResolve = os.path.expanduser(f'{REPO}/data/user/syriacaSyrNT.csv')

In [84]:
fieldNames = ('lexeme', 'trans', 'url', 'applicable')

fh = open(syriacaResolve, 'w')
for (dataType, theseHits) in hits.items():
    tsv = '\t'.join(fieldNames) + '\n'
    markdown = f'''### {dataType}s
{" | ".join(fieldNames)}
--- | --- | --- | ---
'''
    table = tables[dataType]
    data = table[idF]
    for (lx, linked) in sorted(
        theseHits.items(),
        key=lambda x: F.lexeme.v(x[0]),
    )[0:3]:
        lex = F.lexeme.v(lx)
        for lid in linked:
            trans = data[lid][0]
            url = f'{SC_URL}/{dataType}/{lid}'                
            markdown += (
                f'<span class="syc">{lex}</span> | {trans} | {url} | no\n'
            )
            tsv += f'{lex}\t{trans}\t{url}\tno\n'

    fh.write(tsv)
    display(Markdown(markdown))
fh.close()

### persons
lexeme | trans | url | applicable
--- | --- | --- | ---
<span class="syc">ܐܒܐ</span> | Aba of Nineveh | http://syriaca.org/person/1094 | no
<span class="syc">ܐܒܐ</span> | Abba | http://syriaca.org/person/2582 | no
<span class="syc">ܐܒܐ</span> | Aba | http://syriaca.org/person/308 | no
<span class="syc">ܐܒܪܗܡ</span> | Abraham | http://syriaca.org/person/1548 | no
<span class="syc">ܐܒܪܗܡ</span> | Abraham | http://syriaca.org/person/1547 | no
<span class="syc">ܐܒܪܗܡ</span> | Abraham, bishop of Arbela | http://syriaca.org/person/1108 | no
<span class="syc">ܐܒܪܗܡ</span> | Abraham | http://syriaca.org/person/1110 | no
<span class="syc">ܐܒܪܗܡ</span> | Abraham II of Adiabene | http://syriaca.org/person/1552 | no
<span class="syc">ܐܒܪܗܡ</span> | Abraham the Priest | http://syriaca.org/person/1554 | no
<span class="syc">ܐܒܪܗܡ</span> | Abraham of Adiabene | http://syriaca.org/person/1551 | no
<span class="syc">ܐܒܪܗܡ</span> | Abraham | http://syriaca.org/person/964 | no
<span class="syc">ܐܒܪܗܡ</span> | Abraham the Egyptian | http://syriaca.org/person/1553 | no
<span class="syc">ܐܒܪܗܡ</span> | Abraham | http://syriaca.org/person/1546 | no
<span class="syc">ܐܒܪܗܡ</span> | Abraham of the High Mountain | http://syriaca.org/person/1109 | no
<span class="syc">ܐܒܪܗܡ</span> | Abraham of Harran | http://syriaca.org/person/1549 | no
<span class="syc">ܐܒܪܗܡ</span> | Abraham | http://syriaca.org/person/2202 | no
<span class="syc">ܐܕܝ</span> | Addai | http://syriaca.org/person/1118 | no
<span class="syc">ܐܕܝ</span> | Addai | http://syriaca.org/person/2203 | no
<span class="syc">ܐܕܝ</span> | Addai | http://syriaca.org/person/1117 | no


### places
lexeme | trans | url | applicable
--- | --- | --- | ---
<span class="syc">ܐܘܪܫܠܡ</span> | Jerusalem (settlement) | http://syriaca.org/place/104 | no
<span class="syc">ܐܠܟܣܢܕܪܝܐ</span> | Alexandria (settlement) | http://syriaca.org/place/572 | no
<span class="syc">ܐܢܛܝܘܟܝܐ</span> | Antioch (settlement) | http://syriaca.org/place/10 | no
<span class="syc">ܐܢܛܝܘܟܝܐ</span> | Antioch (region) | http://syriaca.org/place/995 | no


# Workbench (BHSA)

In [8]:
import sys, os
import collections
import re
import operator
from functools import reduce
from utils import structure, layout
from IPython.display import display, HTML, Markdown
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

In [2]:
VERSION = '2017'
BASE = '~/github'
ETCBC = f'{BASE}/etcbc'
TREES = 'lingo/trees/tf/{}'.format(VERSION)   # derived wefts
OSM = 'bridging/tf/{}'.format(VERSION)        # wefts from the OSM crafts shop
PHONO = 'phono/tf/{}'.format(VERSION)         # derived wefts
PARALLELS = 'parallels/tf/{}'.format(VERSION) # derived wefts

In [3]:
LOC = ('~/github', 'etcbc/bhsa', 'Copenhagen2018')
sys.path.append(os.path.expanduser(f'{LOC[0]}/{LOC[1]}/programs'))
from bhsa import Bhsa
B = Bhsa(*LOC, version='2017', locations=[ETCBC], modules=[PHONO, PARALLELS, TREES, OSM])
B.api.makeAvailableIn(globals())
B.load('''
    g_word_utf8 g_cons_utf8
    voc_lex_utf8 gloss
    phono crossref tree
    osm
''')

**Documentation:** <a target="_blank" href="https://etcbc.github.io/bhsa" title="{provenance of this corpus}">BHSA</a> <a target="_blank" href="https://etcbc.github.io/bhsa/features/hebrew/2017/0_home.html" title="{CORPUS} feature documentation">Feature docs</a> <a target="_blank" href="https://etcbc.github.io/bhsa/api.html" title="BHSA API documentation">BHSA API</a> <a target="_blank" href="https://github.com/Dans-labs/text-fabric/wiki/api" title="text-fabric-api">Text-Fabric API</a>


This notebook online:
<a target="_blank" href="http://nbviewer.jupyter.org/github/etcbc/lingo/blob/master/presentations/Copenhagen2018.ipynb">NBViewer</a>
<a target="_blank" href="https://github.com/etcbc/lingo/blob/master/presentations/Copenhagen2018.ipynb">GitHub</a>


In [4]:
verse = T.nodeFromSection(('Genesis', 1, 7))

In [5]:
B.pretty(verse)

In [6]:
clause = 427572
B.pretty(clause)

In [9]:
HTML(B.shbLink(clause))

## Queries

![ku](images/stephen-ku.png)

In [10]:
ellipQuery = '''
sentence
  c1:clause
    phrase function=Pred
      word pdp=verb
  c2:clause
    phrase function=Pred
  c3:clause typ=Ellp
    phrase function=Objc
      word pdp=subs|nmpr|prps|prde|prin
  c1 << c2
  c2 << c3
'''

In [11]:
results = B.search(ellipQuery)
len(results)

1410

In [17]:
def f(n):
    B.show(results, n, n+1, withNodes=True)

In [18]:
interact(f, n=widgets.IntSlider(min=0,max=len(results)-1,step=1,value=0))

<function __main__.f>

# *You* can do this!

because:

* the text model works with proper *logic*:
  * graph = nodes + edges + feature annotations
  * very similar to the model of Emdros (MQL)
* the data packaging is for efficient *logistics*
* but do take a beginners course in **Python**

# Open Scriptures: peeling off the packaging

## The OSM source

Have a look in the 
[Open Scriptures Morphology](https://github.com/openscriptures/morphhb/tree/master/wlc)
source files: [Lam.xml](https://github.com/openscriptures/morphhb/blob/master/wlc/Lam.xml)

```
 <div type="book" osisID="Lam">
      <chapter osisID="Lam.1">
        <verse osisID="Lam.1.1">
          <w lemma="349 b" n="1.1.1.0" morph="HTi">אֵיכָ֣ה</w>
          <seg type="x-paseq">׀</seg>
          <w lemma="3427" morph="HVqp3fs">יָשְׁבָ֣ה</w>
          <w lemma="910" n="1.1.1" morph="HNcmsa">בָדָ֗ד</w>

```

### Book names
```
Lam
Lam.1
Lam.1.1
```

Lam = Lmt = Lament = Lamentations = Threni = Klagesangene = ሰቆቃው_ኤርምያስ

### Sections

```
 book
      chapter
        verse
          w
          w
          w
```

### Identifiers

```
                  osisID= Lam 
               osisID= Lam.1
               osisID= Lam.1.1
             lemma= 349 b  n= 1.1.1.0 

             lemma= 3427 
             lemma= 910  n= 1.1.1 
```

### XML markup

```
 <div type="   " >
      <"">
        <"">
          <"" "" morph=""></>
          <seg type="x-paseq"></seg>
          <"" morph=""></>
          <"" "" morph=""></>
```

### Full text

```
אֵיכָ֣ה
׀
יָשְׁבָ֣ה
בָדָ֗ד
```

### Morph

but all you want is the treasure: *morph*

```
HTi
HVqp3fs
HNcmsa
```

from 

```
 <div type="book" osisID="Lam">
      <chapter osisID="Lam.1">
        <verse osisID="Lam.1.1">
          <w lemma="349 b" n="1.1.1.0" morph="HTi">אֵיכָ֣ה</w>
          <seg type="x-paseq">׀</seg>
          <w lemma="3427" morph="HVqp3fs">יָשְׁבָ֣ה</w>
          <w lemma="910" n="1.1.1" morph="HNcmsa">בָדָ֗ד</w>

```

## In a nutshell

1. you get more than you want
1. what you want is intricately wrapped up
1. the treasure is difficult to align with other resources
1. **we suffer from leaking concerns**
1. **we are being micro-managed at all levels**

<img src="images/loveOS.png" width="60%"/>

but we do need better logistics in treasure sharing

# Modular resources
## The IKEA way
## Lessons of a cabinet ...

![kitchen](images/kitchen.png)

![kitchen](images/kitchen-annot.png)

![metod](images/metod.png)

## ... applied to data
Modular resources 

1. help to separate concerns
1. help to finely tune data sets by recombination
1. are usable in novel settings

Research questions often involve new demands on data and creation of new treasures.

> **problem-oriented annotation**: take a corpus
  and add to it **your own** form of annotation,
  oriented towards **your own** research goal
  
> standards for corpus annotation: **1: annotations should be separable**
  
> Hence, annotation should be **modular**: separate **text** (sic!) and annotation

> Thanks to Johan de Joode, Leuven

# Text Fabric

![TF](images/tf-small.png)

A TF resource is a bunch of TF files.

A TF file contains the data for a single *feature*.

## Warp and weft

Every TF resource must have two special features: **warp** features.

All other features are **weft** features, they are woven into the warp.
<img src="images/warp.png" width="50%"/>

[wikipedia](https://en.wikipedia.org/wiki/Weaving).

## Weft features

These contain the concrete, tangible information: 

* the text
* the linguistic annotations
* additional data that is linked to the text

### [book@en](https://etcbc.github.io/bhsa/features/hebrew/2017/book@ll.html)

```
@node
@author=Eep Talstra Centre for Bible and Computer
@dataset=BHSA
@datasetName=Biblia Hebraica Stuttgartensia Amstelodamensis
@email=shebanq@ancient-data.org
@encoders=Dirk Roorda (TF)
@language=English
@languageCode=en
@languageEnglish=english
@provenance=book names from wikipedia and other sources
@valueType=str
@version=2017
@website=https://shebanq.ancient-data.org
@writtenBy=Text-Fabric
@dateWritten=2018-01-17T17:24:58Z

426585	Genesis
Exodus
Leviticus
Numbers
Deuteronomy
Joshua
Judges
1_Samuel
2_Samuel
1_Kings
2_Kings
Isaiah
Jeremiah
Ezekiel
Hosea
Joel
Amos
Obadiah
Jonah
Micah
Nahum
Habakkuk
Zephaniah
Haggai
Zechariah
Malachi
Psalms
Job
Proverbs
Ruth
Song_of_songs
Ecclesiastes
Lamentations
Esther
Daniel
Ezra
Nehemiah
1_Chronicles
2_Chronicles
```

### [book@da](https://etcbc.github.io/bhsa/features/hebrew/2017/book@ll.html)

```
426585	1.Mosebog
2.Mosebog
3.Mosebog
4.Mosebog
5.Mosebog
Josva
Dommer
1.Samuel
2.Samuel
1.Kongebog
2.Kongebog
Esajas
Jeremias
Ezekiel
Hoseas
Joel
Amos
Obadias
Jonas
Mika
Nahum
Habakkuk
Sefanias
Haggaj
Zakarias
Malakias
Salmerne
Job
Ordsprogene
Ruth
Højsangen
Prædikeren
Klagesangene
Ester
Daniel
Ezra
Nehemias
1.Krønikebog
2.Krønikebog
```

### [book@am](https://etcbc.github.io/bhsa/features/hebrew/2017/book@ll.html)

```
426585	ኦሪት_ዘፍጥረት
ኦሪት_ዘጸአት
ኦሪት_ዘሌዋውያን
ኦሪት_ዘኍልቍ
ኦሪት_ዘዳግም
መጽሐፈ_ኢያሱ_ወልደ_ነዌ
መጽሐፈ_መሣፍንት
መጽሐፈ_ሳሙኤል_ቀዳማዊ
መጽሐፈ_ሳሙኤል_ካል
መጽሐፈ_ነገሥት_ቀዳማዊ።
መጽሐፈ_ነገሥት_ካልዕ።
ትንቢተ_ኢሳይያስ
ትንቢተ_ኤርምያስ
ትንቢተ_ሕዝቅኤል
ትንቢተ_ሆሴዕ
ትንቢተ_ኢዮኤል
ትንቢተ_አሞጽ
ትንቢተ_አብድዩ
ትንቢተ_ዮናስ
ትንቢተ_ሚክያስ
ትንቢተ_ናሆም
ትንቢተ_ዕንባቆም
ትንቢተ_ሶፎንያስ
ትንቢተ_ሐጌ
ትንቢተ_ዘካርያስ
ትንቢተ_ሚልክያ
መዝሙረ_ዳዊት
መጽሐፈ_ኢዮብ።
መጽሐፈ_ምሳሌ
መጽሐፈ_ሩት
መኃልየ_መኃልይ_ዘሰሎሞን
መጽሐፈ_መክብብ
ሰቆቃው_ኤርምያስ
መጽሐፈ_አስቴር።
ትንቢተ_ዳንኤል
መጽሐፈ_ዕዝራ።
መጽሐፈ_ነህምያ።
መጽሐፈ_ዜና_መዋዕል_ቀዳማዊ።
መጽሐፈ_ዜና_መዋዕል_ካልዕ።
```

### [g_word_utf8](https://etcbc.github.io/bhsa/features/hebrew/2017/g_word_utf8.html)

```
@node
@author=Eep Talstra Centre for Bible and Computer
@dataset=BHSA
@datasetName=Biblia Hebraica Stuttgartensia Amstelodamensis
@email=shebanq@ancient-data.org
@encoders=Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)
@valueType=str
@version=2017
@website=https://shebanq.ancient-data.org
@writtenBy=Text-Fabric
@dateWritten=2018-01-17T17:20:54Z

בְּ
רֵאשִׁ֖ית
בָּרָ֣א
אֱלֹהִ֑ים
אֵ֥ת
הַ
שָּׁמַ֖יִם
וְ
אֵ֥ת
הָ
אָֽרֶץ
```

### [g_cons_utf8](https://etcbc.github.io/bhsa/features/hebrew/2017/g_cons_utf8.html)

```
ב
ראשׁית
ברא
אלהים
את
ה
שׁמים
ו
את
ה
ארץ

```

### phono
from the [phono module](https://github.com/ETCBC/phono)

```
@node
@author=BHSA Data: Constantijn Sikkel; Phono Notebook: Dirk Roorda
@coreData=BHSA
@coreVersion=2017
@source=Phono Notebook applied to BHSA Data
@valueType=str
@writtenBy=Text-Fabric
@dateWritten=2018-01-17T17:29:02Z

bᵊ
rēšˌîṯ
bārˈā
ʔᵉlōhˈîm
ʔˌēṯ
ha
ššāmˌayim
wᵊ
ʔˌēṯ
hā
ʔˈāreṣ
```

### [sp](https://etcbc.github.io/bhsa/features/hebrew/2017/sp.html) (part-of-speech)

In the BHSA core again.

```
@node
@author=Eep Talstra Centre for Bible and Computer
@dataset=BHSA
@datasetName=Biblia Hebraica Stuttgartensia Amstelodamensis
@email=shebanq@ancient-data.org
@encoders=Constantijn Sikkel (QDF), and Dirk Roorda (TF)
@valueType=str
@version=2017
@website=https://shebanq.ancient-data.org
@writtenBy=Text-Fabric
@dateWritten=2018-01-17T17:25:41Z

prep
subs
verb
subs
prep
art
subs
conj
prep
art
subs
```

### crossref
from the [parallels module](https://github.com/ETCBC/parallels)

```
@edge
@edgeValues
@author=BHSA Data: Constantijn Sikkel; Parallels Notebook: Dirk Roorda, Martijn Naaijer
@coreData=BHSA
@coreVersion=2017
@source=Parallels Module
@valueType=int
@writtenBy=Text-Fabric
@dateWritten=2018-01-17T17:31:34Z

1414202	1414208	84
1414202	1414212	89
1414204	1414206	77
1414206	1414204	77
1414208	1414202,1414212	84
1414212	1414208	84
1414212	1414202	89
1414299	1414308,1414471,1414473	75
1414299	1414469	76
1414299	1414325,1414479	77
1414299	1414311	78
1414299	1414302,1414467	79
1414299	1414475,1414477	80
1414299	1414314	86
```

The numbers become clearer when we start weaving.

But for that we need the warp features first.

## Warp features

* `otype`: each node has a type
* `oslots`: each non-slot node is linked to a set of slot nodes
* `otext`: specification of sections and text formats

Think of the IKEA warehouse: everything is nicely piled together

<img src="images/warehouse.jpg" width="60%" align="right"/>

* a pallet with words
* a corridor with linguistic objects
* shelves with clauses and phrases
* crates with books, chapters, and verses
* every part is labeled with a node number

<img src="images/barcode.png" align="left" width="100"/>

## Weaving ...

We use tools to weave the yarns into a fabric.

Let's enter the workshop and use a loom.

If we have a **warp**, we can weave the **wefts** into it.

<img src="images/loom.png" width="50%" align="right"/>

AD 1425 [Hausbücher der Nürnberger Zwölfbrüderstiftungen](http://www.nuernberger-hausbuecher.de/75-Amb-2-317-4-v/data)

A TF resource 

* has one fixed set of warp features: `otype oslots otext`
* has arbitrary many weft features: `sp g_word_utf8`, ...
* can be augmented with wefts from TF modules.

A TF module
* has only weft features

When you use modules, they should have been built around the same warp as the main resource.

### ... a weave: parallel verses

<img src="images/weave.jpg" width="100%"/>

In [19]:
langs = ('am', 'sw', 'da', 'nl', 'en')
myLang = langs[1] 
# start weaving !
n = 0
limit = 10
for verse in F.otype.s('verse'):
    if n > limit: break
    crossVerses = E.crossref.f(verse)
    if crossVerses:
        for (cross, confidence) in crossVerses:
            if n > limit: break
            n += 1
            passage = '{} {}:{}'.format(*T.sectionFromNode(verse, lang=myLang))
            other   = '{} {}:{}'.format(*T.sectionFromNode(cross, lang=myLang))
            print(f'{verse} = {passage} =~({confidence}%) {other} = {cross}')

1414202 = Mwanzo 1:13 =~(84%) Mwanzo 1:19 = 1414208
1414202 = Mwanzo 1:13 =~(89%) Mwanzo 1:23 = 1414212
1414204 = Mwanzo 1:15 =~(77%) Mwanzo 1:17 = 1414206
1414206 = Mwanzo 1:17 =~(77%) Mwanzo 1:15 = 1414204
1414208 = Mwanzo 1:19 =~(84%) Mwanzo 1:13 = 1414202
1414208 = Mwanzo 1:19 =~(84%) Mwanzo 1:23 = 1414212
1414212 = Mwanzo 1:23 =~(89%) Mwanzo 1:13 = 1414202
1414212 = Mwanzo 1:23 =~(84%) Mwanzo 1:19 = 1414208
1414299 = Mwanzo 5:4 =~(79%) Mwanzo 5:7 = 1414302
1414299 = Mwanzo 5:4 =~(75%) Mwanzo 5:13 = 1414308
1414299 = Mwanzo 5:4 =~(78%) Mwanzo 5:16 = 1414311


# More fabrics: trees

In 2013/2014 we
[extracted](https://github.com/ETCBC/lingo/blob/master/trees/trees.ipynb)
tree structures from the BHSA data.

Every sentence has a tree associated with it, like this:

```
(S(C(PP(pp 0)(n 1))(VP(vb 2))(NP(n 3))(PP(U(pp 4)(dt 5)(n 6))(cj 7)(U(pp 8)(dt 9)(n 10)))))
```
The numbers refer to the words in the sentence.

## Trees as feature

The trees are available in a feature `tree`, defined for sentences.

```
@node
@converter=Dirk Roorda
@convertor=trees.ipynb
@coreData=BHSA
@coreVersion=2017
@description=penn treebank represententation for sentences
@url=https://github.com/etcbc/lingo/trees/trees.ipynb
@valueType=str
@writtenBy=Text-Fabric
@dateWritten=2018-01-21T18:53:06Z

1172209	(S(C(PP(pp 0)(n 1))(VP(vb 2))(NP(n 3))(PP(U(pp 4)(dt 5)(n 6))(cj 7)(U(pp 8)(dt 9)(n 10)))))
(S(C(CP(cj 0))(NP(dt 1)(n 2))(VP(vb 3))(NP(U(n 4))(cj 5)(U(n 6)))))
(S(C(CP(cj 0))(NP(n 1))(PP(pp 2)(U(n 3))(U(n 4)))))
(S(C(CP(cj 0))(NP(U(n 1))(U(n 2)))(VP(vb 3))(PP(pp 4)(U(n 5))(U(dt 6)(n 7)))))
```

Trees are nice. But this output does **not** look nice.

## Display

We want

* multiline view
* see the words
* phonetically
* with gloss
* and with **Open Scriptures Morphology** tag!

In [53]:
passage = ('Job', 3, 16)
passageStr = '{} {}:{}'.format(*passage)
verse = T.nodeFromSection(passage)
sentence = L.d(verse, otype='sentence')[0]
firstSlot = L.d(sentence, otype='word')[0]
stringTree = F.tree.v(sentence)
print(f'{passageStr} - first word = {firstSlot}\ntree = {stringTree}')

Job 3:16 - first word = 336986
tree = (S(C(Ccoor(CP(cj 0))(PP(pp 1)(U(n 2))(U(vb 3)))(NegP(ng 4))(VP(vb 5)))(Ccoor(PP(pp 6)(n 7)(Cattr(NegP(ng 8))(VP(vb 9))(NP(n 10)))))))


## Parsing

Parse it into a structure:

In [54]:
tree = structure(stringTree)
tree

['S',
 ['C',
  ['Ccoor',
   ['CP', [('cj', 0)]],
   ['PP', [('pp', 1)], ['U', [('n', 2)]], ['U', [('vb', 3)]]],
   ['NegP', [('ng', 4)]],
   ['VP', [('vb', 5)]]],
  ['Ccoor',
   ['PP',
    [('pp', 6)],
    [('n', 7)],
    ['Cattr',
     ['NegP', [('ng', 8)]],
     ['VP', [('vb', 9)]],
     ['NP', [('n', 10)]]]]]]]

We can display it a bit more friendly:

In [56]:
print(layout(tree, firstSlot, str))

  S
    C
      Ccoor
        CP
          cj 336986
        PP
          pp 336987
          U
            n 336988
          U
            vb 336989
        NegP
          ng 336990
        VP
          vb 336991
      Ccoor
        PP
          pp 336992
          n 336993
          Cattr
            NegP
              ng 336994
            VP
              vb 336995
            NP
              n 336996


Note that the `layout()` has replaced the relative word numbers in the sentence by absolute slot numbers in the dataset.

## Weaving the wefts ...
All wefts are there, we have to weave them around each warp.

In [37]:
def osmPhonoGloss(n):
    lexNode = L.u(n, otype='lex')[0]
    return '{{{}}} "{}" [{}] = {}'.format(
        F.osm.v(n),
        F.g_word_utf8.v(n),
        F.phono.v(n),
        F.gloss.v(lexNode), # gloss is a feature on lexemes, not words
        # F.voc_lex_utf8.v(lexNode),
    )

## ... into a weave

In [38]:
print(layout(tree, firstSlot, osmPhonoGloss, withLevel=True))

NameError: name 'tree' is not defined

In [39]:
def showTree(s):
    t = F.tree.v(s)
    tree = structure(t)
    firstSlot = L.d(s, otype='word')[0]
    label = '{} {}:{}'.format(*T.sectionFromNode(firstSlot))
    print(label)
    print(layout(tree, firstSlot, osmPhonoGloss, withLevel=True))
    
sentenceInfo = [c for c in C.levels.data if c[0] == 'sentence'][0]
minSentence = sentenceInfo[2]
maxSentence = sentenceInfo[3]

In [40]:
interact(
    showTree,
    s=widgets.IntSlider(
        min=minSentence,
        max=maxSentence,
        step=1,
        value=minSentence,
    )
)

<function __main__.showTree>

# No leaking of concerns

* The TREES module knows nothing of OS morphology
* OS morphology is not aware of TREES
* *thank goodness*
* But they are woven cosily together in one display

## Carried away by tree structures

The raw strings are handy for structure analysis, in a way the woven trees cannot be.

Let us see how many distinct tree structures we've got.

**liberate yourselve from micro-management**

In [57]:
treeDistribution = F.tree.freqList()

distinct = len(treeDistribution)
total = sum(x[1] for x in treeDistribution)

print(f'{distinct} distinct trees of {total} in total')

28096 distinct trees of 63711 in total


In [58]:
for (tree, amount) in treeDistribution[0:10]:
    print(f'{amount:>4} x {tree}')

3772 x (S(C(CP(cj 0))(VP(vb 1))))
1238 x (S(C(VP(vb 0))))
1173 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2))))
 857 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(n 3))))
 749 x (S(C(CP(cj 0))(VP(vb 1))(NP(n 2))))
 577 x (S(C(CP(cj 0))(VP(vb 1))(PrNP(n-pr 2))))
 568 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(dt 3)(n 4))))
 554 x (S(C(VP(vb 0))(NP(n 1))))
 441 x (S(C(CP(cj 0))(NegP(ng 1))(VP(vb 2))))
 406 x (S(C(CP(cj 0))(VP(vb 1))(PP(pp 2)(n-pr 3))))


I'm intrigued by the most frequent tree structure.

Which verbs occur in such a sentence? Let's find out.

In [59]:
lexemes = collections.Counter()
short = treeDistribution[0][0]
for s in F.otype.s('sentence'):
    if F.tree.v(s) == short:
        verb = L.d(s, otype='word')[1]
        lexeme = L.u(verb, otype='lex')[0]
        lexemes[lexeme] += 1
print(f'{len(lexemes)} lexemes found')

501 lexemes found


In [60]:
for (lex, amount) in sorted(
    lexemes.items(),
    key=lambda x: (-x[1], x[0]),
)[0:10]:
    print(f'{amount:>4} x {lex} "{F.voc_lex_utf8.v(lex)}" = {F.gloss.v(lex)}')

1045 x 1437422 "אמר" = say
 203 x 1437412 "היה" = be
 107 x 1437561 "הלך" = walk
 106 x 1437570 "מות" = die
  87 x 1437424 "ראה" = see
  80 x 1437574 "בוא" = come
  71 x 1437645 "שׁוב" = return
  70 x 1437569 "אכל" = eat
  52 x 1437685 "קום" = arise
  45 x 1437654 "חיה" = be alive


## Application

You do not need to "own" the text in order to work with trees

If you do not own the text (BHS), but have the trees:

* Publish them with word numbers

* Align them to a text you own (WLC)

* Make a list of alignment exceptions

<img src="images/openscriptures.png" align="right"/>

# Example B - Open Scriptures Morphology

* align the WLC with the BHS
* compare the OSM with the BHSA

## Aligning

See [BHSAbridgeOSM.ipynb](https://github.com/ETCBC/bridging/blob/master/programs/BHSAbridgeOSM.ipynb)

* performs a consonant by consonant alignment between the WLC and BHS
* stumbled on a few cases requiring a hint:

```python
exceptions = {
    215253: 1,
    266189: 1,
    287360: 2,
    376865: 1,
    383405: 2,
    384049: 1,
    384050: 1,
    405102: -2,
}
```

```
Succeeded in aligning BHS with OSM
420103 BHS words matched against 469448 OSM morphemes with 8 known exceptions
```

### Spotting the anomalies

With a bit of weaving, these exceptions are:

```
Isaiah 9:6
                    BHS 215253         = מרבה
                    OSM w1             = םרבה
Ezekiel 4:6
                    BHS 266189         = ימוני
                    OSM w7             = ימיני
```

```
Ezekiel 43:11
                    BHS 287360         = צורתו
                    OSM w17, w17       = צורת/י
Daniel 10:19
                    BHS 376865         = כְ
                    OSM w10            = בְ
Ezra 10:44
                    BHS 383405         = נשׂאו
                    OSM w3, w3         = נשא/י

```

```
Nehemiah 2:13
                    BHS 384049         = הם
                    OSM w17            = ה
Nehemiah 2:13
                    BHS 384050         = פרוצים
                    OSM w17            = מפרוצים
```

```
1_Chronicles 27:12
                    BHS 405102, 405103 = בן/ימיני
                    OSM w6             = בנימיני
```

### Word breaking

There are cases where the OSM and the BHSA differ in the breaking-up of words.

```
OSM morphemes without BHSA word:          0
OSM morphemes with multiple BHSA words: 130
OSM morphemes with 2        BHSA words: 123
OSM morphemes with 3        BHSA words:   7
```

### Unfinished

The OSM is not yet finished.

We made a list of word nodes for which no morpheme has been tagged

53841 =~ 10% unfinished.

```
Non-marked-up stretches having length x: y times
   1: 14990
   2:  8336
   3:  2802
   4:  1090
   5:   493
   6:   285
   7:   162
   8:    90
   9:    70
  10:    37
  11:    33
  12:    19
  13:    11
  14:    17
  15:     9
  16:     7
  17:     2
  18:     2
  19:     6
  20:     1
  21:     1
  22:     3
  23:     2
  25:     2
  26:     2
  27:     1
  28:     1
  29:     1
  32:     1
  33:     1
  35:     1
  36:     1
  38:     2
  41:     1
  47:     1
  60:     1
  61:     1
  72:     1
  74:     1
  75:     1
```

What remains is: filling in the dots!

We will carry out the comparison for unproblematic words:

* good alignment       (  8 BHSA words excluded)
* same word breaks     (276 BHSA words excluded)
* morph tags available

### Result: OSM module

Two new TF features:
* `osm.tf`    (main words)
* `osm_sf.tf` (suffixes)

Together: the **OSM module**

```
@node
@conversion=notebook openscriptures in BHSA repo
@conversion_author=Dirk Roorda
@coreData=BHSA
@coreVersion=2017
@description=primary morphology string according to OpenScriptures
@source=Open Scriptures
@source_url=https://github.com/openscriptures/morphhb
@valueType=str
@writtenBy=Text-Fabric
@dateWritten=2018-01-12T13:21:01Z

HR
HNcfsa
HVqp3ms
HNcmpa
HTo
HTd
HNcmpa
HC
HTo
HTd
HNcbsa
```

# Comparing

We compare *categories*.

In [OSM](http://openscriptures.github.io/morphhb/parsing/HebrewMorphologyCodes.html):
* part-of-speech
* and their subtypes

In BHSA the features:
* [sp](https://etcbc.github.io/bhsa/features/hebrew/2017/sp.html) = part of speech
* [ls](https://etcbc.github.io/bhsa/features/hebrew/2017/ls.html) = lexical set
* [nametype](https://etcbc.github.io/bhsa/features/hebrew/2017/nametype.html)

<img src="images/openscriptures.png" align="right"/>

## OSM categories

```python
pspOSM = {
    '': dict(
        A='adjective',
        C='conjunction',
        D='adverb',
        N='noun',
        P='pronoun',
        R='preposition',
        S='suffix',
        T='particle',
        V='verb',
    ),
    'A': dict(
        a='adjective',
        c='cardinal number',
        g='gentilic',
        o='ordinal number',
    ),
    'N': dict(
        c='common',
        g='gentilic',
        p='proper name',
    ),
    'P': dict(
        d='demonstrative',
        f='indefinite',
        i='interrogative',
        p='personal',
        r='relative',
    ),
    'R': dict(
        d='definite article',
    ),
    'S': dict(
        d='directional he',
        h='paragogic he',
        n='paragogic nun',
        p='pronominal',
    ),
    'T': dict(
        a='affirmation',
        d='definite article',
        e='exhortation',
        i='interrogative',
        j='interjection',
        m='demonstrative',
        n='negative',
        o='direct object marker',
        r='relative',
    ),
}
```

<img src="images/etcbc-round.png" align="right"/>

## BHSA categories

```python
spBHS = dict(
    art='article',
    verb='verb',
    subs='noun',
    nmpr='proper noun',
    advb='adverb',
    prep='preposition',
    conj='conjunction',
    prps='personal pronoun',
    prde='demonstrative pronoun',
    prin='interrogative pronoun',
    intj='interjection',
    nega='negative particle',
    inrg='interrogative particle',
    adjv='adjective',
)
lsBHS = dict(
    nmdi='distributive noun',
    nmcp='copulative noun',
    padv='potential adverb',
    afad='anaphoric adverb',
    ppre='potential preposition',
    cjad='conjunctive adverb',
    ordn='ordinal',
    vbcp='copulative verb',
    mult='noun of multitude',
    focp='focus particle',
    ques='interrogative particle',
    gntl='gentilic',
    quot='quotation verb',
    card='cardinal',
    none=MISSING,
)
nametypeBHS = dict(
    pers='person',
    mens='measurement unit',
    gens='people',
    topo='place',
    ppde='demonstrative personal pronoun',
)
nametypeBHS.update({
    'pers,gens,topo': 'person',
    'pers,gens': 'person',
    'gens,topo': 'gentilic',
    'pers,god': 'person',
})
```

## Better dumb than smart

We just counted the pairs of OSM, BHSA categories that co-occurred on words.

A selection of the outcomes.

This is OSM versus BHSA

### Verbs

```
verb                                    
	verb::                         ( 84% =  50691x)
	verb:quotation verb:           ( 10% =   6137x)
	verb:copulative verb:          (  5% =   3246x)
	noun::                         (  0% =      6x)
	adjective::                    (  0% =      3x)
	preposition::                  (  0% =      1x)
	proper noun::                  (  0% =      1x)
```

**Excellent** Just 11 discrepancies in 60,000 cases!

### Prepositions

```
preposition                             
	preposition::                  ( 96% =  50697x)
	noun:potential preposition:    (  3% =   1643x)
	adverb:conjunctive adverb:     (  0% =    194x)
	interrogative particle::       (  0% =    169x)
	noun:cardinal:                 (  0% =     13x)
	conjunction::                  (  0% =      5x)
	noun::                         (  0% =      2x)
	proper noun::                  (  0% =      2x)
	article::                      (  0% =      1x)
	verb::                         (  0% =      1x)
```

In [61]:
disc = 194 + 169 + 13 + 5 + 2 + 2 + 1 + 1
tot = 50697 + disc
discPerc = round(100 * disc / tot, 2)
print(f'Discrepancies: {discPerc}% = {disc}x out of {tot}')

Discrepancies: 0.76% = 387x out of 51084


## Attention needed!

* all *rare* cases have been collected into a big list
* context info has been woven into the list
* there are 645 such cases
* see [allCategoriesCases.tsv](https://github.com/ETCBC/bridging/blob/master/programs/allCategoriesCases.tsv) on GitHub

![1](images/cases1.png)

![2](images/cases2.png)

# Follow up?

* inspect the rare cases:
  * these might be glitches, in BHSA or in OSM or in both
  * these might be disputable cases: add them to the docs

* inspect the majority cases: which categories map to which?
  * maybe some categories can be harmonized
  * if that is not desirable: we can generate an exhaustive mapping

* in the end: we can make a BHSA-OSM category mapping that is
  * comprehensive
  * machine-readable
  * documented

# Conclusions

## BHSA versus OSM
<img src="images/etcbc-round-small.png" align="left" width="48"/> is awesome

<img src="images/openscriptures.png" align="left" width="72"/> is terrific

<img src="images/etcbc-plus-os.png" align="left"/> 
$\gt($ awesome $+$ terrific $)$

# Data, Logic, and Logistics: Text-Fabric
<img src="images/tf-small.png" align="right"/>

[Text-Fabric](https://github.com/Dans-labs/text-fabric) is a
* [model](https://github.com/Dans-labs/text-fabric/wiki/Data-Model)
* [file format](https://github.com/Dans-labs/text-fabric/wiki/File-formats)
* [tool](https://github.com/Dans-labs/text-fabric/wiki/Api)

to support the logistics of the interchange of textual treasures

So that you ... 

* ... researcher and tinkerer
* ... programming theologian

* can grab parts from GitHub
* bring them to your shed
* and join them together on your workbench

<img src="images/ikea.png" width="80%"/>

**Designed especially for you** - Thank you

dirk.roorda@dans.knaw.nl