<a href="http://laf-fabric.readthedocs.org/en/latest/" target="_blank"><img align="left" src="images/laf-fabric-small.png"/></a>
<a href="http://www.persistent-identifier.nl/?identifier=urn%3Anbn%3Anl%3Aui%3A13-048i-71" target="_blank"><img align="left"src="images/DANS-logo_small.png"/></a>
<a href="http://www.godgeleerdheid.vu.nl/etcbc" target="_blank"><img align="right" src="images/VU-ETCBC-small.png"/></a>
<a href="https://www.academic-bible.com/en/online-bibles/biblia-hebraica-stuttgartensia-bhs/read-the-bible-text/" target="_blank"><img align="right" src="files/images/DBG-small.png"/></a>

# Accented Units

Some words in the Hebrew text are contiguous with preceding or following words: there is no white space to separate them, only empty space or a maqaf (hyphen). Such a sequence of adjacent words we call an *accented unit*.

How do you find, given a word occurrence, the complete accented unit it belongs to?

In [1]:
import sys, os

import laf
from laf.fabric import LafFabric
from etcbc.preprocess import prepare
fabric = LafFabric()

  0.00s This is LAF-Fabric 4.8.2
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html



# Loading the feature data

In [2]:
version = '4b'
API = fabric.load('etcbc{}'.format(version), 'lexicon,para', 'paragraphs', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        otype monads
        g_word_utf8 trailer_utf8
    ''',
    '''
    '''),
    "prepare": prepare,
    "primary": False,
}, verbose='DETAIL')
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s DETAIL: COMPILING m: etcbc4b: UP TO DATE
  0.01s USING main: etcbc4b DATA COMPILED AT: 2015-11-02T15-08-56
  0.01s DETAIL: COMPILING a: lexicon: UP TO DATE
  0.01s USING annox: lexicon DATA COMPILED AT: 2016-07-08T14-32-54
  0.01s DETAIL: COMPILING a: para: UP TO DATE
  0.01s USING annox: para DATA COMPILED AT: 2016-07-08T14-38-37
  0.03s DETAIL: load main: G.node_anchor_min
  0.15s DETAIL: load main: G.node_anchor_max
  0.23s DETAIL: load main: G.node_sort
  0.29s DETAIL: load main: G.node_sort_inv
  0.70s DETAIL: load main: G.edges_from
  0.76s DETAIL: load main: G.edges_to
  0.83s DETAIL: load main: F.etcbc4_db_monads [node] 
  2.11s DETAIL: load main: F.etcbc4_db_otype [node] 
  2.75s DETAIL: load main: F.etcbc4_ft_g_word_utf8 [node] 
  3.04s DETAIL: load main: F.etcbc4_ft_trailer_utf8 [node] 
  3.16s LOGFILE=/Users/dirk/laf/laf-fabric-output/etcbc4b/paragraphs/__log__paragraphs.txt
  3.16s INFO: LOADING PREPARED data: please wait ... 


# Method

We use `trailer_utf8` as criterion whether a word and the following word are contiguous. If it is the empty string or a maqef, we conclude that the word in question and the next one are part of the same accented unit.

It is not straightforward in LAF-Fabric how to proceed from a word node the the node of the next of previous word.

A solution is then to walk through all the words in text order, and make a mapping from words to accented units.

If we store that mapping, we can easily find the *au* of any word node that we encounter.

# Making an index

In [3]:
inf('Compiling index of accented units ...')
word2au = {}
aus = set()      # only needed to count the total number of accented units
glue = {'', '־'}  # the interword material that continues the current au
current_au = []
for w in F.otype.s('word'):
    current_au.append(w)
    word2au[w] = current_au
    if F.trailer_utf8.v(w) not in glue: # move to a new au
        aus.add(tuple(current_au))
        current_au = []
if current_au: aus.add(tuple(current_au))
inf('Assembled {} words into {} accented units'.format(
    len(word2au.keys()),
    len(aus),
))        
    

  1.99s Compiling index of accented units ...
  4.45s Assembled 426568 words into 262497 accented units


# Result

We display the text of Genesis 1:1-11 and mark the accented units with a a tuple of the monad numbers of their words.

In [4]:
text = ''
for verse in range(1,12):
    text += '\nGenesis 1:{}\n'.format(verse)
    vn = T.node_of('Genesis', 1, verse)
    prev_au = None
    for w in L.d('word', vn):
        this_au = word2au[w]
        if prev_au != None and this_au is not prev_au:
            text += ' ({}) '.format(','.join(F.monads.v(x) for x in prev_au))
        prev_au = this_au
        text += F.g_word_utf8.v(w)+F.trailer_utf8.v(w)
    if prev_au:
        text += ' ({}) '.format(','.join(F.monads.v(x) for x in prev_au))



In [5]:
print(text)


Genesis 1:1
בְּרֵאשִׁ֖ית  (1,2) בָּרָ֣א  (3) אֱלֹהִ֑ים  (4) אֵ֥ת  (5) הַשָּׁמַ֖יִם  (6,7) וְאֵ֥ת  (8,9) הָאָֽרֶץ׃
 (10,11) 
Genesis 1:2
וְהָאָ֗רֶץ  (12,13,14) הָיְתָ֥ה  (15) תֹ֨הוּ֙  (16) וָבֹ֔הוּ  (17,18) וְחֹ֖שֶׁךְ  (19,20) עַל־פְּנֵ֣י  (21,22) תְהֹ֑ום  (23) וְר֣וּחַ  (24,25) אֱלֹהִ֔ים  (26) מְרַחֶ֖פֶת  (27) עַל־פְּנֵ֥י  (28,29) הַמָּֽיִם׃
 (30,31) 
Genesis 1:3
וַיֹּ֥אמֶר  (32,33) אֱלֹהִ֖ים  (34) יְהִ֣י  (35) אֹ֑ור  (36) וַֽיְהִי־אֹֽור׃
 (37,38,39) 
Genesis 1:4
וַיַּ֧רְא  (40,41) אֱלֹהִ֛ים  (42) אֶת־הָאֹ֖ור  (43,44,45) כִּי־טֹ֑וב  (46,47) וַיַּבְדֵּ֣ל  (48,49) אֱלֹהִ֔ים  (50) בֵּ֥ין  (51) הָאֹ֖ור  (52,53) וּבֵ֥ין  (54,55) הַחֹֽשֶׁךְ׃
 (56,57) 
Genesis 1:5
וַיִּקְרָ֨א  (58,59) אֱלֹהִ֤ים ׀  (60) לָאֹור֙  (61,62,63) יֹ֔ום  (64) וְלַחֹ֖שֶׁךְ  (65,66,67,68) קָ֣רָא  (69) לָ֑יְלָה  (70) וַֽיְהִי־עֶ֥רֶב  (71,72,73) וַֽיְהִי־בֹ֖קֶר  (74,75,76) יֹ֥ום  (77) אֶחָֽד׃ פ 
 (78) 
Genesis 1:6
וַיֹּ֣אמֶר  (79,80) אֱלֹהִ֔ים  (81) יְהִ֥י  (82) רָקִ֖יעַ  (83) בְּתֹ֣וךְ  (84,85) הַמָּ֑יִם  (86,87) וִיהִ֣