<a href="http://laf-fabric.readthedocs.org/en/latest/" target="_blank"><img align="left" src="images/laf-fabric-xsmall.png"/></a>
<a href="http://www.godgeleerdheid.vu.nl/etcbc" target="_blank"><img align="left" src="images/VU-ETCBC-xsmall.png"/></a>
<a href="http://www.persistent-identifier.nl/?identifier=urn%3Anbn%3Anl%3Aui%3A13-048i-71" target="_blank"><img align="left"src="images/etcbc4easy-small.png"/></a>
<a href="http://tla.mpi.nl" target="_blank"><img align="right" src="images/TLA-xsmall.png"/></a>
<a href="http://www.dans.knaw.nl" target="_blank"><img align="right"src="images/DANS-xsmall.png"/></a>

# Canonical Text Services

We prepare for an export of SHEBANQ to the canonical text service system.

# Format

We make an output of the fully pointed Masoretic Text, sectioned by book, chapter and verse.
We mark up sentences and clauses.

## Problems
There are a few complications that arise when we put the material in a single hierarchy.

* the sectioning division sometimes crosses sentences
* sentences and clauses are not always contiguous, they may contain gaps.

## Solutions

### Verses and sentences
Some sentences span multiple verses. Either verses or sentences should be marked up by means of single elements that mark the position where a new thing starts. The other element will be a content containing element.
We choose to map verses on empty elements.

### Gaps in sentences and clauses
We map the contiguous parts of sentences and clauses (atoms) to content containing elements. We do not mark up sentences and clauses as a whole. But we will add an attribute to the atom elements that holds an identifier of the sentence and clause they are part of.

# Linguistic information
Just for experimentation, we add a bit of linguistic information.

Every clause has a *text type*, indicating whether the clause is **N**arrative, **D**iscursive, or **Q**uotation (direct speech). The value is a sequence of N, D, Q indicators, because a clause can be narrative within a quotation and so on. If the text type is not clear, the value **?** is used.

We also add the *relation*, indicating the syntactic function of the clause.

At the word level, we add a lexeme identifier and a phonemic representation as attributes.
So every word gets wrapped in an element.

In [1]:
import sys
import collections
from IPython.display import HTML, display_pretty, display_html

from laf.fabric import LafFabric
from etcbc.preprocess import prepare

fabric = LafFabric()

source = 'etcbc'
version = '4b'

  0.00s This is LAF-Fabric 4.5.10
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html



In [2]:
fabric.load(source+version, 'lexicon', 'plain', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        otype
        number rela
        lex
        book chapter verse
    ''',''),
    "prepare": prepare,
    "primary": False,
})
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s USING main  DATA COMPILED AT: 2015-11-02T15-08-56
  0.00s USING annox DATA COMPILED AT: 2016-01-27T19-01-17
  4.75s LOGFILE=/Users/dirk/SURFdrive/laf-fabric-output/etcbc4b/plain/__log__plain.txt
  4.75s INFO: LOADING PREPARED data: please wait ... 
  4.75s prep prep: G.node_sort
  4.89s prep prep: G.node_sort_inv
  5.50s prep prep: L.node_up
  9.46s prep prep: L.node_down
    15s prep prep: V.verses
    15s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html
    17s INFO: LOADED PREPARED data
    17s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK plain AT 2016-02-18T14-03-02


## Run the examples

In [3]:
HTML(T.style(params=dict(
    hebrew_color='660000',
    hebrew_size='xx-large',
    hebrew_line_height='2.0'
), show_params=True))

  1.95s etcbc_color = aa0066
  1.95s etcbc_line_height = 1.5
  1.95s etcbc_size = small
  1.95s fmt_color = ccbb00
  1.95s fmt_size = small
  1.95s fmt_width = 5em
  1.95s hebrew_color = 660000
  1.95s hebrew_line_height = 2.0
  1.95s hebrew_size = xx-large
  1.95s phono_color = 00b040
  1.95s phono_line_height = 1.5
  1.95s phono_size = medium
  1.95s verse_color = 0000ff
  1.95s verse_size = small
  1.95s verse_width = 5em


In [4]:
html = []
for fmt in T.formats(): html.append(
    T.verse(
        'Esther', 3, 4,
        fmt=fmt,
        verse_label=True, format_label=True,
        html=True,
    ))
examples = '\n'.join(html)
HTML(examples)

0,1,2
Esther 3:4,וַיְהִ֗י ֯ב֯אמרם אֵלָיו֙ יֹ֣ום וָיֹ֔ום וְלֹ֥א שָׁמַ֖ע אֲלֵיהֶ֑ם וַיַּגִּ֣ידוּ לְהָמָ֗ן לִרְאֹות֙ הֲיַֽעַמְדוּ֙ דִּבְרֵ֣י מָרְדֳּכַ֔י כִּֽי־הִגִּ֥יד לָהֶ֖ם אֲשֶׁר־ה֥וּא יְהוּדִֽי׃,hebrew primary

0,1,2
Esther 3:4,וַיְהִ֗י כְּאָמְרָ֤ם אֵלָיו֙ יֹ֣ום וָיֹ֔ום וְלֹ֥א שָׁמַ֖ע אֲלֵיהֶ֑ם וַיַּגִּ֣ידוּ לְהָמָ֗ן לִרְאֹות֙ הֲיַֽעַמְדוּ֙ דִּבְרֵ֣י מָרְדֳּכַ֔י כִּֽי־הִגִּ֥יד לָהֶ֖ם אֲשֶׁר־ה֥וּא יְהוּדִֽי׃,hebrew accent

0,1,2
Esther 3:4,וַיְהִי כְּאָמְרָם אֵלָיו יֹום וָיֹום וְלֹא שָׁמַע אֲלֵיהֶם וַיַּגִּידוּ לְהָמָן לִרְאֹות הֲיַעַמְדוּ דִּבְרֵי מָרְדֳּכַי כִּי־הִגִּיד לָהֶם אֲשֶׁר־הוּא יְהוּדִי׃,hebrew vowel

0,1,2
Esther 3:4,ויהי כאמרם אליו יום ויום ולא שמע אליהם ויגידו להמן לראות היעמדו דברי מרדכי כי־הגיד להם אשר־הוא יהודי׃,hebrew cons

0,1,2
Esther 3:4,WAJ:HI81J K.:>@M:R@70m >;L@JW03 JO74Wm W@JO80Wm W:LO71> C@MA73< >:AL;JHE92m WAJ.AG.I74JDW. L:H@M@81n LIR:>OWT03 H:AJA35<AM:DW.03 D.IB:R;74J M@R:D.:@KA80J K.I35J&HIG.I71JD L@HE73m >:ACER&H71W.> J:HW.DI35J00,trans accent

0,1,2
Esther 3:4,WAJ:HIJ K.:>@M:R@m >;L@JW JOWm W@JOWm W:LO> C@MA< >:AL;JHEm WAJ.AG.IJDW. L:H@M@n LIR:>OWT H:AJA<AM:DW. D.IB:R;J M@R:D.:@KAJ K.IJ&HIG.IJD L@HEm >:ACER&HW.> J:HW.DIJ00,trans vowel

0,1,2
Esther 3:4,WJHJ K>MRm >LJW JWm WJWm WL> #M< >LJHm WJGJDW LHMn LR>WT HJ<MDW DBRJ MRDKJ KJ&HGJD LHm >#R&HW> JHWDJ00,trans cons

0,1,2
Esther 3:4,wayᵊhˈî kᵊʔomrˈām ʔēlāʸw yˈôm wāyˈôm wᵊlˌō šāmˌaʕ ʔᵃlêhˈem wayyaggˈîḏû lᵊhāmˈān lirᵊʔôṯ hᵃyˈaʕamᵊḏû divrˈê mordᵒḵˈay kˈî-higgˌîḏ lāhˌem ʔᵃšer-hˌû yᵊhûḏˈî .,phono full

0,1,2
Esther 3:4,wayhî kʔåmråm ʔēlåʸw yôm wåyôm wlō šåmaʕ ʔlêhem wayyaggîḏû lhåmån lirʔôṯ hyaʕamḏû divrê mårdḵay kî-higgîḏ låhem ʔšer-hû yhûḏî .,phono simple


# Whole Bible

Generate the complete text in all representations.

In [5]:
msg('Writing the complete text in several representations')
for fmt in T.formats():
    for vl in (True, False):
        file_name = '{}{}_{}{}.txt'.format(
            source, version,
            fmt, '_v' if vl else ''
        )
        fl = outfile(file_name)
        fl.write(T.whole(fmt=fmt, verse_labels=vl))
        fl.close()
msg('Done')

  8.11s Writing the complete text in several representations
  8.11s etcbc4b_hp_v.txt
  8.11s Producing whole text of etcbc4b in format hp with verse labels
    10s etcbc4b_hp.txt
    10s Producing whole text of etcbc4b in format hp
    12s etcbc4b_ha_v.txt
    12s Producing whole text of etcbc4b in format ha with verse labels
    21s etcbc4b_ha.txt
    21s Producing whole text of etcbc4b in format ha
    29s etcbc4b_hv_v.txt
    29s Producing whole text of etcbc4b in format hv with verse labels
    37s etcbc4b_hv.txt
    37s Producing whole text of etcbc4b in format hv
    44s etcbc4b_hc_v.txt
    44s Producing whole text of etcbc4b in format hc with verse labels
    52s etcbc4b_hc.txt
    52s Producing whole text of etcbc4b in format hc
 1m 00s etcbc4b_ea_v.txt
 1m 00s Producing whole text of etcbc4b in format ea with verse labels
 1m 04s etcbc4b_ea.txt
 1m 04s Producing whole text of etcbc4b in format ea
 1m 08s etcbc4b_ev_v.txt
 1m 08s Producing whole text of etcbc4b in format ev w

In [7]:
show_limit = 2
for fmt in T.formats():
    file_name = '{}{}_{}_v.txt'.format(source, version, fmt)
    i = 0
    fl = infile(file_name)
    print('\n{}'.format(file_name))
    for line in fl:
        if i == show_limit: break
        i += 1
        sys.stdout.write(line)
    fl.close()


etcbc4b_hp_v.txt
Genesis 1:1  בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃
Genesis 1:2  וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃

etcbc4b_ha_v.txt
Genesis 1:1  בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃
Genesis 1:2  וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃

etcbc4b_hv_v.txt
Genesis 1:1  בְּרֵאשִׁית בָּרָא אֱלֹהִים אֵת הַשָּׁמַיִם וְאֵת הָאָרֶץ׃
Genesis 1:2  וְהָאָרֶץ הָיְתָה תֹהוּ וָבֹהוּ וְחֹשֶׁךְ עַל־פְּנֵי תְהֹום וְרוּחַ אֱלֹהִים מְרַחֶפֶת עַל־פְּנֵי הַמָּיִם׃

etcbc4b_hc_v.txt
Genesis 1:1  בראשית ברא אלהים את השמים ואת הארץ׃
Genesis 1:2  והארץ היתה תהו ובהו וחשך על־פני תהום ורוח אלהים מרחפת על־פני המים׃

etcbc4b_ea_v.txt
Genesis 1:1  B.:R;>CI73JT B.@R@74> >:ELOHI92Jm >;71T HAC.@MA73JIm W:>;71T H@>@35REy00
Genesis 1:2  W:H@>@81REy H@J:T@71H TO33HW.03 W@BO80HW. W:XO73CEk: <AL&P.:N;7

## Trailer

Here is a list of all the different trailing material and their number of occurrences.

In [19]:
msg('Exploring trailers')
trailer = collections.Counter()
trailer_ph = collections.Counter()
trailer_map = collections.defaultdict(set)

for node in NN(test=F.otype.v, value='word'):
    trl = F.trailer_utf8.v(node)
    trl_ph = F.phono_sep.v(node)
    trailer[trl] += 1
    trailer_ph[trl_ph] += 1
    trailer_map[trl].add(trl_ph)

msg('Done. Found {} trailers in Hebrew text and {} trailers in phonemic text'.format(
    len(trailer), len(trailer_ph),
))
print('Trailers in Hebrew text:')
for (trl, n) in sorted(trailer.items(), key=lambda x: (-x[1], x[0])):
    trl = 'ø' if trl == '' else trl.replace('\n', '\\n').replace(' ','_')
    print('{:>7} x [{}]'.format(n, trl))
          
print('Trailers in phonemic text:')
for (trl_ph, n) in sorted(trailer_ph.items(), key=lambda x: (-x[1], x[0])):
    trl_ph = 'ø' if trl_ph == '' else trl_ph.replace('\n', '\\n').replace(' ','_')
    print('{:>7} x [{}]'.format(n, trl_ph))

print('Mapping between trailers in Hebrew and phonemic text:')
for trl in sorted(trailer_map):
    print('{:>7} => {}'.format(
        'ø' if trl == '' else trl.replace('\n', '\\n').replace(' ','_'), 
        ', '.join(
                'ø' if trl_ph == '' else trl_ph.replace('\n', '\\n').replace(' ','_') for trl_ph in sorted(
                    trailer_map[trl]
                ))
    ))

44m 49s Exploring trailers
44m 51s Done. Found 12 trailers in Hebrew text and 3 trailers in phonemic text


Trailers in Hebrew text:
 237039 x [_]
 121796 x [ø]
  42275 x [־]
  20037 x [׃\n]
   2266 x [_׀_]
   1892 x [׃_ס_\n]
   1165 x [׃_פ_\n]
     76 x [_ס_]
     13 x [_פ_]
      7 x [׃_׆̇_\n]
      1 x [׃_׆̇_ס__\n]
      1 x [׃_׆̇_פ__\n]
Trailers in phonemic text:
 239251 x [_]
 164105 x [ø]
  23212 x [_.]
Mapping between trailers in Hebrew and phonemic text:
      ø => ø, _
      _ => ø, _, _.
    _׀_ => _
    _ס_ => _
    _פ_ => _
      ־ => ø
    ׃\n => _.
׃_׆̇_\n => _.
׃_׆̇_ס__\n => _.
׃_׆̇_פ__\n => _.
 ׃_ס_\n => _.
 ׃_פ_\n => _.


## Empty words

There are words that have an empty representation.

Let us have a closer look.
How frequent are they and to what lexemes do they correspond, and what is their part of speech?

In [21]:
msg('Looking for empty words')
ewords = collections.defaultdict(lambda: [])
verse = None

for i in NN(test=F.otype.v, values=['verse', 'word']):
    if F.otype.v(i) == 'verse':
        verse = i
        continue
    text = F.g_word_utf8.v(i)
    if text == '':
        lex = F.lex_utf8.v(i)
        pos = F.sp.v(i)
        ewords[(lex, pos)].append(verse)

msg('Done')
for (item, occs) in sorted(ewords.items(), key=lambda x: (-len(x[1]), x[0][1], x[0][0])):
    print("{:>6} x {:<15} = {:>10} in {}{}".format(
        len(occs), 
        item[1], 
        item[0], 
        "; ".join([F.label.v(j) for j in occs][0:5]),
        ' ...' if len(occs) > 20 else '',
    ))


45m 30s Looking for empty words
45m 32s Done


  6423 x art             =          ה in  GEN 01,05;  GEN 01,05;  GEN 01,07;  GEN 01,07;  GEN 01,08 ...


In [22]:
close()

45m 37s Results directory:
/Users/dirk/SURFdrive/laf-fabric-output/etcbc4b/plain

.DS_Store                              6148 Wed Feb 17 18:23:33 2016
__log__plain.txt                       1847 Wed Feb 17 20:35:11 2016
etcbc4b_ea.txt                      3119751 Wed Feb 17 20:26:00 2016
etcbc4b_ea_v.txt                    3459469 Wed Feb 17 20:25:56 2016
etcbc4b_ec.txt                      1560727 Wed Feb 17 20:26:23 2016
etcbc4b_ec_v.txt                    1900445 Wed Feb 17 20:26:17 2016
etcbc4b_ev.txt                      2534187 Wed Feb 17 20:26:10 2016
etcbc4b_ev_v.txt                    2873905 Wed Feb 17 20:26:05 2016
etcbc4b_ha.txt                      5332789 Wed Feb 17 20:25:20 2016
etcbc4b_ha_v.txt                    5672507 Wed Feb 17 20:25:10 2016
etcbc4b_hc.txt                      2803552 Wed Feb 17 20:25:51 2016
etcbc4b_hc_v.txt                    3143270 Wed Feb 17 20:25:43 2016
etcbc4b_hp.txt                      5323553 Wed Feb 17 20:25:00 2016
etcbc4b_hp_v.txt     