<a href="http://laf-fabric.readthedocs.org/en/latest/" target="_blank"><img align="left" src="images/laf-fabric-xsmall.png"/></a>
<a href="http://www.dans.knaw.nl" target="_blank"><img align="left"src="images/DANS-xsmall.png"/></a>
<a href="http://tla.mpi.nl" target="_blank"><img align="right" src="images/TLA-xsmall.png"/></a>
<a href="http://www.godgeleerdheid.vu.nl/etcbc" target="_blank"><img align="right" src="images/VU-ETCBC-xsmall.png"/></a>
<a href="http://sblgnt.com/" target="_blank"><img align="right"src="images/sblgnt.jpg" width="50"/></a>

# Plain Text

# Text from features

We present the API functions ``T`` for generating the plain text of the Hebrew Bible.

You can already retrieve a text that is identical to the primary data of the LAF resource using the
features ``g_word_utf8`` and ``trailer_utf8`` for Hebrew and ``unicode`` and ``unicodetrailer`` for Greek.

But the ``T`` functions make it easier.

* You can generate texts for words and verses, individually and collectively
* You can specify different formats: 
  * hebrew unicode or etcbc transliteration or phonemic;
    * and, for *hebrew* and *etcbc*: 
      * consonantal
      * with vowels
      * with vowels and accents
      * lexeme (consonantal)
      * lexeme (vocalized)
    * and, for *phonemic*:
      * with schwas, qamets gadol/qatan distinction and accents
      * simplified
* You can switch to vocalized qeres instead of ketivs (except for the lexeme based formats);
* You can render the text in HTML and control the formatting, or get the plain text;
* You can lookup the passage label of any given node and represent it in a given language.

And there is a format for *Greek*:

* greek unicode;

These functions make it easy to produces texts that can be pasted into word processors.

By the way, if you generate *verse labels*, you can choose the language of the book names.
Currently we have Latin (the ETCBC choice), English (the default), and quite a few others.

See the [ETCBC API documentation](http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html#texts).

** Note **

This notebook loads two sources: the Hebrew Bible and the Greek New Testament.
We need to APIs in the same notebook. This notebook shows you how to do that.

# Greek New Testament

2016-09-09. This is the story of how we got a Greek New Testament data source into Emdros and LAF.

1. We use the Greek New Testament as encoded by the Society for Biblical Literature. It can be obtained from the GitHub repo [biblical humanities: Greek New Testament](https://github.com/biblicalhumanities/greek-new-testament) which provided several text representations, of which we have chosen the [sblgnt](https://github.com/biblicalhumanities/greek-new-testament/tree/master/syntax-trees/sblgnt) one.
1. Cody Kingham (now master student at the ETCBC) has converted that source into an MQL data file, which can be imported in an Emdros database
1. Dirk Roorda has used the ETCBC LAF-Fabric suite to convert from Emdros to LAF and then to compile the LAF for efficient processing



## Licence
We (and you) are allowed to process this data for non-commercial purposes.

The licence for the Hebrew data is CC-BY-NC, stated on the
[SHEBANQ site](https://shebanq.ancient-data.org/sources).

The licence for the Greek data is CC-SA-NC, plus extra restrictions, stated on the
[SBL-GNT site](http://sblgnt.com/license/).

We draw your attention explicitly to this condition:

>You must always attribute quotations from the SBLGNT. If you quote fewer than 100 verses of the SBLGNT in a single print or electronic work, you can attribute it by simply adding "SBLGNT" after the quotation. Use of 100 or more verses in a single work must be accompanied by the following statement:

>Scripture quotations marked [SBLGNT](http://sblgnt.com) are from the [SBL Greek New Testament](http://sblgnt.com). Copyright © 2010 [Society of Biblical Literature](http://www.sbl-site.org) and [Logos Bible Software](http://www.logos.com).

If you copy this text including the hyperlinks, and paste it somewhere in your application, you are set.

We acknowledge, with gratitude, the work that has gone in preparing these sources.
A lot of that work stems from voluntary efforts! Thumbs up for Jonathan Robie and 
the [Biblical Humanities gang](http://biblicalhumanities.org/about/).

In [13]:
# run later
for biblang in biblangs:
    for (code, (lange, lango)) in sorted(T[biblang].langs.items()):
        print('{:<3} = {:<20} = {}'.format(code, lange, lango))

am  = amharic              = ኣማርኛ
ar  = arabic               = العَرَبِية
bn  = bengali              = বাংলা
da  = danish               = Dansk
de  = german               = Deutsch
el  = greek                = Ελληνικά
en  = english              = English
es  = spanish              = Español
fa  = farsi                = فارسی
fr  = french               = Français
he  = hebrew               = עברית
hi  = hindi                = हिन्दी
id  = indonesian           = Bahasa Indonesia
ja  = japanese             = 日本語
ko  = korean               = 한국어
la  = latin                = Latina
nl  = dutch                = Nederlands
pa  = punjabi              = ਪੰਜਾਬੀ
pt  = portuguese           = Português
ru  = russian              = Русский
sw  = swahili              = Kiswahili
syc = syriac               = ܠܫܢܐ ܣܘܪܝܝܐ
tr  = turkish              = Türkçe
ur  = urdu                 = اُردُو
yo  = yoruba               = èdè Yorùbá
zh  = chinese              = 中文


# Examples

## Hebrew
    
This is the code to produce the Hebrew accents representation (`ha`) of a single verse:

    T.text('Esther', 3, 4, fmt='ha', verse_label=True, html=True)
    
This is the code to produce the plain text of the whole Bible in Hebrew unicode with accents and
with verse labels:

    T.text(fmt='ha', verse_labels=True)

## Greek

This is the code to produce the Greek primary representation (`gp`) of a single verse:

    T.text('John', 3, 16, fmt='gp', verse_label=True, html=True)
    
This is the code to produce the plain text of the whole New Testament in Greek unicode
with verse labels:

    T.text(fmt='gp', verse_labels=True)
    
## More info

See the [ETCBC API documentation](http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html#texts) for
many more examples.

**Run the next cells later.**

In [20]:
HTML(examples['Hebrew'])

0,1
Esther 3:4,וַיְהִ֗י ֯ב֯אמרם אֵלָיו֙ יֹ֣ום וָיֹ֔ום וְלֹ֥א שָׁמַ֖ע אֲלֵיהֶ֑ם וַיַּגִּ֣ידוּ לְהָמָ֗ן לִרְאֹות֙ הֲיַֽעַמְדוּ֙ דִּבְרֵ֣י מָרְדֳּכַ֔י כִּֽי־הִגִּ֥יד לָהֶ֖ם אֲשֶׁר־ה֥וּא יְהוּדִֽי׃

0,1
Esther 3:4,וַ הִי ב אמר אֵלָי יֹום וָ יֹום וְ לֹא שָׁמַע אֲלֵי וַ גִּיד לְ הָמָן לִ רְא הֲ עַמְד דִּבְר מָרְדֳּכַי כִּי גִּיד ל אֲשֶׁר הוּא יְהוּדִי

0,1
Esther 3:4,ו היה כ אמר אל יום ו יום ו לא שׁמע אל ו נגד ל המן ל ראה ה עמד דבר מרדכי כי נגד ל אשׁר הוא יהודי

0,1
Esther 3:4,וַיְהִ֗י כְּאָמְרָ֤ם אֵלָיו֙ יֹ֣ום וָיֹ֔ום וְלֹ֥א שָׁמַ֖ע אֲלֵיהֶ֑ם וַיַּגִּ֣ידוּ לְהָמָ֗ן לִרְאֹות֙ הֲיַֽעַמְדוּ֙ דִּבְרֵ֣י מָרְדֳּכַ֔י כִּֽי־הִגִּ֥יד לָהֶ֖ם אֲשֶׁר־ה֥וּא יְהוּדִֽי׃

0,1
Esther 3:4,וַיְהִי כְּאָמְרָם אֵלָיו יֹום וָיֹום וְלֹא שָׁמַע אֲלֵיהֶם וַיַּגִּידוּ לְהָמָן לִרְאֹות הֲיַעַמְדוּ דִּבְרֵי מָרְדֳּכַי כִּי־הִגִּיד לָהֶם אֲשֶׁר־הוּא יְהוּדִי׃

0,1
Esther 3:4,ויהי כאמרמ אליו יומ ויומ ולא שמע אליהמ ויגידו להמנ לראות היעמדו דברי מרדכי כי־הגיד להמ אשר־הוא יהודי׃

0,1
Esther 3:4,WA-J:HI81J*B-*>MRM>;L@JW03JO74WMW@-JO80WMW:-LO71>C@MA73<>:AL;JHE92MWA-J.AG.I74JDW.L:-H@M@81NLI-R:>OWT03H:A-JA75<AM:DW.03D.IB:R;74JM@R:D.:@KA80JK.I75J&HIG.I71JDL@HE73M>:ACER&H71W.>J:HW.DI75J00

0,1
Esther 3:4,WA- HIJ B- >MR >;L@J JOWM W@- JOWM W:- LO> C@MA< >:AL;J WA- G.IJD L:- H@M@N LI- R:> H:A- <AM:D D.IB:R M@R:D.:@KAJ K.IJ G.IJD L >:ACER HW.> J:HW.DIJ

0,1
Esther 3:4,W HJH[ K >MR[ >L JWM/ W JWM/ W L> CM<[ >L W NGD[ L HMN=/ L R>H[ H= <MD[ DBR/ MRDKJ/ KJ NGD[ L >CR HW> JHWDJ/

0,1
Esther 3:4,WAJ:HI81J K.:>@M:R@70m >;L@JW03 JO74Wm W@JO80Wm W:LO71> C@MA73< >:AL;JHE92m WAJ.AG.I74JDW. L:H@M@81n LIR:>OWT03 H:AJA45<AM:DW.03 D.IB:R;74J M@R:D.:@KA80J K.I45J&HIG.I71JD L@HE73m >:ACER&H71W.> J:HW.DI45J00

0,1
Esther 3:4,WAJ:HIJ K.:>@M:R@m >;L@JW JOWm W@JOWm W:LO> C@MA< >:AL;JHEm WAJ.AG.IJDW. L:H@M@n LIR:>OWT H:AJA<AM:DW. D.IB:R;J M@R:D.:@KAJ K.IJ&HIG.IJD L@HEm >:ACER&HW.> J:HW.DIJ00

0,1
Esther 3:4,WJHJ K>MRM >LJW JWM WJWM WL> #M< >LJHM WJGJDW LHMN LR>WT HJ<MDW DBRJ MRDKJ KJ&HGJD LHM >#R&HW> JHWDJ00

0,1
Esther 3:4,wayᵊhˈî kᵊʔomrˈām ʔēlāʸw yˈôm wāyˈôm wᵊlˌō šāmˌaʕ ʔᵃlêhˈem wayyaggˈîḏû lᵊhāmˈān lirᵊʔôṯ hᵃyˈaʕamᵊḏû divrˈê mordᵒḵˈay kˈî-higgˌîḏ lāhˌem ʔᵃšer-hˌû yᵊhûḏˈî .

0,1
Esther 3:4,wayhî kʔåmråm ʔēlåʸw yôm wåyôm wlō šåmaʕ ʔlêhem wayyaggîḏû lhåmån lirʔôṯ hyaʕamḏû divrê mårdḵay kî-higgîḏ låhem ʔšer-hû yhûḏî .


In [21]:
HTML(examples['Greek'])

0,1
John 3:16,"γὰρ Οὕτως ἠγάπησεν ὁ θεὸς τὸν κόσμον ὥστε τὸν υἱὸν τὸν μονογενῆ ἔδωκεν,ἵνα πᾶς ὁ πιστεύων εἰς αὐτὸν μὴ ἀπόληται ἀλλὰ ἔχῃ ζωὴν αἰώνιον."


**Start the program here**

In [1]:
import sys
import collections
from IPython.display import HTML, display_pretty, display_html

from laf.fabric import LafFabric
from etcbc.preprocess import prep
from etcbc.lib import Transcription

biblangs = ('Hebrew', 'Greek')

# We initialize two fabrics and put them in a dict, keyed with biblang

fabric = dict((biblang, LafFabric()) for biblang in biblangs)
API = {}

# We have two source+version combinations, one for each biblang

source = dict(((biblangs[0], 'etcbc'), (biblangs[1], 'sblgnt')))
version = dict(((biblangs[0], '4b'), (biblangs[1], '')))

# this will hold the examples that we produce, keyed by biblang
examples = {}

  0.00s This is LAF-Fabric 4.7.2
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html

  0.00s This is LAF-Fabric 4.7.2
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html



We are going to load the APIs for Hebrew and Greek separately, and store them in the dict `API`, keyed by *biblang*.

# Load the Hebrew API

In [2]:
biblang = 'Hebrew'
fabric[biblang].load(source[biblang]+version[biblang], 'lexicon', 'plain', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        otype
        lex_utf8
        sp
        label
        monads
    ''',''),
    "prepare": prep('Hebrew'),
    "primary": False,
})
exec(fabric[biblang].llocalnames.format(var='fabric[biblang]', biblang=biblang))

  0.00s LOADING API: please wait ... 
  0.00s USING main: etcbc4b DATA COMPILED AT: 2015-11-02T15-08-56
  0.00s USING annox: lexicon DATA COMPILED AT: 2016-07-08T14-32-54
  3.20s LOGFILE=/Users/dirk/laf/laf-fabric-output/etcbc4b/plain/__log__plain.txt
  3.20s INFO: LOADING PREPARED data: please wait ... 
  3.20s prep prep: G.node_sort
  3.30s prep prep: G.node_sort_inv
  3.83s prep prep: L.node_up
  7.06s prep prep: L.node_down
    13s prep prep: V.verses
    13s prep prep: V.books_la
    13s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html
    15s INFO: LOADED PREPARED data
    15s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK plain AT 2016-09-09T09-47-07


# Load the Greek API

In [3]:
biblang = 'Greek'
API[biblang] = fabric[biblang].load(source[biblang]+version[biblang], '', 'plain', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        otype
        unicode
        unicodetrailer
        book chapter verse
    ''',''),
    "prepare": prep('Greek'),
    "primary": False,
})
exec(fabric[biblang].llocalnames.format(var='fabric[biblang]', biblang=biblang))

  0.00s LOADING API: please wait ... 
  0.00s USING main: sblgnt DATA COMPILED AT: 2016-09-09T06-57-44
  0.52s LOGFILE=/Users/dirk/laf/laf-fabric-output/sblgnt/plain/__log__plain.txt
  0.52s INFO: LOADING PREPARED data: please wait ... 
  0.52s prep prep: G.node_sort
  0.55s prep prep: G.node_sort_inv
  0.69s prep prep: L.node_up
  1.16s prep prep: L.node_down
  2.29s prep prep: V.verses
  2.29s prep prep: V.books_la
  2.30s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html
  2.55s INFO: LOADED PREPARED data
  2.55s INFO: DATA LOADED FROM SOURCE sblgnt AND ANNOX  FOR TASK plain AT 2016-09-09T09-47-10


Normally, we used API elements like `F` and `T` just by their short names, because we inject them into the local name space after loading the API by this statement:

    exec(fabric.localnames.format(var='fabric'))
    
But now we have two `T`s and two `F`s, one for each API. Instead we use the following statements:

    exec(fabric[biblang].llocalnames.format(var='fabric[biblang]', biblang=biblang))

for `biblang` having the values `'Hebrew'` and `'Greek'`.

They also inject `T` and `F` into the local name space, but now they are dicts, keyed
by whatever you pass to biblang.

So instead of using `T`, we find the T API in `T['Hebrew']` and `T['Greek']`.

We define dicts named `T` and `F` keyed by *biblang* and containing the API elements `T` and `F` for the respective APIs. Likewise for the other API elements we need.

## Run the examples

In [4]:
demopassage = {
    'Hebrew': ('Esther', 3, 4),
    'Greek':  ('John', 3, 16),
}
for biblang in biblangs:
    TT = T[biblang]
    Msg = msg[biblang]
    html = []
    for fmt in TT.formats():
        fmtx = TT.formats()[fmt][0]
        Msg('{:<6}: Format {:<3} = {}'.format(biblang, fmt, fmtx))
        html.append('<p>{} = {}</p>'.format(fmt, fmtx))
        html.append(
            TT.text(
                *demopassage[biblang],
                fmt=fmt,
                verse_label=True,
                html=True,
            ))
    examples[biblang] = '\n'.join(html)

  2.57s Hebrew: Format hp  = hebrew primary
  2.57s Hebrew: Format hpl = hebrew primary (lexeme)
  2.57s Hebrew: Format hcl = hebrew cons (lexeme)
  2.58s Hebrew: Format ha  = hebrew accent
  2.58s Hebrew: Format hv  = hebrew vowel
  2.58s Hebrew: Format hc  = hebrew cons
  2.58s Hebrew: Format ep  = trans primary
  2.58s Hebrew: Format epl = trans primary (lexeme)
  2.58s Hebrew: Format ecl = hebrew cons (lexeme)
  2.58s Hebrew: Format ea  = trans accent
  2.59s Hebrew: Format ev  = trans vowel
  2.59s Hebrew: Format ec  = trans cons
  2.59s Hebrew: Format pf  = phono full
  2.59s Hebrew: Format ps  = phono simple
  0.03s Greek : Format gp  = greek primary


**Now you can run the examples near the top of this notebook.**
After that you can tweak the styles further:

## HTML formatting

You can tweak the HTML styles easily.

In [5]:
def set_style(biblang, kind):
    if biblang == 'Hebrew':
        if kind == 'big':
            return T[biblang].style(params=dict(
                hebrew_color='660000',
                hebrew_size='xx-large',
                hebrew_line_height='2.0',
                phono_color='ff0000',
            ), show_params=True)
        elif kind == 'reduced':
            return T[biblang].style(params=dict(
                hebrew_color='0000aa',
                hebrew_size='x-large',
                hebrew_line_height='2.0'
            ))
        elif kind == 'default':
            return T[biblang].style()
        else:
            return ''
    elif biblang == 'Greek':
        if kind == 'big':
            return T[biblang].style(params=dict(
                greek_color='660000',
                greek_size='xx-large',
                greek_line_height='2.0',
            ), show_params=True)
        elif kind == 'reduced':
            return T[biblang].style(params=dict(
                greek_color='0000aa',
                greek_size='x-large',
                greek_line_height='2.0'
            ))
        elif kind == 'default':
            return T[biblang].style()
        else:
            return ''

In [6]:
HTML(set_style('Hebrew', 'default'))

In [7]:
HTML(set_style('Greek', 'big'))

  0.09s greek_color = 660000
  0.09s greek_line_height = 2.0
  0.09s greek_size = xx-large
  0.09s verse_color = 0000ff
  0.09s verse_size = small
  0.10s verse_width = 5em


# Passage look up

Here we show how to look up the passage of any given node.
There is some subtlety involved, because nodes may represent a whole book, chapter or verse, and some sentences span multiple verses. Have a good look at the examples.

In [8]:
for biblang in biblangs:
    Inf = inf[biblang]
    FF = F[biblang]
    TT = T[biblang]
    LL = L[biblang]
    Inf('{}: Gathering example nodes'.format(biblang))
    example_nodes = {
            list(FF.otype.s('book'))[1],
            list(FF.otype.s('chapter'))[100],
            list(FF.otype.s('verse'))[1000],
            list(FF.otype.s('sentence'))[10000],
            list(FF.otype.s('clause'))[20000],
            list(FF.otype.s('phrase'))[30000],
            list(FF.otype.s('word'))[100000],
        } | set([s for s in FF.otype.s('sentence') if LL.u('verse', s) == None][0:10]) \
        if biblang == 'Hebrew' else {
            list(FF.db_otype.s('book'))[1],
            list(FF.db_otype.s('chapter'))[100],
            list(FF.db_otype.s('verse'))[1000],
            list(FF.db_otype.s('sentence'))[2000],
            list(FF.db_otype.s('clause'))[20000],
            list(FF.db_otype.s('phrase'))[30000],
            list(FF.db_otype.s('word'))[100000],
        } | set([s for s in FF.db_otype.s('sentence') if LL.u('verse', s) == None][0:10])
    Inf('{}: Found {} example nodes'.format(biblang, len(example_nodes)))
    for n in sorted(example_nodes):
        Inf('{:>7} {:<15} occurs in {:<25} first word {}'.format(
            n, FF.otype.v(n),
            TT.passage(n, lang='en'), 
            TT.passage(n, lang='en', first_word=True)
        ), withtime=False)

  2.69s Hebrew: Gathering example nodes
    11s Hebrew: Found 17 example nodes
 100000 word            occurs in Deuteronomy 11:19         first word Deuteronomy 11:19
 446568 clause          occurs in Deuteronomy 25:9          first word Deuteronomy 25:9
 635133 phrase          occurs in Exodus 36:29              first word Exodus 36:29
1125838 sentence        occurs in Genesis 1:17-18           first word Genesis 1:17
1125875 sentence        occurs in Genesis 1:29-30           first word Genesis 1:29
1125890 sentence        occurs in Genesis 2:4-7             first word Genesis 2:4
1126231 sentence        occurs in Genesis 7:2-3             first word Genesis 7:2
1126238 sentence        occurs in Genesis 7:8-9             first word Genesis 7:8
1126244 sentence        occurs in Genesis 7:13-14           first word Genesis 7:13
1126337 sentence        occurs in Genesis 9:9-10            first word Genesis 9:9
1126400 sentence        occurs in Genesis 10:11-12          first word Genes

# Whole Bible

Generate the complete text in all representations.

In [9]:
for biblang in biblangs:
    Inf = inf[biblang]
    TT = T[biblang]
    Outfile = outfile[biblang]
    
    Inf('{}: Writing the complete text in several representations'.format(biblang))
    for fmt in TT.formats():
        for vl in (True, False):
            file_name = '{}{}_{}{}.txt'.format(
                source[biblang], version[biblang],
                fmt, '_v' if vl else ''
            )
            fl = Outfile(file_name)
            Inf(file_name)
            fl.write(TT.text(fmt=fmt, verse_label=vl))
            fl.close()
    Inf('Done')

    13s Hebrew: Writing the complete text in several representations
    13s etcbc4b_hp_v.txt
    14s etcbc4b_hp.txt
    15s etcbc4b_hpl_v.txt
    15s etcbc4b_hpl.txt
    15s etcbc4b_hcl_v.txt
    16s etcbc4b_hcl.txt
    17s etcbc4b_ha_v.txt
    23s etcbc4b_ha.txt
    29s etcbc4b_hv_v.txt
    36s etcbc4b_hv.txt
    42s etcbc4b_hc_v.txt
    50s etcbc4b_hc.txt
    58s etcbc4b_ep_v.txt
    59s etcbc4b_ep.txt
    59s etcbc4b_epl_v.txt
 1m 00s etcbc4b_epl.txt
 1m 00s etcbc4b_ecl_v.txt
 1m 01s etcbc4b_ecl.txt
 1m 01s etcbc4b_ea_v.txt
 1m 04s etcbc4b_ea.txt
 1m 07s etcbc4b_ev_v.txt
 1m 11s etcbc4b_ev.txt
 1m 14s etcbc4b_ec_v.txt
 1m 20s etcbc4b_ec.txt
 1m 26s etcbc4b_pf_v.txt
 1m 27s etcbc4b_pf.txt
 1m 28s etcbc4b_ps_v.txt
 1m 30s etcbc4b_ps.txt
 1m 32s Done
 1m 29s Greek: Writing the complete text in several representations
 1m 29s sblgnt_gp_v.txt
 1m 30s sblgnt_gp.txt
 1m 30s Done


In [10]:
show_limit = 2
for biblang in biblangs:
    TT = T[biblang]
    Infile = infile[biblang]

    for fmt in TT.formats():
        file_name = '{}{}_{}_v.txt'.format(source[biblang], version[biblang], fmt)
        i = 0
        fl = Infile(file_name)
        print('\n{}'.format(file_name))
        for line in fl:
            if i == show_limit: break
            i += 1
            sys.stdout.write(line)
        fl.close()


etcbc4b_hp_v.txt
Genesis 1:1	בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃
Genesis 1:2	וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃

etcbc4b_hpl_v.txt
Genesis 1:1	בְּ רֵאשִׁית בָּרָא אֱלֹה אֵת הַ שָּׁמַי וְ אֵת הָ אָרֶץ 
Genesis 1:2	וְ הָ אָרֶץ הָי תֹהוּ וָ בֹהוּ וְ חֹשֶׁךְ עַל פְּן תְהֹום וְ רוּחַ אֱלֹה רַחֶף עַל פְּן הַ מָּי 

etcbc4b_hcl_v.txt
Genesis 1:1	ב ראשׁית ברא אלהים את ה שׁמים ו את ה ארץ 
Genesis 1:2	ו ה ארץ היה תהו ו בהו ו חשׁך על פנה תהום ו רוח אלהים רחף על פנה ה מים 

etcbc4b_ha_v.txt
Genesis 1:1	בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃
Genesis 1:2	וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃

etcbc4b_hv_v.txt
Genesis 1:1	בְּרֵאשִׁית בָּרָא אֱלֹהִים אֵת הַשָּׁמַיִם וְאֵת הָאָרֶץ׃
Genesis 1:2	וְהָאָרֶץ הָיְתָה תֹהוּ וָבֹהוּ וְחֹשֶׁךְ עַל־פְּנֵי תְהֹום וְרוּחַ אֱלֹהִים מְרַחֶפֶת 

# Count the number of characters in the plain text

We want to know how many characters the Hebrew Bible has.
We will measure it in the number of unicode characters.
We want to count every s(h)in as one graphemes, so we have to make sure that we compose those unicode characters.
Note that unicode composition does not compose other points and accents with consonants to single letters.

In [11]:
from unicodedata import normalize

In [12]:
biblang = 'Hebrew'
TT = T[biblang]
NNn = NN[biblang]
FF = F[biblang]
Inf = inf[biblang]


In [13]:
example = TT.text('Genesis', 1, 1, fmt='ha', verse_label=False).split(' ')
for ex in example:
    ex_c = normalize('NFKC', ex)
    ex_d = normalize('NFKD', ex)
    print('{:>3}; c={:>3}; d={:>3}; {}'.format(len(ex), len(ex_c), len(ex_d), ex))

 11; c= 12; d= 12; בְּרֵאשִׁ֖ית
  7; c=  7; d=  7; בָּרָ֣א
  9; c=  9; d=  9; אֱלֹהִ֑ים
  4; c=  4; d=  4; אֵ֥ת
 11; c= 12; d= 12; הַשָּׁמַ֖יִם
  6; c=  6; d=  6; וְאֵ֥ת
 10; c= 10; d= 10; הָאָֽרֶץ׃



In [14]:
full_text = TT.text(fmt='ha', verse_label=False)
print(len(normalize('NFKC', full_text)))

2831192


## Trailer

Here is a list of all the different trailing material and their number of occurrences.

In [15]:
biblang = 'Hebrew'


In [16]:
Inf('Exploring trailers')
trailer = collections.Counter()
trailer_ph = collections.Counter()
trailer_map = collections.defaultdict(set)

for node in NNn(test=FF.otype.v, value='word'):
    trl = FF.trailer_utf8.v(node)
    trl_ph = FF.phono_sep.v(node)
    trailer[trl] += 1
    trailer_ph[trl_ph] += 1
    trailer_map[trl].add(trl_ph)

Inf('Done. Found {} trailers in Hebrew text and {} trailers in phonemic text'.format(
    len(trailer), len(trailer_ph),
))
Inf('Trailers in Hebrew text:')
for (trl, n) in sorted(trailer.items(), key=lambda x: (-x[1], x[0])):
    trl = 'ø' if trl == '' else trl.replace('\n', '\\n').replace(' ','_')
    print('{:>7} x [{}]'.format(n, trl))
          
print('Trailers in phonemic text:')
for (trl_ph, n) in sorted(trailer_ph.items(), key=lambda x: (-x[1], x[0])):
    trl_ph = 'ø' if trl_ph == '' else trl_ph.replace('\n', '\\n').replace(' ','_')
    print('{:>7} x [{}]'.format(n, trl_ph))

print('Mapping between trailers in Hebrew and phonemic text:')
for trl in sorted(trailer_map):
    print('{:>7} => {}'.format(
        'ø' if trl == '' else trl.replace('\n', '\\n').replace(' ','_'), 
        ', '.join(
                'ø' if trl_ph == '' else trl_ph.replace('\n', '\\n').replace(' ','_') for trl_ph in sorted(
                    trailer_map[trl]
                ))
    ))

 1m 40s Exploring trailers
 1m 42s Done. Found 12 trailers in Hebrew text and 3 trailers in phonemic text
 1m 42s Trailers in Hebrew text:
 237039 x [_]
 121796 x [ø]
  42275 x [־]
  20037 x [׃\n]
   2266 x [_׀_]
   1892 x [׃_ס_\n]
   1165 x [׃_פ_\n]
     76 x [_ס_]
     13 x [_פ_]
      7 x [׃_׆̇_\n]
      1 x [׃_׆̇_ס__\n]
      1 x [׃_׆̇_פ__\n]
Trailers in phonemic text:
 239251 x [_]
 164105 x [ø]
  23212 x [_.]
Mapping between trailers in Hebrew and phonemic text:
      ø => ø, _
      _ => ø, _, _.
    _׀_ => _
    _ס_ => _
    _פ_ => _
      ־ => ø
    ׃\n => _.
׃_׆̇_\n => _.
׃_׆̇_ס__\n => _.
׃_׆̇_פ__\n => _.
 ׃_ס_\n => _.
 ׃_פ_\n => _.


## Empty words

There are words that have an empty representation.

Let us have a closer look.
How frequent are they and to what lexemes do they correspond, and what is their part of speech?

In [17]:
Inf('Looking for empty words')
ewords = collections.defaultdict(lambda: [])
verse = None

for i in NNn(test=FF.otype.v, values=['verse', 'word']):
    if FF.otype.v(i) == 'verse':
        verse = i
        continue
    text = FF.g_word_utf8.v(i)
    if text == '':
        lex = FF.lex_utf8.v(i)
        pos = FF.sp.v(i)
        ewords[(lex, pos)].append(verse)

Inf('Done')
for (item, occs) in sorted(ewords.items(), key=lambda x: (-len(x[1]), x[0][1], x[0][0])):
    print("{:>6} x {:<15} = {:>10} in {}{}".format(
        len(occs), 
        item[1], 
        item[0], 
        "; ".join([FF.label.v(j) for j in occs][0:5]),
        ' ...' if len(occs) > 20 else '',
    ))


 1m 42s Looking for empty words
 1m 44s Done
  6423 x art             =          ה in  GEN 01,05;  GEN 01,05;  GEN 01,07;  GEN 01,07;  GEN 01,08 ...


In [18]:
close['Hebrew']()

 1m 44s Results directory:
/Users/dirk/laf/laf-fabric-output/etcbc4b/plain

.DS_Store                              6148 Fri Sep  9 10:08:21 2016
__log__plain.txt                       3661 Fri Sep  9 11:48:51 2016
etcbc4b_ea.txt                      3119734 Fri Sep  9 11:48:14 2016
etcbc4b_ea_v.txt                    3438368 Fri Sep  9 11:48:11 2016
etcbc4b_ec.txt                      1549212 Fri Sep  9 11:48:33 2016
etcbc4b_ec_v.txt                    1867846 Fri Sep  9 11:48:27 2016
etcbc4b_ecl.txt                     1760600 Fri Sep  9 11:48:08 2016
etcbc4b_ecl_v.txt                   2079234 Fri Sep  9 11:48:08 2016
etcbc4b_ep.txt                      2990524 Fri Sep  9 11:48:06 2016
etcbc4b_ep_v.txt                    3309158 Fri Sep  9 11:48:06 2016
etcbc4b_epl.txt                     2327644 Fri Sep  9 11:48:07 2016
etcbc4b_epl_v.txt                   2646278 Fri Sep  9 11:48:07 2016
etcbc4b_ev.txt                      2538742 Fri Sep  9 11:48:21 2016
etcbc4b_ev_v.txt           

In [19]:
close['Greek']()

 1m 41s Results directory:
/Users/dirk/laf/laf-fabric-output/sblgnt/plain

.DS_Store                              6148 Fri Sep  9 11:42:30 2016
__log__plain.txt                       2401 Fri Sep  9 11:48:51 2016
sblgnt_gp.txt                       1646554 Fri Sep  9 11:48:40 2016
sblgnt_gp_v.txt                     1747309 Fri Sep  9 11:48:39 2016
