<a href="http://laf-fabric.readthedocs.org/en/latest/" target="_blank"><img align="left" src="images/laf-fabric-xsmall.png"/></a>
<a href="http://www.godgeleerdheid.vu.nl/etcbc" target="_blank"><img align="left" src="images/VU-ETCBC-xsmall.png"/></a>
<a href="http://www.persistent-identifier.nl/?identifier=urn%3Anbn%3Anl%3Aui%3A13-048i-71" target="_blank"><img align="left"src="images/etcbc4easy-small.png"/></a>
<a href="http://tla.mpi.nl" target="_blank"><img align="right" src="images/TLA-xsmall.png"/></a>
<a href="http://www.dans.knaw.nl" target="_blank"><img align="right"src="images/DANS-xsmall.png"/></a>

# Plain Text

# Text from features

Here comes the plain text of the Hebrew Bible.

You can retrieve a text that is identical to the primary data of the LAF resource.
So what is the use?

* You can use it as a check that the LAF-Fabric machinery works correct
* You can make text selections that interest you
* You can ask for the unvocalized text

The text is obtained by walking all word nodes and concatenate their ``g_word_utf8`` or ``g_cons_utf8`` and ``trailer_utf8`` features.

The ``g_word_utf8`` feature contains the unicode representation of the vocalized text of the word.

The ``g_cons_utf8`` feature contains the unicode representation of the consonantal text of the word.

The ``trailer_utf8`` feature contains the unicode representation of material that follows the word,
but does not belong to the word and neither to the following word. 
It may be empty, a space, punctuation, or certain other textual marks.

In [1]:
import sys
import collections

from laf.fabric import LafFabric
fabric = LafFabric()

  0.00s This is LAF-Fabric 4.5.4
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html

  0.00s INFO: CREATING new OUTPUT_DIR /Users/dirk/laf-fabric-output


In [4]:
fabric.load('etcbc4b', '--', 'plain', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        otype
        g_word_utf8 g_cons_utf8 trailer_utf8
        g_word g_cons lex_utf8
        sp
        book chapter verse label
    ''',''),
    "primary": True,
})
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s INFO: USING DATA COMPILED AT: 2015-11-02T15-08-56
  0.00s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX -- FOR TASK plain AT 2015-11-13T10-24-04


## Trailer

Before we generate the text, let's list all the different suffixes and their number of occurrences.

In [3]:
trailer = collections.defaultdict(int)

for node in NN(test=F.otype.v, value='word'):
    trailer[F.trailer_utf8.v(node)] += 1

trailer_file = outfile('trailers.txt')
for (trl, n) in sorted(trailer.items(), key=lambda x: (-x[1], x[0])):
    trailer_file.write("{:>7} x [{}]\n".format(n, trl))
trailer_file.close()

In [4]:
cat {my_file('trailers.txt')}

 237039 x [ ]
 121796 x []
  42275 x [־]
  20037 x [׃
]
   2266 x [ ׀ ]
   1892 x [׃ ס 
]
   1165 x [׃ פ 
]
     76 x [ ס ]
     13 x [ פ ]
      7 x [׃ ׆̇ 
]
      1 x [׃ ׆̇ ס  
]
      1 x [׃ ׆̇ פ  
]


## Vocalized text versus consonantal text, Hebrew Unicode versus Transliteration

Now the complete text, note that we insert some newlines.

If you want the consonantal text, replace the feature ``g_word_utf8`` by ``g_cons_utf8``.

In many cases the use of Hebrew Unicode characters, however pleasing to the eye, is not preferred.
Often the Hebrew occurrs embedded in non-Hebrew text, or under tree structures where the Hebrew right-to-left writing
direction does not play nice with the context.
Moreover, rendering software such as text editor, command prompts and browsers solve the puzzle of multiple writing directions
in unpredictable ways.

In those cases you can resort to a *transliteration*, with or without vowels.
Use the features ``g_word`` and ``g_cons``.

In [5]:
plain_file = outfile("etcbc4_plain.txt")
plainc_file = outfile("etcbc4_plainc.txt")
plaint_file = outfile("etcbc4_plaint.txt")
plaintc_file = outfile("etcbc4_plaintc.txt")

for i in F.otype.s('word'):
    the_text = F.g_word_utf8.v(i)
    the_textc = F.g_cons_utf8.v(i)
    the_textt = F.g_word.v(i)
    the_texttc = F.g_cons.v(i)
    the_trailer = F.trailer_utf8.v(i)
    the_newline = '\n' if '\n' in the_trailer else ''
    plain_file.write(the_text + the_trailer)
    plainc_file.write(the_textc + the_trailer)
    plaint_file.write(the_textt + " " + the_newline)
    plaintc_file.write(the_texttc + " " + the_newline)

plain_file.close()
plainc_file.close()
plaint_file.close()
plaintc_file.close()

In [6]:
!head -n 10 {my_file('etcbc4_plain.txt')}

בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃
וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃
וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י אֹ֑ור וַֽיְהִי־אֹֽור׃
וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור כִּי־טֹ֑וב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָאֹ֖ור וּבֵ֥ין הַחֹֽשֶׁךְ׃
וַיִּקְרָ֨א אֱלֹהִ֤ים ׀ לָאֹור֙ יֹ֔ום וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום אֶחָֽד׃ פ 
וַיֹּ֣אמֶר אֱלֹהִ֔ים יְהִ֥י רָקִ֖יעַ בְּתֹ֣וךְ הַמָּ֑יִם וִיהִ֣י מַבְדִּ֔יל בֵּ֥ין מַ֖יִם לָמָֽיִם׃
וַיַּ֣עַשׂ אֱלֹהִים֮ אֶת־הָרָקִיעַ֒ וַיַּבְדֵּ֗ל בֵּ֤ין הַמַּ֨יִם֙ אֲשֶׁר֙ מִתַּ֣חַת לָרָקִ֔יעַ וּבֵ֣ין הַמַּ֔יִם אֲשֶׁ֖ר מֵעַ֣ל לָרָקִ֑יעַ וַֽיְהִי־כֵֽן׃
וַיִּקְרָ֧א אֱלֹהִ֛ים לָֽרָקִ֖יעַ שָׁמָ֑יִם וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום שֵׁנִֽי׃ פ 
וַיֹּ֣אמֶר אֱלֹהִ֗ים יִקָּו֨וּ הַמַּ֜יִם מִתַּ֤חַת הַשָּׁמַ֨יִם֙ אֶל־מָקֹ֣ום אֶחָ֔ד וְתֵרָאֶ֖ה הַיַּבָּשָׁ֑ה וַֽיְהִי־כֵֽן׃
וַיִּקְרָ֨א אֱלֹהִ֤ים ׀ לַיַּבָּשָׁה֙ אֶ֔רֶץ וּלְמִקְוֵ֥ה הַמַּ֖

In [7]:
!head -n 10 {my_file('etcbc4_plainc.txt')}

בראשׁית ברא אלהים את השׁמים ואת הארץ׃
והארץ היתה תהו ובהו וחשׁך על־פני תהום ורוח אלהים מרחפת על־פני המים׃
ויאמר אלהים יהי אור ויהי־אור׃
וירא אלהים את־האור כי־טוב ויבדל אלהים בין האור ובין החשׁך׃
ויקרא אלהים ׀ לאור יום ולחשׁך קרא לילה ויהי־ערב ויהי־בקר יום אחד׃ פ 
ויאמר אלהים יהי רקיע בתוך המים ויהי מבדיל בין מים למים׃
ויעשׂ אלהים את־הרקיע ויבדל בין המים אשׁר מתחת לרקיע ובין המים אשׁר מעל לרקיע ויהי־כן׃
ויקרא אלהים לרקיע שׁמים ויהי־ערב ויהי־בקר יום שׁני׃ פ 
ויאמר אלהים יקוו המים מתחת השׁמים אל־מקום אחד ותראה היבשׁה ויהי־כן׃
ויקרא אלהים ׀ ליבשׁה ארץ ולמקוה המים קרא ימים וירא אלהים כי־טוב׃


In [8]:
!head -n 10 {my_file('etcbc4_plaint.txt')}

B.:- R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA- C.@MA73JIM W:- >;71T H@- >@75REY00 
W:- H@- >@81REY H@J:T@71H TO33HW.03 W@- BO80HW. W:- XO73CEK: <AL& P.:N;74J T:HO92WM W:- R74W.XA >:ELOHI80JM M:RAXE73PET <AL& P.:N;71J HA- M.@75JIM00 
WA- J.O71>MER >:ELOHI73JM J:HI74J >O92WR WA75- J:HIJ& >O75WR00 
WA- J.A94R:> >:ELOHI91JM >ET& H@- >O73WR K.IJ& VO92WB WA- J.AB:D.;74L >:ELOHI80JM B.;71JN H@- >O73WR W.- B;71JN HA- XO75CEK:00 
WA- J.IQ:R@63> >:ELOHI70JM05 L@- >OWR03 - JO80WM W:- LA- XO73CEK: - Q@74R@> L@92J:L@H WA75- J:HIJ& <E71REB WA75- J:HIJ& BO73QER JO71WM >EX@75D00_P 
WA- J.O74>MER >:ELOHI80JM J:HI71J R@QI73J<A B.:- TO74WK: HA- M.@92JIM WI- JHI74J MAB:D.I80JL B.;71JN MA73JIM L@- M@75JIM00 
WA- J.A74<AF >:ELOHIJM02 >ET& H@- R@QIJ<A01 WA- J.AB:D.;81L B.;70JN HA- M.A33JIM03 >:ACER03 MI- T.A74XAT L@- R@QI80J<A - W.- B;74JN HA- M.A80JIM >:ACE73R M;- <A74L L@- R@QI92J<A - WA75- J:HIJ& K;75N00 
WA- J.IQ:R@94> >:ELOHI91JM L@75- R@QI73J<A - C@M@92JIM WA75- J:HIJ& <E71REB WA75- J:HIJ& BO73QE

In [9]:
!head -n 10 {my_file('etcbc4_plaintc.txt')}

B R>CJT BR> >LHJM >T H CMJM W >T H >RY 
W H >RY HJTH THW W BHW W XCK <L PNJ THWM W RWX >LHJM MRXPT <L PNJ H MJM 
W J>MR >LHJM JHJ >WR W JHJ >WR 
W JR> >LHJM >T H >WR KJ VWB W JBDL >LHJM BJN H >WR W BJN H XCK 
W JQR> >LHJM L >WR  JWM W L XCK  QR> LJLH W JHJ <RB W JHJ BQR JWM >XD 
W J>MR >LHJM JHJ RQJ< B TWK H MJM W JHJ MBDJL BJN MJM L MJM 
W J<F >LHJM >T H RQJ< W JBDL BJN H MJM >CR M TXT L RQJ<  W BJN H MJM >CR M <L L RQJ<  W JHJ KN 
W JQR> >LHJM L RQJ<  CMJM W JHJ <RB W JHJ BQR JWM CNJ 
W J>MR >LHJM JQWW H MJM M TXT H CMJM >L MQWM >XD W TR>H H JBCH W JHJ KN 
W JQR> >LHJM L JBCH  >RY W L MQWH H MJM QR> JMJM W JR> >LHJM KJ VWB 


## Passage indicators

If you want books, chapters and verses marked, you can achieve it in the following way:

In [10]:
plainx_file = outfile("etcbc4_plainx.txt")
plaintx_file = outfile("etcbc4_plaintx.txt")

the_book = None
the_chapter = None
the_verse = None

for i in NN():
    this_type = F.otype.v(i)
    if this_type == "word":
        the_text = F.g_word_utf8.v(i)
        the_textt = F.g_word.v(i)
        the_trailer = F.trailer_utf8.v(i)
        the_newline = '\n' if '\n' in the_trailer else ''
        plainx_file.write(the_text + the_trailer)
        plaintx_file.write(the_textt + ' ' + the_newline)
    elif this_type == "book":
        the_book = F.book.v(i)
        sys.stderr.write("\r{:>6} {:<30}".format(i, the_book)) 
        plainx_file.write("\n{}".format(the_book))
        plaintx_file.write("\n{}".format(the_book))
    elif this_type == "chapter":
        the_chapter = F.chapter.v(i)
        plainx_file.write("\n{} {}".format(the_book, the_chapter))
        plaintx_file.write("\n{} {}".format(the_book, the_chapter))
    elif this_type == "verse":
        the_verse = F.verse.v(i)
        plainx_file.write("\n{}:{} ".format(the_chapter, the_verse))
        plaintx_file.write("\n{}:{} ".format(the_chapter, the_verse))
sys.stderr.write("\n")

plainx_file.close()
plaintx_file.close()

1367535 Chronica_II                   


In [11]:
!head -n 10 {my_file('etcbc4_plainx.txt')}


Genesis
Genesis 1
1:1 בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃

1:2 וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃

1:3 וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י אֹ֑ור וַֽיְהִי־אֹֽור׃

1:4 וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור כִּי־טֹ֑וב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָאֹ֖ור וּבֵ֥ין הַחֹֽשֶׁךְ׃


In [12]:
!head -n 10 {my_file('etcbc4_plaintx.txt')}


Genesis
Genesis 1
1:1 B.:- R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA- C.@MA73JIM W:- >;71T H@- >@75REY00 

1:2 W:- H@- >@81REY H@J:T@71H TO33HW.03 W@- BO80HW. W:- XO73CEK: <AL& P.:N;74J T:HO92WM W:- R74W.XA >:ELOHI80JM M:RAXE73PET <AL& P.:N;71J HA- M.@75JIM00 

1:3 WA- J.O71>MER >:ELOHI73JM J:HI74J >O92WR WA75- J:HIJ& >O75WR00 

1:4 WA- J.A94R:> >:ELOHI91JM >ET& H@- >O73WR K.IJ& VO92WB WA- J.AB:D.;74L >:ELOHI80JM B.;71JN H@- >O73WR W.- B;71JN HA- XO75CEK:00 


## Verse list

We can get the text in a quite different way: just read it from the *primary data*.

Let us do that per verse.

In [13]:
verse_file = outfile("etcbc4_verses.txt")

for i in F.otype.s('verse'):
    the_text = ''.join([txt for (j, txt) in P.data(i)])
    the_verse = F.label.v(i)
    verse_file.write("{}\n{}\n".format(the_verse, the_text))

verse_file.close()

In [14]:
!head -n 10 {my_file('etcbc4_verses.txt')}

 GEN 01,01
בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃

 GEN 01,02
וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃

 GEN 01,03
וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י אֹ֑ור וַֽיְהִי־אֹֽור׃

 GEN 01,04


## Empty words

There are words that have an empty representation.

Let us have a closer look.
How frequent are they and to what lexemes do they correspond, and what is their part of speech?

In [15]:
ewords = collections.defaultdict(lambda: [])
verse = None

for i in NN(test=F.otype.v, values=['verse', 'word']):
    if F.otype.v(i) == 'verse':
        verse = i
        continue
    text = F.g_word_utf8.v(i)
    if text == '':
        lex = F.lex_utf8.v(i)
        pos = F.sp.v(i)
        ewords[(lex, pos)].append(verse)

for (item, occs) in sorted(ewords.items(), key=lambda x: (-len(x[1]), x[0][1], x[0][0])):
    print("{:>6} x {:<15} = {:>10} in {}{}".format(
        len(occs), 
        item[1], 
        item[0], 
        "; ".join([F.label.v(j) for j in occs][0:5]),
        ' ...' if len(occs) > 20 else '',
    ))

  6423 x art             =          ה in  GEN 01,05;  GEN 01,05;  GEN 01,07;  GEN 01,07;  GEN 01,08 ...


In [16]:
close()

 1m 14s Results directory:
/Users/dirk/test/laf-fabric-output/etcbc4b/plain

__log__plain.txt                        209 Fri Nov 13 08:58:30 2015
etcbc4_plain.txt                    5323553 Fri Nov 13 08:57:55 2015
etcbc4_plainc.txt                   2865624 Fri Nov 13 08:57:55 2015
etcbc4_plaint.txt                   3416982 Fri Nov 13 08:57:55 2015
etcbc4_plaintc.txt                  1647447 Fri Nov 13 08:57:55 2015
etcbc4_plaintx.txt                  3575484 Fri Nov 13 08:58:09 2015
etcbc4_plainx.txt                   5482055 Fri Nov 13 08:58:09 2015
etcbc4_verses.txt                   5602109 Fri Nov 13 08:58:19 2015
trailers.txt                            223 Fri Nov 13 08:57:43 2015
