# Hebrew in Pandas

This notebook contains the Pandas instructions to load the big ETCBC4b export and save it in a more compact pandas format.
Pandas is a Python module that implements the concept of *data frame* as we know it from R.

We then perform some simple information extracting on the data.
For comparison, the same information extraction has been done in
[R](https://shebanq.ancient-data.org/shebanq/static/docs/tools/r/Hebrew_in_RR.html).

Just a little machinery to get timed messages.

In [3]:
import sys, collections
import pandas as pd
from laf.timestamp import Timestamp

curtime = Timestamp()

def _reset():
    global curtime
    curtime = Timestamp()

def msg(txt, newline=True, withtime=True):
    curtime.Nmsg(txt, newline=newline, withtime=withtime)

def Msg(txt, newline=True, withtime=True):
    _reset()
    curtime.Nmsg(txt, newline=newline, withtime=withtime)

We read exactly the same file as is read by R.
Note that it goes much quicker.

All features that refer to other objects, do so through the `oid`.

In [3]:
Msg('Reading etcbc data into pandas')

l_features = '''
    subphrase phrase_atom phrase clause_atom clause sentence_atom sentence
    half_verse verse chapter book
'''.strip().split()
li_features = ['in.'+x for x in l_features]

dtype = dict(
    maxmonad='int',minmonad='int',monads='str',
    oid='int',otype='str',
    code='str',det='str',dist='float64',dist_unit='str',domain='str',function='str',
    g_cons='str',g_cons_utf8='str',g_lex='str',g_lex_utf8='str',
    g_nme='str',g_nme_utf8='str',g_pfm='str',g_pfm_utf8='str',
    g_prs='str',g_prs_utf8='str',g_uvf='str',g_uvf_utf8='str',
    g_vbe='str',g_vbe_utf8='str',g_vbs='str',g_vbs_utf8='str',
    g_word='str',g_word_utf8='str',
    gn='str',is_root='str',kind='str',
    language='str',lex='str',lex_utf8='str',
    ls='str',mother_object_type='str',
    nme='str',nu='str',number='float64',pdp='str',pfm='str',prs='str',ps='str',
    rela='str',sp='str',st='str',tab='float64',
    trailer_utf8='str',txt='str',typ='str',
    uvf='str',vbe='str',vbs='str',
    vs='str',vt='str',
    g_qere_utf8='str',qtrailer_utf8='str',
    entry='str',entry_heb='str',entryid='str',
    freq_lex='float64',freq_occ='float64',
    g_entry='str',g_entry_heb='str',
    gloss='str',id='str',lan='str',nametype='str',
    pos='str',rank_lex='float64',rank_occ='float64',root='str',subpos='str',
    phono='str',phono_sep='str',
    book='str',chapter='float64',label='str',verse='float64',
    instruction='str',number_in_ch='float64',pargr='str',
    distributional_parent='str',functional_parent='str',mother='str',
)
for otype in li_features: dtype[otype] = 'float64'

na_values = dict((x, set() if dtype[x] == 'str' else {''}) for x in dtype)

etcbc = pd.read_table(
    '/Users/dirk/SURFdrive/laf-fabric-data/r/etcbc4b.txt',
    delimiter='\t',
    low_memory=False,
    encoding='utf8',
    keep_default_na=False,
    na_values=na_values,
    dtype=dtype,
#    index_col='oid',
)
msg('Done. Size = {}'.format(etcbc.size))

  0.00s Reading etcbc data into pandas
    22s Done. Size = 136501510


Pandas has its own compressed format. We save the data into this format, check how big it is and how fast it loads.

In [4]:
Msg('saving compressed')
etcbc.to_msgpack('/Users/dirk/SURFdrive/laf-fabric-data/r/etcbc4b.pd', encoding='utf-8', compress='zlib')
msg('Done')

  0.00s saving compressed
    29s Done


In [6]:
!ls -lh /Users/dirk/SURFdrive/laf-fabric-data/r

total 1366456
-rw-r--r--  1 dirk  staff   227M Feb  1 13:33 etcbc4b.pd
-rw-r--r--  1 dirk  staff    41M Feb  1 13:33 etcbc4b.pd.gz
-rw-r--r--  1 dirk  staff    52M Jan 29 16:48 etcbc4b.rds
-rw-r--r--  1 dirk  staff   347M Jan 29 16:45 etcbc4b.txt


The file is not very compressed: more than 2/3 of the original tab-delimited file.

In [5]:
!gzip -k -f /Users/dirk/SURFdrive/laf-fabric-data/r/etcbc4b.pd

This is better. So for distribution we use `etcbc4b.pd.gz`.
This data has been saved at the github repo 
[etcbc/laf-fabric-data](https://github.com/ETCBC/laf-fabric-data)

In [7]:
!cp '/Users/dirk/SURFdrive/laf-fabric-data/r/etcbc4b.pd.gz' '/Users/dirk/SURFdrive/current/demos/github/laf-fabric-data/'

In [4]:
Msg('loading compressed')
etcbc = pd.read_msgpack('/Users/dirk/SURFdrive/laf-fabric-data/r/etcbc4b.pd', encoding='utf-8')
msg('Done. Size={}'.format(etcbc.size))

  0.00s loading compressed
  9.90s Done. Size=136501510


Not too bad: the complete data loaded in about 10 seconds.

In [3]:
etcbc.shape

(1436858, 95)

In [4]:
etcbc.head(30)

Unnamed: 0,oid,otype,in.subphrase,in.phrase_atom,in.phrase,in.clause_atom,in.clause,in.sentence_atom,in.sentence,in.half_verse,...,tab,trailer_utf8,txt,typ,uvf,vbe,vbs,verse,vs,vt
0,1,book,,,,,,,,,...,,,,,,,,,,
1,2,chapter,,,,,,,,,...,,,,,,,,,,
2,3,verse,,,,,,,,,...,,,,,,,,1.0,,
3,11,sentence,,,,,,,,,...,,,,,,,,,,
4,10,sentence_atom,,,,,,,11.0,,...,,,,,,,,,,
5,9,clause,,,,,,10.0,11.0,,...,,,?,xQtX,,,,,,
6,8,clause_atom,,,,,9.0,10.0,11.0,,...,0.0,,,xQtX,,,,,,
7,4,half_verse,,,,,,,,,...,,,,,,,,,,
8,7,phrase,,,,8.0,9.0,10.0,11.0,4.0,...,,,,PP,,,,,,
9,6,phrase_atom,,,7.0,8.0,9.0,10.0,11.0,4.0,...,,,,PP,,,,,,


# Books

Let us extract some data.
First a list of the book names.

In [5]:
books = etcbc[etcbc.otype == 'book'].book
print(' '.join(str(x) for x in books))

Genesis Exodus Leviticus Numeri Deuteronomium Josua Judices Samuel_I Samuel_II Reges_I Reges_II Jesaia Jeremia Ezechiel Hosea Joel Amos Obadia Jona Micha Nahum Habakuk Zephania Haggai Sacharia Maleachi Psalmi Iob Proverbia Ruth Canticum Ecclesiastes Threni Esther Daniel Esra Nehemia Chronica_I Chronica_II


# Text

Now the complete text of the whole bible.

In [6]:
Msg('Writing')
tf = open('/Users/dirk/Downloads/test_pd_text.txt', 'w')
words = etcbc.loc[etcbc.otype == 'word']
text = words.g_word_utf8 + words.trailer_utf8
tf.write((''.join(text)).replace('\u05C3', '\u05C3\n'))
tf.write('\n')
tf.close()
msg('Done')

  0.00s Writing
  0.69s Done


In [7]:
!head /Users/dirk/Downloads/test_pd_text.txt

בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃
וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃
וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י אֹ֑ור וַֽיְהִי־אֹֽור׃
וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור כִּי־טֹ֑וב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָאֹ֖ור וּבֵ֥ין הַחֹֽשֶׁךְ׃
וַיִּקְרָ֨א אֱלֹהִ֤ים ׀ לָאֹור֙ יֹ֔ום וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום אֶחָֽד׃
 פ וַיֹּ֣אמֶר אֱלֹהִ֔ים יְהִ֥י רָקִ֖יעַ בְּתֹ֣וךְ הַמָּ֑יִם וִיהִ֣י מַבְדִּ֔יל בֵּ֥ין מַ֖יִם לָמָֽיִם׃
וַיַּ֣עַשׂ אֱלֹהִים֮ אֶת־הָרָקִיעַ֒ וַיַּבְדֵּ֗ל בֵּ֤ין הַמַּ֨יִם֙ אֲשֶׁר֙ מִתַּ֣חַת לָרָקִ֔יעַ וּבֵ֣ין הַמַּ֔יִם אֲשֶׁ֖ר מֵעַ֣ל לָרָקִ֑יעַ וַֽיְהִי־כֵֽן׃
וַיִּקְרָ֧א אֱלֹהִ֛ים לָֽרָקִ֖יעַ שָׁמָ֑יִם וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום שֵׁנִֽי׃
 פ וַיֹּ֣אמֶר אֱלֹהִ֗ים יִקָּו֨וּ הַמַּ֜יִם מִתַּ֤חַת הַשָּׁמַ֨יִם֙ אֶל־מָקֹ֣ום אֶחָ֔ד וְתֵרָאֶ֖ה הַיַּבָּשָׁ֑ה וַֽיְהִי־כֵֽן׃
וַיִּקְרָ֨א אֱלֹהִ֤ים ׀ לַיַּבָּשָׁה֙ אֶ֔רֶץ וּלְמִקְוֵ֥ה הַמַּ֖

Compare the texts as output by pandas and by r.

In [41]:
!diff -qs /Users/dirk/Downloads/test_r_text.txt /Users/dirk/Downloads/test_pd_text.txt

Files /Users/dirk/Downloads/test_r_text.txt and /Users/dirk/Downloads/test_pd_text.txt are identical


# Drill down to a passage

Let us get the words from the first verse

In [27]:
word_ids = etcbc[(etcbc.otype=='word') & (etcbc['in.verse']==3)].oid
print(word_ids.values)

[ 5 12 13 16 20 23 24 26 27 28 29]


Now the *text* of the first verse.

In [18]:
words = etcbc[(etcbc.otype=='word') & (etcbc['in.verse']==3)]
text = words.g_word_utf8 + words.trailer_utf8
print((''.join(text)).replace('\u05C3', '\u05C3\n'))

בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃



Let us get the words and text of an arbitrary passage, say Psalmi 131:2

In [29]:
verse_id = etcbc[(etcbc.otype == 'verse') & \
                 (etcbc.book == 'Psalmi') & \
                 (etcbc.chapter == 131) & \
                 (etcbc.verse == 2)].oid.iloc[0]
print(verse_id)
words = etcbc[(etcbc.otype=='word') & (etcbc['in.verse']==verse_id)]
print(words.oid.values)
text = words.g_word_utf8 + words.trailer_utf8
print((''.join(text)).replace('\u05C3', '\u05C3\n'))

1123436
[1123438 1123445 1123448 1123451 1123456 1123459 1123463 1123466 1123467
 1123469 1123471 1123476 1123477 1123478 1123480]
אִם־לֹ֤א שִׁוִּ֨יתִי ׀ וְדֹומַ֗מְתִּי נַ֫פְשִׁ֥י כְּ֭גָמֻל עֲלֵ֣י אִמֹּ֑ו כַּגָּמֻ֖ל עָלַ֣י נַפְשִֽׁי׃



In [34]:
otype = etcbc[etcbc.oid == 3].otype.iloc[0]
otype

'verse'

Now let us organize this in two functions: one that returns the verse object given a passage, and one that prints the texts of the words in a given object.

In [38]:
def object2text(oid):
    otype = etcbc[etcbc.oid == oid].otype.iloc[0]
    inotype = 'in.'+otype
    words = etcbc[(etcbc.otype == 'word') & (etcbc[inotype] == oid)]
    text = words.g_word_utf8 + words.trailer_utf8
    return (''.join(text)).replace('\u05C3', '\u05C3\n')

def verse2object(book, chapter, verse):
    return etcbc[(etcbc.otype == 'verse') & \
                 (etcbc.book == book) & \
                 (etcbc.chapter == chapter) & \
                 (etcbc.verse == verse)].oid.iloc[0]

def verse2text(book, chapter, verse):
    return object2text(verse2object(book, chapter, verse))

def chapter2object(book, chapter):
    return etcbc[(etcbc.otype == 'chapter') & \
                 (etcbc.book == book) & \
                 (etcbc.chapter == chapter)].oid.iloc[0]

def chapter2text(book, chapter):
    return object2text(chapter2object(book, chapter))

In [39]:
print(verse2text('Psalmi', 131, 2))

אִם־לֹ֤א שִׁוִּ֨יתִי ׀ וְדֹומַ֗מְתִּי נַ֫פְשִׁ֥י כְּ֭גָמֻל עֲלֵ֣י אִמֹּ֑ו כַּגָּמֻ֖ל עָלַ֣י נַפְשִֽׁי׃



In [40]:
print(chapter2text('Psalmi', 131))

שִׁ֥יר הַֽמַּֽעֲלֹ֗ות לְדָ֫וִ֥ד יְהוָ֤ה ׀ לֹא־גָבַ֣הּ לִ֭בִּי וְלֹא־רָמ֣וּ עֵינַ֑י וְלֹֽא־הִלַּ֓כְתִּי ׀ בִּגְדֹלֹ֖ות וּבְנִפְלָאֹ֣ות מִמֶּֽנִּי׃
אִם־לֹ֤א שִׁוִּ֨יתִי ׀ וְדֹומַ֗מְתִּי נַ֫פְשִׁ֥י כְּ֭גָמֻל עֲלֵ֣י אִמֹּ֑ו כַּגָּמֻ֖ל עָלַ֣י נַפְשִֽׁי׃
יַחֵ֣ל יִ֝שְׂרָאֵל אֶל־יְהוָ֑ה מֵֽ֝עַתָּ֗ה וְעַד־עֹולָֽם׃



# Bigrams

We make a column of verse-bound bigrams of lexemes. The two lexemes are separated by an underscore `_`. 

This is the R code.
I have difficulties in using Pandas, because I get an error like this all the time:

``ValueError: buffer source array is read-only``

even when I apply recipes from the [Pandas cookbook](http://pandas.pydata.org/pandas-docs/stable/cookbook.html)

In [13]:
vs_next = etcbc.groupby(level=0)['in.verse'].shift(1)

ValueError: buffer source array is read-only