<img align="right" src="images/dans-small.png"/>
<img align="right" src="images/tf-small.png"/>
<img align="right" src="images/etcbc.png"/>


# Import BHSA data into Pandas

This notebook contains the Pandas instructions to load the
[`bigTables`](bigTables.ipynb) export of the BHSA
and save it in the much more compact `.pd` format.

We then perform some simple information extracting on the data.
For comparison, the same information extraction has been done for R:
in [`bigTablesR`](bigTablesR.ipynb).

Just a little machinery to get timed messages.

In [1]:
import os
from utils import caption
import pandas as pd  # pip3 install pandas
from tf.core.files import dirMake
from memutil import memUsage
memUsage()

Current:  0.12 GB
Delta:    0.12 GB


# Data files
We read exactly the same file as is read by R in [`bigTablesR`](bigTablesR.ipynb).

In [3]:
locations = "~/github/ETCBC"
coreModule = "bhsa"
version = "2021"
tempDir = os.path.expanduser("{}/{}/_temp/{}/r".format(locations, coreModule, version))
dirMake(tempDir)
tableFile = "{}/{}{}.txt".format(tempDir, coreModule, version)
tableFilePd = "{}/{}{}.pd".format(tempDir, coreModule, version)
plainTextPd = "{}/plainTextFromPd.txt".format(tempDir)
plainTextR = "{}/plainTextFromR.txt".format(tempDir)

All features that refer to other objects, do so through the `n`.
Note that the process is much quicker than for R.

In [11]:
caption(0, "Reading BHSA data into pandas")

levelFeatures = """
    lex subphrase phrase_atom phrase clause_atom clause sentence_atom sentence
    half_verse verse chapter book
""".strip().split()
inLevelFeatures = ["in." + x for x in levelFeatures]

dtype = dict(
    n="int",
    otype="str",
    code="str",
    det="str",
    dist="float64",
    dist_unit="str",
    domain="str",
    function="str",
    g_cons="str",
    g_cons_utf8="str",
    g_lex="str",
    g_lex_utf8="str",
    g_nme="str",
    g_nme_utf8="str",
    g_pfm="str",
    g_pfm_utf8="str",
    g_prs="str",
    g_prs_utf8="str",
    g_uvf="str",
    g_uvf_utf8="str",
    g_vbe="str",
    g_vbe_utf8="str",
    g_vbs="str",
    g_vbs_utf8="str",
    g_word="str",
    g_word_utf8="str",
    gn="str",
    is_root="str",
    kind="str",
    language="str",
    languageISO="str",
    lex="str",
    lex_utf8="str",
    ls="str",
    mother_object_type="str",
    nme="str",
    nu="str",
    number="float64",
    pdp="str",
    pfm="str",
    prs="str",
    ps="str",
    rela="str",
    sp="str",
    st="str",
    tab="float64",
    trailer_utf8="str",
    txt="str",
    typ="str",
    uvf="str",
    vbe="str",
    vbs="str",
    vs="str",
    vt="str",
    g_qere_utf8="str",
    qtrailer_utf8="str",
    entry="str",
    entry_heb="str",
    entryid="str",
    freq_lex="float64",
    freq_occ="float64",
    g_entry="str",
    g_entry_heb="str",
    gloss="str",
    id="str",
    lan="str",
    nametype="str",
    pos="str",
    rank_lex="float64",
    rank_occ="float64",
    root="str",
    subpos="str",
    phono="str",
    phono_sep="str",
    book="str",
    chapter="float64",
    label="str",
    verse="float64",
    instruction="str",
    number_in_ch="float64",
    pargr="str",
    distributional_parent="str",
    functional_parent="str",
    mother="str",
)
for otype in inLevelFeatures:
    dtype[otype] = "float64"

naValues = dict((x, set() if dtype[x] == "str" else {""}) for x in dtype)

bhsa = pd.read_table(
    tableFile,
    delimiter="\t",
    low_memory=False,
    encoding="utf8",
    keep_default_na=False,
    na_values=naValues,
    dtype=dtype,
    #    index_col='n',
)
caption(0, "Done. Size = {}".format(bhsa.size))
memUsage()

|      6m 47s Reading BHSA data into pandas
|      6m 59s Done. Size = 146129931
Current:  2.07 GB
Delta:    1.93 GB


Pandas advices to use Apache Arrow (`pyarrow`) for serialization.
We save the data into this format, check how big it is and how fast it loads.

In [12]:
caption(0, "saving compressed")
bhsa.to_parquet(tableFilePd, engine="pyarrow")
caption(0, "Done")

|      7m 05s saving compressed
|      7m 09s Done


The file is nicely compressed: 1/6 of the original tab-delimited file.

G-zipping shaves off another third, maybe not worth the hassle.

In [5]:
!gzip -k -f {tableFilePd}

In [6]:
!ls -lh {tempDir}

total 1063312
-rw-r--r--  1 dirk  staff    62M Apr  8 16:53 bhsa2017.pd
-rw-r--r--  1 dirk  staff    47M Apr  8 16:53 bhsa2017.pd.gz
-rw-r--r--  1 dirk  staff    46M Apr  8 16:51 bhsa2017.rds
-rw-r--r--  1 dirk  staff   340M Apr  7 21:25 bhsa2017.txt
-rw-r--r--  1 dirk  staff   5.1M Apr  8 16:31 plainTextFromPd.txt
-rw-r--r--  1 dirk  staff   5.1M Apr  8 16:52 plainTextFromR.txt


Let us see how fast we can load the compressed file.

In [4]:
memUsage()
caption(0, "loading compressed")
bhsa = pd.read_parquet(tableFilePd, engine="pyarrow")
caption(0, "Done. Size={}".format(bhsa.size))
memUsage()

Current:  0.12 GB
Delta:    0.00 GB
|       0.00s loading compressed
|       1.27s Done. Size=146129931
Current:  2.43 GB
Delta:    2.31 GB


Not too bad: the complete data loaded in about 4 seconds.

In [8]:
bhsa.shape

(1446635, 101)

In [9]:
bhsa.head(30)

Unnamed: 0,n,otype,in.lex,in.subphrase,in.phrase_atom,in.phrase,in.clause_atom,in.clause,in.sentence_atom,in.sentence,...,txt,typ,uvf,vbe,vbs,verse,voc_lex,voc_lex_utf8,vs,vt
0,426585,book,,,,,,,,,...,,,,,,,,,,
1,426624,chapter,,,,,,,,,...,,,,,,,,,,
2,1414190,verse,,,,,515654.0,427553.0,1235920.0,1172209.0,...,,,,,,1.0,,,,
3,1172209,sentence,,,,,515654.0,427553.0,1235920.0,,...,,,,,,,,,,
4,1235920,sentence_atom,,,,,515654.0,427553.0,,1172209.0,...,,,,,,,,,,
5,427553,clause,,,,,515654.0,,1235920.0,1172209.0,...,?,xQtX,,,,,,,,
6,515654,clause_atom,,,,,,427553.0,1235920.0,1172209.0,...,,xQtX,,,,,,,,
7,606323,half_verse,,,,,515654.0,427553.0,1235920.0,1172209.0,...,,,,,,,,,,
8,651503,phrase,,,904690.0,,515654.0,427553.0,1235920.0,1172209.0,...,,PP,,,,,,,,
9,904690,phrase_atom,,,,651503.0,515654.0,427553.0,1235920.0,1172209.0,...,,PP,,,,,,,,


In [10]:
columnList = bhsa.columns.values.tolist()
columnList

['n',
 'otype',
 'in.lex',
 'in.subphrase',
 'in.phrase_atom',
 'in.phrase',
 'in.clause_atom',
 'in.clause',
 'in.sentence_atom',
 'in.sentence',
 'in.half_verse',
 'in.verse',
 'in.chapter',
 'in.book',
 'distributional_parent',
 'functional_parent',
 'mother',
 'book',
 'chapter',
 'code',
 'det',
 'dist',
 'dist_unit',
 'domain',
 'freq_lex',
 'freq_occ',
 'function',
 'g_cons',
 'g_cons_utf8',
 'g_lex',
 'g_lex_utf8',
 'g_nme',
 'g_nme_utf8',
 'g_pfm',
 'g_pfm_utf8',
 'g_prs',
 'g_prs_utf8',
 'g_uvf',
 'g_uvf_utf8',
 'g_vbe',
 'g_vbe_utf8',
 'g_vbs',
 'g_vbs_utf8',
 'g_word',
 'g_word_utf8',
 'gloss',
 'gn',
 'instruction',
 'is_root',
 'kind',
 'kq_hybrid',
 'kq_hybrid_utf8',
 'label',
 'language',
 'languageISO',
 'lex',
 'lex0',
 'lex_utf8',
 'lexeme_count',
 'ls',
 'mother_object_type',
 'nametype',
 'nme',
 'nu',
 'number',
 'pargr',
 'pdp',
 'pfm',
 'phono',
 'phono_trailer',
 'prs',
 'prs_gn',
 'prs_nu',
 'prs_ps',
 'ps',
 'qere',
 'qere_trailer',
 'qere_trailer_utf8',
 'qe

# Books

Let us extract some data.
First a list of the book names.

In [11]:
books = bhsa[bhsa.otype == "book"].book
print(" ".join(str(x) for x in books))

Genesis Exodus Leviticus Numeri Deuteronomium Josua Judices Samuel_I Samuel_II Reges_I Reges_II Jesaia Jeremia Ezechiel Hosea Joel Amos Obadia Jona Micha Nahum Habakuk Zephania Haggai Sacharia Maleachi Psalmi Iob Proverbia Ruth Canticum Ecclesiastes Threni Esther Daniel Esra Nehemia Chronica_I Chronica_II


# Text

Now the complete text of the whole bible.

In [12]:
caption(0, "Writing")
words = bhsa.loc[bhsa.otype == "word"]
text = words.g_word_utf8 + words.trailer_utf8

with open(plainTextPd, "w") as pt:
    pt.write(("".join(text)).replace("\u05C3", "\u05C3\n"))
    pt.write("\n")

caption(0, "Done")

|         58s Writing
|         59s Done


In [13]:
!head {plainTextPd}

בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃
 וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃
 וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י אֹ֑ור וַֽיְהִי־אֹֽור׃
 וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור כִּי־טֹ֑וב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָאֹ֖ור וּבֵ֥ין הַחֹֽשֶׁךְ׃
 וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לָאֹור֙ יֹ֔ום וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום אֶחָֽד׃
 פ וַיֹּ֣אמֶר אֱלֹהִ֔ים יְהִ֥י רָקִ֖יעַ בְּתֹ֣וךְ הַמָּ֑יִם וִיהִ֣י מַבְדִּ֔יל בֵּ֥ין מַ֖יִם לָמָֽיִם׃
 וַיַּ֣עַשׂ אֱלֹהִים֮ אֶת־הָרָקִיעַ֒ וַיַּבְדֵּ֗ל בֵּ֤ין הַמַּ֨יִם֙ אֲשֶׁר֙ מִתַּ֣חַת לָרָקִ֔יעַ וּבֵ֣ין הַמַּ֔יִם אֲשֶׁ֖ר מֵעַ֣ל לָרָקִ֑יעַ וַֽיְהִי־כֵֽן׃
 וַיִּקְרָ֧א אֱלֹהִ֛ים לָֽרָקִ֖יעַ שָׁמָ֑יִם וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום שֵׁנִֽי׃
 פ וַיֹּ֣אמֶר אֱלֹהִ֗ים יִקָּו֨וּ הַמַּ֜יִם מִתַּ֤חַת הַשָּׁמַ֨יִם֙ אֶל־מָקֹ֣ום אֶחָ֔ד וְתֵרָאֶ֖ה הַיַּבָּשָׁ֑ה וַֽיְהִי־כֵֽן׃
 וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לַיַּבָּשָׁה֙ אֶ֔רֶץ וּלְמִקְוֵ֥ה הַמַּ֖יִם 

In [14]:
!head {plainTextR}

בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃
 וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃
 וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י אֹ֑ור וַֽיְהִי־אֹֽור׃
 וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור כִּי־טֹ֑וב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָאֹ֖ור וּבֵ֥ין הַחֹֽשֶׁךְ׃
 וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לָאֹור֙ יֹ֔ום וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום אֶחָֽד׃
 פ וַיֹּ֣אמֶר אֱלֹהִ֔ים יְהִ֥י רָקִ֖יעַ בְּתֹ֣וךְ הַמָּ֑יִם וִיהִ֣י מַבְדִּ֔יל בֵּ֥ין מַ֖יִם לָמָֽיִם׃
 וַיַּ֣עַשׂ אֱלֹהִים֮ אֶת־הָרָקִיעַ֒ וַיַּבְדֵּ֗ל בֵּ֤ין הַמַּ֨יִם֙ אֲשֶׁר֙ מִתַּ֣חַת לָרָקִ֔יעַ וּבֵ֣ין הַמַּ֔יִם אֲשֶׁ֖ר מֵעַ֣ל לָרָקִ֑יעַ וַֽיְהִי־כֵֽן׃
 וַיִּקְרָ֧א אֱלֹהִ֛ים לָֽרָקִ֖יעַ שָׁמָ֑יִם וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום שֵׁנִֽי׃
 פ וַיֹּ֣אמֶר אֱלֹהִ֗ים יִקָּו֨וּ הַמַּ֜יִם מִתַּ֤חַת הַשָּׁמַ֨יִם֙ אֶל־מָקֹ֣ום אֶחָ֔ד וְתֵרָאֶ֖ה הַיַּבָּשָׁ֑ה וַֽיְהִי־כֵֽן׃
 וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לַיַּבָּשָׁה֙ אֶ֔רֶץ וּלְמִקְוֵ֥ה הַמַּ֖יִם 

Compare the texts as output by Pandas and by R.

In [15]:
!diff -qs {plainTextPd} {plainTextR}

Files /Users/dirk/github/etcbc/bhsa/_temp/2017/r/plainTextFromPd.txt and /Users/dirk/github/etcbc/bhsa/_temp/2017/r/plainTextFromR.txt are identical


# Drill down to a passage

Let us get the words from the first verse.

In [16]:
wordIds = bhsa[(bhsa.otype == "word") & (bhsa["in.verse"] == 1414190)].n
print(wordIds.values)

[ 1  2  3  4  5  6  7  8  9 10 11]


Now the *text* of the first verse.

In [17]:
words = bhsa[(bhsa.otype == "word") & (bhsa["in.verse"] == 1414190)]
text = words.g_word_utf8 + words.trailer_utf8
print(("".join(text)).replace("\u05C3", "\u05C3\n"))

בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃
 


Let us get the words and text of an arbitrary passage, say `Psalmi 131:2`

In [18]:
verse_id = bhsa[
    (bhsa.otype == "verse")
    & (bhsa.book == "Psalmi")
    & (bhsa.chapter == 131)
    & (bhsa.verse == 2)
].n.iloc[0]
print(verse_id)
words = bhsa[(bhsa.otype == "word") & (bhsa["in.verse"] == verse_id)]
print(words.n.values)
text = words.g_word_utf8 + words.trailer_utf8
print(("".join(text)).replace("\u05C3", "\u05C3\n"))

1431613
[333421 333422 333423 333424 333425 333426 333427 333428 333429 333430
 333431 333432 333433 333434 333435]
אִם־לֹ֤א שִׁוִּ֨יתִי׀ וְדֹומַ֗מְתִּי נַ֫פְשִׁ֥י כְּ֭גָמֻל עֲלֵ֣י אִמֹּ֑ו כַּגָּמֻ֖ל עָלַ֣י נַפְשִֽׁי׃
 


In [19]:
otype = bhsa[bhsa.n == 1414190].otype.iloc[0]
otype

'verse'

Now let us organize this in two functions: one that returns the verse object given a passage, and one that prints the texts of the words in a given object.

In [20]:
def object2text(n):
    otype = bhsa[bhsa.n == n].otype.iloc[0]
    inotype = "in." + otype
    words = bhsa[(bhsa.otype == "word") & (bhsa[inotype] == n)]
    text = words.g_word_utf8 + words.trailer_utf8
    return ("".join(text)).replace("\u05C3", "\u05C3\n")


def verse2object(book, chapter, verse):
    return bhsa[
        (bhsa.otype == "verse")
        & (bhsa.book == book)
        & (bhsa.chapter == chapter)
        & (bhsa.verse == verse)
    ].n.iloc[0]


def verse2text(book, chapter, verse):
    return object2text(verse2object(book, chapter, verse))


def chapter2object(book, chapter):
    return bhsa[
        (bhsa.otype == "chapter") & (bhsa.book == book) & (bhsa.chapter == chapter)
    ].n.iloc[0]


def chapter2text(book, chapter):
    return object2text(chapter2object(book, chapter))

In [21]:
print(verse2text("Psalmi", 131, 2))

אִם־לֹ֤א שִׁוִּ֨יתִי׀ וְדֹומַ֗מְתִּי נַ֫פְשִׁ֥י כְּ֭גָמֻל עֲלֵ֣י אִמֹּ֑ו כַּגָּמֻ֖ל עָלַ֣י נַפְשִֽׁי׃
 


In [22]:
print(chapter2text("Psalmi", 131))

שִׁ֥יר הַֽמַּֽעֲלֹ֗ות לְדָ֫וִ֥ד יְהוָ֤ה׀ לֹא־גָבַ֣הּ לִ֭בִּי וְלֹא־רָמ֣וּ עֵינַ֑י וְלֹֽא־הִלַּ֓כְתִּי׀ בִּגְדֹלֹ֖ות וּבְנִפְלָאֹ֣ות מִמֶּֽנִּי׃
 אִם־לֹ֤א שִׁוִּ֨יתִי׀ וְדֹומַ֗מְתִּי נַ֫פְשִׁ֥י כְּ֭גָמֻל עֲלֵ֣י אִמֹּ֑ו כַּגָּמֻ֖ל עָלַ֣י נַפְשִֽׁי׃
 יַחֵ֣ל יִ֝שְׂרָאֵל אֶל־יְהוָ֑ה מֵֽ֝עַתָּ֗ה וְעַד־עֹולָֽם׃
 


# Bi-grams

We make a column of verse-bound bi-grams of lexemes. The two lexemes are separated by an underscore `_`.

In [23]:
vsNext = bhsa[bhsa.otype == "word"]["in.verse"]
vsPrev = bhsa[bhsa.otype == "word"]["in.verse"].shift(1)
lex = bhsa[bhsa.otype == "word"].lex_utf8
lexNext = bhsa[bhsa.otype == "word"].lex_utf8.shift(1)

In [24]:
lastInVs = vsPrev != vsNext
lexNext[lastInVs] = ""

In [25]:
bigram = ["{}_{}".format(*p) for p in zip(lex, lexNext)]

In [26]:
bigram[0:30]

['ב_',
 'ראשׁית_ב',
 'ברא_ראשׁית',
 'אלהים_ברא',
 'את_אלהים',
 'ה_את',
 'שׁמים_ה',
 'ו_שׁמים',
 'את_ו',
 'ה_את',
 'ארץ_ה',
 'ו_',
 'ה_ו',
 'ארץ_ה',
 'היה_ארץ',
 'תהו_היה',
 'ו_תהו',
 'בהו_ו',
 'ו_בהו',
 'חשׁך_ו',
 'על_חשׁך',
 'פנה_על',
 'תהום_פנה',
 'ו_תהום',
 'רוח_ו',
 'אלהים_רוח',
 'רחף_אלהים',
 'על_רחף',
 'פנה_על',
 'ה_פנה']