# Import BHSA data into R

This notebook contains the R instructions to load the 
[bigTables](bigTables.ipynb) export of the BHSA
and save it in the much more compact `.rds` format.

We then perform some simple information extracting on the data.
For comparison, the same information extraction has been done for Pandas:
in [bigTablesP](bigTablesP.ipynb).

Note that we have to ignore quotes and comment signs!

First we load the big text file with all information. This will take 3 minutes or so.

In [1]:
bhsa = read.table(
    '../_temp/2017/r/bhsa2017.txt', 
    sep="\t", 
    header=TRUE, 
    comment.char="",
    quote="",
    as.is = TRUE,
)
dim(bhsa)

Now we save it into compact rds format.

In [2]:
saveRDS(
    object=bhsa, 
    file='../_temp/2017/r/bhsa2017.rds'
)

We load the data again, now from the compact representation. Much quicker. Still 40 seconds.

In [3]:
bhsa = readRDS(
    file='../_temp/2017/r/bhsa2017.rds'
)

In [4]:
dim(bhsa)

In [5]:
head(bhsa, n=30)

n,otype,in.subphrase,in.phrase_atom,in.phrase,in.clause_atom,in.clause,in.sentence_atom,in.sentence,in.half_verse,⋯,txt,typ,uvf,vbe,vbs,verse,voc_lex,voc_lex_utf8,vs,vt
426585,book,,,,,,,,,⋯,,,,,,,,,,
426624,chapter,,,,,,,,,⋯,,,,,,,,,,
1414190,verse,,,,,,,,,⋯,,,,,,1.0,,,,
1172209,sentence,,,,,,,,,⋯,,,,,,,,,,
1235920,sentence_atom,,,,,,,1172209.0,,⋯,,,,,,,,,,
427553,clause,,,,,,1235920.0,1172209.0,,⋯,?,xQtX,,,,,,,,
515654,clause_atom,,,,,427553.0,1235920.0,1172209.0,,⋯,,xQtX,,,,,,,,
606323,half_verse,,,,515654.0,427553.0,1235920.0,1172209.0,,⋯,,,,,,,,,,
651503,phrase,,,,515654.0,427553.0,1235920.0,1172209.0,606323.0,⋯,,PP,,,,,,,,
904690,phrase_atom,,,651503.0,515654.0,427553.0,1235920.0,1172209.0,606323.0,⋯,,PP,,,,,,,,


# Books

Let us extract some data.
First a list of the book names.

In [6]:
books = bhsa$book[bhsa$otype == 'book']
paste(books, collapse=' ')

# Text

Now the complete text of the whole bible.

In [7]:
words = which(bhsa$otype == 'word')
text = paste(
    bhsa$g_word_utf8[words], sub('׃', '׃\n', bhsa$trailer_utf8[words]),
    sep='', collapse=''
)
write(text, file='../_temp/2017/r/plainTextFromR.txt')

# Drill down to a passage

Let us get the part of speech of thewords from the first verse:

In [10]:
wordIds = bhsa$n[bhsa$otype=='word' & bhsa$in.verse==1414190]
wordIds

Now the *text* of the first verse.

In [11]:
words = which(bhsa$n %in% wordIds)
gsub('׃', '׃\n', 
    paste(bhsa$g_word_utf8[words], bhsa$trailer_utf8[words], collapse='')
)

Let us get the words and text of an arbitrary passage, say Psalmi 131:2

In [12]:
verseId = bhsa$n[bhsa$otype == 'verse' & bhsa$book == 'Psalmi' & bhsa$chapter == 131 & bhsa$verse == 2]
verseId
wordIds = bhsa$n[bhsa$otype=='word' & bhsa$in.verse == verseId]
wordIds
words = which(bhsa$n %in% wordIds)
gsub('׃', '׃\n', 
    paste(bhsa$g_word_utf8[words], bhsa$trailer_utf8[words], collapse='')
)

Now let us organize this in two functions: one that returns the verse object given a passage, and one that prints the texts of the words in a given object.

In [13]:
object2text = function(n) {
    otype = bhsa$otype[bhsa$n == n]
    wordIds = eval(parse(text=paste("bhsa$n[bhsa$otype=='word' & bhsa$in.", otype, '==n]', sep='')))
    words = which(bhsa$n %in% wordIds)
    return(gsub('׃', '׃\n',  
        paste(bhsa$g_word_utf8[words], bhsa$trailer_utf8[words], collapse='')
    ))
}

verse2object = function(book, chapter, verse) {
    return(bhsa$n[bhsa$otype == 'verse' & bhsa$book == book & bhsa$chapter == chapter & bhsa$verse == verse])
}
verse2text = function(book, chapter, verse) {
    return(object2text(verse2object(book, chapter, verse)))
}
chapter2object = function(book, chapter) {
    return(bhsa$n[bhsa$otype == 'chapter' & bhsa$book == book & bhsa$chapter == chapter])
}
chapter2text = function(book, chapter) {
    return(object2text(chapter2object(book, chapter)))
}

In [14]:
cat(verse2text('Psalmi', 131, 2))

אִם ־לֹ֤א  שִׁוִּ֨יתִי ׀ וְ דֹומַ֗מְתִּי  נַ֫פְשִׁ֥י  כְּ֭ גָמֻל  עֲלֵ֣י  אִמֹּ֑ו  כַּ  גָּמֻ֖ל  עָלַ֣י  נַפְשִֽׁי ׃
 

In [15]:
cat(chapter2text('Psalmi', 131))

שִׁ֥יר  הַֽ מַּֽעֲלֹ֗ות  לְ דָ֫וִ֥ד  יְהוָ֤ה ׀ לֹא ־גָבַ֣הּ  לִ֭בִּי  וְ לֹא ־רָמ֣וּ  עֵינַ֑י  וְ לֹֽא ־הִלַּ֓כְתִּי ׀ בִּ גְדֹלֹ֖ות  וּ בְ נִפְלָאֹ֣ות  מִמֶּֽנִּי ׃
 אִם ־לֹ֤א  שִׁוִּ֨יתִי ׀ וְ דֹומַ֗מְתִּי  נַ֫פְשִׁ֥י  כְּ֭ גָמֻל  עֲלֵ֣י  אִמֹּ֑ו  כַּ  גָּמֻ֖ל  עָלַ֣י  נַפְשִֽׁי ׃
 יַחֵ֣ל  יִ֝שְׂרָאֵל  אֶל ־יְהוָ֑ה  מֵֽ֝ עַתָּ֗ה  וְ עַד ־עֹולָֽם ׃
 

# Bigrams

We make a column of verse-bound bigrams of lexemes. The two lexemes are separated by an underscore `_`. 

In [19]:
vsNext = bhsa$in.verse[bhsa$otype=='word'][-1]
vsPrev = bhsa$in.verse[bhsa$otype=='word'][-length(bhsa)]

lex = bhsa$g_lex_utf8[bhsa$otype=='word']
lexNext = bhsa$g_lex_utf8[bhsa$otype=='word'][-1]
lastInVs = vsPrev != vsNext

lexNext[lastInVs] = ''

bigram = paste(
    lex,
    lexNext,
    sep='_'
)

In [20]:
head(bigram, n=30)

In [27]:
vsNext[0:2]

In [29]:
vsPrev[0:2]