<a href="http://laf-fabric.readthedocs.org/en/latest/" target="_blank"><img align="left" src="images/laf-fabric-xsmall.png"/></a>
<a href="http://www.godgeleerdheid.vu.nl/etcbc" target="_blank"><img align="left" src="images/VU-ETCBC-xsmall.png"/></a>
<a href="http://www.persistent-identifier.nl/?identifier=urn%3Anbn%3Anl%3Aui%3A13-048i-71" target="_blank"><img align="left"src="images/etcbc4easy-small.png"/></a>
<a href="http://tla.mpi.nl" target="_blank"><img align="right" src="images/TLA-xsmall.png"/></a>
<a href="http://www.dans.knaw.nl" target="_blank"><img align="right"src="images/DANS-xsmall.png"/></a>

# ETCBC in R

This notebook exports the ETCBC database to an R data frame.
The nodes are exported as rows, they correspond to the text objects such as word, phrase, clause, sentence, verse, chapter, book and a few others.

The etcbc features become the columns, so each row tells what values the features have for the corresponding object.

The edges corresponding to the etcbc features *mother*, *functional_parent*, *distributional_parent* are
exported as extra columns. For each row, such a column indicates the target of a corresponding outgoing edge.
In the ETCBC data objects have at most one outgoing edge for each type of edge. The target is identified by its object identifier, which is in the ``oid`` column.

We also write the data that says which objects are contained in which.
In the ETCBC data, an object is contained in at most one object of each object type.
So a word is contained in exactly one phrase, never in more than one.
Another way to say this is that objects of the same type never overlap.
To each row we add the following columns:

* for each object type, execpt `word` there is a column with name that object type and containing
  the identifier of the containing object of that type of the row object (if any).

Extra data such as lexicon (including frequency and rank features), phonetic transcription, and ketiv-qere are also included.

The result is the file 
``etcbc4b.rds``
in the 
[laf-fabric-data github repository](https://github.com/ETCBC/laf-fabric-data).

In [1]:
import sys, collections

from laf.fabric import LafFabric
import etcbc
from etcbc.preprocess import prepare

fabric = LafFabric()

  0.00s This is LAF-Fabric 4.5.7
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html



In [2]:
API = fabric.load('etcbc4b', 'para', 'hinr', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
''',""),
    "primary": False,
})
print(API['F_all'])

  0.00s LOADING API: please wait ... 
  0.00s INFO: USING DATA COMPILED AT: 2015-11-02T15-08-56
  0.00s INFO: USING DATA COMPILED AT: 2016-01-27T18-59-08
  1.18s LOGFILE=/Users/dirk/SURFdrive/laf-fabric-output/etcbc4b/hinr/__log__hinr.txt
  1.18s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX para FOR TASK hinr AT 2016-01-29T13-41-45


[('etcbc4', ['db.maxmonad', 'db.minmonad', 'db.monads', 'db.oid', 'db.otype', 'ft.code', 'ft.det', 'ft.dist', 'ft.dist_unit', 'ft.domain', 'ft.function', 'ft.g_cons', 'ft.g_cons_utf8', 'ft.g_lex', 'ft.g_lex_utf8', 'ft.g_nme', 'ft.g_nme_utf8', 'ft.g_pfm', 'ft.g_pfm_utf8', 'ft.g_prs', 'ft.g_prs_utf8', 'ft.g_uvf', 'ft.g_uvf_utf8', 'ft.g_vbe', 'ft.g_vbe_utf8', 'ft.g_vbs', 'ft.g_vbs_utf8', 'ft.g_word', 'ft.g_word_utf8', 'ft.gn', 'ft.is_root', 'ft.kind', 'ft.language', 'ft.lex', 'ft.lex_utf8', 'ft.ls', 'ft.mother_object_type', 'ft.nme', 'ft.nu', 'ft.number', 'ft.pdp', 'ft.pfm', 'ft.prs', 'ft.ps', 'ft.rela', 'ft.sp', 'ft.st', 'ft.tab', 'ft.trailer_utf8', 'ft.txt', 'ft.typ', 'ft.uvf', 'ft.vbe', 'ft.vbs', 'ft.vs', 'ft.vt', 'px.instruction', 'px.number_in_ch', 'px.pargr', 'sft.book', 'sft.chapter', 'sft.label', 'sft.verse'])]


In [3]:
p_features = [x.split('.', 1)[1] for x in API['F_all'][0][1] if x.split('.', 1)[0] == 'px']
p_feature_str = ' '.join(p_features)
p_feature_str

'instruction number_in_ch pargr'

In [4]:
API = fabric.load_again({
    "xmlids": {"node": False, "edge": False},
    "features": (p_feature_str,""),
    "primary": False,
})
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s INFO: USING DATA COMPILED AT: 2015-11-02T15-08-56
  0.00s INFO: USING DATA COMPILED AT: 2016-01-27T18-59-08
  0.23s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX para FOR TASK hinr AT 2016-01-29T13-41-50


In [5]:
msg("Collecting R data for paragraphs")
chunk_size = 100000
i = 0
s = 0
p_data = {}
for n in NN():
    p_values = []
    p_data[n] = tuple(F.item[x].v(n) for x in p_features)
    i += 1
    s += 1
    if s == chunk_size:
        s = 0
        msg('{:>7} nodes read'.format(i))
msg('{:>7} nodes read and done'.format(i))

  7.29s Collecting R data for paragraphs
  7.73s  100000 nodes read
  8.06s  200000 nodes read
  8.39s  300000 nodes read
  8.72s  400000 nodes read
  9.04s  500000 nodes read
  9.36s  600000 nodes read
  9.72s  700000 nodes read
    10s  800000 nodes read
    11s  900000 nodes read
    11s 1000000 nodes read
    12s 1100000 nodes read
    12s 1200000 nodes read
    12s 1300000 nodes read
    13s 1400000 nodes read
    13s 1436858 nodes read and done


In [6]:
API = fabric.load('etcbc4b', 'lexicon', 'hinr', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
''',""),
    "primary": False,
})

print(API['F_all'])
print(API['FE_all'])

  0.00s LOADING API: please wait ... 
  0.00s INFO: USING DATA COMPILED AT: 2015-11-02T15-08-56
  0.00s INFO: USING DATA COMPILED AT: 2016-01-27T19-01-17
  0.01s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK hinr AT 2016-01-29T13-42-05


[('etcbc4', ['db.maxmonad', 'db.minmonad', 'db.monads', 'db.oid', 'db.otype', 'ft.code', 'ft.det', 'ft.dist', 'ft.dist_unit', 'ft.domain', 'ft.function', 'ft.g_cons', 'ft.g_cons_utf8', 'ft.g_lex', 'ft.g_lex_utf8', 'ft.g_nme', 'ft.g_nme_utf8', 'ft.g_pfm', 'ft.g_pfm_utf8', 'ft.g_prs', 'ft.g_prs_utf8', 'ft.g_uvf', 'ft.g_uvf_utf8', 'ft.g_vbe', 'ft.g_vbe_utf8', 'ft.g_vbs', 'ft.g_vbs_utf8', 'ft.g_word', 'ft.g_word_utf8', 'ft.gn', 'ft.is_root', 'ft.kind', 'ft.language', 'ft.lex', 'ft.lex_utf8', 'ft.ls', 'ft.mother_object_type', 'ft.nme', 'ft.nu', 'ft.number', 'ft.pdp', 'ft.pfm', 'ft.prs', 'ft.ps', 'ft.rela', 'ft.sp', 'ft.st', 'ft.tab', 'ft.trailer_utf8', 'ft.txt', 'ft.typ', 'ft.uvf', 'ft.vbe', 'ft.vbs', 'ft.vs', 'ft.vt', 'kq.g_qere_utf8', 'kq.qtrailer_utf8', 'lex.entry', 'lex.entry_heb', 'lex.entryid', 'lex.freq_lex', 'lex.freq_occ', 'lex.g_entry', 'lex.g_entry_heb', 'lex.gloss', 'lex.id', 'lex.lan', 'lex.nametype', 'lex.pos', 'lex.rank_lex', 'lex.rank_occ', 'lex.root', 'lex.subpos', 'ph.phon

In [7]:
all_features = [x.split('.', 1)[1] for x in API['F_all'][0][1]]
all_feature_str = ' '.join(all_features)
alle_features = [x.split('.', 1)[1] for x in API['FE_all'][0][1]]
alle_feature_str = ' '.join(alle_features)
alle_feature_str

'distributional_parent functional_parent mother'

In [8]:
API = fabric.load_again({
    "xmlids": {"node": False, "edge": False},
    "features": (all_feature_str,alle_feature_str),
    "primary": False,
    "prepare": prepare,
})
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s INFO: USING DATA COMPILED AT: 2015-11-02T15-08-56
  0.00s INFO: USING DATA COMPILED AT: 2016-01-27T19-01-17
    44s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html
  0.00s LOADING API with EXTRAs: please wait ... 
  0.00s INFO: USING DATA COMPILED AT: 2015-11-02T15-08-56
  0.00s INFO: USING DATA COMPILED AT: 2016-01-27T19-01-17
  0.01s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK hinr AT 2016-01-29T13-42-58
  0.00s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK hinr AT 2016-01-29T13-42-58


In [12]:
msg("Writing R feature data")
hr = open('/Users/dirk/SURFdrive/laf-fabric-data/r/etcbc4b.txt', 'w')

use_features = all_features
usee_features = alle_features
l_features = '''
    subphrase phrase_atom phrase clause_atom clause sentence_atom sentence
    half_verse verse chapter book
'''.strip().split()


hr.write('{}\t{}\t{}\n'.format(
    '\t'.join(use_features),
    '\t'.join(p_features),
    '\t'.join(usee_features),
    '\t'.join(l_features),
))
chunk_size = 100000
i = 0
s = 0
for n in NN():
    all_values = [F.item[x].v(n) for x in use_features]
    p_values = p_data[n]
    alle_values = [F.oid.v((list(C.item[x].v(n)) or [-1])[0]) for x in usee_features]
    l_values = [L.u(x, n) or '' for x in l_features]
    hr.write('{}\t{}\t{}\n'.format(
        ('\t'.join(x or '' for x in all_values)).replace('\n',''),
        ('\t'.join(x or '' for x in p_values)).replace('\n',''),
        ('\t'.join(x or '' for x in alle_values)),
        ('\t'.join(str(x) for x in l_values)),
    ))
    i += 1
    s += 1
    if s == chunk_size:
        s = 0
        msg('{:>7} nodes written'.format(i))
hr.close()
msg('{:>7} nodes written and done'.format(i))

 9m 25s Writing R feature data
 9m 34s  100000 nodes written
 9m 45s  200000 nodes written
 9m 54s  300000 nodes written


KeyboardInterrupt: 

In [22]:
!ls -lh /Users/dirk/SURFdrive/laf-fabric-data/r

total 886752
-rw-r--r--  1 dirk  staff    49M Jan 29 11:00 etcbc4b.rds
-rw-r--r--  1 dirk  staff   265M Jan 29 10:52 etcbc4b.txt
-rw-r--r--  1 dirk  staff   6.6M Jan 29 13:38 etcbc4b_L.rds
-rw-r--r--  1 dirk  staff   113M Jan 29 13:58 etcbc4b_L.txt


The R export is ready now, but it is a bit large.
We can get a much leaner file by using R to load this file and save it in .rds format.

If your installation is not such that you can run R in a notebook based on a python kernel (as I am experiencing right now due to problems with the python module `rpy2`), switch to the Hebrew_In_RR notebook in this same directory to 
carry out the R cells.

Copy it to the github directory

In [23]:
!cp '/Users/dirk/SURFdrive/laf-fabric-data/r/etcbc4b.rds' '/Users/dirk/SURFdrive/current/demos/github/laf-fabric-data/'