<img align="right" src="images/dans-small.png"/>
<img align="right" src="images/tf-small.png"/>
<img align="right" src="images/etcbc.png"/>

# BHSA as a Big Table
This notebook exports the [BHSA](etcbc.png) database to an R data frame.
The nodes are exported as rows, they correspond to the text objects such as word, phrase, clause, sentence, verse, chapter, book and a few others.

The BHSA features become the columns, so each row tells what values the features have for the corresponding object.

The edges corresponding to the BHSA features *mother*, *functional_parent*, *distributional_parent* are
exported as extra columns. For each row, such a column indicates the target of a corresponding outgoing edge.

We also write the data that says which objects are contained in which.
To each row we add the following columns:

* for each object type, except `word` there is a column with name that object type and containing
  the identifier of the containing object of that type of the row object (if any).

Extra data such as lexicon (including frequency and rank features), phonetic transcription, and ketiv-qere are also included.

We compose the big table and save it as a tab delimited file.
The result can be processed by R and Pandas,
who may converted the table to internal formats
for quicker loading.
It turns out that for this size of the data Pandas is a bit quicker than R.

Also, because we remain in a Python environment, working with Pandas 
is easier when you want to use configurations ad libraries from the text-fabric sphere.

See 
[bigTablesR](bigTablesR.ipynb)
and
[bigTablesP](bigTablesP.ipynb)

In [2]:
import os, sys, collections
from tf.fabric import Fabric

# Data source

In [3]:
locations = '~/github/etcbc'
coreModule = 'bhsa'
sources = [coreModule, 'phono']
version = '2017'
tempDir = os.path.expanduser('{}/{}/_temp/{}/r'.format(locations, coreModule, version))
tableFile = '{}/{}{}.txt'.format(tempDir, coreModule, version)

In [4]:
modules = ['{}/tf/{}'.format(s, version) for s in sources]
TF = Fabric(locations=locations, modules=modules)

This is Text-Fabric 8.4.13
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

118 features found and 0 ignored


# Load ALL features

In [5]:
api = TF.load('')
allFeatures = TF.explore(silent=False, show=True)
loadableFeatures = allFeatures['nodes'] + allFeatures['edges']
api = TF.load(loadableFeatures)
api.makeAvailableIn(globals())

  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
  3.57s All features loaded/computed - for details use loadLog()
   |     0.00s Feature overview: 111 for nodes; 5 for edges; 2 configs; 8 computed
  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
  7.16s All features loaded/computed - for details use loadLog()


[('Computed',
  'computed-data',
  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),
 ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),
 ('Fabric', 'loading', ('TF',)),
 ('Locality', 'locality', ('L Locality',)),
 ('Nodes', 'navigating-nodes', ('N Nodes',)),
 ('Features',
  'node-features',
  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),
 ('Search', 'search', ('S Search',)),
 ('Text', 'text', ('T Text',))]

# Writing tabular data

In [6]:
## info("Writing R feature data")

if not os.path.exists(tempDir):
    os.makedirs(tempDir)

hr = open(tableFile, 'w')

skipFeatures = '''
    otype
    oslots
'''.strip().split()
for f in (Fall() + Eall()):
    if '@' in f: skipFeatures.append(f)

levelFeatures = '''
    lex subphrase phrase_atom phrase clause_atom clause sentence_atom sentence
    half_verse verse chapter book
'''.strip().split()
inLevelFeatures = ['in.'+x for x in levelFeatures]

allNodeFeatures = sorted(set(Fall()) - set(skipFeatures))
allEdgeFeatures = sorted(set(Eall()) - set(skipFeatures))

hr.write('{}\t{}\t{}\t{}\t{}\n'.format(
    'n',
    'otype',
    '\t'.join(inLevelFeatures),
    '\t'.join(allEdgeFeatures),
    '\t'.join(allNodeFeatures),
))
chunkSize = 100000
i = 0
s = 0
NA = ['']
NAe = [['']]
for n in N.walk():
    levelValues = [(L.u(n, otype=level) or NA)[0] for level in levelFeatures]
    edgeValues = [str((Es(f).f(n) or NA)[0]) for f in allEdgeFeatures]
    nodeValues = [(str(Fs(f).v(n) or '')) for f in allNodeFeatures]
    hr.write('{}\t{}\t{}\t{}\t{}\n'.format(
        n,
        F.otype.v(n),
        ('\t'.join(str(x) for x in levelValues)),
        ('\t'.join(edgeValues)),
        ('\t'.join(nodeValues)).replace('\n',''),
    ))
    i += 1
    s += 1
    if s == chunkSize:
        s = 0
        TF.info('{:>7} nodes written'.format(i))
hr.close()
TF.info('{:>7} nodes written and done'.format(i))

    27s  100000 nodes written
    41s  200000 nodes written
    55s  300000 nodes written
 1m 09s  400000 nodes written
 1m 23s  500000 nodes written
 1m 37s  600000 nodes written
 1m 52s  700000 nodes written
 2m 06s  800000 nodes written
 2m 20s  900000 nodes written
 2m 34s 1000000 nodes written
 2m 48s 1100000 nodes written
 3m 02s 1200000 nodes written
 3m 16s 1300000 nodes written
 3m 30s 1400000 nodes written
 3m 36s 1446635 nodes written and done


In [7]:
!ls -lh {tempDir}

total 720904
-rw-r--r--  1 dirk  staff   340M Apr  7 21:25 bhsa2017.txt


The tabular export is ready now, but it is a bit large.
We can get a much leaner file by using R to load this file and save it in .rds format.

We do that in a separate notebook, not running Python, but R: bigTablesR in this same directory.