<img align="right" src="tf-small.png"/>

# SHEBANQ from ETCBC

This notebook assembles the data from the ETCBC that is needed
to feed the website [SHEBANQ](https://shebanq.ancient-data.org).

All data is delivered through github repositories.
Before the pipeline starts, these repos must be pulled.

This notebook will call a series of other notebooks, some of them
residing in other github repos.
Before these notebooks can be run, they must be converted to Python
programs. Then the will be called as such, with parameters injected as local variables.
One of these parameters will be `SCRIPT=True`, with the understanding
that a notebook can adapt its actions to the fact that it is part of the pipeline.
These notebooks can also be run interactively, and then you can add extra actions which are not relevant to the pipeline conversion, such as testing, experimenting, visualizing.
Take care that you wrap non-essential things in contexts where
`SCRIPT=False`.

This notebook itself can also be run in script mode.

## Pipeline

### Core data

The core data is delivered by the ETCBC as `bhsa.mql.bz2` in 
the Github repo [bhsa](https://github.com/ETCBC/bhsa) in directory `source`.

This data will be converted by `tfFromMQL` in the `programs` directory.

The result of this action will be an updated TF resource in its 
`tf/core` directory.

### Statistics

The notebook `addStats` in the same *bhsa* repo will add statistical
features to the core dataset: `freq_occ freq_lex rank_occ rank_lex`.

In [1]:
import os,sys,collections
from pipeline import runPipeline
from tf.fabric import Fabric

# Config

In [2]:
CORE_NAME = 'bhsa'
CORE_MODULE = 'core'

if 'SCRIPT' not in locals(): 
    SCRIPT = False
    DEFAULT_CORE_NAME = CORE_NAME
    DEFAULT_VERSION = 'c'

In [5]:
pipeline = dict(
    defaults = dict(
        CORE_NAME=CORE_NAME,
        VERSION=DEFAULT_VERSION,
        CORE_MODULE=CORE_MODULE,
    ),
    versions={
        '4': dict(),
        '4b': dict(),
        'c': dict(),
        'd': dict(),
        '2017': dict(),
    },
    repoOrder = '''
        bhsa
        phono
    ''',
    repoConfig = dict(
        bhsa=(
            dict(
                task='tfFromMQL',
            ),
            dict(
                task='lexicon',
                omit={'4', '4b'},
            ),
            dict(
                task='paragraphs',
                 omit={'4', '4b', 'c'},
            ),
            dict(
                task='ketiv-qere',
                omit={'4', '4b', 'c'},
            ),
            dict(
                task='addStats',
                omit={'4', '4b'},
            ),
        ),
        phono=(
            dict(
                task='phono',
                omit={'4', '4b', 'c'},
            ),
        ),
    ),
)

# Run the pipeline

In [7]:
good = runPipeline(pipeline, version='c', force=True)


####################################################################################
#                                                                                  #
# Make version [c]                                                                 #
#                                                                                  #
####################################################################################



=                                          =
= Make repo [bhsa]                         =
=                                          =



--------------------------------------------
- Run notebook [bhsa/tfFromMQL]            -
--------------------------------------------


START tfFromMQL (CORE_MODULE=core, CORE_NAME=bhsa, VERSION=c)
	Source /Users/dirk/github/etcbc/bhsa/source/c/bhsa.mql.bz2 exists
	Destination /Users/dirk/github/etcbc/bhsa/tf/c/core/.tf/otype.tfx exists
	Destination /Users/dirk/github/etcbc/bhsa/tf/c/core/.tf/otype.tfx up to date
      0.00s bun

     2m 47s 	line  33000000
		objects in phrase_atom
     2m 48s 33367199 lines parsed
426581 objects of type word
90562 objects of type clause_atom
63570 objects of type sentence
64339 objects of type sentence_atom
113792 objects of type subphrase
253174 objects of type phrase
929 objects of type chapter
39 objects of type book
88000 objects of type clause
45180 objects of type half_verse
23213 objects of type verse
267515 objects of type phrase_atom
     2m 48s Making TF data ...
     2m 48s Monad - idd mapping ...
     2m 48s maxSlot=426581
     2m 48s Node mapping and otype ...
     2m 49s oslots ...
     2m 49s metadata ...
     2m 49s features ...
     2m 49s 	features from words
     2m 55s 	   100000 words
     2m 59s 	   200000 words
     3m 05s 	   300000 words
     3m 10s 	   400000 words
     3m 11s 	   426581 words
     3m 11s 	features from books
     3m 11s 	       39 books
     3m 11s 	features from chapters
     3m 11s 	      929 chapters
     3m 11s 	features from cla

  0.00s Grid feature "otype" not found in

  0.00s Grid feature "oslots" not found in



  0.00s Grid feature "otext" not found. Working without Text-API

  0.00s Exporting 90 node and 4 edge and 1 config features to /Users/dirk/github/etcbc/bhsa/_temp/c/core:
   |     0.06s T book                 to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.00s T book@am              to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.00s T book@ar              to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.00s T book@bn              to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.00s T book@da              to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.00s T book@de              to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.00s T book@el              to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.00s T book@en              to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.00s T book@es              to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.00s T book@fa              to /Users/dirk/github/etcbc/bhsa/_

      0.00s checkDiffs
2 features to add:
	g_voc_lex g_voc_lex_utf8
6 features to delete:
	gloss lex0 nametype root voc_lex voc_lex_utf8
93 features in common
book                      ... no changes
book@am                   ... no changes
book@ar                   ... no changes
book@bn                   ... no changes
book@da                   ... no changes
book@de                   ... no changes
book@el                   ... no changes
book@en                   ... no changes
book@es                   ... no changes
book@fa                   ... no changes
book@fr                   ... no changes
book@he                   ... no changes
book@hi                   ... no changes
book@id                   ... no changes
book@ja                   ... no changes
book@ko                   ... no changes
book@la                   ... no changes
book@nl                   ... no changes
book@pa                   ... no changes
book@pt                   ... no changes
book@ru              

   |     0.00s T book@ja              from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.00s T book@ko              from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.00s T book@la              from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.00s T book@nl              from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.00s T book@pa              from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.00s T book@pt              from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.00s T book@ru              from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.00s T book@sw              from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.00s T book@syc             from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.00s T book@tr              from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.00s T book@ur              from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.00s T book@yo              from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |

   |     0.15s B nu                   from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.12s B st                   from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.17s B g_voc_lex            from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.26s B g_voc_lex_utf8       from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.00s Feature overview: 90 for nodes; 4 for edges; 1 configs; 7 computed
  7.09s All features loaded/computed - for details use loadLog()
added 9236 lexemes
maxNode is now 1446130
Reading lexicon ...
Lexicon arc has   708 entries
Lexicon hbo has  8528 entries
Done
Reading the BHSA core data ...
Done
Language arc has   708 lexemes in the text
Language hbo has  8528 lexemes in the text
708 arc lexemes
8528 hbo lexemes
Equal lex values in hbo and arc in the BHSA   text contains 460 lexemes
Equal lex values in hbo and arc in the lexicon     contains 460 lexemes
Common values in the lexicon but not in the text: 0x: set()
Common values in the text but not i

  0.00s Grid feature "otype" not found in

  0.01s Grid feature "oslots" not found in



  0.01s Grid feature "otext" not found. Working without Text-API

  0.00s Exporting 12 node and 1 edge and 1 config features to /Users/dirk/github/etcbc/bhsa/_temp/c/core:
   |     0.04s T gloss                to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.69s T language             to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.70s T lex                  to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.68s T lex0                 to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.73s T lex_utf8             to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.68s T ls                   to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.01s T nametype             to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.65s T otype                to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.00s T root                 to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.68s T sp                   to /Users/dirk/github/etcbc/bhsa/_