<img align="right" src="tf-small.png"/>

# Text-Fabric from ETCBC

This notebook assembles the data from the ETCBC that is needed
to compile its datasets in text-fabric-format on Github.
Ulltimately the data for the website [SHEBANQ](https://shebanq.ancient-data.org) will be
derived from these TF-sources.

## Pipeline

A run of the pipeline produces a data *version*.
It should be run whenever there are new or updated data sources present that affect the output data.
Since all input data is delivered in a Github repo, we have excellent machinery to 
work with versioning.

The pipe line works by executing a series of programs, contained in Github repositories.
For each repository in the pipeline, a series of notebooks will be executed.
See [script mode](https://github.com/ETCBC/pipeline/blob/master/README.md#operation) for 
details on how we call notebooks.

All this is specified in the configuration below.

### Core data

The core data is delivered by the ETCBC as `bhsa.mql.bz2` in 
the Github repo [bhsa](https://github.com/ETCBC/bhsa) in directory `source`.

This data will be converted by `tfFromMQL` in the `programs` directory.

The result of this action will be an updated TF resource in its 
`tf/core` directory.

### Statistics

The notebook `addStats` in the same *bhsa* repo will add statistical
features to the core dataset: `freq_occ freq_lex rank_occ rank_lex`.

All data is delivered through github repositories.
Before the pipeline starts, these repos must be pulled.

This notebook will call a series of other notebooks, some of them
residing in other github repos.
Before these notebooks can be run, they must be converted to Python
programs. Then the will be called as such, with parameters injected as local variables.
One of these parameters will be `SCRIPT=True`, with the understanding
that a notebook can adapt its actions to the fact that it is part of the pipeline.
These notebooks can also be run interactively, and then you can add extra actions which are not relevant to the pipeline conversion, such as testing, experimenting, visualizing.
Take care that you wrap non-essential things in contexts where
`SCRIPT=False`.

This notebook itself can also be run in script mode.



In [1]:
import os,sys,collections
from pipeline import runPipeline
from tf.fabric import Fabric

# Config

In [2]:
CORE_NAME = 'bhsa'
CORE_MODULE = 'core'

if 'SCRIPT' not in locals(): 
    SCRIPT = False
    DEFAULT_CORE_NAME = CORE_NAME
    DEFAULT_VERSION = 'c'

In [3]:
pipeline = dict(
    defaults = dict(
        CORE_NAME=CORE_NAME,
        VERSION=DEFAULT_VERSION,
        CORE_MODULE=CORE_MODULE,
    ),
    versions={
        '4': dict(),
        '4b': dict(),
        'c': dict(),
        'd': dict(),
        '2017': dict(),
    },
    repoOrder = '''
        bhsa
        phono
        valence
        parallels
    ''',
    repoConfig = dict(
        bhsa=(
            dict(
                task='tfFromMQL',
            ),
            dict(
                task='lexicon',
                omit={'4', '4b'},
            ),
            dict(
                task='paragraphs',
                 omit={'4', '4b'},
            ),
            dict(
                task='ketivQere',
                omit={'4', '4b'},
            ),
            dict(
                task='addStats',
                omit={'4', '4b'},
            ),
        ),
        phono=(
            dict(
                task='phono',
                omit={'4', '4b'},
            ),
        ),
        valence=(
            dict(
                task='flowchart',
                omit={'4', '4b', 'c'},
            ),
        ),
        parallels=(
            dict(
                task='parallels',
                omit={'4', '4b'},
                params=dict(
                    FORCE_MATRIX=False,
                ),
            ),
        ),
    ),
)

# Run the pipeline

In [5]:
good = runPipeline(pipeline, version='c', force=True)


##############################################################################################
#                                                                                            #
#      1m 37s Make version [c]                                                               #
#                                                                                            #
##############################################################################################


**********************************************************************************************
*                                                                                            *
*      1m 37s Make repo [bhsa]                                                               *
*                                                                                            *
**********************************************************************************************


---------------------------------------------

|      1m 52s 	line   2000000
|      1m 55s 		objects in word
|      1m 57s 	line   3000000
|      2m 03s 	line   4000000
|      2m 09s 	line   5000000
|      2m 10s 		objects in word
|      2m 15s 	line   6000000
|      2m 21s 	line   7000000
|      2m 26s 		objects in word
|      2m 27s 	line   8000000
|      2m 33s 	line   9000000
|      2m 38s 	line  10000000
|      2m 41s 		objects in word
|      2m 45s 	line  11000000
|      2m 51s 	line  12000000
|      2m 56s 	line  13000000
|      2m 56s 		objects in word
|      3m 02s 	line  14000000
|      3m 08s 	line  15000000
|      3m 12s 		objects in word
|      3m 14s 	line  16000000
|      3m 20s 	line  17000000
|      3m 27s 	line  18000000
|      3m 28s 		objects in word
|      3m 33s 	line  19000000
|      3m 38s 	line  20000000
|      3m 43s 		objects in word
|      3m 44s 	line  21000000
|      3m 50s 	line  22000000
|      3m 51s 		objects in clause_atom
|      3m 54s 		objects in clause_atom
|      3m 55s 	line  23000000
|     

   |     0.82s T g_nme                to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     1.01s T g_nme_utf8           to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     1.02s T g_pfm                to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.90s T g_pfm_utf8           to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.86s T g_prs                to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     1.01s T g_prs_utf8           to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.77s T g_uvf                to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.72s T g_uvf_utf8           to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.73s T g_vbe                to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.72s T g_vbe_utf8           to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.74s T g_vbs                to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.71s T g_vbs_utf8           to /Users/dirk/github/etcbc/bhsa/_temp

|      6m 20s g_uvf_utf8                ... no changes
|      6m 20s g_vbe                     ... no changes
|      6m 20s g_vbe_utf8                ... no changes
|      6m 21s g_vbs                     ... no changes
|      6m 21s g_vbs_utf8                ... no changes
|      6m 21s g_word                    ... no changes
|      6m 22s g_word_utf8               ... no changes
|      6m 22s gn                        ... no changes
|      6m 23s is_root                   ... no changes
|      6m 23s kind                      ... no changes
|      6m 23s label                     ... no changes
|      6m 23s language                  ... differencesafter the metadata
|      6m 23s 	line      2 OLD -->hbo<--
|      6m 23s 	line      2 NEW -->Hebrew<--
|      6m 23s 	line      3 OLD -->hbo<--
|      6m 23s 	line      3 NEW -->Hebrew<--
|      6m 23s 	line      4 OLD -->hbo<--
|      6m 23s 	line      4 NEW -->Hebrew<--
|      6m 23s 	line      5 OLD -->hbo<--
|      6m 23s 	line      

   |     1.61s T g_word               from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     1.69s T g_word_utf8          from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     1.52s T lex                  from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     1.70s T lex_utf8             from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     1.16s T trailer              from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     1.20s T trailer_utf8         from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |      |     1.30s C __levels__           from otype, oslots
   |      |       20s C __order__            from otype, oslots, __levels__
   |      |     0.93s C __rank__             from otype, __order__
   |      |       19s C __levUp__            from otype, oslots, __rank__
   |      |       10s C __levDown__          from otype, __levUp__, __rank__
   |      |     4.76s C __boundary__         from otype, oslots, __rank__
   |     0.00s M otext                from /Users/dirk/github/etcbc/bh

   |     0.00s Feature overview: 90 for nodes; 4 for edges; 1 configs; 7 computed
 1m 13s All features loaded/computed - for details use loadLog()
..............................................................................................
.      9m 14s Basic test                                                                     .
..............................................................................................
..............................................................................................
.      9m 14s First verse in all formats                                                     .
..............................................................................................
text-orig-plain
	בראשׁית ברא אלהים את השׁמים ואת הארץ׃ 
lex-orig-full
	בְּ רֵאשִׁית בָּרָא אֱלֹה אֵת הַ שָּׁמַי וְ אֵת הָ אָרֶץ 
lex-trans-full
	B.:- R;>CIJT B.@R@> >:ELOH >;T HA- C.@MAJ W:- >;T H@- >@REY 
lex-orig-plain
	ב ראשׁית֜ ברא אלהים֜ את ה שׁמים֜ ו את ה ארץ֜ 
text-trans-plai

..............................................................................................
.      9m 38s Various tweaks in features                                                     .
..............................................................................................
..............................................................................................
.      9m 38s Update the otype, oslots and otext features                                    .
..............................................................................................
|      9m 41s Features that have new or modified data
|      9m 41s 	gloss
|      9m 41s 	language
|      9m 41s 	lex
|      9m 41s 	lex0
|      9m 41s 	lex_utf8
|      9m 41s 	ls
|      9m 41s 	nametype
|      9m 41s 	otype
|      9m 41s 	root
|      9m 41s 	sp
|      9m 41s 	voc_lex
|      9m 41s 	voc_lex_utf8
|      9m 41s 	oslots
|      9m 41s Check voc_lex_utf8: בְּ רֵאשִׁית ברא אֱלֹהִים אֵת הַ שָׁמַיִם וְ אֶרֶץ
|      9m

99 features found and 0 ignored
  0.00s loading features ...
   |     1.30s T otype                from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |       28s T oslots               from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     1.45s T lex0                 from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     1.64s T lex_utf8             from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |      |     1.25s C __levels__           from otype, oslots
   |      |       19s C __order__            from otype, oslots, __levels__
   |      |     0.89s C __rank__             from otype, __order__
   |      |       20s C __levUp__            from otype, oslots, __rank__
   |      |       11s C __levDown__          from otype, __levUp__, __rank__
   |      |     4.29s C __boundary__         from otype, oslots, __rank__
   |     0.00s M otext                from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |      |     0.14s C __sections__         from otype, oslots, otext, __levUp__, __levels_

|     11m 38s 	Read 90562 paragraph annotations
|     11m 38s 	OK: All label/line entries found in index
|     11m 38s Prepare TF paragraph features
..............................................................................................
.     11m 38s write new/changed features to TF ...                                           .
..............................................................................................
   |     0.15s T instruction          to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.15s T pargr                to /Users/dirk/github/etcbc/bhsa/_temp/c/core
..............................................................................................
.     11m 38s Check differences with previous version                                        .
..............................................................................................
|     11m 38s 	2 features to add
|     11m 38s 		instruction
|     11m 38s 		pargr
|     11m 38s 	no features to 

   |     0.01s T qere_utf8            to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.00s M otext                to /Users/dirk/github/etcbc/bhsa/_temp/c/core
..............................................................................................
.     11m 49s Check differences with previous version                                        .
..............................................................................................
|     11m 49s 	2 features to add
|     11m 49s 		qere_trailer
|     11m 49s 		qere_trailer_utf8
|     11m 49s 	no features to delete
|     11m 49s 	3 features in common
|     11m 49s otext                     ... differences
|     11m 49s 	line      5 OLD -->@dateWritten=2017-09-29T13:44:27Z<--
|     11m 49s 	line      5 NEW -->@dateWritten=2017-09-29T13:48:09Z<--
|     11m 49s 	line     12 OLD -->@fmt:text-orig-full={g_word_utf8}{traile ...<--
|     11m 49s 	line     12 NEW -->@fmt:text-orig-full={qere_utf8/g_word_ut ...<--
|     11m 49s 	l

   |     0.74s T freq_lex             to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.72s T freq_occ             to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.74s T rank_lex             to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.73s T rank_occ             to /Users/dirk/github/etcbc/bhsa/_temp/c/core
..............................................................................................
.     12m 08s Check differences with previous version                                        .
..............................................................................................
|     12m 08s 	4 features to add
|     12m 08s 		freq_lex
|     12m 08s 		freq_occ
|     12m 08s 		rank_lex
|     12m 08s 		rank_occ
|     12m 08s 	no features to delete
|     12m 08s 	0 features in common
|     12m 08s Done
..............................................................................................
.     12m 08s Deliver features to /Users/dirk/github/etcbc/

   |     0.16s B pfm                  from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.16s B vbs                  from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.14s B vbe                  from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.16s B language             from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.00s Feature overview: 102 for nodes; 4 for edges; 1 configs; 7 computed
  9.11s All features loaded/computed - for details use loadLog()
|     12m 31s 	Looking for non-verb qamets
|     12m 34s 	4060 lexemes and 13458 unique occurrences
|     12m 34s 	Filtering lexemes with varied occurrences
|     12m 35s 	161 interesting lexemes with 1705 unique occurrences
|     12m 35s 	Guessing between gadol and qatan
	JM/: Override for syllable 1: ā becomes o
	BJT/: Override for syllable 1: o becomes ā
	JWMM: Override for syllable 2:  becomes ā
	JHWNTN/: Override for syllable 2:  becomes ā
	JRB<M/: No override needed for syllable 1 which is ā
|     12m 35s 	10

|     13m 36s START parallels (CORE_MODULE=core, CORE_NAME=bhsa, FORCE_MATRIX=False, VERSION=c)
|     13m 36s 	Destination /Users/dirk/github/etcbc/parallels/tf/c/parallels/.tf/crossref.tfx exists
|     13m 36s NOTE: repo seems up to date. Will be run because of "force=True"
..............................................................................................
.     13m 36s Load the existing TF dataset                                                   .
..............................................................................................
This is Text-Fabric 2.3.15
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an 