![pipeline](pictures/pictures.002.png)

# Text-Fabric from ETCBC

This notebook assembles the data from the ETCBC that is needed
to compile its datasets in text-fabric-format on Github.
Ulltimately the data for the website [SHEBANQ](https://shebanq.ancient-data.org) will be
derived from these TF-sources.

## Pipeline
This is **pipe 1** of the pipeline from ETCBC data to the website SHEBANQ.

A run of this pipe produces a data *version*.
It should be run whenever there are new or updated data sources present that affect the output data.
Since all input data is delivered in a Github repo, we have excellent machinery to 
work with versioning.

The pipe works by executing a series of programs, contained in Github repositories.
For each repository in the pipe, a series of notebooks will be executed.
See [script mode](https://github.com/ETCBC/pipeline/blob/master/README.md#operation) for 
details on how we call notebooks.

All this is specified in the configuration below.

### Core data

The core data is delivered by the ETCBC as `bhsa.mql.bz2` in 
the Github repo [bhsa](https://github.com/ETCBC/bhsa) in directory `source`.

This data will be converted by `tfFromMQL` in the `programs` directory.

The result of this action will be an updated TF resource in its 
`tf/core` directory.

### Additional data

Researchers have contributed to the dataset, 
but not all that data is in the core.
They are typically in the repository where the research has been 
executed, and where the data is documented.

Before the pipe starts, these repos must be pulled.

## Continuous version
Version `c` acts as a *continuous* version. It will be overwritten
by new snapshots of the data on a regular basis.

We support the following workflow to carry out these updates:

1. make a new version called `_temp`. Note:
   * this choice of name prevents it to reach github, because `_temp` directories are in `.gitignore`;
   * after running this workflow, the version `_temp` already exists, this is not a problem;
2. put a new data snapshot in the `source/_temp` directory of the `bhsa` repo, add also data to the
   `source/_temp` directories of the other repos in the pipeline, as far as relevant;
3. run `good = runPipeline(pipeline, versions=['_temp'], force=True)`. Note:
   * we use `force=True` here, because then the old data in version `_temp` will be thoroughly overwritten;
4. if all went well run `copyVersion(pipeline, '_temp', 'c')`. This will overwrite all data directories
   in version `c` by the just created data directories in `temp`.

In [1]:
import os,sys,collections
from pipeline import runPipeline, copyVersion
from tf.fabric import Fabric

# Config

In [2]:
CORE_NAME = 'bhsa'

if 'SCRIPT' not in locals(): 
    SCRIPT = False
    DEFAULT_CORE_NAME = CORE_NAME
    DEFAULT_VERSION = 'c'

# Pipeline settings

Here all the nitty gritty differences between versions are stated.

In [3]:
pipeline = dict(
    defaults = dict(
        CORE_NAME=CORE_NAME,
        VERSION=DEFAULT_VERSION,
        LANG_FEATURE='language',
        OCC_FEATURE='g_cons',
        LEX_FEATURE='lex',
        TEXT_FEATURE='g_word_utf8',
        TRAILER_FEATURE='trailer_utf8',
        DO_VOCALIZED_LEXEME=True,
        EXTRA_OVERLAP='',
        LEX_FORMATS='@fmt:lex-trans-plain={lex0} ',
        RENAME=(
            ('g_suffix', 'trailer'),
            ('g_suffix_utf8', 'trailer_utf8'),
        ),
    ),
    versions={
        '_temp': dict(),
        'c': dict(),
        '2017': dict(),
        '2016': dict(),
        '4b': dict(
                DO_VOCALIZED_LEXEME=False,
                EXTRA_OVERLAP='gloss nametype',
                LEX_FORMATS='@fmt:lex-trans-plain={lex} ',
            ),
        '4': dict(
                DO_VOCALIZED_LEXEME=False,
                EXTRA_OVERLAP='gloss nametype',
                LEX_FORMATS='@fmt:lex-trans-plain={lex} ',
        ),
        '3': dict(
                LANG_FEATURE='language',
                OCC_FEATURE='surface_consonants',
                LEX_FEATURE='lexeme',
                TEXT_FEATURE='text',
                TRAILER_FEATURE='suffix',
                LEX_FORMAT='@fmt:lex-trans-plain={lexeme} ',
            ),
    },
    repoOrder = '''
        bhsa
        phono
        valence
        parallels
    ''',
    repoConfig = dict(
        bhsa=(
            dict(
                task='coreData',
            ),
            dict(
                task='bookNames',
                omit={},
            ),
            dict(
                task='lexicon',
                omit={'3'},
            ),
            dict(
                task='paragraphs',
                 omit={'3', '4', '4b'},
            ),
            dict(
                task='ketivQere',
                omit={'3', '4', '4b'},
            ),
            dict(
                task='stats',
                omit={'4', '4b'},
            ),
        ),
        phono=(
            dict(
                task='phono',
                omit={'3', '4', '4b'},
            ),
        ),
        valence=(
            dict(
                task='enrich',
                omit={'3'},
            ),
            dict(
                task='flowchart',
                omit={'3'},
            ),
        ),
        parallels=(
            dict(
                task='parallels',
                omit={},
                params=dict(
                    FORCE_MATRIX=False,
                ),
            ),
        ),
    ),
    repoDataDirs = dict(
        bhsa      = 'source _temp tf shebanq',
        phono     = '_temp tf',
        valence   = 'source _temp tf shebanq',
        parallels = 'source _temp tf',
    ),
)

# Run the pipeline

In [5]:
good = runPipeline(pipeline, versions=['2017'], force=True)


##############################################################################################
#                                                                                            #
#     13m 20s Make version [2017]                                                            #
#                                                                                            #
##############################################################################################


**********************************************************************************************
*                                                                                            *
*     13m 20s Make repo [bhsa]                                                               *
*                                                                                            *
**********************************************************************************************


---------------------------------------------

  0.30s 			feature functional_parent (str) =def= 0 : edge
  0.30s 			feature txt (str) =def=  : node
  0.30s 			feature typ (str) =def= Unkn : node
  0.31s 			feature kind (str) =def= unknown : node
  0.31s 			feature rela (str) =def= NA : node
  0.31s 			feature mother_object_type (str) =def= clause : node
  0.31s 			feature dist_unit (str) =def= clause_atoms : node
  0.31s 		otype half_verse
  0.32s 			feature label (str) =def=  : node
  0.32s 		otype verse
  0.32s 			feature verse (int) =def= 0 : node
  0.32s 			feature chapter (int) =def= 0 : node
  0.33s 			feature label (str) =def=  : node
  0.33s 			feature book (str) =def= Genesis : node
  0.33s 		otype phrase_atom
  0.33s 			feature number (int) =def= 0 : node
  0.33s 			feature dist (int) =def= 0 : node
  0.34s 			feature distributional_parent (str) =def= 0 : edge
  0.34s 			feature mother (str) =def= 0 : edge
  0.34s 			feature functional_parent (str) =def= 0 : edge
  0.34s 			feature det (str) =def= NA : node
  0.34s 			fea

   |     0.38s T mother_object_type   to /Users/dirk/github/etcbc/bhsa/_temp/2017/tf
   |     0.78s T nme                  to /Users/dirk/github/etcbc/bhsa/_temp/2017/tf
   |     0.80s T nu                   to /Users/dirk/github/etcbc/bhsa/_temp/2017/tf
   |     2.31s T number               to /Users/dirk/github/etcbc/bhsa/_temp/2017/tf
   |     0.63s T otype                to /Users/dirk/github/etcbc/bhsa/_temp/2017/tf
   |     0.81s T pdp                  to /Users/dirk/github/etcbc/bhsa/_temp/2017/tf
   |     0.86s T pfm                  to /Users/dirk/github/etcbc/bhsa/_temp/2017/tf
   |     0.78s T prs                  to /Users/dirk/github/etcbc/bhsa/_temp/2017/tf
   |     0.77s T prs_gn               to /Users/dirk/github/etcbc/bhsa/_temp/2017/tf
   |     0.79s T prs_nu               to /Users/dirk/github/etcbc/bhsa/_temp/2017/tf
   |     0.80s T prs_ps               to /Users/dirk/github/etcbc/bhsa/_temp/2017/tf
   |     0.79s T ps                   to /Users/dirk/github/etcbc

|     18m 41s vbe                       ... no changes
|     18m 41s vbs                       ... no changes
|     18m 42s verse                     ... no changes
|     18m 42s vs                        ... no changes
|     18m 42s vt                        ... no changes
|     18m 42s Done
..............................................................................................
.     18m 42s Deliver data set to /Users/dirk/github/etcbc/bhsa/tf/2017                      .
..............................................................................................
..............................................................................................
.     18m 43s Load and compile standard TF features                                          .
..............................................................................................
This is Text-Fabric 3.0.7
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs

   |     1.58s T vt                   from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.00s Feature overview: 70 for nodes; 4 for edges; 1 configs; 7 computed
 1m 24s All features loaded/computed - for details use loadLog()
..............................................................................................
.     21m 24s Basic test                                                                     .
..............................................................................................
..............................................................................................
.     21m 24s First verse in all formats                                                     .
..............................................................................................
lex-trans-plain
	B R>CJT/ BR>[ >LHJM/ >T H CMJM/ W >T H >RY/ 
text-trans-plain
	BR>CJT BR> >LHJM >T HCMJM W>T H>RY00 
text-orig-full
	בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ 
l

101 features found and 0 ignored
  0.00s loading features ...
   |     0.00s T book@am              from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.00s T book@ar              from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.00s T book@bn              from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.00s T book@da              from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.00s T book@de              from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.00s T book@el              from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.00s T book@en              from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.00s T book@es              from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.00s T book@fa              from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.00s T book@fr              from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.00s T book@he              from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.00s T book@hi              from /Use

|     21m 49s added 9233 lexemes
|     21m 49s maxNode is now 1446635
|     21m 49s language arc has   708 lexemes in the text
|     21m 49s language hbo has  8525 lexemes in the text
..............................................................................................
.     21m 49s Collect lexeme info from the lexicon                                           .
..............................................................................................
|     21m 49s Reading lexicon ...
|     21m 49s Lexicon arc has   708 entries
|     21m 49s Lexicon hbo has  8527 entries
|     21m 49s Done
..............................................................................................
.     21m 49s Test - Match between text and lexicon                                          .
..............................................................................................
|     21m 49s 708 arc lexemes
|     21m 49s 8527 hbo lexemes
|     21m 49s Equal lex values in hbo and ar

|     22m 13s ls                        ... differences after the metadata
|     22m 14s 	line 426586 OLD --><empty><--
|     22m 14s 	line 426586 NEW -->1437412	vbcp<--
|     22m 14s 	line 426587 OLD --><empty><--
|     22m 14s 	line 426587 NEW -->1437422	quot<--
|     22m 14s 	line 426588 OLD --><empty><--
|     22m 14s 	line 426588 NEW -->1437428	ppre<--
|     22m 14s 	line 426589 OLD --><empty><--
|     22m 14s 	line 426589 NEW -->1437431	padv<--

|     22m 14s oslots                    ... differences after the metadata
|     22m 15s 	line 1010820 OLD --><empty><--
|     22m 15s 	line 1010820 NEW -->1,84,197,220,241,270,318,330,334,428,435 ...<--
|     22m 15s 	line 1010821 OLD --><empty><--
|     22m 15s 	line 1010821 NEW -->2,4662,27811,41331,48284,53077,66101,796 ...<--
|     22m 15s 	line 1010822 OLD --><empty><--
|     22m 15s 	line 1010822 NEW -->3,381,535,545,550,724,736,2126,2137,2148 ...<--
|     22m 15s 	line 1010823 OLD --><empty><--
|     22m 15s 	line 1010823 NEW -->4

|     23m 48s 	Destination /Users/dirk/github/etcbc/bhsa/tf/2017/.tf/pargr.tfx does not exist
..............................................................................................
.     23m 48s Load the existing TF dataset                                                   .
..............................................................................................
This is Text-Fabric 3.0.7
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

105 features found and 0 ignored
  0.00s loading features ...
   |     0.04s B label                from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.42s B number               from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.00s Feature overview: 100 for nodes; 4 for edges; 1 configs; 7 computed
  6.95s All features loaded/computed - for details use loadLog()
|    

   |     0.10s B trailer_utf8         from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.02s B label                from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.00s Feature overview: 102 for nodes; 4 for edges; 1 configs; 7 computed
  5.42s All features loaded/computed - for details use loadLog()
|     24m 14s Mapping between verse labels and verse nodes
|     24m 14s 23213 verses
..............................................................................................
.     24m 14s Parsing Ketiv-Qere data                                                        .
..............................................................................................
|     24m 14s 	Read 1892 ketiv-qere annotations
|     24m 14s 	Parsed 1892 ketiv-qere annotations
|     24m 14s 	All verses entries found in index
|     24m 14s 	All ketivs found in the text
|     24m 14s 	All ketivs found in the data
|     24m 14s Prepare TF ketiv qere features
|     24m 14s Update the otext feature


|     24m 20s 	Destination /Users/dirk/github/etcbc/bhsa/tf/2017/.tf/freq_occ.tfx does not exist
..............................................................................................
.     24m 20s Loading relevant features                                                      .
..............................................................................................
This is Text-Fabric 3.0.7
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

109 features found and 0 ignored
  0.00s loading features ...
   |     0.16s B g_cons               from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.16s B language             from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.28s B lex                  from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.00s Feature overview: 104 for nodes; 4 for edges; 1 config

   |     0.31s B g_word_utf8          from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.18s B lex0                 from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.32s B lex_utf8             from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.00s B qere                 from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.00s B qere_trailer         from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.13s B trailer              from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.17s B lex                  from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.17s B sp                   from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.16s B vs                   from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.16s B vt                   from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.14s B gn                   from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.16s B nu                   from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.15s B ps         

|     26m 02s 	Destination /Users/dirk/github/etcbc/valence/tf/2017/.tf/valence.tfx does not exist
True True
..............................................................................................
.     26m 02s Load the existing TF dataset                                                   .
..............................................................................................
This is Text-Fabric 3.0.7
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

113 features found and 0 ignored
  0.00s loading features ...
   |     0.23s B lex_utf8             from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.21s B lex                  from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.01s B gloss                from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.25s B sp                   from /Users/dirk/gi

|     26m 52s 	Done
|     26m 52s 	Phrases of kind C :  19302
|     26m 52s 	Phrases of kind L :  11680
|     26m 52s 	Phrases of kind I :   6018
|     26m 52s 	Total complements :  37000
|     26m 52s 	Total phrases     : 214541
..............................................................................................
.     26m 52s Checking enrichment logic                                                      .
..............................................................................................
|     26m 52s 	All 6 rules OK
..............................................................................................
.     26m 52s Generating enrichments                                                         .
..............................................................................................
|     26m 59s 	Generated enrichment values for 1380 verbs:
|     26m 59s 	Enriched values for 221366 nodes
|     26m 59s 	Overview of rule applications:
|     26m 59s gen

   |     0.86s T predication          from /Users/dirk/github/etcbc/valence/tf/2017
   |     0.85s T grammatical          from /Users/dirk/github/etcbc/valence/tf/2017
   |     0.38s T original             from /Users/dirk/github/etcbc/valence/tf/2017
   |     0.56s T lexical              from /Users/dirk/github/etcbc/valence/tf/2017
   |     0.52s T semantic             from /Users/dirk/github/etcbc/valence/tf/2017
   |     0.37s T f_correction         from /Users/dirk/github/etcbc/valence/tf/2017
   |     0.38s T s_manual             from /Users/dirk/github/etcbc/valence/tf/2017
   |     0.45s T cfunction            from /Users/dirk/github/etcbc/valence/tf/2017
   |     0.00s Feature overview: 117 for nodes; 4 for edges; 1 configs; 7 computed
    10s All features loaded/computed - for details use loadLog()
Time - Time - True
Pred - Pred - True
Subj - Subj - True
Objc - Objc - True
Conj -  - True
Subj -  - True
Pred -  - True
PreC -  - True
Conj - None - False
Subj - None - False
|   

|     27m 43s 	10000 clauses
|     27m 48s 	20000 clauses
|     27m 52s 	30000 clauses
|     27m 56s 	40000 clauses
|     27m 59s 	47385 clauses
..............................................................................................
.     27m 59s Writing sense feature to TF                                                    .
..............................................................................................
   |     0.12s T sense                to /Users/dirk/github/etcbc/valence/_temp/2017/tf
..............................................................................................
.     27m 59s Check differences with previous version                                        .
..............................................................................................
|     27m 59s 	1 features to add
|     27m 59s 		sense
|     27m 59s 	no features to delete
|     27m 59s 	0 features in common
|     27m 59s Done
..................................................

116 features found and 0 ignored
  0.00s loading features ...
   |     0.07s T crossref             from /Users/dirk/github/etcbc/parallels/tf/2017
   |     0.05s T crossrefSET          from /Users/dirk/github/etcbc/parallels/tf/2017
   |     0.07s T crossrefLCS          from /Users/dirk/github/etcbc/parallels/tf/2017
   |     0.00s Feature overview: 108 for nodes; 7 for edges; 1 configs; 7 computed
  4.54s All features loaded/computed - for details use loadLog()
..............................................................................................
.     28m 20s Test: crossrefs of Genesis 10                                                  .
..............................................................................................
|     28m 20s 	Method 
|     28m 20s 		20 start verses
		Genesis 10:2
|     28m 20s 		         ----------> 1_Chronicles 1:5     confidende 100%
		Genesis 10:3
|     28m 20s 		         ----------> 1_Chronicles 1:6     confidende  95%
		Genesis 10:4

# Consolidate the continuous version
If you have run an update version called `_temp`, and all has went well
you can copy over the entire version (including its source and temp directories to `c`).
This will happen for all repos in the pipeline.

In [5]:
#good = True
#good = False

if good:
    copyVersion(pipeline, '_temp', 'c')


##############################################################################################
#                                                                                            #
#     14m 11s Copy version _temp ==> c                                                       #
#                                                                                            #
##############################################################################################


**********************************************************************************************
*                                                                                            *
*     14m 11s Repo bhsa                                                                      *
*                                                                                            *
**********************************************************************************************

|     14m 11s 	Copy source/_temp ==> source/c
