![pipeline](pictures/pictures.002.png)

# Text-Fabric from ETCBC

This notebook assembles the data from the ETCBC that is needed
to compile its datasets in text-fabric-format on Github.
Ulltimately the data for the website [SHEBANQ](https://shebanq.ancient-data.org) will be
derived from these TF-sources.

## Pipeline
This is **pipe 1** of the pipeline from ETCBC data to the website SHEBANQ.

A run of this pipe produces a data *version*.
It should be run whenever there are new or updated data sources present that affect the output data.
Since all input data is delivered in a Github repo, we have excellent machinery to 
work with versioning.

The pipe works by executing a series of programs, contained in Github repositories.
For each repository in the pipe, a series of notebooks will be executed.
See [script mode](https://github.com/ETCBC/pipeline/blob/master/README.md#operation) for 
details on how we call notebooks.

All this is specified in the configuration below.

### Core data

The core data is delivered by the ETCBC as `bhsa.mql.bz2` in 
the Github repo [bhsa](https://github.com/ETCBC/bhsa) in directory `source`.

This data will be converted by `tfFromMQL` in the `programs` directory.

The result of this action will be an updated TF resource in its 
`tf/core` directory.

### Additional data

Researchers have contributed to the dataset, 
but not all that data is in the core.
They are typically in the repository where the research has been 
executed, and where the data is documented.

Before the pipe starts, these repos must be pulled.

## Continuous version
Version `c` acts as a *continuous* version. It will be overwritten
by new snapshots of the data on a regular basis.

We support the following workflow to carry out these updates:

1. make a new version called `_temp`. Note:
   * this choice of name prevents it to reach github, because `_temp` directories are in `.gitignore`;
   * after running this workflow, the version `_temp` already exists, this is not a problem;
2. put a new data snapshot in the `source/_temp` directory of the `bhsa` repo, add also data to the
   `source/_temp` directories of the other repos in the pipeline, as far as relevant;
3. run `good = runPipeline(pipeline, versions=['_temp'], force=True)`. Note:
   * we use `force=True` here, because then the old data in version `_temp` will be thoroughly overwritten;
4. if all went well run `copyVersion(pipeline, '_temp', 'c')`. This will overwrite all data directories
   in version `c` by the just created data directories in `temp`.

In [1]:
import os,sys,collections
from pipeline import runPipeline, copyVersion
from tf.fabric import Fabric

# Config

In [2]:
CORE_NAME = 'bhsa'

if 'SCRIPT' not in locals(): 
    SCRIPT = False
    DEFAULT_CORE_NAME = CORE_NAME
    DEFAULT_VERSION = 'c'

# Pipeline settings

Here all the nitty gritty differences between versions are stated.

In [3]:
pipeline = dict(
    defaults = dict(
        CORE_NAME=CORE_NAME,
        VERSION=DEFAULT_VERSION,
        LANG_FEATURE='language',
        OCC_FEATURE='g_cons',
        LEX_FEATURE='lex',
        TEXT_FEATURE='g_word_utf8',
        TRAILER_FEATURE='trailer_utf8',
        DO_VOCALIZED_LEXEME=True,
        EXTRA_OVERLAP='',
        LEX_FORMATS='@fmt:lex-trans-plain={lex0} ',
        RENAME=(
            ('g_suffix', 'trailer'),
            ('g_suffix_utf8', 'trailer_utf8'),
        ),
    ),
    versions={
        '_temp': dict(),
        'c': dict(),
        '2017': dict(),
        '2016': dict(),
        '4b': dict(
            DO_VOCALIZED_LEXEME=False,
            EXTRA_OVERLAP='gloss nametype',
            LEX_FORMATS='@fmt:lex-trans-plain={lex} ',
            ),
        '4': dict(
            DO_VOCALIZED_LEXEME=False,
            EXTRA_OVERLAP='gloss nametype',
            LEX_FORMATS='@fmt:lex-trans-plain={lex} ',
        ),
        '3': dict(
                LANG_FEATURE='language',
                OCC_FEATURE='surface_consonants',
                LEX_FEATURE='lexeme',
                TEXT_FEATURE='text',
                TRAILER_FEATURE='suffix',
                LEX_FORMAT='@fmt:lex-trans-plain={lexeme} ',
            ),
    },
    repoOrder = '''
        bhsa
        phono
        valence
        parallels
    ''',
    repoConfig = dict(
        bhsa=(
            dict(
                task='coreData',
            ),
            dict(
                task='bookNames',
                omit={},
            ),
            dict(
                task='lexicon',
                omit={'3'},
            ),
            dict(
                task='paragraphs',
                 omit={'3', '4', '4b'},
            ),
            dict(
                task='ketivQere',
                omit={'3', '4', '4b'},
            ),
            dict(
                task='stats',
                omit={'4', '4b'},
            ),
        ),
        phono=(
            dict(
                task='phono',
                omit={'3', '4', '4b'},
            ),
        ),
        valence=(
            dict(
                task='enrich',
                omit={'3'},
            ),
            dict(
                task='flowchart',
                omit={'3'},
            ),
        ),
        parallels=(
            dict(
                task='parallels',
                omit={},
                params=dict(
                    FORCE_MATRIX=False,
                ),
            ),
        ),
    ),
    repoDataDirs = dict(
        bhsa      = 'source _temp tf shebanq',
        phono     = '_temp tf',
        valence   = 'source _temp tf shebanq',
        parallels = 'source _temp tf',
    ),
)

# Run the pipeline

In [4]:
good = runPipeline(pipeline, versions=['_temp'], force=True)


##############################################################################################
#                                                                                            #
#       0.00s Make version [_temp]                                                           #
#                                                                                            #
##############################################################################################


**********************************************************************************************
*                                                                                            *
*       0.00s Make repo [bhsa]                                                               *
*                                                                                            *
**********************************************************************************************


---------------------------------------------

  0.25s 			feature kind (str) =def= unknown : node
  0.25s 			feature rela (str) =def= NA : node
  0.25s 			feature mother_object_type (str) =def= clause : node
  0.26s 			feature dist_unit (str) =def= clause_atoms : node
  0.26s 		otype half_verse
  0.26s 			feature label (str) =def=  : node
  0.26s 		otype verse
  0.27s 			feature verse (int) =def= 0 : node
  0.27s 			feature chapter (int) =def= 0 : node
  0.27s 			feature label (str) =def=  : node
  0.27s 			feature book (str) =def= Genesis : node
  0.27s 		otype phrase_atom
  0.27s 			feature number (int) =def= 0 : node
  0.27s 			feature dist (int) =def= 0 : node
  0.28s 			feature distributional_parent (str) =def= 0 : edge
  0.28s 			feature mother (str) =def= 0 : edge
  0.28s 			feature functional_parent (str) =def= 0 : edge
  0.28s 			feature det (str) =def= NA : node
  0.28s 			feature typ (str) =def= VP : node
  0.28s 			feature rela (str) =def= NA : node
  0.29s 			feature dist_unit (str) =def= clause_atoms : node
  0.29s 		

   |     0.73s T nme                  to /Users/dirk/github/etcbc/bhsa/_temp/_temp/tf
   |     0.75s T nu                   to /Users/dirk/github/etcbc/bhsa/_temp/_temp/tf
   |     2.15s T number               to /Users/dirk/github/etcbc/bhsa/_temp/_temp/tf
   |     0.61s T otype                to /Users/dirk/github/etcbc/bhsa/_temp/_temp/tf
   |     0.74s T pdp                  to /Users/dirk/github/etcbc/bhsa/_temp/_temp/tf
   |     0.75s T pfm                  to /Users/dirk/github/etcbc/bhsa/_temp/_temp/tf
   |     0.77s T prs                  to /Users/dirk/github/etcbc/bhsa/_temp/_temp/tf
   |     0.79s T prs_gn               to /Users/dirk/github/etcbc/bhsa/_temp/_temp/tf
   |     0.80s T prs_nu               to /Users/dirk/github/etcbc/bhsa/_temp/_temp/tf
   |     0.79s T prs_ps               to /Users/dirk/github/etcbc/bhsa/_temp/_temp/tf
   |     0.75s T ps                   to /Users/dirk/github/etcbc/bhsa/_temp/_temp/tf
   |     0.68s T qere                 to /Users/dirk/g

 1m 29s All features loaded/computed - for details use loadLog()
..............................................................................................
.      6m 07s Load and compile all other TF features                                         .
..............................................................................................
   |     0.00s Feature overview: 70 for nodes; 4 for edges; 1 configs; 7 computed
  0.00s loading features ...
   |     0.18s T code                 from /Users/dirk/github/etcbc/bhsa/tf/_temp
   |     1.76s T det                  from /Users/dirk/github/etcbc/bhsa/tf/_temp
   |     1.49s T dist                 from /Users/dirk/github/etcbc/bhsa/tf/_temp
   |     2.00s T dist_unit            from /Users/dirk/github/etcbc/bhsa/tf/_temp
   |     3.83s T distributional_parent from /Users/dirk/github/etcbc/bhsa/tf/_temp
   |     0.27s T domain               from /Users/dirk/github/etcbc/bhsa/tf/_temp
   |     0.83s T function             from /Us

75 features found and 0 ignored
  0.00s loading features ...
   |     0.01s B book                 from /Users/dirk/github/etcbc/bhsa/tf/_temp
   |     0.00s Feature overview: 70 for nodes; 4 for edges; 1 configs; 7 computed
  4.38s All features loaded/computed - for details use loadLog()
|      7m 29s 26 book name features created
..............................................................................................
.      7m 29s Write book name features as TF                                                 .
..............................................................................................
   |     0.00s T book@am              to /Users/dirk/github/etcbc/bhsa/_temp/_temp/tf
   |     0.00s T book@ar              to /Users/dirk/github/etcbc/bhsa/_temp/_temp/tf
   |     0.00s T book@bn              to /Users/dirk/github/etcbc/bhsa/_temp/_temp/tf
   |     0.00s T book@da              to /Users/dirk/github/etcbc/bhsa/_temp/_temp/tf
   |     0.00s T book@de             

|      7m 34s ko = korean               Genesis is 창세기                  in 한국어                 
|      7m 34s la = latin                Genesis is Genesis              in Latina              
|      7m 34s nl = dutch                Genesis is Genesis              in Nederlands          
|      7m 34s pa = punjabi              Genesis is ਉਤਪਤ                 in ਪੰਜਾਬੀ              
|      7m 34s pt = portuguese           Genesis is Gênesis              in Português           
|      7m 34s ru = russian              Genesis is Бытия                in Русский             
|      7m 34s sw = swahili              Genesis is Mwanzo               in Kiswahili           
|      7m 34s syc = syriac               Genesis is ܒܪܝܬܐ                in ܠܫܢܐ ܣܘܪܝܝܐ         
|      7m 34s tr = turkish              Genesis is Yaratılış            in Türkçe              
|      7m 34s ur = urdu                 Genesis is پیدائش               in اُردُو              
|      7m 34s yo = yoruba              

..............................................................................................
.      7m 59s Various tweaks in features                                                     .
..............................................................................................
..............................................................................................
.      8m 00s Update the otype, oslots and otext features                                    .
..............................................................................................
|      8m 02s Features that have new or modified data
|      8m 02s 	gloss
|      8m 02s 	language
|      8m 02s 	lex
|      8m 02s 	lex0
|      8m 02s 	lex_utf8
|      8m 02s 	ls
|      8m 02s 	nametype
|      8m 02s 	otype
|      8m 02s 	root
|      8m 02s 	sp
|      8m 02s 	voc_lex
|      8m 02s 	voc_lex_utf8
|      8m 02s 	oslots
|      8m 02s Check voc_lex_utf8: בְּ רֵאשִׁית ברא אֱלֹהִים אֵת הַ שָׁמַיִם וְ אֶרֶץ
|      8m

  0.00s loading features ...
   |     0.84s T otype                from /Users/dirk/github/etcbc/bhsa/tf/_temp
   |       18s T oslots               from /Users/dirk/github/etcbc/bhsa/tf/_temp
   |     1.41s T lex0                 from /Users/dirk/github/etcbc/bhsa/tf/_temp
   |     1.56s T lex_utf8             from /Users/dirk/github/etcbc/bhsa/tf/_temp
   |      |     1.23s C __levels__           from otype, oslots
   |      |       18s C __order__            from otype, oslots, __levels__
   |      |     0.88s C __rank__             from otype, __order__
   |      |       19s C __levUp__            from otype, oslots, __rank__
   |      |       11s C __levDown__          from otype, __levUp__, __rank__
   |      |     3.97s C __boundary__         from otype, oslots, __rank__
   |     0.00s M otext                from /Users/dirk/github/etcbc/bhsa/tf/_temp
   |      |     0.11s C __sections__         from otype, oslots, otext, __levUp__, __levels__, book, chapter, verse
   |     1.41

|      9m 47s 	Read 90669 paragraph annotations
|      9m 47s 	OK: All label/line entries found in index
|      9m 47s Prepare TF paragraph features
..............................................................................................
.      9m 47s write new/changed features to TF ...                                           .
..............................................................................................
   |     0.15s T instruction          to /Users/dirk/github/etcbc/bhsa/_temp/_temp/tf
   |     0.15s T pargr                to /Users/dirk/github/etcbc/bhsa/_temp/_temp/tf
..............................................................................................
.      9m 47s Check differences with previous version                                        .
..............................................................................................
|      9m 47s 	2 features to add
|      9m 47s 		instruction
|      9m 47s 		pargr
|      9m 47s 	no features

|      9m 58s qere                      ... differences after the metadata
|      9m 58s 	line      2 OLD --><--
|      9m 58s 	line      2 NEW -->3897	HAJ:Y;74><--
|      9m 58s 	line      3 OLD --><--
|      9m 58s 	line      3 NEW -->4420	>@H:@LO75W<--
|      9m 58s 	line      4 OLD --><--
|      9m 58s 	line      4 NEW -->5645	>@H:@LO92W<--
|      9m 58s 	line      5 OLD --><--
|      9m 58s 	line      5 NEW -->5912	>@95H:@LOW03<--

|      9m 58s qere_utf8                 ... differences after the metadata
|      9m 58s 	line      2 OLD --><--
|      9m 58s 	line      2 NEW -->3897	הַיְצֵ֣א<--
|      9m 58s 	line      3 OLD --><--
|      9m 58s 	line      3 NEW -->4420	אָהֳלֹֽו<--
|      9m 58s 	line      4 OLD --><--
|      9m 58s 	line      4 NEW -->5645	אָהֳלֹ֑ו<--
|      9m 58s 	line      5 OLD --><--
|      9m 58s 	line      5 NEW -->5912	אָֽהֳלֹו֙<--

|      9m 58s Done
..............................................................................................
.      9m 58

113 features found and 0 ignored
  0.00s loading features ...
   |     0.16s B lex                  from /Users/dirk/github/etcbc/bhsa/tf/_temp
   |     0.90s T freq_occ             from /Users/dirk/github/etcbc/bhsa/tf/_temp
   |     0.88s T freq_lex             from /Users/dirk/github/etcbc/bhsa/tf/_temp
   |     0.85s T rank_occ             from /Users/dirk/github/etcbc/bhsa/tf/_temp
   |     0.89s T rank_lex             from /Users/dirk/github/etcbc/bhsa/tf/_temp
   |     0.00s Feature overview: 108 for nodes; 4 for edges; 1 configs; 7 computed
  9.26s All features loaded/computed - for details use loadLog()
..............................................................................................
.     10m 28s Basic test                                                                     .
..............................................................................................
|     10m 28s Top 10 freqent lexemes (computed on otype=word)
|     10m 29s W           50272x


|     11m 03s 	 8000 verses 113745 528 18 287
|     11m 05s 	 9000 verses 129599 572 20 327
|     11m 07s 	10000 verses 146214 623 20 438
|     11m 09s 	11000 verses 159806 748 20 487
|     11m 11s 	12000 verses 174189 890 24 524
|     11m 13s 	13000 verses 190552 1017 28 576
|     11m 15s 	14000 verses 205101 1167 32 622
|     11m 17s 	15000 verses 218607 1289 33 728
|     11m 18s 	16000 verses 227941 1335 39 777
|     11m 19s 	17000 verses 235631 1378 48 827
|     11m 20s 	18000 verses 243254 1395 51 866
|     11m 21s 	19000 verses 250705 1428 59 906
|     11m 22s 	20000 verses 260114 1469 60 960
|     11m 24s 	21000 verses 275079 1532 63 979
|     11m 26s 	22000 verses 286438 1589 65 1007
|     11m 28s 	23000 verses 301296 1644 66 1075
|     11m 28s 	23213 verses done 304794 1649 66 1081
|     11m 28s 	  270185 accents
|     11m 28s 	    9015 cleanup
|     11m 28s 	   45235 dagesh_forte
|     11m 28s 	   21511 dagesh_forte_lene
|     11m 28s 	   59612 dagesh_lene
|     11m 28s 	   1

|     11m 43s 	Destination /Users/dirk/github/etcbc/valence/tf/_temp/.tf/valence.tfx exists
|     11m 43s NOTE: repo seems up to date. Will be run because of "force=True"
True True
..............................................................................................
.     11m 43s Load the existing TF dataset                                                   .
..............................................................................................
This is Text-Fabric 3.0.6
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

113 features found and 0 ignored
  0.00s loading features ...
   |     0.22s B lex_utf8             from /Users/dirk/github/etcbc/bhsa/tf/_temp
   |     0.17s B lex                  from /Users/dirk/github/etcbc/bhsa/tf/_temp
   |     0.01s B gloss                from /Users/dirk/github/et

|     12m 30s 	Done
|     12m 30s 	Phrases of kind C :  19302
|     12m 30s 	Phrases of kind L :  11680
|     12m 30s 	Phrases of kind I :   6018
|     12m 30s 	Total complements :  37000
|     12m 30s 	Total phrases     : 214541
..............................................................................................
.     12m 30s Checking enrichment logic                                                      .
..............................................................................................
|     12m 30s 	All 6 rules OK
..............................................................................................
.     12m 30s Generating enrichments                                                         .
..............................................................................................
|     12m 37s 	Generated enrichment values for 1380 verbs:
|     12m 37s 	Enriched values for 221366 nodes
|     12m 37s 	Overview of rule applications:
|     12m 37s gen

|     12m 47s valence                   ... differences after the metadata
|     12m 48s 	line      2 OLD -->427558	complement<--
|     12m 48s 	line      2 NEW -->427561	complement<--
|     12m 48s 	line      3 OLD -->427584	complement<--
|     12m 48s 	line      3 NEW -->427587	complement<--
|     12m 48s 	line      4 OLD -->427596	complement<--
|     12m 48s 	line      4 NEW -->427599	complement<--
|     12m 48s 	line      5 OLD -->427602	NA<--
|     12m 48s 	line      5 NEW -->427605	NA<--

|     12m 48s Done
..............................................................................................
.     12m 48s Deliver features to /Users/dirk/github/etcbc/valence/tf/_temp                  .
..............................................................................................
|     12m 48s 	valence
|     12m 48s 	predication
|     12m 48s 	grammatical
|     12m 48s 	original
|     12m 48s 	lexical
|     12m 48s 	semantic
|     12m 48s 	f_correction
|     12m 48s 	s_man

|     13m 16s 	Done, 47575 clauses with relevant constituents
..............................................................................................
.     13m 16s Counting constituents                                                          .
..............................................................................................
|     13m 17s 	22384 clauses with  1 dos        constituents
|     13m 17s 	25191 clauses with  0 dos        constituents
|     13m 17s 	22384 clauses with  a dos        constituent
|     13m 17s 	 3554 clauses with  1 pdos       constituents
|     13m 17s 	44021 clauses with  0 pdos       constituents
|     13m 17s 	 3554 clauses with  a pdos       constituent
|     13m 17s 	  989 clauses with  1 ndos       constituents
|     13m 17s 	46586 clauses with  0 ndos       constituents
|     13m 17s 	  989 clauses with  a ndos       constituent
|     13m 17s 	  111 clauses with  1 kdos       constituents
|     13m 17s 	47464 clauses with  0 kdos     

|     13m 48s 	Destination /Users/dirk/github/etcbc/parallels/tf/_temp/.tf/crossref.tfx exists
|     13m 48s NOTE: repo seems up to date. Will be run because of "force=True"
..............................................................................................
.     13m 48s Load the existing TF dataset                                                   .
..............................................................................................
This is Text-Fabric 3.0.6
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

113 features found and 0 ignored
  0.00s loading features ...
   |     0.03s B otype                from /Users/dirk/github/etcbc/bhsa/tf/_temp
   |     0.01s B book                 from /Users/dirk/github/etcbc/bhsa/tf/_temp
   |     0.01s B chapter              from /Users/dirk/github/etcbc/bhs

# Consolidate the continuous version
If you have run an update version called `_temp`, and all has went well
you can copy over the entire version (including its source and temp directories to `c`).
This will happen for all repos in the pipeline.

In [5]:
#good = True
#good = False

if good:
    copyVersion(pipeline, '_temp', 'c')


##############################################################################################
#                                                                                            #
#     14m 11s Copy version _temp ==> c                                                       #
#                                                                                            #
##############################################################################################


**********************************************************************************************
*                                                                                            *
*     14m 11s Repo bhsa                                                                      *
*                                                                                            *
**********************************************************************************************

|     14m 11s 	Copy source/_temp ==> source/c
