![pipeline](pictures/pictures.002.png)

# Text-Fabric from ETCBC

This notebook assembles the data from the ETCBC that is needed
to compile its datasets in text-fabric-format on Github.
Ulltimately the data for the website [SHEBANQ](https://shebanq.ancient-data.org) will be
derived from these TF-sources.

## Pipeline
This is **pipe 1** of the pipeline from ETCBC data to the website SHEBANQ.

A run of this pipe produces a data *version*.
It should be run whenever there are new or updated data sources present that affect the output data.
Since all input data is delivered in a Github repo, we have excellent machinery to 
work with versioning.

The pipe works by executing a series of programs, contained in Github repositories.
For each repository in the pipe, a series of notebooks will be executed.
See [script mode](https://github.com/ETCBC/pipeline/blob/master/README.md#operation) for 
details on how we call notebooks.

All this is specified in the configuration below.

### Core data

The core data is delivered by the ETCBC as `bhsa.mql.bz2` in 
the Github repo [bhsa](https://github.com/ETCBC/bhsa) in directory `source`.

This data will be converted by `tfFromMQL` in the `programs` directory.

The result of this action will be an updated TF resource in its 
`tf/core` directory.

### Additional data

Researchers have contributed to the dataset, 
but not all that data is in the core.
They are typically in the repository where the research has been 
executed, and where the data is documented.

Before the pipe starts, these repos must be pulled.

## Continuous version
Version `c` acts as a *continuous* version. It will be overwritten
by new snapshots of the data on a regular basis.

We support the following workflow to carry out these updates:

1. make a new version called `_temp`. Note:
   * this choice of name prevents it to reach github, because `_temp` directories are in `.gitignore`;
   * after running this workflow, the version `_temp` already exists, this is not a problem;
2. put a new data snapshot in the `source/_temp` directory of the `bhsa` repo, add also data to the
   `source/_temp` directories of the other repos in the pipeline, as far as relevant;
3. run `good = runPipeline(pipeline, versions=['_temp'], force=True)`. Note:
   * we use `force=True` here, because then the old data in version `_temp` will be thoroughly overwritten;
4. if all went well run `copyVersion(pipeline, '_temp', 'c')`. This will overwrite all data directories
   in version `c` by the just created data directories in `temp`.

In [1]:
import os,sys,collections
from pipeline import runPipeline, copyVersion
from tf.fabric import Fabric

# Config

In [2]:
CORE_NAME = 'bhsa'

if 'SCRIPT' not in locals(): 
    SCRIPT = False
    DEFAULT_CORE_NAME = CORE_NAME
    DEFAULT_VERSION = 'c'

# Pipeline settings

Here all the nitty gritty differences between versions are stated.

In [3]:
pipeline = dict(
    defaults = dict(
        CORE_NAME=CORE_NAME,
        VERSION=DEFAULT_VERSION,
        LANG_FEATURE='language',
        OCC_FEATURE='g_cons',
        LEX_FEATURE='lex',
        TEXT_FEATURE='g_word_utf8',
        TRAILER_FEATURE='trailer_utf8',
        DO_VOCALIZED_LEXEME=True,
        EXTRA_OVERLAP='',
        LEX_FORMATS='@fmt:lex-trans-plain={lex0} ',
        RENAME=(
            ('g_suffix', 'trailer'),
            ('g_suffix_utf8', 'trailer_utf8'),
        ),
    ),
    versions={
        '_temp': dict(),
        'c': dict(),
        '2017': dict(),
        '2016': dict(),
        '4b': dict(
            DO_VOCALIZED_LEXEME=False,
            EXTRA_OVERLAP='gloss nametype',
            LEX_FORMATS='@fmt:lex-trans-plain={lex} ',
            ),
        '4': dict(
            DO_VOCALIZED_LEXEME=False,
            EXTRA_OVERLAP='gloss nametype',
            LEX_FORMATS='@fmt:lex-trans-plain={lex} ',
        ),
        '3': dict(
                LANG_FEATURE='language',
                OCC_FEATURE='surface_consonants',
                LEX_FEATURE='lexeme',
                TEXT_FEATURE='text',
                TRAILER_FEATURE='suffix',
                LEX_FORMAT='@fmt:lex-trans-plain={lexeme} ',
            ),
    },
    repoOrder = '''
        bhsa
        phono
        valence
        parallels
    ''',
    repoConfig = dict(
        bhsa=(
            dict(
                task='coreData',
            ),
            dict(
                task='bookNames',
                omit={},
            ),
            dict(
                task='lexicon',
                omit={'3'},
            ),
            dict(
                task='paragraphs',
                 omit={'3', '4', '4b'},
            ),
            dict(
                task='ketivQere',
                omit={'3', '4', '4b'},
            ),
            dict(
                task='stats',
                omit={'4', '4b'},
            ),
        ),
        phono=(
            dict(
                task='phono',
                omit={'3', '4', '4b'},
            ),
        ),
        valence=(
            dict(
                task='enrich',
                omit={'3'},
            ),
            dict(
                task='flowchart',
                omit={'3'},
            ),
        ),
        parallels=(
            dict(
                task='parallels',
                omit={},
                params=dict(
                    FORCE_MATRIX=False,
                ),
            ),
        ),
    ),
    repoDataDirs = dict(
        bhsa      = 'source _temp tf shebanq',
        phono     = '_temp tf',
        valence   = 'source _temp tf shebanq',
        parallels = 'source _temp tf',
    ),
)

# Run the pipeline

In [6]:
good = runPipeline(pipeline, versions=['_temp'], force=True)


##############################################################################################
#                                                                                            #
#     12m 04s Make version [_temp]                                                           #
#                                                                                            #
##############################################################################################


**********************************************************************************************
*                                                                                            *
*     12m 04s Make repo [bhsa]                                                               *
*                                                                                            *
**********************************************************************************************


---------------------------------------------

  0.14s 			feature kind (str) =def= unknown : node
  0.14s 			feature rela (str) =def= NA : node
  0.15s 			feature mother_object_type (str) =def= clause : node
  0.15s 			feature dist_unit (str) =def= clause_atoms : node
  0.15s 		otype half_verse
  0.15s 			feature label (str) =def=  : node
  0.15s 		otype verse
  0.15s 			feature verse (int) =def= 0 : node
  0.15s 			feature chapter (int) =def= 0 : node
  0.15s 			feature label (str) =def=  : node
  0.15s 			feature book (str) =def= Genesis : node
  0.15s 		otype phrase_atom
  0.16s 			feature number (int) =def= 0 : node
  0.16s 			feature dist (int) =def= 0 : node
  0.16s 			feature distributional_parent (str) =def= 0 : edge
  0.16s 			feature mother (str) =def= 0 : edge
  0.16s 			feature functional_parent (str) =def= 0 : edge
  0.16s 			feature det (str) =def= NA : node
  0.16s 			feature typ (str) =def= VP : node
  0.16s 			feature rela (str) =def= NA : node
  0.17s 			feature dist_unit (str) =def= clause_atoms : node
  0.17s 		

   |     0.71s T nu                   to /Users/Cody/github/etcbc/bhsa/_temp/_temp/tf
   |     2.03s T number               to /Users/Cody/github/etcbc/bhsa/_temp/_temp/tf
   |     0.57s T otype                to /Users/Cody/github/etcbc/bhsa/_temp/_temp/tf
   |     0.69s T pdp                  to /Users/Cody/github/etcbc/bhsa/_temp/_temp/tf
   |     0.73s T pfm                  to /Users/Cody/github/etcbc/bhsa/_temp/_temp/tf
   |     0.74s T prs                  to /Users/Cody/github/etcbc/bhsa/_temp/_temp/tf
   |     0.70s T prs_gn               to /Users/Cody/github/etcbc/bhsa/_temp/_temp/tf
   |     0.72s T prs_nu               to /Users/Cody/github/etcbc/bhsa/_temp/_temp/tf
   |     0.73s T prs_ps               to /Users/Cody/github/etcbc/bhsa/_temp/_temp/tf
   |     0.70s T ps                   to /Users/Cody/github/etcbc/bhsa/_temp/_temp/tf
   |     0.65s T qere                 to /Users/Cody/github/etcbc/bhsa/_temp/_temp/tf
   |     0.60s T qere_utf8            to /Users/Cody/g

..............................................................................................
.     17m 25s Load and compile all other TF features                                         .
..............................................................................................
   |     0.00s Feature overview: 70 for nodes; 4 for edges; 1 configs; 7 computed
  0.00s loading features ...
   |     0.19s T code                 from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |     1.64s T det                  from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |     1.46s T dist                 from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |     1.94s T dist_unit            from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |     3.81s T distributional_parent from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |     0.25s T domain               from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |     0.80s T function             from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |     5.77s T functional_p

75 features found and 0 ignored
  0.00s loading features ...
   |     0.01s B book                 from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |     0.00s Feature overview: 70 for nodes; 4 for edges; 1 configs; 7 computed
  5.07s All features loaded/computed - for details use loadLog()
|     18m 43s 26 book name features created
..............................................................................................
.     18m 43s Write book name features as TF                                                 .
..............................................................................................
   |     0.00s T book@am              to /Users/Cody/github/etcbc/bhsa/_temp/_temp/tf
   |     0.00s T book@ar              to /Users/Cody/github/etcbc/bhsa/_temp/_temp/tf
   |     0.00s T book@bn              to /Users/Cody/github/etcbc/bhsa/_temp/_temp/tf
   |     0.00s T book@da              to /Users/Cody/github/etcbc/bhsa/_temp/_temp/tf
   |     0.00s T book@de             

|     18m 50s ko = korean               Genesis is 창세기                  in 한국어                 
|     18m 50s la = latin                Genesis is Genesis              in Latina              
|     18m 50s nl = dutch                Genesis is Genesis              in Nederlands          
|     18m 50s pa = punjabi              Genesis is ਉਤਪਤ                 in ਪੰਜਾਬੀ              
|     18m 50s pt = portuguese           Genesis is Gênesis              in Português           
|     18m 50s ru = russian              Genesis is Бытия                in Русский             
|     18m 50s sw = swahili              Genesis is Mwanzo               in Kiswahili           
|     18m 50s syc = syriac               Genesis is ܒܪܝܬܐ                in ܠܫܢܐ ܣܘܪܝܝܐ         
|     18m 50s tr = turkish              Genesis is Yaratılış            in Türkçe              
|     18m 50s ur = urdu                 Genesis is پیدائش               in اُردُو              
|     18m 50s yo = yoruba              

..............................................................................................
.     19m 12s Various tweaks in features                                                     .
..............................................................................................
..............................................................................................
.     19m 13s Update the otype, oslots and otext features                                    .
..............................................................................................
|     19m 15s Features that have new or modified data
|     19m 15s 	gloss
|     19m 15s 	language
|     19m 15s 	lex
|     19m 15s 	lex0
|     19m 15s 	lex_utf8
|     19m 15s 	ls
|     19m 15s 	nametype
|     19m 15s 	otype
|     19m 15s 	root
|     19m 15s 	sp
|     19m 15s 	voc_lex
|     19m 15s 	voc_lex_utf8
|     19m 15s 	oslots
|     19m 15s Check voc_lex_utf8: בְּ רֵאשִׁית ברא אֱלֹהִים אֵת הַ שָׁמַיִם וְ אֶרֶץ
|     19m

  0.00s loading features ...
   |     0.81s T otype                from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |       20s T oslots               from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |     1.29s T lex0                 from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |     1.54s T lex_utf8             from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |      |     1.17s C __levels__           from otype, oslots
   |      |       17s C __order__            from otype, oslots, __levels__
   |      |     1.06s C __rank__             from otype, __order__
   |      |       18s C __levUp__            from otype, oslots, __rank__
   |      |       12s C __levDown__          from otype, __levUp__, __rank__
   |      |     3.87s C __boundary__         from otype, oslots, __rank__
   |     0.00s M otext                from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |      |     0.15s C __sections__         from otype, oslots, otext, __levUp__, __levels__, book, chapter, verse
   |     0.04

|     21m 00s 	Read 90669 paragraph annotations
|     21m 00s 	OK: All label/line entries found in index
|     21m 00s Prepare TF paragraph features
..............................................................................................
.     21m 00s write new/changed features to TF ...                                           .
..............................................................................................
   |     0.13s T instruction          to /Users/Cody/github/etcbc/bhsa/_temp/_temp/tf
   |     0.14s T pargr                to /Users/Cody/github/etcbc/bhsa/_temp/_temp/tf
..............................................................................................
.     21m 01s Check differences with previous version                                        .
..............................................................................................
|     21m 01s 	2 features to add
|     21m 01s 		instruction
|     21m 01s 		pargr
|     21m 01s 	no features

|     21m 15s 	qere_trailer_utf8
|     21m 15s 	otext
|     21m 15s 	qere_trailer
..............................................................................................
.     21m 15s Load and compile the new TF features                                           .
..............................................................................................
This is Text-Fabric 3.0.3
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

109 features found and 0 ignored
  0.00s loading features ...
   |     0.22s B g_word               from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |     0.29s B g_word_utf8          from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |     0.11s B trailer              from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |     0.12s B trailer_utf8         from /Users/Cody/github/etcbc/bhsa/tf/_temp

|     21m 50s W           50272x
|     21m 50s H           30386x
|     21m 50s L           20069x
|     21m 50s B           15542x
|     21m 50s >T          10997x
|     21m 50s MN           7562x
|     21m 50s JHWH/        6828x
|     21m 50s <L           5766x
|     21m 50s >L           5517x
|     21m 50s >CR          5500x
..............................................................................................
.     21m 50s Top 10 freqent lexemes (computed on otype=lex)                                 .
..............................................................................................
|     21m 50s W           50272x
|     21m 50s H           30386x
|     21m 50s L           20069x
|     21m 50s B           15542x
|     21m 50s >T          10997x
|     21m 50s MN           7562x
|     21m 50s JHWH/        6828x
|     21m 50s <L           5766x
|     21m 50s >L           5517x
|     21m 50s >CR          5500x
|     21m 50s 	INFO: Same lexeme frequencies computed b

|     22m 43s 	23213 lines
|     22m 43s 	OK: phono text and word info are CONSISTENT
..............................................................................................
.     22m 43s Writing TF phono features                                                      .
..............................................................................................
   |     0.73s T phono                to /Users/Cody/github/etcbc/phono/_temp/_temp/tf
   |     0.62s T phono_trailer        to /Users/Cody/github/etcbc/phono/_temp/_temp/tf
   |     0.00s M otext@phono          to /Users/Cody/github/etcbc/phono/_temp/_temp/tf
..............................................................................................
.     22m 44s Check differences with previous version                                        .
..............................................................................................
|     22m 44s 	2 features to add
|     22m 44s 		phono
|     22m 44s 		phono_traile

   |     0.00s B nametype             from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |     0.18s B ls                   from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |     0.10s B function             from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |     0.38s B rela                 from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |     0.31s B typ                  from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |       27s B mother               from /Users/Cody/github/etcbc/bhsa/tf/_temp
   |     0.00s Feature overview: 108 for nodes; 4 for edges; 1 configs; 7 computed
    37s All features loaded/computed - for details use loadLog()
..............................................................................................
.     23m 33s Finding occurrences ...                                                        .
..............................................................................................
|     23m 35s 	Done
|     23m 35s 	All:      1380 verbs with  73710 verb oc

|     23m 44s 	Done
|     23m 44s 	Phrases of kind C :  19302
|     23m 44s 	Phrases of kind L :  11680
|     23m 44s 	Phrases of kind I :   6018
|     23m 44s 	Total complements :  37000
|     23m 44s 	Total phrases     : 214541
..............................................................................................
.     23m 44s Checking enrichment logic                                                      .
..............................................................................................
|     23m 44s 	All 6 rules OK
..............................................................................................
.     23m 44s Generating enrichments                                                         .
..............................................................................................
|     23m 50s 	Generated enrichment values for 1380 verbs:
|     23m 50s 	Enriched values for 221366 nodes
|     23m 50s 	Overview of rule applications:
|     23m 50s gen

   |     0.74s T valence              from /Users/Cody/github/etcbc/valence/tf/_temp
   |     0.76s T predication          from /Users/Cody/github/etcbc/valence/tf/_temp
   |     0.84s T grammatical          from /Users/Cody/github/etcbc/valence/tf/_temp
   |     0.37s T original             from /Users/Cody/github/etcbc/valence/tf/_temp
   |     0.49s T lexical              from /Users/Cody/github/etcbc/valence/tf/_temp
   |     0.48s T semantic             from /Users/Cody/github/etcbc/valence/tf/_temp
   |     0.37s T f_correction         from /Users/Cody/github/etcbc/valence/tf/_temp
   |     0.35s T s_manual             from /Users/Cody/github/etcbc/valence/tf/_temp
   |     0.45s T cfunction            from /Users/Cody/github/etcbc/valence/tf/_temp
   |     0.00s Feature overview: 117 for nodes; 4 for edges; 1 configs; 7 computed
    12s All features loaded/computed - for details use loadLog()
Time - Time - True
Pred - Pred - True
Subj - Subj - True
Objc - Objc - True
Conj -  - T

|     24m 34s 	10000 clauses
|     24m 37s 	20000 clauses
|     24m 40s 	30000 clauses
|     24m 43s 	40000 clauses
|     24m 45s 	47385 clauses
..............................................................................................
.     24m 45s Writing sense feature to TF                                                    .
..............................................................................................
   |     0.09s T sense                to /Users/Cody/github/etcbc/valence/_temp/_temp/tf
..............................................................................................
.     24m 45s Check differences with previous version                                        .
..............................................................................................
|     24m 45s 	1 features to add
|     24m 45s 		sense
|     24m 45s 	no features to delete
|     24m 45s 	0 features in common
|     24m 45s Done
.................................................

    35s SIMILARITY (O verse SET M>50): Computed    16 M comparisons and saved 376 entries in matrix
    40s SIMILARITY (O verse SET M>50): Computed    18 M comparisons and saved 389 entries in matrix
    44s SIMILARITY (O verse SET M>50): Computed    21 M comparisons and saved 544 entries in matrix
    48s SIMILARITY (O verse SET M>50): Computed    24 M comparisons and saved 575 entries in matrix
    53s SIMILARITY (O verse SET M>50): Computed    26 M comparisons and saved 613 entries in matrix
    57s SIMILARITY (O verse SET M>50): Computed    29 M comparisons and saved 636 entries in matrix
 1m 02s SIMILARITY (O verse SET M>50): Computed    32 M comparisons and saved 666 entries in matrix
 1m 06s SIMILARITY (O verse SET M>50): Computed    35 M comparisons and saved 684 entries in matrix
 1m 11s SIMILARITY (O verse SET M>50): Computed    37 M comparisons and saved 1101 entries in matrix
 1m 15s SIMILARITY (O verse SET M>50): Computed    40 M comparisons and saved 1318 entries in matri

 6m 38s SIMILARITY (O verse SET M>50): Computed   234 M comparisons and saved 22979 entries in matrix
 6m 42s SIMILARITY (O verse SET M>50): Computed   237 M comparisons and saved 23043 entries in matrix
 6m 46s SIMILARITY (O verse SET M>50): Computed   239 M comparisons and saved 23373 entries in matrix
 6m 50s SIMILARITY (O verse SET M>50): Computed   242 M comparisons and saved 23590 entries in matrix
 6m 53s SIMILARITY (O verse SET M>50): Computed   245 M comparisons and saved 23770 entries in matrix
 6m 57s SIMILARITY (O verse SET M>50): Computed   247 M comparisons and saved 23807 entries in matrix
 7m 01s SIMILARITY (O verse SET M>50): Computed   250 M comparisons and saved 23901 entries in matrix
 7m 05s SIMILARITY (O verse SET M>50): Computed   253 M comparisons and saved 24012 entries in matrix
 7m 09s SIMILARITY (O verse SET M>50): Computed   255 M comparisons and saved 24091 entries in matrix
 7m 12s SIMILARITY (O verse SET M>50): Computed   258 M comparisons and saved 2413

19m 54s SIMILARITY (O verse LCS M>60): Computed   169 M comparisons and saved 70082 entries in matrix
20m 07s SIMILARITY (O verse LCS M>60): Computed   172 M comparisons and saved 71976 entries in matrix
20m 19s SIMILARITY (O verse LCS M>60): Computed   175 M comparisons and saved 73162 entries in matrix
20m 32s SIMILARITY (O verse LCS M>60): Computed   177 M comparisons and saved 73911 entries in matrix
20m 44s SIMILARITY (O verse LCS M>60): Computed   180 M comparisons and saved 75103 entries in matrix
20m 57s SIMILARITY (O verse LCS M>60): Computed   183 M comparisons and saved 76293 entries in matrix
21m 09s SIMILARITY (O verse LCS M>60): Computed   185 M comparisons and saved 76923 entries in matrix
21m 19s SIMILARITY (O verse LCS M>60): Computed   188 M comparisons and saved 77786 entries in matrix
21m 29s SIMILARITY (O verse LCS M>60): Computed   191 M comparisons and saved 78268 entries in matrix
21m 40s SIMILARITY (O verse LCS M>60): Computed   193 M comparisons and saved 7879

# Consolidate the continuous version
If you have run an update version called `_temp`, and all has went well
you can copy over the entire version (including its source and temp directories to `c`).
This will happen for all repos in the pipeline.

In [7]:
#good = True
#good = False

if good:
    copyVersion(pipeline, '_temp', 'c')


##############################################################################################
#                                                                                            #
#     54m 50s Copy version _temp ==> c                                                       #
#                                                                                            #
##############################################################################################


**********************************************************************************************
*                                                                                            *
*     54m 50s Repo bhsa                                                                      *
*                                                                                            *
**********************************************************************************************

|     54m 50s 	Copy source/_temp ==> source/c
