![pipeline](pictures/pictures.002.png)

# Text-Fabric from ETCBC

This notebook assembles the data from the ETCBC that is needed
to compile its datasets in text-fabric-format on Github.
Ulltimately the data for the website [SHEBANQ](https://shebanq.ancient-data.org) will be
derived from these TF-sources.

## Pipeline
This is **pipe 1** of the pipeline from ETCBC data to the website SHEBANQ.

A run of this pipe produces a data *version*.
It should be run whenever there are new or updated data sources present that affect the output data.
Since all input data is delivered in a Github repo, we have excellent machinery to 
work with versioning.

The pipe works by executing a series of programs, contained in Github repositories.
For each repository in the pipe, a series of notebooks will be executed.
See [script mode](https://github.com/ETCBC/pipeline/blob/master/README.md#operation) for 
details on how we call notebooks.

All this is specified in the configuration below.

### Core data

The core data is delivered by the ETCBC as `bhsa.mql.bz2` in 
the Github repo [bhsa](https://github.com/ETCBC/bhsa) in directory `source`.

This data will be converted by `tfFromMQL` in the `programs` directory.

The result of this action will be an updated TF resource in its 
`tf/core` directory.

### Additional data

Researchers have contributed to the dataset, 
but not all that data is in the core.
They are typically in the repository where the research has been 
executed, and where the data is documented.

Before the pipe starts, these repos must be pulled.

## Continuous version
Version `c` acts as a *continuous* version. It will be overwritten
by new snapshots of the data on a regular basis.

We support the following workflow to carry out these updates:

1. make a new version called `_temp`. Note:
   * this choice of name prevents it to reach github, because `_temp` directories are in `.gitignore`;
   * after running this workflow, the version `_temp` already exists, this is not a problem;
2. put a new data snapshot in the `source/_temp` directory of the `bhsa` repo, add also data to the
   `source/_temp` directories of the other repos in the pipeline, as far as relevant;
3. run `good = runPipeline(pipeline, versions=['_temp'], force=True)`. Note:
   * we use `force=True` here, because then the old data in version `_temp` will be thoroughly overwritten;
4. if all went well run `copyVersion(pipeline, '_temp', 'c')`. This will overwrite all data directories
   in version `c` by the just created data directories in `temp`.

In [4]:
import os,sys,collections
from pipeline import runPipeline, copyVersion
from tf.fabric import Fabric

# Config

In [5]:
CORE_NAME = 'bhsa'

if 'SCRIPT' not in locals(): 
    SCRIPT = False
    DEFAULT_CORE_NAME = CORE_NAME
    DEFAULT_VERSION = 'c'

# Pipeline settings

Here all the nitty gritty differences between versions are stated.

In [6]:
pipeline = dict(
    defaults = dict(
        CORE_NAME=CORE_NAME,
        VERSION=DEFAULT_VERSION,
        LANG_FEATURE='language',
        OCC_FEATURE='g_cons',
        LEX_FEATURE='lex',
        TEXT_FEATURE='g_word_utf8',
        TRAILER_FEATURE='trailer_utf8',
        DO_VOCALIZED_LEXEME=True,
        EXTRA_OVERLAP='',
        LEX_FORMATS='@fmt:lex-trans-plain={lex0} ',
        RENAME=(
            ('g_suffix', 'trailer'),
            ('g_suffix_utf8', 'trailer_utf8'),
        ),
    ),
    versions={
        '_temp': dict(),
        'c': dict(),
        '2017': dict(),
        '2016': dict(),
        '4b': dict(
            DO_VOCALIZED_LEXEME=False,
            EXTRA_OVERLAP='gloss nametype',
            LEX_FORMATS='@fmt:lex-trans-plain={lex} ',
            ),
        '4': dict(
            DO_VOCALIZED_LEXEME=False,
            EXTRA_OVERLAP='gloss nametype',
            LEX_FORMATS='@fmt:lex-trans-plain={lex} ',
        ),
        '3': dict(
                LANG_FEATURE='language',
                OCC_FEATURE='surface_consonants',
                LEX_FEATURE='lexeme',
                TEXT_FEATURE='text',
                TRAILER_FEATURE='suffix',
                LEX_FORMAT='@fmt:lex-trans-plain={lexeme} ',
            ),
    },
    repoOrder = '''
        bhsa
        phono
        valence
        parallels
    ''',
    repoConfig = dict(
        bhsa=(
            dict(
                task='coreData',
            ),
            dict(
                task='bookNames',
                omit={},
            ),
            dict(
                task='lexicon',
                omit={'3'},
            ),
            dict(
                task='paragraphs',
                 omit={'3', '4', '4b'},
            ),
            dict(
                task='ketivQere',
                omit={'3', '4', '4b'},
            ),
            dict(
                task='stats',
                omit={'4', '4b'},
            ),
        ),
        phono=(
            dict(
                task='phono',
                omit={'3', '4', '4b'},
            ),
        ),
        valence=(
            dict(
                task='enrich',
                omit={'3'},
            ),
            dict(
                task='flowchart',
                omit={'3'},
            ),
        ),
        parallels=(
            dict(
                task='parallels',
                omit={},
                params=dict(
                    FORCE_MATRIX=False,
                ),
            ),
        ),
    ),
    repoDataDirs = dict(
        bhsa      = 'source _temp tf shebanq',
        phono     = '_temp tf',
        valence   = 'source _temp tf shebanq',
        parallels = 'source _temp tf',
    ),
)

# Run the pipeline

In [7]:
good = runPipeline(pipeline, versions=['_temp'], force=True)


##############################################################################################
#                                                                                            #
#       0.00s Make version [_temp]                                                           #
#                                                                                            #
##############################################################################################


**********************************************************************************************
*                                                                                            *
*       0.00s Make repo [bhsa]                                                               *
*                                                                                            *
**********************************************************************************************


---------------------------------------------

  0.00s Grid feature "otype" not found in
/Users/Cody/github/etcbc/bhsa/tf/_temp/
  0.00s Grid feature "oslots" not found in
/Users/Cody/github/etcbc/bhsa/tf/_temp/


  0.00s Grid feature "otext" not found. Working without Text-API

  0.00s loading features ...


  0.00s Feature "otype" not available in
/Users/Cody/github/etcbc/bhsa/tf/_temp/
  0.00s Not all features could be loaded/computed


..............................................................................................
.       0.15s Load and compile all other TF features                                         .
..............................................................................................
  0.00s Feature overview: 0 for nodes; 0 for edges; 0 configs; 0 computed
  0.00s loading features ...


  0.00s Feature "otype" not available in
/Users/Cody/github/etcbc/bhsa/tf/_temp/
  0.00s Not all features could be loaded/computed


AttributeError: 'NoneType' object has no attribute 'makeAvailableIn'

# Consolidate the continuous version
If you have run an update version called `_temp`, and all has went well
you can copy over the entire version (including its source and temp directories to `c`).
This will happen for all repos in the pipeline.

In [5]:
#good = True
#good = False

if good:
    copyVersion(pipeline, '_temp', 'c')


##############################################################################################
#                                                                                            #
#       0.00s Copy version _temp ==> c                                                       #
#                                                                                            #
##############################################################################################


**********************************************************************************************
*                                                                                            *
*       0.00s Repo bhsa                                                                      *
*                                                                                            *
**********************************************************************************************

|       0.00s 	Copy source/_temp ==> source/c
