![pipeline](pictures/pictures.002.png)

# Text-Fabric from ETCBC

This notebook assembles the data from the ETCBC that is needed
to compile its datasets in text-fabric-format on GitHub.
Ultimately the data for the website [SHEBANQ](https://shebanq.ancient-data.org) will be
derived from these Text-Fabric-sources.

## Pipeline
This is **pipe 1** of the pipeline from ETCBC data to the website SHEBANQ.

A run of this pipe produces a data *version*.
It should be run whenever there are new or updated data sources present that affect the output data.
Since all input data is delivered in a GitHub repo, we have excellent machinery to
work with versioning.

The pipe works by executing a series of programs, contained in GitHub repositories.
For each repository in the pipe, a series of notebooks will be executed.
See [script mode](https://github.com/ETCBC/pipeline/blob/master/README.md#operation) for
details on how we call notebooks.

All this is specified in the configuration below.

### Core data

The core data is delivered by the ETCBC as `bhsa.mql.bz2` in
the GitHub repo [bhsa](https://github.com/ETCBC/bhsa) in directory `source`.

This data will be converted by `coreData` in the `programs` directory.

The result of this action will be an updated Text-Fabric resource in its
`tf/core` directory.

### Additional data

Researchers have contributed to the dataset,
but not all that data is in the core.
They are typically in the repository where the research has been
executed, and where the data is documented.

Before the pipe starts, these repos must be pulled.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from pipeline import runPipeline, copyVersion

# Config

In [3]:
CORE_NAME = "bhsa"

if "SCRIPT" not in locals():
    SCRIPT = False
    DEFAULT_CORE_NAME = CORE_NAME
    DEFAULT_VERSION = "2021"

# Pipeline settings

Here all the nitty-gritty differences between versions are stated.

In [4]:
pipeline = dict(
    defaults=dict(
        CORE_NAME=CORE_NAME,
        VERSION=DEFAULT_VERSION,
        LANG_FEATURE="languageISO",
        OCC_FEATURE="g_cons",
        LEX_FEATURE="lex",
        TEXT_FEATURE="g_word_utf8",
        TRAILER_FEATURE="trailer_utf8",
        DO_VOCALIZED_LEXEME=True,
        EXTRA_OVERLAP="",
        LEX_FORMATS="@fmt:lex-trans-plain={lex0} ",
        RENAME=(
            ("g_suffix", "trailer"),
            ("g_suffix_utf8", "trailer_utf8"),
        ),
    ),
    versions={
        "_temp": dict(
            LEX_FORMATS="@fmt:lex-default={voc_lex_utf8} ",
        ),
        "c": dict(
            LEX_FORMATS="@fmt:lex-default={voc_lex_utf8} ",
        ),
        "2021": dict(
            LEX_FORMATS="@fmt:lex-default={voc_lex_utf8} ",
        ),
        "2017": dict(
            LEX_FORMATS="@fmt:lex-default={voc_lex_utf8} ",
        ),
        "2016": dict(
            LEX_FORMATS="@fmt:lex-default={voc_lex_utf8} ",
        ),
        "4b": dict(
            LANG_FEATURE="language",
            DO_VOCALIZED_LEXEME=False,
            EXTRA_OVERLAP="gloss nametype",
            LEX_FORMATS="@fmt:lex-trans-plain={lex} ",
        ),
        "4": dict(
            LANG_FEATURE="language",
            DO_VOCALIZED_LEXEME=False,
            EXTRA_OVERLAP="gloss nametype",
            LEX_FORMATS="@fmt:lex-trans-plain={lex} ",
        ),
        "3": dict(
            LANG_FEATURE="language",
            OCC_FEATURE="surface_consonants",
            LEX_FEATURE="lexeme",
            TEXT_FEATURE="text",
            TRAILER_FEATURE="suffix",
            LEX_FORMATS="@fmt:lex-trans-plain={lexeme} ",
        ),
    },
    repoOrder="""
        bhsa
        phono
        valence
        parallels
        bridging
        trees
    """,
    repoConfig=dict(
        bhsa=(
            dict(
                task="coreData",
            ),
            dict(
                task="bookNames",
                omit={},
            ),
            dict(
                task="lexicon",
                omit={"3"},
            ),
            dict(
                task="paragraphs",
                omit={"3", "4", "4b"},
            ),
            dict(
                task="ketivQere",
                omit={"3", "4", "4b"},
            ),
            dict(
                task="stats",
                omit={"4", "4b"},
            ),
        ),
        phono=(
            dict(
                task="phono",
                omit={"3", "4", "4b"},
            ),
        ),
        valence=(
            dict(
                task="enrich",
                omit={"3"},
            ),
            dict(
                task="flowchart",
                omit={"3"},
            ),
        ),
        parallels=(
            dict(
                task="parallels",
                omit={},
                params=dict(
                    FORCE_MATRIX=False,
                ),
            ),
        ),
        trees=(
            dict(
                task="trees",
            ),
        ),
        bridging=(
            dict(
                task="BHSAbridgeOSM",
                omit={"3", "4", "4b"},
            ),
        ),
    ),
    repoDataDirs=dict(
        bhsa="source _temp tf shebanq",
        phono="_temp tf",
        valence="source _temp tf shebanq",
        parallels="source _temp tf",
        trees="source _temp tf",
        bridging="source _temp tf",
    ),
)



# Run the pipeline

To run everything from scratch:

```python
good = runPipeline(pipeline, versions=['3', '4', '4b', '2016', '2017', '2021', 'c'], force=True)
```

To run the SHEBANQ versions from scratch:

```python
good = runPipeline(pipeline, versions=['4', '4b', '2017', '2021'], force=True)
```

To make a new version called `temp`

```python
good = runPipeline(pipeline, versions=['_temp'], force=True) # a new candidate for 'c'
```

To copy the `_temp` version over to the continuous version `c`:

```python
if good:
    copyVersion(pipeline, '_temp', 'c')
```
This will copy all data, `source`, `_temp` and `tf` over from `_temp` to `c`.

If you want to create the node mappings between versions,
go to the versionMappings notebook in the BHSA repo and run it.

In [6]:
# good = runPipeline(pipeline, versions=['3', '4', '4b', '2016', '2017', 'c'], force=True)
# good = runPipeline(pipeline, versions=['2016', '2017', 'c'], force=True)
# good = runPipeline(pipeline, versions=["2021"], repos=["bridging", "trees"], force=False)
good = runPipeline(pipeline, versions=["2021"], force=False)


##############################################################################################
#                                                                                            #
#     16m 30s Make version [2021]                                                            #
#                                                                                            #
##############################################################################################


**********************************************************************************************
*                                                                                            *
*     16m 30s Make repo [bhsa]                                                               *
*                                                                                            *
**********************************************************************************************


---------------------------------------------

# End of pipeline

You might want to create the version mappings between recent versions.
See the etcbc/bhsa/programs/versionMappings notebook.

# What follows is legacy stuff.

# Consolidate the continuous version

Previously, version `c` acted as a *continuous* version. The idea was that it would be overwritten
by new snapshots of the data on a regular basis.

We do not do that anymore, so the following steps are no longer relevant.

Version `c` will be inactivated.

Each version is a fixed version.

We support the following workflow to carry out these updates:

1. make a new version called `_temp`. Note:
   * this choice of name prevents it to reach github, because `_temp` directories are in `.gitignore`;
   * after running this workflow, the version `_temp` already exists, this is not a problem;
2. put a new data snapshot in the `source/_temp` directory of the `bhsa` repo, add also data to the
   `source/_temp` directories of the other repos in the pipeline, as far as relevant;
3. run `good = runPipeline(pipeline, versions=['_temp'], force=True)`. Note:
   * we use `force=True` here, because then the old data in version `_temp` will be thoroughly overwritten;
4. if all went well run `copyVersion(pipeline, '_temp', 'c')`. This will overwrite all data directories
   in version `c` by the just created data directories in `temp`.

If you have run an update version called `_temp`, and all has went well
you can copy over the entire version (including its source and temp directories to `c`).

In [5]:
# good = True
# good = False

if good:
    copyVersion(pipeline, "_temp", "c")


##############################################################################################
#                                                                                            #
#     14m 41s Copy version _temp ==> c                                                       #
#                                                                                            #
##############################################################################################


**********************************************************************************************
*                                                                                            *
*     14m 41s Repo bhsa                                                                      *
*                                                                                            *
**********************************************************************************************

|     14m 41s 	Copy source/_temp ==> source/c
