<img align="right" src="images/dans-small.png"/>
<img align="right" src="images/tf-small.png"/>
<img align="right" src="images/etcbc.png"/>


![mql](images/emdros.png)

# TF from MQL

This notebook can read an
[MQL](https://emdros.org/mql.html)
dump of a version of the [BHSA](https://github.com/ETCBC/bhsa) Hebrew Text Database
and transform it in a Text-Fabric
[Text-Fabric](https://github.com/Dans-labs/text-fabric)
resource.

## Discussion

The principled way of going about such a conversion is to import the MQL source into
an [Emdros](https://emdros.org) database, and use it to retrieve objects and features from there.

Because the syntax of an MQL file leaves some freedom, it is error prone to do a text-to-text conversion from
MQL to something else.

Yet this is what we do, the error-prone thing. We then avoid installing and configuring and managing Emdros, MYSQL/SQLite3.
Aside the upfront work to get this going, the going after that would also be much slower.

So here you are, a smallish script to do an awful lot of work, mostly correct, if careful used.

# Caveat

This notebook makes use of a new feature of text-fabric, first present in 2.3.12.
Make sure to upgrade first.

```sudo -H pip3 install --upgrade text-fabric
```

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import sys
import yaml
from shutil import rmtree
from tf.fabric import Fabric
from tf.core.helpers import formatMeta
import utils

# Pipeline
See [operation](https://github.com/ETCBC/pipeline/blob/master/README.md#operation)
for how to run this script in the pipeline.

In [3]:
if "SCRIPT" not in locals():
    SCRIPT = False
    FORCE = True
    CORE_NAME = "bhsa"
    VERSION = "2021"
    RENAME = (
        ("g_suffix", "trailer"),
        ("g_suffix_utf8", "trailer_utf8"),
    )

In [4]:
def stop(good=False):
    if SCRIPT:
        sys.exit(0 if good else 1)

# Setting up the context: source file and target directories

The conversion is executed in an environment of directories, so that sources, temp files and
results are in convenient places and do not have to be shifted around.

In [5]:
repoBase = os.path.expanduser("~/github/etcbc")
thisRepo = "{}/{}".format(repoBase, CORE_NAME)

In [6]:
thisSource = "{}/source/{}".format(thisRepo, VERSION)
mqlzFile = "{}/{}.mql.bz2".format(thisSource, CORE_NAME)

In [7]:
thisTemp = "{}/_temp/{}".format(thisRepo, VERSION)
thisTempSource = "{}/source".format(thisTemp)
mqlFile = "{}/{}.mql".format(thisTempSource, CORE_NAME)
thisTempTf = "{}/tf".format(thisTemp)

In [8]:
thisTf = "{}/tf/{}".format(thisRepo, VERSION)

# Test

Check whether this conversion is needed in the first place.
Only when run as a script.

In [9]:
if SCRIPT:
    testFile = "{}/.tf/otype.tfx".format(thisTf)
    (good, work) = utils.mustRun(
        mqlzFile, "{}/.tf/otype.tfx".format(thisTf), force=FORCE
    )
    if not good:
        stop(good=False)
    if not work:
        stop(good=True)

# TF Settings

We add some custom information here.

* the MQL object type that corresponds to the Text-Fabric slot type, typically `word`;
* a piece of metadata that will go into every feature; the time will be added automatically
* suitable text formats for the `otext` feature of TF.

The oText feature is very sensitive to what is available in the source MQL.
It needs to be configured here.
We save the configs we need per source and version.
And we define a stripped down default version to start with.

In [10]:
slotType = "word"

In [13]:
genericMetaPath = f"{thisRepo}/yaml/generic.yaml"
coreMetaPath = f"{thisRepo}/yaml/core.yaml"

with open(genericMetaPath) as fh:
    genericMeta = yaml.load(fh, Loader=yaml.FullLoader)
    genericMeta["version"] = VERSION
with open(coreMetaPath) as fh:
    coreMeta = formatMeta(yaml.load(fh, Loader=yaml.FullLoader))
    
featureMetaData = {"": genericMeta, **coreMeta}

{'': {'dataset': 'BHSA',
  'datasetName': 'Biblia Hebraica Stuttgartensia Amstelodamensis',
  'author': 'Eep Talstra Centre for Bible and Computer',
  'encoders': 'Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)',
  'website': 'https://shebanq.ancient-data.org',
  'email': 'shebanq@ancient-data.org',
  'version': '2021'},
 'book': {'description': '✅ book name in Latin (Genesis; Numeri; Reges1; ...'},
 'chapter': {'description': '✅ chapter number (1; 2; 3; ...'},
 'code': {'description': '✅ identifier of a clause atom relationship (0; 74; 367; ...'},
 'det': {'description': '✅ determinedness of phrase(atom) (det; und; NA.'},
 'dist': {'description': '✅ distance to mother of the object (0; 1; 2; -6; -6; ...'},
 'dist_unit': {'description': '✅ unit of measure for the dist feature (words; phrase_atoms; clause_atoms.'},
 'distributional_parent': {'description': '✅ parent releationship between phrase/clause/sentence atoms ('},
 'domain': {'description': '✅ text type of cl

In [13]:
oText = {
    "": {
        "": """
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
@fmt:text-orig-full={g_word_utf8}{g_suffix_utf8}
""",
    },
    "_temp": """
@fmt:lex-orig-full={g_lex_utf8} 
@fmt:lex-orig-plain={lex_utf8} 
@fmt:lex-trans-full={g_lex} 
@fmt:lex-trans-plain={lex} 
@fmt:text-orig-full={g_word_utf8}{trailer_utf8}
@fmt:text-orig-plain={g_cons_utf8}{trailer_utf8}
@fmt:text-trans-full={g_word}{trailer}
@fmt:text-trans-plain={g_cons}{trailer}
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
""",  # noqa W291
    "2021": """
@fmt:lex-orig-full={g_lex_utf8} 
@fmt:lex-orig-plain={lex_utf8} 
@fmt:lex-trans-full={g_lex} 
@fmt:lex-trans-plain={lex} 
@fmt:text-orig-full={g_word_utf8}{trailer_utf8}
@fmt:text-orig-plain={g_cons_utf8}{trailer_utf8}
@fmt:text-trans-full={g_word}{trailer}
@fmt:text-trans-plain={g_cons}{trailer}
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
""",  # noqa W291
    "2017": """
@fmt:lex-orig-full={g_lex_utf8} 
@fmt:lex-orig-plain={lex_utf8} 
@fmt:lex-trans-full={g_lex} 
@fmt:lex-trans-plain={lex} 
@fmt:text-orig-full={g_word_utf8}{trailer_utf8}
@fmt:text-orig-plain={g_cons_utf8}{trailer_utf8}
@fmt:text-trans-full={g_word}{trailer}
@fmt:text-trans-plain={g_cons}{trailer}
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
""",  # noqa W291
    "2016": """
@fmt:lex-orig-full={g_lex_utf8} 
@fmt:lex-orig-plain={lex_utf8} 
@fmt:lex-trans-full={g_lex} 
@fmt:lex-trans-plain={lex} 
@fmt:text-orig-full={g_word_utf8}{trailer_utf8}
@fmt:text-orig-plain={g_cons_utf8}{trailer_utf8}
@fmt:text-trans-full={g_word}{trailer}
@fmt:text-trans-plain={g_cons}{trailer}
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
""",  # noqa W291
    "4b": """
@fmt:lex-orig-full={g_lex_utf8} 
@fmt:lex-orig-plain={lex_utf8} 
@fmt:lex-trans-full={g_lex} 
@fmt:lex-trans-plain={lex} 
@fmt:text-orig-full={g_word_utf8}{trailer_utf8}
@fmt:text-orig-full-ketiv={g_word_utf8}{trailer_utf8}
@fmt:text-orig-plain={g_cons_utf8}{trailer_utf8}
@fmt:text-trans-full={g_word} 
@fmt:text-trans-full-ketiv={g_word} 
@fmt:text-trans-plain={g_cons} 
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
""",  # noqa W291
    "4": """
@fmt:lex-orig-full={g_lex_utf8} 
@fmt:lex-orig-plain={lex_utf8} 
@fmt:lex-trans-full={g_lex} 
@fmt:lex-trans-plain={lex} 
@fmt:text-orig-full={g_word_utf8}{trailer_utf8}
@fmt:text-orig-full-ketiv={g_word_utf8}{trailer_utf8}
@fmt:text-orig-plain={g_cons_utf8}{trailer_utf8}
@fmt:text-trans-full={g_word} 
@fmt:text-trans-full-ketiv={g_word} 
@fmt:text-trans-plain={g_cons} 
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
""",  # noqa W291
    "3": """
@fmt:lex-orig-full={graphical_lexeme_utf8} 
@fmt:lex-orig-plain={lexeme_utf8} 
@fmt:lex-trans-full={graphical_lexeme} 
@fmt:lex-trans-plain={lexeme} 
@fmt:text-orig-full={text}{suffix}
@fmt:text-orig-plain={surface_consonants_utf8}{suffix}
@fmt:text-trans-full={graphical_word} 
@fmt:text-trans-plain={surface_consonants} 
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
""",  # noqa W291
    "c": """
@fmt:lex-orig-full={g_lex_utf8} 
@fmt:lex-orig-plain={lex_utf8} 
@fmt:lex-trans-full={g_lex} 
@fmt:lex-trans-plain={lex} 
@fmt:text-orig-full={g_word_utf8}{trailer_utf8}
@fmt:text-orig-plain={g_cons_utf8}{trailer_utf8}
@fmt:text-trans-full={g_word}{trailer}
@fmt:text-trans-plain={g_cons}{trailer}
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
""",  # noqa W291
}

The next function selects the proper otext material, falling back on a default if nothing
appropriate has been specified in `oText`.

In [14]:
thisOtext = oText.get(VERSION, oText[""])

In [15]:
if thisOtext is oText[""]:
    utils.caption(
        0, "WARNING: no otext feature info provided, using a meager default value"
    )
    otextInfo = {}
else:
    utils.caption(0, "INFO: otext feature information found")
    otextInfo = dict(
        line[1:].split("=", 1) for line in thisOtext.strip("\n").split("\n")
    )
    for x in sorted(otextInfo.items()):
        utils.caption(0, '\t{:<20} = "{}"'.format(*x))

|       0.00s INFO: otext feature information found
|       0.00s 	fmt:lex-orig-full    = "{g_lex_utf8} "
|       0.00s 	fmt:lex-orig-plain   = "{lex_utf8} "
|       0.00s 	fmt:lex-trans-full   = "{g_lex} "
|       0.00s 	fmt:lex-trans-plain  = "{lex} "
|       0.00s 	fmt:text-orig-full   = "{g_word_utf8}{trailer_utf8}"
|       0.00s 	fmt:text-orig-plain  = "{g_cons_utf8}{trailer_utf8}"
|       0.00s 	fmt:text-trans-full  = "{g_word}{trailer}"
|       0.00s 	fmt:text-trans-plain = "{g_cons}{trailer}"
|       0.00s 	sectionFeatures      = "book,chapter,verse"
|       0.01s 	sectionTypes         = "book,chapter,verse"


# Overview

The program has several stages:

1. **prepare** the source (utils.bunzip if needed)
1. **convert** convert the MQL file into a text-fabric dataset
1. **differences** (informational)
1. **deliver** the TF data at its destination directory
1. **compile** all TF features to binary format

# Prepare

Check the source, utils.bunzip it if needed, empty the result directory.

In [16]:
if not os.path.exists(thisTempSource):
    os.makedirs(thisTempSource)

In [17]:
utils.caption(0, "bunzipping {} ...".format(mqlzFile))
utils.bunzip(mqlzFile, mqlFile)
utils.caption(0, "Done")

|       3.13s bunzipping /Users/werk/github/etcbc/bhsa/source/2021/bhsa.mql.bz2 ...
|       3.13s 	NOTE: Using existing unzipped file which is newer than bzipped one
|       3.13s Done


In [18]:
if os.path.exists(thisTempTf):
    rmtree(thisTempTf)
os.makedirs(thisTempTf)

# MQL to Text-Fabric
Transform the collected information in feature-like data-structures, and write it all
out to `.tf` files.

In [28]:
TF = Fabric(locations=thisTempTf, silent=SCRIPT)

This is Text-Fabric 9.1.6
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

0 features found and 0 ignored
  0.00s Warp feature "otype" not found in
~/github/etcbc/bhsa/_temp/2021/tf/
  0.00s Warp feature "oslots" not found in
~/github/etcbc/bhsa/_temp/2021/tf/
  0.00s Warp feature "otext" not found. Working without Text-API



In [29]:
TF.importMQL(mqlFile, slotType=slotType, otext=otextInfo, meta=featureMetaData)

  0.00s Parsing mql source ...
  0.00s 		enum boolean_t
  0.00s 		enum phrase_determination_t
  0.00s 		enum language_t
  0.00s 		enum book_name_t
  0.00s 		enum lexical_set_t
  0.00s 		enum verbal_stem_t
  0.00s 		enum verbal_tense_t
  0.00s 		enum person_t
  0.01s 		enum number_t
  0.01s 		enum gender_t
  0.01s 		enum state_t
  0.01s 		enum part_of_speech_t
  0.01s 		enum phrase_type_t
  0.01s 		enum phrase_atom_relation_t
  0.01s 		enum phrase_relation_t
  0.01s 		enum phrase_atom_unit_distance_to_mother_t
  0.01s 		enum subphrase_relation_t
  0.01s 		enum subphrase_mother_object_type_t
  0.01s 		enum phrase_function_t
  0.01s 		enum clause_atom_type_t
  0.01s 		enum clause_type_t
  0.01s 		enum clause_kind_t
  0.01s 		enum clause_constituent_relation_t
  0.01s 		enum clause_constituent_mother_object_type_t
  0.01s 		enum clause_constituent_unit_distance_to_mother_t
  0.01s 		otype word
  0.01s 			feature number (int) =def= 0 : node
  0.02s 			feature lexeme_count (int) =def= 0 : no

# Rename features
We rename the features mentioned in the RENAME dictionary.

In [30]:
if RENAME is None:
    utils.caption(4, "Rename features: nothing to do")
else:
    utils.caption(4, "Renaming {} features in {}".format(len(RENAME), thisTempTf))
    for (srcFeature, dstFeature) in RENAME:
        srcPath = "{}/{}.tf".format(thisTempTf, srcFeature)
        dstPath = "{}/{}.tf".format(thisTempTf, dstFeature)
        if os.path.exists(srcPath):
            os.rename(srcPath, dstPath)
            utils.caption(0, "\trenamed {} to {}".format(srcFeature, dstFeature))
        else:
            utils.caption(0, "\tsource feature {} does not exist.".format(srcFeature))
            utils.caption(
                0, "\tdestination feature {} will not be created.".format(dstFeature)
            )

..............................................................................................
.     18m 27s Renaming 2 features in /Users/werk/github/etcbc/bhsa/_temp/2021/tf             .
..............................................................................................
|     18m 27s 	renamed g_suffix to trailer
|     18m 27s 	renamed g_suffix_utf8 to trailer_utf8


# Diffs

Check differences with previous versions.

The new dataset has been created in a temporary directory,
and has not yet been copied to its destination.

Here is your opportunity to compare the newly created features with the older features.
You expect some differences in some features.

We check the differences between the previous version of the features and what has been generated.
We list features that will be added and deleted and changed.
For each changed feature we show the first line where the new feature differs from the old one.
We ignore changes in the metadata, because the timestamp in the metadata will always change.

In [31]:
utils.checkDiffs(thisTempTf, thisTf)

..............................................................................................
.     18m 32s Check differences with previous version                                        .
..............................................................................................
|     18m 32s 	2 features to add
|     18m 32s 		g_voc_lex
|     18m 32s 		g_voc_lex_utf8
|     18m 32s 	43 features to delete
|     18m 32s 		book@am
|     18m 32s 		book@ar
|     18m 32s 		book@bn
|     18m 32s 		book@da
|     18m 32s 		book@de
|     18m 32s 		book@el
|     18m 32s 		book@en
|     18m 32s 		book@es
|     18m 32s 		book@fa
|     18m 32s 		book@fr
|     18m 32s 		book@he
|     18m 32s 		book@hi
|     18m 32s 		book@id
|     18m 32s 		book@ja
|     18m 32s 		book@ko
|     18m 32s 		book@la
|     18m 32s 		book@nl
|     18m 32s 		book@pa
|     18m 32s 		book@pt
|     18m 32s 		book@ru
|     18m 32s 		book@sw
|     18m 32s 		book@syc
|     18m 32s 		book@tr
|     18m 32s 		book@ur
|     18m 3

# Deliver

Copy the new TF dataset from the temporary location where it has been created to its final destination.

In [32]:
utils.deliverDataset(thisTempTf, thisTf)

..............................................................................................
.     19m 22s Deliver data set to /Users/werk/github/etcbc/bhsa/tf/2021                      .
..............................................................................................


# Compile TF

Just to see whether everything loads and the pre-computing of extra information works out.
Moreover, if you want to work with these features, then the pre-computing has already been done, and everything is quicker in subsequent runs.

We issue load statement to trigger the pre-computing of extra data.
Note that all features specified text formats in the `otext` config feature,
will be loaded, as well as the features for sections.

At that point we have access to the full list of features.
We grab them and are going to load them all!

In [33]:
utils.caption(4, "Load and compile standard TF features")
TF = Fabric(locations=thisTf, modules=[""])
api = TF.load("")

..............................................................................................
.     19m 26s Load and compile standard TF features                                          .
..............................................................................................
This is Text-Fabric 9.1.6
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

75 features found and 0 ignored
  0.00s loading features ...
   |     0.42s T otype                from ~/github/etcbc/bhsa/tf/2021
   |     7.31s T oslots               from ~/github/etcbc/bhsa/tf/2021
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
   |     0.90s T g_word_utf8          from ~/github/etcbc/bhsa/tf/2021
   |     0.83s T g_lex_utf8           from ~/github/etcbc/bhsa/tf/2021
   |     0.03s T verse                from ~/github/etcbc/bhsa/tf/2021
   |     0.81s T g_cons_utf8          from ~/github/etcbc/bhsa/tf/2021
   |     0.73s T g_lex        

In [34]:
utils.caption(4, "Load and compile all other TF features")
allFeatures = TF.explore(silent=False, show=True)
loadableFeatures = allFeatures["nodes"] + allFeatures["edges"]
api = TF.load(loadableFeatures)
api.makeAvailableIn(globals())

..............................................................................................
.     20m 19s Load and compile all other TF features                                         .
..............................................................................................
   |     0.00s Feature overview: 70 for nodes; 4 for edges; 1 configs; 8 computed
  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
   |     0.11s T code                 from ~/github/etcbc/bhsa/tf/2021
   |     0.83s T det                  from ~/github/etcbc/bhsa/tf/2021
   |     0.89s T dist                 from ~/github/etcbc/bhsa/tf/2021
   |     1.00s T dist_unit            from ~/github/etcbc/bhsa/tf/2021
   |     2.68s T distributional_parent from ~/github/etcbc/bhsa/tf/2021
   |     0.14s T domain               from ~/github/etcbc/bhsa/tf/2021
   |     0.43s T function             from ~/github/etcbc/bhsa/tf/2021
   |     3.

[('Computed',
  'computed-data',
  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),
 ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),
 ('Fabric', 'loading', ('TF',)),
 ('Locality', 'locality', ('L Locality',)),
 ('Nodes', 'navigating-nodes', ('N Nodes',)),
 ('Features',
  'node-features',
  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),
 ('Search', 'search', ('S Search',)),
 ('Text', 'text', ('T Text',))]

# Examples

In [35]:
utils.caption(4, "Basic test")
utils.caption(4, "First verse in all formats")
for fmt in T.formats:
    utils.caption(0, "{}".format(fmt), continuation=True)
    utils.caption(0, "\t{}".format(T.text(range(1, 12), fmt=fmt)), continuation=True)

..............................................................................................
.     21m 43s Basic test                                                                     .
..............................................................................................
..............................................................................................
.     21m 43s First verse in all formats                                                     .
..............................................................................................
lex-orig-full
	בְּ רֵאשִׁית בָּרָא אֱלֹה אֵת הַ שָּׁמַי וְ אֵת הָ אָרֶץ 
lex-orig-plain
	ב ראשׁית֜ ברא אלהים֜ את ה שׁמים֜ ו את ה ארץ֜ 
lex-trans-full
	B.:- R;>CIJT B.@R@> >:ELOH >;T HA- C.@MAJ W:- >;T H@- >@REY 
lex-trans-plain
	B R>CJT/ BR>[ >LHJM/ >T H CMJM/ W >T H >RY/ 
text-orig-full
	בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ 
text-orig-plain
	בראשׁית ברא אלהים את השׁמים ואת הארץ׃ 
text-trans-full
	

In [36]:
if SCRIPT:
    stop(good=True)

In [16]:
f = "subphrase_type"
print("`" + "` `".join(sorted(str(x[0]) for x in Fs(f).freqList())) + "`")

    15s Node feature "subphrase_type" not loaded


AttributeError: 'NoneType' object has no attribute 'freqList'