<img align="right" src="tf-small.png"/>

![mql](emdros.png)

# TF from MQL

This notebook can read an
[MQL](https://emdros.org/mql.html)
dump of a version of the [BHSA](https://github.com/ETCBC/bhsa) Hebrew Text Database
and transform it in a Text-Fabric
[Text-Fabric](https://github.com/Dans-labs/text-fabric)
resource.

## Discussion

The principled way of going about such a conversion is to import the MQL source into
an [Emdros](https://emdros.org) database, and use it to retrieve objects and features from there.

Because the syntax of an MQL file leaves some freedom, it is error prone to do a text-to-text conversion from
MQL to something else.

Yet this is what we do, the error-prone thing. We then avoid installing and configuring and managing Emdros, MySQL/sqLite3.
Aside the upfront work to get this going, the going after that would also be much slower.

So here you are, a smallish script to do an awful lot of work, mostly correct, if careful used.

# Caveat

This notebook makes use of a new feature of text-fabric, first present in 2.3.12.
Make sure to upgrade first.

```sudo -H pip3 install --upgrade text-fabric
```

In [8]:
import os,sys,re,collections
from shutil import rmtree
from tf.fabric import Fabric
import utils

# Pipeline
See [operation](https://github.com/ETCBC/pipeline/blob/master/README.md#operation) 
for how to run this script in the pipeline.

In [9]:
if 'SCRIPT' not in locals():
    SCRIPT = False
    FORCE = True
    CORE_NAME = 'bhsa'
    VERSION = 'c'

def stop(good=False):
    if SCRIPT: sys.exit(0 if good else 1)

# Setting up the context: source file and target directories

The conversion is executed in an environment of directories, so that sources, temp files and
results are in convenient places and do not have to be shifted around.

In [10]:
repoBase = os.path.expanduser('~/github/etcbc')
thisRepo = '{}/{}'.format(repoBase, CORE_NAME)

thisSource = '{}/source/{}'.format(thisRepo, VERSION)
mqlzFile = '{}/{}.mql.bz2'.format(thisSource, CORE_NAME)

thisTemp = '{}/_temp/{}'.format(thisRepo, VERSION)
thisTempSource = '{}/source'.format(thisTemp)
mqlFile = '{}/{}.mql'.format(thisTempSource, CORE_NAME)
thisTempTf = '{}/tf'.format(thisTemp)

thisTf = '{}/tf/{}'.format(thisRepo, VERSION)

# Test

Check whether this conversion is needed in the first place.
Only when run as a script.

In [11]:
if SCRIPT:
    testFile = '{}/.tf/otype.tfx'.format(thisTf)
    (good, work) = utils.mustRun(mqlzFile, '{}/.tf/otype.tfx'.format(thisTf), force=FORCE)
    if not good: stop(good=False)
    if not work: stop(good=True)

# TF Settings

We add some custom information here.

* the MQL object type that corresponds to the TF slot type, typically `word`;
* a piece of metadata that will go into every feature; the time will be added automatically
* suitable text formats for the `otext` feature of TF.

The oText feature is very sensitive to what is available in the source MQL.
It needs to be configured here.
We save the configs we need per source and version.
And we define a stripped down default version to start with.

In [12]:
slotType = 'word'

featureMetaData = dict(
    dataset='BHSA',
    datasetName='Biblia Hebraica Stuttgartensia Amstelodamensis',
    author='Eep Talstra Centre for Bible and Computer',
    encoders='Constantijn Sikkel (QDF), Ulrik Petersen (MQL) and Dirk Roorda (TF)',
    website='https://shebanq.ancient-data.org',
    email='shebanq@ancient-data.org',
)

oText = {
    '': {
        '': '''
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
@fmt:text-orig-full={g_word_utf8}{g_suffix_utf8}
''',
    },
    '4': '''
@fmt:lex-orig-full={g_lex_utf8} 
@fmt:lex-orig-plain={lex_utf8} 
@fmt:lex-trans-full={g_lex} 
@fmt:lex-trans-plain={lex} 
@fmt:text-orig-full={g_qere_utf8/g_word_utf8}{qtrailer_utf8/trailer_utf8}
@fmt:text-orig-full-ketiv={g_word_utf8}{trailer_utf8}
@fmt:text-orig-plain={g_cons_utf8}{trailer_utf8}
@fmt:text-trans-full={g_word} 
@fmt:text-trans-full-ketiv={g_word} 
@fmt:text-trans-plain={g_cons} 
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
''',
    '4b': '''
@fmt:lex-orig-full={g_lex_utf8} 
@fmt:lex-orig-plain={lex_utf8} 
@fmt:lex-trans-full={g_lex} 
@fmt:lex-trans-plain={lex} 
@fmt:text-orig-full={g_qere_utf8/g_word_utf8}{qtrailer_utf8/trailer_utf8}
@fmt:text-orig-full-ketiv={g_word_utf8}{trailer_utf8}
@fmt:text-orig-plain={g_cons_utf8}{trailer_utf8}
@fmt:text-trans-full={g_word} 
@fmt:text-trans-full-ketiv={g_word} 
@fmt:text-trans-plain={g_cons} 
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
''',
    'c': '''
@fmt:lex-orig-full={g_lex_utf8} 
@fmt:lex-orig-plain={lex_utf8} 
@fmt:lex-trans-full={g_lex} 
@fmt:lex-trans-plain={lex} 
@fmt:text-orig-full={g_word_utf8}{trailer_utf8}
@fmt:text-orig-plain={g_cons_utf8}{trailer_utf8}
@fmt:text-trans-full={g_word}{trailer}
@fmt:text-trans-plain={g_cons}{trailer}
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
''',
    '2016': '''
@fmt:lex-orig-full={g_lex_utf8} 
@fmt:lex-orig-plain={lex_utf8} 
@fmt:lex-trans-full={g_lex} 
@fmt:lex-trans-plain={lex} 
@fmt:text-orig-full={g_word_utf8}{trailer_utf8}
@fmt:text-orig-plain={g_cons_utf8}{trailer_utf8}
@fmt:text-trans-full={g_word}{trailer}
@fmt:text-trans-plain={g_cons}{trailer}
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
''',
    '2017': '''
@fmt:lex-orig-full={g_lex_utf8} 
@fmt:lex-orig-plain={lex_utf8} 
@fmt:lex-trans-full={g_lex} 
@fmt:lex-trans-plain={lex} 
@fmt:text-orig-full={g_word_utf8}{trailer_utf8}
@fmt:text-orig-plain={g_cons_utf8}{trailer_utf8}
@fmt:text-trans-full={g_word}{trailer}
@fmt:text-trans-plain={g_cons}{trailer}
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
''',
}

The next function selects the proper otext material, falling back on a default if nothing 
appropriate has been specified in `oText`.

In [13]:
thisOtext = oText.get(VERSION, oText[''])

if thisOtext is oText['']:
    utils.caption(0, 'WARNING: no otext feature info provided, using a meager default value')
    otextInfo = {}
else:
    utils.caption(0, 'INFO: otext feature information found')
    otextInfo = dict(line[1:].split('=', 1) for line in thisOtext.strip('\n').split('\n'))
    for x in sorted(otextInfo.items()):
        utils.caption(0, '\t{:<20} = "{}"'.format(*x))

|      1m 19s INFO: otext feature information found
|      1m 19s 	fmt:lex-orig-full    = "{g_lex_utf8} "
|      1m 19s 	fmt:lex-orig-plain   = "{lex_utf8} "
|      1m 19s 	fmt:lex-trans-full   = "{g_lex} "
|      1m 19s 	fmt:lex-trans-plain  = "{lex} "
|      1m 19s 	fmt:text-orig-full   = "{g_word_utf8}{trailer_utf8}"
|      1m 19s 	fmt:text-orig-plain  = "{g_cons_utf8}{trailer_utf8}"
|      1m 19s 	fmt:text-trans-full  = "{g_word}{trailer}"
|      1m 19s 	fmt:text-trans-plain = "{g_cons}{trailer}"
|      1m 19s 	sectionFeatures      = "book,chapter,verse"
|      1m 19s 	sectionTypes         = "book,chapter,verse"


# Overview

The program has several stages:
   
1. **prepare** the source (utils.bunzip if needed)
1. **parse MQL** and collect information in datastructures
1. **transform to TF** write the datastructures as TF features
1. **differences** (informational)
1. **deliver** the tf data at its destination directory
1. **compile** all tf features to binary format

# Prepare

Check the source, utils.bunzip it if needed, empty the result directory.

In [14]:
if not os.path.exists(thisTempSource):
    os.makedirs(thisTempSource)

utils.caption(0, 'bunzipping {} ...'.format(mqlzFile))
utils.bunzip(mqlzFile, mqlFile)
utils.caption(0, 'Done')

if os.path.exists(thisTempTf): rmtree(thisTempTf)
os.makedirs(thisTempTf)

|      1m 19s bunzipping /Users/dirk/github/etcbc/bhsa/source/c/bhsa.mql.bz2 ...
|      1m 19s 	NOTE: Using existing unzipped file which is newer than bzipped one
|      1m 19s Done


Convert a monads specification (a comma separated sequence of numbers and number ranges)
into a set of integers.

# MQL to Text-Fabric
Transform the collected information in feature-like datastructures, and write it all
out to `.tf` files.

In [15]:
TF = Fabric(locations=thisTempTf, silent=True)
TF.importMQL(mqlFile, slotType=slotType, otext=otextInfo, meta=featureMetaData)

  0.00s Parsing mql source ...
  0.01s 		enum boolean_t
  0.01s 		enum phrase_determination_t
  0.01s 		enum language_t
  0.02s 		enum book_name_t
  0.02s 		enum lexical_set_t
  0.02s 		enum verbal_stem_t
  0.02s 		enum verbal_tense_t
  0.03s 		enum person_t
  0.03s 		enum number_t
  0.03s 		enum gender_t
  0.03s 		enum state_t
  0.04s 		enum part_of_speech_t
  0.04s 		enum phrase_type_t
  0.04s 		enum phrase_atom_relation_t
  0.04s 		enum phrase_relation_t
  0.05s 		enum phrase_atom_unit_distance_to_mother_t
  0.05s 		enum subphrase_relation_t
  0.05s 		enum subphrase_mother_object_type_t
  0.05s 		enum phrase_function_t
  0.05s 		enum clause_atom_type_t
  0.06s 		enum clause_type_t
  0.06s 		enum clause_kind_t
  0.06s 		enum clause_constituent_relation_t
  0.06s 		enum clause_constituent_mother_object_type_t
  0.07s 		enum clause_constituent_unit_distance_to_mother_t
  0.07s 		otype word
  0.07s 			feature number (int) =def= 0 : node
  0.08s 			feature g_voc_lex (str) =def=  : node
 

 3m 12s 39 objects of type book
 3m 12s 88000 objects of type clause
 3m 12s 45180 objects of type half_verse
 3m 12s 23213 objects of type verse
 3m 12s 267515 objects of type phrase_atom
 3m 12s Making TF data ...
 3m 12s Monad - idd mapping ...
 3m 12s maxSlot=426581
 3m 12s Node mapping and otype ...
 3m 13s oslots ...
 3m 13s metadata ...
 3m 13s features ...
 3m 13s 	features from words
 3m 17s 	   100000 words
 3m 22s 	   200000 words
 3m 25s 	   300000 words
 3m 29s 	   400000 words
 3m 30s 	   426581 words
 3m 30s 	features from books
 3m 30s 	       39 books
 3m 30s 	features from chapters
 3m 30s 	      929 chapters
 3m 30s 	features from clauses
 3m 31s 	    88000 clauses
 3m 31s 	features from clause_atoms
 3m 33s 	    90562 clause_atoms
 3m 33s 	features from half_verses
 3m 33s 	    45180 half_verses
 3m 33s 	features from phrases
 3m 34s 	   100000 phrases
 3m 35s 	   200000 phrases
 3m 35s 	   253174 phrases
 3m 35s 	features from phrase_atoms
 3m 36s 	   100000 phrase

# Diffs

Check differences with previous versions.

The new dataset has been created in a temporary directory,
and has not yet been copied to its destination.

Here is your opportunity to compare the newly created features with the older features.
You expect some differences in some features.

We check the differences between the previous version of the features and what has been generated.
We list features that will be added and deleted and changed.
For each changed feature we show the first line where the new feature differs from the old one.
We ignore changes in the metadata, because the timestamp in the metadata will always change.

In [16]:
utils.checkDiffs(thisTempTf, thisTf)

..............................................................................................
.      6m 04s Check differences with previous version                                        .
..............................................................................................
|      6m 04s 	no features to add
|      6m 04s 	26 features to delete
|      6m 04s 		book@am
|      6m 04s 		book@ar
|      6m 04s 		book@bn
|      6m 04s 		book@da
|      6m 04s 		book@de
|      6m 04s 		book@el
|      6m 04s 		book@en
|      6m 04s 		book@es
|      6m 04s 		book@fa
|      6m 04s 		book@fr
|      6m 04s 		book@he
|      6m 04s 		book@hi
|      6m 04s 		book@id
|      6m 04s 		book@ja
|      6m 04s 		book@ko
|      6m 04s 		book@la
|      6m 04s 		book@nl
|      6m 04s 		book@pa
|      6m 04s 		book@pt
|      6m 04s 		book@ru
|      6m 04s 		book@sw
|      6m 04s 		book@syc
|      6m 04s 		book@tr
|      6m 04s 		book@ur
|      6m 04s 		book@yo
|      6m 04s 		book@zh
|      6m 04s 	69 f

# Deliver 

Copy the new TF dataset from the temporary location where it has been created to its final destination.

In [17]:
utils.deliverDataset(thisTempTf, thisTf)

..............................................................................................
.      6m 32s Deliver data set to /Users/dirk/github/etcbc/bhsa/tf/c                         .
..............................................................................................


# Compile TF

Just to see whether everything loads and the precomputing of extra information works out.
Moreover, if you want to work with these features, then the precomputing has already been done, and everything is quicker in subsequent runs.

We issue load statement to trigger the precomputing of extra data.
Note that all features specified text formats in the `otext` config feature,
will be loaded, as well as the features for sections.

At that point we have access to the full list of features.
We grab them and are going to load them all! 

In [18]:
utils.caption(4, 'Load and compile standard TF features')
TF = Fabric(locations=thisTf, modules=[''])
api = TF.load('')

utils.caption(4, 'Load and compile all other TF features')
allFeatures = TF.explore(silent=False, show=True)
loadableFeatures = allFeatures['nodes'] + allFeatures['edges']
api = TF.load(loadableFeatures)
T = api.T

..............................................................................................
.      6m 33s Load and compile standard TF features                                          .
..............................................................................................
This is Text-Fabric 3.0.2
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

69 features found and 0 ignored
  0.00s loading features ...
   |     1.49s T otype                from /Users/dirk/github/etcbc/bhsa/tf/c
   |       10s T oslots               from /Users/dirk/github/etcbc/bhsa/tf/c
   |     0.10s T book                 from /Users/dirk/github/etcbc/bhsa/tf/c
   |     0.05s T chapter              from /Users/dirk/github/etcbc/bhsa/tf/c
   |     0.05s T verse                from /Users/dirk/github/etcbc/bhsa/tf/c
   |     1.51s T g_c

# Examples

In [19]:
utils.caption(4, 'Basic test')
utils.caption(4, 'First verse in all formats')
for fmt in T.formats:
    utils.caption(0, '{}'.format(fmt), continuation=True)
    utils.caption(0, '\t{}'.format(T.text(range(1,12), fmt=fmt)), continuation=True)

..............................................................................................
.      9m 06s Basic test                                                                     .
..............................................................................................
..............................................................................................
.      9m 06s First verse in all formats                                                     .
..............................................................................................
lex-trans-full
	B.:- R;>CIJT B.@R@> >:ELOH >;T HA- C.@MAJ W:- >;T H@- >@REY 
text-orig-plain
	בראשׁית ברא אלהים את השׁמים ואת הארץ׃ 
lex-trans-plain
	B R>CJT/ BR>[ >LHJM/ >T H CMJM/ W >T H >RY/ 
lex-orig-plain
	ב ראשׁית֜ ברא אלהים֜ את ה שׁמים֜ ו את ה ארץ֜ 
text-trans-full
	B.:-R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA-C.@MA73JIM W:->;71T H@->@75REY00 
text-orig-full
	בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽר