<img align="right" src="images/dans-small.png"/>
<img align="right" src="images/tf-small.png"/>
<img align="right" src="images/etcbc.png"/>


#  MQL conversion to TF

This notebook extra-biblical texts, produced by the ETCBC in MQL format, into 
[text-Fabric](https://github.com/Dans-labs/text-fabric)
format.

In [1]:
bookNames = dict(
    en='''
        1QM
        1QS
        Kuntillet_Ajrud
        Arad
        Balaam
        Ketef_Hinnom
        Lachish
        Mesha_Stela
        Mesad_Hashavyahu
        Pirqe
        Shirata
        Siloam
    '''.strip().split(),
)

## Discussion

In [2]:
import os,sys,re,collections
import utils
from shutil import rmtree
from tf.fabric import Fabric

# Setting up the context: source file and target directories

The conversion is executed in an environment of directories, so that sources, temp files and
results are in convenient places and do not have to be shifted around.

In [3]:
NAME = 'extrabiblical'
VERSION = '0.1'

In [4]:
repoBase = os.path.expanduser('~/github/etcbc')
thisRepo = '{}/{}'.format(repoBase, 'extrabiblical')

thisSource = '{}/source/{}'.format(thisRepo, VERSION)
mqlzFile = '{}/{}.mql.bz2'.format(thisSource, NAME)

thisTemp = '{}/_temp/{}'.format(thisRepo, VERSION)
thisTempSource = '{}/source'.format(thisTemp)
mqlFile = '{}/{}.mql'.format(thisTempSource, NAME)
thisTempTf = '{}/tf'.format(thisTemp)

thisTf = '{}/tf/{}'.format(thisRepo, VERSION)

# TF Settings

We add some custom information here.

* the MQL object type that corresponds to the TF slot type, typically `word`;
* a piece of metadata that will go into every feature; the time will be added automatically
* suitable text formats for the `otext` feature of TF.

The oText feature is very sensitive to what is available in the source MQL.
It needs to be configured here.
We save the configs we need per source and version.
And we define a stripped down default version to start with.

In [5]:
slotType = 'word'

featureMetaData = dict(
    dataset='ExtraBiblical',
    version=VERSION,
    datasetName='Non Masoretic Texts related to the Hebrew Bible',
    author='Eep Talstra Centre for Bible and Computer',
    encoders='Constantijn Sikkel, Martijn Naaijer, Ulrik Petersen (MQL) and Dirk Roorda (TF)',
    website='http://etcbc.nl',
    email='m.naaijer@vu.nl',
)

oText = {
    '': {
        '': '''
@fmt:lex-orig-full={g_lex_utf8}{g_suffix_utf8}
@fmt:lex-trans-full={lex_utf8}{g_suffix}
@fmt:text-orig-full={g_word_utf8}{g_suffix_utf8}
@fmt:text-trans-full={g_word}{g_suffix}
@fmt:text-trans-plain={g_cons}{g_suffix}
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
''',
    },
    '0.1': '''
@fmt:lex-orig-full={g_lex_utf8}{g_suffix_utf8}
@fmt:lex-trans-full={lex_utf8}{g_suffix}
@fmt:text-orig-full={g_word_utf8}{g_suffix_utf8}
@fmt:text-trans-full={g_word}{g_suffix}
@fmt:text-trans-plain={g_cons}{g_suffix}
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
''',
}

The next function selects the proper otext material, falling back on a default if nothing 
appropriate has been specified in `oText`.

In [6]:
thisOtext = oText.get(VERSION, oText[''])

if thisOtext is oText['']:
    utils.caption(0, 'WARNING: no otext feature info provided, using a meager default value')
    otextInfo = {}
else:
    utils.caption(0, 'INFO: otext feature information found')
    otextInfo = dict(line[1:].split('=', 1) for line in thisOtext.strip('\n').split('\n'))
    for x in sorted(otextInfo.items()):
        utils.caption(0, '\t{:<20} = "{}"'.format(*x))

|       0.00s INFO: otext feature information found
|       0.00s 	fmt:lex-orig-full    = "{g_lex_utf8}{g_suffix_utf8}"
|       0.00s 	fmt:lex-trans-full   = "{lex_utf8}{g_suffix}"
|       0.00s 	fmt:text-orig-full   = "{g_word_utf8}{g_suffix_utf8}"
|       0.00s 	fmt:text-trans-full  = "{g_word}{g_suffix}"
|       0.00s 	fmt:text-trans-plain = "{g_cons}{g_suffix}"
|       0.00s 	sectionFeatures      = "book,chapter,verse"
|       0.00s 	sectionTypes         = "book,chapter,verse"


# Prepare

In [7]:
if not os.path.exists(thisTempSource):
    os.makedirs(thisTempSource)

utils.caption(0, 'bunzipping {} ...'.format(mqlzFile))
utils.bunzip(mqlzFile, mqlFile)
utils.caption(0, 'Done')

if os.path.exists(thisTempTf): rmtree(thisTempTf)
os.makedirs(thisTempTf)

|       2.08s bunzipping /Users/dirk/github/etcbc/extrabiblical/source/0.1/extrabiblical.mql.bz2 ...
|       2.08s 	NOTE: Using existing unzipped file which is newer than bzipped one
|       2.08s Done


# MQL to Text-Fabric
Transform the collected information in feature-like data structures, and write it all
out to `.tf` files.

In [8]:
TF = Fabric(locations=thisTempTf, silent=True)
TF.importMQL(mqlFile, slotType=slotType, otext=otextInfo, meta=featureMetaData)

  0.00s Parsing mql source ...
  0.00s 		enum boolean_t
  0.01s 		enum phrase_determination_t
  0.02s 		enum language_t
  0.02s 		enum book_name_t
  0.02s 		enum lexical_set_t
  0.02s 		enum verbal_stem_t
  0.02s 		enum verbal_tense_t
  0.02s 		enum person_t
  0.02s 		enum number_t
  0.03s 		enum gender_t
  0.03s 		enum state_t
  0.03s 		enum part_of_speech_t
  0.03s 		enum phrase_type_t
  0.03s 		enum phrase_atom_relation_t
  0.03s 		enum phrase_relation_t
  0.04s 		enum phrase_atom_unit_distance_to_mother_t
  0.04s 		enum subphrase_relation_t
  0.04s 		enum subphrase_mother_object_type_t
  0.05s 		enum phrase_function_t
  0.06s 		enum clause_atom_type_t
  0.06s 		enum clause_type_t
  0.07s 		enum clause_kind_t
  0.07s 		enum clause_constituent_relation_t
  0.07s 		enum clause_constituent_mother_object_type_t
  0.07s 		enum clause_constituent_unit_distance_to_mother_t
  0.07s 		otype book
  0.07s 			feature book (str) =def= None : node
  0.07s 		otype chapter
  0.07s 			feature book (

  4.77s 		objects in clause
  4.78s 		objects in clause_atom
  4.79s 		objects in half_verse
  4.79s 		objects in phrase
  4.80s 		objects in phrase_atom
  4.80s 		objects in sentence
  4.81s 		objects in sentence_atom
  4.81s 		objects in subphrase
  4.82s 		objects in verse
  4.82s 		objects in word
  4.86s 		objects in book
  4.87s 		objects in chapter
  4.87s 		objects in clause
  4.89s 		objects in clause_atom
  4.89s 		objects in half_verse
  4.90s 		objects in phrase
  4.92s 		objects in phrase_atom
  4.94s 		objects in sentence
  4.94s 		objects in sentence_atom
  4.94s 		objects in subphrase
  4.96s 		objects in verse
  4.96s 		objects in word
  5.08s 		objects in book
  5.08s 		objects in chapter
  5.08s 		objects in clause
  5.10s 		objects in clause_atom
  5.11s 		objects in half_verse
  5.11s 		objects in phrase
  5.14s 		objects in phrase_atom
  5.15s 		objects in sentence
  5.16s 		objects in sentence_atom
  5.16s 		objects in subphrase
  5.17s 		objects in verse
  5.18s

   |     0.07s T suffix_person        to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.01s T tab                  to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.01s T txt                  to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.09s T typ                  to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.05s T uvf                  to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.05s T vbe                  to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.05s T vbs                  to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.01s T verse                to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.06s T vs                   to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.06s T vt                   to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.17s T distributional_parent to /Users/dirk/github/etcbc/ext

# Tweak the book names

We load the fresh dataset and modify the names of the book a little, and store them in feature `book@en`.

In [9]:
utils.caption(4, 'Book names')

metaData = {}

metaData['book@en'] = {
    'valueType': 'str',
    'language': 'English',
    'languageCode': 'en',
    'languageEnglish': 'english',
}

newFeatures = sorted(metaData)
newFeaturesStr = ' '.join(newFeatures)

..............................................................................................
.         30s Book names                                                                     .
..............................................................................................


In [10]:
utils.caption(0, 'Loading relevant features')

TF = Fabric(locations=thisTempTf, modules=[''])
api = TF.load('book')
api.makeAvailableIn(globals())

nodeFeatures = {}

bookNodes = []
for b in F.otype.s('book'):
    bookNodes.append(b)

nodeFeatures['book@en'] = dict(zip(bookNodes, bookNames['en']))

|         35s Loading relevant features
This is Text-Fabric 3.0.6
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

71 features found and 0 ignored
  0.00s loading features ...
   |     0.05s T otype                from /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.31s T oslots               from /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.00s T book                 from /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.01s T chapter              from /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.01s T verse                from /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.06s T g_cons               from /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.06s T g_lex_utf8           from /Users/dirk/github/etcbc/extrabiblical/_te

In [11]:
utils.caption(0, 'Write book name features as TF')
TF = Fabric(locations=thisTempTf, silent=True)
TF.save(nodeFeatures=nodeFeatures, edgeFeatures={}, metaData=metaData)

|         42s Write book name features as TF
   |     0.00s T book@en              to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf


# Diffs

Check differences with previous versions.

In [12]:
utils.checkDiffs(thisTempTf, thisTf)

..............................................................................................
.         44s Check differences with previous version                                        .
..............................................................................................
|         44s 	no features to add
|         44s 	no features to delete
|         44s 	72 features in common
|         44s book                      ... no changes
|         44s book@en                   ... no changes
|         44s chapter                   ... no changes
|         44s code                      ... no changes
|         44s det                       ... no changes
|         44s dist                      ... no changes
|         44s dist_unit                 ... no changes
|         44s distributional_parent     ... no changes
|         44s domain                    ... no changes
|         44s function                  ... no changes
|         44s functional_parent         ... no changes
| 

# Deliver 

Copy the new TF features from the temporary location where they have been created to their final destination.

In [13]:
utils.deliverDataset(thisTempTf, thisTf)

..............................................................................................
.         52s Deliver data set to /Users/dirk/github/etcbc/extrabiblical/tf/0.1              .
..............................................................................................


# Compile TF

In [14]:
utils.caption(4, 'Load and compile standard TF features')
TF = Fabric(locations=thisTf, modules=[''])
api = TF.load('')

utils.caption(4, 'Load and compile all other TF features')
allFeatures = TF.explore(silent=False, show=True)
loadableFeatures = allFeatures['nodes'] + allFeatures['edges']
api = TF.load(loadableFeatures)
api.makeAvailableIn(globals())

..............................................................................................
.         58s Load and compile standard TF features                                          .
..............................................................................................
This is Text-Fabric 3.0.6
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

72 features found and 0 ignored
  0.00s loading features ...
   |     0.00s T book@en              from /Users/dirk/github/etcbc/extrabiblical/tf/0.1
   |     0.00s Feature overview: 65 for nodes; 4 for edges; 3 configs; 7 computed
  0.21s All features loaded/computed - for details use loadLog()
..............................................................................................
.         59s Load and compile all other TF features                           

# Examples

In [17]:
utils.caption(4, 'The books and their chapters')
for b in F.otype.s('book'):
    utils.caption(0, '\t* {} (coded as {})'.format(T.sectionFromNode(b)[0], F.book.v(b)))
    utils.caption(0, '\t\t{}'.format(', '.join(str(F.chapter.v(c)) for c in L.d(b, 'chapter'))))
    
utils.caption(4, 'Basic tests')
vn = T.nodeFromSection(('1QM', 1, 1))
cn = T.nodeFromSection(('1QM', 1))

utils.caption(0, 'First verse node = {}'.format(vn))
utils.caption(0, 'Text of first verse = {}'.format(T.text(L.d(vn, 'word'))))
utils.caption(0, 'First chapter node = {}'.format(cn))
utils.caption(0, 'Text of first chapter = {}'.format(T.text(L.d(cn, 'word'))))
utils.caption(4, 'Now as sentences and clauses and phrases')
for sn in L.d(cn, 'sentence'):
    utils.caption(0, '=======')
    for cn in L.d(sn, 'clause'):
        utils.caption(0, 'clause {}'.format(F.typ.v(cn)))
        for pn in L.d(cn, 'phrase'):
            utils.caption(0, '\tphrase {}'.format(F.function.v(pn)))
            utils.caption(0, '\t\t{}'.format(T.text(L.d(pn, 'word'))))
utils.caption(0, 'Top 20 lexemes with frequency =\n\t{}'.format(
    '\n\t'.join('{:<6} {:>5}x'.format(*x) for x in F.lex.freqList()[0:20]),
))


..............................................................................................
.      3m 16s The books and their chapters                                                   .
..............................................................................................
|      3m 16s 	* 1QM (coded as B_1QM)
|      3m 16s 		1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
|      3m 16s 	* 1QS (coded as B_1QS)
|      3m 16s 		1, 2, 3, 4, 5, 6, 7
|      3m 16s 	* Kuntillet_Ajrud (coded as Ajrud)
|      3m 16s 		18, 19, 20
|      3m 16s 	* Arad (coded as Arad)
|      3m 16s 		1, 2, 40
|      3m 16s 	* Balaam (coded as Balaam)
|      3m 16s 		1, 2
|      3m 16s 	* Ketef_Hinnom (coded as Ketef_Hinnom)
|      3m 16s 		1, 2
|      3m 16s 	* Lachish (coded as Lachish)
|      3m 16s 		3, 4, 5, 6
|      3m 16s 	* Mesha_Stela (coded as Mesa)
|      3m 16s 		1
|      3m 16s 	* Mesad_Hashavyahu (coded as Mesad_Hashavyahu)
|      3m 16s 		1
|      3m 16s 	* Pirqe (coded as Siloam)
|      3m 16s 		