<img align="right" src="images/tf-small.png"/>

#  MQL conversion to TF

This notebook extra-biblical texts, produced by the ETCBC in MQL format, into 
[text-Fabric](https://github.com/Dans-labs/text-fabric)
format.

In [1]:
bookNames = dict(
    en='''
        1QM
        1QS
        Kuntillet_Ajrud
        Arad
        Balaam
        Ketef_Hinnom
        Lachish
        Mesha_Stela
        Mesad_Hashavyahu
        Pirqe
        Shirata
        Siloam
    '''.strip().split(),
)

## Discussion

In [2]:
import os,sys,re,collections
import utils
from shutil import rmtree
from tf.fabric import Fabric

# Setting up the context: source file and target directories

The conversion is executed in an environment of directories, so that sources, temp files and
results are in convenient places and do not have to be shifted around.

In [3]:
NAME = 'extrabiblical'
VERSION = '0.1'

In [4]:
repoBase = os.path.expanduser('~/github/etcbc')
thisRepo = '{}/{}'.format(repoBase, 'extrabiblical')

thisSource = '{}/source/{}'.format(thisRepo, VERSION)
mqlzFile = '{}/{}.mql.bz2'.format(thisSource, NAME)

thisTemp = '{}/_temp/{}'.format(thisRepo, VERSION)
thisTempSource = '{}/source'.format(thisTemp)
mqlFile = '{}/{}.mql'.format(thisTempSource, NAME)
thisTempTf = '{}/tf'.format(thisTemp)

thisTf = '{}/tf/{}'.format(thisRepo, VERSION)

# TF Settings

We add some custom information here.

* the MQL object type that corresponds to the TF slot type, typically `word`;
* a piece of metadata that will go into every feature; the time will be added automatically
* suitable text formats for the `otext` feature of TF.

The oText feature is very sensitive to what is available in the source MQL.
It needs to be configured here.
We save the configs we need per source and version.
And we define a stripped down default version to start with.

In [5]:
slotType = 'word'

featureMetaData = dict(
    dataset='ExtraBiblical',
    datasetName='Non Masoretic Texts related to the Hebrew Bible',
    author='Eep Talstra Centre for Bible and Computer',
    encoders='Constantijn Sikkel, Martijn Naaijer, Ulrik Petersen (MQL) and Dirk Roorda (TF)',
    website='http://etcbc.nl',
    email='m.naaijer@vu.nl',
)

oText = {
    '': {
        '': '''
@fmt:lex-orig-full={g_lex_utf8}{g_suffix_utf8}
@fmt:lex-trans-full={lex_utf8}{g_suffix}
@fmt:text-orig-full={g_word_utf8}{g_suffix_utf8}
@fmt:text-trans-full={g_word}{g_suffix}
@fmt:text-trans-plain={g_cons}{g_suffix}
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
''',
    },
    '0.1': '''
@fmt:lex-orig-full={g_lex_utf8}{g_suffix_utf8}
@fmt:lex-trans-full={lex_utf8}{g_suffix}
@fmt:text-orig-full={g_word_utf8}{g_suffix_utf8}
@fmt:text-trans-full={g_word}{g_suffix}
@fmt:text-trans-plain={g_cons}{g_suffix}
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
''',
}

The next function selects the proper otext material, falling back on a default if nothing 
appropriate has been specified in `oText`.

In [6]:
thisOtext = oText.get(VERSION, oText[''])

if thisOtext is oText['']:
    utils.caption(0, 'WARNING: no otext feature info provided, using a meager default value')
    otextInfo = {}
else:
    utils.caption(0, 'INFO: otext feature information found')
    otextInfo = dict(line[1:].split('=', 1) for line in thisOtext.strip('\n').split('\n'))
    for x in sorted(otextInfo.items()):
        utils.caption(0, '\t{:<20} = "{}"'.format(*x))

|       0.00s INFO: otext feature information found
|       0.00s 	fmt:lex-orig-full    = "{g_lex_utf8}{g_suffix_utf8}"
|       0.00s 	fmt:lex-trans-full   = "{lex_utf8}{g_suffix}"
|       0.00s 	fmt:text-orig-full   = "{g_word_utf8}{g_suffix_utf8}"
|       0.00s 	fmt:text-trans-full  = "{g_word}{g_suffix}"
|       0.00s 	fmt:text-trans-plain = "{g_cons}{g_suffix}"
|       0.00s 	sectionFeatures      = "book,chapter,verse"
|       0.00s 	sectionTypes         = "book,chapter,verse"


# Prepare

In [7]:
if not os.path.exists(thisTempSource):
    os.makedirs(thisTempSource)

utils.caption(0, 'bunzipping {} ...'.format(mqlzFile))
utils.bunzip(mqlzFile, mqlFile)
utils.caption(0, 'Done')

if os.path.exists(thisTempTf): rmtree(thisTempTf)
os.makedirs(thisTempTf)

|       0.01s bunzipping /Users/dirk/github/etcbc/extrabiblical/source/0.1/extrabiblical.mql.bz2 ...
|       0.01s 	NOTE: Using existing unzipped file which is newer than bzipped one
|       0.01s Done


# MQL to Text-Fabric
Transform the collected information in feature-like datastructures, and write it all
out to `.tf` files.

In [8]:
TF = Fabric(locations=thisTempTf, silent=True)
TF.importMQL(mqlFile, slotType=slotType, otext=otextInfo, meta=featureMetaData)

  0.00s Parsing mql source ...
  0.00s 		enum boolean_t
  0.01s 		enum phrase_determination_t
  0.01s 		enum language_t
  0.01s 		enum book_name_t
  0.01s 		enum lexical_set_t
  0.01s 		enum verbal_stem_t
  0.01s 		enum verbal_tense_t
  0.02s 		enum person_t
  0.02s 		enum number_t
  0.02s 		enum gender_t
  0.02s 		enum state_t
  0.02s 		enum part_of_speech_t
  0.02s 		enum phrase_type_t
  0.02s 		enum phrase_atom_relation_t
  0.03s 		enum phrase_relation_t
  0.03s 		enum phrase_atom_unit_distance_to_mother_t
  0.03s 		enum subphrase_relation_t
  0.03s 		enum subphrase_mother_object_type_t
  0.03s 		enum phrase_function_t
  0.03s 		enum clause_atom_type_t
  0.04s 		enum clause_type_t
  0.04s 		enum clause_kind_t
  0.04s 		enum clause_constituent_relation_t
  0.04s 		enum clause_constituent_mother_object_type_t
  0.04s 		enum clause_constituent_unit_distance_to_mother_t
  0.04s 		otype book
  0.04s 			feature book (str) =def= None : node
  0.05s 		otype chapter
  0.05s 			feature book (

  3.72s 		objects in clause
  3.72s 		objects in clause_atom
  3.72s 		objects in half_verse
  3.72s 		objects in phrase
  3.73s 		objects in phrase_atom
  3.73s 		objects in sentence
  3.74s 		objects in sentence_atom
  3.74s 		objects in subphrase
  3.75s 		objects in verse
  3.75s 		objects in word
  3.77s 		objects in book
  3.77s 		objects in chapter
  3.77s 		objects in clause
  3.78s 		objects in clause_atom
  3.78s 		objects in half_verse
  3.79s 		objects in phrase
  3.80s 		objects in phrase_atom
  3.81s 		objects in sentence
  3.82s 		objects in sentence_atom
  3.82s 		objects in subphrase
  3.83s 		objects in verse
  3.83s 		objects in word
  3.91s 		objects in book
  3.91s 		objects in chapter
  3.91s 		objects in clause
  3.92s 		objects in clause_atom
  3.93s 		objects in half_verse
  3.93s 		objects in phrase
  3.94s 		objects in phrase_atom
  3.95s 		objects in sentence
  3.95s 		objects in sentence_atom
  3.96s 		objects in subphrase
  3.96s 		objects in verse
  3.97s

   |     0.01s T tab                  to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.01s T txt                  to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.07s T typ                  to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.03s T uvf                  to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.03s T vbe                  to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.03s T vbs                  to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.00s T verse                to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.04s T vs                   to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.03s T vt                   to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.08s T distributional_parent to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.12s T functional_parent    to /Users/dirk/github/etcbc/ext

# Tweak the book names

We load the fresh dataset and modify the names of the book a little, and store them in feature `book@en`.

In [9]:
utils.caption(4, 'Book names')

metaData = {}

metaData['book@en'] = {
    'valueType': 'str',
    'language': 'English',
    'languageCode': 'en',
    'languageEnglish': 'english',
}

newFeatures = sorted(metaData)
newFeaturesStr = ' '.join(newFeatures)

..............................................................................................
.       6.93s Book names                                                                     .
..............................................................................................


In [10]:
utils.caption(0, 'Loading relevant features')

TF = Fabric(locations=thisTempTf, modules=[''])
api = TF.load('book')
api.makeAvailableIn(globals())

nodeFeatures = {}

bookNodes = []
for b in F.otype.s('book'):
    bookNodes.append(b)

nodeFeatures['book@en'] = dict(zip(bookNodes, bookNames['en']))

|       6.95s Loading relevant features
This is Text-Fabric 3.0.4
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

71 features found and 0 ignored
  0.00s loading features ...
   |     0.03s T otype                from /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.28s T oslots               from /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.00s T book                 from /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.00s T chapter              from /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.00s T verse                from /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.05s T g_cons               from /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf
   |     0.05s T g_lex_utf8           from /Users/dirk/github/etcbc/extrabiblical/_te

In [11]:
utils.caption(0, 'Write book name features as TF')
TF = Fabric(locations=thisTempTf, silent=True)
TF.save(nodeFeatures=nodeFeatures, edgeFeatures={}, metaData=metaData)

|       9.19s Write book name features as TF
   |     0.00s T book@en              to /Users/dirk/github/etcbc/extrabiblical/_temp/0.1/tf


# Diffs

Check differences with previous versions.

In [12]:
utils.checkDiffs(thisTempTf, thisTf)

..............................................................................................
.       9.21s Check differences with previous version                                        .
..............................................................................................
|       9.21s 	no features to add
|       9.21s 	no features to delete
|       9.21s 	72 features in common
|       9.21s book                      ... no changes
|       9.21s book@en                   ... no changes
|       9.21s chapter                   ... no changes
|       9.21s code                      ... no changes
|       9.22s det                       ... no changes
|       9.24s dist                      ... no changes
|       9.24s dist_unit                 ... no changes
|       9.24s distributional_parent     ... no changes
|       9.26s domain                    ... no changes
|       9.26s function                  ... no changes
|       9.27s functional_parent         ... no changes
| 

# Deliver 

Copy the new TF features from the temporary location where they have been created to their final destination.

In [13]:
utils.deliverDataset(thisTempTf, thisTf)

..............................................................................................
.       9.90s Deliver data set to /Users/dirk/github/etcbc/extrabiblical/tf/0.1              .
..............................................................................................


# Compile TF

In [14]:
utils.caption(4, 'Load and compile standard TF features')
TF = Fabric(locations=thisTf, modules=[''])
api = TF.load('')

utils.caption(4, 'Load and compile all other TF features')
allFeatures = TF.explore(silent=False, show=True)
loadableFeatures = allFeatures['nodes'] + allFeatures['edges']
api = TF.load(loadableFeatures)
api.makeAvailableIn(globals())

..............................................................................................
.         10s Load and compile standard TF features                                          .
..............................................................................................
This is Text-Fabric 3.0.4
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

72 features found and 0 ignored
  0.00s loading features ...
   |     0.00s T book@en              from /Users/dirk/github/etcbc/extrabiblical/tf/0.1
   |     0.00s Feature overview: 65 for nodes; 4 for edges; 3 configs; 7 computed
  0.23s All features loaded/computed - for details use loadLog()
..............................................................................................
.         10s Load and compile all other TF features                           

# Examples

In [15]:
utils.caption(4, 'The books and their chapters')
for b in F.otype.s('book'):
    utils.caption(0, '\t* {} (coded as {})'.format(T.sectionFromNode(b)[0], F.book.v(b)))
    utils.caption(0, '\t\t{}'.format(', '.join(str(F.chapter.v(c)) for c in L.d(b, 'chapter'))))
    
utils.caption(4, 'Basic tests')
vn = T.nodeFromSection(('1QM', 1, 1))
cn = T.nodeFromSection(('1QM', 1))

utils.caption(0, 'First verse node = {}'.format(vn))
utils.caption(0, 'Text of first verse = {}'.format(T.text(L.d(vn, 'word'))))
utils.caption(0, 'First chapter node = {}'.format(cn))
utils.caption(0, 'Text of first chapter = {}'.format(T.text(L.d(cn, 'word'))))
utils.caption(4, 'Now as sentences and clauses')
for sn in L.d(cn, 'sentence'):
    utils.caption(0, '=======')
    for cn in L.d(sn, 'clause'):
        utils.caption(0, 'clause {}'.format(F.typ.v(cn)))
        for pn in L.d(cn, 'phrase'):
            utils.caption(0, '\tphrase {}'.format(F.function.v(pn)))
            utils.caption(0, '\t\t{}'.format(T.text(L.d(pn, 'word'))))
utils.caption(0, 'Top 20 lexemes with frequency =\n{}'.format('\n\t'.join(repr(x) for x in F.lex.freqList()[0:20])))


..............................................................................................
.         13s The books and their chapters                                                   .
..............................................................................................
|         13s 	* 1QM (coded as B_1QM)
|         13s 		1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
|         13s 	* 1QS (coded as B_1QS)
|         13s 		1, 2, 3, 4, 5, 6, 7
|         13s 	* Kuntillet_Ajrud (coded as Ajrud)
|         13s 		18, 19, 20
|         13s 	* Arad (coded as Arad)
|         13s 		1, 2, 40
|         13s 	* Balaam (coded as Balaam)
|         13s 		1, 2
|         13s 	* Ketef_Hinnom (coded as Ketef_Hinnom)
|         13s 		1, 2
|         13s 	* Lachish (coded as Lachish)
|         13s 		3, 4, 5, 6
|         13s 	* Mesha_Stela (coded as Mesa)
|         13s 		1
|         13s 	* Mesad_Hashavyahu (coded as Mesad_Hashavyahu)
|         13s 		1
|         13s 	* Pirqe (coded as Siloam)
|         13s 		

|         13s 		ל 
|         13s clause NmCl
|         13s 	phrase Modi
|         13s 		ון 
|         13s clause NmCl
|         13s 	phrase Adju
|         13s 		גדול 
|         13s clause NmCl
|         13s 	phrase Modi
|         13s 		מ 
|         13s clause ZYq0
|         13s 	phrase Pred
|         13s 		יתנו 
|         13s 	phrase Objc
|         13s 		יד 
|         13s clause NmCl
|         13s 	phrase Modi
|         13s 		בכל 
|         13s Top 20 lexemes with frequency =
('W', 1377)
	('H', 887)
	('L', 825)
	('B', 740)
	('W=', 379)
	('KL', 348)
	('=', 250)
	('MN', 182)
	('M', 177)
	('K', 175)
	('>L==', 172)
	('>JC', 166)
	('<L', 144)
	('L>', 127)
	('>T', 114)
	('>CR', 106)
	('J', 98)
	('HM', 94)
	('MLXMH', 89)
	('KJ', 85)
