<img align="right" src="images/dans-small.png"/>
<img align="right" src="images/tf-small.png"/>
<img align="right" src="images/etcbc.png"/>


#  MQL conversion to TF

This notebook extra-biblical texts, produced by the ETCBC in MQL format, into
[text-Fabric](https://github.com/Dans-labs/text-fabric)
format.

In [None]:
import os
import utils
from shutil import rmtree
from tf.fabric import Fabric

In [22]:
bookNames = {
    "0.1": dict(
        en="""
          1QM
          1QS
          Kuntillet_Ajrud
          Arad
          Balaam
          Ketef_Hinnom
          Lachish
          Mesha_Stela
          Mesad_Hashavyahu
          Pirqe
          Shirata
          Siloam
      """.strip().split(),
    ),
    "0.2": dict(
        en="""
          1QH
          1QM
          1QS
          Kuntillet_Ajrud
          Arad
          Balaam
          Ketef_Hinnom
          Lachish
          Mesha_Stela
          Mesad_Hashavyahu
          Pirqe
          Shirata
          Siloam
      """.strip().split(),
    ),
}

## Discussion

# Setting up the context: source file and target directories

The conversion is executed in an environment of directories, so that sources, temp files and
results are in convenient places and do not have to be shifted around.

In [24]:
ORG = "etcbc"
REPO = "extrabiblical"
VERSION = "0.2"

In [25]:
repoBase = os.path.expanduser(f"~/github/{ORG}")
thisRepo = f"{repoBase}/{REPO}"

thisSource = f"{thisRepo}/source/{VERSION}"
mqlzFile = f"{thisSource}/{REPO}.mql.bz2"

thisTemp = f"{thisRepo}/_temp/{VERSION}"
thisTempSource = f"{thisTemp}/source"
mqlFile = f"{thisTempSource}/{REPO}.mql"
thisTempTf = f"{thisTemp}/tf"

thisTf = f"{thisRepo}/tf/{VERSION}"

# TF Settings

We add some custom information here.

* the MQL object type that corresponds to the TF slot type, typically `word`;
* a piece of metadata that will go into every feature; the time will be added automatically
* suitable text formats for the `otext` feature of TF.

The `otext` feature is very sensitive to what is available in the source MQL.
It needs to be configured here.
We save the configs we need per source and version.
And we define a stripped down default version to start with.

In [26]:
slotType = "word"

featureMetaData = dict(
    dataset="ExtraBiblical",
    version=VERSION,
    datasetName="Non Masoretic Texts related to the Hebrew Bible",
    author="Eep Talstra Centre for Bible and Computer",
    encoders="Constantijn Sikkel (MQL), Martijn Naaijer, and Dirk Roorda (TF)",
    website="http://etcbc.nl",
    email="m.naaijer@vu.nl",
)

oText = {
    "": {
        "": """
@fmt:lex-orig-full={g_lex_utf8}{g_suffix_utf8}
@fmt:lex-trans-full={lex_utf8}{g_suffix}
@fmt:text-orig-full={g_word_utf8}{g_suffix_utf8}
@fmt:text-trans-full={g_word}{g_suffix}
@fmt:text-trans-plain={g_cons}{g_suffix}
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
""",
    },
    "0.1": """
@fmt:lex-orig-full={g_lex_utf8}{g_suffix_utf8}
@fmt:lex-trans-full={lex_utf8}{g_suffix}
@fmt:text-orig-full={g_word_utf8}{g_suffix_utf8}
@fmt:text-trans-full={g_word}{g_suffix}
@fmt:text-trans-plain={g_cons}{g_suffix}
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
""",
    "0.2": """
@fmt:lex-orig-full={g_lex_utf8}{g_suffix_utf8}
@fmt:lex-trans-full={lex_utf8}{g_suffix}
@fmt:text-orig-full={g_word_utf8}{g_suffix_utf8}
@fmt:text-trans-full={g_word}{g_suffix}
@fmt:text-trans-plain={g_cons}{g_suffix}
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
""",
}

The next function selects the proper `otext` material, falling back on a default if nothing
appropriate has been specified in `oText`.

In [27]:
thisOtext = oText.get(VERSION, oText[""])

if thisOtext is oText[""]:
    utils.caption(
        0, "WARNING: no otext feature info provided, using a meager default value"
    )
    otextInfo = {}
else:
    utils.caption(0, "INFO: otext feature information found")
    otextInfo = dict(
        line[1:].split("=", 1) for line in thisOtext.strip("\n").split("\n")
    )
    for x in sorted(otextInfo.items()):
        utils.caption(0, '\t{:<20} = "{}"'.format(*x))

|      7m 18s INFO: otext feature information found
|      7m 18s 	fmt:lex-orig-full    = "{g_lex_utf8}{g_suffix_utf8}"
|      7m 18s 	fmt:lex-trans-full   = "{lex_utf8}{g_suffix}"
|      7m 18s 	fmt:text-orig-full   = "{g_word_utf8}{g_suffix_utf8}"
|      7m 18s 	fmt:text-trans-full  = "{g_word}{g_suffix}"
|      7m 18s 	fmt:text-trans-plain = "{g_cons}{g_suffix}"
|      7m 18s 	sectionFeatures      = "book,chapter,verse"
|      7m 18s 	sectionTypes         = "book,chapter,verse"


# Prepare

In [28]:
if not os.path.exists(thisTempSource):
    os.makedirs(thisTempSource)

utils.caption(0, "bunzipping {} ...".format(mqlzFile))
utils.bunzip(mqlzFile, mqlFile)
utils.caption(0, "Done")

if os.path.exists(thisTempTf):
    rmtree(thisTempTf)
os.makedirs(thisTempTf)

|      7m 19s bunzipping /Users/dirk/github/etcbc/extrabiblical/source/0.2/extrabiblical.mql.bz2 ...
|      7m 19s 	NOTE: Using existing unzipped file which is newer than bzipped one
|      7m 19s Done


# MQL to Text-Fabric
Transform the collected information in feature-like data structures, and write it all
out to `.tf` files.

In [29]:
TF = Fabric(locations=thisTempTf, silent=True)
TF.importMQL(mqlFile, slotType=slotType, otext=otextInfo, meta=featureMetaData)

  0.00s Parsing mql source ...
  0.00s 		enum boolean_t
  0.00s 		enum phrase_determination_t
  0.00s 		enum language_t
  0.00s 		enum book_name_t
  0.00s 		enum lexical_set_t
  0.00s 		enum verbal_stem_t
  0.00s 		enum verbal_tense_t
  0.01s 		enum person_t
  0.01s 		enum number_t
  0.01s 		enum gender_t
  0.01s 		enum state_t
  0.01s 		enum part_of_speech_t
  0.01s 		enum phrase_type_t
  0.01s 		enum phrase_atom_relation_t
  0.01s 		enum phrase_relation_t
  0.01s 		enum phrase_atom_unit_distance_to_mother_t
  0.01s 		enum subphrase_relation_t
  0.01s 		enum subphrase_mother_object_type_t
  0.01s 		enum phrase_function_t
  0.01s 		enum clause_atom_type_t
  0.02s 		enum clause_type_t
  0.02s 		enum clause_kind_t
  0.02s 		enum clause_constituent_relation_t
  0.02s 		enum clause_constituent_mother_object_type_t
  0.02s 		enum clause_constituent_unit_distance_to_mother_t
  0.02s 		otype book
  0.02s 			feature book (str) =def= None : node
  0.02s 		otype chapter
  0.02s 			feature book (

  0.03s OK: oslots is valid


   |     0.00s T book                 to /Users/dirk/github/etcbc/extrabiblical/_temp/0.2/tf
   |     0.00s T chapter              to /Users/dirk/github/etcbc/extrabiblical/_temp/0.2/tf
   |     0.02s T code                 to /Users/dirk/github/etcbc/extrabiblical/_temp/0.2/tf
   |     0.08s T det                  to /Users/dirk/github/etcbc/extrabiblical/_temp/0.2/tf
   |     0.02s T dist                 to /Users/dirk/github/etcbc/extrabiblical/_temp/0.2/tf
   |     0.01s T dist_unit            to /Users/dirk/github/etcbc/extrabiblical/_temp/0.2/tf
   |     0.01s T domain               to /Users/dirk/github/etcbc/extrabiblical/_temp/0.2/tf
   |     0.04s T function             to /Users/dirk/github/etcbc/extrabiblical/_temp/0.2/tf
   |     0.06s T g_cons               to /Users/dirk/github/etcbc/extrabiblical/_temp/0.2/tf
   |     0.06s T g_cons_utf8          to /Users/dirk/github/etcbc/extrabiblical/_temp/0.2/tf
   |     0.06s T g_lex                to /Users/dirk/github/etcbc/extr

# Tweak the book names

We load the fresh dataset and modify the names of the book a little, and store them in feature `book@en`.

In [30]:
utils.caption(4, "Book names")

metaData = {}

metaData["book@en"] = {
    "valueType": "str",
    "language": "English",
    "languageCode": "en",
    "languageEnglish": "english",
}

newFeatures = sorted(metaData)
newFeaturesStr = " ".join(newFeatures)

..............................................................................................
.      7m 42s Book names                                                                     .
..............................................................................................


In [31]:
utils.caption(0, "Loading relevant features")

TF = Fabric(locations=thisTempTf, modules=[""])
api = TF.load("book")
api.makeAvailableIn(globals())

nodeFeatures = {}

bookNodes = []
for b in F.otype.s("book"):
    bookNodes.append(b)

nodeFeatures["book@en"] = dict(zip(bookNodes, bookNames[VERSION]["en"]))

|      7m 46s Loading relevant features
This is Text-Fabric 7.3.15
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

71 features found and 0 ignored
  0.00s loading features ...
   |     0.01s T book                 from /Users/dirk/github/etcbc/extrabiblical/_temp/0.2/tf
   |      |     0.09s C __levels__           from otype, oslots, otext
   |      |     1.17s C __order__            from otype, oslots, __levels__
   |      |     0.05s C __rank__             from otype, __order__
   |      |     1.04s C __levUp__            from otype, oslots, __rank__
   |      |     0.60s C __levDown__          from otype, __levUp__, __rank__
   |      |     0.25s C __boundary__         from otype, oslots, __rank__
   |      |     0.01s C __sections__         from otype, oslots, otext, __levUp__, __levels__, book, chapter, verse
  4.46s All features loaded/computed - for details use loadLog()


In [32]:
utils.caption(0, "Write book name features as TF")
TF = Fabric(locations=thisTempTf, silent=True)
TF.save(nodeFeatures=nodeFeatures, edgeFeatures={}, metaData=metaData)

|      7m 54s Write book name features as TF
   |     0.00s T book@en              to /Users/dirk/github/etcbc/extrabiblical/_temp/0.2/tf


True

# Diffs

Check differences with previous versions.

In [33]:
utils.checkDiffs(thisTempTf, thisTf)

..............................................................................................
.      7m 57s Check differences with previous version                                        .
..............................................................................................
|      7m 57s 	no features to add
|      7m 57s 	no features to delete
|      7m 57s 	72 features in common
|      7m 57s book                      ... no changes
|      7m 57s book@en                   ... differences after the metadata
|      7m 57s 	line      2 OLD -->39863	1QM<--
|      7m 57s 	line      2 NEW -->39863	1QH<--
|      7m 57s 	line      3 OLD -->1QS<--
|      7m 57s 	line      3 NEW -->1QM<--
|      7m 57s 	line      4 OLD -->Kuntillet_Ajrud<--
|      7m 57s 	line      4 NEW -->1QS<--
|      7m 57s 	line      5 OLD -->Arad<--
|      7m 57s 	line      5 NEW -->Kuntillet_Ajrud<--

|      7m 57s chapter                   ... no changes
|      7m 57s code                      ... no changes
|

# Deliver

Copy the new TF features from the temporary location where they have been created to their final destination.

In [34]:
utils.deliverDataset(thisTempTf, thisTf)

..............................................................................................
.      8m 02s Deliver data set to /Users/dirk/github/etcbc/extrabiblical/tf/0.2              .
..............................................................................................


# Compile TF

In [35]:
utils.caption(4, "Load and compile standard TF features")
TF = Fabric(locations=thisTf, modules=[""])
api = TF.load("")

utils.caption(4, "Load and compile all other TF features")
allFeatures = TF.explore(silent=False, show=True)
loadableFeatures = allFeatures["nodes"] + allFeatures["edges"]
api = TF.load(loadableFeatures)
api.makeAvailableIn(globals())

..............................................................................................
.      8m 39s Load and compile standard TF features                                          .
..............................................................................................
This is Text-Fabric 7.3.15
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

72 features found and 0 ignored
  0.00s loading features ...
  0.35s All features loaded/computed - for details use loadLog()
..............................................................................................
.      8m 40s Load and compile all other TF features                                         .
..............................................................................................
   |     0.00s Feature overview: 65 for nodes; 4 for edges; 3 configs; 7 computed
  0.00s loading features ...
   |     0.01s T code                 from /Users/dirk/github/etcbc/extrabiblical/tf/0.2
   |   

[('Computed',
  'computed-data',
  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),
 ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),
 ('Fabric', 'loading', ('ensureLoaded', 'TF', 'ignored', 'loadLog')),
 ('Locality', 'locality', ('L Locality',)),
 ('Misc', 'messaging', ('cache', 'error', 'indent', 'info', 'reset')),
 ('Nodes',
  'navigating-nodes',
  ('N Nodes', 'sortKey', 'sortKeyTuple', 'otypeRank', 'sortNodes')),
 ('Features',
  'node-features',
  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),
 ('Search', 'search', ('S Search',)),
 ('Text', 'text', ('T Text',))]

# Examples

In [36]:
utils.caption(4, "The books and their chapters")
for b in F.otype.s("book"):
    utils.caption(
        0, "\t* {} (coded as {})".format(T.sectionFromNode(b)[0], F.book.v(b))
    )
    utils.caption(
        0, "\t\t{}".format(", ".join(str(F.chapter.v(c)) for c in L.d(b, "chapter")))
    )

utils.caption(4, "Basic tests")
vn = T.nodeFromSection(("1QM", 1, 1))
cn = T.nodeFromSection(("1QM", 1))

utils.caption(0, "First verse node = {}".format(vn))
utils.caption(0, "Text of first verse = {}".format(T.text(L.d(vn, "word"))))
utils.caption(0, "First chapter node = {}".format(cn))
utils.caption(0, "Text of first chapter = {}".format(T.text(L.d(cn, "word"))))
utils.caption(4, "Now as sentences and clauses and phrases")
for sn in L.d(cn, "sentence"):
    utils.caption(0, "=======")
    for cn in L.d(sn, "clause"):
        utils.caption(0, "clause {}".format(F.typ.v(cn)))
        for pn in L.d(cn, "phrase"):
            utils.caption(0, "\tphrase {}".format(F.function.v(pn)))
            utils.caption(0, "\t\t{}".format(T.text(L.d(pn, "word"))))
utils.caption(
    0,
    "Top 20 lexemes with frequency =\n\t{}".format(
        "\n\t".join("{:<6} {:>5}x".format(*x) for x in F.lex.freqList()[0:20]),
    ),
)


..............................................................................................
.      8m 48s The books and their chapters                                                   .
..............................................................................................
|      8m 48s 	* 1QH (coded as B_1QHa)
|      8m 48s 		3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29
|      8m 48s 	* 1QM (coded as B_1QM)
|      8m 48s 		1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
|      8m 48s 	* 1QS (coded as B_1QS)
|      8m 48s 		1, 2, 3, 4, 5, 6, 7
|      8m 48s 	* Kuntillet_Ajrud (coded as Ajrud)
|      8m 48s 		18, 19, 20
|      8m 48s 	* Arad (coded as Arad)
|      8m 48s 		1, 2, 40
|      8m 48s 	* Balaam (coded as Balaam)
|      8m 48s 		1, 2
|      8m 48s 	* Ketef_Hinnom (coded as Ketef_Hinnom)
|      8m 48s 		1, 2
|      8m 48s 	* Lachish (coded as Lachish)
|      8m 48s 		3, 4, 5, 6
|      8m 48s 	* Mesha_Stela (coded as Mesa)
