<img align="right" src="images/dans-small.png"/>
<img align="right" src="images/tf-small.png"/>
<img align="right" src="images/etcbc.png"/>


![mql](images/emdros.png)

# CALAP from MQL into TF

This notebook can read an
[MQL](https://emdros.org/mql.html)
dump of a version of the CALAP (Peshitta)
and transform it in a Text-Fabric
[Text-Fabric](https://github.com/Dans-labs/text-fabric)
resource.

The dump is obtained from the DANS archive, the dataset
*Computer-Assisted Linguistic Analysis of the Peshitta*
by Peursen, Prof. Dr. W.T. van (Eep Talstra Centre for Bible And Computing, VU University Amsterdam), 2005,
with [DOI 10.17026/dans-zv9-w9d2](https://doi.org/10.17026/dans-zv9-w9d2).

**N.B.:**
There is an error in the dump: the last monad (slot) is 53920, but the last book is declared to reach
until the non-existent slot 53921:

```
CREATE OBJECT
FROM MONADS =  { 35794-53920 } 
WITH ID_D = 144485
[book
  book := Sirach;
]
GO
```

The correction consists of replacing `53921` by `53920`, here, in one place only (line 503).
We have corrected this manually and bzipped the result and stored it in the *sources* subdirectory
of these repo.

The source only has the text in transliteration, in the feature `surface_consonants`.
We provide a key to the transliteration and a feature `cons` that contains the backliterated unicode strings.

<img align="left" src="images/peshitta.png"/>

In [1]:
import os,sys,re,collections
from shutil import rmtree
from tf.fabric import Fabric
import utils

# Setting up the context: source file and target directories

The conversion is executed in an environment of directories, so that sources, temp files and
results are in convenient places and do not have to be shifted around.

In [2]:
PROJECT = 'calap'
VERSION = '2014'

repoBase = os.path.expanduser('~/github/etcbc')
thisRepo = '{}/{}'.format(repoBase, PROJECT)

thisSource = '{}/source/{}'.format(thisRepo, VERSION)
mqlzFile = '{}/{}.mql.bz2'.format(thisSource, PROJECT)

thisTemp = '{}/_temp/{}'.format(thisRepo, VERSION)
thisTempSource = '{}/source'.format(thisTemp)
mqlFile = '{}/{}.mql'.format(thisTempSource, PROJECT)
thisTempTf = '{}/tf'.format(thisTemp)

thisTf = '{}/tf/{}'.format(thisRepo, VERSION)

Only update if and when needed, or force update of everything

In [3]:
FORCE = False

# Test

Check whether this conversion is needed in the first place.
Only when run as a script.

In [4]:
testFile = '{}/.tf/otype.tfx'.format(thisTf)
(good, work) = utils.mustRun(mqlzFile, '{}/.tf/otype.tfx'.format(thisTf), force=FORCE)
print(f'The text fabric files {"must be updated" if good else "are up to date"}')

|       0.00s 	Source /Users/dirk/github/etcbc/calap/source/2014/calap.mql.bz2 exists
|       0.00s 	Destination /Users/dirk/github/etcbc/calap/tf/2014/.tf/otype.tfx exists
|       0.00s 	Destination /Users/dirk/github/etcbc/calap/tf/2014/.tf/otype.tfx up to date
The text fabric files must be updated


# TF Settings

We add some custom information here.

* the MQL object type that corresponds to the Text-Fabric slot type, typically `word`;
* a piece of metadata that will go into every feature; the time will be added automatically
* suitable text formats for the `otext` feature of TF.

The oText feature is very sensitive to what is available in the source MQL.
It needs to be configured here.
We save the configs we need per source and version.
And we define a stripped down default version to start with.

In [5]:
slotType = 'word'

featureMetaData = dict(
    project='CALAP and TURGAMA',
    organizations='Peshitta Institute Leiden; Werkgroep Informatica at the Vrije Universiteit Amsterdam (WIVU)',
    projectPeriod='1999-2005, 2005-2010',
    projectUrl='http://www.nwo.nl/onderzoek-en-resultaten/onderzoeksprojecten/78/1900123778.html',
    encoders='Janet Dyk, Percy van Keulen, Wido van Peursen, Constantijn Sikkel, HendrikJan Bosman, Konrad Jenner, Eep Talstra, Dirk Bakker, Jeffrey A. Volkmer, Ariel Gutman',
    language='Syriac',
    iso639='syc',
    dataset='CALAP',
    version=VERSION,
    datasetName='Computer-Assisted Linguistic Analysis of the Peshitta',
    author='Eep Talstra Centre for Bible and Computer',
    converter='Dirk Roorda (TF)',
)

oText = {
    '': {
        '': '''
@fmt:text-trans-full={surface_consonants} 
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
''',
    },
    '2014': '''
@fmt:text-trans-full={surface_consonants} 
@sectionFeatures=book,chapter,verse
@sectionTypes=book,chapter,verse
''',
}

The next function selects the proper otext material, falling back on a default if nothing 
appropriate has been specified in `oText`.

In [6]:
thisOtext = oText.get(VERSION, oText[''])

if thisOtext is oText['']:
    utils.caption(0, 'WARNING: no otext feature info provided, using a meager default value')
    otextInfo = {}
else:
    utils.caption(0, 'INFO: otext feature information found')
    otextInfo = dict(line[1:].split('=', 1) for line in thisOtext.strip('\n').split('\n'))
    for x in sorted(otextInfo.items()):
        utils.caption(0, '\t{:<20} = "{}"'.format(*x))

|       0.04s INFO: otext feature information found
|       0.04s 	fmt:text-trans-full  = "{surface_consonants} "
|       0.04s 	sectionFeatures      = "book,chapter,verse"
|       0.04s 	sectionTypes         = "book,chapter,verse"


# Overview

The program has several stages:
   
1. **prepare** the source (utils.bunzip if needed)
1. **convert** convert the MQL file into a text-fabric dataset
1. **differences** (informational)
1. **deliver** the TF data at its destination directory
1. **compile** all TF features to binary format

# Prepare

Check the source, utils.bunzip it if needed, empty the result directory.

In [7]:
if not os.path.exists(thisTempSource):
    os.makedirs(thisTempSource)

utils.caption(0, 'bunzipping {} ...'.format(mqlzFile))
utils.bunzip(mqlzFile, mqlFile)
utils.caption(0, 'Done')

if os.path.exists(thisTempTf): rmtree(thisTempTf)
os.makedirs(thisTempTf)

|       0.06s bunzipping /Users/dirk/github/etcbc/calap/source/2014/calap.mql.bz2 ...
|       0.06s 	NOTE: Using existing unzipped file which is newer than bzipped one
|       0.06s Done


# MQL to Text-Fabric
Transform the collected information in feature-like data-structures, and write it all
out to `.tf` files.

In [8]:
TF = Fabric(locations=thisTempTf, silent=True)
TF.importMQL(mqlFile, slotType=slotType, otext=otextInfo, meta=featureMetaData)

  0.00s Parsing mql source ...
  0.00s 		enum boolean
  0.00s 		enum lexical_set_e
  0.00s 		enum phrase_type_e
  0.00s 		enum phrase_atom_type_e
  0.00s 		enum phrase_function_e
  0.01s 		enum book_name_e
  0.01s 		enum part_of_speech_e
  0.01s 		enum gender_e
  0.01s 		enum number_e
  0.01s 		enum person_e
  0.01s 		enum stem_e
  0.01s 		enum state_e
  0.01s 		enum voice_e
  0.01s 		enum determination_e
  0.01s 		enum phrase_atom_relation_e
  0.01s 		enum subphrase_type_e
  0.01s 		enum subphrase_kind_e
  0.01s 		enum clause_type_e
  0.01s 		enum clause_atom_type_e
  0.01s 		enum clause_constituent_relation_e
  0.01s 		enum tense_e
  0.02s 		otype book
  0.02s 			feature book (str) =def= Genesis : node
  0.02s 		otype verse
  0.02s 			feature verse_label (str) =def= "" : node
  0.02s 			feature verse (int) =def= 0 : node
  0.02s 			feature chapter (int) =def= 0 : node
  0.02s 			feature book (str) =def= Genesis : node
  0.02s 		otype chapter
  0.02s 			feature chapter (int) =def= 0 :

# Diffs

Check differences with previous versions.

The new dataset has been created in a temporary directory,
and has not yet been copied to its destination.

Here is your opportunity to compare the newly created features with the older features.
You expect some differences in some features.

We check the differences between the previous version of the features and what has been generated.
We list features that will be added and deleted and changed.
For each changed feature we show the first line where the new feature differs from the old one.
We ignore changes in the metadata, because the timestamp in the metadata will always change.

In [9]:
utils.checkDiffs(thisTempTf, thisTf)

..............................................................................................
.         14s Check differences with previous version                                        .
..............................................................................................
|         14s 	no features to add
|         14s 	2 features to delete
|         14s 		book@en
|         14s 		cons
|         14s 	37 features in common
|         14s analyzed_form             ... no changes
|         14s book                      ... no changes
|         14s chapter                   ... no changes
|         14s determination             ... no changes
|         14s emf                       ... no changes
|         14s frv                       ... no changes
|         14s gender                    ... no changes
|         14s is_apposition             ... no changes
|         14s lexeme                    ... no changes
|         14s morph_state               ... no changes
|         14s

# Deliver 

Copy the new TF dataset from the temporary location where it has been created to its final destination.

In [10]:
utils.deliverDataset(thisTempTf, thisTf)

..............................................................................................
.         15s Deliver data set to /Users/dirk/github/etcbc/calap/tf/2014                     .
..............................................................................................


# Compile TF

Just to see whether everything loads and the pre-computing of extra information works out.
Moreover, if you want to work with these features, then the pre-computing has already been done, and everything is quicker in subsequent runs.

We issue load statement to trigger the pre-computing of extra data.
Note that all features specified text formats in the `otext` config feature,
will be loaded, as well as the features for sections.

At that point we have access to the full list of features.
We grab them and are going to load them all! 

In [11]:
utils.caption(4, 'Load and compile standard TF features')
TF = Fabric(locations=thisTf, modules=[''])
api = TF.load('')

utils.caption(4, 'Load and compile all other TF features')
allFeatures = TF.explore(silent=False, show=True)
loadableFeatures = allFeatures['nodes'] + allFeatures['edges']
api = TF.load(loadableFeatures)
api.makeAvailableIn(globals())

..............................................................................................
.         15s Load and compile standard TF features                                          .
..............................................................................................
This is Text-Fabric 3.1.3
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

37 features found and 0 ignored
  0.00s loading features ...
   |     0.15s T otype                from /Users/dirk/github/etcbc/calap/tf/2014
   |     0.83s T oslots               from /Users/dirk/github/etcbc/calap/tf/2014
   |     0.01s T book                 from /Users/dirk/github/etcbc/calap/tf/2014
   |     0.01s T chapter              from /Users/dirk/github/etcbc/calap/tf/2014
   |     0.01s T verse                from /Users/dirk/github/etcbc/calap/tf/2014


# Add unicode text and English book names

In [12]:
syriacMapping = {
    '>': "\u0710", # alaph
    'B': "\u0712", # beth
    'G': "\u0713", # gamal
    'D': "\u0715", # dalat
    'H': "\u0717", # he
    'W': "\u0718", # waw
    'Z': "\u0719", # zain
    'X': "\u071A", # heth
    'V': "\u071B", # teth
    'J': "\u071D", # yudh
    'K': "\u071F", # kaph
    'L': "\u0720", # lamadh
    'M': "\u0721", # mim
    'N': "\u0722", # nun
    'S': "\u0723", # semkath
    '<': "\u0725", # e
    'P': "\u0726", # pe
    'Y': "\u0728", # sadhe
    'Q': "\u0729", # qaph
    'R': "\u072A", # rish
    'C': "\u072B", # shin
    'T': "\u072C", # taw
    's': "\u0724", # semkath final
    'p': "\u0727", # pe reversed
}

def syriac(translit): return ''.join(syriacMapping.get(c,c) for c in translit)

In [13]:
metaData={
    '': featureMetaData,
    'cons': {
        'valueType': 'str',
        'source': 'feature surface_consonants',
        'method': 'back transliteration',
    },
    'book@en': {
        'valueType': 'str',
        'source': 'feature book',
        'method': 'copy',
    },
}

metaData['otext'] = dict()
metaData['otext'].update(T.config)
metaData['otext'].update({'fmt:text-orig-full': '{cons} '})

nodeFeatures = {}
edgeFeatures = {}

In [14]:
cons = {}
bookEn = {}
for n in N():
    sf = F.surface_consonants.v(n)
    if sf != None:
        cons[n] = syriac(sf)
for n in F.otype.s('book'):
    bookEn[n] = F.book.v(n)
nodeFeatures['cons'] = cons
nodeFeatures['book@en'] = bookEn

In [15]:
TF = Fabric(locations=thisTf, silent=True)
TF.save(nodeFeatures=nodeFeatures, edgeFeatures=edgeFeatures, metaData=metaData)

   |     0.00s T book@en              to /Users/dirk/github/etcbc/calap/tf/2014
   |     0.16s T cons                 to /Users/dirk/github/etcbc/calap/tf/2014
   |     0.00s M otext                to /Users/dirk/github/etcbc/calap/tf/2014


In [16]:
utils.caption(4, 'Load and compile standard TF features')
TF = Fabric(locations=thisTf, modules=[''])
api = TF.load('')

utils.caption(4, 'Load and compile all other TF features')
allFeatures = TF.explore(silent=False, show=True)
loadableFeatures = allFeatures['nodes'] + allFeatures['edges']
api = TF.load(loadableFeatures)
api.makeAvailableIn(globals())

..............................................................................................
.         27s Load and compile standard TF features                                          .
..............................................................................................
This is Text-Fabric 3.1.3
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

39 features found and 0 ignored
  0.00s loading features ...
   |     0.24s T cons                 from /Users/dirk/github/etcbc/calap/tf/2014
   |     0.00s M otext                from /Users/dirk/github/etcbc/calap/tf/2014
   |      |     0.02s C __sections__         from otype, oslots, otext, __levUp__, __levels__, book, chapter, verse
   |     0.00s T book@en              from /Users/dirk/github/etcbc/calap/tf/2014
   |     0.00s Feature overview: 36 for nodes; 1

# Examples

In [17]:
utils.caption(4, 'Basic test')
utils.caption(4, 'First verse in all formats')
for fmt in T.formats:
    utils.caption(0, '{}'.format(fmt), continuation=True)
    utils.caption(0, '\t{}'.format(T.text(range(1,12), fmt=fmt)), continuation=True)

..............................................................................................
.         29s Basic test                                                                     .
..............................................................................................
..............................................................................................
.         29s First verse in all formats                                                     .
..............................................................................................
text-trans-full
	W MLK> DWJD S>B W <L B CNJ> W MKSJN HWW 
text-orig-full
	ܘ ܡܠܟܐ ܕܘܝܕ ܣܐܒ ܘ ܥܠ ܒ ܫܢܝܐ ܘ ܡܟܣܝܢ ܗܘܘ 


In [18]:
F.psp.freqList()

(('noun', 18087),
 ('preposition', 13991),
 ('verb', 9744),
 ('conjunction', 7270),
 ('pronoun', 1553),
 ('adjective', 1463),
 ('negative', 1123),
 ('adverb', 511),
 ('interjection', 136),
 ('interrogative', 42))

In [19]:
F.phrase_type.freqList()

(('VP', 9259),
 ('CP', 8441),
 ('PP', 7418),
 ('NP', 5452),
 ('NegP', 1113),
 ('PrNP', 1012),
 ('PPrP', 951),
 ('AdvP', 439),
 ('AdjP', 396),
 ('IPrP', 176),
 ('InjP', 136),
 ('DPrP', 85),
 ('InrP', 17))

# Generate full text locally

We generate the full text with book, chapter and verse divisions, transliterated and in unicode.

In [20]:
for fmt in T.formats:
    with open(f'{thisTemp}/{fmt}.txt', 'w') as fh:
        for b in F.otype.s('book'):
            book = T.sectionFromNode(b)[0]
            fh.write(f'\n\nBOOK {book}\n')
            for c in L.d(b, otype='chapter'):
                chapter = T.sectionFromNode(c)[1]
                fh.write(f'\n{book} {chapter}\n\n')
                for v in L.d(c, otype='verse'):
                    verse = T.sectionFromNode(v)[2]
                    text = T.text(L.d(v, otype='word'), fmt=fmt)
                    fh.write(f'{verse} {text}\n')

In [21]:
!head -n 10 {thisTemp}/text-orig-full.txt



BOOK I_Kings

I_Kings 1

1 ܘ ܡܠܟܐ ܕܘܝܕ ܣܐܒ ܘ ܥܠ ܒ ܫܢܝܐ ܘ ܡܟܣܝܢ ܗܘܘ ܠܗ ܒ ܠܒܘܫܐ ܘ ܠܐ ܫܚܢ 
2 ܘ ܐܡܪܘ ܠܗ ܥܒܕܘܗܝ ܗܐ ܥܒܕܝܟ ܩܕܡܝܟ ܢܒܥܘܢ ܠ ܡܪܢ ܡܠܟܐ ܥܠܝܡܬܐ ܒܬܘܠܬܐ ܘ ܬܩܘܡ ܩܕܡ ܡܠܟܐ ܘ ܬܗܘܐ ܠܗ ܡܫܡܫܢܝܬܐ ܘ ܬܫܟܒ ܒ ܥܘܒܟ ܘ ܢܫܚܢ ܠ ܡܪܢ ܡܠܟܐ 
3 ܘ ܒܥܘ ܥܠܝܡܬܐ ܕ ܫܦܝܪܐ ܒ ܟܠܗ ܬܚܘܡܐ ܕ ܐܝܣܪܝܠ ܘ ܐܫܟܚܘ ܠ ܐܒܝܫܓ ܫܝܠܘܡܝܬܐ ܘ ܐܝܬܝܘܗ ܠ ܡܠܟܐ 
4 ܘ ܥܠܝܡܬܐ ܫܦܝܪܐ ܗܘܬ ܒ ܚܙܘܗ ܛܒ ܘ ܗܘܬ ܠ ܡܠܟܐ ܡܫܡܫܢܝܬܐ ܘ ܡܫܡܫܐ ܠܗ ܘ ܡܠܟܐ ܠܐ ܝܕܥܗ 


In [22]:
!head -n 10 {thisTemp}/text-trans-full.txt



BOOK I_Kings

I_Kings 1

1 W MLK> DWJD S>B W <L B CNJ> W MKSJN HWW LH B LBWC> W L> CXN 
2 W >MRW LH <BDWHJ H> <BDJK QDMJK NB<WN L MRN MLK> <LJMT> BTWLT> W TQWM QDM MLK> W THW> LH MCMCNJT> W TCKB B <WBK W NCXN L MRN MLK> 
3 W B<W <LJMT> D CPJR> B KLH TXWM> D >JSRJL W >CKXW L >BJCG CJLWMJT> W >JTJWH L MLK> 
4 W <LJMT> CPJR> HWT B XZWH VB W HWT L MLK> MCMCNJT> W MCMC> LH W MLK> L> JD<H 
