<img align="right" src="tf-small.png"/>

# Paragraphs

This notebook can read ETCBC `.px` files with information
about *paragraphs* in it.
We distil a bunch of extra features at the `clause_atom` level, namely:
* `pargr`
* `instruction`

**NB** This conversion will not work for versions `4` and `4b`.

## Discussion
Somebody should tell in more detail what they are, and document it in the feature documentation.

In [1]:
import os,sys,re,collections
from tf.fabric import Fabric
from tf.transcription import Transcription
import utils

# Pipeline
See [operation](https://github.com/ETCBC/pipeline/blob/master/README.md#operation) 
for how to run this script in the pipeline.

In [2]:
if 'SCRIPT' not in locals():
    SCRIPT = False
    FORCE = True
    CORE_NAME = 'bhsa'
    VERSION= 'c'
    CORE_MODULE ='core' 

def stop(good=False):
    if SCRIPT: sys.exit(0 if good else 1)

# Setting up the context: source file and target directories

The conversion is executed in an environment of directories, so that sources, temp files and
results are in convenient places and do not have to be shifted around.

In [3]:
module = CORE_MODULE
repoBase = os.path.expanduser('~/github/etcbc')
thisRepo = '{}/{}'.format(repoBase, CORE_NAME)

thisSource = '{}/source/{}'.format(thisRepo, VERSION)

thisTemp = '{}/_temp/{}'.format(thisRepo, VERSION)
thisSave = '{}/{}'.format(thisTemp, module)

thisTf = '{}/tf/{}'.format(thisRepo, VERSION)
thisDeliver = '{}/{}'.format(thisTf, module)

In [4]:
testFeature = 'pargr'

# Test

Check whether this conversion is needed in the first place.
Only when run as a script.

In [5]:
if SCRIPT:
    (good, work) = utils.mustRun(None, '{}/.tf/{}.tfx'.format(thisDeliver, testFeature), force=FORCE)
    if not good: stop(good=False)
    if not work: stop(good=True)

# TF Settings

* a piece of metadata that will go into these features; the time will be added automatically
* new text formats for the `otext` feature of TF, based on lexical features.
  We select the version specific otext material, 
  falling back on a default if nothing appropriate has been specified in oText.
 
We do not do this for the older versions 4 and 4b.

In [6]:
provenanceMetadata = dict(
    dataset='BHSA',
    datasetName='Biblia Hebraica Stuttgartensia Amstelodamensis',
    author='Eep Talstra Centre for Bible and Computer',
    encoders='Constantijn Sikkel (QDF), and Dirk Roorda (TF)',
    website='https://shebanq.ancient-data.org',
    email='shebanq@ancient-data.org',
)

In [8]:
utils.caption(4, 'Load the existing TF dataset')
TF = Fabric(locations=thisTf, modules=module)
api = TF.load('label number')
api.makeAvailableIn(globals())

..............................................................................................
.      3m 53s Load the existing TF dataset                                                   .
..............................................................................................
This is Text-Fabric 2.3.13
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an invite to Slack
105 features found and 0 ignored
  0.00s loading features ...
   |     0.02s B label                from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.30s B number               from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.00s Feature overview:

# Clause atom identifiers in .px
We must map the way the clause_atoms are identified in the `.px` files
to nodes in TF.

In [11]:
utils.caption(0, '\tLabeling clause_atoms')

labelNumberFromNode = {}
nodeFromLabelNumber = {}
for n in N():
    otype = F.otype.v(n)
    if otype == 'book':
        curSubtract = 0
        curChapterSeq = 0
    elif otype == 'chapter':
        curSubtract += curChapterSeq
        curChapterSeq = 0
    elif otype == 'verse':
        curLabel = F.label.v(n)
    elif otype == 'clause_atom':
        curChapterSeq += 1
        nm = int(F.number.v(n)) - curSubtract
        nodeFromLabelNumber[(curLabel, nm)] = n
        labelNumberFromNode[n] = (curLabel, nm)

nLabs = len(nodeFromLabelNumber)
nNodes = len(labelNumberFromNode)

if nLabs == nNodes:
    utils.caption(0, '\tOK: clause atoms succesfully labeled')
    utils.caption(0, '\t{} clause atoms'.format(nNodes))
else:
    utils.caption(0, '\tWARNING: clause atoms not uniquely labeled')
    utils.caption(0, '\t{} labels =/= {} nodes'.format(nLabs, nNodes))

|     18m 16s 	Labeling clause_atoms
|     18m 18s 	OK: clause atoms succesfully labeled
|     18m 18s 	90562 clause atoms


# Read the PX files

In [15]:
utils.caption(4, 'Parsing paragraph data in PX')

pxFile = '{}/paragraphs.txt'.format(thisTemp)
pxzFile = '{}/paragraphs.txt.bz2'.format(thisSource)
utils.caption(0, 'bunzipping {} ...'.format(pxzFile))
utils.bunzip(pxzFile, pxFile)
pxHandle = open(pxFile)

data = []
notFound = set()

ln = 0
can = 0
featurescan = re.compile(r'0 0 (..) [0-9]+ LineNr\s*([0-9]+).*?Pargr:\s*([0-9.]+)')
curLabel = None

for line in pxHandle:
    ln += 1
    if line.strip()[0] != '*':
        curLabel = line[0:10]
        continue
    can += 1
    features = featurescan.findall(line)
    if len(features) == 0:
        utils.caption(0, '\tWarning: line {}: no instruction, LineNr, Pargr found'.format(ln))
    elif len(features) > 1:
        utils.caption(0, '\tWarning: line {}: multiple instruction, LineNr, Pargr found'.format(ln))
    else:
        feature = features[0]
        theIns = feature[0]
        theN = feature[1]
        thePara = feature[2]
        labNum = (curLabel, int(theN))
        if labNum not in nodeFromLabelNumber:
            notFound.add(labNum)
            continue
        data.append((nodeFromLabelNumber[labNum], theIns, theN, thePara))
pxHandle.close()
utils.caption(0, '\tRead {} paragraph annotations'.format(len(data)))

if notFound:
    utils.caption(0, '\tWARNING: Could not find {} label/line entries in index: {}'.format(
        len(notFound), sorted({lab for lab in notFound}),
    ))
else:
    utils.caption(0, '\tOK: All label/line entries found in index')

..............................................................................................
.     31m 29s Parsing paragraph data in PX                                                   .
..............................................................................................
|     31m 29s bunzipping /Users/dirk/github/etcbc/bhsa/source/c/paragraphs.txt.bz2 ...
|     31m 29s 	NOTE: Using existing unzipped file which is newer than bzipped one
|     31m 29s 	Read 90562 paragraph annotations
|     31m 29s 	OK: All label/line entries found in index


In [32]:
if not SCRIPT:
    print('\n'.join(repr(d) for d in data[0:10]))

(576266, '.N', '1', '1')
(576267, '..', '2', '1')
(576268, '..', '3', '1')
(576269, '..', '4', '1')
(576270, '.q', '5', '1.1')
(576271, '..', '6', '1.1')
(576272, '..', '7', '1.1')
(576273, '..', '8', '1.1')
(576274, '..', '9', '1.1')
(576275, '.q', '10', '1.1.1')


# Prepare TF features

We now collect the lexical information into the features for nodes of type `lex`.

In [33]:
utils.caption(0, 'Prepare TF paragraph features')

metaData = {}
nodeFeatures = {}

newFeatures = '''
    pargr
    instruction
'''.strip().split()

nodeFeatures = dict( 
    instruction=dict(((x[0], x[1]) for x in data)),
    pargr=dict(((x[0], x[3]) for x in data)),
)

for f in nodeFeatures:
    metaData[f] = {}
    metaData[f].update(provenanceMetadata)
    metaData[f]['valueType'] = 'str'

|     55m 16s Prepare TF paragraph features


In [34]:
changedFeatures = set(nodeFeatures)

# Stage: TF generation
Transform the collected information in feature-like datastructures, and write it all
out to `.tf` files.

In [35]:
utils.caption(4, 'write new/changed features to TF ...')
TF = Fabric(locations=thisSave, silent=True)
TF.save(nodeFeatures=nodeFeatures, edgeFeatures={}, metaData=metaData)

..............................................................................................
.     55m 26s write new/changed features to TF ...                                           .
..............................................................................................
   |     0.15s T instruction          to /Users/dirk/github/etcbc/bhsa/_temp/c/core
   |     0.14s T pargr                to /Users/dirk/github/etcbc/bhsa/_temp/c/core


# Stage: Diffs

Check differences with previous versions.

The new dataset has been created in a temporary directory,
and has not yet been copied to its destination.

Here is your opportunity to compare the newly created features with the older features.
You expect some differences in some features.

We check the differences between the previous version of the features and what has been generated.
We list features that will be added and deleted and changed.
For each changed feature we show the first line where the new feature differs from the old one.
We ignore changes in the metadata, because the timestamp in the metadata will always change.

In [36]:
utils.checkDiffs(thisSave, thisDeliver, only=changedFeatures)

..............................................................................................
.     55m 30s Check differences with previous version                                        .
..............................................................................................
|     55m 30s 	no features to add
|     55m 30s 	no features to delete
|     55m 30s 	2 features in common
|     55m 30s instruction               ... no changes
|     55m 30s pargr                     ... differencesafter the metadata
|     55m 30s 	line      3 OLD -->2<--
|     55m 30s 	line      3 NEW -->1<--
|     55m 30s 	line      4 OLD -->3<--
|     55m 30s 	line      4 NEW -->1<--
|     55m 30s 	line      5 OLD -->4<--
|     55m 30s 	line      5 NEW -->1<--
|     55m 30s 	line      6 OLD -->5<--
|     55m 30s 	line      6 NEW -->1.1<--

|     55m 30s Done


# Stage: Deliver 

Copy the new TF dataset from the temporary location where it has been created to its final destination.

In [37]:
utils.deliverFeatures(thisSave, thisDeliver, changedFeatures)

..............................................................................................
.     55m 37s Deliver features to /Users/dirk/github/etcbc/bhsa/tf/c/core                    .
..............................................................................................
|     55m 37s 	instruction
|     55m 37s 	pargr


# Stage: Compile TF

We load the new features, use the new format, check some values

In [38]:
utils.caption(4, 'Load and compile the new TF features')

TF = Fabric(locations=thisTf, modules=module)
api = TF.load(' '.join(changedFeatures))
api.makeAvailableIn(globals())

..............................................................................................
.     55m 39s Load and compile the new TF features                                           .
..............................................................................................
This is Text-Fabric 2.3.13
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an invite to Slack
107 features found and 0 ignored
  0.00s loading features ...
   |     0.32s T instruction          from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.30s T pargr                from /Users/dirk/github/etcbc/bhsa/tf/c/core
   |     0.00s Feature overview:

In [40]:
utils.caption(4, 'Test: paragraphs of the first verses')

def showParagraphs(verseNode):
    clause_atoms = L.d(verseNode, otype='clause_atom')
    for ca in clause_atoms:
        utils.caption(0, '\t\t{:<3} {:>12} {}'.format(
            F.instruction.v(ca),
            F.pargr.v(ca),
            T.text(L.d(ca, otype='word'))
        ), continuation=True)

for (i, verseNode) in enumerate(F.otype.s('verse')[0:10]):
    verseLabel = T.sectionFromNode(verseNode)
    verseHeading = '{} {}:{}'.format(*verseLabel) if i == 0 else verseLabel[2]
    utils.caption(0, '\t{}'.format(verseHeading), continuation=True)
    showParagraphs(verseNode)

..............................................................................................
.     56m 09s Test: paragraphs of the first verses                                           .
..............................................................................................
	Genesis 1:1
		.N             1 בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ 
	2
		..             1 וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ 
		..             1 וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְהֹ֑ום 
		..             1 וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃ 
	3
		.#           1.1 וַיֹּ֥אמֶר אֱלֹהִ֖ים 
		.q         1.1.1 יְהִ֣י אֹ֑ור 
		.#         1.1.2 וַֽיְהִי־אֹֽור׃ 
	4
		.#         1.1.3 וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור 
		..         1.1.3 כִּי־טֹ֑וב 
		.#         1.1.4 וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָאֹ֖ור וּבֵ֥ין הַחֹֽשֶׁךְ׃ 
	5
		.#         1.1.5 וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לָאֹור֙ יֹ֔ום 
		..         1.1.5 וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה 
		.#       1.1.5.1 וַֽיְהִי־עֶ֥רֶב 
		.#      