<img align="right" src="images/dans-small.png"/>
<img align="right" src="images/tf-small.png"/>
<img align="right" src="images/etcbc.png"/>


# Ketiv Qere

This notebook can read ketiv-qere info in files issued by the ETCBC and transform them
into new features.
There will be new features at the word level.

**NB** This conversion will not work for versions `4` and `4b`.

## Discussion
There are already `qere` and `qere_utf8` features in the MQL of the core data.
However, there are several problems with those:

* features that contain the after-word material, `qere_trailer` and `qere_trailer_utf8`
  are missing;
* if there is no qere, both features are filled with the empty string.
  In this way we can make no distinction between a truly empty `qere` and the absence of a `qere`.

That is why we reconstruct ketiv and qere from special files that are used by the ETCBC.

In [1]:
import os
import sys
import collections
from tf.fabric import Fabric
from tf.writing.transcription import Transcription
import utils


# Pipeline
See [operation](https://github.com/ETCBC/pipeline/blob/master/README.md#operation)
for how to run this script in the pipeline.

In [2]:
if "SCRIPT" not in locals():
    SCRIPT = False
    FORCE = True
    CORE_NAME = "bhsa"
    VERSION = "2021"


def stop(good=False):
    if SCRIPT:
        sys.exit(0 if good else 1)

# Setting up the context: source file and target directories

The conversion is executed in an environment of directories, so that sources, temp files and
results are in convenient places and do not have to be shifted around.

In [3]:
repoBase = os.path.expanduser("~/github/etcbc")
thisRepo = "{}/{}".format(repoBase, CORE_NAME)

thisSource = "{}/source/{}".format(thisRepo, VERSION)

thisTemp = "{}/_temp/{}".format(thisRepo, VERSION)
thisTempTf = "{}/tf".format(thisTemp)

thisTf = "{}/tf/{}".format(thisRepo, VERSION)

In [4]:
testFeature = "qere_trailer"

# Test

Check whether this conversion is needed in the first place.
Only when run as a script.

In [5]:
if SCRIPT:
    (good, work) = utils.mustRun(
        None, "{}/.tf/{}.tfx".format(thisTf, testFeature), force=FORCE
    )
    if not good:
        stop(good=False)
    if not work:
        stop(good=True)

# TF Settings

* a piece of metadata that will go into these features; the time will be added automatically
* new text formats for the `otext` feature of TF, based on lexical features.
  We select the version specific otext material,
  falling back on a default if nothing appropriate has been specified in oText.

We do not do this for the older versions `4` and `4b`.

In [7]:
provenanceMetadata = dict(
    dataset="BHSA",
    version=VERSION,
    datasetName="Biblia Hebraica Stuttgartensia Amstelodamensis",
    author="Eep Talstra Centre for Bible and Computer",
    encoders="Constantijn Sikkel (QDF), and Dirk Roorda (TF)",
    website="https://shebanq.ancient-data.org",
    email="shebanq@ancient-data.org",
)

oText = {
    "_temp": """
@fmt:text-orig-full={qere_utf8/g_word_utf8}{qere_trailer_utf8/trailer_utf8}
@fmt:text-orig-full-ketiv={g_word_utf8}{trailer_utf8}
@fmt:text-trans-full={qere/g_word}{qere_trailer/trailer}
@fmt:text-trans-full-ketiv={g_word}{trailer}""",
    "2021": """
@fmt:text-orig-full={qere_utf8/g_word_utf8}{qere_trailer_utf8/trailer_utf8}
@fmt:text-orig-full-ketiv={g_word_utf8}{trailer_utf8}
@fmt:text-trans-full={qere/g_word}{qere_trailer/trailer}
@fmt:text-trans-full-ketiv={g_word}{trailer}""",
    "2017": """
@fmt:text-orig-full={qere_utf8/g_word_utf8}{qere_trailer_utf8/trailer_utf8}
@fmt:text-orig-full-ketiv={g_word_utf8}{trailer_utf8}
@fmt:text-trans-full={qere/g_word}{qere_trailer/trailer}
@fmt:text-trans-full-ketiv={g_word}{trailer}""",
    "2016": """
@fmt:text-orig-full={qere_utf8/g_word_utf8}{qere_trailer_utf8/trailer_utf8}
@fmt:text-orig-full-ketiv={g_word_utf8}{trailer_utf8}
@fmt:text-trans-full={qere/g_word}{qere_trailer/trailer}
@fmt:text-trans-full-ketiv={g_word}{trailer}""",
    "c": """
@fmt:text-orig-full={qere_utf8/g_word_utf8}{qere_trailer_utf8/trailer_utf8}
@fmt:text-orig-full-ketiv={g_word_utf8}{trailer_utf8}
@fmt:text-trans-full={qere/g_word}{qere_trailer/trailer}
@fmt:text-trans-full-ketiv={g_word}{trailer}""",
}

thisOtext = oText.get(VERSION, "")

if thisOtext == "":
    utils.caption(0, "No additional text formats provided")
    otextInfo = {}
else:
    utils.caption(0, "New text formats")
    otextInfo = dict(
        line[1:].split("=", 1) for line in thisOtext.strip("\n").split("\n")
    )
    for x in sorted(otextInfo.items()):
        utils.caption(0, '{:<30} = "{}"'.format(*x))

|         16s New text formats
|         16s fmt:text-orig-full             = "{qere_utf8/g_word_utf8}{qere_trailer_utf8/trailer_utf8}"
|         16s fmt:text-orig-full-ketiv       = "{g_word_utf8}{trailer_utf8}"
|         16s fmt:text-trans-full            = "{qere/g_word}{qere_trailer/trailer}"
|         16s fmt:text-trans-full-ketiv      = "{g_word}{trailer}"


In [8]:
utils.caption(4, "Load the existing TF dataset")
TF = Fabric(locations=thisTf, modules=[""])
api = TF.load("label g_word g_cons trailer_utf8")
api.makeAvailableIn(globals())

..............................................................................................
.         18s Load the existing TF dataset                                                   .
..............................................................................................
This is Text-Fabric 8.5.13
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

82 features found and 0 ignored
  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
  3.49s All features loaded/computed - for details use loadLog()


[('Computed',
  'computed-data',
  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),
 ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),
 ('Fabric', 'loading', ('TF',)),
 ('Locality', 'locality', ('L Locality',)),
 ('Nodes', 'navigating-nodes', ('N Nodes',)),
 ('Features',
  'node-features',
  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),
 ('Search', 'search', ('S Search',)),
 ('Text', 'text', ('T Text',))]

# Verse labels
The ketiv-qere files deal with different verse labels.
We make a mapping between verse labels and nodes.

In [9]:
utils.caption(0, "Mapping between verse labels and verse nodes")
nodeFromLabel = {}
for vs in F.otype.s("verse"):
    lab = F.label.v(vs)
    nodeFromLabel[lab] = vs
utils.caption(0, "{} verses".format(len(nodeFromLabel)))

|         26s Mapping between verse labels and verse nodes
|         26s 23213 verses


# Read the Ketiv-Qere file

In [10]:
utils.caption(4, "Parsing Ketiv-Qere data")

verseInfo = collections.defaultdict(lambda: [])
notFound = set()
missing = collections.defaultdict(lambda: [])
missed = collections.defaultdict(lambda: [])

error_limit = 10

kqFile = "{}/ketivqere.txt".format(thisSource)
kqHandle = open(kqFile)

ln = 0
can = 0
cur_label = None
for line in kqHandle:
    ln += 1
    can += 1
    vlab = line[0:10]
    fields = line.rstrip("\n")[10:].split()
    (ketiv, qere) = fields[0:2]
    (qtrim, qtrailer) = Transcription.suffix_and_finales(qere)
    vnode = nodeFromLabel.get(vlab, None)
    if vnode is None:
        notFound.add(vlab)
        continue
    verseInfo[vnode].append((ketiv, qtrim, qtrailer))
kqHandle.close()
utils.caption(0, "\tRead {} ketiv-qere annotations".format(ln))

..............................................................................................
.         30s Parsing Ketiv-Qere data                                                        .
..............................................................................................
|         30s 	Read 1892 ketiv-qere annotations


In [11]:
data = []

for vnode in verseInfo:
    wlookup = collections.defaultdict(lambda: [])
    wvisited = collections.defaultdict(lambda: -1)
    wnodes = L.d(vnode, otype="word")
    for w in wnodes:
        gw = F.g_word.v(w)
        if "*" in gw:
            gw = F.g_cons.v(w)
            if gw == "":
                gw = "."
            if F.trailer_utf8.v(w) == "":
                gw += "-"
            wlookup[gw].append(w)
    for (ketiv, qere, qtrailer) in verseInfo[vnode]:
        wvisited[ketiv] += 1
        windex = wvisited[ketiv]
        ws = wlookup.get(ketiv, None)
        if ws is None or windex > len(ws) - 1:
            missing[vnode].append((windex, ketiv, qere))
            continue
        w = ws[windex]
        qereU = Transcription.to_hebrew(qere)
        qtrailerU = Transcription.to_hebrew(qtrailer)
        data.append(
            (
                w,
                ketiv,
                qere,
                qtrailer.replace("\n", ""),
                qereU,
                qtrailerU.replace("\n", ""),
            )
        )
    for ketiv in wlookup:
        if ketiv not in wvisited or len(wlookup[ketiv]) - 1 > wvisited[ketiv]:
            missed[vnode].append(
                (len(wlookup[ketiv]) - (wvisited.get(ketiv, -1) + 1), ketiv)
            )
utils.caption(0, "\tParsed {} ketiv-qere annotations".format(len(data)))

|         38s 	Parsed 1884 ketiv-qere annotations


In [12]:
if not SCRIPT:
    print("\n".join(repr(d) for d in data[0:10]))

(297572, '<NWJ', '<:ANIJ.;J', '&', 'עֲנִיֵּי', '־')
(297647, 'W-', 'W:', '', 'וְ', '')
(297648, 'NCQH', 'NIC:Q:<@73H', ' ', 'נִשְׁקְעָ֖ה', ' ')
(297938, 'M<LWTW', 'MA<:ALOWT@80JW', ' ', 'מַעֲלֹותָ֔יו', ' ')
(370148, 'M>WM', 'MW.m04', ' ', 'מוּם֩', ' ')
(370612, 'L-', 'L:', '', 'לְ', '')
(370613, '<BDJK', '<AB:D@73k:', ' ', 'עַבְדָ֖ךְ', ' ')
(370621, 'L-', 'L:', '', 'לְ', '')
(370622, 'KFDJ>', 'KAF:D.@>;80J', ' ', 'כַשְׂדָּאֵ֔י', ' ')
(370704, 'HZMNTWN', 'HIZ:D.:MIN:T.W.n03', ' ', 'הִזְדְּמִנְתּוּן֙', ' ')


In [13]:
if notFound:
    utils.caption(
        0,
        "\tWARNING: Could not find {} verses: {}".format(
            len(notFound), sorted(notFound)
        ),
    )
else:
    utils.caption(0, "\tAll verses entries found in index")
if missing:
    utils.caption(
        0,
        "\tWARNING: Could not locate ketivs in the text: {} verses".format(
            len(missing)
        ),
    )
    e = 0
    for vnode in sorted(missing):
        if e > error_limit:
            break
        vlab = F.label.v(vnode)
        for (windex, ketiv, qere) in missing[vnode]:
            e += 1
            if e > error_limit:
                break
            utils.caption(
                0,
                "\t\tNOT IN TEXT: {:<10} {:<20} #{} {}".format(
                    vlab, ketiv, windex, qere
                ),
            )
else:
    utils.caption(0, "\tAll ketivs found in the text")
if missed:
    utils.caption(
        0, "\tCould not lookup qeres in the data: {} verses".format(len(missed))
    )
    e = 0
    for vnode in sorted(missed):
        if e > error_limit:
            break
        vlab = F.label.v(vnode)
        for (windex, ketiv) in missed[vnode]:
            e += 1
            if e > error_limit:
                break
            utils.caption(
                0, "\t\tNOT IN DATA: {:<10} {:<20} #{}".format(vlab, ketiv, windex)
            )
else:
    utils.caption(0, "\tAll ketivs found in the data")

|         52s 	All verses entries found in index
|         52s 		NOT IN TEXT: RICHT16,25 KJ                   #0 K.:
|         52s 		NOT IN TEXT: IISA 18,20 <L                   #0 <AL
|         52s 		NOT IN TEXT:  JES 44,24 MJ                   #0 M;
|         52s 		NOT IN TEXT:  IOB 38,12 H                    #0 HA
|         52s 		NOT IN TEXT:  THR 01,06 MN                   #0 MI
|         52s 		NOT IN TEXT:  THR 04,03 KJ                   #0 K.A
|         52s 		NOT IN TEXT:  NEH 02,13 HM-                  #0 H;74m
|         52s 		NOT IN TEXT:  ICHR27,12 BN-                  #0 B.;74n
|         52s 	Could not lookup qeres in the data: 8 verses
|         52s 		NOT IN DATA: RICHT16,25 KJ-                  #1
|         52s 		NOT IN DATA: IISA 18,20 <L-                  #1
|         52s 		NOT IN DATA:  JES 44,24 MJ-                  #1
|         52s 		NOT IN DATA:  IOB 38,12 H-                   #1
|         52s 		NOT IN DATA:  THR 01,06 MN-                  #1
|         52s 		NOT IN DA

# Prepare TF features

In [14]:
utils.caption(0, "Prepare TF ketiv qere features")

nodeFeatures = {}

newFeatures = """
    qere
    qere_trailer
    qere_utf8
    qere_trailer_utf8
""".strip().split()

nodeFeatures = dict(
    qere=dict(((x[0], x[2]) for x in data)),
    qere_trailer=dict(((x[0], x[3]) for x in data)),
    qere_utf8=dict(((x[0], x[4]) for x in data)),
    qere_trailer_utf8=dict(((x[0], x[5]) for x in data)),
)

|      3m 01s Prepare TF ketiv qere features


We update the `otext` feature with new/changed formats

In [15]:
utils.caption(0, "Update the otext feature")

metaData = {
    "": provenanceMetadata,
}

metaData["otext"] = dict()
metaData["otext"].update(T.config)
metaData["otext"].update(otextInfo)

for f in nodeFeatures:
    metaData[f] = {}
    metaData[f]["valueType"] = "str"

|      3m 05s Update the otext feature


In [16]:
changedDataFeatures = set(nodeFeatures)
changedFeatures = changedDataFeatures | {"otext"}

# Write new features
Transform the collected information in feature-like datastructures, and write it all
out to `.tf` files.

In [17]:
utils.caption(4, "write new/changed features to TF ...")
TF = Fabric(locations=thisTempTf, silent=True)
TF.save(nodeFeatures=nodeFeatures, edgeFeatures={}, metaData=metaData)

..............................................................................................
.      3m 09s write new/changed features to TF ...                                           .
..............................................................................................


True

# Diffs

Check differences with previous versions.

The new dataset has been created in a temporary directory,
and has not yet been copied to its destination.

Here is your opportunity to compare the newly created features with the older features.
You expect some differences in some features.

We check the differences between the previous version of the features and what has been generated.
We list features that will be added and deleted and changed.
For each changed feature we show the first line where the new feature differs from the old one.
We ignore changes in the metadata, because the timestamp in the metadata will always change.

In [18]:
utils.checkDiffs(thisTempTf, thisTf, only=changedFeatures)

..............................................................................................
.      3m 13s Check differences with previous version                                        .
..............................................................................................
|      3m 13s 	2 features to add
|      3m 13s 		qere_trailer
|      3m 13s 		qere_trailer_utf8
|      3m 13s 	no features to delete
|      3m 13s 	3 features in common
|      3m 13s otext                     ... differences
|      3m 13s 	line      5 OLD -->@dateWritten=2021-06-28T08:55:24Z<--
|      3m 13s 	line      5 NEW -->@dateWritten=2021-06-28T08:59:31Z<--
|      3m 13s 	line     12 OLD -->@fmt:text-orig-full={g_word_utf8}{traile ...<--
|      3m 13s 	line     12 NEW -->@fmt:text-orig-full={qere_utf8/g_word_ut ...<--
|      3m 13s 	line     13 OLD -->@fmt:text-orig-plain={g_cons_utf8}{trail ...<--
|      3m 13s 	line     13 NEW -->@fmt:text-orig-full-ketiv={g_word_utf8}{ ...<--
|      3m 13s 	line 

# Deliver

Copy the new TF dataset from the temporary location where it has been created to its final destination.

In [19]:
utils.deliverFeatures(thisTempTf, thisTf, changedFeatures)

..............................................................................................
.      3m 17s Deliver features to /Users/dirk/github/etcbc/bhsa/tf/2021                      .
..............................................................................................
|      3m 17s 	qere_utf8
|      3m 17s 	qere
|      3m 17s 	otext
|      3m 17s 	qere_trailer_utf8
|      3m 17s 	qere_trailer


# Compile TF

We load the new features, use the new format, check some values

In [20]:
utils.caption(4, "Load and compile the new TF features")

TF = Fabric(locations=thisTf, modules=[""])
api = TF.load(
    "g_word_utf8 g_word trailer_utf8 trailer {}".format(" ".join(changedDataFeatures))
)
api.makeAvailableIn(globals())

..............................................................................................
.      3m 20s Load and compile the new TF features                                           .
..............................................................................................
This is Text-Fabric 8.5.13
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

84 features found and 0 ignored
  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
   |     0.01s T qere_trailer         from ~/github/etcbc/bhsa/tf/2021
   |     0.01s T qere                 from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T qere_trailer_utf8    from ~/github/etcbc/bhsa/tf/2021
   |     0.01s T qere_utf8            from ~/github/etcbc/bhsa/tf/2021
   |      |     0.66s C __levels__           from otype, oslots, otext
   |      |       14s C __order__            from otype, oslots, __levels__
   |      |     0.69s C 

[('Computed',
  'computed-data',
  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),
 ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),
 ('Fabric', 'loading', ('TF',)),
 ('Locality', 'locality', ('L Locality',)),
 ('Nodes', 'navigating-nodes', ('N Nodes',)),
 ('Features',
  'node-features',
  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),
 ('Search', 'search', ('S Search',)),
 ('Text', 'text', ('T Text',))]

# Examples

In [21]:
utils.caption(4, "Basic tests")


def showKq(w):
    hw = F.g_word_utf8.v(w)
    tw = F.g_word.v(w)
    ht = F.trailer_utf8.v(w)
    tt = F.trailer.v(w)

    qhw = F.qere_utf8.v(w)
    qtw = F.qere.v(w)
    qht = F.qere_trailer_utf8.v(w)
    qtt = F.qere_trailer.v(w)

    utils.caption(0, "{:<20} {}".format("hebrew", hw + ht))
    utils.caption(0, "{:<20} {}".format("hebrew qere", qhw + qht))
    utils.caption(0, "{:<20} {}".format("transcription", tw + tt))
    utils.caption(0, "{:<20} {}".format("transcription qere", qtw + qtt))


utils.caption(
    0,
    "{:<30}: {}".format(
        "absence of qere",
        " ".join(
            "NA" if F.qere.v(w) is None else F.qere.v(w) for w in (range(24700, 24710))
        ),
    ),
)
utils.caption(
    0,
    "{:<30}: {}".format(
        "presence of qere trailer",
        " ".join(
            "NA" if F.qere_trailer.v(w) is None else F.qere_trailer.v(w)
            for w in (range(30190, 30195))
        ),
    ),
)

showNode = L.u(122073, otype="verse")[0]
showVerse = T.sectionFromNode(showNode)

utils.caption(4, "{} {}:{} in all formats".format(*showVerse))
for fmt in T.formats:
    utils.caption(
        0, "{:<30} {}".format(fmt, T.text(L.d(showNode, otype="word"), fmt=fmt))
    )

..............................................................................................
.      4m 08s Basic tests                                                                    .
..............................................................................................
|      4m 08s absence of qere               : NA NA WA J.I45C:T.AX:AW.75W. NA NA NA NA NA NA
|      4m 08s presence of qere trailer      : NA NA NA &  
..............................................................................................
.      4m 08s Josua 15:52 in all formats                                                     .
..............................................................................................
|      4m 08s lex-orig-full                  אֲרַב וְ רוּמָה וְ אֶשְׁעָן 
|      4m 08s lex-orig-plain                 ארב ו רומה ו אשׁען 
|      4m 08s lex-trans-full                 >:ARAB W:- RW.M@H W:- >EC:<@N 
|      4m 08s lex-trans-plain                >RB W RWMH W >C<N 

In [14]:
if SCRIPT:
    stop(good=True)