<img align="right" src="images/dans-small.png"/>
<img align="right" src="images/tf-small.png"/>
<img align="right" src="images/etcbc.png"/>


# Ketiv Qere

This notebook can read ketiv-qere info in files issued by the ETCBC and transform them
into new features.
There will be new features at the word level.

**NB** This conversion will not work for versions `4` and `4b`.

## Discussion
There are already `qere` and `qere_utf8` features in the MQL of the core data.
However, there are several problems with those:

* features that contain the after-word material, `qere_trailer` and `qere_trailer_utf8`
  are missing;
* if there is no qere, both features are filled with the empty string.
  In this way we can make no distinction between a truly empty `qere` and the absence of a `qere`.

That is why we reconstruct ketiv and qere from special files that are used by the ETCBC.

In [1]:
import os
import sys
import collections
import yaml
from tf.fabric import Fabric
from tf.writing.transcription import Transcription
from tf.core.helpers import formatMeta
import utils


# Pipeline
See [operation](https://github.com/ETCBC/pipeline/blob/master/README.md#operation)
for how to run this script in the pipeline.

In [2]:
if "SCRIPT" not in locals():
    SCRIPT = False
    FORCE = True
    CORE_NAME = "bhsa"
    VERSION = "2021"


def stop(good=False):
    if SCRIPT:
        sys.exit(0 if good else 1)

# Setting up the context: source file and target directories

The conversion is executed in an environment of directories, so that sources, temp files and
results are in convenient places and do not have to be shifted around.

In [3]:
repoBase = os.path.expanduser("~/github/etcbc")
thisRepo = "{}/{}".format(repoBase, CORE_NAME)

thisSource = "{}/source/{}".format(thisRepo, VERSION)

thisTemp = "{}/_temp/{}".format(thisRepo, VERSION)
thisTempTf = "{}/tf".format(thisTemp)

thisTf = "{}/tf/{}".format(thisRepo, VERSION)

In [4]:
testFeature = "qere_trailer"

# Test

Check whether this conversion is needed in the first place.
Only when run as a script.

In [5]:
if SCRIPT:
    (good, work) = utils.mustRun(
        None, "{}/.tf/{}.tfx".format(thisTf, testFeature), force=FORCE
    )
    if not good:
        stop(good=False)
    if not work:
        stop(good=True)

# TF Settings

* a piece of metadata that will go into these features; the time will be added automatically
* new text formats for the `otext` feature of TF, based on lexical features.
  We select the version specific `otext` material,
  falling back on a default if nothing appropriate has been specified in `otext`.

We do not do this for the older versions `4` and `4b`.

In [19]:
genericMetaPath = f"{thisRepo}/yaml/generic.yaml"
coreMetaPath = f"{thisRepo}/yaml/core.yaml"
ketivqereMetaPath = f"{thisRepo}/yaml/ketivqere.yaml"

with open(genericMetaPath) as fh:
    genericMeta = yaml.load(fh, Loader=yaml.FullLoader)
    genericMeta["version"] = VERSION
with open(coreMetaPath) as fh:
    coreMeta = formatMeta(yaml.load(fh, Loader=yaml.FullLoader))
with open(ketivqereMetaPath) as fh:
    ketivqereMeta = formatMeta(yaml.load(fh, Loader=yaml.FullLoader))

metaData = {"": genericMeta}

In [20]:
oText = {
    "_temp": """
@fmt:text-orig-full={qere_utf8/g_word_utf8}{qere_trailer_utf8/trailer_utf8}
@fmt:text-orig-full-ketiv={g_word_utf8}{trailer_utf8}
@fmt:text-trans-full={qere/g_word}{qere_trailer/trailer}
@fmt:text-trans-full-ketiv={g_word}{trailer}""",
    "2021": """
@fmt:text-orig-full={qere_utf8/g_word_utf8}{qere_trailer_utf8/trailer_utf8}
@fmt:text-orig-full-ketiv={g_word_utf8}{trailer_utf8}
@fmt:text-trans-full={qere/g_word}{qere_trailer/trailer}
@fmt:text-trans-full-ketiv={g_word}{trailer}""",
    "2017": """
@fmt:text-orig-full={qere_utf8/g_word_utf8}{qere_trailer_utf8/trailer_utf8}
@fmt:text-orig-full-ketiv={g_word_utf8}{trailer_utf8}
@fmt:text-trans-full={qere/g_word}{qere_trailer/trailer}
@fmt:text-trans-full-ketiv={g_word}{trailer}""",
    "2016": """
@fmt:text-orig-full={qere_utf8/g_word_utf8}{qere_trailer_utf8/trailer_utf8}
@fmt:text-orig-full-ketiv={g_word_utf8}{trailer_utf8}
@fmt:text-trans-full={qere/g_word}{qere_trailer/trailer}
@fmt:text-trans-full-ketiv={g_word}{trailer}""",
    "c": """
@fmt:text-orig-full={qere_utf8/g_word_utf8}{qere_trailer_utf8/trailer_utf8}
@fmt:text-orig-full-ketiv={g_word_utf8}{trailer_utf8}
@fmt:text-trans-full={qere/g_word}{qere_trailer/trailer}
@fmt:text-trans-full-ketiv={g_word}{trailer}""",
}

thisOtext = oText.get(VERSION, "")

if thisOtext == "":
    utils.caption(0, "No additional text formats provided")
    otextInfo = {}
else:
    utils.caption(0, "New text formats")
    otextInfo = dict(
        line[1:].split("=", 1) for line in thisOtext.strip("\n").split("\n")
    )
    for x in sorted(otextInfo.items()):
        utils.caption(0, '{:<30} = "{}"'.format(*x))

|      2m 42s New text formats
|      2m 42s fmt:text-orig-full             = "{qere_utf8/g_word_utf8}{qere_trailer_utf8/trailer_utf8}"
|      2m 42s fmt:text-orig-full-ketiv       = "{g_word_utf8}{trailer_utf8}"
|      2m 42s fmt:text-trans-full            = "{qere/g_word}{qere_trailer/trailer}"
|      2m 42s fmt:text-trans-full-ketiv      = "{g_word}{trailer}"


In [21]:
utils.caption(4, "Load the existing TF dataset")
TF = Fabric(locations=thisTf, modules=[""])
api = TF.load("label g_word g_cons trailer_utf8")
api.makeAvailableIn(globals())

..............................................................................................
.      2m 43s Load the existing TF dataset                                                   .
..............................................................................................
This is Text-Fabric 9.1.7
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

108 features found and 0 ignored
  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
    14s All features loaded/computed - for details use TF.isLoaded()


[('Computed',
  'computed-data',
  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),
 ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),
 ('Fabric', 'loading', ('TF',)),
 ('Locality', 'locality', ('L Locality',)),
 ('Nodes', 'navigating-nodes', ('N Nodes',)),
 ('Features',
  'node-features',
  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),
 ('Search', 'search', ('S Search',)),
 ('Text', 'text', ('T Text',))]

# Verse labels
The ketiv-qere files deal with different verse labels.
We make a mapping between verse labels and nodes.

In [22]:
utils.caption(0, "Mapping between verse labels and verse nodes")
nodeFromLabel = {}
for vs in F.otype.s("verse"):
    lab = F.label.v(vs)
    nodeFromLabel[lab] = vs
utils.caption(0, "{} verses".format(len(nodeFromLabel)))

|      2m 57s Mapping between verse labels and verse nodes
|      2m 57s 23213 verses


# Read the Ketiv-Qere file

In [23]:
utils.caption(4, "Parsing Ketiv-Qere data")

verseInfo = collections.defaultdict(lambda: [])
notFound = set()
missing = collections.defaultdict(lambda: [])
missed = collections.defaultdict(lambda: [])

error_limit = 10

kqFile = "{}/ketivqere.txt".format(thisSource)
kqHandle = open(kqFile)

ln = 0
can = 0
cur_label = None
for line in kqHandle:
    ln += 1
    can += 1
    vlab = line[0:10]
    fields = line.rstrip("\n")[10:].split()
    (ketiv, qere) = fields[0:2]
    (qtrim, qtrailer) = Transcription.suffix_and_finales(qere)
    vnode = nodeFromLabel.get(vlab, None)
    if vnode is None:
        notFound.add(vlab)
        continue
    verseInfo[vnode].append((ketiv, qtrim, qtrailer))
kqHandle.close()
utils.caption(0, "\tRead {} ketiv-qere annotations".format(ln))

..............................................................................................
.      2m 57s Parsing Ketiv-Qere data                                                        .
..............................................................................................
|      2m 57s 	Read 1892 ketiv-qere annotations


In [24]:
data = []

for vnode in verseInfo:
    wlookup = collections.defaultdict(lambda: [])
    wvisited = collections.defaultdict(lambda: -1)
    wnodes = L.d(vnode, otype="word")
    for w in wnodes:
        gw = F.g_word.v(w)
        if "*" in gw:
            gw = F.g_cons.v(w)
            if gw == "":
                gw = "."
            if F.trailer_utf8.v(w) == "":
                gw += "-"
            wlookup[gw].append(w)
    for (ketiv, qere, qtrailer) in verseInfo[vnode]:
        wvisited[ketiv] += 1
        windex = wvisited[ketiv]
        ws = wlookup.get(ketiv, None)
        if ws is None or windex > len(ws) - 1:
            missing[vnode].append((windex, ketiv, qere))
            continue
        w = ws[windex]
        qereU = Transcription.to_hebrew(qere)
        qtrailerU = Transcription.to_hebrew(qtrailer)
        data.append(
            (
                w,
                ketiv,
                qere,
                qtrailer.replace("\n", ""),
                qereU,
                qtrailerU.replace("\n", ""),
            )
        )
    for ketiv in wlookup:
        if ketiv not in wvisited or len(wlookup[ketiv]) - 1 > wvisited[ketiv]:
            missed[vnode].append(
                (len(wlookup[ketiv]) - (wvisited.get(ketiv, -1) + 1), ketiv)
            )
utils.caption(0, "\tParsed {} ketiv-qere annotations".format(len(data)))

|      2m 57s 	Parsed 1892 ketiv-qere annotations


In [25]:
if not SCRIPT:
    print("\n".join(repr(d) for d in data[0:10]))

(297575, '<NWJ', '<:ANIJ.;J', '&', 'עֲנִיֵּי', '־')
(297650, 'W-', 'W:', '', 'וְ', '')
(297651, 'NCQH', 'NIC:Q:<@73H', ' ', 'נִשְׁקְעָ֖ה', ' ')
(297941, 'M<LWTW', 'MA<:ALOWT@80JW', ' ', 'מַעֲלֹותָ֔יו', ' ')
(370152, 'M>WM', 'MW.m04', ' ', 'מוּם֩', ' ')
(370616, 'L-', 'L:', '', 'לְ', '')
(370617, '<BDJK', '<AB:D@73k:', ' ', 'עַבְדָ֖ךְ', ' ')
(370625, 'L-', 'L:', '', 'לְ', '')
(370626, 'KFDJ>', 'KAF:D.@>;80J', ' ', 'כַשְׂדָּאֵ֔י', ' ')
(370708, 'HZMNTWN', 'HIZ:D.:MIN:T.W.n03', ' ', 'הִזְדְּמִנְתּוּן֙', ' ')


In [26]:
if notFound:
    utils.caption(
        0,
        "\tWARNING: Could not find {} verses: {}".format(
            len(notFound), sorted(notFound)
        ),
    )
else:
    utils.caption(0, "\tAll verses entries found in index")
if missing:
    utils.caption(
        0,
        "\tWARNING: Could not locate ketivs in the text: {} verses".format(
            len(missing)
        ),
    )
    e = 0
    for vnode in sorted(missing):
        if e > error_limit:
            break
        vlab = F.label.v(vnode)
        for (windex, ketiv, qere) in missing[vnode]:
            e += 1
            if e > error_limit:
                break
            utils.caption(
                0,
                "\t\tNOT IN TEXT: {:<10} {:<20} #{} {}".format(
                    vlab, ketiv, windex, qere
                ),
            )
else:
    utils.caption(0, "\tAll ketivs found in the text")
if missed:
    utils.caption(
        0, "\tCould not lookup qeres in the data: {} verses".format(len(missed))
    )
    e = 0
    for vnode in sorted(missed):
        if e > error_limit:
            break
        vlab = F.label.v(vnode)
        for (windex, ketiv) in missed[vnode]:
            e += 1
            if e > error_limit:
                break
            utils.caption(
                0, "\t\tNOT IN DATA: {:<10} {:<20} #{}".format(vlab, ketiv, windex)
            )
else:
    utils.caption(0, "\tAll ketivs found in the data")

|      2m 57s 	All verses entries found in index
|      2m 57s 	All ketivs found in the text
|      2m 57s 	All ketivs found in the data


# Prepare TF features

In [27]:
utils.caption(0, "Prepare TF ketiv qere features")

nodeFeatures = {}

newFeatures = """
    qere
    qere_trailer
    qere_utf8
    qere_trailer_utf8
""".strip().split()

nodeFeatures = dict(
    qere=dict(((x[0], x[2]) for x in data)),
    qere_trailer=dict(((x[0], x[3]) for x in data)),
    qere_utf8=dict(((x[0], x[4]) for x in data)),
    qere_trailer_utf8=dict(((x[0], x[5]) for x in data)),
)

|      2m 57s Prepare TF ketiv qere features


We update the `otext` feature with new/changed formats

In [28]:
utils.caption(0, "Update the otext feature")

for f in nodeFeatures:
    if f in ketivqereMeta:
        metaData[f] = ketivqereMeta[f]
    elif f in coreMeta:
        metaData[f] = coreMeta[f]
    else:
        metaData[f] = {}
    metaData[f]["valueType"] = "str"
    metaData[f]["provenance"] = "from additional ketiv/qere file provided by the ETCBC"

metaData["otext"] = dict()
metaData["otext"].update(T.config)
metaData["otext"].update(otextInfo)

|      2m 57s Update the otext feature


In [30]:
changedDataFeatures = set(nodeFeatures)
changedFeatures = changedDataFeatures | {"otext"}

# Write new features
Transform the collected information in feature-like data structures, and write it all
out to `.tf` files.

In [31]:
utils.caption(4, "write new/changed features to TF ...")
TF = Fabric(locations=thisTempTf, silent=True)
TF.save(nodeFeatures=nodeFeatures, edgeFeatures={}, metaData=metaData)

..............................................................................................
.      3m 04s write new/changed features to TF ...                                           .
..............................................................................................


True

# Diffs

Check differences with previous versions.

The new dataset has been created in a temporary directory,
and has not yet been copied to its destination.

Here is your opportunity to compare the newly created features with the older features.
You expect some differences in some features.

We check the differences between the previous version of the features and what has been generated.
We list features that will be added and deleted and changed.
For each changed feature we show the first line where the new feature differs from the old one.
We ignore changes in the metadata, because the timestamp in the metadata will always change.

In [32]:
utils.checkDiffs(thisTempTf, thisTf, only=changedFeatures)

..............................................................................................
.      3m 26s Check differences with previous version                                        .
..............................................................................................
|      3m 26s 	2 features to add
|      3m 26s 		qere_trailer
|      3m 26s 		qere_trailer_utf8
|      3m 26s 	no features to delete
|      3m 26s 	3 features in common
|      3m 26s otext                     ... differences
|      3m 26s 	line      5 OLD -->@dateWritten=2021-12-09T07:37:31Z<--
|      3m 26s 	line      5 NEW -->@dateWritten=2021-12-09T08:29:47Z<--
|      3m 26s 	line     12 OLD -->@fmt:text-orig-full={g_word_utf8}{traile ...<--
|      3m 26s 	line     12 NEW -->@fmt:text-orig-full={qere_utf8/g_word_ut ...<--
|      3m 26s 	line     13 OLD -->@fmt:text-orig-plain={g_cons_utf8}{trail ...<--
|      3m 26s 	line     13 NEW -->@fmt:text-orig-full-ketiv={g_word_utf8}{ ...<--
|      3m 26s 	line 

# Deliver

Copy the new TF dataset from the temporary location where it has been created to its final destination.

In [33]:
utils.deliverFeatures(thisTempTf, thisTf, changedFeatures)

..............................................................................................
.      3m 27s Deliver features to /Users/werk/github/etcbc/bhsa/tf/2021                      .
..............................................................................................
|      3m 27s 	qere
|      3m 27s 	qere_trailer
|      3m 27s 	qere_trailer_utf8
|      3m 27s 	otext
|      3m 27s 	qere_utf8


# Compile TF

We load the new features, use the new format, check some values

In [None]:
utils.caption(4, "Load and compile the new TF features")

TF = Fabric(locations=thisTf, modules=[""])
api = TF.load(
    "g_word_utf8 g_word trailer_utf8 trailer {}".format(" ".join(changedDataFeatures))
)
api.makeAvailableIn(globals())

..............................................................................................
.      3m 31s Load and compile the new TF features                                           .
..............................................................................................
This is Text-Fabric 9.1.7
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

110 features found and 0 ignored
  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
   |     0.01s T qere_trailer_utf8    from ~/github/etcbc/bhsa/tf/2021
   |     0.00s T qere_trailer         from ~/github/etcbc/bhsa/tf/2021
   |     0.01s T qere                 from ~/github/etcbc/bhsa/tf/2021
   |     0.01s T qere_utf8            from ~/github/etcbc/bhsa/tf/2021
   |      |     0.64s C __levels__           from otype, oslots, otext
   |      |       14s C __order__            from otype, oslots, __levels__
   |      |     0.66s C 

# Examples

In [None]:
utils.caption(4, "Basic tests")


def showKq(w):
    hw = F.g_word_utf8.v(w)
    tw = F.g_word.v(w)
    ht = F.trailer_utf8.v(w)
    tt = F.trailer.v(w)

    qhw = F.qere_utf8.v(w)
    qtw = F.qere.v(w)
    qht = F.qere_trailer_utf8.v(w)
    qtt = F.qere_trailer.v(w)

    utils.caption(0, "{:<20} {}".format("hebrew", hw + ht))
    utils.caption(0, "{:<20} {}".format("hebrew qere", qhw + qht))
    utils.caption(0, "{:<20} {}".format("transcription", tw + tt))
    utils.caption(0, "{:<20} {}".format("transcription qere", qtw + qtt))


utils.caption(
    0,
    "{:<30}: {}".format(
        "absence of qere",
        " ".join(
            "NA" if F.qere.v(w) is None else F.qere.v(w) for w in (range(24700, 24710))
        ),
    ),
)
utils.caption(
    0,
    "{:<30}: {}".format(
        "presence of qere trailer",
        " ".join(
            "NA" if F.qere_trailer.v(w) is None else F.qere_trailer.v(w)
            for w in (range(30190, 30195))
        ),
    ),
)

showNode = L.u(122073, otype="verse")[0]
showVerse = T.sectionFromNode(showNode)

utils.caption(4, "{} {}:{} in all formats".format(*showVerse))
for fmt in T.formats:
    utils.caption(
        0, "{:<30} {}".format(fmt, T.text(L.d(showNode, otype="word"), fmt=fmt))
    )

In [14]:
if SCRIPT:
    stop(good=True)