# BHSA and OSM: comparison on verb attributes

We will investigate how the morphology marked up in the OSM corresponds and differs from the BHSA linguistic features.

In this notebook we investigate the markup of verb attributes.
According to the [OSM specs](http://openscriptures.github.io/morphhb/parsing/HebrewMorphologyCodes.html)
this is provided:

* verb stem
* conjugation type
* person
* gender
* number

We use the `osm` and `osm_sf` features compiled by the
[BHSAbridgeOSM notebook](BHSAbridgeOSM.ipynb).

# Results

See below, where most of the cases are mentioned.
We also collect all cases in [verbs.tsv](verbs.tsv) , a tab delimited file.

In [1]:
from tf.app import use
from helpers import show

# Load data
We load the BHSA data in the standard way, and we add the OSM data as a module of the features `osm` and `osm_sf`.
Note that we only need to point TF to the right GitHub org/repo/directory, in order to load the OSM features.

In [2]:
A = use("bhsa", mod="etcbc/bridging/tf", hoist=globals())

This is Text-Fabric 9.0.4
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

123 features found and 0 ignored


# Verb occurrences

Let us first identify what the verb occurrences are, according to the OSM and to the BHSA.
We'll show the differences.
The OSM is not yet completed, so we focus on the cases where the OSM has morphology.

We call the set of words that have a non-empty OSM morphology string the OSM-base.

In [15]:
verbsBHS = set(F.sp.s("verb"))
hasOSM = set()
for w in F.otype.s("word"):
    osm = F.osm.v(w)
    if osm and len(osm) > 1:
        hasOSM.add(w)

verbsBHSfocus = verbsBHS & hasOSM

verbsOSM = {w for w in hasOSM if F.osm.v(w)[1] == "V"}

print(
    """
Number of verb occurrences in the Hebrew Bible:
\tin BHSA (total):                     {:>5}
\tin BHSA (intersected with OSM-base): {:>5}
\tin OSM:                              {:>5}
""".format(
        len(verbsBHS),
        len(verbsBHSfocus),
        len(verbsOSM),
    )
)


Number of verb occurrences in the Hebrew Bible:
	in BHSA (total):                     75451
	in BHSA (intersected with OSM-base): 73668
	in OSM:                              73642



As you see: very few discrepancies.
Before we show them, we define functions that show a verb with BHSA morphology and OSM morphology.

If a piece of moprhology is not present, we substitute a `?`.
We also transform a not-applicable or unknown value in the BHSA by `?`, although
there is a difference between missing markup and markup saying: insufficient information!

# Names

We map the names for stems and conjugations found in the
[OSM morphology description](http://openscriptures.github.io/morphhb/parsing/HebrewMorphologyCodes.html)
to convenient names when comparing them with the morphology values in the BHSA features
[vs](https://etcbc.github.io/bhsa/features/hebrew/2017/vt.html)
and
[vt](https://etcbc.github.io/bhsa/features/hebrew/2017/vt.html),
and we map some BHSA names as well.

In [16]:
stemMapOSM = dict(
    H=dict(
        q="qal",
        N="niphal",
        p="piel",
        P="pual",
        h="hiphil",
        H="hophal",
        t="hithpael",
        o="polel",
        O="polal",
        r="hithpolel",
        m="poel",
        M="poal",
        k="palel",
        K="pulal",
        Q="qalpassive",
        l="pilpel",
        L="polpal",
        f="hithpalpel",
        D="nithpael",
        j="pealal",
        i="pilel",
        u="hothpaal",
        c="tiphil",
        v="hishtaphel",
        w="nithpalel",
        y="nithpoel",
        z="hithpoel",
    ),
    A=dict(
        q="peal",
        Q="peil",
        u="hithpeel",
        p="pael",
        P="ithpaal",
        M="hithpaal",
        a="aphel",
        h="haphel",
        s="saphel",
        e="shaphel",
        H="hophal",
        i="ithpeel",
        t="hishtaphel",
        v="ishtaphel",
        w="hithaphel",
        o="polel",
        z="ithpoel",
        r="hithpolel",
        f="hithpalpel",
        b="hephal",
        c="tiphel",
        m="poel",
        l="palpel",
        L="ithpalpel",
        O="ithpolel",
        G="ittaphal",
    ),
)

In [17]:
stemMapBHS = dict(
    hif="hiphil",
    hit="hithpael",
    htpo="hithpoel",
    hof="hophal",
    nif="niphal",
    piel="piel",
    poal="poal",
    poel="poel",
    pual="pual",
    qal="qal",
    afel="aphel",
    etpa="etpaal",
    etpe="etpeel",
    haf="haphel",
    hotp="hothpaal",
    hsht="hishtaphel",
    htpa="hithpaal",
    htpe="hithpeel",
    nit="nithpael",
    pael="pael",
    peal="peal",
    peil="peil",
    shaf="shaphel",
    tif="tiphal",
    pasq="qalpassive",
)

In [18]:
conjMapOSM = dict(
    p="perfect",
    q="weqatal",
    i="imperfect",
    w="wayyiqtol",
    h="cohortative",
    j="jussive",
    v="imperative",
    r="part act",
    s="part pass",
    a="inf abs",
    c="inf cons",
)
conjMapBHS = dict(
    impf="imperfect",
    impv="imperative",
    infa="inf abs",
    infc="inf cons",
    perf="perfect",
    ptca="part act",
    ptcp="part pass",
    wayq="wayyiqtol",
)

In [19]:
naValues = {"NA", "N/A"}
missingValues = {None, "", "unknown"}
noPersonConj = {"r", "s", "a", "c"}


def getValue(x):
    return "_" if x in naValues else "?" if x in missingValues else x


def getValueHead(x):
    return "_" if x in naValues else "?" if x in missingValues else x[0]


def getValueTail(x):
    return "_" if x in naValues else "?" if x in missingValues else x[1:]


def extractFeature(x, n):
    return "?" if not x or len(x) <= n else x[n]


def getLangOSM(w):
    return extractFeature(F.osm.v(w), 0)


def getStemOSM(w):
    return extractFeature(F.osm.v(w), 2)


def getStemOSMX(w):
    return stemMapOSM.get(getLangOSM(w), {}).get(getStemOSM(w), "?")


def getConjOSM(w):
    return extractFeature(F.osm.v(w), 3)


def getConjOSMX(w):
    return conjMapOSM.get(getConjOSM(w), "?")


def getPersonOSM(w):
    return "_" if getConjOSM(w) in noPersonConj else extractFeature(F.osm.v(w), 4)


def getGenderOSM(w):
    return extractFeature(F.osm.v(w), 4 if getConjOSM(w) in noPersonConj else 5)


def getNumberOSM(w):
    return extractFeature(F.osm.v(w), 5 if getConjOSM(w) in noPersonConj else 6)


def getStemBHS(w):
    return getValue(F.vs.v(w))


def getStemBHSX(w):
    return stemMapBHS.get(getStemBHS(w), "?")


def getConjBHS(w):
    return getValue(F.vt.v(w))


def getConjBHSX(w):
    return conjMapBHS.get(getConjBHS(w), "?")


def getPersonBHS(w):
    return getValueTail(F.ps.v(w))


def getGenderBHS(w):
    return getValue(F.gn.v(w))


def getNumberBHS(w):
    return getValueHead(F.nu.v(w))


def getVerbBHS(w):
    return "{}-{}-{}{}{}".format(
        getStemBHSX(w),
        getConjBHSX(w),
        getPersonBHS(w),
        getGenderBHS(w),
        getNumberBHS(w),
    )


def getVerbOSM(w):
    return "{}-{}-{}{}{}".format(
        getStemOSMX(w),
        getConjOSMX(w),
        getPersonOSM(w),
        getGenderOSM(w),
        getNumberOSM(w),
    )


def getBHS(w):
    return F.sp.v(w)


def getOSM(w):
    return F.osm.v(w)

# Mappings

We collect the numbers of cooccurrences of OSM values and BHSA values for each verb feature,
and see how they compare.

In [45]:
closerLook = set()

In [46]:
def showFeatures(base):
    cases = set()
    mappings = {}

    def makeMap(key, getBHS, getOSM):
        BHSFromOSM = {}
        OSMFromBHS = {}

        for w in base:
            osm = getOSM(w)
            bhs = getBHS(w)
            BHSFromOSM.setdefault(osm, {}).setdefault(bhs, set()).add(w)
            OSMFromBHS.setdefault(bhs, {}).setdefault(osm, set()).add(w)
        mappings.setdefault(key, {})[True] = BHSFromOSM
        mappings.setdefault(key, {})[False] = OSMFromBHS

    def showMap(key, direction):
        dirLabel = "OSM ===> BHS" if direction else "BHS ===> OSM"
        print(
            """
---------------------------------------------------------------------------------
--- {} {}
---------------------------------------------------------------------------------
""".format(
                key, dirLabel
            )
        )
        cases = set()
        for (item, itemData) in sorted(mappings[key][direction].items()):
            print("{:<10}".format(item))
            first = True
            for (itemOther, ws) in sorted(
                itemData.items(), key=lambda x: (-len(x[1]), x[0])
            ):
                print("\t{:<15} ({:>5}x)".format(itemOther, len(ws)))
                if not first and len(ws) < 100:
                    shown = 0
                    for w in sorted(ws):
                        if shown < 5:
                            show(T, F, [w], getVerbBHS, getVerbOSM, indent="\t\t\t\t")
                            shown += 1
                        elif shown == 5:
                            print(f"\t\t\t\tand {len(ws) - 5} more")
                            shown += 1
                        cases.add(w)
                first = False
        print("\n{} ({}): {} cases".format(key, dirLabel, len(cases)))
        return cases

    def showFeature(key):
        cases = set()
        print(
            """
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
o-o COMPARING FEATURE {}
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
""".format(
                key
            )
        )
        for direction in (True, False):
            theseCases = showMap(key, direction)
            cases |= theseCases
        print("\n{}: {} cases".format(key, len(cases)))
        return cases

    for (key, getBHS, getOSM) in (
        ("stem", getStemBHSX, getStemOSMX),
        ("conjugation", getConjBHSX, getConjOSMX),
        ("person", getPersonBHS, getPersonOSM),
        ("gender", getGenderBHS, getGenderOSM),
        ("number", getNumberBHS, getNumberOSM),
    ):
        makeMap(key, getBHS, getOSM)
        cases |= showFeature(key)
    print("\n{}: {} cases".format("All features", len(cases)))

    return cases

## in BHSA but not in OSM

In [47]:
extraBHS = verbsBHSfocus - verbsOSM

print("Marked as verb in BHSA but not in OSM: {:>3}".format(len(extraBHS)))
for w in sorted(extraBHS):
    show(T, F, [w], getVerbBHS, getOSM, indent="\t")

Marked as verb in BHSA but not in OSM: 187
	Genesis 1:4 w47"טֹ֑וב"
		BHS: qal-perfect-3ms
		OSM: HAamsa
	Genesis 1:10 w178"טֹֽוב"
		BHS: qal-perfect-3ms
		OSM: HAamsa
	Genesis 1:12 w227"טֹֽוב"
		BHS: qal-perfect-3ms
		OSM: HAamsa
	Genesis 1:18 w351"טֹֽוב"
		BHS: qal-perfect-3ms
		OSM: HAamsa
	Genesis 1:21 w413"טֹֽוב"
		BHS: qal-perfect-3ms
		OSM: HAamsa
	Genesis 1:25 w494"טֹֽוב"
		BHS: qal-perfect-3ms
		OSM: HAamsa
	Genesis 18:1 w7815"חֹ֥ם"
		BHS: qal-inf cons-???
		OSM: HNcmsc
	Genesis 30:11 w15747"ב"
		BHS: qal-perfect-3ms
		OSM: HR
	Genesis 40:2 w21943"מַּשְׁקִ֔ים"
		BHS: hiphil-part act-?mp
		OSM: HNcmpa
	Genesis 40:5 w21997"מַּשְׁקֶ֣ה"
		BHS: hiphil-part act-?ms
		OSM: HNcmsa
	Genesis 40:9 w22067"מַּשְׁקִ֛ים"
		BHS: hiphil-part act-?mp
		OSM: HNcmpa
	Genesis 40:20 w22307"מַּשְׁקִ֗ים"
		BHS: hiphil-part act-?mp
		OSM: HNcmpa
	Genesis 40:21 w22322"מַּשְׁקִ֖ים"
		BHS: hiphil-part act-?mp
		OSM: HNcmpa
	Genesis 40:23 w22348"מַּשְׁקִ֛ים"
		BHS: hiphil-part act-?mp
		OSM: HNcmpa
	Genesi

In [48]:
cases = extraBHS
closerLook |= cases
print("{} cases merged into {} closer look items".format(len(cases), len(closerLook)))

187 cases merged into 187 closer look items


## in OSM but not in BHSA

In [49]:
extraOSM = verbsOSM - verbsBHSfocus

print("Marked as verb in OSM but not in BHSA: {:>3}".format(len(extraOSM)))
for w in sorted(extraOSM):
    show(T, F, [w], getBHS, getVerbOSM, indent="\t")

Marked as verb in OSM but not in BHSA: 161
	Genesis 24:59 w12263"מֵנִקְתָּ֑הּ"
		BHS: subs
		OSM: hiphil-part act-_fs
	Genesis 26:13 w13242"גָדֵ֔ל"
		BHS: adjv
		OSM: qal-part act-_ms
	Genesis 30:42 w16354"עֲטֻפִים֙"
		BHS: adjv
		OSM: qal-part pass-_mp
	Genesis 39:20 w21838"אסורי"
		BHS: subs
		OSM: qal-part pass-_mp
	Genesis 40:1 w21928"אֹפֶ֑ה"
		BHS: subs
		OSM: qal-part act-_ms
	Genesis 40:2 w21948"אֹופִֽים"
		BHS: subs
		OSM: qal-part act-_mp
	Genesis 40:5 w22000"אֹפֶ֗ה"
		BHS: subs
		OSM: qal-part act-_ms
	Genesis 40:16 w22206"אֹפִ֖ים"
		BHS: subs
		OSM: qal-part act-_mp
	Genesis 40:17 w22236"אֹפֶ֑ה"
		BHS: subs
		OSM: qal-part act-_ms
	Genesis 40:20 w22313"אֹפִ֖ים"
		BHS: subs
		OSM: qal-part act-_mp
	Genesis 40:22 w22336"אֹפִ֖ים"
		BHS: subs
		OSM: qal-part act-_mp
	Genesis 41:10 w22546"אֹפִֽים"
		BHS: subs
		OSM: qal-part act-_mp
	Genesis 41:23 w22768"צְנֻמֹ֥ות"
		BHS: adjv
		OSM: qal-part pass-_fp
	Genesis 49:22 w28044"פֹּרָת֙"
		BHS: subs
		OSM: qal-part act-_fs
	Genesis 49:

In [50]:
cases = extraOSM
closerLook |= cases
print("{} cases merged into {} closer look items".format(len(cases), len(closerLook)))

161 cases merged into 348 closer look items


## Common verb base
The rest of the comparison is carried out for the *common verb base*, i.e. those words
that have been marked as verb in the BHSA and in the OSM.

In [51]:
verbBase = verbsOSM & verbsBHSfocus
print("Common verb base: {} occurrences".format(len(verbBase)))

Common verb base: 73481 occurrences


# Feature comparison
We are going to compare all features.

In [52]:
cases = showFeatures(verbBase)


o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
o-o COMPARING FEATURE stem
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o


---------------------------------------------------------------------------------
--- stem OSM ===> BHS
---------------------------------------------------------------------------------

aphel     
	haphel          (   20x)
	aphel           (    3x)
				Daniel 3:1 w371624"אֲקִימֵהּ֙"
					BHS: aphel-perfect-3ms
					OSM: aphel-perfect-3ms
				Daniel 4:11 w372576"אַתַּ֥רוּ"
					BHS: aphel-imperative-2mp
					OSM: aphel-imperative-2mp
				Ezra 5:15 w380360"אֲחֵ֣ת"
					BHS: aphel-imperative-2ms
					OSM: aphel-imperative-2ms
	pael            (    1x)
				Ezra 4:12 w379764"יַחִֽיטוּ"
					BHS: pael-perfect-3mp
					OSM: aphel-imperfect-3mp
haphel    
	haphel          (  141x)
	aphel           (    1x)
				Daniel 5:12 w373436"אַֽחֲוָיַ֨ת"
					BHS: aphel-inf cons-???
					OSM: haphel-inf cons-_??
	

In [53]:
closerLook |= cases
print("{} cases for a closer look".format(len(closerLook)))

1180 cases for a closer look


# Result

We are going to list all cases in [verbs.tsv](verbs.tsv) .

In [54]:
fields = """
    passage
    node
    occurrence
    OSMmorph
    stemOSM
    stemBHS
    conjOSM
    conjBHS
    personOSM
    personBHS
    genderOSM
    genderBHS
    numberOSM
    numberBHS
""".strip().split()
lineFormat = ("{}\t" * (len(fields) - 1)) + "{}\n"

with open("verbs.tsv", "w") as fh:
    fh.write(lineFormat.format(*fields))
    for w in sorted(closerLook):
        fh.write(
            lineFormat.format(
                "{} {}:{}".format(*T.sectionFromNode(w)),
                w,
                F.g_word_utf8.v(w),
                F.osm.v(w),
                getStemOSMX(w),
                getStemBHSX(w),
                getConjOSMX(w),
                getConjBHSX(w),
                getPersonOSM(w),
                getPersonBHS(w),
                getGenderOSM(w),
                getGenderBHS(w),
                getNumberOSM(w),
                getNumberBHS(w),
            )
        )