# BHSA and OSM: comparison on part-of-speech

We will investigate how the morphology marked up in the OSM corresponds and differs from the BHSA linguistic features.

In this notebook we investigate the markup of *part-of-speech*.

We use the `osm` and `osm_sf` features compiled by the
[BHSAbridgeOSM notebook](BHSAbridgeOSM.ipynb).

In [2]:
import operator
import collections
from functools import reduce

from tf.app import use
from helpers import show

# Load data
We load the BHSA data in the standard way, and we add the OSM data as a module of the features `osm` and `osm_sf`.
Note that we only need to point TF to the right GitHub org/repo/directory, in order to load the OSM features.

In [3]:
A = use("bhsa", mod="etcbc/bridging/tf", hoist=globals())

This is Text-Fabric 9.0.4
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

123 features found and 0 ignored


# Part of speech

The BHSA has two features for part-of-speech:
[sp](https://etcbc.github.io/bhsa/features/hebrew/2017/sp)
and
[pdp](https://etcbc.github.io/bhsa/features/hebrew/2017/pdp).

The first one, `sp`, is lexical part of speech, a context-insensitve assignment of part-of-speech labels to
occurrences of lexemes.

The second one, `pdp`, is *phrase dependent part of speech*. This assignment is sensitive to
cases where adjectives are used as noun, nouns as prepositions, etc.

A preliminary check has revealed that the OSM part-of-speech resembles `sp` more than `pdp`, so
we stick to `sp`.

The OSM has part-of-speech as the second letter of the morph string.
See [here](http://openscriptures.github.io/morphhb/parsing/HebrewMorphologyCodes.html).

The BHSA makes a few more distinctions in its [sp](https://etcbc.github.io/bhsa/features/hebrew/2017/sp) feature,
so we map the OSM values to sets of BHSA values.

But the OSM has a subclassification of particles (`T`) that we should use.

One of the OSM values is `S` (suffix).
The BHSA has no counterpart for this, but we expect that all morph strings in the `osm_sf` features will show
the `S`.

We'll test that as well.

Here is the default mapping between OSM part-of-speech and BHSA part-of-speech.

We'll see later that this results in many discrepancies.

We'll analyze the discrepancies, and try to overcome them by making lexeme-dependent exceptions to these rules.

It turns out that we need a few dozen lexeme-based exception rules and we'll have nearly 1000
left-over cases that merit closer inspection.

In [4]:
particleTypes = dict(
    a="affirmation",
    d="definite article",
    e="exhortation",
    i="interrogative",
    j="interjection",
    m="demonstrative",
    n="negative",
    o="direct object marker",
    r="relative",
)

In [5]:
pspBhsFromOsm = {
    "A": {"adjv"},  # adjective
    "C": {"conj"},  # conjunction
    "D": {"advb"},  # adverb
    "N": {"subs", "nmpr"},  # noun
    "P": {"prps", "prde"},  # pronoun
    "R": {"prep"},  # preposition
    "S": {"_suffix_"},  # suffix
    "Ta": {"advb"},
    "Td": {"art"},
    "Te": {"intj"},
    "Ti": {"prin", "inrg"},
    "Tj": {"intj"},
    "Tm": {"intj", "advb"},
    "Tn": {"nega"},
    "To": {"prep"},  # object marker
    "Tr": {"conj"},  # relative
    "T": {"intj"},
    "V": {"verb"},  # verb
    "×": set(),  # no morphology
}

Just for ease of processing, we make a mapping from slots to OSM part-of-speech.

We assign `×` slot `w` if there is no valid OSM part-of-speech label available for `w`.

In [7]:
osmPsp = {}
noPsp = 0
nonEmpty = 0

for w in F.otype.s("word"):
    osm = F.osm.v(w)
    word = F.g_word_utf8.v(w)
    if not word:
        continue
    nonEmpty += 1
    
    if not osm or osm == "*" or len(osm) < 2:
        psp = "×"
        noPsp += 1
    else:
        psp = osm[1]
        if psp == "T":
            psp = osm[1:3]
    osmPsp[w] = psp

allPsp = len(osmPsp)
withPsp = allPsp - noPsp
print(
    """{} BHSA words:
    having  OSM part of speech: {:>3}% = {:>6}
    without OSM part of speech: {:>3}% = {:>6}
""".format(
        nonEmpty,
        round(100 * withPsp / allPsp),
        withPsp,
        round(100 * noPsp / allPsp),
        noPsp,
    )
)

420102 BHSA words:
    having  OSM part of speech: 100% = 419849
    without OSM part of speech:   0% =    253



We organize the osm-bhs combinations that show up in the text in several ways.

`psp` is keyed by: osm, bhs, lexeme node.

`pspLex` is keyed by: lexeme node, osm, bhs, and then contains a list of slots where this combination occurs.

Both mappings contains a list of slots where the combinations occur.

In [9]:
psp = {}
pspLex = {}
for lx in F.otype.s("lex"):
    ws = [w for w in L.d(lx, otype="word") if F.g_word_utf8.v(w)]
    for w in ws:
        osm = osmPsp[w]
        bhs = F.sp.v(w)
        psp.setdefault(osm, {}).setdefault(bhs, {}).setdefault(lx, set()).add(w)
        pspLex.setdefault(lx, {}).setdefault(osm, {}).setdefault(bhs, set()).add(w)

For each osm-bhs combination, we want to see how many lexemes and how many occurrences have that combination.

In [10]:
pspCount = {}
for (osm, osmData) in psp.items():
    for (bhs, bhsData) in osmData.items():
        nlex = len(bhsData)
        noccs = reduce(operator.add, (len(x) for x in bhsData.values()), 0)
        pspCount.setdefault(osm, {})[bhs] = (nlex, noccs)

Now we are going to present an overview of osm-bhs combinations.

We mark a combination with `OK` if the combination is according to the default OSM-BHS mapping.

We use the mark `*` if there is no OSM part-of-speech available.

Otherwise we mark it with a `?`.

In [11]:
mismatches = []
for osm in pspCount:
    print(osm)
    totalOccs = sum(x[1] for x in pspCount[osm].values())
    for (bhs, (nlex, noccs)) in sorted(
        pspCount[osm].items(), key=lambda x: (-x[1][1], -x[1][0], x[0])
    ):
        perc = round(100 * noccs / totalOccs)
        status = bhs in pspBhsFromOsm[osm]
        statusLabel = "OK" if status else "?"
        if not status:
            if osm == "×":
                statusLabel = "*"
            else:
                mismatches.append((osm, bhs, nlex, noccs))
        print(
            "\t=> {:<4} ({:<2}) in {:>4} lexemes and {:>3}% = {:>6} occurrences".format(
                bhs,
                statusLabel,
                nlex,
                perc,
                noccs,
            )
        )
total = 0
for (osm, bhs, nlex, noccs) in mismatches:
    total += noccs
print("\n{:<24} {:>6} occurrences".format("Total number of mismatches", total))

R
	=> prep (OK) in   23 lexemes and  96% =  61788 occurrences
	=> subs (? ) in   29 lexemes and   3% =   2105 occurrences
	=> advb (? ) in    1 lexemes and   0% =    200 occurrences
	=> inrg (? ) in    1 lexemes and   0% =    178 occurrences
	=> art  (? ) in    1 lexemes and   0% =      9 occurrences
	=> conj (? ) in    1 lexemes and   0% =      5 occurrences
	=> nmpr (? ) in    3 lexemes and   0% =      4 occurrences
	=> verb (? ) in    3 lexemes and   0% =      4 occurrences
	=> nega (? ) in    2 lexemes and   0% =      3 occurrences
	=> adjv (? ) in    2 lexemes and   0% =      2 occurrences
	=> prin (? ) in    1 lexemes and   0% =      2 occurrences
×
	=> art  (* ) in    1 lexemes and  20% =     51 occurrences
	=> subs (* ) in   29 lexemes and  16% =     41 occurrences
	=> verb (* ) in   24 lexemes and  16% =     41 occurrences
	=> nmpr (* ) in   27 lexemes and  15% =     37 occurrences
	=> prep (* ) in    4 lexemes and  11% =     28 occurrences
	=> conj (* ) in    2 lexemes and   

It is not as bad as it seems.
The number of *lexemes* involved in a mismatch is limited:

In [12]:
mismatchLexemes = set()
for (osm, bhs, nlex, noccs) in mismatches:
    lexemes = psp[osm][bhs].keys()
    mismatchLexemes |= lexemes
print("Lexemes to be researched: {}".format(len(mismatchLexemes)))

Lexemes to be researched: 663


We are going to investigate the lexemes that are involved in a mismatch.

It turns out that:

* for most of the lexemes there is a dominant combination of OSM and BHSA assigned part-of-speech;
* non-dominant combinations mostly have a very limited number of occurrences.

This is what we are going to do:

* for each lexeme we go along with the dominant combination.
  If that is different from the default marking, we add a lexeme-bound exception to the rule
  that maps OSM part-of-speech to BHSA part-of-speech.
* if even the dominant combination has less than 10 occurrences, we do not add a lexeme-bound rule,
  but we add the case to the list of exceptional cases.
* we spell out the exceptional cases, so that readers can manually check the part-of-speeches as assigned by
  OSM and BHSA.

In order to determine what is dominant: if a combination has 50% or more of occurrences of a lexeme
then that combination is dominant.
So, for each lexeme there is at least one dominant case.

There may not be a dominant case if not all occurrences of a lexeme have been marked up in the OSM.

The next cell computes the new rules and the exceptions.
It will show all new rules, and all kinds of exceptions.
But it only shows at most 10 instances of each kind of exception.

All exceptions are written to a tab-separated file
[pspCases.tsv](pspCases.tsv).

In [14]:
closerLook = set()

rules = []
text = []


def getOSMpsp(w):
    return "{} - {}".format(str(F.osm.v(w)), str(F.osm_sf.v(w)))


fields = """
    passage
    slot
    occurrence
    lex-node
    lex
    lex-pointed
    gloss
    bhsa-psp
    osm-psp
    #cases-like-this
""".strip().split()
lineFormat = ("{}\t" * (len(fields) - 1)) + "{}\n"

casesLikeThis = {}

for lx in sorted(mismatchLexemes, key=lambda x: -F.freq_lex.v(x)):
    freqLex = F.freq_lex.v(lx)
    text.append(
        '\n{:<15}        {:>6}x [{}] "{}"'.format(
            F.lex.v(lx),
            freqLex,
            F.gloss.v(lx),
            F.voc_lex_utf8.v(lx),
        )
    )
    nRealCases = freqLex
    if "×" in pspLex[lx]:
        for (bhs, ws) in pspLex[lx]["×"].items():
            nRealCases -= len(ws)

    osmCount = collections.Counter()
    for (osm, osmData) in pspLex[lx].items():
        for ws in osmData.values():
            osmCount[osm] += len(ws)

    for osm in sorted(pspLex[lx], key=lambda x: -osmCount[x]):
        if osm == "×":
            continue
        osmData = pspLex[lx][osm]
        for (bhs, ws) in sorted(osmData.items(), key=lambda x: (-len(x[1]), x[0])):
            showCases = False
            nws = len(ws)
            status = bhs in pspBhsFromOsm[osm]
            statusLabel = "OK" if status else "?"

            if 2 * nws > freqLex and nws >= 10:
                if status:
                    pass
                else:
                    statusLabel = "NN"
                    rules.append((lx, osm, bhs, ws))
            else:
                if status:
                    statusLabel = "OK?"
                else:
                    showCases = True

            text.append(
                "\t{:<2} ~ {:<4} ({:<3}) {:>6}x".format(
                    bhs,
                    osm,
                    statusLabel,
                    nws,
                )
            )

            if showCases:
                for w in sorted(ws)[0:10]:
                    text.append(
                        show(
                            T,
                            F,
                            [w],
                            F.sp.v,
                            getOSMpsp,
                            indent="\t\t\t\t\t",
                            asString=True,
                        )
                    )
                if nws > 10:
                    text.append("\t\t\t\t\tand {} more occurrences".format(nws - 10))
                closerLook |= set(ws)
                for w in ws:
                    casesLikeThis[w] = nws

with open("pspCases.tsv", "w") as fh:
    fh.write(lineFormat.format(*fields))
    for w in sorted(closerLook):
        closerLook.add(w)
        fh.write(
            lineFormat.format(
                "{} {}:{}".format(*T.sectionFromNode(w)),
                w,
                F.g_word_utf8.v(w),
                lx,
                F.lex.v(lx),
                F.voc_lex_utf8.v(lx),
                F.gloss.v(lx),
                F.sp.v(w),
                F.osm.v(w),
                casesLikeThis[w],
            )
        )
print("Written {} cases to file".format(len(closerLook)))

if rules:
    print("Lexeme-bound exceptions  : {:>4}".format(len(rules)))
else:
    print("No lexeme-bound exceptions")

if closerLook or text:
    print("Cases that need attention: {:>4}".format(len(closerLook)))
else:
    print("All cases clear")

print("\nLEXEME-BOUND EXCEPTIONS\n")
casesSolved = set()
for (lx, osm, bhs, ws) in rules:
    casesSolved |= set(ws)
    print(
        '\t{:<15} {:<4} ~ {:<2} ({:>5}x) [{:<20}] "{}"'.format(
            F.lex.v(lx),
            bhs,
            osm,
            len(ws),
            F.gloss.v(lx),
            F.voc_lex_utf8.v(lx),
        )
    )
print("This solves {} cases".format(len(casesSolved)))
print("Remaining cases: {}".format(total - len(casesSolved)))

Written 1551 cases to file
Lexeme-bound exceptions  :   95
Cases that need attention: 1551

LEXEME-BOUND EXCEPTIONS

	>XD/            subs ~ A  (  970x) [one                 ] "אֶחָד"
	>JN/            subs ~ Tn (  784x) [<NEG>               ] "אַיִן"
	CNJM/           subs ~ A  (  768x) [two                 ] "שְׁנַיִם"
	>XR/            subs ~ R  (  679x) [after               ] "אַחַר"
	CLC/            subs ~ A  (  602x) [three               ] "שָׁלֹשׁ"
	M>H/            subs ~ A  (  578x) [hundred             ] "מֵאָה"
	XMC/            subs ~ A  (  506x) [five                ] "חָמֵשׁ"
	TXT/            subs ~ R  (  502x) [under part          ] "תַּחַת"
	>LP=/           subs ~ A  (  492x) [thousand            ] "אֶלֶף"
	CB</            subs ~ A  (  487x) [seven               ] "שֶׁבַע"
	<WD/            subs ~ D  (  490x) [duration            ] "עֹוד"
	>RB</           subs ~ A  (  454x) [four                ] "אַרְבַּע"
	BJN/            subs ~ R  (  407x) [interval            ] "בַּיִן"
	

We show the top of the file with the cases for attention.

In [19]:
nLines = 50
print(f"\nCASES FOR ATTENTION (showing first {nLines} entries\n")
for t in text[0:nLines]:
    print(t)
print(f"\n ... AND {len(text) - nLines} entries more")


CASES FOR ATTENTION (showing first 50 entries


W                       50272x [and] "וְ"
	conj ~ C    (OK )  50234x
	conj ~ R    (?  )      5x
					Isaiah 1:5 w212149"וְ"
						BHS: conj
						OSM: HR - None
					Jeremiah 49:37 w261838"וְ"
						BHS: conj
						OSM: HR - None
					Jonah 2:4 w298983"וְ"
						BHS: conj
						OSM: HR - None
					Jonah 3:5 w299160"וְ"
						BHS: conj
						OSM: HR - None
					Jonah 4:11 w299540"וּ"
						BHS: conj
						OSM: HR - None

H                       30386x [the] "הַ"
	art ~ Td   (OK )  23858x
	art ~ R    (?  )      9x
					Judges 13:8 w135894"הַ"
						BHS: art
						OSM: HRd - None
					Judges 14:14 w136543"הָֽ"
						BHS: art
						OSM: HRd - None
					2_Samuel 15:18 w169279"הַ"
						BHS: art
						OSM: HRd - None
					2_Samuel 19:4 w172015"הַ"
						BHS: art
						OSM: HR - None
					2_Samuel 19:32 w172761"ב"
						BHS: art
						OSM: HR - None
					Haggai 2:6 w304609"הֶ"
						BHS: art
						OSM: HR - None
					Haggai 2:19 w304903"הַ"
			

# SP versus PDP

Here is the computation that shows that the BHS feature
[sp](https://etcbc.github.io/bhsa/features/hebrew/2017/sp)
matches the OSM part-of-speech better than
[pdp](https://etcbc.github.io/bhsa/features/hebrew/2017/pdp).

In [21]:
discrepancies = {}

for w in F.otype.s("word"):
    if not F.g_word_utf8.v(w):
        continue
    osm = osmPsp[w]
    if osm == "×":
        continue
    lex = F.lex.v(w)
    trans = pspBhsFromOsm[osm]
    if F.sp.v(w) not in trans:
        discrepancies.setdefault("sp", set()).add(w)
    if F.pdp.v(w) not in trans:
        discrepancies.setdefault("pdp", set()).add(w)

lexDiscrepancies = {}  # discrepancies per lexeme
for (ft, ws) in sorted(discrepancies.items()):
    for w in sorted(ws):
        lexNode = L.u(w, otype="lex")[0]
        lexInfo = lexDiscrepancies.setdefault(ft, {})
        if lexNode in lexInfo:
            continue
        lexInfo[lexNode] = w

if discrepancies:
    print("Discrepancies")
    for (ft, lexInfo) in sorted(lexDiscrepancies.items()):
        print("\n--- {:<4}: {:>4} lexemes ---\n".format(ft, len(lexInfo)))

    for (ft, ws) in sorted(discrepancies.items()):
        n = len(ws)
        print("\n--- {:<4}: {:>6}x ---\n".format(ft, n))
        for w in sorted(ws)[0:10]:
            show(T, F, [w], Fs(ft).v, getOSMpsp)
        if n > 10:
            print("\tand {} more".format(n - 10))

Discrepancies

--- pdp : 1285 lexemes ---


--- sp  :  663 lexemes ---


--- pdp :  18993x ---

Genesis 1:4 w47"טֹ֑וב"
	BHS: verb
	OSM: HAamsa - None
Genesis 1:5 w78"אֶחָֽד"
	BHS: subs
	OSM: HAcmsa - None
Genesis 1:7 w108"תַּ֣חַת"
	BHS: subs
	OSM: HR - None
Genesis 1:9 w147"תַּ֤חַת"
	BHS: subs
	OSM: HR - None
Genesis 1:9 w152"אֶחָ֔ד"
	BHS: subs
	OSM: HAcmsa - None
Genesis 1:10 w178"טֹֽוב"
	BHS: verb
	OSM: HAamsa - None
Genesis 1:12 w227"טֹֽוב"
	BHS: verb
	OSM: HAamsa - None
Genesis 1:16 w286"שְׁנֵ֥י"
	BHS: subs
	OSM: HAcmdc - None
Genesis 1:18 w351"טֹֽוב"
	BHS: verb
	OSM: HAamsa - None
Genesis 1:21 w392"הַֽ"
	BHS: conj
	OSM: HTd - None
	and 18983 more

--- sp  :  14594x ---

Genesis 1:4 w47"טֹ֑וב"
	BHS: verb
	OSM: HAamsa - None
Genesis 1:4 w51"בֵּ֥ין"
	BHS: subs
	OSM: HR - None
Genesis 1:4 w55"בֵ֥ין"
	BHS: subs
	OSM: HR - None
Genesis 1:5 w78"אֶחָֽד"
	BHS: subs
	OSM: HAcmsa - None
Genesis 1:6 w91"בֵּ֥ין"
	BHS: subs
	OSM: HR - None
Genesis 1:7 w103"בֵּ֤ין"
	BHS: subs
	OSM: HR - None
Gen

In [22]:
strangePsp = {}
strangeSuffix = {}

for w in F.otype.s("word"):
    if not F.g_word_utf8.v(w):
        continue
    osm = osmPsp[w]
    if osm == "×":
        continue

    if osm == "S" or osm not in pspBhsFromOsm:
        strangePsp.setdefault(osm, set()).add(w)

    osm_sf = F.osm_sf.v(w)
    if osm_sf:
        osmSuffix = None if len(osm_sf) < 2 else osm_sf[1]
        if osmSuffix != "S":
            strangeSuffix.setdefault(osmSuffix, set()).add(w)

if strangePsp:
    print("Strange psp")
    for (ln, ws) in sorted(strangePsp.items()):
        print("\t{:<5}: {:>5}x".format(ln, len(ws)))
        for w in sorted(ws)[0:5]:
            show(T, F, [w], F.sp.v, getOSMpsp, indent="\t\t")
        n = len(ws)
        if n > 5:
            print("and {} more".format(n - 5))
else:
    print("No other psps encountered than {}".format(", ".join(pspBhsFromOsm)))
if strangeSuffix:
    print("Strange suffix psp")
    for (ln, ws) in sorted(strangeSuffix.items()):
        print("\t{:<5}: {:>5}x".format(ln, len(ws)))
        for w in sorted(ws)[0:5]:
            show(T, F, [w], F.sp.v, getOSMpsp, indent="\t\t")
        n = len(ws)
        if n > 5:
            print("and {} more".format(n - 5))
else:
    print("No other suffix psps encountered than S")

Strange psp
	S    :     5x
		Jeremiah 18:3 w243711"הו"
			BHS: prps
			OSM: HSp3ms - None
		Jeremiah 36:32 w255103"הֵֽמָּה"
			BHS: prps
			OSM: HSp3mp - None
		Ezekiel 8:6 w267780"הם"
			BHS: prps
			OSM: HSp3mp - None
		Zechariah 5:9 w306339"הֵ֥נָּה"
			BHS: prps
			OSM: HSp3fp - None
		Song_of_songs 6:5 w358746"הֵ֖ם"
			BHS: prps
			OSM: HSp3mp - None
Strange suffix psp
	A    :    19x
		Genesis 23:2 w10754"קִרְיַ֥ת אַרְבַּ֛ע"
			BHS: nmpr
			OSM: HNp - HAcfsa
		Joshua 14:15 w121375"קִרְיַ֣ת אַרְבַּ֔ע"
			BHS: nmpr
			OSM: HNp - HAcfsa
		Joshua 15:13 w121692"קִרְיַ֥ת אַרְבַּ֛ע"
			BHS: nmpr
			OSM: HNp - HAcfsa
		Joshua 15:54 w122085"קִרְיַ֥ת אַרְבַּ֛ע"
			BHS: nmpr
			OSM: HNp - HAcfsa
		Joshua 20:7 w124314"קִרְיַ֥ת אַרְבַּ֛ע"
			BHS: nmpr
			OSM: HNp - HAcfsa
and 14 more
	C    :     3x
		2_Samuel 16:10 w170003"וכי"
			BHS: conj
			OSM: HC - HC
		Jeremiah 39:12 w256632"כִּ֗י אם"
			BHS: conj
			OSM: HC - HC
		Ruth 3:12 w356983"כִּ֥י אם"
			BHS: conj
			OSM: HC - HC
	D    :   200x
		