# BHSA and OSM: comparison on part-of-speech

We will investigate how the morphology marked up in the OSM corresponds and differs from the BHSA linguistic features.

In this notebook we investigate the markup of *part-of-speech*.

We use the `osm` and `osm_sf` features compiled by the 
[BHSAbridgeOSM notebook](BHSAbridgeOSM.ipynb).

In [1]:
import os
import operator
import collections
from functools import reduce

from tf.fabric import Fabric
from utils import show

# Load data
We load the BHSA data in the standard way, and we add the OSM data as a module of the features `osm` and `osm_sf`.
Note that we only need to point TF to the right directories, and then we can load all features
that are present in those directories.

In [2]:
BHSA = 'BHSA/tf/2017'
OSM = 'bridging/tf/2017'

TF = Fabric(locations='~/github/etcbc', modules=[BHSA, OSM])
api = TF.load('''
    sp pdp
    lex voc_lex_utf8 freq_lex
    gloss
    osm osm_sf
    g_word_utf8
''')
api.makeAvailableIn(globals())

This is Text-Fabric 3.1.1
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

116 features found and 0 ignored
  0.00s loading features ...
   |     0.21s B g_word_utf8          from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.13s B sp                   from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.13s B pdp                  from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.13s B lex                  from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.01s B voc_lex_utf8         from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.10s B freq_lex             from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.01s B gloss                from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.13s B osm                  from /Users/dirk/github/etcbc/bridging/tf/2017
   |     0.02s B osm_sf               from

# Part of speech

The BHSA has two features for part-of-speech:
[sp](https://etcbc.github.io/bhsa/features/hebrew/2017/sp)
and
[pdp](https://etcbc.github.io/bhsa/features/hebrew/2017/pdp).

The first one, `sp`, is lexical part of speech, a context-insensitve assignment of part-of-speech labels to 
occurrences of lexemes.

The second one, `pdp`, is *phrase dependent part of speech*. This assignment is sensitive to
cases where adjectives are used as noun, nouns as prepositions, etc.

A preliminary check has revealed that the OSM part-of-speech is resembles `sp` more than `pdp`, so
we stick to `sp`.

The OSM has part-of-speech as the second letter of the morph string.
See [here](http://openscriptures.github.io/morphhb/parsing/HebrewMorphologyCodes.html).

The BHSA makes a few more distinctions in its [sp](https://etcbc.github.io/bhsa/features/hebrew/2017/sp) feature,
so we map the OSM values to sets of BHSA values.

One of the OSM values is `S` (suffix).
The BHSA has no counterpart for this, but we expect that all morph strings in the `osm_sf` features will show
the `S`.

We'll test that as well.

Here is the default mapping between OSM part-of-speech and BHSA part-of-speech.

We'll see later that this results in many discrepancies.

We'll analyze the discrepancies, and try to overcome them by making lexeme-dependent exceptions to these rules.

It turns out that we need **75** lexeme-based exception rules and we'll have **1206** left-over cases that
merit closer inspection.

In [5]:
pspBhsFromOsm = {
    'A': {'adjv'}, # adjective
    'C': {'conj'}, # conjunction
    'D': {'advb'}, # adverb
    'N': {'subs', 'nmpr'}, # noun
    'P': {'prps', 'prde', 'prin', 'inrg'}, # pronoun
    'R': {'prep'}, # preposition 
    'S': {'_suffix_'}, # suffix
    'T': {'art', 'intj', 'nega'}, # particle 
    'V': {'verb'}, # verb
    '×': set(), # no morphology
}

Just for ease of processing, we make a mapping from slots to OSM part-of-speech.

We assign `×` slot `w` if there is no valid OSM part-of-speech label available for `w`. 

In [6]:
osmPsp = {}
noPsp = 0
for w in F.otype.s('word'):
    osm = F.osm.v(w)
    if osm == None or osm == '' or osm == '*' or len(osm) < 2:
        psp = '×'
        noPsp += 1
    else:
        psp = osm[1]
    osmPsp[w] = psp 

allPsp = len(osmPsp)
withPsp = allPsp - noPsp
print('''{} BHSA words:
    having  OSM part of speech: {:>3}% = {:>6}
    without OSM part of speech: {:>3}% = {:>6}
'''.format(
    F.otype.maxSlot,
    round(100 * withPsp / allPsp),
    withPsp,
    round(100 * noPsp / allPsp),
    noPsp,
))

426584 BHSA words:
    having  OSM part of speech:  87% = 372636
    without OSM part of speech:  13% =  53948



We organize the osm-bhs combinations that show up in the text in several ways.

`psp` is keyed by: osm, bhs, lexeme node.

`pspLex` is keyed by: lexeme node, osm, bhs, and then contains a list of slots where this combination occurs.

Both mappings contains a list of slots where the combinations occur.

In [7]:
psp = {}
pspLex = {}
for lx in F.otype.s('lex'):
    ws = L.d(lx, otype='word')
    for w in ws:
        osm = osmPsp[w]
        bhs = F.sp.v(w)
        psp.setdefault(osm, {}).setdefault(bhs, {}).setdefault(lx, set()).add(w)
        pspLex.setdefault(lx, {}).setdefault(osm, {}).setdefault(bhs, set()).add(w)

For each osm-bhs combination, we want to see how many lexemes and how many occurrences have that combination.

In [8]:
pspCount = {}
for (osm, osmData) in psp.items():
    for (bhs, bhsData) in osmData.items():
        nlex = len(bhsData)
        noccs = reduce(operator.add, (len(x) for x in bhsData.values()), 0)
        pspCount.setdefault(osm, {})[bhs] = (nlex, noccs)

Now we are going to present an overview of osm-bhs combinations.

We mark a combination with `OK` if the combination is according to the default OSM-BHS mapping.

We use the mark `*` if there is no OSM part-of-speech available.

Otherwise we mark it with a `?`.

In [9]:
mismatches = []
for osm in pspCount:
    print(osm)
    totalOccs = sum(x[1] for x in pspCount[osm].values())
    for (bhs, (nlex, noccs)) in sorted(pspCount[osm].items(), key=lambda x: (-x[1][1], -x[1][0], x[0])):
        perc = round(100 * noccs / totalOccs)
        status = bhs in pspBhsFromOsm[osm]
        statusLabel = 'OK' if status else '?'
        if not status:
            if osm == '×':
                statusLabel = '*'
            else:
                mismatches.append((osm, bhs, nlex, noccs))
        print('\t=> {:<4} ({:<2}) in {:>4} lexemes and {:>3}% = {:>6} occurrences'.format(
            bhs, statusLabel, nlex, perc, noccs,
        ))
total = 0
for (osm, bhs, nlex, noccs) in mismatches:
    total += noccs
print('\n{:<24} {:>6} occurrences'.format('Total number of mismatches', total))

R
	=> prep (OK) in   22 lexemes and  96% =  55591 occurrences
	=> subs (? ) in   18 lexemes and   3% =   1658 occurrences
	=> advb (? ) in    1 lexemes and   0% =    194 occurrences
	=> inrg (? ) in    1 lexemes and   0% =    169 occurrences
	=> conj (? ) in    2 lexemes and   0% =      5 occurrences
	=> art  (? ) in    1 lexemes and   0% =      2 occurrences
	=> nmpr (? ) in    1 lexemes and   0% =      2 occurrences
	=> verb (? ) in    1 lexemes and   0% =      1 occurrences
×
	=> verb (* ) in 1189 lexemes and  25% =  13584 occurrences
	=> subs (* ) in 2427 lexemes and  23% =  12637 occurrences
	=> art  (* ) in    1 lexemes and  17% =   9384 occurrences
	=> conj (* ) in    9 lexemes and  14% =   7821 occurrences
	=> prep (* ) in   23 lexemes and  12% =   6495 occurrences
	=> adjv (* ) in  468 lexemes and   4% =   2090 occurrences
	=> nmpr (* ) in  452 lexemes and   2% =   1279 occurrences
	=> inrg (* ) in   12 lexemes and   0% =    246 occurrences
	=> intj (* ) in   12 lexemes and   

It is not as bad as it seems.
The number of *lexemes* involved in a mismatch is limited:

In [10]:
mismatchLexemes = set()
for (osm, bhs, nlex, noccs) in mismatches:
    lexemes = psp[osm][bhs].keys()
    mismatchLexemes |= lexemes
print('Lexemes to be researched: {}'.format(len(mismatchLexemes)))

Lexemes to be researched: 229


We are going to investigate the lexemes that are involved in a mismatch.

It turns out that:

* for most of the lexemes there is a dominant combination of OSM and BHSA assigned part-of-speech;
* non-dominant combinations have a very limited number of occurrences.

This is what we are going to do:

* for each lexeme we go along with the dominant combination.
  If that is different from the default marking, we add a lexeme-bound exception to the rule 
  that maps OSM part-of-speech to BHSA part-of-speech.
* if even the dominant combination has less than 10 occurrences, we do not add a lexeme-bound rule,
  but we add the case to the list of exceptional cases.
* we spell out the exceptional cases, so that readers can manually check the part-of-speeches as assigned by
  OSM and BHSA.
  
In order to determine what is dominant: if a combination has 50% or more of occurrences of a lexeme.
then that combination is dominant.
So, for each lexeme there is at least one dominant case.

There may not be a dominant case if not all occurrences of a lexeme have been marked up in the OSM.

The next cell computes the new rules and the exceptions.
It will show all new rules, and all kinds of exceptions.
But it only shows at most 10 instances of each kind of exception.

A .tsv file with all exceptions can be downloaded via this 
[link](https://github.com/ETCBC/bridging/blob/master/programs/pspCases.tsv).

In [11]:
researchCases = []

inspectCases = 0
rules = []

text = []

fh = open('pspCases.tsv', 'w')
fields = '''
    slot
    occurrence
    lex-node
    lex
    lex-pointed
    gloss
    bhsa-psp
    osm-psp
    #cases-like-this
'''.strip().split()
lineFormat = ('{}\t' * (len(fields) - 1)) + '{}\n'

def getOSMpsp(w):
    return '{} - {}'.format(str(F.osm.v(w)), str(F.osm_sf.v(w)))

for lx in sorted(mismatchLexemes, key=lambda x: -F.freq_lex.v(x)):
    freqLex = F.freq_lex.v(lx)
    text.append('\n{:<15}        {:>6}x [{}] "{}"'.format(
        F.lex.v(lx), 
        freqLex,
        F.gloss.v(lx),
        F.voc_lex_utf8.v(lx), 
    ))
    nRealCases = freqLex
    if '×' in pspLex[lx]:
        for (bhs, ws) in pspLex[lx]['×'].items():
            nRealCases -= len(ws)

    osmCount = collections.Counter()
    for (osm, osmData) in pspLex[lx].items():
        for ws in osmData.values():
            osmCount[osm] += len(ws)
            
    for osm in sorted(pspLex[lx], key=lambda x: -osmCount[x]):
        osmData = pspLex[lx][osm]
        for (bhs, ws) in sorted(osmData.items(), key=lambda x: (-len(x[1]), x[0])):
            showCases = False
            nws = len(ws)
            status = bhs in pspBhsFromOsm[osm]
            statusLabel = 'OK' if status else '?'

            if 2 * nws > freqLex and nws >= 10:
                if not status and osm != '×':
                    statusLabel = 'NN'
                    rules.append((lx, osm, bhs, nws))
            else:
                if status:
                    statusLabel = 'OK?'
                showCases = osm != '×'

            text.append('\t{} ~ {:<4} ({:<3}) {:>6}x'.format(
                bhs, osm, statusLabel, nws,
            ))
            
            if showCases:
                text.append('\n{}'.format('-' * 60))
                inspectCases += nws
                for w in sorted(ws)[0:10]:
                    text.append(show(T, F, [w], F.sp.v, getOSMpsp, indent='\t\t', asString=True))
                if nws > 10:
                    text.append('\tand {} more occurrences'.format(nws - 10))
                text.append('{}\n'.format('-' * 60))
                for w in ws:
                    fh.write(lineFormat.format(
                        w,
                        F.g_word_utf8.v(w),
                        lx, 
                        F.lex.v(lx),
                        F.voc_lex_utf8.v(lx),
                        F.gloss.v(lx),
                        F.sp.v(w),
                        F.osm.v(w),
                        nws,
                    ))

fh.close()

if rules:
    print('Lexeme-bound exceptions  : {:>4}'.format(len(rules)))
else:
    print('No lexeme-bound exceptions')

if inspectCases or text:
    print('Cases that need attention: {:>4}'.format(inspectCases))
else:
    print('All cases clear')

print('\nLEXEME-BOUND EXCEPTIONS\n')
casesSolved = 0
for (lx, osm, bhs, nws) in rules:
    casesSolved += nws
    print('\t{:<15} {:<4} ~ {} ({:>5}x) [{}] "{}"'.format(
        F.lex.v(lx),
        bhs,
        osm,
        nws,
        F.gloss.v(lx),
        F.voc_lex_utf8.v(lx),
))
print('This solves {} cases'.format(casesSolved))
print('Remaining cases: {}'.format(total - casesSolved))

print('\nCASES FOR ATTENTION\n')
for t in text: print(t)


Lexeme-bound exceptions  :   75
Cases that need attention: 1206

LEXEME-BOUND EXCEPTIONS

	>T              prep ~ T (10863x) [<object marker>] "אֵת"
	>CR             conj ~ T ( 5497x) [<relative>] "אֲשֶׁר"
	>XD/            subs ~ A (  910x) [one] "אֶחָד"
	>JN/            subs ~ T (  639x) [<NEG>] "אַיִן"
	GM              advb ~ T (  690x) [even] "גַּם"
	CNJM/           subs ~ A (  768x) [two] "שְׁנַיִם"
	H=              inrg ~ T (  536x) [<interrogative>] "הֲ"
	>XR/            subs ~ R (  586x) [after] "אַחַר"
	CLC/            subs ~ A (  600x) [three] "שָׁלֹשׁ"
	M>H/            subs ~ A (  565x) [hundred] "מֵאָה"
	MH              prin ~ T (  568x) [what] "מָה"
	KN              advb ~ T (  422x) [thus] "כֵּן"
	XMC/            subs ~ A (  506x) [five] "חָמֵשׁ"
	TXT/            subs ~ R (  370x) [under part] "תַּחַת"
	>LP=/           subs ~ A (  492x) [thousand] "אֶלֶף"
	CB</            subs ~ A (  482x) [seven] "שֶׁבַע"
	<WD/            subs ~ D (  423x) [duration] "עֹוד"
	>RB</        

PWYJ/                       1x [<uncertain>] "פּוּצַי"
	subs ~ V    (?  )      1x

------------------------------------------------------------
		Zephaniah 3:10 w303914 "פּוּצַ֔י"
			BHS: subs
			OSM: HVqsmsc - HSp1cs
------------------------------------------------------------


MN=                         1x [what] "מָן"
	prin ~ T    (?  )      1x

------------------------------------------------------------
		Exodus 16:15 w37694 "מָ֣ן"
			BHS: prin
			OSM: HTi - None
------------------------------------------------------------


KJ/                         1x [branding] "כִּי"
	subs ~ C    (?  )      1x

------------------------------------------------------------
		Isaiah 3:24 w213261 "כִּי"
			BHS: subs
			OSM: HC - None
------------------------------------------------------------


CB<H/                       1x [Shibah] "שִׁבְעָה"
	nmpr ~ A    (?  )      1x

------------------------------------------------------------
		Genesis 26:33 w13591 "שִׁבְעָ֑ה"
			BHS: nmpr
			OSM: HAcms

# SP versus PDP

Here is the computation that shows that the BHS feature
[sp](https://etcbc.github.io/bhsa/features/hebrew/2017/sp)
matches the OSM part-of-speech better than
[pdp](https://etcbc.github.io/bhsa/features/hebrew/2017/pdp).

In [12]:
discrepancies = {}

for w in F.otype.s('word'):
    osm = osmPsp[w]
    if osm == '×': continue
    lex = F.lex.v(w)
    trans = pspBhsFromOsm[osm]
    if F.sp.v(w) not in trans:
        discrepancies.setdefault('sp', set()).add(w)
    if F.pdp.v(w) not in trans:
        discrepancies.setdefault('pdp', set()).add(w)
        
lexDiscrepancies = {} # discrepancies per lexeme
for (ft, ws) in sorted(discrepancies.items()):
    for w in sorted(ws):
        lexNode = L.u(w, otype='lex')[0]
        lexInfo = lexDiscrepancies.setdefault(ft, {})
        if lexNode in lexInfo:
            continue
        lexInfo[lexNode] = w

if discrepancies:
    print('Discrepancies')
    for (ft, lexInfo) in sorted(lexDiscrepancies.items()):
        print('\n--- {:<4}: {:>4} lexemes ---\n'.format(ft, len(lexInfo)))
        
    for (ft, ws) in sorted(discrepancies.items()):
        n = len(ws)
        print('\n--- {:<4}: {:>6}x ---\n'.format(ft, n))
        for w in sorted(ws)[0:10]:
            show(T, F, [w], Fs(ft).v, getOSMpsp)
        if n > 10:
            print('\tand {} more'.format(n - 10))

Discrepancies

--- pdp :  938 lexemes ---


--- sp  :  229 lexemes ---


--- pdp :  35443x ---

Genesis 1:1 w5 "אֵ֥ת"
	BHS: prep
	OSM: HTo - None
Genesis 1:1 w9 "אֵ֥ת"
	BHS: prep
	OSM: HTo - None
Genesis 1:4 w43 "אֶת"
	BHS: prep
	OSM: HTo - None
Genesis 1:5 w78 "אֶחָֽד"
	BHS: subs
	OSM: HAcmsa - None
Genesis 1:7 w98 "אֶת"
	BHS: prep
	OSM: HTo - None
Genesis 1:7 w106 "אֲשֶׁר֙"
	BHS: conj
	OSM: HTr - None
Genesis 1:7 w116 "אֲשֶׁ֖ר"
	BHS: conj
	OSM: HTr - None
Genesis 1:9 w152 "אֶחָ֔ד"
	BHS: subs
	OSM: HAcmsa - None
Genesis 1:11 w195 "אֲשֶׁ֥ר"
	BHS: conj
	OSM: HTr - None
Genesis 1:12 w218 "אֲשֶׁ֥ר"
	BHS: conj
	OSM: HTr - None
	and 35433 more

--- sp  :  31760x ---

Genesis 1:1 w5 "אֵ֥ת"
	BHS: prep
	OSM: HTo - None
Genesis 1:1 w9 "אֵ֥ת"
	BHS: prep
	OSM: HTo - None
Genesis 1:4 w43 "אֶת"
	BHS: prep
	OSM: HTo - None
Genesis 1:4 w51 "בֵּ֥ין"
	BHS: subs
	OSM: HR - None
Genesis 1:4 w55 "בֵ֥ין"
	BHS: subs
	OSM: HR - None
Genesis 1:5 w78 "אֶחָֽד"
	BHS: subs
	OSM: HAcmsa - None
Genesis 1:6 w91 "בֵּ

In [13]:
strangePsp = {}
strangeSuffix = {}

for w in F.otype.s('word'):
    osm = osmPsp[w]
    if osm == '×': continue

    if osm == 'S' or osm not in pspBhsFromOsm:
        strangePsp.setdefault(osm, set()).add(w)

    osm_sf = F.osm_sf.v(w)
    if osm_sf:
        osmSuffix = None if len(osm_sf) < 2 else osm_sf[1]
        if osmSuffix != 'S':           
            strangeSuffix.setdefault(osmSuffix, set()).add(w)
            
if strangePsp:
    print('Strange psp')
    for (ln, ws) in sorted(strangePsp.items()):
        print('\t{:<5}: {:>5}x'.format(ln, len(ws)))
        for w in sorted(ws)[0:5]:
            show(T, F, [w], F.sp.v, getOSMpsp, indent='\t\t')
        n = len(ws)
        if n > 5:
            print('and {} more'.format(n - 5))
else:
    print('No other psps encountered than {}'.format(', '.join(pspBhsFromOsm)))
if strangeSuffix:
    print('Strange suffix psp')
    for (ln, ws) in sorted(strangeSuffix.items()):
        print('\t{:<5}: {:>5}x'.format(ln, len(ws)))
        for w in sorted(ws)[0:5]:
            show(T, F, [w], F.sp.v, getOSMpsp, indent='\t\t')
        n = len(ws)
        if n > 5:
            print('and {} more'.format(n - 5))
else:
    print('No other suffix psps encountered than S')

Strange psp
	S    :     3x
		Jeremiah 18:3 w243708 "הו"
			BHS: prps
			OSM: HSp3ms - None
		Song_of_songs 6:5 w358742 "הֵ֖ם"
			BHS: prps
			OSM: HSp3mp - None
		Daniel 2:35 w371257 "הִמֹּון֙"
			BHS: prps
			OSM: ASp3mp - None
Strange suffix psp
	A    :     6x
		Numbers 2:9 w70538 "אֶ֜לֶף"
			BHS: subs
			OSM: HAcbsc - HAcbsa
		Numbers 2:16 w70633 "אֶ֜לֶף"
			BHS: subs
			OSM: HAcbsc - HAcbsa
		Numbers 2:24 w70744 "אֶ֛לֶף"
			BHS: subs
			OSM: HAcbsc - HAcbsa
		Numbers 2:31 w70834 "אֶ֗לֶף"
			BHS: subs
			OSM: HAcbsc - HAcbsa
		Judges 1:10 w127653 "קִרְיַ֣ת אַרְבַּ֑ע"
			BHS: nmpr
			OSM: HNp - HAcfsa
and 1 more
	C    :     2x
		Ruth 3:12 w356979 "כִּ֥י אם"
			BHS: conj
			OSM: HC - HC
		Ecclesiastes 8:12 w362111 "וּ"
			BHS: conj
			OSM: HAcbsc - HC
	D    :   194x
		Genesis 4:15 w1911 "לָכֵן֙"
			BHS: advb
			OSM: HR - HD
		Genesis 30:15 w15823 "לָכֵן֙"
			BHS: advb
			OSM: HR - HD
		Exodus 6:6 w31334 "לָכֵ֞ן"
			BHS: advb
			OSM: HR - HD
		Numbers 16:11 w80256 "לָכֵ֗ן"
			BHS: advb