# BHSA and OSM: comparison

We will investigate how the morphology marked up in the OSM corresponds and differs from the BHSA linguistic features.

We use the `osm` and `osm_sf` features compiled by the 
[BHSAbridgeOSM notebook](BHSAbridgeOSM.ipynb).

In [1]:
import os
import collections

from tf.fabric import Fabric

In [2]:
BHSA = 'BHSA/tf/2017'
OSM = 'bridging/tf/2017'

TF = Fabric(locations='~/github/etcbc', modules=[BHSA, OSM])
api = TF.load('''
    sp lex
    language
    osm osm_sf
    g_word_utf8
''')
api.makeAvailableIn(globals())

This is Text-Fabric 3.1.1
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

116 features found and 0 ignored
  0.00s loading features ...
   |     0.22s B g_word_utf8          from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.14s B sp                   from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.13s B lex                  from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.12s B language             from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.14s B osm                  from /Users/dirk/github/etcbc/bridging/tf/2017
   |     0.03s B osm_sf               from /Users/dirk/github/etcbc/bridging/tf/2017
   |     0.00s Feature overview: 110 for nodes; 5 for edges; 1 configs; 7 computed
  5.17s All features loaded/computed - for details use loadLog()


We need to show cases.

In [3]:
def show(ws, getBHSinfo, getOSMinfo):
    print('{} w{} "{}"\n\tBHS: {}\n\tOSM: {}'.format(
        '{} {}:{}'.format(*T.sectionFromNode(ws[0])),
        '/'.join(str(w) for w in ws),
        '/'.join(F.g_word_utf8.v(w) for w in ws),
        '/'.join(getBHSinfo(w) for w in ws),
        '/'.join(getOSMinfo(w) for w in ws),
    ))

# Language

Do BHSA and OSM agree on language?
Let's count the words in the BHSA where they disagree.

The BHSA names the languages by means of ISO codes, the OSM uses one letter abbreviations.

The OSM has the language code as the first letter of the morphology string.

In [4]:
langBhsFromOsm = dict(A='arc', H='hbo')
langOsmFromBhs = dict((y,x) for (x,y) in langBhsFromOsm.items())

We exclude the words for which the OSM has no morphology, or where the alignment between BHSA and OSM is problematic.

In [5]:
xLanguage = set()
strangeLanguage = collections.Counter()

for w in F.otype.s('word'):
    osm = F.osm.v(w)
    if osm == None or osm == '' or osm == '*': continue
    osmLanguage = osm[0]
    trans = langBhsFromOsm.get(osmLanguage, None)
    if trans == None:
        strangeLanguage[osmLanguage] += 1
    else:
        if langBhsFromOsm[osm[0]] != F.language.v(w):
            xLanguage.add(w)

if strangeLanguage:
    print('Strange languages')
    for (ln, amount) in sorted(strangeLanguage.items()):
        print('Strange language {}: {:>5}x'.format(ln, amount))
else:
    print('No other languages encountered than {}'.format(', '.join(langBhsFromOsm)))
print('Language discrepancies: {}'.format(len(xLanguage)))

No other languages encountered than A, H
Language discrepancies: 9


In [6]:
for w in sorted(xLanguage):
    show([w], F.language.v, lambda x: F.osm.v(x)[0]) 

Daniel 2:5 w370626 "הֵ֣ן"
	BHS: arc
	OSM: H
Daniel 2:9 w370692 "הֵן"
	BHS: arc
	OSM: H
Daniel 2:13 w370806 "דָּנִיֵּ֥אל"
	BHS: arc
	OSM: H
Daniel 2:24 w371000 "דָּֽנִיֵּאל֙"
	BHS: arc
	OSM: H
Daniel 2:28 w371120 "הֽוּא"
	BHS: arc
	OSM: H
Daniel 2:29 w371130 "אַחֲרֵ֣י"
	BHS: arc
	OSM: H
Daniel 3:15 w371915 "הֵ֧ן"
	BHS: arc
	OSM: H
Daniel 4:5 w372449 "דָּנִיֵּ֜אל"
	BHS: arc
	OSM: H
Daniel 7:1 w374551 "דָּנִיֵּאל֙"
	BHS: arc
	OSM: H


# Part of speech

Let's move to part of speech.

The OSM has part of speech as the second letter of the morph string.
See [here](http://openscriptures.github.io/morphhb/parsing/HebrewMorphologyCodes.html).

The BHSA makes a few more distinctions in its [s](https://etcbc.github.io/bhsa/features/hebrew/2017/sp) feature,
so we map the OSM values to sets of BHSA values.

One of the OSM values is `S` (suffix).
The BHSA has no counterpart for this, but we expect that all morph strings in the `osm_sf` features will show
the `S`.

We'll test that as well.

## Particulars

The object marker is a 'prep' in the BHSA, but a particle in the OSM.
We will look into the lexeme value in the BHSA to adapt to this.

We also test if object markers are coded as particle in the OSM.

## Outcomes

A first comparison reveals lots of discrepancies.
Clearly, different rules have been applied by BHSA and OSM to arrive at the part-of-speeches for the words.

Maybe we can uncover a bot more of these rules and bring the number of exceptions down.

Probably there is (very) low hanging fruit here. I just started looking ...

In [7]:
pspBhsFromOsm = dict(
    A={'adjv'}, # adjective
    C={'conj'}, # conjunction
    D={'advb'}, # adverb
    N={'subs', 'nmpr'}, # noun
    P={'prps', 'prde', 'prin', 'inrg'}, # pronoun
    R={'prep'}, # preposition 
    S={'_suffix_'}, # suffix
    T={'art', 'intj', 'nega'}, # particle 
    V={'verb'}, # verb
)

In [8]:
xPsp = set()
strangePsp = {}
strangeSuffix = {}
etNotParticle = set()

for w in F.otype.s('word'):
    osm = F.osm.v(w)
    if osm == None or osm == '' or osm == '*': continue
    osmPsp = None if len(osm) < 2 else osm[1]

    good = True
    
    if osmPsp == None or osmPsp == 'S' or osmPsp not in pspBhsFromOsm:
        strangePsp.setdefault(osmPsp, set()).add(w)
        good = False

    osm_sf = F.osm_sf.v(w)
    if osm_sf:
        osmSuffix = None if len(osm_sf) < 2 else osm_sf[1]
        if osmSuffix != 'S':           
            strangeSuffix.setdefault(osmSuffix, set()).add(w)
            good = False
    
    if not good: continue
            
    lex = F.lex.v(w)
    isOM = lex == '>T'
    trans = 'prep' if isOM else pspBhsFromOsm[osmPsp]
    if isOM and osmPsp != 'T':
        etNotParticle.add(w)
    
    if F.sp.v(w) not in trans:
        xPsp.add(w)

if strangePsp:
    print('Strange psp')
    for (ln, ws) in sorted(strangePsp.items()):
        print('Strange psp {}: {:>5}x'.format(ln, len(ws)))
else:
    print('No other psps encountered than {}'.format(', '.join(pspBhsFromOsm)))
if strangeSuffix:
    print('Strange suffix psp')
    for (ln, ws) in sorted(strangeSuffix.items()):
        print('Strange suffix psp {}: {:>5}x'.format(ln, len(ws)))
else:
    print('No other suffix psps encountered than S')
print('Psp discrepancies: {}'.format(len(xPsp)))

Strange psp
Strange psp S:     3x
Strange suffix psp
Strange suffix psp A:     6x
Strange suffix psp C:     2x
Strange suffix psp D:   194x
Strange suffix psp N:    79x
Strange suffix psp P:     6x
Strange suffix psp T:   777x
Psp discrepancies: 20501


In [9]:
def getOSMpsp(w):
    return '{} - {}'.format(str(F.osm.v(w)), str(F.osm_sf.v(w)))

## PSP of non-suffix with value `S`

In [10]:
for (psp, ws) in sorted(strangePsp.items()):
    print('\n--- Strange suffix psp {}---:\n'.format(psp))
    for w in sorted(ws)[0:5]:
        show([w], F.sp.v, getOSMpsp)
    n = len(ws)
    if n > 5:
        print('and {} more'.format(n - 5))


--- Strange suffix psp S---:

Jeremiah 18:3 w243708 "הו"
	BHS: prps
	OSM: HSp3ms - None
Song_of_songs 6:5 w358742 "הֵ֖ם"
	BHS: prps
	OSM: HSp3mp - None
Daniel 2:35 w371257 "הִמֹּון֙"
	BHS: prps
	OSM: ASp3mp - None


## Suffixes with PSP not being `S`

In [11]:
for (psp, ws) in sorted(strangeSuffix.items()):
    print('\n--- Strange suffix {}---:\n'.format(psp))
    for w in sorted(ws)[0:5]:
        show([w], F.sp.v, getOSMpsp)
    n = len(ws)
    if n > 5:
        print('and {} more'.format(n - 5))


--- Strange suffix A---:

Numbers 2:9 w70538 "אֶ֜לֶף"
	BHS: subs
	OSM: HAcbsc - HAcbsa
Numbers 2:16 w70633 "אֶ֜לֶף"
	BHS: subs
	OSM: HAcbsc - HAcbsa
Numbers 2:24 w70744 "אֶ֛לֶף"
	BHS: subs
	OSM: HAcbsc - HAcbsa
Numbers 2:31 w70834 "אֶ֗לֶף"
	BHS: subs
	OSM: HAcbsc - HAcbsa
Judges 1:10 w127653 "קִרְיַ֣ת אַרְבַּ֑ע"
	BHS: nmpr
	OSM: HNp - HAcfsa
and 1 more

--- Strange suffix C---:

Ruth 3:12 w356979 "כִּ֥י אם"
	BHS: conj
	OSM: HC - HC
Ecclesiastes 8:12 w362111 "וּ"
	BHS: conj
	OSM: HAcbsc - HC

--- Strange suffix D---:

Genesis 4:15 w1911 "לָכֵן֙"
	BHS: advb
	OSM: HR - HD
Genesis 30:15 w15823 "לָכֵן֙"
	BHS: advb
	OSM: HR - HD
Exodus 6:6 w31334 "לָכֵ֞ן"
	BHS: advb
	OSM: HR - HD
Numbers 16:11 w80256 "לָכֵ֗ן"
	BHS: advb
	OSM: HR - HD
Numbers 20:12 w82792 "לָכֵ֗ן"
	BHS: advb
	OSM: HR - HD
and 189 more

--- Strange suffix N---:

Genesis 11:10 w5143 "שָׁנָ֔ה"
	BHS: subs
	OSM: HAcbsc - HNcfsa
Genesis 21:5 w9687 "שָׁנָ֑ה"
	BHS: subs
	OSM: HAcbsc - HNcfsa
Genesis 25:7 w12529 "שָׁנָ֛ה"
	BHS: subs


## ET not marked as particle

In [12]:
for w in sorted(etNotParticle)[0:100]:
    show([w], F.sp.v, getOSMpsp)
n = len(etNotParticle)
if n > 5:
    print('and {} more'.format(n - 5))

Hosea 12:4 w293548 "אֶת"
	BHS: prep
	OSM: HR - None
Joel 2:20 w294741 "אֶת"
	BHS: prep
	OSM: HR - None
Zephaniah 1:3 w303124 "אֶת"
	BHS: prep
	OSM: HR - None


## Different PSP in BHSA and OSM

In [13]:
for w in sorted(xPsp)[0:100]:
    show([w], F.sp.v, getOSMpsp)
n = len(xPsp)
if n > 5:
    print('and {} more'.format(n - 5))

Genesis 1:4 w51 "בֵּ֥ין"
	BHS: subs
	OSM: HR - None
Genesis 1:4 w55 "בֵ֥ין"
	BHS: subs
	OSM: HR - None
Genesis 1:5 w78 "אֶחָֽד"
	BHS: subs
	OSM: HAcmsa - None
Genesis 1:6 w91 "בֵּ֥ין"
	BHS: subs
	OSM: HR - None
Genesis 1:7 w103 "בֵּ֤ין"
	BHS: subs
	OSM: HR - None
Genesis 1:7 w106 "אֲשֶׁר֙"
	BHS: conj
	OSM: HTr - None
Genesis 1:7 w113 "בֵ֣ין"
	BHS: subs
	OSM: HR - None
Genesis 1:7 w116 "אֲשֶׁ֖ר"
	BHS: conj
	OSM: HTr - None
Genesis 1:9 w152 "אֶחָ֔ד"
	BHS: subs
	OSM: HAcmsa - None
Genesis 1:11 w195 "אֲשֶׁ֥ר"
	BHS: conj
	OSM: HTr - None
Genesis 1:12 w218 "אֲשֶׁ֥ר"
	BHS: conj
	OSM: HTr - None
Genesis 1:14 w247 "בֵּ֥ין"
	BHS: subs
	OSM: HR - None
Genesis 1:14 w251 "בֵ֣ין"
	BHS: subs
	OSM: HR - None
Genesis 1:16 w286 "שְׁנֵ֥י"
	BHS: subs
	OSM: HAcmdc - None
Genesis 1:18 w340 "בֵּ֥ין"
	BHS: subs
	OSM: HR - None
Genesis 1:18 w344 "בֵ֣ין"
	BHS: subs
	OSM: HR - None
Genesis 1:21 w396 "אֲשֶׁר֩"
	BHS: conj
	OSM: HTr - None
Genesis 1:29 w601 "אֲשֶׁר֙"
	BHS: conj
	OSM: HTr - None
Genesis 1:29 w612 "א