# BHSA and OSM: comparison on verb attributes

We will investigate how the morphology marked up in the OSM corresponds and differs from the BHSA linguistic features.

In this notebook we investigate the markup of verb attributes.
According to the [OSM specs](http://openscriptures.github.io/morphhb/parsing/HebrewMorphologyCodes.html)
this is provided:

* verb stem
* conjugation type
* person
* gender
* number

We use the `osm` and `osm_sf` features compiled by the 
[BHSAbridgeOSM notebook](BHSAbridgeOSM.ipynb).

# Results

See below, where most of the cases are mentioned.
We also collect all cases in [verbs.tsv](verbs.tsv) , a tab delimited file.

In [1]:
import os
import collections

from tf.fabric import Fabric
from utils import show

# Load data
We load the BHSA data in the standard way, and we add the OSM data as a module of the features `osm` and `osm_sf`.
Note that we only need to point TF to the right directories, and then we can load all features
that are present in those directories.

In [2]:
BHSA = 'BHSA/tf/2017'
OSM = 'bridging/tf/2017'

TF = Fabric(locations='~/github/etcbc', modules=[BHSA, OSM])
api = TF.load('''
    sp
    lex voc_lex_utf8 gloss
    languageISO
    osm osm_sf
    g_word_utf8
    vs vt ps gn nu
''')
api.makeAvailableIn(globals())

This is Text-Fabric 3.1.1
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

117 features found and 0 ignored
  0.00s loading features ...
   |     0.22s B g_word_utf8          from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.13s B sp                   from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.13s B lex                  from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.01s B voc_lex_utf8         from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.01s B gloss                from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.13s B languageISO          from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.13s B osm                  from /Users/dirk/github/etcbc/bridging/tf/2017
   |     0.02s B osm_sf               from /Users/dirk/github/etcbc/bridging/tf/2017
   |     0.13s B vs                   

# Verb occurrences

Let us first identify what the verb occurrences are, according to the OSM and to the BHSA.
We'll show the differences.
The OSM is not yet completed, so we focus on the cases where the OSM has morphology.

We call the set of words that have a non-empty OSM morphology string the OSM-base. 

In [3]:
verbsBHS = set(F.sp.s('verb'))
hasOSM = set()
for w in F.otype.s('word'):
    osm = F.osm.v(w)
    if osm and len(osm) > 1:
        hasOSM.add(w)

verbsBHSfocus = verbsBHS & hasOSM

verbsOSM = {w for w in hasOSM if F.osm.v(w)[1] == 'V'}

print('''
Number of verb occurrences in the Hebrew Bible:
\tin BHSA (total):                     {:>5}
\tin BHSA (intersected with OSM-base): {:>5}
\tin OSM:                              {:>5}
'''.format(
    len(verbsBHS),
    len(verbsBHSfocus),
    len(verbsOSM),
))


Number of verb occurrences in the Hebrew Bible:
	in BHSA (total):                     75450
	in BHSA (intersected with OSM-base): 60126
	in OSM:                              60085



As you see: very few discrepancies.
Before we show them, we define functions that show a verb with BHSA morphology and OSM morphology.

If a piece of moprhology is not present, we substitute a `?`.
We also transform a not-applicable or unknown value in the BHSA by `?`, although
there is a difference between missing markup and markup saying: insufficient information!

# Names

We map the names for stems and conjugations found in the
[OSM morphology description](http://openscriptures.github.io/morphhb/parsing/HebrewMorphologyCodes.html)
to convenient names when comparing them with the morphology values in the BHSA features
[vs](https://etcbc.github.io/bhsa/features/hebrew/2017/vt.html)
and
[vt](https://etcbc.github.io/bhsa/features/hebrew/2017/vt.html),
and we map some BHSA names as well.

In [4]:
stemMapOSM = dict(
    H=dict(
        q='qal',
        N='niphal',
        p='piel',
        P='pual',
        h='hiphil',
        H='hophal',
        t='hithpael',
        o='polel',
        O='polal',
        r='hithpolel',
        m='poel',
        M='poal',
        k='palel',
        K='pulal',
        Q='qalpassive',
        l='pilpel',
        L='polpal',
        f='hithpalpel',
        D='nithpael',
        j='pealal',
        i='pilel',
        u='hothpaal',
        c='tiphil',
        v='hishtaphel',
        w='nithpalel',
        y='nithpoel',
        z='hithpoel',
    ),
    A=dict(
        q='peal',
        Q='peil',
        u='hithpeel',
        p='pael',
        P='ithpaal',
        M='hithpaal',
        a='aphel',
        h='haphel',
        s='saphel',
        e='shaphel',
        H='hophal',
        i='ithpeel',
        t='hishtaphel',
        v='ishtaphel',
        w='hithaphel',
        o='polel',
        z='ithpoel',
        r='hithpolel',
        f='hithpalpel',
        b='hephal',
        c='tiphel',
        m='poel',
        l='palpel',
        L='ithpalpel',
        O='ithpolel',
        G='ittaphal',
    ),
)

In [5]:
stemMapBHS = dict(
    hif='hiphil',
    hit='hithpael',
    htpo='hithpoel',
    hof='hophal',
    nif='niphal',
    piel='piel',
    poal='poal',
    poel='poel',
    pual='pual',
    qal='qal',
    afel='aphel',
    etpa='etpaal',
    etpe='etpeel',
    haf='haphel',
    hotp='hothpaal',
    hsht='hishtaphel',
    htpa='hithpaal',
    htpe='hithpeel',
    nit='nithpael',
    pael='pael',
    peal='peal',
    peil='peil',
    shaf='shaphel',
    tif='tiphal',
    pasq='qalpassive',
)

In [6]:
conjMapOSM = dict(
    p='perfect',
    q='weqatal',
    i='imperfect',
    w='wayyiqtol',
    h='cohortative',
    j='jussive',
    v='imperative',
    r='part act',
    s='part pass',
    a='inf abs',
    c='inf cons',
)
conjMapBHS = dict(
    impf='imperfect',
    impv='imperative',
    infa='inf abs',
    infc='inf cons',
    perf='perfect',
    ptca='part act',
    ptcp='part pass',
    wayq='wayyiqtol',    
)

In [7]:
naValues = {'NA', 'N/A'}
missingValues = {None, '', 'unknown'}
noPersonConj = {'r', 's', 'a', 'c'}

def getValue(x): return '_' if x in naValues else '?' if x in missingValues else x
def getValueHead(x): return '_' if x in naValues else '?' if x in missingValues else x[0]
def getValueTail(x): return '_' if x in naValues else '?' if x in missingValues else x[1:]

def extractFeature(x, n): return '?' if not x or len(x) <= n else x[n]

def getLangOSM(w): return extractFeature(F.osm.v(w), 0)

def getStemOSM(w): return extractFeature(F.osm.v(w), 2)
def getStemOSMX(w): return stemMapOSM.get(getLangOSM(w), {}).get(getStemOSM(w), '?')
def getConjOSM(w): return extractFeature(F.osm.v(w), 3)
def getConjOSMX(w): return conjMapOSM.get(getConjOSM(w), '?')
def getPersonOSM(w): return '_' if getConjOSM(w) in noPersonConj else extractFeature(F.osm.v(w), 4)
def getGenderOSM(w): return extractFeature(F.osm.v(w), 4 if getConjOSM(w) in noPersonConj else 5)
def getNumberOSM(w): return extractFeature(F.osm.v(w), 5 if getConjOSM(w) in noPersonConj else 6)

def getStemBHS(w): return getValue(F.vs.v(w))
def getStemBHSX(w): return stemMapBHS.get(getStemBHS(w), '?')
def getConjBHS(w): return getValue(F.vt.v(w))
def getConjBHSX(w): return conjMapBHS.get(getConjBHS(w), '?')
def getPersonBHS(w): return getValueTail(F.ps.v(w))
def getGenderBHS(w): return getValue(F.gn.v(w))
def getNumberBHS(w): return getValueHead(F.nu.v(w))

def getVerbBHS(w):
    return '{}-{}-{}{}{}'.format(
        getStemBHSX(w),
        getConjBHSX(w),
        getPersonBHS(w),
        getGenderBHS(w),
        getNumberBHS(w),
    )

def getVerbOSM(w):
    return '{}-{}-{}{}{}'.format(
        getStemOSMX(w),
        getConjOSMX(w),
        getPersonOSM(w),
        getGenderOSM(w),
        getNumberOSM(w),
    )

def getBHS(w): return F.sp.v(w)
def getOSM(w): return F.osm.v(w)

# Mappings

We collect the numbers of cooccurrences of OSM stems and BHSA values for each verb feature,
and see how they compare.

In [8]:
closerLook = set()

In [9]:
def showFeatures():
    cases = set()
    mappings = {}

    def makeMap(key, getBHS, getOSM):
        BHSFromOSM = {}
        OSMFromBHS = {}

        for w in verbBase:
            osm = getOSM(w)
            bhs = getBHS(w)
            BHSFromOSM.setdefault(osm, {}).setdefault(bhs, set()).add(w)
            OSMFromBHS.setdefault(bhs, {}).setdefault(osm, set()).add(w)
        mappings.setdefault(key, {})[True] = BHSFromOSM
        mappings.setdefault(key, {})[False] = OSMFromBHS

    def showMap(key, direction):
        dirLabel = 'OSM ===> BHS' if direction else 'BHS ===> OSM'
        print('''
---------------------------------------------------------------------------------
--- {} {}
---------------------------------------------------------------------------------
'''.format(key, dirLabel))
        cases = set()
        for (item, itemData) in sorted(mappings[key][direction].items()):
            print('{:<10}'.format(item))
            first = True
            for (itemOther, ws) in sorted(itemData.items(), key=lambda x: (-len(x[1]), x[0])):
                print('\t{:<15} ({:>5}x)'.format(itemOther, len(ws)))
                if not first and len(ws) < 100:
                    for w in sorted(ws):
                        show(T, F, [w], getVerbBHS, getVerbOSM, indent='\t\t\t\t')
                        cases.add(w)
                first = False
        print('\n{} ({}): {} cases'.format(key, dirLabel, len(cases)))
        return cases
    
    def showFeature(key):
        cases = set()
        print('''
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
o-o COMPARING FEATURE {}
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
'''.format(key))
        for direction in (True, False):
            theseCases = showMap(key, direction)
            cases |= theseCases
        print('\n{}: {} cases'.format(key, len(cases)))
        return cases
    
    for (key, getBHS, getOSM) in (
        ('stem', getStemBHSX, getStemOSMX),
        ('conjugation', getConjBHSX, getConjOSMX),
        ('person', getPersonBHS, getPersonOSM),
        ('gender', getGenderBHS, getGenderOSM),
        ('number', getNumberBHS, getNumberOSM),
    ):
        makeMap(key, getBHS, getOSM)
        cases |= showFeature(key)
    print('\n{}: {} cases'.format('All features', len(cases)))

    return cases

## in BHSA but not in OSM

In [10]:
extraBHS = verbsBHSfocus - verbsOSM

print('Marked as verb in BHSA but not in OSM: {:>3}'.format(len(extraBHS)))
for w in sorted(extraBHS):
    show(T, F, [w], getVerbBHS, getOSM, indent='\t')

Marked as verb in BHSA but not in OSM:  52
	Numbers 7:13 w74146 "מְלֵאִ֗ים"
		BHS: qal-part act-?mp
		OSM: HAampa
	Numbers 7:14 w74158 "מְלֵאָ֥ה"
		BHS: qal-part act-?fs
		OSM: HAafsa
	Numbers 7:19 w74228 "מְלֵאִ֗ים"
		BHS: qal-part act-?mp
		OSM: HAampa
	Numbers 7:20 w74240 "מְלֵאָ֥ה"
		BHS: qal-part act-?fs
		OSM: HAafsa
	Numbers 7:25 w74309 "מְלֵאִ֗ים"
		BHS: qal-part act-?mp
		OSM: HAampa
	Numbers 7:26 w74321 "מְלֵאָ֥ה"
		BHS: qal-part act-?fs
		OSM: HAafsa
	Numbers 7:31 w74390 "מְלֵאִ֗ים"
		BHS: qal-part act-?mp
		OSM: HAampa
	Numbers 7:32 w74402 "מְלֵאָ֥ה"
		BHS: qal-part act-?fs
		OSM: HAafsa
	Numbers 7:37 w74471 "מְלֵאִ֗ים"
		BHS: qal-part act-?mp
		OSM: HAampa
	Numbers 7:38 w74483 "מְלֵאָ֥ה"
		BHS: qal-part act-?fs
		OSM: HAafsa
	Numbers 7:43 w74552 "מְלֵאִ֗ים"
		BHS: qal-part act-?mp
		OSM: HAampa
	Numbers 7:44 w74564 "מְלֵאָ֥ה"
		BHS: qal-part act-?fs
		OSM: HAafsa
	Numbers 7:49 w74633 "מְלֵאִ֗ים"
		BHS: qal-part act-?mp
		OSM: HAampa
	Numbers 7:50 w74645 "מְלֵאָ֥ה"
		BHS: q

In [11]:
cases = extraBHS
closerLook |= cases
print('{} cases merged into {} closer look items'.format(len(cases), len(closerLook)))

52 cases merged into 52 closer look items


## in OSM but not in BHSA

In [12]:
extraOSM = verbsOSM - verbsBHSfocus

print('Marked as verb in OSM but not in BHSA: {:>3}'.format(len(extraOSM)))
for w in sorted(extraOSM):
    show(T, F, [w], getBHS, getVerbOSM, indent='\t')

Marked as verb in OSM but not in BHSA:  11
	1_Samuel 16:16 w150632 "טֹ֥וב"
		BHS: adjv
		OSM: qal-weqatal-3ms
	1_Samuel 16:23 w150776 "טֹ֣וב"
		BHS: adjv
		OSM: qal-perfect-3ms
	Hosea 2:3 w291153 "רֻחָֽמָה"
		BHS: nmpr
		OSM: pual-perfect-3fs
	Hosea 7:4 w292417 "אֹפֶ֑ה"
		BHS: subs
		OSM: qal-part act-_ms
	Hosea 7:6 w292448 "אֹֽפֵהֶ֔ם"
		BHS: subs
		OSM: qal-part act-_ms
	Hosea 9:10 w292956 "אָהֳבָֽם"
		BHS: subs
		OSM: qal-inf cons-_??
	Jonah 1:15 w298910 "זַּעְפֹּֽו"
		BHS: subs
		OSM: qal-inf cons-_??
	Zephaniah 3:10 w303914 "פּוּצַ֔י"
		BHS: subs
		OSM: qal-part pass-_ms
	Ruth 2:1 w356231 "מידע"
		BHS: subs
		OSM: pual-part act-_ms
	Song_of_songs 2:13 w357957 "לכי"
		BHS: prep
		OSM: qal-imperative-2fs
	Song_of_songs 3:8 w358155 "אֲחֻ֣זֵי"
		BHS: adjv
		OSM: qal-part pass-_mp


In [13]:
cases = extraOSM
closerLook |= cases
print('{} cases merged into {} closer look items'.format(len(cases), len(closerLook)))

11 cases merged into 63 closer look items


## Common verb base
The rest of the comparison is carried out for the *common verb base*, i.e. those words
that have been marked as verb in the BHSA and in the OSM.

In [14]:
verbBase = verbsOSM & verbsBHSfocus
print('Common verb base: {} occurrences'.format(len(verbBase)))

Common verb base: 60074 occurrences


In [15]:
closerLook |= cases
print('{} cases merged into {} closer look items'.format(len(cases), len(closerLook)))

11 cases merged into 63 closer look items


# Feature comparison
We are going to compare all features.

In [16]:
cases = showFeatures()


o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
o-o COMPARING FEATURE stem
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o


---------------------------------------------------------------------------------
--- stem OSM ===> BHS
---------------------------------------------------------------------------------

aphel     
	haphel          (   10x)
	aphel           (    2x)
				Daniel 3:1 w371620 "אֲקִימֵהּ֙"
					BHS: aphel-perfect-3ms
					OSM: aphel-perfect-3ms
				Daniel 4:11 w372572 "אַתַּ֥רוּ"
					BHS: aphel-imperative-2mp
					OSM: aphel-imperative-2mp
haphel    
	haphel          (   97x)
	hithpeel        (    1x)
				Daniel 2:9 w370704 "הזמנתון"
					BHS: hithpeel-perfect-2mp
					OSM: haphel-perfect-2mp
	peal            (    1x)
				Daniel 2:31 w371168 "חָזֵ֤ה"
					BHS: peal-part act-?ms
					OSM: haphel-part act-_ms
hiphil    
	hiphil          ( 7105x)
	qal             (    6x)
				Amos 2:13 w295902 "תָּעִיק

				Song_of_songs 7:3 w358899 "סוּגָ֖ה"
					BHS: qal-part pass-?fs
					OSM: qalpassive-part act-_fs
shaphel   
	shaphel         (    9x)
tiphal    
	tiphil          (    1x)

stem (BHS ===> OSM): 98 cases

stem: 179 cases

o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
o-o COMPARING FEATURE conjugation
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o


---------------------------------------------------------------------------------
--- conjugation OSM ===> BHS
---------------------------------------------------------------------------------

cohortative
	imperfect       (  413x)
imperative
	imperative      ( 3794x)
	inf abs         (    4x)
				Nahum 2:2 w301671 "צַפֵּה"
					BHS: piel-inf abs-???
					OSM: piel-imperative-2ms
				Nahum 2:2 w301673 "חַזֵּ֣ק"
					BHS: piel-inf abs-???
					OSM: piel-imperative-2ms
				Nahum 2:2 w301675 "אַמֵּ֥ץ"
					BHS: piel-inf abs-???
					OSM: piel-imperative-2ms
				Proverbs 2


o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
o-o COMPARING FEATURE person
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o


---------------------------------------------------------------------------------
--- person OSM ===> BHS
---------------------------------------------------------------------------------

1         
	1               ( 4977x)
	2               (    3x)
				Ruth 3:3 w356811 "ירדתי"
					BHS: qal-perfect-2fs
					OSM: qal-weqatal-1cs
				Ruth 3:4 w356844 "שׁכבתי"
					BHS: qal-perfect-2fs
					OSM: qal-weqatal-1cs
				Ruth 4:5 w357240 "קניתי"
					BHS: qal-perfect-2ms
					OSM: qal-perfect-1cs
	?               (    3x)
				Zephaniah 1:2 w303100 "אָסֵ֜ף"
					BHS: qal-part act-?ms
					OSM: hiphil-imperfect-1cs
				Zephaniah 1:3 w303109 "אָסֵ֨ף"
					BHS: qal-part act-?ms
					OSM: hiphil-imperfect-1cs
				Zephaniah 1:3 w303113 "אָסֵ֤ף"
					BHS: qal-part act-?ms
					OSM: hiphil-imperfect-1cs
2  


o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
o-o COMPARING FEATURE number
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o


---------------------------------------------------------------------------------
--- number OSM ===> BHS
---------------------------------------------------------------------------------

?         
	?               ( 6272x)
	p               (    1x)
				Hosea 6:9 w292336 "חַכֵּ֨י"
					BHS: piel-part act-?mp
					OSM: piel-inf cons-_??
	s               (    1x)
				Amos 4:5 w296306 "קַטֵּ֤ר"
					BHS: piel-imperative-2ms
					OSM: piel-inf abs-_??
p         
	p               (15129x)
	s               (    2x)
				Nahum 3:7 w301967 "רֹאַ֨יִךְ֙"
					BHS: qal-part act-?ms
					OSM: qal-part act-_mp
				Proverbs 24:17 w353525 "אויביך"
					BHS: qal-part act-?ms
					OSM: qal-part act-_mp
s         
	s               (38659x)
	?               (    6x)
				Hosea 11:3 w293343 "קָחָ֖ם"
					BHS: qal

In [17]:
closerLook |= cases
print('{} cases for a closer look'.format(len(closerLook)))

288 cases for a closer look


# Result

We are going to list all cases in [verbs.tsv](verbs.tsv) .

In [18]:
fields = '''
    passage
    node
    occurrence
    OSMmorph
    stemOSM
    stemBHS
    conjOSM
    conjBHS
    personOSM
    personBHS
    genderOSM
    genderBHS
    numberOSM
    numberBHS
'''.strip().split()
lineFormat = ('{}\t' * (len(fields) - 1)) + '{}\n'

with open('verbs.tsv', 'w') as fh:
    fh.write(lineFormat.format(*fields))
    for w in sorted(closerLook):
        fh.write(lineFormat.format(
            '{} {}:{}'.format(*T.sectionFromNode(w)),
            w,
            F.g_word_utf8.v(w),
            F.osm.v(w),
            getStemOSMX(w),
            getStemBHSX(w),
            getConjOSMX(w),
            getConjBHSX(w),
            getPersonOSM(w),
            getPersonBHS(w),
            getGenderOSM(w),
            getGenderBHS(w),
            getNumberOSM(w),
            getNumberBHS(w),
        ))