# BHSA and OSM: comparison on word categories

We will investigate how the morphology marked up in the OSM corresponds and differs from the BHSA linguistic features.

In this notebook we investigate the word categories.
The [OSM docs](http://openscriptures.github.io/morphhb/parsing/HebrewMorphologyCodes.html)
specify a main category for part-of-speech, and additional subtypes for noun, pronoun, adjective, preposition and suffix.

The BHSA specifies its categories in the features
[sp](https://etcbc.github.io/bhsa/features/hebrew/2017/sp.html),
[ls](https://etcbc.github.io/bhsa/features/hebrew/2017/ls.html), and
[nametype](https://etcbc.github.io/bhsa/features/hebrew/2017/nametype.html).

The purpose of this notebook is to see how they correlate.

# Results

See below, where most of the cases are mentioned.
We also collect all cases in [category.tsv](verbs.tsv) , a tab delimited file.

In [2]:
import os
import collections

from tf.fabric import Fabric
from utils import show

# Load data
We load the BHSA data in the standard way, and we add the OSM data as a module of the features `osm` and `osm_sf`.
Note that we only need to point TF to the right directories, and then we can load all features
that are present in those directories.

In [8]:
BHSA = 'BHSA/tf/2017'
OSM = 'bridging/tf/2017'

TF = Fabric(locations='~/github/etcbc', modules=[BHSA, OSM])
api = TF.load('''
    sp ls nametype
    osm osm_sf
    g_word_utf8
    prs uvf
''')
api.makeAvailableIn(globals())

This is Text-Fabric 3.1.1
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

117 features found and 0 ignored
  0.00s loading features ...
   |     0.22s B g_word_utf8          from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.15s B sp                   from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.14s B ls                   from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.00s B nametype             from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.14s B osm                  from /Users/dirk/github/etcbc/bridging/tf/2017
   |     0.02s B osm_sf               from /Users/dirk/github/etcbc/bridging/tf/2017
   |     0.14s B prs                  from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.15s B uvf                  from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.00s Feature overview: 111 f

Let's quickly oversee the values of the relevant BHSA features.

We only work on words where the OSM has assigned morphology.

In [84]:
wordBase = [w for w in F.otype.s('word') if F.osm.v(w) and F.osm.v(w) != '*']
print(len(wordBase))

372636


In [5]:
F.sp.freqList()

(('subs', 125558),
 ('verb', 75450),
 ('prep', 73298),
 ('conj', 62737),
 ('nmpr', 35696),
 ('art', 30387),
 ('adjv', 10075),
 ('nega', 6059),
 ('prps', 5035),
 ('advb', 4603),
 ('prde', 2678),
 ('intj', 1912),
 ('inrg', 1303),
 ('prin', 1026))

In [6]:
F.ls.freqList()

(('none', 386055),
 ('nmdi', 9427),
 ('quot', 6525),
 ('card', 6317),
 ('padv', 5238),
 ('vbcp', 3640),
 ('ppre', 3342),
 ('gntl', 1961),
 ('focp', 1183),
 ('nmcp', 994),
 ('ques', 749),
 ('ordn', 740),
 ('afad', 547),
 ('cjad', 208),
 ('mult', 35))

In [7]:
F.nametype.freqList()

(('pers', 1671),
 ('topo', 841),
 ('gens', 51),
 ('pers,gens,topo', 19),
 ('pers,gens', 13),
 ('mens', 10),
 ('ppde', 7),
 ('gens,topo', 2),
 ('pers,god', 1))

In [9]:
F.prs.freqList()

(('absent', 235942),
 ('n/a', 145484),
 ('W', 11905),
 ('K', 7134),
 ('J', 6566),
 ('M', 3938),
 ('H', 3352),
 ('HM', 3047),
 ('KM', 2657),
 ('NW', 1635),
 ('HW', 1611),
 ('NJ', 1321),
 ('K=', 1308),
 ('HN', 192),
 ('H=', 161),
 ('MW', 117),
 ('HJ', 77),
 ('HWN', 51),
 ('N', 47),
 ('KN', 19),
 ('KWN', 10),
 ('N>', 10))

In [10]:
F.uvf.freqList()

(('absent', 423038),
 ('H', 1068),
 ('J', 946),
 ('>', 865),
 ('N', 650),
 ('W', 17))

In order to read the results with more ease, we translate the codes to friendly names, found in the docs of
OSM and BHSA.

In [74]:
pspOSM = {
    '': dict(
        A='adjective',
        C='conjunction',
        D='adverb',
        N='noun',
        P='pronoun',
        R='preposition',
        S='suffix',
        T='particle',
        V='verb',
    ),
    'A': dict(
        a='adjective',
        c='cardinal number',
        g='gentilic',
        o='ordinal number',
    ),
    'N': dict(
        c='common',
        g='gentilic',
        p='proper name',
    ),
    'P': dict(
        d='demonstrative',
        f='indefinite',
        i='interrogative',
        p='personal',
        r='relative',
    ),
    'R': dict(
        d='definite article',
    ),
    'S': dict(
        d='directional he',
        h='paragogic he',
        n='paragogic nun',
        p='pronominal',
    ),
    'T': dict(
        a='affirmation',
        d='definite article',
        e='exhortation',
        i='interrogative',
        j='interjection',
        m='demonstrative',
        n='negative',
        o='direct object marker',
        r='relative',
    ),
}

In [94]:
spBHS = dict(
    art='article',
    verb='verb',
    subs='noun',
    nmpr='proper noun',
    advb='adverb',
    prep='preposition',
    conj='conjunction',
    prps='personal pronoun',
    prde='demonstrative pronoun',
    prin='interrogative pronoun',
    intj='interjection',
    nega='negative particle',
    inrg='interrogative particle',
    adjv='adjective',
)
lsBHS = dict(
    nmdi='distributive noun',
    nmcp='copulative noun',
    padv='potential adverb',
    afad='anaphoric adverb',
    ppre='potential preposition',
    cjad='conjunctive adverb',
    ordn='ordinal',
    vbcp='copulative verb',
    mult='noun of multitude',
    focp='focus particle',
    ques='interrogative particle',
    gntl='gentilic',
    quot='quotation verb',
    card='cardinal',
    none=NA,
)
nametypeBHS = dict(
    pers='person',
    mens='measurement unit',
    gens='people',
    topo='place',
    ppde='demonstrative personal pronoun',
)
nametypeBHS.update({
    'pers,gens,topo': 'person',
    'pers,gens': 'person',
    'gens,topo': 'gentilic',
    'pers,god': 'person',
})

In [112]:
naValues = {'NA', 'N/A', 'n/a', 'none', 'absent'}
NA = '-'
unknownValues = {None, '', 'unknown'}
UNKNOWN = '?'

PRS = 'p'

noSubTypes = {'C', 'D', 'V'}



def getValueBHS(x, feat=None): return NA if x in naValues else UNKNOWN if x in unknownValues else feat[x] if feat else x

def getValueOSM(x):
    if not x or len(x) < 2: return UNKNOWN
    tp = x[1]
    tpName = pspOSM[''][tp]
    return tpName if tp in noSubTypes or len(tp) < 2 else pspOSM[tp][x[2]]

def getTypeBHS(w):
    return ':'.join((
        getValueBHS(F.sp.v(w), spBHS), 
        getValueBHS(F.ls.v(w), lsBHS), 
        getValueBHS(F.nametype.v(w), nametypeBHS),
    ))

def getTypeOSM(w): return getValueOSM(F.osm.v(w))

def getSuffixTypeBHS(w):
    prs = getValueBHS(F.prs.v(w))
    if prs not in {NA, UNKNOWN}:
        prs = PRS
    return ':'.join((prs, getValueBHS(F.uvf.v(w))))

def getSuffixTypeOSM(w): return getValueOSM(F.osm_sf.v(w))

def getWordBHS(w): return 'T={} S={}'.format(getTypeBHS(w), getSuffixTypeBHS(w))
def getWordOSM(w): return 'T={} [{}] S={} [{}]'.format(
    getTypeOSM(w),
    F.osm.v(w),
    getSuffixTypeOSM(w),
    F.osm_sf.v(w),
)

# Mappings

We collect the numbers of cooccurrences of OSM types and BHSA types.
WE do this separately for main words and for suffixes.

In [113]:
closerLook = set()

In [114]:
def showFeatures(base, withCases=False):
    cases = set()
    mappings = {}

    def makeMap(key, getBHS, getOSM):
        BHSFromOSM = {}
        OSMFromBHS = {}

        for w in base:
            osm = getOSM(w)
            bhs = getBHS(w)
            BHSFromOSM.setdefault(osm, {}).setdefault(bhs, set()).add(w)
            OSMFromBHS.setdefault(bhs, {}).setdefault(osm, set()).add(w)
        mappings.setdefault(key, {})[True] = BHSFromOSM
        mappings.setdefault(key, {})[False] = OSMFromBHS

    def showMap(key, direction):
        dirLabel = 'OSM ===> BHS' if direction else 'BHS ===> OSM'
        print('''
---------------------------------------------------------------------------------
--- {} {}
---------------------------------------------------------------------------------
'''.format(key, dirLabel))
        cases = set()
        for (item, itemData) in sorted(mappings[key][direction].items()):
            print('{:<40}'.format(item))
            first = True
            for (itemOther, ws) in sorted(itemData.items(), key=lambda x: (-len(x[1]), x[0])):
                nws = len(ws)
                print('\t{:<40} ({:>6}x)'.format(itemOther, nws))
                if not first:
                    for w in sorted(ws)[0:10]:
                        if withCases: show(T, F, [w], getWordBHS, getWordOSM, indent='\t\t\t\t')
                        cases.add(w)
                    if nws > 10:
                        if withCases: print('\t\t\t\tand {} more'.format(nws - 10))
                first = False
        print('\n{} ({}): {} cases'.format(key, dirLabel, len(cases)))
        return cases
    
    def showFeature(key):
        cases = set()
        print('''
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
o-o COMPARING FEATURE {}
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
'''.format(key))
        for direction in (True, False):
            theseCases = showMap(key, direction)
            cases |= theseCases
        print('\n{}: {} cases'.format(key, len(cases)))
        return cases
    
    for (key, getBHS, getOSM) in (
        ('main', getTypeBHS, getTypeOSM),
        ('suffix', getSuffixTypeBHS, getSuffixTypeOSM),
    ):
        makeMap(key, getBHS, getOSM)
        cases |= showFeature(key)
    print('\n{}: {} cases'.format('All features', len(cases)))

    return cases

# Feature comparison
We are going to compare all features.

In [115]:
cases = showFeatures(wordBase)


o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
o-o COMPARING FEATURE main
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o


---------------------------------------------------------------------------------
--- main OSM ===> BHS
---------------------------------------------------------------------------------

adjective                               
	noun:cardinal:?                          (  6134x)
	adjective:-:?                            (  6095x)
	adjective:ordinal:?                      (   349x)
	noun:-:?                                 (   286x)
	verb:-:?                                 (    38x)
	proper noun:-:?                          (    30x)
	noun:noun of multitude:?                 (    13x)
	adjective:gentilic:?                     (     5x)
	noun:potential adverb:?                  (     5x)
	adverb:-:?                               (     3x)
	conjunction:-:?                          (     1x)
	prepo

	particle                                 (     5x)
	suffix                                   (     3x)
preposition:-:?                         
	preposition                              ( 55591x)
	particle                                 ( 11101x)
	conjunction                              (    77x)
	pronoun                                  (     5x)
	adjective                                (     1x)
	adverb                                   (     1x)
	noun                                     (     1x)
	verb                                     (     1x)
proper noun:-:?                         
	noun                                     ( 31764x)
	adjective                                (    30x)
	particle                                 (     5x)
	preposition                              (     2x)
	verb                                     (     1x)
verb:-:?                                
	verb                                     ( 50691x)
	adjective                                (  

In [116]:
cases = showFeatures(wordBase, withCases=True)


o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
o-o COMPARING FEATURE main
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o


---------------------------------------------------------------------------------
--- main OSM ===> BHS
---------------------------------------------------------------------------------

adjective                               
	noun:cardinal:?                          (  6134x)
	adjective:-:?                            (  6095x)
				Genesis 1:16 w290 "גְּדֹלִ֑ים"
					BHS: T=adjective:-:? S=-:-
					OSM: T=adjective [HAampa] S=? [None]
				Genesis 1:16 w295 "גָּדֹל֙"
					BHS: T=adjective:-:? S=-:-
					OSM: T=adjective [HAamsa] S=? [None]
				Genesis 1:16 w305 "קָּטֹן֙"
					BHS: T=adjective:-:? S=-:-
					OSM: T=adjective [HAamsa] S=? [None]
				Genesis 1:20 w368 "חַיָּ֑ה"
					BHS: T=adjective:-:? S=-:-
					OSM: T=adjective [HAafsa] S=? [None]
				Genesis 1:21 w387 "גְּדֹלִ֑ים"
					BHS: T=

				Genesis 1:28 w557 "יֹּ֨אמֶר"
					BHS: T=verb:quotation verb:? S=-:-
					OSM: T=verb [HVqw3ms] S=? [None]
				and 6127 more
	verb:copulative verb:?                   (  3246x)
				Genesis 1:2 w15 "הָיְתָ֥ה"
					BHS: T=verb:copulative verb:? S=-:-
					OSM: T=verb [HVqp3fs] S=? [None]
				Genesis 1:3 w35 "יְהִ֣י"
					BHS: T=verb:copulative verb:? S=-:-
					OSM: T=verb [HVqj3ms] S=? [None]
				Genesis 1:3 w38 "יְהִי"
					BHS: T=verb:copulative verb:? S=-:-
					OSM: T=verb [HVqw3ms] S=? [None]
				Genesis 1:5 w72 "יְהִי"
					BHS: T=verb:copulative verb:? S=-:-
					OSM: T=verb [HVqw3ms] S=? [None]
				Genesis 1:5 w75 "יְהִי"
					BHS: T=verb:copulative verb:? S=-:-
					OSM: T=verb [HVqw3ms] S=? [None]
				Genesis 1:6 w82 "יְהִ֥י"
					BHS: T=verb:copulative verb:? S=-:-
					OSM: T=verb [HVqj3ms] S=? [None]
				Genesis 1:6 w89 "יהִ֣י"
					BHS: T=verb:copulative verb:? S=-:-
					OSM: T=verb [HVqi3ms] S=? [None]
				Genesis 1:7 w123 "יְהִי"
					BHS: T=verb:copulative verb:? S

				Numbers 7:26 w74321 "מְלֵאָ֥ה"
					BHS: T=verb:-:? S=-:-
					OSM: T=adjective [HAafsa] S=? [None]
				Numbers 7:31 w74390 "מְלֵאִ֗ים"
					BHS: T=verb:-:? S=-:-
					OSM: T=adjective [HAampa] S=? [None]
				Numbers 7:32 w74402 "מְלֵאָ֥ה"
					BHS: T=verb:-:? S=-:-
					OSM: T=adjective [HAafsa] S=? [None]
				Numbers 7:37 w74471 "מְלֵאִ֗ים"
					BHS: T=verb:-:? S=-:-
					OSM: T=adjective [HAampa] S=? [None]
				Numbers 7:38 w74483 "מְלֵאָ֥ה"
					BHS: T=verb:-:? S=-:-
					OSM: T=adjective [HAafsa] S=? [None]
				and 28 more
	noun                                     (     8x)
				Judges 4:11 w129619 "חֹתֵ֣ן"
					BHS: T=verb:-:? S=-:-
					OSM: T=noun [HNcmsc] S=? [None]
				Joel 4:3 w295061 "זֹּונָ֔ה"
					BHS: T=verb:-:? S=-:-
					OSM: T=noun [HNcfsa] S=? [None]
				Joel 4:14 w295245 "חָר֑וּץ"
					BHS: T=verb:-:? S=-:-
					OSM: T=noun [HNp] S=? [None]
				Joel 4:14 w295253 "חָרֽוּץ"
					BHS: T=verb:-:? S=-:-
					OSM: T=noun [HNp] S=? [None]
				Obadiah 1:3 w298206 "שִׁבְת

In [117]:
closerLook |= cases
print('{} cases for a closer look'.format(len(closerLook)))

636 cases for a closer look


# Result

We are going to list all cases in [category.tsv](category.tsv) .

In [118]:
fields = '''
    passage
    node
    occurrence
    OSMmorph
    OSMmorphSuffix
    OSMtype
    OSMsuffixType
    BHStype
    BHSsuffixType
'''.strip().split()
lineFormat = ('{}\t' * (len(fields) - 1)) + '{}\n'

with open('category.tsv', 'w') as fh:
    fh.write(lineFormat.format(*fields))
    for w in sorted(closerLook):
        fh.write(lineFormat.format(
            '{} {}:{}'.format(*T.sectionFromNode(w)),
            w,
            F.g_word_utf8.v(w),
            F.osm.v(w),
            F.osm_sf.v(w),
            getTypeOSM(w),
            getSuffixTypeOSM(w),
            getTypeBHS(w),
            getSuffixTypeBHS(w),

        ))