# BHSA and OSM: comparison on word categories

We will investigate how the morphology marked up in the OSM corresponds and differs from the BHSA linguistic features.

In this notebook we investigate the word categories.
The [OSM docs](http://openscriptures.github.io/morphhb/parsing/HebrewMorphologyCodes.html)
specify a main category for part-of-speech, and additional subtypes for noun, pronoun, adjective, preposition and suffix.

The BHSA specifies its categories in the features
[sp](https://etcbc.github.io/bhsa/features/hebrew/2017/sp.html),
[ls](https://etcbc.github.io/bhsa/features/hebrew/2017/ls.html), and
[nametype](https://etcbc.github.io/bhsa/features/hebrew/2017/nametype.html).

The purpose of this notebook is to see how they correlate.

# Mappings

We collect the numbers of cooccurrences of OSM types and BHSA types.
We do this separately for main words and for suffixes.

We give examples where the rare cases occur.
A rare case is less than 10% of the total number of cases.

That means, if OSM type $t$ compares to BHS types $s_1, ... ,s_n$, with frequencies
$f_1, ..., f_n$, then we give cases of those $(t, s_i)$ such that

$$f_i <= 0.10\times \sum_{j=1}^{n}f_j$$. 

# Results
* [categories.tsv](categories.tsv) overview of cooccurrences of OSM and BHSA categories
* [categoriesCases.tsv](categoriesCases.tsv) same, but examples for the rarer combinations
* [allCategoriesCases.tsv](allCategoriesCases.tsv) all rarer cases, in biblical order

In [2]:
import os
import collections
import operator
from functools import reduce

from tf.fabric import Fabric
from utils import show

# Load data
We load the BHSA data in the standard way, and we add the OSM data as a module of the features `osm` and `osm_sf`.
Note that we only need to point TF to the right directories, and then we can load all features
that are present in those directories.

In [3]:
BHSA = 'BHSA/tf/2017'
OSM = 'bridging/tf/2017'

TF = Fabric(locations='~/github/etcbc', modules=[BHSA, OSM])
api = TF.load('''
    sp ls nametype
    osm osm_sf
    g_word_utf8
    prs uvf
''')
api.makeAvailableIn(globals())

This is Text-Fabric 3.1.1
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

117 features found and 0 ignored
  0.00s loading features ...
   |     0.21s B g_word_utf8          from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.19s B sp                   from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.16s B ls                   from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.00s B nametype             from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.17s B osm                  from /Users/dirk/github/etcbc/bridging/tf/2017
   |     0.03s B osm_sf               from /Users/dirk/github/etcbc/bridging/tf/2017
   |     0.16s B prs                  from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.14s B uvf                  from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.00s Feature overview: 111 f

Let's quickly oversee the values of the relevant BHSA features.

We only work on words where the OSM has assigned morphology.

In [4]:
wordBase = [w for w in F.otype.s('word') if F.osm.v(w) and F.osm.v(w) != '*']
print(len(wordBase))

372636


In [5]:
F.sp.freqList()

(('subs', 125558),
 ('verb', 75450),
 ('prep', 73298),
 ('conj', 62737),
 ('nmpr', 35696),
 ('art', 30387),
 ('adjv', 10075),
 ('nega', 6059),
 ('prps', 5035),
 ('advb', 4603),
 ('prde', 2678),
 ('intj', 1912),
 ('inrg', 1303),
 ('prin', 1026))

In [6]:
F.ls.freqList()

(('none', 386055),
 ('nmdi', 9427),
 ('quot', 6525),
 ('card', 6317),
 ('padv', 5238),
 ('vbcp', 3640),
 ('ppre', 3342),
 ('gntl', 1961),
 ('focp', 1183),
 ('nmcp', 994),
 ('ques', 749),
 ('ordn', 740),
 ('afad', 547),
 ('cjad', 208),
 ('mult', 35))

In [7]:
F.nametype.freqList()

(('pers', 1671),
 ('topo', 841),
 ('gens', 51),
 ('pers,gens,topo', 19),
 ('pers,gens', 13),
 ('mens', 10),
 ('ppde', 7),
 ('gens,topo', 2),
 ('pers,god', 1))

In [8]:
F.prs.freqList()

(('absent', 235942),
 ('n/a', 145484),
 ('W', 11905),
 ('K', 7134),
 ('J', 6566),
 ('M', 3938),
 ('H', 3352),
 ('HM', 3047),
 ('KM', 2657),
 ('NW', 1635),
 ('HW', 1611),
 ('NJ', 1321),
 ('K=', 1308),
 ('HN', 192),
 ('H=', 161),
 ('MW', 117),
 ('HJ', 77),
 ('HWN', 51),
 ('N', 47),
 ('KN', 19),
 ('KWN', 10),
 ('N>', 10))

In [9]:
F.uvf.freqList()

(('absent', 423038),
 ('H', 1068),
 ('J', 946),
 ('>', 865),
 ('N', 650),
 ('W', 17))

In order to read the results with more ease, we translate the codes to friendly names, found in the docs of
OSM and BHSA.

In [10]:
naValues = {'NA', 'N/A', 'n/a', 'none', 'absent'}
NA = ''

missingValues = {None, ''}
MISSING = ''

unknownValues = {'unknown'}
UNKNOWN = '?'

PRS = 'p'

noSubTypes = {'C', 'D', 'V'}

In [11]:
pspOSM = {
    '': dict(
        A='adjective',
        C='conjunction',
        D='adverb',
        N='noun',
        P='pronoun',
        R='preposition',
        S='suffix',
        T='particle',
        V='verb',
    ),
    'A': dict(
        a='adjective',
        c='cardinal number',
        g='gentilic',
        o='ordinal number',
    ),
    'N': dict(
        c='common',
        g='gentilic',
        p='proper name',
    ),
    'P': dict(
        d='demonstrative',
        f='indefinite',
        i='interrogative',
        p='personal',
        r='relative',
    ),
    'R': dict(
        d='definite article',
    ),
    'S': dict(
        d='directional he',
        h='paragogic he',
        n='paragogic nun',
        p='pronominal',
    ),
    'T': dict(
        a='affirmation',
        d='definite article',
        e='exhortation',
        i='interrogative',
        j='interjection',
        m='demonstrative',
        n='negative',
        o='direct object marker',
        r='relative',
    ),
}

In [12]:
spBHS = dict(
    art='article',
    verb='verb',
    subs='noun',
    nmpr='proper noun',
    advb='adverb',
    prep='preposition',
    conj='conjunction',
    prps='personal pronoun',
    prde='demonstrative pronoun',
    prin='interrogative pronoun',
    intj='interjection',
    nega='negative particle',
    inrg='interrogative particle',
    adjv='adjective',
)
lsBHS = dict(
    nmdi='distributive noun',
    nmcp='copulative noun',
    padv='potential adverb',
    afad='anaphoric adverb',
    ppre='potential preposition',
    cjad='conjunctive adverb',
    ordn='ordinal',
    vbcp='copulative verb',
    mult='noun of multitude',
    focp='focus particle',
    ques='interrogative particle',
    gntl='gentilic',
    quot='quotation verb',
    card='cardinal',
    none=MISSING,
)
nametypeBHS = dict(
    pers='person',
    mens='measurement unit',
    gens='people',
    topo='place',
    ppde='demonstrative personal pronoun',
)
nametypeBHS.update({
    'pers,gens,topo': 'person',
    'pers,gens': 'person',
    'gens,topo': 'gentilic',
    'pers,god': 'person',
})

In [14]:
def getValueBHS(x, feat=None): return (
        NA if x in naValues
        else MISSING if x in missingValues
        else UNKNOWN if x in unknownValues
        else feat[x] if feat
        else x
    )

def getValueOSM(x):
    if not x or len(x) < 2: return UNKNOWN
    tp = x[1]
    tpName = pspOSM[''][tp]
    subTpName = None if tp in noSubTypes or len(x) < 3 else pspOSM[tp][x[2]]
    return ':'.join((x for x in (tpName, subTpName) if x is not None))

def getTypeBHS(w):
    return ':'.join((
        getValueBHS(F.sp.v(w), spBHS), 
        getValueBHS(F.ls.v(w), lsBHS), 
        getValueBHS(F.nametype.v(w), nametypeBHS),
    ))

def getTypeOSM(w): return getValueOSM(F.osm.v(w))

def getSuffixTypeBHS(w):
    prs = getValueBHS(F.prs.v(w))
    if prs not in {NA, UNKNOWN}:
        prs = PRS
    return ':'.join((prs, getValueBHS(F.uvf.v(w))))

def getSuffixTypeOSM(w): return getValueOSM(F.osm_sf.v(w))

def getWordBHS(w): return 'T={} S={}'.format(getTypeBHS(w), getSuffixTypeBHS(w))
def getWordOSM(w): return 'T={} [{}] S={} [{}]'.format(
    getTypeOSM(w),
    F.osm.v(w),
    getSuffixTypeOSM(w),
    F.osm_sf.v(w),
)

In [15]:
def showFeatures(base):
    cases = set()
    categories = []
    categoriesCases = []
    mappings = {}

    def makeMap(key, getBHS, getOSM):
        BHSFromOSM = {}
        OSMFromBHS = {}

        for w in base:
            osm = getOSM(w)
            bhs = getBHS(w)
            BHSFromOSM.setdefault(osm, {}).setdefault(bhs, set()).add(w)
            OSMFromBHS.setdefault(bhs, {}).setdefault(osm, set()).add(w)
        mappings.setdefault(key, {})[True] = BHSFromOSM
        mappings.setdefault(key, {})[False] = OSMFromBHS

    def showMap(key, direction):
        dirLabel = 'OSM ===> BHS' if direction else 'BHS ===> OSM'
        categories.append('''
---------------------------------------------------------------------------------
--- {} {}
---------------------------------------------------------------------------------
'''.format(key, dirLabel))
        categoriesCases.append(categories[-1])
        cases = set()
        for (item, itemData) in sorted(mappings[key][direction].items()):
            categories.append('{:<40}'.format(item))
            categoriesCases.append(categories[-1])

            totalCases = reduce(operator.add, (len(d) for d in itemData.values()), 0)
            for (itemOther, ws) in sorted(itemData.items(), key=lambda x: (-len(x[1]), x[0])):
                nws = len(ws)
                perc = int(round(100 * nws / totalCases))
                categories.append('\t{:<40} ({:>3}% = {:>6}x)'.format(itemOther, perc, nws))
                categoriesCases.append(categories[-1])
                if nws < 0.1 * totalCases:
                    for w in sorted(ws)[0:10]:
                        categoriesCases.append(show(T, F, [w], getWordBHS, getWordOSM, indent='\t\t\t\t', asString=True))
                        cases.add(w)
                    if nws > 10:
                        categoriesCases.append('\t\t\t\tand {} more'.format(nws - 10))
        categories.append('\n{} ({}): {} cases'.format(key, dirLabel, len(cases)))
        categoriesCases.append(categories[-1])

        return cases
    
    def showFeature(key):
        cases = set()
        categories.append('''
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
o-o COMPARING FEATURE {}
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
'''.format(key))
        categoriesCases.append(categories[-1])

        for direction in (True, False):
            theseCases = showMap(key, direction)
            cases |= theseCases
        categories.append('\n{}: {} cases'.format(key, len(cases)))
        categoriesCases.append(categories[-1])

        return cases
    
    for (key, getBHS, getOSM) in (
        ('main', getTypeBHS, getTypeOSM),
        ('suffix', getSuffixTypeBHS, getSuffixTypeOSM),
    ):
        makeMap(key, getBHS, getOSM)
        cases |= showFeature(key)
                                          
    categories.append('\n{}: {} cases'.format('All features', len(cases)))
    categoriesCases.append(categories[-1])

    with open('categories.tsv', 'w') as fh:
        fh.write('\n'.join(categories))
    with open('categoriesCases.tsv', 'w') as fh:
        fh.write('\n'.join(categoriesCases))

    
    fields = '''
        passage
        node
        occurrence
        OSMmorph
        OSMtype
        BHStype
        OSMmorphSuffix
        OSMsuffixType
        BHSsuffixType
    '''.strip().split()
    lineFormat = ('{}\t' * (len(fields) - 1)) + '{}\n'

    with open('allCategoriesCases.tsv', 'w') as fh:
        fh.write(lineFormat.format(*fields))
        for w in sorted(cases):
            fh.write(lineFormat.format(
                '{} {}:{}'.format(*T.sectionFromNode(w)),
                w,
                F.g_word_utf8.v(w),
                F.osm.v(w),
                getTypeOSM(w),
                getTypeBHS(w),
                F.osm_sf.v(w),
                getSuffixTypeOSM(w),
                getSuffixTypeBHS(w),
            ))


# Feature comparison
We are going to compare all features.

In [16]:
showFeatures(wordBase)

# Results
* [categories.tsv](categories.tsv) overview of cooccurrences of OSM and BHSA categories
* [categoriesCases.tsv](categoriesCases.tsv) same, but examples for the rarer combinations
* [allCategoriesCases.tsv](allCategoriesCases.tsv) all rarer cases, in biblical order