# BHSA and OSM: comparison

We will investigate how the morphology marked up in the OSM corresponds and differs from the BHSA linguistic features.

We use the `osm` and `osm_sf` features compiled by the 
[BHSAbridgeOSM notebook](BHSAbridgeOSM.ipynb).

In [1]:
import os
import collections

from tf.fabric import Fabric

In [9]:
BHSA = 'BHSA/tf/2017'
OSM = 'bridging/tf/2017'

TF = Fabric(locations='~/github/etcbc', modules=[BHSA, OSM])
api = TF.load('''
    sp
    language
    osm
    g_word_utf8
''')
api.makeAvailableIn(globals())

This is Text-Fabric 3.1.1
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

116 features found and 0 ignored
  0.00s loading features ...
   |     0.23s B g_word_utf8          from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.13s B sp                   from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.14s B language             from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.13s B osm                  from /Users/dirk/github/etcbc/bridging/tf/2017
   |     0.00s Feature overview: 110 for nodes; 5 for edges; 1 configs; 7 computed
  5.19s All features loaded/computed - for details use loadLog()


We need to show cases.

In [10]:
def show(ws):
    print('{} BHS {:<30} = {}'.format(
        '{} {}:{}'.format(*T.sectionFromNode(ws[0])),
        ', '.join(str(w) for w in ws),
        '/'.join(F.g_word_utf8.v(w) for w in ws),
    ))

# Language

Do BHSA and OSM agree on language?
Let's count the words in the BHSA where they disagree.

The BHSA names the languages by means of ISO codes, the OSM uses one letter abbreviations.

In [11]:
langBhsFromOsm = dict(A='arc', H='hbo')
langOsmFromBhs = dict((y,x) for (x,y) in langBhsFromOsm.items())

We exclude the words for which the OSM has no morphology, or where the alignment between BHSA and OSM is problematic.

In [12]:
xLanguage = set()
strangeLanguage = collections.Counter()

for w in F.otype.s('word'):
    osm = F.osm.v(w)
    if osm == None or osm == '' or osm == '*': continue
    osmLanguage = osm[0]
    trans = langBhsFromOsm.get(osmLanguage, None)
    if trans == None:
        strangeLanguage[osmLanguage] += 1
    else:
        if langBhsFromOsm[osm[0]] != F.language.v(w):
            xLanguage.add(w)

print('Strange languages')
for (ln, amount) in sorted(strangeLanguage.items()):
    print('Strange language {}: {:>5}x'.format(ln, amount))
print('Language discrepancies: {}'.format(len(xLanguage)))

Strange languages
Language discrepancies: 9


In [13]:
for w in sorted(xLanguage):
    show([w])
    

Daniel 2:5 BHS 370626                         = הֵ֣ן
Daniel 2:9 BHS 370692                         = הֵן
Daniel 2:13 BHS 370806                         = דָּנִיֵּ֥אל
Daniel 2:24 BHS 371000                         = דָּֽנִיֵּאל֙
Daniel 2:28 BHS 371120                         = הֽוּא
Daniel 2:29 BHS 371130                         = אַחֲרֵ֣י
Daniel 3:15 BHS 371915                         = הֵ֧ן
Daniel 4:5 BHS 372449                         = דָּנִיֵּ֜אל
Daniel 7:1 BHS 374551                         = דָּנִיֵּאל֙
