# BHSA and OSM: comparison on language

We will investigate how the morphology marked up in the OSM corresponds and differs from the BHSA linguistic features.

In this notebook we investigate the markup of *language* (Hebrew or Aramaic).

We use the `osm` and `osm_sf` features compiled by the 
[BHSAbridgeOSM notebook](BHSAbridgeOSM.ipynb).

In [1]:
import os
import collections

from tf.fabric import Fabric
from utils import show

# Load data
We load the BHSA data in the standard way, and we add the OSM data as a module of the features `osm` and `osm_sf`.
Note that we only need to point TF to the right directories, and then we can load all features
that are present in those directories.

In [2]:
BHSA = 'BHSA/tf/2017'
OSM = 'bridging/tf/2017'

TF = Fabric(locations='~/github/etcbc', modules=[BHSA, OSM])
api = TF.load('''
    language
    osm osm_sf
    g_word_utf8
''')
api.makeAvailableIn(globals())

This is Text-Fabric 3.1.1
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

116 features found and 0 ignored
  0.00s loading features ...
   |     0.21s B g_word_utf8          from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.12s B language             from /Users/dirk/github/etcbc/BHSA/tf/2017
   |     0.13s B osm                  from /Users/dirk/github/etcbc/bridging/tf/2017
   |     0.02s B osm_sf               from /Users/dirk/github/etcbc/bridging/tf/2017
   |     0.00s Feature overview: 110 for nodes; 5 for edges; 1 configs; 7 computed
  4.85s All features loaded/computed - for details use loadLog()


# Language

Do BHSA and OSM agree on language?
Let's count the words in the BHSA where they disagree.

The BHSA names the languages by means of ISO codes, the OSM uses one letter abbreviations.

The OSM has the language code as the first letter of the morphology string.

In [3]:
langBhsFromOsm = dict(A='arc', H='hbo')
langOsmFromBhs = dict((y,x) for (x,y) in langBhsFromOsm.items())

We exclude the words for which the OSM has no morphology, or where the alignment between BHSA and OSM is problematic.

In [4]:
xLanguage = set()
strangeLanguage = collections.Counter()

for w in F.otype.s('word'):
    osm = F.osm.v(w)
    if osm == None or osm == '' or osm == '*': continue
    osmLanguage = osm[0]
    trans = langBhsFromOsm.get(osmLanguage, None)
    if trans == None:
        strangeLanguage[osmLanguage] += 1
    else:
        if langBhsFromOsm[osm[0]] != F.language.v(w):
            xLanguage.add(w)

if strangeLanguage:
    print('Strange languages')
    for (ln, amount) in sorted(strangeLanguage.items()):
        print('Strange language {}: {:>5}x'.format(ln, amount))
else:
    print('No other languages encountered than {}'.format(', '.join(langBhsFromOsm)))
print('Language discrepancies: {}'.format(len(xLanguage)))

No other languages encountered than A, H
Language discrepancies: 9


In [6]:
for w in sorted(xLanguage):
    show(T, F, [w], F.language.v, lambda x: F.osm.v(x)[0]) 

Daniel 2:5 w370626 "הֵ֣ן"
	BHS: arc
	OSM: H
Daniel 2:9 w370692 "הֵן"
	BHS: arc
	OSM: H
Daniel 2:13 w370806 "דָּנִיֵּ֥אל"
	BHS: arc
	OSM: H
Daniel 2:24 w371000 "דָּֽנִיֵּאל֙"
	BHS: arc
	OSM: H
Daniel 2:28 w371120 "הֽוּא"
	BHS: arc
	OSM: H
Daniel 2:29 w371130 "אַחֲרֵ֣י"
	BHS: arc
	OSM: H
Daniel 3:15 w371915 "הֵ֧ן"
	BHS: arc
	OSM: H
Daniel 4:5 w372449 "דָּנִיֵּ֜אל"
	BHS: arc
	OSM: H
Daniel 7:1 w374551 "דָּנִיֵּאל֙"
	BHS: arc
	OSM: H
