# BHSA and OSM: comparison on language

We will investigate how the morphology marked up in the OSM corresponds and differs from the BHSA linguistic features.

In this notebook we investigate the markup of *language* (Hebrew or Aramaic).

We use the `osm` and `osm_sf` features compiled by the
[BHSAbridgeOSM notebook](BHSAbridgeOSM.ipynb).

In [1]:
import collections

from tf.fabric import Fabric
from utils import show

# Load data
We load the BHSA data in the standard way, and we add the OSM data as a module of the features `osm` and `osm_sf`.
Note that we only need to point TF to the right directories, and then we can load all features
that are present in those directories.

In [2]:
BHSA = "BHSA/tf/2021"
OSM = "bridging/tf/2021"

TF = Fabric(locations="~/github/etcbc", modules=[BHSA, OSM])
api = TF.load(
    """
    language
    osm osm_sf
    g_word_utf8
"""
)
api.makeAvailableIn(globals())

This is Text-Fabric 9.0.3
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

117 features found and 0 ignored
  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
  3.50s All features loaded/computed - for details use TF.isLoaded()


[('Computed',
  'computed-data',
  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),
 ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),
 ('Fabric', 'loading', ('TF',)),
 ('Locality', 'locality', ('L Locality',)),
 ('Nodes', 'navigating-nodes', ('N Nodes',)),
 ('Features',
  'node-features',
  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),
 ('Search', 'search', ('S Search',)),
 ('Text', 'text', ('T Text',))]

# Language

Do BHSA and OSM agree on language?
Let's count the words in the BHSA where they disagree.

The BHSA names the languages by means of ISO codes, the OSM uses one letter abbreviations.

The OSM has the language code as the first letter of the morphology string.

In [5]:
langBhsFromOsm = dict(A="Aramaic", H="Hebrew")
langOsmFromBhs = dict((y, x) for (x, y) in langBhsFromOsm.items())

We exclude the words for which the OSM has no morphology, or where the alignment between BHSA and OSM is problematic.

In [6]:
xLanguage = set()
strangeLanguage = collections.Counter()

for w in F.otype.s("word"):
    osm = F.osm.v(w)
    if osm is None or osm == "" or osm == "*":
        continue
    osmLanguage = osm[0]
    trans = langBhsFromOsm.get(osmLanguage, None)
    if trans is None:
        strangeLanguage[osmLanguage] += 1
    else:
        if langBhsFromOsm[osm[0]] != F.language.v(w):
            xLanguage.add(w)

if strangeLanguage:
    print("Strange languages")
    for (ln, amount) in sorted(strangeLanguage.items()):
        print("Strange language {}: {:>5}x".format(ln, amount))
else:
    print("No other languages encountered than {}".format(", ".join(langBhsFromOsm)))
print("Language discrepancies: {}".format(len(xLanguage)))

No other languages encountered than A, H
Language discrepancies: 2


In [7]:
for w in sorted(xLanguage):
    show(T, F, [w], F.language.v, lambda x: F.osm.v(x)[0])

Psalms 116:12 w330987"כָּֽל"
	BHS: Aramaic
	OSM: H
Psalms 116:12 w330988"תַּגְמוּלֹ֥והִי"
	BHS: Aramaic
	OSM: H
