<img align="right" src="images/dans-small.png"/>
<img align="right" src="images/tf-small.png"/>
<img align="right" src="images/etcbc.png"/>

We make a link between the morphology in the
[Openscriptures](http://openscriptures.org)
and the linguistics in the [BHSA](https://github.com/ETCBC/bhsa).

We proceed as follows:

* extract the morphology from the files in
  [openscriptures/morphhb/wlc](https://github.com/openscriptures/morphhb/tree/master/wlc)
* link the words in the openscripture files to slots in the BHSA
* compile the openscripture morphology data into a TF feature file.

In [1]:
import os
from glob import glob
from lxml import etree
from itertools import zip_longest
from unicodedata import normalize, category

from tf.fabric import Fabric

# Loading BHSA

In [2]:
REPO = os.path.expanduser('~/github/etcbc/bhsa')
baseDir = '{}/tf'.format(REPO)
tempDir = '{}/_temp'.format(REPO)
VERSION = '2017'

TF = Fabric(locations='{}/tf/{}'.format(REPO, VERSION), modules=[''])
api = TF.load('book g_cons_utf8 g_prs_utf8')
api.makeAvailableIn(globals())


This is Text-Fabric 3.1.1
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

114 features found and 0 ignored
  0.00s loading features ...
   |     0.01s B book                 from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.21s B g_cons_utf8          from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.78s T g_prs_utf8           from /Users/dirk/github/etcbc/bhsa/tf/2017
   |     0.00s Feature overview: 108 for nodes; 5 for edges; 1 configs; 7 computed
  5.85s All features loaded/computed - for details use loadLog()


In [3]:
bhsBooks = sorted(F.book.v(n) for n in F.otype.s('book'))
print('\n'.join(bhsBooks))

Amos
Canticum
Chronica_I
Chronica_II
Daniel
Deuteronomium
Ecclesiastes
Esra
Esther
Exodus
Ezechiel
Genesis
Habakuk
Haggai
Hosea
Iob
Jeremia
Jesaia
Joel
Jona
Josua
Judices
Leviticus
Maleachi
Micha
Nahum
Nehemia
Numeri
Obadia
Proverbia
Psalmi
Reges_I
Reges_II
Ruth
Sacharia
Samuel_I
Samuel_II
Threni
Zephania


# Reading open scriptures

In [4]:
OS_BASE = os.path.expanduser('~/github/openscriptures/morphhb/wlc')
os.chdir(OS_BASE)
osBookSet = set(fn[0:-4] for fn in glob('*.xml') if fn != 'VerseMap.xml')

In [5]:
print('\n'.join(sorted(osBookSet)))

1Chr
1Kgs
1Sam
2Chr
2Kgs
2Sam
Amos
Dan
Deut
Eccl
Esth
Exod
Ezek
Ezra
Gen
Hab
Hag
Hos
Isa
Jer
Job
Joel
Jonah
Josh
Judg
Lam
Lev
Mal
Mic
Nah
Neh
Num
Obad
Prov
Ps
Ruth
Song
Zech
Zeph


In [6]:
osBooks = '''
Amos
Song
1Chr
2Chr
Dan
Deut
Eccl
Esth
Exod
Ezek
Ezra
Gen
Hab
Hag
Hos
Job
Isa
Jer
Joel
Jonah
Josh
Judg
Lev
Mal
Mic
Nah
Neh
Num
Obad
Prov
Ps
1Kgs
2Kgs
Ruth
Zech
1Sam
2Sam
Lam
Zeph
'''.strip().split()

In [7]:
osBookFromBhs = {}
bhsBookFromOs = {}
for (i, bhsBook) in enumerate(bhsBooks):
    osBook = osBooks[i]
    osBookFromBhs[bhsBook] = osBook
    bhsBookFromOs[osBook] = bhsBook

In [23]:
NS = '{http://www.bibletechnologies.net/2003/OSIS/namespace}'
NFD = 'NFD'
LO = 'Lo'

finals = {
    'ך':\
    'כ',
    'ם':\
    'מ',
    'ן':\
    'נ',
    'ף':\
    'פ',
    'ץ':\
    'צ',
}

finalsI = {v: k for (k,v) in finals.items()}

# k	05DA	ך	letter final kaf
# K	05DB	כ	letter kaf
# m	05DD	ם	letter final mem
# M	05DE	מ	letter mem
# n	05DF	ן	letter final nun
# N	05E0	נ	letter nun
# p	05E3	ף	letter final pe
# P	05E4	פ	letter pe
# y	05E5	ץ	letter final tsadi
# Y	05E6	צ	letter tsadi


def toCons(fw): return ''.join(c for c in normalize(NFD, fw) if category(c) == LO)
def final(c): return finalsI.get(c, c)
def finalCons(s): return s[0:-1]+final(s[-1])

def readOsBook(osBook, osWords, stats):
    infile = '{}.xml'.format(osBook)
    parser = etree.XMLParser(remove_blank_text=True, ns_clean=True)
    root = etree.parse(infile, parser).getroot()
    osisTextNode = root[0]
    divNode = osisTextNode[1]
    chapterNodes = list(divNode)
    print('reading {:<5} ({:<15}) {:>3} chapters'.format(osBook, bhsBookFromOs[osBook], len(chapterNodes)))
    for chapterNode in chapterNodes:
        if chapterNode.tag != NS+'chapter': continue
        for verseNode in list(chapterNode):
            if verseNode.tag != NS+'verse': continue
            for wordNode in list(verseNode):
                if wordNode.tag != NS+'w': continue
                lemma = wordNode.get('lemma', None)
                morph = wordNode.get('morph', None)
                text = wordNode.text
                lemmas = lemma.split('/') if lemma != None else []
                morphs = morph.split('/') if morph != None else []
                texts = text.split('/') if text != None else []
                for (lm, mph, tx) in zip_longest(lemmas, morphs, texts):
                    txc = None if tx == None else toCons(tx)
                    osWords.append((tx, txc, mph, lm))
                    if mph == None:
                        stats['noMorph'] += 1
                    if tx == None:
                        stats['xMorph'] += 1

In [9]:
osWords = []
stats = dict(noMorph=0, xMorph=0)

for bn in F.otype.s('book'):
    bhsBook = T.sectionFromNode(bn, lang='la')[0]
    osBook = osBookFromBhs[bhsBook]
    readOsBook(osBook, osWords, stats)

print('''
BHS words:       {:>6}
Collected words: {:>6}
No morphology:   {:>6}
Mismatches:      {:>6}
{} % of the words are morphologically annotated.
'''.format(
        F.otype.maxSlot,
        len(osWords),
        stats['noMorph'], 
        stats['xMorph'], 
        round(100 * (len(osWords) - stats['noMorph'] - stats['xMorph'])/len(osWords)),
))

reading Gen   (Genesis        )  50 chapters
reading Ezek  (Exodus         )  48 chapters
reading Lev   (Leviticus      )  27 chapters
reading Num   (Numeri         )  36 chapters
reading Deut  (Deuteronomium  )  34 chapters
reading Josh  (Josua          )  24 chapters
reading Judg  (Judices        )  21 chapters
reading 1Sam  (Samuel_I       )  31 chapters
reading 2Sam  (Samuel_II      )  24 chapters
reading 1Kgs  (Reges_I        )  22 chapters
reading 2Kgs  (Reges_II       )  25 chapters
reading Jer   (Jesaia         )  52 chapters
reading Isa   (Jeremia        )  66 chapters
reading Ezra  (Ezechiel       )  10 chapters
reading Hos   (Hosea          )  14 chapters
reading Joel  (Joel           )   4 chapters
reading Amos  (Amos           )   9 chapters
reading Obad  (Obadia         )   1 chapters
reading Jonah (Jona           )   4 chapters
reading Mic   (Micha          )   7 chapters
reading Nah   (Nahum          )   3 chapters
reading Hab   (Habakuk        )   3 chapters
reading Ze

In [10]:
list(enumerate(osWords[0:100]))

[(0, ('בְּ', 'ב', 'HR', 'b')),
 (1, ('רֵאשִׁ֖ית', 'ראשית', 'Ncfsa', '7225')),
 (2, ('בָּרָ֣א', 'ברא', 'HVqp3ms', '1254 a')),
 (3, ('אֱלֹהִ֑ים', 'אלהים', 'HNcmpa', '430')),
 (4, ('אֵ֥ת', 'את', 'HTo', '853')),
 (5, ('הַ', 'ה', 'HTd', 'd')),
 (6, ('שָּׁמַ֖יִם', 'שמים', 'Ncmpa', '8064')),
 (7, ('וְ', 'ו', 'HC', 'c')),
 (8, ('אֵ֥ת', 'את', 'To', '853')),
 (9, ('הָ', 'ה', 'HTd', 'd')),
 (10, ('אָֽרֶץ', 'ארץ', 'Ncbsa', '776')),
 (11, ('וְ', 'ו', 'HC', 'c')),
 (12, ('הָ', 'ה', 'Td', 'd')),
 (13, ('אָ֗רֶץ', 'ארץ', 'Ncbsa', '776')),
 (14, ('הָיְתָ֥ה', 'היתה', 'HVqp3fs', '1961')),
 (15, ('תֹ֨הוּ֙', 'תהו', 'HNcmsa', '8414')),
 (16, ('וָ', 'ו', 'HC', 'c')),
 (17, ('בֹ֔הוּ', 'בהו', 'Ncmsa', '922')),
 (18, ('וְ', 'ו', 'HC', 'c')),
 (19, ('חֹ֖שֶׁךְ', 'חשך', 'Ncmsa', '2822')),
 (20, ('עַל', 'על', 'HR', '5921 a')),
 (21, ('פְּנֵ֣י', 'פני', 'HNcbpc', '6440')),
 (22, ('תְה֑וֹם', 'תהום', 'HNcbsa', '8415')),
 (23, ('וְ', 'ו', 'HC', 'c')),
 (24, ('ר֣וּחַ', 'רוח', 'Ncbsc', '7307')),
 (25, ('אֱלֹהִ֔ים', 'אלהים'

Why are there 40,000 word more in OS than in BHSA?
Let's explore.

In [11]:
for (i, w) in enumerate(F.otype.s('word')):
    bhs = toCons(F.g_cons_utf8.v(w))
    os = toCons(osWords[i][0])
    if bhs != os:
        print('Mismatch at {}: bhs=[{}] os=[{}]'.format(i, bhs, os))
        break

Mismatch at 61: bhs=[] os=[אור]


In [12]:
for i in range(61,65):
    print(i, F.g_cons_utf8.v(i))

61 ל
62 
63 אור
64 יום


Aha, the BHSA has encoded an empty article here, because the pointing of surrounding letters signals an article.
So let's ignore the inserted empty articles of the BHSA.

In [13]:
j = -1
for w in F.otype.s('word'):
    bhs = toCons(F.g_cons_utf8.v(w))
    if bhs == '': continue
    j += 1
    os = osWords[j][1]
    if bhs != os:
        print('''Mismatch at BHS-{} OS-{}:\nbhs=[{}]\nos=[{}]'''.format(w, j, bhs, os))
        break

Mismatch at BHS-194 OS-187:
bhs=[מינו]
os=[מינ]


In [14]:
for w in range(192,196):
    print(w, toCons(F.g_cons_utf8.v(w)), osWords[w-7][1])

192 פרי פרי
193 ל ל
194 מינו מינ
195 אשר ו


Aha, the BHS does not split a word here, while the OS does, or rather: the OS specifies a morpheme boundary here.
Maybe we can remedy this by looking at the pronominal suffix in the BHS.

In [26]:
j = -1
bhsPrs = {}
for w in F.otype.s('word'):
    bhs = toCons(F.g_cons_utf8.v(w))
    if bhs == '': continue
    j += 1
    os = osWords[j][1]
    prs = F.g_prs_utf8.v(w)
    if prs:
        cprs = toCons(prs.strip())
        fcprs = finalCons(cprs)
        if bhs.endswith(cprs) or bhs.endswith(fcprs):
            bhs = bhs[0:len(bhs)-len(cprs)]
    if bhs != os:
        print('''Mismatch at BHS-{} OS-{}:\nbhs=[{}]\nos=[{}]'''.format(w, j, bhs, os))
        break
    if prs:
        j += 1
        os = osWords[j][1]
        bhs = fcprs
        if bhs != os:
            print('''Mismatch in prs of BHS-{} OS-{}:\nbhs=[{}]\nos=[{}]'''.format(w, j, bhs, os))
            break

Mismatch at BHS-988 OS-1012:
bhs=[ממנ]
os=[ממ]


In [30]:
for w in range(988,990):
    print(w, toCons(F.g_cons_utf8.v(w)), osWords[w+24][1])

988 ממנו ממ
989 כי נו


In [25]:
toCons(F.g_prs_utf8.v(988))

'ו'

In [27]:
toCons(F.g_cons_utf8.v(988))

'ממנו'

In [28]:
for i in range(1010,1015): print(osWords[i][1])

לא
תאכל
ממ
נו
כי


In [32]:
T.sectionFromNode(988)

('Genesis', 2, 17)