<a href="http://laf-fabric.readthedocs.org/en/latest/" target="_blank"><img align="left" src="images/laf-fabric-xsmall.png"/></a>
<a href="http://www.godgeleerdheid.vu.nl/etcbc" target="_blank"><img align="left" src="images/VU-ETCBC-xsmall.png"/></a>
<a href="http://www.persistent-identifier.nl/?identifier=urn%3Anbn%3Anl%3Aui%3A13-048i-71" target="_blank"><img align="left"src="images/etcbc4easy-small.png"/></a>
<a href="http://tla.mpi.nl" target="_blank"><img align="right" src="images/TLA-xsmall.png"/></a>
<a href="http://www.dans.knaw.nl" target="_blank"><img align="right"src="images/DANS-xsmall.png"/></a>

# Phonological representation

Here comes the plain text of the Hebrew Bible in a phonological/phonetic representation augmented with lexical and morphological features in a handy representation to do trigram analysis on the Hebrew text.

# Result

[phono.csv](phono.csv), a tab delimited file (ca 15 MB) with the following fields:

    book chapter verse wordentry

``wordentry`` is a composite field. It is an underscore separated list of the following fields

    phonetic                   phonetic representation of word occurrence
    lexeme                     lexeme identifier
    verb.stem                  all values on verbs are prefixed with v.
    verb.tense
    verb.person
    verb.number
    verb.gender
    noun.number                all values on nouns are prefixed with n.
    noun.gender
    noun.state
    adjv.number                all values on adjectives are prefixed with a.
    adjv.gender
    h-ending                   heh locale
    nounsuff.person            pronominal suffix on noun, all values prefixed with ns.
    nounsuff.number
    nounsuff.gender
    verbsuff.person            pronominal suffix on verb, all values prefixed with vs.
    verbsuff.number
    verbsuff.gender

The words are taken together if written together, but a maqef (-) is taken as a word separator.
Each of these superwords contains at most inflectional word.
The attributes that are given for a superword are the attributes that belong to its inflectional nucleus.

## Phonetic transcription

The phonetic description is documented in the [phono notebook](https://shebanq.ancient-data.org/shebanq/static/docs/tools/phono/phono.html) on SHEBANQ.

But for the purposes here the following simplifications are applied:

* we remove all schwas
* we identify qamets gadol and qatan
* we remove the accent marks
* we use the qere and ignore the ketiv where they are different
* the `[ ]` around the tetragrammaton are removed (the occurrences can still be recognized by lexeme=`JHWH/`)

In [2]:
import sys, os, collections, re

from laf.fabric import LafFabric
from etcbc.preprocess import prepare
from etcbc.lib import Transcription
fabric = LafFabric()

  0.00s This is LAF-Fabric 4.5.12
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html



In [3]:
version = '4b'
fabric.load('etcbc{}'.format(version), 'lexicon', 'pproduction', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        otype monads
        phono phono_sep
        lex
        sp vs vt gn nu ps st
        uvf prs pfm vbs vbe
        language
        book chapter verse label
    ''',''),
    "prepare": prepare,
})
exec(fabric.localnames.format(var='fabric'))
trans = Transcription()

  0.00s LOADING API: please wait ... 
  0.00s USING main  DATA COMPILED AT: 2015-11-02T15-08-56
  0.00s USING annox DATA COMPILED AT: 2016-01-27T19-01-17
  8.82s LOGFILE=/Users/dirk/SURFdrive/laf-fabric-output/etcbc4b/pproduction/__log__pproduction.txt
  8.82s INFO: LOADING PREPARED data: please wait ... 
  8.82s prep prep: G.node_sort
  8.95s prep prep: G.node_sort_inv
  9.61s prep prep: L.node_up
    14s prep prep: L.node_down
    21s prep prep: V.verses
    21s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html
    23s INFO: LOADED PREPARED data
    23s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX lexicon FOR TASK pproduction AT 2016-02-23T09-22-39


# Compile "phonetic" data

In [4]:
prs_map = {
    'W':  '3sm',
    'K':  '2sm',
    'J':  '1sc',
    'M':  '3pm',
    'H':  '3sf',
    'HM': '3pm',
    'KM': '2pm',
    'NW': '1pc',
    'HW': '3sm',
    'NJ': '1sc',
    'K=': '2sf',
    'HN': '3pf',
    'MW': '3pm',
    'N':  '3pf',
    'KN': '2pf',
}

def expand(mrf, pref): return '_'.join(pref+'.'+x for x in mrf)

fmt_str = '{}\t'+('_'.join(['{}'] * 11)) + '\n'

png = dict(
    NA='',
    unknown='',
    p1='1',
    p2='2',
    p3='3',
    sg='s',
    du='d',
    pl='p',
    m='m',
    f='f',
    a='a',
    c='c',
    e='e',
)
undefs = {'NA', 'unknown'}

# Phonetics

We use the phonetic representation, stored in the features `phono` and `phono_sep`.

Interword material is in `phono_sep`, which can have only three values: empty string, space, or space followed by .

In [60]:
x = '1'
i = 1
(pre, this, post) = (x+'23')[i-1:i+2]
print('{} = {} = {}'.format(pre, this, post))

1 = 2 = 3


In [65]:
msg('Yods')
yod = set()
small_yods = collections.defaultdict(set)
for w in F.otype.s('word'):
    ph = F.phono.v(w)
    i = ph.find('ʸ')
    if i != -1:
        (pre, yod, post) = (ph+'xx')[i-1:i+2]
        if pre == 'e': small_yods['e'].add(ph)
        elif post == 'w': small_yods['w'].add(ph)
        else: small_yods['rest'].add(ph)
msg('Done {} small yods')
for (x,y) in small_yods.items():
    print('{}: {}x'.format(x, len(y)))

 1h 35m 24s Yods
 1h 35m 25s Done {} small yods


rest: 32x
e: 905x
w: 935x


In [66]:
for x in sorted(small_yods['rest']):
    print(x)

heḥᵉʸiṯˈānû
heḥᵉʸˈêṯî
heḥᵉʸˈā
heḥᵉʸˌā
hˈᵉʸôṯ-
hˈᵉʸē-
hᵉʸiṯˌem
hᵉʸîṯˈem
hᵉʸîṯˌem
hᵉʸôṯ
hᵉʸôṯ-
hᵉʸôṯˈô
hᵉʸôṯˈām
hᵉʸôṯˈēḵ
hᵉʸôṯˌî
hᵉʸôṯˌāh
hᵉʸôṯˌām
hᵉʸôṯˌēnû
hᵉʸôṯᵊḵˈā
hᵉʸē-
hᵉʸˈîṯem
hᵉʸˈôṯ
hᵉʸˈôṯ-
hᵉʸˈôṯᵊḵem
hᵉʸˈē
hᵉʸˌôṯ
hᵉʸˌû
lˈeḥᵉʸô
lˈeḥᵉʸˈāh
ʔᵉʸˈāl
ˈʔᵉʸālûṯˈî
ḥᵉʸˈî


In [35]:
msg('Furtive')
furtives = set()
strange_furtives = set()
for w in F.otype.s('word'):
    ph = F.phono.v(w)
    i = ph.find('ₐ')
    if i != -1:
        furtives.add(ph[i-1])
        if ph[i-1] == 'ˌ': strange_furtives.add(ph)
msg('Done {} furtives, {} strange'.format(len(furtives), len(strange_furtives)))
print(furtives)
print(strange_furtives)

50m 48s Furtive
50m 49s Done 10 furtives, 13 strange


{'ē', 'i', 'ō', 'ô', 'ê', 'a', 'ˌ', 'u', 'î', 'û'}
{'mmiqṣôˌₐʕ', 'yārôˌₐḥ', 'hišmîˌₐʕ', 'maddûˌₐʕ', 'haddîˌₐḥ', 'rûˌₐḥ', 'hiznîˌₐḥ', 'ššᵊmôˌₐʕ', 'yānôˌₐḥ', 'hinnîˌₐḥ', 'zānôˌₐḥ', 'rāqîˌₐʕ', 'ʔᵉlôˌₐh'}


In [33]:
print(', '.join(x.upper() for x in furtives))

Ē, I, Ō, Ô, Ê, A, ˌ, U, Î, Û


In [None]:
ˌₐ

In [5]:
msg('Tetra')
tetras = set()
for w in F.otype.s('word'):
    ph = F.phono.v(w)
    if F.lex.v(w) == 'JHWH/': tetras.add(ph)
msg('Done. {} tetras'.format(len(tetras)))

  2.64s Tetra
  4.60s Done. 31 tetras


In [6]:
for t in sorted(tetras):
    print(t)

[yhwāh]
[yhwˈāh]
[yhwˌih]
[yhwˌāh]
[yhôˈāh]
[yhôˌāh]
[yᵃhwˌāh]
[yᵉhwˈih]
[yᵉhwˌih]
[yᵉhôˈih]
[yᵊhwih]
[yᵊhwāh]
[yᵊhwˈih]
[yᵊhwˈāh]
[yᵊhwˈˌāh]-
[yᵊhwˌih]
[yᵊhwˌāh]
[yᵊhôˈih]
[yᵊhôˈāh]
[yᵊhôˌih]
[yᵊhôˌāh]
[yᵊhˈwāh]
[yᵊhˈwˈih]
[yᵊhˈwˈāh]
[yᵊhˈwˌāh]
[yᵊˈhwˈāh]
[ˈyhwāh]
[ˈyhwˈih]
[ˈyhwˈāh]
[ˈyhôāh]
[ˈyhôˈāh]


In [25]:
tetra_map = {
    '[yhwāh]': 'ʔᵃḏōnåy',
    '[yhwˈāh]': 'ʔᵃḏōnåy',
    '[yhwˌih]': 'ʔᵉlōhîm',
    '[yhwˌāh]': 'ʔᵃḏōnåy',
    '[yhôˈāh]': 'ʔᵃḏōnåy',
    '[yhôˌāh]': 'ʔᵃḏōnåy',
    '[yᵃhwˌāh]': 'ʔᵃḏōnåy',
    '[yᵉhwˈih]': 'ʔᵉlōhîm',
    '[yᵉhwˌih]': 'ʔᵉlōhîm',
    '[yᵉhôˈih]': 'ʔᵉlōhîm',
    '[yᵊhwih]': 'ʔᵉlōhîm',
    '[yᵊhwāh]': 'ʔᵃḏōnåy',
    '[yᵊhwˈih]': 'ʔᵉlōhîm',
    '[yᵊhwˈāh]': 'ʔᵃḏōnåy',
    '[yᵊhwˈˌāh]': 'ʔᵃḏōnåy', # xxxx
    '[yᵊhwˌih]': 'ʔᵉlōhîm',
    '[yᵊhwˌāh]': 'ʔᵃḏōnåy',
    '[yᵊhôˈih]': 'ʔᵉlōhîm',
    '[yᵊhôˈāh]': 'ʔᵃḏōnåy',
    '[yᵊhôˌih]': 'ʔᵉlōhîm',
    '[yᵊhôˌāh]': 'ʔᵃḏōnåy',
    '[yᵊhˈwāh]': 'ʔᵃḏōnåy',
    '[yᵊhˈwˈih]': 'ʔᵉlōhîm',
    '[yᵊhˈwˈāh]': 'ʔᵃḏōnåy',
    '[yᵊhˈwˌāh]': 'ʔᵃḏōnåy',
    '[yᵊˈhwˈāh]': 'ʔᵃḏōnåy',
    '[ˈyhwāh]': 'ʔᵃḏōnåy',
    '[ˈyhwˈih]': 'ʔᵉlōhîm',
    '[ˈyhwˈāh]': 'ʔᵃḏōnåy',
    '[ˈyhôāh]': 'ʔᵃḏōnåy',
    '[ˈyhôˈāh]': 'ʔᵃḏōnåy',
}

In [67]:
tetra_pat = re.compile('(\[[^]]*\])')
furtive_pat = re.compile('(.)ₐ')

def tetra_repl(match):
    return tetra_map[match.group(1)]

def furtive_repl(match):
    return match.group(1).upper()

def phono_old(w): return F.phono.v(w).\
    replace('ᵊ', '').\
    replace('ā', 'å').\
    replace('o', 'å').\
    replace('ˈ', '').\
    replace('ˌ', '').\
    replace('*', '').\
    replace('[', '').\
    replace(']', '')
    
def phono(w):
    ph = tetra_pat.sub(tetra_repl, F.phono.v(w))
    ph =  ph.\
    replace('ᵊ', '').\
    replace('ᵉ', '').\
    replace('ᵃ', '').\
    replace('ᵒ', '').\
    replace('ā', 'å').\
    replace('o', 'å').\
    replace('ˈ', '').\
    replace('ˌ', '').\
    replace('*', '').\
    replace(' ', '').\
    replace('-', '').\
    replace('eʸ', 'e').\
    replace('ʸw', 'w').\
    replace('ʸ', 'y')
    ph = furtive_pat.sub(furtive_repl, ph)
    return ph

In [38]:
vb = 'xxeₐzzuₐyy'
print(furtive_pat.sub(furtive_repl, vb))

xxEzzUyy


In [44]:
# Just as a check: collect all seps
seps = collections.Counter()
for w in F.otype.s('word'):
    seps[F.phono_sep.v(w)] += 1
seps

Counter({'': 164105, ' ': 239251, ' .': 23212})

In [18]:
# just as check: is there an underscore in the uvf pfm vbs vbe  features?
vals = collections.defaultdict(lambda: collections.defaultdict(lambda: []))
features = ['uvf', 'pfm', 'vbs', 'vbe']
for w in F.otype.s('word'):
    for f in features:
        val = F.item[f].v(w)
        if '_' in val:
            vals[f][val].append(w)
print(vals if vals else 'No _ in feature values')
# the answer should be no

No _ in feature values


In [19]:
def get_passage(w):
    vn = w if F.otype.v(w) == 'verse' else L.u('verse', w)
    return '{}\t{}\t{}'.format(
        F.book.v(L.u('book', w)),
        F.chapter.v(L.u('chapter', w)),
        F.verse.v(vn),
    )

## Splitting exception

The basic units are words as they are written together.
The maqef separates words.
Then every unit has at most one word in the class noun-verb-adjective.

However, there is one exception, in Jesaia 9:5: ʔᵃvîʕˌaḏ (אֲבִיעַ֖ד) is written together, but analysed as two words.
So we use a list of words that we want to add a separating space to.

In [20]:
splitx = {'215237'}

In [68]:
msg('Generating phonetic data file ...')

phono_file = open("phono.csv", 'w')

headline = fmt_str.format(
    'book\tchapter\tverse', 
    'phonetic',
    'lexeme',
    'verb.stem',
    'verb.tense',
    'verb.person_verb.number_verb.gender',
    'noun.number_noun.gender',
    'noun.state',
    'adjv.number_adjv.gender',
    'h-ending',
    'nounsuff.person_nounsuff.number_nounsuff.gender',
    'verbsuff.person_verbsuff.number_verbsuff.gender',
)
phono_file.write(headline)

chunksize = 1000
nv = 0
nc = 0
spliterrors = 0

for v in F.otype.s('verse'):
    nv += 1
    nc += 1
    if nc == chunksize:
        nc = 0
        msg('{:<5} verses with {} split errors'.format(nv, spliterrors))
    passage_label = get_passage(v)
    lines = []
    words = L.d('word', v)
    cur_line = []
    cur_sep = ''
    skip = False
    for w in words:
        if F.language.v(w) == 'Aramaic':
            skip = True
            break
        the_monad = F.monads.v(w)
        if the_monad in splitx: the_sep = ' '
        else: the_sep = (' ' if F.phono.v(w).endswith('-') else '') + F.phono_sep.v(w)
        if cur_sep == '':
            cur_line.append(w)
        else:
            if cur_line:
                lines.append(cur_line)
            cur_line = [w]
        cur_sep = the_sep
    if skip: continue
    if cur_line: lines.append(cur_line)
    
    for line in lines:
        line_text = ''
        for w in line: line_text += phono(w)
        lex_ident = '~'.join(F.lex.v(w).replace('_', '#') for w in line)
        verb_stem = '%'.join('v.{}'.format(F.vs.v(w)) for w in line if F.sp.v(w) == 'verb')
        verb_tense = '%'.join('v.{}'.format(F.vt.v(w)) for w in line if F.sp.v(w) == 'verb')
        png_verb = '%'.join('v.{}_v.{}_v.{}'.format(
            png[F.ps.v(w)], png[F.nu.v(w)], png[F.gn.v(w)],
        ) for w in line if F.sp.v(w) == 'verb')
        png_noun = '%'.join('n.{}_n.{}'.format(
            png[F.nu.v(w)], png[F.gn.v(w)],
        ) for w in line if F.sp.v(w) == 'subs')
        png_adjv = '%'.join('a.{}_a.{}'.format(
            png[F.nu.v(w)], png[F.gn.v(w)],
        ) for w in line if F.sp.v(w) == 'adjv')
        nom_st = '%'.join('n.{}'.format(png[F.st.v(w)]) for w in line if F.sp.v(w) == 'subs')
        uvf_h = '%'.join('h' for w in line if F.uvf.v(w) == 'H')
        prs_noun = '%'.join(expand(prs_map.get(F.prs.v(w), ''), 'ns') for w in line if F.sp.v(w) == 'subs')
        prs_verb = '%'.join(expand(prs_map.get(F.prs.v(w), ''), 'vs') for w in line if F.sp.v(w) == 'verb')

        line = fmt_str.format(
            passage_label, 
            line_text,
            lex_ident,
            verb_stem,
            verb_tense,
            png_verb,
            png_noun,
            nom_st,
            png_adjv,
            uvf_h,
            prs_noun,
            prs_verb,
        )
        if '%' in line:
            spliterrors += 1
        phono_file.write(line)
phono_file.close()
msg('{:<5} verses with {} split errors. Done'.format(nv, spliterrors))

 1h 38m 26s Generating phonetic data file ...
 1h 38m 27s 1000  verses with 0 split errors
 1h 38m 27s 2000  verses with 0 split errors
 1h 38m 28s 3000  verses with 0 split errors
 1h 38m 29s 4000  verses with 0 split errors
 1h 38m 29s 5000  verses with 0 split errors
 1h 38m 30s 6000  verses with 0 split errors
 1h 38m 31s 7000  verses with 0 split errors
 1h 38m 32s 8000  verses with 0 split errors
 1h 38m 33s 9000  verses with 0 split errors
 1h 38m 33s 10000 verses with 0 split errors
 1h 38m 34s 11000 verses with 0 split errors
 1h 38m 34s 12000 verses with 0 split errors
 1h 38m 35s 13000 verses with 0 split errors
 1h 38m 36s 14000 verses with 0 split errors
 1h 38m 36s 15000 verses with 0 split errors
 1h 38m 37s 16000 verses with 0 split errors
 1h 38m 37s 17000 verses with 0 split errors
 1h 38m 37s 18000 verses with 0 split errors
 1h 38m 38s 19000 verses with 0 split errors
 1h 38m 38s 20000 verses with 0 split errors
 1h 38m 39s 21000 verses with 0 split errors
 1h 38m 3