<a href="http://laf-fabric.readthedocs.org/en/latest/" target="_blank"><img align="left" src="images/laf-fabric-xsmall.png"/></a>
<a href="http://www.godgeleerdheid.vu.nl/etcbc" target="_blank"><img align="left" src="images/VU-ETCBC-xsmall.png"/></a>
<a href="http://www.persistent-identifier.nl/?identifier=urn%3Anbn%3Anl%3Aui%3A13-048i-71" target="_blank"><img align="left"src="images/etcbc4easy-small.png"/></a>
<a href="http://tla.mpi.nl" target="_blank"><img align="right" src="images/TLA-xsmall.png"/></a>
<a href="http://www.dans.knaw.nl" target="_blank"><img align="right"src="images/DANS-xsmall.png"/></a>

# Phonological transliteration

## A re-implementation of the transliteration rules by Nicolai Winther-Nielsen, Chris Wilson, and Claus Tøndering.

Users of the [Bible Online Learner](http://bibleol.3bmoodle.dk/text/show_text/ETCBC4/) may encounter a phonologocial transliteration of Biblical Hebrew. This turns out to be a well documented transliteration scheme that is not
merely a character-by-character transliteration of the Hebrew consonants and vowels.
It is also a phonological representation, in which ambiguities are solved, e.g. the qamets as long a or short o, and the schwa (quiescens or mobile).

As Nicolai Winther-Nielsen has pointed out, this is the kind of transliteration needed when you want to subject biblical Hebrew to the toolkit of modern linguists.

So, we want to make use of this transliteration.

However, we could not find a readily available means to do so online, and that is why we reimplement the rules stated in Nicolai's article (see below) in Python.

* [Transliteration of Biblical Hebrew for the Role-Lexical Module](http://www.see-j.net/index.php/hiphil/article/view/62)
* [Lex: A software project for linguists](http://www.see-j.net/index.php/hiphil/article/view/60/56)
* [Bible Online Learner, Software on Github](https://github.com/EzerIT/BibleOL)

In [1]:
import sys, collections, re
import unicodedata

# Rules

Here is a representation of the rules, manually copied from Nicolai's article.
A rule is represented by the following bits of information:

* character to be transliterated
* preceding string
* following string
* replacement string

A character is only transliterated by a rule, if the preceding and following string match.

The order of the rules is important.

In [None]:
rules = '''

'''

In [33]:
from laf.fabric import LafFabric
from etcbc.preprocess import prepare
from etcbc.lib import Transcription
fabric = LafFabric()

  0.00s This is LAF-Fabric 4.5.3
API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
Feature doc: http://shebanq-doc.readthedocs.org/en/latest/texts/welcome.html



In [2]:
version = '4b'
fabric.load('etcbc{}'.format(version), '--', 'pproduction', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        otype
        g_word_utf8 g_cons_utf8 trailer_utf8
        g_word g_cons lex_utf8 lex
        sp vs vt gn nu ps st
        uvf prs
        language
        book chapter verse label
    ''',''),
    "primary": True,
    "prepare": prepare,
})
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s INFO: USING DATA COMPILED AT: 2015-06-29T05-30-49
  6.15s LOGFILE=/Users/dirk/SURFdrive/laf-fabric-output/etcbc4b/pproduction/__log__pproduction.txt
    16s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html
  0.00s LOADING API with EXTRAs: please wait ... 
  0.00s INFO: USING DATA COMPILED AT: 2015-06-29T05-30-49
  0.68s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX -- FOR TASK pproduction AT 2015-08-27T12-28-30
  0.00s INFO: DATA LOADED FROM SOURCE etcbc4b AND ANNOX -- FOR TASK pproduction AT 2015-08-27T12-28-30


## Trailer

Before we generate the text, let's list all the different suffixes and their number of occurrences.

In [3]:
trailer = collections.defaultdict(int)

for node in F.otype.s('word'):
    trailer[F.trailer_utf8.v(node)] += 1

In [61]:
trans = Transcription()

# Accent

In order to interpret the qamets correctly as a or o we need to identify the closed, unaccented syllables.

Here are the rules:

A closed syllable is recognizable by a vowel

* followed by two distinct consonants (of which the first has a silent schwa) or
* followed by a consonant with a dagesh forte or
* followed by a consonant without vowel and then the end of the word or a -

Such a syllable is unaccented if 

* it does not have a maqaf accent on its first consonant and
* it is followed by another syllable or a -
