<a href="http://laf-fabric.readthedocs.org/en/latest/" target="_blank"><img align="left" src="images/laf-fabric-xsmall.png"/></a>
<a href="http://www.godgeleerdheid.vu.nl/etcbc" target="_blank"><img align="left" src="images/VU-ETCBC-xsmall.png"/></a>
<a href="http://www.persistent-identifier.nl/?identifier=urn%3Anbn%3Anl%3Aui%3A13-048i-71" target="_blank"><img align="left"src="images/etcbc4easy-small.png"/></a>
<a href="http://tla.mpi.nl" target="_blank"><img align="right" src="images/TLA-xsmall.png"/></a>
<a href="http://www.dans.knaw.nl" target="_blank"><img align="right"src="images/DANS-xsmall.png"/></a>

# Phonetic Transliteration of Hebrew Masoretic Text

# Frequently asked questions

Q: *What is the use of a phonetic transliteration of the Hebrew Bible? What can anyone wish beyond the careful, meticulous Masoretic system of consonants, vowels and accents?*

A: Several things:

* the Hebrew Bible may be subject of study in various fields,
  where the people involved do not master the Hebrew script;
  a phonetic transcription removes a hurdle for them.
* in computational linguistics there are many tools that deal with written language in latin alphabets;
  even a simple task as getting the consonant-vowel pattern of a word is unnecessarily complicated
  when using the Hebrew script.
* in phonetics and language learning theory, it is important to represent the sounds without being burdened
  by the idiosyncracies of the writing system and the spelling.
  
Q: *But surely, there already exist transliterations of Hebrew? Why not use them?*

Here are a few pragmatic reasons:

* we want to be able to *compute* a transliteration based upon our own data;
* we want to gain insight in to what extent the transliteration can be purely rule-based, and to what extent
  it depends on lexical information that you just need to know;
* we want to make available a well documented transliteration, that can be studied, borrowed and improved by others.

Q: *But how **good** is your transliteration?*

we do not know, ..., yet. A few remarks though:

* we have applied most of the *rules* that we could find in Hebrew grammars;
* we have suspended some of the rules for some verb paradigms where it is known that they lead to incorrect results
* where the rules did not suffice, we have searched the corpus for other occurrences of the same word, to get clues;
* where we knew that clues pointed in the wrong direction, we have applied a list of exceptions (currently a list of only the word בָּתִּֽים (\*bottˈîm => bāttˈîm) 
* we have a fair test set with critical cases that all pass
* we have a few tables of all cases where the algorithm has made corpus based decisions and lexical decisions
* we are open for your corrections: login into [SHEBANQ](https://shebanq.ancient-data.org), go to a passage with         offending phonetic transliteration, and make a manual note. **Tip:** Give that note the keyword ``phono``, then we
  will collect them.

Q: *To me, this is not entirely satisfying.*

A: Fair enough. Consider jumping to [Bible Online Learner](http://bibleol.3bmoodle.dk/text/show_text),
where they have built in a pretty good transliteration, based on a different method of rule application. It is documented in an article by Nicolai Winther-Nielsen:
[Transliteration of Biblical Hebrew for the Role-Lexical Module](http://www.see-j.net/index.php/hiphil/article/view/62) 
and additional information can be found in Claus Tøndering's
[Bible Online Learner, Software on Github](https://github.com/EzerIT/BibleOL).
See also [Lex: A software project for linguists](http://www.see-j.net/index.php/hiphil/article/view/60/56).

We are planning to conduct an automatic comparison of both transliteration schemes over the whole corpus.

Q: *Who is the **we**?*

That is the author of this notebook, [Dirk Roorda](mailto:dirk.roorda@dans.knaw.nl), working together with Martijn Naaijer and getting input from Nicolai Winther-Nielsen and Wido van Peursen.

# Overview of the results

1. The main result is a python function ``phono(``*etcbc_original*``, ...): ``*phonetic transliteration*.
1. A [test set](tests4b.html)
   consisting of a number of fairly critical cases and [how they work out](tests_debug4b.txt);
1. The [set of cases](special_verb_cases4b.html)
   where the verb paradigm overrules the rules for qamets gadol/qatan;
1. The [set of cases](special_nonverb_cases4b.html)
   where the outcome after applying the rules for qamets gadol/qatan has been corrected by the vote of other occurrences.
1. A [plain text](combi4b.txt) with the complete text in ETCBC transliteration and phonetic transcription,
   verse by verse.

# Overview of the method

## Highlevel description

1. **ETCBC transliteration**
   Our starting point is the ETCBC full transliteration of the Hebrew Masoretic text.
   This transliteration is in 1-1 correspondence with the Masoretic text, including all vowels and accents.
1. **Grammar rules** 
   We have implemented the rules we find in grammars of Hebrew about long and short qamets, mobile and silent schwa,
   dagesh, and mater lectionis. 
   The implementation takes the form of a row of *regular expressions*,
   where we transliterate targeted pieces of the original.
   These regular expressions are exquisitely formulated, and must be applied in the given order.
   *Beware:* Seemingly innocent modifications in these expressions or in the order of application,
   may ruin the transcription completely.
1. **Qamets puzzles: verbs**
   In many verb forms the grammar rules would dictate that a certain qamets is qatan while in fact it is gadol.
   In most cases this is caused by the fact that no accent has been marked on the syllable that carries the
   qamets in question. There is a limited set of verb paradigms where this occurs.
   We detect those and suppress qamets qatan interpretation for them.
1. **Qamets puzzles: non-verbs**
   There are quite a few non-verb occurrences where the accent pattern of a word invites a qamets to become
   qatan, that is, by the grammar rules. 
   Yet, other occurrences of the same lexeme have other accent patterns, and
   lead to a gadol interpretation of the same qamets. 
   In this case we count the unique cases in favour of gadol versus qatan, and let the majority decide for all 
   occurrences. In cases where we know that the majority votes wrong, we have intervened.
   
### Qamets work hypothesis
Note, that in the the *non-verb qamets puzzles* we have tacitly made the assumption that qamets qatan and gadol are not phonological variants of each other.
In other words, it never occurs that a qamets gadol becomes shortened into a qamets qatan.
From the grammar rules it follows that short versions of the qamets can only be

* patah
* schwa
* composite schwa with patah

and never

* qamets qatan
* composite schwa with qamets

Whether this hypothesis is right, is not my competence. 
We just use it as a working hypothesis.

## Lexical information

This method is not a pure method, in the sense that it works only with the information given in the source strint.
We *cheat*, i.e. we use morphological information from the ETCBC database to 
steer us into the right directorion. To this end, the input of the ``phono()`` function can be given in several ways:

* as ETCBC transliteration, possibly with a verse reference, possibly augmented with lexical information
* as LAF node, possibly augmented with lexical information

If an ETCBC string is given as input together with a verse reference,
the LAF node will be retrieved is possible.
The lexical info will be retrieved from the node if it has not been given directly.

If a LAF node is given as input, we can compute the verse reference and the ETCBC transliteration, and, if
needed, additional lexical information.

## Combined words

You can use ``phono()`` to transliterate multiple words at the same time, but you can also do individual words,
even if in Hebrew they are written together.
However, it is better to feed combined words to ``phono()`` in one go, because the prefix word may influence the transliteration of the postfix word. Think of the article followed by word starting with a BGDKPT letter.
The dagesh in the BGDKPT is interpreted as a lene, if the word stands on its own, but as a forte if it is combined.

However, it not not advised to feed longer strings to ``phono()``, because when phono retrieves lexical information, it uses the information of the last node that matches a word in the input string.

## Accents

We determine "primary" and "secundary" stress in our transliteration, but this must not be taken in a phonetic sense.
Every syllable that carries an accent pointing will get a primary stress mark.
However, a few specific accent pointings are not deemed to produce an an accent, and an other group of accents
is deemed to produce only a secondary accent.
The last syllable of a word also gets a secundary accent by default.
We have not yet tried to be more precise in this, so *segolates* do not get the treatment they deserve.

The main rationale for accents is that they prevent a qamets to be read as qatan.

## Individual symbols

We have made a careful selection of UNICODE symbols to represent Hebrew sounds.
Sometimes we follow the phonetic usage of the symbols, sometimes we follow wide spread custom.
The actual mapping can be plugged in quite easily, 
and the intermediate stages in the transformation do not use theese final symbols,
so the algorithm can be easily adapted to other choices.

### Consonants

Provided it is not part of a long vowel, we write yod as ``y``,
whilst ``j`` would be more in line with the phonetic alphabet.

Likewise, we write ``ו`` as w, if it is not part of a long vowel.

With regards to the ``BGDKPT`` letters, it would have been attractive to use the letters ``b g d k p t`` without 
diacritic for the plosive variants, and with a suitable diacritic for the fricative variants.
Alas, the UNICODE table does not offer such a suitable diacritic that is available for all these particular 6 letters.

So, we use ``b g d k p t`` for the plosives, but for the fricatives we use ``v ḡ ḏ ḵ f ṯ``.

With regards to the *emphatic* consonants ט and ח and צ we represent them with a under dot: ``ṭ ḥ ṣ``.
ק is just ``q``.

א and ע translate to translate to ``ʕ`` and ``ʔ``.

שׁ and שׂ translate to ``š`` and ``ś``.
ס is just ``s``.

When א and ה are mater lectionis, they are left out. A ה with mappiq becomes just ``h``,
like every ה which is not a mater lectionis.

We do not mark the deviant final forms of the consonants ך and ם and ן and ף and ץ, assuming that
this is just a scriptural peculiarity, with no effect on the actual sounds.

The remaining consonants go as follows:

<table>
<tr><td>ל</td><td>``l``</td></tr>
<tr><td>מ</td><td>``m``</td></tr>
<tr><td>נ</td><td>``n``</td></tr>
<tr><td>ר</td><td>``r``</td></tr>
<tr><td>ז</td><td>``z``</td></tr>
</table>

### Vowels

The short vowels (patah, segol, hireq) are just ``a e i`` and qibbuts is just ``u``.

However, the *furtive* patah is a ``ₐ`` in front of its consonant.

The long vowels without yod or waw (qamets gadol, tsere, holam) have an over bar ``ā ē ō``.

The complex vowels (tsere or hireq plus yod, holam plus waw, waw with dagesh) have a circumflex ``ê î ô û``.

A segol followed by yod becomes ``eʸ``

The composite schwas (patah, segol, qamets) are written as superscripts ``ᵃ ᵉ ᵒ``.

The simple schwa is left out if silent, and otherwise it becomes ``ᵊ``.

### Accent

The primary and secundary stress are marked as ``ˈ ˌ`` and are placed *in front of the vowel they occur with*.

### Punctuation

The sof-pasuq ׃ becomes ``.``. If it is followed by ס or ף or  ̇׆ (nun-hafuka), these extra symbols are omitted.

The maqef ־ (between words) becomes ``-``.

If words are juxtaposed without space in the Hebrew, they are also juxtaposed without space in the phonetic
transliteration.

### Tetragrammaton

The tetragrammaton is transliterated with the vowels it is encountered with, but the whole is put between 
square brackets ``[ ]``.

### Ketiv-qere

The ketiv-qere symbol is represented as ``*``, and any word containing such a symbol is placed between
braces ``{ }``.

If you are massively displeased with these choices, we will consider revising them.


## Cleaning up

We leave the accents and the schwas in the end product of the ``phono()`` function,
despite the fact that the accents, as they appear, do not have consistent phonetic significance.
And it can be argued that every schwa is silent.
If you do not care for schwas and accents, it is easy to remove them.
Also, if you find the results in separating the qamets into qatan and gadol unsatisfying or irrelevant, you can
just replace them both bij a single symbol, such as ``å``.

## Testing

Quite a bit of code is dedicated to count special cases, to test, and to produce neat tables with interesting forms.
It is also possible to call the ``phono()`` function in debug mode, which will write to a text file all stages in the
transliteration from etcbc orginal into the phonetic result.

# Load the modules

In [None]:
import sys, os, collections, re
from unicodedata import normalize

from laf.fabric import LafFabric
from etcbc.preprocess import prepare
from etcbc.lib import Transcription
fabric = LafFabric()

# Load the LAF data

In [None]:
version = '4'
fabric.load('etcbc{}'.format(version), '--', 'phono', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        otype
        g_word_utf8 g_cons_utf8 trailer_utf8
        g_word g_cons lex_utf8 lex
        sp vs vt gn nu ps st
        uvf prs pfm vbs vbe
        language
        book chapter verse label
    ''',''),
    "primary": True,
    "prepare": prepare,
}, verbose='NORMAL')
exec(fabric.localnames.format(var='fabric'))
trans = Transcription()

# The source string

Here is what we use as our starting point: the etcbc transliteration, with one or two tweaks.

The ETCBC transcription encodes also what comes after each word until the next word.
Sometimes we want that extra bit, and sometimes not, and sometimes part of it.

In [None]:
# possible material at the end of a verse
end_verse = re.compile('''
    00                            # two zeroes
    (?:_[SPN])?                   # possibly followed by an underscore plus any of S, P, N
    $                             # at the end of the string
''', re.X)

def get_orig(w, sep=0):                # sep=0:  all interword separation material but no end-of-lines
                                       # sep=1:  all separation including end-of-lines 
                                       # sep=-1: no separation material at all
    orig = F.g_word.v(w)
    if orig.endswith('&'):             # maqef
        sp = '' if sep == -1 else '&'
        return orig[0:-1]+sp           
    if orig.endswith('-'):             # attached to next word
        sp = '' if sep == -1 else '-'
        return orig[0:-1]+sp
    if end_verse.search(orig):         # end of verse symbol
        sp = '' if sep == -1 else ' ' if sep == 0 else ' $\n'
        return end_verse.sub('', orig)+sp          
    sp = '' if sep == -1 else ' '      # normal case, with space behind 
    return orig+sp           

# The phonological symbols

Here is the list of symbols that constitutes the mapping from ETCBC transcription codes to a phonetic transcription.
It is a series of triplets (*etcbc symbol*, *name*, *phonetic symbol*).

If changes are needed to the appearance of the phonetic transcriptions (not to its *logic*), here is the place to tweak.

Note that the order is important.
In the final stage of the transformation process, these substitutions will be applied in the order they appear here.

This is especially important for, but not only for, the BGDKPT letters.

In [None]:
specials = (
    ('>', 'alef', 'ʔ'),
    ('<', 'ayin', 'ʕ'),
    ('v', 'tet', 'ṭ'),
    ('y', 'tsade', 'ṣ'),
    ('x', 'chet', 'ḥ'),
    ('c', 'shin', 'š'),
    ('f', 'sin', 'ś'),
    ('#', 's(h)in', 'ŝ'),

    ('ij', 'long hireq', 'î'),
    ('I', 'short hireq', 'i'),
    (';j', 'long tsere', 'ê'),
    ('ow', 'long holam', 'ô'),
    ('w.', 'long `qibbuts`', 'û'),
    ('ej', 'e glide', 'eʸ'),
    ('j', 'yod', 'y'),

    (':a', 'hataf patach', 'ᵃ'),
    (':@', 'hataf qamats', 'ᵒ'),
    (':e', 'hataf segol', 'ᵉ'),
    ('%', 'schwa mobile', 'ᵊ'),
    (':', 'schwa quiescens', ''),
    ('@', 'qamats gadol', 'ā'),
    ('+', 'qamats', 'å'),
    ('e', 'segol', 'e'),
    (';', 'tsere', 'ē',),
    ('o', 'holam', 'ō'),
    ('^', 'qamats qatan', 'o'),

    ('b.', 'b plosive', 'B'),
    ('g.', 'g plosive', 'G'),
    ('d.', 'd plosive', 'D'),
    ('k.', 'k plosive', 'K'),
    ('p.', 'p plosive', 'P'),
    ('t.', 't plosive', 'T'),

    ('b', 'b fricative', 'v'),
    ('g', 'g fricative', 'ḡ'),
    ('d', 'd fricative', 'ḏ'),
    ('k', 'k fricative', 'ḵ'),
    ('p', 'p fricative', 'f'),
    ('t', 't fricative', 'ṯ'),

    ('B', 'b plosive', 'b'),
    ('G', 'g plosive', 'g'),
    ('D', 'd plosive', 'd'),
    ('K', 'k plosive', 'k'),
    ('P', 'p plosive', 'p'),
    ('T', 't plosive', 't'),
    
    ('w', 'waw', 'w'),
    ('l', 'lamed', 'l'),
    ('m', 'mem', 'm'),
    ('n', 'nun', 'n'),
    ('r', 'resh', 'r'),
    ('z', 'zajin', 'z'),
    
    ('!', 'primary accent', 'ˈ'),
    ('/', 'secundary accent', 'ˌ'),
    
    ('&', 'maqef', '-'),
)

## Assembling the symbols in dictionaries

We compile the table of symbols in handy dictionaries for ease of processing later.

We need to quickly detect the dagesh lenes later on, so we store them in a dictionary.

Our treatment of accents is still primitive. 

We ignore some accents (``irrelevant accents`` below) and we consifer some accents as indicators of a mere
*secundary* accent (``secundary accents`` below).

The ``sound_dict`` is the resultig (ordered) mapping of all source characters to "phonetic" characters.

In [None]:
dagesh_lenes = {'b.', 'g.', 'd.', 'k.', 'p.', 't.'}
dagesh_lene_dict = dict()

irrelevant_accents = (
    ('01', 'segol'),  # occurs always with another accent
    ('03', 'pashta'), # by definition on last syllable: not relevant for accent
)
secundary_accents = (
    ('71', 'merkha'), # ??
    ('63', 'qadma'),  # ??
    ('73', 'tipeha'), # ??
)
sound_dict = collections.OrderedDict() 

for (sym, let, glyph) in specials:
#    print('{:<3} {:<10} {:>3}'.format(sym, let, glyph))
    if sym in dagesh_lenes:
        dagesh_lene_dict[sym[0]] = glyph
    else:
        sound_dict[sym] = glyph

# Patterns

The ``phono()`` function that we will define (far) below, performs an ordered sequence of transformations.
Most of these are defined as [regular expressions](http://www.regular-expressions.info),
and some parts of those expressions occur over and over again, e.g. subpatterns for *vowel* and *consonant*.

Here we define the shortcuts that we are going to use in the regular expressions.

## Details of the matching process

Normally, when a pattern matches a string, the string is consumed: the parts of the pattern that match
consume corresponding stretches of the string.
However, in many cases a pattern specifies specific contexts in which a match should be found.
In those cases we do not want that the context parts of the pattern are responsible for string
consumption, because in those parts there could be another relevangt match.

In regular expression there is a solution for that: look-ahead and look-behind assertions and we use them frequently.

``(?<=`` *before_pattern* ``)`` *pattern* ``(?=`` *behind-pattern* ``)``

A match of this pattern in a string is a portion of a string that matches *pattern*, provided that
portion is preceded by *before_pattern* and followed by *behind* pattern.

If there is a match, and new matches must be searched for, the search will start right after *pattern*.

Instead of the above *positive* look-ahead and look-behind assertions, there are also *negative* variants:

``(?<!`` *before_pattern* ``)`` *pattern* ``(?!`` *behind-pattern* ``)``

in those cases the match is good, if the *before_pattern* does not match the preceding material, and analogously
the *behind_pattern*.

In Python there is a restriction on look-behind patterns:
they must be patterns that only have matches of a predictable, fixed length.
That will make some of our patterns slightly more complicated.
For example, vowels can be simple or complex, and hence have variable length.
If we want to specify a consonant, provided it is preceded by a vowel, we have to be careful.

In regular expressions there are *greedy*, *non-greedy* and *possessive* quantifiers. 
Greedy ones try to match as many times as possible at first;
non-greedy ones try to match as few times as possible at first.
Possessive quantifiers are like greedy ones, but greedy ones will give back occurrences if that helps 
to achieve a match. Possessive ones do not do that.

<table>
<tr><th>kind</th><th>greedy</th><th>non-greedy</th><th>possessive</th></tr>
<tr><th>0 or more</th><td>``*``</td><td>``*?``</td><td>``*+``</td></tr>
<tr><th>1 or more</th><td>``+``</td><td>``+?``</td><td>``++``</td></tr>
<tr><th>at least *n*, at most *m*</th>
    <td>``{``*n*``,`` *m*``}``</td>
    <td>``{``*n*``,`` *m*``}?``</td>
    <td>``{``*n*``,`` *m*``}+``</td>
 </tr>
</table>

For example, the pattern ``[ab]*b`` matches substrings of ``a``s and ``b``s that  in a ``b``.
In order to match the string ``aaaaab``, the ``[a|b]*`` part starts with greedily consuming the whole string,
but after discovering that the ``b`` part in the pattern should also match something, the ``[a|b]*`` part
reluctantly gives back one occurrence. That will do the trick.

However, ``[ab]*+b`` will not match ``aaaaab``, because the possessive quantifier gives nothing back.

Possessive quantifiers a desirable in combination with negative look-behind assertions.

For example, take ``[ab]*+(?!c)$``. This will match substrings of ``a``s and ``b``s that are not followed by ``c``.
So it matches ``ababab`` but not ``abababc``.
However, the non-possessive variant, ``[ab]*(?!c)`` matches both. So how does it match ``abababcd``?
First, the ``[ab]*`` part matches all ``a``s and ``b``s. Then the look-behind assertion that ``c`` does not follow,
is violated. So ``[ab]*`` backtracks one occurrence, a ``b``. At that point the look-behind assertion finds a ``b`` 
which is not ``c``, and the match succeeds.

Python lacks *possessive* quantifiers in regular expressions, so again, this makes some expressions below more complicated than they were otherwise.

In [None]:
# We want to test for vowels in look-behind conditions.
# Python insists that look-behind conditions match patterns with fixed length.
# Vowels have variable length, so we need to take a bit more context.
# This extra context is dependent on whether the vowel occurs in front of a consonant or after it
# vowel1 is for before, vowel2 is for after, both are usable in look-behind conditions
# vowel matches purely vowels of variable length, and is not usable in look-behind conditions

vowel1 = '(?:(?::[ea@])|(?:w\.)|(?:[i;]j)|(?:ow)|(?:.[%@\^;aeiIou]))'
vowel2 = '(?:(?::[ea@])|(?:w\.)|(?:[i;]j)|(?:ow)|(?:[%@\^;aeiIou].))'
vowel = '(?:(?::[ea@])|(?:w\.)|(?:[i;]j)|(?:ow)|(?:[%@\^;aeiIou]))'

# lvowel are long vowels only (including compositions)
# svowel are short vowels only, including composite schwas
lvowel1 = '(?:(?:w\.)|(?:[i;]j)|(?:ow)|(?:.[@;o]))'
svowel = '(?:(?::[ea@])|(?:[%@\^;aeiIou]))'

gadol = sound_dict['@']
qatan = sound_dict['^']
a_like = {':a', 'a'}
o_like = {':@', 'o', 'ow', 'u', 'w.'}
e_like = {':e', ';', ';j', 'e', 'i', 'ij'}

# complex i/w vowel: the composite vowels with waw and yod, after translation
complex_i_vowel = ''.join(sound_dict[s] for s in {'ij', ';j'})
complex_w_vowel = ''.join(sound_dict[s] for s in {'ow'})

# consonants
ncons = '[^>bgdhwzxvjklmns<pyqrfct _&$-]' # not a consonant
cons = '[>bgdhwzxvjklmns<pyqrfct]'        # any consonant
consx = '[bgdwzxvjklmns<pyqrfct]'         # any consonant except alef
bgdkpt = '[bgdkpt]'                       # begadkefat consonant
nbgdkpt = '[wzxvjlmns<yqrfc]'             # non-begadkefat consonant
prep = '[bkl]'                            # proclitic preposition

# accents

acc = '[ˈˌ]'                              # primary and secundary accent

# Regular expressions

Here are the patterns, but also the replacement functions we are going to carry out when the patterns match.
How exactly the patterns and replacement functions hang together, is a matter for the phono function itself.

## Fringe

Ketiv-qere, tetragrammaton, accents

In [None]:
# We put the tetragrammaton and the qere/ketiv words inside brackets.
# Moreover, we take them as whole words, disconnected from their environment by extra white space.

# tetragrammaton
tetra = re.compile('([0-9]*)(j{n}*h{n}*w{n}*h{n}*)(?=[ _&$-]|\Z)'.format(n=ncons))

def tetra_repl(match):
    return '{} [ {} ] '.format(match.group(1), match.group(2))

# ketiv qere
masora = re.compile('\*([^ _&$-]*)')

def masora_repl(match):
    return ' { '+match.group(1).replace('*', '')+' } '

# remove the extra white space in the end
tetra_masora_cleanup = re.compile(' ([\[\]{}]) ')

def tetra_masora_cleanup_repl(match):
    return match.group(1)

# explicit accents

# lets assume that any cantillation mark or accent indicates that the vowel is stressed
# except for some types of mark (qadma, pashta)
sep_accent = re.compile('([0-9]{2})')
remove_accent = re.compile('|'.join('~{}'.format(x[0]) for x in irrelevant_accents))
secundary_accent = re.compile('|'.join('~{}'.format(x[0]) for x in secundary_accents))
primary_accent = re.compile('[~0-9]+')
condense_accents = re.compile('({v})([!/]+)'.format(v=vowel))

def sep_accent_repl(match):
    return '~'+match.group(1)

def condense_accents_repl(match):
    accent = '!' if '!' in match.group(2) else '/'
    return accent+match.group(1)

# implicit accents
default_accent1 = re.compile('({v}{c}?\.?(?:\Z|[ ]))'.format(v=svowel, c=cons))
default_accent2 = re.compile('({v}(?:\Z|[ ]))'.format(v=lvowel1))
strip_accents = re.compile('[0-9*,]')

def default_accent_repl(match):
    return '/'+match.group(1)

## Qamets gadol and qatan

In [None]:
# qamets qatan  
# NB: all patterns stipulate that the qamets (@) in question is unaccented

 # near end of word:
qamets_qatan1 = re.compile('(?<={c})(\.?)@(?={c}(?:\.?[/!]?[ &-]|\Z))'.format(c=consx))

# before dagesh forte:
qamets_qatan2 = re.compile('(?<={c})(\.?)@(?={c}\.)'.format(c=cons))

# if the following consonant is BGDKFT and does not have dagesh, the @ is in an open syllable:
qamets_qatan3 = re.compile('(?<={c})(\.?)@(?={c}:(?:{nb}|(?:{b}\.)))'.format(c=cons, b=bgdkpt, nb=nbgdkpt))

# assimilation of qamets with following composite schwa of type (chatef qamets),
#     but if the qamets is under a preposition BCL, not if it is under the article H:
qamets_qatan4a = re.compile('(?<={p})(\.?[!/]?)@(?=-{c}:@)'.format(p=prep, c=cons))

#     or word-internal
qamets_qatan4b = re.compile('(?<={c})(\.?[!/]?)@(?={c}:@)'.format(c=cons))

# before an other qamets qatan, provided the syllable is unaccented
qamets_qatan5 = re.compile('(?<={c})(\.?)@(?={c}\.?[/!]?\^)'.format(c=cons))

def qamets_qatan_repl(match):
    return match.group(1)+'^'

# there are exceptions to the heuristic of interpreting qamets by voting between occurrences
qamets_qatan_x = '''
BJT/ => 1A
'''

# there are unaccented conjugated verb forms that must not be subjected to qamets-qatan transformation
qamets_qatan_verb_x = {
    'verb qal perf 3sf',
    'verb qal perf 3p-',
    'verb nif impf 1s-',
    'verb nif impf 1p-',
    'verb nif impf 2sf',
    'verb nif impf 2pm',
    'verb nif impf 3pm',
    'verb nif impv 2sf',
    'verb nif impv 2pm',
}
qqv_experimental = {
    'verb qal impf 3pm',
}

qamets_qatan_verb_x |= qqv_experimental

def qamets_qatan_verb_x_repl(match):
    return match.group(1)+'@'
# for the use of applying individual corrections:

## Schwa, dagesh and furtive patah

The rules for the schwa that I have found are contradictory.

These rules I have seen (e.g.) 

1. if two consecutive consonants have both a schwa, the second one is mobile;
1. a schwa under a consonant with dagesh forte is mobile
1. a schwa under the last consonant of a word is quiescens
1. a schwa on a consonant that follows a long vowel, is mobile

But there are examples that rules 1 and 3 apply at the same time.

And in the qal 3 sg f forms end with a tav with schwa, often preceded by a consonant with also schwa.
In this case the tav has a dagesh, which by the rules for dagesh cannot be a lene. So it must be a forte.
So this violates rule 2.

We will cut this matter short, and make any final schwa quiescens.

As to rule 4, there are cases where the schwa in question is also followed by a final consonant with schwa.
In those cases it seems that the schwa in question is silent.

In [None]:
# furtive patah
furtive_patah = re.compile('(?<={v1})([x<]|(?:h.))a(?=\Z|[ &-])'.format(v1=vowel1))

def furtive_patah_repl(match):
    return 'ₐ'+match.group(1)

# mobile schwa
mobile_schwa1 = re.compile('''
    (                         # here is what goes before the schwa in question
        (?:(?:\A|[ &-]).\.?)|  # an initial consonant or
        (?:.\.)|               # a consonant with dagesh (which must be forte then) or 
        (?::.\.?)|             # another schwa and then a consonant
        (?:                    # a long vowel such as the following
            (?:
                @>?|               # qamets possibly with alef as mater lectionis (the remaining qametses are gadol)
                ;j?|               # tsere, possibly followed by yod
                ij|                # hireq with yod
                o[>w]?|            # holam possibly followed by yod
                w\.                # waw with dagesh
            )
            {c}                # and then a consonant
        )
    )
    :
    (?![@ae])                # the schwa may not be composite
'''.format(c=cons), re.X)

mobile_schwa2 = re.compile(':(?={b}(?:[^.]|[ &-]|\Z))'.format(b=bgdkpt)) # before BGDKPT letter without dagesh

# second last consonant with schwa when last consonsoant also has schwa
mobile_schwa3 = re.compile('[%:](?={c}\.?{a}?[%:](?:[ &]|\Z))'.format(a=acc, c=cons))

# all schwas and the end of the word are quiescens, only if the words are not glued together
mobile_schwa4 = re.compile('[%:](?=[ &]|\Z)')

def mobile_schwa1_repl(match):
    return match.group(1)+'%'

# dagesh
dages_forte_lene = re.compile('(?<={v1})(-?)({b})\.(?=[/!]?{v2})'.format(v1=vowel1, v2=vowel, b=bgdkpt))
dages_forte = re.compile('(?<={v1})(-?[h>]*-?)([^h])\.(?=[/!]?{v2})'.format(v1=vowel1, v2=vowel))
dages_lene = re.compile('({b})\.'.format(b=bgdkpt))

def dages_forte_lene_repl(match):
    return match.group(1)+(dagesh_lene_dict[match.group(2)] * 2)

def dages_lene_repl(match):
    return dagesh_lene_dict[match.group(1)]

def dages_forte_repl(match):
    return match.group(1) + match.group(2) * 2

## Mater lectionis and final fixes

In [None]:
# silent aleph
silent_aleph = re.compile('(?<=[^ &-])>(?!(?:[/!]|{v}))'.format(v=vowel))

# final mater lectionis
# I assume that heh and alef are only matrices lectionis after a LONG vowel
last_ml = re.compile('(?<={v1})[>h]+(?=[ &-]|\Z)'.format(v1=lvowel1))
last_ml_jw = re.compile('jw(?=[ &-]|\Z)')

# mappiq heh
mappiq_heh = re.compile('h\.')

fixit_i = re.compile('([{v}])\.'.format(v=complex_i_vowel))
fixit_w = re.compile('([{v}])\.'.format(v=complex_w_vowel))
fixit = re.compile('(.)\.')

split_sep = re.compile('^(.*?)([ .&$\n-]*)$') # to split the result in the phono part and the interword part

def fixit_repl(match):
    return match.group(1) * 2

def fixit_i_repl(match):
    return match.group(1)+'j'

def fixit_w_repl(match):
    return match.group(1)+'w'

# END OF REGULAR EXPRESSIONS AND REPLACEMENT FUNCTIONS

## Accents

In [None]:
def doaccents(w=None, orig=None, debug=False):
    if w != None: orig = get_orig(w)    
    dout = []

# prepare
    if debug: dout.append(('orig', orig))
    result = orig.lower().replace('_', ' ')
    if debug: dout.append(('trim', result))

# tetra grammaton
    result = tetra.sub(tetra_repl, result)
    if debug: dout.append(('tetra', result))

# ketiv qere
    result = masora.sub(masora_repl, result)
    if debug: dout.append(('masora', result))

# explicit accents
    result = sep_accent.sub(sep_accent_repl, result)
    result = remove_accent.sub('', result)
    result = secundary_accent.sub('/', result)
    result = primary_accent.sub('!', result)
    result = condense_accents.sub(condense_accents_repl, result)
    if debug: dout.append(('accents', result))

# implicit accents
    if '!' not in result and '/' not in result:    
        result = default_accent1.sub(default_accent_repl, result)
        if not '/' in result:
            result = default_accent2.sub(default_accent_repl, result)
    if debug: dout.append(('default accent', result))

# deliver
    return (result, dout) if debug else result

## Qamets gadol-qatan: unsophisticated

Here is the function that carries out rule based qamets qatan detection, without going into
verb paradigms and exceptions. It is the first go at it.

In [None]:
def doplainqamets(word, accentless=False, debug=False):
    dout = []
    result = word
    if accentless:
        result = result.replace('!', '').replace('/', '')
    result = qamets_qatan1.sub(qamets_qatan_repl, result)
    if debug: dout.append(('qamets_qatan1', result))
    result = qamets_qatan2.sub(qamets_qatan_repl, result)
    if debug: dout.append(('qamets_qatan2', result))
    result = qamets_qatan3.sub(qamets_qatan_repl, result)
    if debug: dout.append(('qamets_qatan3', result))
    result = qamets_qatan4a.sub(qamets_qatan_repl, result)
    if debug: dout.append(('qamets_qatan4a', result))
    result = qamets_qatan4b.sub(qamets_qatan_repl, result)
    if debug: dout.append(('qamets_qatan4b', result))
    result = qamets_qatan5.sub(qamets_qatan_repl, result)
    if debug: dout.append(('qamets_qatan5', result))

    return (result, dout) if debug else result

## Corrections

For some words we need specific corrections.
The rules for qamets qatan are not specific enough.

### Correction mechanism

We define a function ``apply_corr(wordq, corr)`` that can apply a correction instruction to ``wordq``, which is a word in pre-transliterated form, i.e. a word that has underwent transliteration steps ending with qamets interpretation, including applying special verb cases.

The ``corr`` is a comma-separated list of basic instructions, which have the form
*number* *letter*. It will interpret the *number*-th qamets as a gadol of qatan, depending on whether *letter* = ``ā`` or ``o``.

### Precomputed list of corrections

Later on we compile a dictionary ``qamets_corrections`` of pre-computed corrections.
This dictionary is keyed by the pre-transliterated form, and valued by the corresponding correction string. Here we initialize this dictionary.

The ``phono()`` function that carries out the complete transliteration, looks by default in ``qamets_corrections``, but this can be overridden. These corrections will not be carried out for the special verb cases.

In [None]:
qamets_corrections = {} # list of translits that must be corrected

# apply correction instructions to a word

def apply_corr(wordq, corr):
    if corr == '': return wordq
    corrs = corr.split(',')
    indices=[]
    for (i, ch) in enumerate(wordq):
        if ch == '^' or (ch == '@' and (i == 0 or wordq[i-1] != ':')):
            indices.append(i)
    resultlist = list(wordq)
    for c in corrs:
        (pos, kind) = c
        pos = int(pos) - 1
        repl = '^' if kind == 'o' else '@'
        if pos >= len(indices):
            msg('Line {}: pos={} out of range {}'.format(ln, pos, indices))
            continue
        rpos = indices[pos]
        resultlist[rpos] = repl
    return''.join(resultlist)

### Feature value normalization

We need concise, normalized values for the lexical features.

In [None]:
undefs = {'NA', 'unknown', 'n/a', 'absent'}

png = dict(
    NA='-',
    unknown='-',
    p1='1',
    p2='2',
    p3='3',
    sg='s',
    du='d',
    pl='p',
    m='m',
    f='f',
    a='a',
    c='c',
    e='e',
)
png['n/a'] = '-'

### Lexical info

We need a label for lexical information such as part of speech, person, number, gender.

In [None]:
declensed = {'subs', 'nmpr', 'adjv', 'prps', 'prde', 'prin'}

def get_lex_info(w):
    sp = F.sp.v(w)
    lex_infos = [sp]
    if sp == 'verb':
        lex_infos.extend([F.vs.v(w), F.vt.v(w), '{}{}{}'.format(png[F.ps.v(w)], png[F.nu.v(w)], png[F.gn.v(w)])])
    elif sp in declensed:
        lex_infos.append('{}{}'.format(png[F.nu.v(w)], png[F.gn.v(w)]))
    return ' '.join(lex_infos)

# The phono function

The definition of the function that generates the phonological transliteration.

Here the rule fabrics are woven together, exceptions invoked.

In [None]:
def phono(w=None, orig=None, debug=False, lex_info=None, suppress=True, correct=1, corrections=None, inparts=False):
    if w != None:
        orig = get_orig(w, sep=-1)
        lex_info = get_lex_info(w)
    dout = []

# if suppress, phono will suppress qatan interpretation in certain verb paradigmatic forms
# if correct is 1, phono will apply individual corrections
# if correct is 0, phono will not apply individual corrections
# if correct is -1, phono will stop just before applying the qamets qatan corrections and return
# the intermediate result

# accents
    if debug: (result, dout) = doaccents(orig=orig, debug=True)
    else: result = doaccents(orig=orig)
        
# qamets qatan
    suppr = True
    if suppress:
        suppr = False
        if '@' not in result:
            if debug: dout.append(('qamets qatan', 'no qamets present'))
        elif '!' in result:
            if debug: dout.append(('qamets qatan', 'primary accent present'))
        elif lex_info == None:
            if debug: dout.append((
                'qamets qatan', 
                'qamets without primary accent, but no exception invoked',
            ))
        elif lex_info not in qamets_qatan_verb_x:
            if debug: dout.append(('qamets qatan', 'no exception for {}'.format(lex_info)))
        else:
            suppr = True
    else:
        if debug: dout.append(('qamtes qatn suppression for verb forms is switched off'))
    
    if suppr:
        if debug: dout.append(('qamets qatan', 'suppressed for {}'.format(lex_info)))
    else:
        if debug:
            (result, this_dout) = doplainqamets(result, debug=True)
            dout.extend(this_dout)
        else: result = doplainqamets(result)

    if correct == -1: return (result, dout) if debug else result

    if correct == 1 and lex_info not in qamets_qatan_verb_x:
        if corrections == None:
            corrections = qamets_corrections
        parts = result.split('-')
        hotpart = parts[-1]
        if hotpart.endswith(' '):
            sep = ' '
            hotpart = hotpart.rstrip(' ')
        else:
            sep = ''
        if hotpart in corrections:
            hotpartn = apply_corr(hotpart, corrections[hotpart])
            if debug: dout.append((
                'qamets qatan',
                'correction: {} => {}'.format(hotpart, hotpartn)
            ))
            parts[-1] = hotpartn + sep
            result = '-'.join(parts)

# furtive patah
    result = furtive_patah.sub(furtive_patah_repl, result)    
    if debug: dout.append(('furtive_patah', result))

# mobile schwa
    result = mobile_schwa1.sub(mobile_schwa1_repl, result)
    if debug: dout.append(('mobile_schwa1', result))
    result = mobile_schwa2.sub('%', result)
    if debug: dout.append(('mobile_schwa2', result))
    result = mobile_schwa3.sub('', result)
    if debug: dout.append(('mobile_schwa3', result))
    result = mobile_schwa4.sub('', result)
    if debug: dout.append(('mobile_schwa4', result))

# dagesh
    result = dages_forte_lene.sub(dages_forte_lene_repl, result)    
    if debug: dout.append(('dagesh_forte_lene', result))
    result = result.replace('ij.', 'Ijj')
    result = dages_forte.sub(dages_forte_repl, result)
    if debug: dout.append(('dagesh_forte', result))
    result = dages_lene.sub(dages_lene_repl, result)
    if debug: dout.append(('dagesh_lene', result))

# silent aleph (but not in tetra/ketiv/qere)
    if '{' not in result and '[' not in result:
        result = silent_aleph.sub('', result)    
    if debug: dout.append(('silent_aleph', result))

# final mater lectionis (but not in tetra/ketiv/qere)
    if '{' not in result and '[' not in result:
        result = last_ml_jw.sub('ʸw', result)
        result = last_ml.sub('', result)    
    if debug: dout.append(('last_ml', result))

# mappiq heh
    result = mappiq_heh.sub('h', result)
    if debug: dout.append(('dagesh_forte_h', result))

# final schwa
    if result.endswith('k:'): result = result[0:-1]
        
# symbols
    resultparts = result.split('-')
    results = []
    for resultp in resultparts:
        result = resultp
        for (sym, repl) in sound_dict.items():
            result = result.replace(sym, repl)
        if debug: dout.append(('symbols', result))

    # fix left over dagesh and mappiq
        result = fixit_i.sub(fixit_i_repl, result)
        result = fixit_w.sub(fixit_w_repl, result)
        result = fixit.sub(fixit_repl, result)
        if debug: dout.append(('fixit', result))

    # zero width word boundary and extra white space and punctuation
        result = tetra_masora_cleanup.sub(tetra_masora_cleanup_repl, result)
        result = result.replace('$\n','.').replace('$', '.')
        if debug: dout.append(('cleanup', result))
        results.append(result)

# deliver
    result = ''.join(results) if not inparts else results
    return (result, dout) if debug else result

# Skeleton analysis

We have to do more work for the qamets. Sometimes a word form on its own is not enough to determine whether a qamets is gadol or qatan. In those cases, we analyse all occurrences of the same lexeme, and for each syllable position we measure whether an A-like vowel of an O-like vowel tends to occur in that syllable.

In order to do that, we need to compute a *vowel skeleton* for each word.

## Stripping paradigmatic material

A word may have extra syllables, due to inflections, such as plurals, feminine forms, or suffixes. Let us call this the *paradigmatic material* of a word. 

Now, we strip from the initial vowel skeleton a number of trailing vowels that corresponds
to the number of consonants found in the paradigmatic material.
This is rather crude, but it will do.

In [None]:
# we need the number of letters in a defined value of a morpho feature
def len_suffix(v):
    if v == None: return 0
    if v in undefs: return 0
    return len(v.replace('=', '').replace('W', '').replace('J', ''))

# we need a function that return 1 for plural/dual subs/adj and for fem adj
def len_ending(sp, n, g):
    if sp == 'subs': return 1 if n in {'pl', 'du'} else 0
    if sp == 'adjv': return 1 if n in {'pl', 'du'} or g in 'f' else 0 
    return 0

# return the number of consonants in the suffixes
def len_morpho(w):
    return max((
        len_suffix(F.prs.v(w)) + len_suffix(F.uvf.v(w)), 
        len_ending(F.sp.v(w), F.nu.v(w), F.gn.v(w)),
    ))

## Skeleton patterns

Next, we reduce the vowel skeleton to a skeleton pattern. We are not interested in all vowels, only in whether the vowel is a qamets (gadol or qatan), A-like, O-like, or other (which we dub E-like).

In [None]:
# the qamets gadol/qatan skeleton
qamets_qatan_skel = re.compile('([^@^])')

# the vowel skeleton where the qamets gadol/qatan are preserved as @ and ^
# another o-like vowel becomes O (holam, qamets chatuf) (no waws nor yods)
# another a-like vowel becomes A (patah, patah chatuf) (no alefs)
silent_alef_start = re.compile('([ &-]|\A)>([!/]?(?:[^!/.:;@^aeiou]|\Z))')

def silent_alef_start_repl(match):
    return match.group(1)+'E'+match.group(2)

qamets_qatan_fullskel = re.compile('''
    (
        E                                         # replacement of silent initial alef without vowels
    |   (?::[@ae])                                # a composite schwa
    |   (?:[;i]j) | (?:ow) | (?:w.)               # a composite vowel   
    |   [@a;eiou^]                                # a vowel point
    |   .                                         # anything else
    )
'''.format(c=cons), re.X)

def qamets_qatan_fullskel_repl(match):
    found = match.group(1)
    if found == 'E': return 'E'
    if found == '@': return gadol
    if found == '^': return qatan
    if found in a_like: return 'A'
    if found in o_like: return 'O'
    if found in e_like: return 'E'
    return ''

def get_full_skel(w, orig=None, debug=False):
    if orig == None:
        orig = get_orig(w, sep=-1)
    wordq = phono(orig=orig, correct=-1)
    wordqr = silent_alef_start.sub(silent_alef_start_repl, wordq)
    fullskel = qamets_qatan_fullskel.sub(qamets_qatan_fullskel_repl, wordqr)
    ending_length = len_morpho(w)
    relevant_part = len(fullskel) - ending_length
    if debug: print('{}: {} => {} => {} : {} minus {} = {}'.format(
        w, orig, wordq, wordqr, fullskel, ending_length, fullskel[0:relevant_part],
    ))

    return fullskel[0:relevant_part]

# Qamets gadol qatan: sophisticated

A lot of work is needed to get the qamets gadol-qatan right.
This involves looking at accents, verb paradigms and special cases among the non-verbs.

## Qamets gadol qatan: non-verbs

Sometimes a qamets is gadol or qatan for lexical reasons, i.e. it can not be derived by rules based on the word occurrence itself, but other occurrences have to be invoked.

### All candidates

In [None]:
# find lexemes which have an occurrence with a qamets (except verbs)
msg("Looking for non-verb qamets")
qq_words = set()
qq_lex = collections.defaultdict(lambda: [])

for w in F.otype.s('word'):
    ln = F.language.v(w)
    if ln != 'Hebrew': continue
    sp = F.sp.v(w)
    if sp == 'verb': continue
    orig = get_orig(w, sep=-1)
    if '*' in orig: continue       # ketiv qere
    if '@' not in orig: continue   # no qamets in word
    word = doaccents(orig=orig)
    lex = F.lex.v(w)
    if word in qq_words: continue
    qq_words.add(word)
    qq_lex[lex].append(w)
msg('{} lexemes and {} unique occurrences'.format(len(qq_lex), len(qq_words)))

### Filtering interesting candidates

In [None]:
msg('Filtering lexemes with varied occurrences')
qq_varied = collections.defaultdict(lambda: [])
nocc = 0
for lex in qq_lex:
    ws = qq_lex[lex]
    if len(ws) == 1: continue
    occs = []
    skel_set = set()
    has_qatan = False
    has_gadol = False
    for w in ws:
        orig = get_orig(w, sep=-1)
        wordq = phono(orig=orig, correct=-1)
        skel = qamets_qatan_skel.sub('', wordq.replace(':@','')).replace('@',gadol).replace('^',qatan)
        if gadol in skel: has_gadol = True
        if qatan in skel: has_qatan = True
        skel_set.add(skel)
        occs.append((skel, orig, w))
    if len(skel_set) > 1 and has_qatan and has_gadol:
        for (skel, orig, w) in occs:
            fullskel = get_full_skel(w, orig=orig)
            qq_varied[lex].append((skel, fullskel, w))
            nocc += 1
msg('{} interesting lexemes with {} unique occurrences'.format(len(qq_varied), nocc))

### Guess the qamets

In [None]:
qamets_qatan_xc = dict(
    (x[0], x[1]) for x in (y.split(' => ') for y in qamets_qatan_x.strip().split('\n'))
)
qamets_qatan_xcompiled = collections.defaultdict(lambda: {})
for (lex, corrstr) in qamets_qatan_xc.items():
    corrs = corrstr.split(',')
    for corr in corrs:
        (pos, ins) = corr
        pos = int(pos) - 1
        qamets_qatan_xcompiled[lex][pos] = ins

def compile_occs(lex, occs):
    vowel_counts = collections.defaultdict(lambda: collections.Counter())
    for (skel, fullskel, w) in occs:
        for (i, c) in enumerate(fullskel):
            vowel_counts[i][c] += 1
    occs_compiled = {}
    for i in sorted(vowel_counts):
        vowel_count = vowel_counts[i]
        a_ish = vowel_count.get(gadol, 0) + vowel_count.get('A', 0)
        o_ish = vowel_count.get(qatan, 0) + vowel_count.get('O', 0)
        if a_ish != o_ish: occs_compiled[i] = gadol if a_ish > o_ish else qatan
    if lex in qamets_qatan_xcompiled:
        override = qamets_qatan_xcompiled[lex]
        for i in override:
            ins = override[i]
            old_ins = occs_compiled.get(i, '')
            new_ins = gadol if ins == 'A' else qatan
            if old_ins == new_ins:
                print('{}: No override needed for syllable {} which is {}'.format(
                    lex, i+1, old_ins,
                ))
            else:
                print('{}: Override for syllable {}: {} becomes {}'.format(
                    lex, i+1, old_ins, new_ins,
                ))
                occs_compiled[i] = new_ins
    return occs_compiled

def guess_qq(occ, occs_compiled, debug=False):
    (skel, fullskel, w) = occ
    guess = ''
    for (i, c) in enumerate(fullskel):
        guess += occs_compiled.get(i, c) if c == gadol or c == qatan else c
    if debug: print('{}'.format(w))
    return guess

def get_corr(fullskel, guess, debug=False):
    n = 0
    corr = []
    for (i, fc) in enumerate(fullskel):
        if fc != qatan and fc != gadol: continue
        n += 1
        gc = guess[i]
        if fc == gc: continue
        corr.append('{}{}'.format(n, gc))
    if debug: print('{} guess {} corr {}'.format(fullskel, guess, corr))
    return ','.join(corr)

### Carrying out the guess work

In [None]:
msg('Guessing between gadol and qatan')
qamets_corrections = {}
qq_varied_remaining = set()
ndiff_occs = 0
ndiff_lexs = 0
nconflicts = 0
for lex in qq_varied:
    debug = False
    occs = qq_varied[lex]
    occs_compiled = compile_occs(lex, occs)
    this_ndiff_occs = 0
    for occ in occs:
        (skel, fullskel, w) = occ
        guess = guess_qq(occ, occs_compiled, debug=debug)
        corr = get_corr(fullskel, guess, debug=debug)
        if corr:
            this_ndiff_occs += 1
            orig = get_orig(w, sep=-1)
            wordq = phono(orig=orig, correct=-1)
            if wordq in qamets_corrections:
                old_corr = qamets_corrections[wordq]
                if old_corr != corr:
                    print('Conflicting corrections for {} {} {} ({} => {}): first {} and then {}'.format(
                        lex, wordq, skel, fullskel, guess, old_corr, corr,
                    ))
                    nconflicts += 1
            qamets_corrections[wordq] = corr

    if this_ndiff_occs:
        ndiff_lexs += 1
        ndiff_occs += this_ndiff_occs
        qq_varied_remaining.add(lex)
msg('{} lexemes with modified occurrences ({})'.format(ndiff_lexs, ndiff_occs))
msg('{} patterns with conflicts'.format(nconflicts))

# Testing

The function below reads a text file with tests.

A test is a tab separated line with as fields:

    passage etcbc-original phono_transcription expected_result bol_reference comments
    
The testing routine executes all tests, checks the results, produces onscreen output, debug output in file, and pretty output in a html file.

## Auxiliary functions

We want to be able to look up the node given a passage and an etcbc transcription string.

We also want to be able to easily generate a test from an occurrence that we encounter, whether we have the node, or the transcription string with passage.
For that we need a passage index.

In [None]:
msg("Compiling passage index")
passage_index = {}

for bn in F.otype.s('book'):
    book_name = F.book.v(bn)
    for cn in L.d('chapter', bn):
        chapter_num = F.chapter.v(cn)
        for vn in L.d('verse', cn):
            verse_num = F.verse.v(vn)
            passage_index['{} {}:{}'.format(book_name, chapter_num, verse_num)] = vn
msg('{} passages (verses)'.format(len(passage_index)))

### Composing tests

Given an occurrence in etcbc translit in a passage, or a node number, we want to easily compile a test out of it.
Say we are looking for ``orig``.

The match need not be perfect. 
We want to find the node w, which carries a translit that occurs at the end of ``orig``.
If there are multiple, we want the longest.
If there are multiple longest ones, we want the first that occurs in the passage.

In [None]:
def get_hebrew(orig):
    origm = Transcription.suffix_and_finales(orig)
    return Transcription.to_hebrew(origm[0]+origm[1]).replace('-','')

def get_passage(w):
    vn = w if F.otype.v(w) == 'verse' else L.u('verse', w)
    return '{} {}:{}'.format(
        F.book.v(L.u('book', w)),
        F.chapter.v(L.u('chapter', w)),
        F.verse.v(vn),
    )
    
def find_w(passage, orig):
    vn = passage_index.get(passage, None)
    if vn == None:
        return None
    wr = None
    ws = []
    for (i, w) in enumerate(L.d('word', vn)):
        w_orig = get_orig(w, sep=-1)
        if w_orig != '' and orig.endswith(w_orig):
            ws.append((w, w_orig, i))
    if len(ws) == 0:
        print('find_w: no {} found in {}'.format(orig, passage))
    else:
        if len(ws) > 1:
            ws_sorted = sorted(ws, key=lambda x: (-len(x[1]), x[2]))
            wr = ws_sorted[0][0]
        else:
            wr = ws[0][0]
    return wr
    
def maketest(w=None, orig=None, lex_info=None, passage=None, expected=None, comment=None):
    if comment == None: comment = 'isolated case'
    had_w = True
    if w == None:
        had_w = False
        if passage != None and orig != None: w = find_w(passage, orig)
    if w == None: 
        if expected == None: expected = phono(orig=orig, lex_info=lex_info)
        test = (passage, orig, lex_info, expected, comment)
    else:
        if lex_info == None:
            lex_info = get_lex_info(w)
        if passage == None:
            passage = get_passage(w)            
        if had_w:
            if expected == None: expected = phono(w=w)
            test =  (w, '', '', expected, comment)
        else:
            if expected == None: expected = phono(orig=orig, lex_info=lex_info)
            test = (passage, orig, lex_info, expected, comment)

    return test

### Formatting test results

Here are some HTML/CSS definitions for pretty printing test results.

In [None]:
def h_esc(txt):
    return txt.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;')

def test_html_head(title, stats, mystats): 
    return '''<html>
<head>
    <meta http-equiv="Content-Type"
          content="text/html; charset=UTF-8" />
    <title>'''+title+'''</title>
    <style type="text/css">
        .h {
            font-family: Ezra SIL, SBL Hebrew, Verdana, sans-serif; 
            font-size: x-large;
            text-align: right;
        }
        .t {
            font-family: Menlo, Courier New, Courier, monospace;
            font-size: small;
            color: #0000cc;        
        }
        .tl {
            font-family: Menlo, Courier New, Courier, monospace;
            font-size: medium;
            font-weight: bold;
            color: #000000;        
        }
        .p {
            font-family: Verdana, Arial, sans-serif;
            font-size: medium;
        }
        .l {
            font-family: Verdana, Arial, sans-serif;
            font-size: small;
            color: #440088;
        }
        .v {
            font-family: Verdana, Arial, sans-serif; 
            font-size: x-small;
            color: #666666;
        }
        .c {
            font-family: Ezra SIL, SBL Hebrew, Verdana, sans-serif;
            font-size: small;
            background-color: #ffffdd;
            width: 20%;
        }
        .cor {
            font-family: Menlo, Courier New, Courier, monospace;
            font-weight: bold
            font-size: medium;
        }
        .exact {
            background-color: #88ffff;
        }
        .good {
            background-color: #88ff88;
        }
        .error {
            background-color: #ff8888;
        }
        .norm {
            background-color: #8888ff;
        }
        .ca {
            background-color: #88ffff;
        }
        .cr {
            background-color: #ffff33;
        }
    </style>
</head><body>
'''+(('<p>'+stats+'</p>') if stats else '')+(('<p>'+mystats+'</p>') if mystats else '')+'''
<table>
'''

test_html_tail = '''</table>
</body>
</html>
'''

### Run tests

This is the function that runs a sequence of tests.
If the second argument is a string, it reads a tab separated file with tests from a file with that name.
Otherwise it should be a list of tests, a test being a list or tuple consisting of:

    source, orig, lex_info, expected, comment
    
where ``source`` is either a string ``passage`` or a number ``w``.
If it is a ``w``, it is the node corresponding to the word, and it is used to get the ``passage, orig, lex_info`` which are allowed to be empty.
If it is a ``passage``, the node will be looked up on the basis of it plus ``orig``.
If the node is found, it will be used to get the ``lex_info``, if not, the given ``lex_info`` will be used.

In [None]:
def vfname(inpath):
    (indir, infile) = os.path.split(inpath)
    (inbase, inext) = os.path.splitext(infile)
    return os.path.join(indir, inbase+version+inext)
    
def runtests(ntestfields, title, testsource, outfilename, htmlfilename, order=True):
    if type(testsource) is list:
        tests = testsource
    else:
        tests = []
        test_in_file = open(testsource)
        for tline in test_in_file: tests.append(tline.rstrip('\n').split('\t'))
        test_in_file.close()

    ntl = 0
    for fields in tests:
        ntl += 1
        if len(fields) != ntestfields:
            print('ERROR: not the right number of fields ({} instead of {}) at line {}'.format(
                len(fields), ntestfields, ntl,
            ))
    lines = []
    htmllines = []
    longlines = []
    nexact = 0
    ngood = 0
    ntests = len(tests)
    test_sequence = sorted(tests, key=lambda x: (x[1], x[2], x[3], x[0])) if order else tests

    for (i, (source, orig, lex_info, expected, comment)) in enumerate(test_sequence):
        w = None
        passage = None
        if type(source) is int:
            w = source
            passage = get_passage(w)
            wgiven = True
        else:
            passage = source
            w = find_w(passage, orig)
            wgiven = False
        if w != None:
            if wgiven or orig.strip() == '':
                orig = get_orig(w, sep=-1)
            if wgiven or lex_info.strip() == '':
                lex_info = get_lex_info(w)

        (wordph, dout) = phono(orig=orig, debug=True, lex_info=lex_info)
        if wordph == expected:
            isgood = '='
            nexact += 1
        elif wordph.replace('ˌ', '').replace('ˈ', '') == expected.replace('ˌ', '').replace('ˈ', ''):
            isgood = '~'
            ngood += 1
        else:
            isgood = '#'
        if passage == None: passage = ''
        if expected == None: expected = ''
        if orig == None: orig = ''
        if lex_info == None: lex_info = ''
        lines.append('{:>3} {:<19} {:<17} {:<22} {:<20} {} {:<20}'.format(
            i+1, passage, 
            lex_info, orig, wordph,
            isgood,
            '' if isgood == '=' else expected, 
        ))
        longlines.append('{:>3} {:<19} {:<17} {:<25} => {:<25} < {} {:<25} # {}\n{}\n\n'.format(
            i+1, passage, 
            lex_info, orig, wordph,
            isgood,
            '' if isgood == '=' else expected, 
            comment,
            '\n'.join('{:<7} {:<20} {}'.format('', x[0], x[1]) for x in dout),
        ))
        htmllines.append(('''
    <tr>
        <td class="{st}">{i}</td>
        <td class="v">{v}</td>
        <td class="t">{t}</td>        
        <td class="h">{h}</td>
        <td class="l">{l}</td>
        <td class="p {st}">{p}</td>
        <td class="p{est}">{e}</td>
        <td class="c">{c}</td>
    </tr>
    ''').format(
        st='exact' if isgood == '=' else 'good' if isgood == '~' else 'error',
        i=i+1,
        v=passage,
        t=h_esc(orig),
        l=lex_info,
        h=get_hebrew(orig),
        p=wordph,
        e='' if isgood == '=' else expected,
        est='' if isgood == '=' else ' ca' if isgood == '~' else ' norm',
        c=h_esc(comment),
    ))

    line_text = '\n'.join(lines)
    longline_text = '\n'.join(longlines)
    print(line_text)
    test_out_file = open(vfname(outfilename), 'w')
    test_out_file.write('{}\n\n{}\n'.format(line_text, longline_text))
    stats = '{} tests; {} failed; {} passed of which {} exactly.'.format(
        ntests, ntests-ngood-nexact, ngood + nexact, nexact,
    )
    test_out_file.close()
    test_html_file = open(vfname(htmlfilename), 'w')
    test_html_headline = '''
    <tr>
        <th class="v">v</th>
        <th class="v">verse</th>
        <th class="t">etcbc</th>
        <th class="h">hebrew</th>
        <th class="l">lexical</th>
        <th class="p">phono</th>
        <th class="p norm">expected</th>
        <th class="c">comment</th>
    </tr>
    '''
    test_html_file.write('{}{}{}{}'.format(
        test_html_head(title, stats, ''), test_html_headline, ''.join(htmllines), test_html_tail))
    test_html_file.close()
    msg(stats)

### Produce showcases

This is a variant on ``runtests()``.

It produces overviews of the cases where the corpus dependent rules have been applied.

In [None]:
def showcases(ntestfields, title, stats, testsource, htmlfilename, order=True):
    msg("Generating HTML in {}".format(htmlfilename))

    if type(testsource) is list:
        tests = testsource
    else:
        tests = []
        test_in_file = open(vfname(testsource))
        for tline in test_in_file: tests.append(tline.rstrip('\n').split('\t'))
        test_in_file.close()

    ntl = 0
    for fields in tests:
        ntl += 1
        if len(fields) != ntestfields:
            print('ERROR: not the right number of fields ({} instead of {}) at line {}'.format(
                len(fields), ntestfields, ntl,
            ))
    htmllines = []
    ncorr = 0
    ntests = len(tests)
    test_sequence = sorted(tests, key=lambda x: (x[3], x[0], x[1], x[5])) if order else tests

    for (i, (corr, wordph, wordph_c, lex, orig, w, comment)) in enumerate(test_sequence):
        passage = get_passage(w)
        lex_info = get_lex_info(w)
        heb = get_hebrew(orig)
        if corr:
            ncorr += 1
        htmllines.append(('''
    <tr>
        <td class="v">{i}</td>
        <td class="cor{st}">{cr}</td>
        <td class="tl">{tl}</td>        
        <td class="v">{v}</td>
        <td class="l">{l}</td>
        <td class="h">{h}</td>
        <td class="p {st}">{p}</td>
        <td class="p {st1}">{pc}</td>
        <td class="t">{t}</td>        
        <td class="c">{c}</td>
    </tr>
''').format(
        i=i+1,
        st=' cr' if corr else '',
        st1=' good' if corr else '',
        cr=corr,
        tl=h_esc(lex),
        v=passage,
        l=lex_info,
        h=heb,
        p=wordph if wordph != wordph_c else '',
        pc=wordph_c,
        t=h_esc(orig),
        c=h_esc(comment),
    ))

    mystats = '{} occurrences and {} corrections'.format(
        ntests, ncorr,
    )
    test_html_headline = '''
    <tr>
        <th class="v">n</th>
        <th class="cor cr">correction</th>
        <th class="tl">lexeme</th>
        <th class="v">verse</th>
        <th class="l">lexical</th>
        <th class="h">hebrew</th>
        <th class="p cr">phono<br/>uncorrected</th>
        <th class="p good">phono<br/>corrected</th>
        <th class="t">etcbc</th>
        <th class="c">comment</th>
    </tr>
    '''
    test_html_file = open(vfname(htmlfilename), 'w')
    test_html_file.write('{}{}{}{}'.format(
        test_html_head(title, stats, mystats), test_html_headline, ''.join(htmllines), test_html_tail))
    test_html_file.close()
    if stats: msg(stats)
    if mystats: msg(mystats)

## Test the main examples

In [None]:
runtests(5, 'Mixed Tests', 'tests.txt', 'tests_debug.txt', 'tests.html')

## Testing: Special cases

In [None]:
special_tests = [
    dict(w=7494, expected=None, comment="schwa in front of BGDKPT without dagesh"),
    dict(w=5, expected=None, comment="article in isolation"),
    dict(w=6, expected=None, comment="word after article in isolation"),
    dict(w=106, expected=None, comment="proclitic min"),
    dict(w=107, expected=None, comment="word starting with BGDKPT after proclitic min"),
    dict(passage='Genesis 1:7', orig='MI-T.A74XAT', expected=None, comment="proclitic min combined with word starting with BGDKPT"),
    dict(w=1684, expected=None, comment='Tetra with end of verse'),
    dict(passage='Genesis 4:1', orig='>ET&J:HW@75H $', expected=None, comment='Tetra with end of verse'),   
]

print('{} => {}'.format('MI-T.A74XAT', phono(orig='MI-T.A74XAT', inparts=True)))

runtests(
    5,
    'special cases',
    [maketest(**t) for t in special_tests], 
    'special_cases_out.txt', 'special_cases.html',
)

## Testing: Qamets gadol qatan: non-verbs

We have generated a number of corrections of the qamets interpretation in non verbs. We have applied exceptions to the corrections. Here is the list of representative occurrences where corrections and/or exceptions have been applied.

In [None]:
msg('Showing lexemes with varied occurrences')
qqi_filename = 'qamets_qatan_individuals'
qqi = outfile('{}.txt'.format(qqi_filename))
nvcases = []

noccs = 0
ncorrs = 0
for lex in sorted(qq_varied):
    if lex not in qq_varied_remaining: continue
    occs = qq_varied[lex]
    for (skel, fullskel, w) in sorted(occs, key=lambda x: (x[1], x[2])):
        orig = get_orig(w, sep=-1)
        wordq = phono(orig=orig, correct=-1)
        corr = qamets_corrections.get(wordq, '')
        if corr: ncorrs += 1
        noccs += 1
        wordph = phono(orig=orig, correct=0)
        wordph_c = phono(orig=orig, correct=1)
        comment = 'on the basis of other occurrences' if corr else 'by the rules'
        qqi.write('{:<1}\t{:<5}\t{:<16}\t{:<16}\t{:<10}\t{:<20}\t{}\n'.format(
            '*' if corr else '',
            corr,
            wordph,
            wordph_c,
            lex,
            orig,
            w,
        ))
        nvcases.append((corr, wordph, wordph_c, lex, orig, w, comment))
    qqi.write('\n')
qqi.close()
msg('{} lexemes with {} occurrences and {} corrections written'.format(
    len(qq_varied_remaining), noccs, ncorrs,
))

### Pretty printing the non-verb cases to HTML

In [None]:
showcases(
    7,
    'special nonverb cases',
    '{} lexemes'.format(len(qq_varied_remaining)),
    nvcases, 
    'special_nonverb_cases.html',
    order=False,
)

## Testing: Qamets gadol qatan: verbs

Usually, accents take care that potential qatans are read as gadols.
But sometimes the accents are missing.
We have used a list of paradigm labels where such cases might occur, and there we suppress the gamets-as-qatan interpretation.
We look at the verb paradigms to fill in the missing information.

Here we list the cases where this occurs, and show them.

### Look up the cases

In [None]:
qq_verb_words = set()
qq_verb_specials = []

msg('Finding qamets qatan special verb cases')
for w in F.otype.s('word'):
    ln = F.language.v(w)
    if ln != 'Hebrew': continue
    sp = F.sp.v(w)
    if sp != 'verb': continue
    orig = get_orig(w, sep=-1)
    if '*' in orig: continue       # ketiv qere
    if '@' not in orig: continue   # no qamets in word
    word = doaccents(orig=orig)
    wordq = doplainqamets(word, accentless=True)
    if '^' not in wordq: continue  # no risk of unwanted qamets qatan
#    if '!' in word: continue       # primary accent has been marked

    lex_info = get_lex_info(w)
    if lex_info in qamets_qatan_verb_x:
        if (word, lex_info) in qq_verb_words: continue
        qq_verb_words.add((word, lex_info))
        qq_verb_specials.append((w, orig, word, lex_info))
msg('{} cases'.format(len(qq_verb_specials)))

### Show the cases

In [None]:
msg('Showing verb cases')

qq_verb_specials_compiled = collections.defaultdict(lambda: collections.defaultdict(lambda: {}))

for (w, orig, word, lex_info) in qq_verb_specials:
    qq_verb_specials_compiled[word][lex_info] = (w, orig) # (word-lex_info pairs are unique)

ncorr = 0
ngood = 0
vcases = []
verb_lexemes = set()
for word in sorted(qq_verb_specials_compiled):
    for lex_info in sorted(qq_verb_specials_compiled[word]):
        (w, orig) = qq_verb_specials_compiled[word][lex_info]
        wordph = phono(orig=orig, lex_info=lex_info)
        wordph_ns = phono(orig=orig, lex_info=lex_info, suppress=False)
        corr = ''
        lex = F.lex.v(w)
        verb_lexemes.add(lex)
        if wordph == wordph_ns: 
            ngood += 1
            corr = ''
            comment = "qamets: no need to suppress qatan"
        else:
            ncorr += 1
            corr = 'gadol'
            comment = 'qamets: gadol maintained because of verb paradigm'
        vcases.append((corr, wordph, wordph_ns, lex, orig, w, comment))

showcases(
    7,
    'verb cases',
    '{} lexemes'.format(len(verb_lexemes)),
    vcases,
    'special_verb_cases.html',
    order=True,
)

# Print the whole text

In [None]:
msg('Generating complete texts (transcribed and phonetic) ... ')

phono_fname = 'phono{}.txt'.format(version)
word_fname = 'wordph{}.txt'.format(version)
phono_file = outfile(phono_fname)
word_file = outfile(word_fname)

orig_file = outfile('orig{}.txt'.format(version))
combi_file = open('combi{}.txt'.format(version), 'w')

nv = 0
nchunk = 1000
nvc = 0
for v in F.otype.s('verse'):
    nv += 1
    nvc += 1
    if nvc == nchunk:
        msg('{:>5} verses'.format(nv))
        nvc = 0
    passage_label = get_passage(v)
    phono_file.write('{}  '.format(passage_label))
    orig_file.write('{}  '.format(passage_label))
    combi_file.write('{}\n'.format(passage_label))

    words = L.d('word', v)
    orig_text = ''
    phono_text = ''
    
    cur_orig = ''
    cur_nodes = []
    
    for w in words:
        orig = get_orig(w, sep=1)
        eol = orig.endswith('\n')
        sep = orig[-1]
        lex_info = get_lex_info(w)
        cur_orig += orig
        cur_nodes.append(w)
        if sep ==  '-': continue
        orig_text += cur_orig
        cur_phonos = phono(orig=cur_orig, lex_info=lex_info, inparts=True)
        phono_text += ''.join(cur_phonos)
        lnodes = len(cur_nodes)
        lphonos = len(cur_phonos)
        if lnodes != lphonos:
            word_file.write('!MISMATCH nodes={} phonos={}'.format(lnodes, lphonos))
        for i in range(max((lnodes, lphonos))):
            this_node = cur_nodes[i] if i < lnodes else 'X'
            this_phono = cur_phonos[i] if i <lphonos else 'X'
            parts = split_sep.findall(this_phono)
            if len(parts):
                (this_phono_x, this_sep) = parts[0]
            else:
                (this_phono_x, this_sep) = (this_phono, '')
            word_file.write('{}\t{}\t{}\n'.format(this_node, this_phono_x, this_sep))
        cur_nodes = []
        cur_phonos = []
        if eol:
            orig_file.write(orig_text)
            phono_file.write(phono_text + '\n')
            combi_file.write('{}{}\n\n'.format(orig_text, phono_text))
            orig_text = ''
            phono_text = ''
        cur_orig = ''
        cur_phono = ''
    if cur_orig:
        orig_text += cur_orig
        cur_phonos = phono(orig=cur_orig, lex_info=lex_info, inparts=True)
        phono_text += ''.join(cur_phonos)
        lnodes = len(cur_nodes)
        lphonos = len(cur_phonos)
        if lnodes != lphonos:
            word_file.write('!MISMATCH nodes={} phonos={}'.format(lnodes, lphonos))
        for i in range(max((lnodes, lphonos))):
            this_node = cur_nodes[i] if i < lnodes else 'X'
            this_phono = cur_phonos[i] if i <lphonos else 'X'
            parts = split_sep.findall(this_phono)
            if len(parts):
                (this_phono_x, this_sep) = parts[0]
            else:
                (this_phono_x, this_sep) = (this_phono, '')
            word_file.write('{}\t{}\t{}\n'.format(this_node, this_phono_x, this_sep))
    if orig_text != '' or not eol:
        if not eol:
            orig_text += '\n'
            phono_text += '\n'
            word_file.write('{}\t{}\t{}\n'.format('', '', '+'))
        orig_file.write(orig_text)
        phono_file.write(phono_text)
        combi_file.write('{}{}\n'.format(orig_text, phono_text))

phono_file.close()
orig_file.close()
combi_file.close()
word_file.close()
msg('{:>5} verses done'.format(nv))

## Consistency check

We take the just generated phono and wordph files.
From the phono file we strip the passage indicators, and from the wordph we strip the node numbers.

They should be consistent.

In [None]:
phono_file = infile(phono_fname)
word_file = infile(word_fname)

phono_test = outfile('phono_x{}.txt'.format(version))
word_test = outfile('wordph_x{}.txt'.format(version))

strip_passage = re.compile('^\S+ [0-9]+:[0-9]+\s*')

msg("Reading phono")
i = 0
for line in phono_file:
    i += 1
    phono_test.write(strip_passage.sub('', line))
msg('{} lines'.format(i))

msg("Reading wordph")
i = 0
for line in word_file:
    (mat, sep) = line[0:-1].split('\t')[1:3]
    if sep == '+':
        i += 1
        sep = '\n'
    word_test.write(mat+sep)
    if sep.endswith('.'):
        word_test.write('\n')
        i += 1
msg('{} lines'.format(i))

phono_file.close()
word_file.close()
phono_test.close()
word_test.close()

## Tetra

**Bonus**
Here is a regular expression that finds all occurrences of the tetragrammaton, and only those.

In [None]:
ncons = '[^>bgdhwzxvjklmns<pyqrfct _&$-]' # not a consonant
tetra_pat = re.compile('([0-9]*j{n}*h{n}*w{n}*h{n}*)(?=[ _&$-]|\Z)'.format(n=ncons))
tetras = set()
matched = set()
tetra_not_matched = set()
not_tetra_matched = set()
for node in F.otype.s('word'):
    rep = get_orig(node, sep=-1).lower()
    is_tetra = F.lex.v(node) == 'JHWH/'
    is_match = tetra_pat.match(rep)
    if is_tetra: tetras.add(node)
    if is_match: matched.add(node)
    if is_tetra and not is_match: tetra_not_matched.add(node)
    if not is_tetra and is_match: not_tetra_matched.add(node)
print('{:<20}: {}'.format('Tetras', len(tetras)))
print('{:<20}: {}'.format('Matched', len(matched)))
print('{:<20}: {}'.format('Tetras not matched', len(tetra_not_matched)))
print('{:<20}: {}'.format('Not tetras matched', len(not_tetra_matched)))

In [None]:
print(', '.join(sorted(get_orig(w, sep=-1).lower() for w in tetra_not_matched)))

# Unicode diacritics
Here is an overview of various latin characters with diacritics as far as they are in the UNICODE standard.