# Probabilistic Hebrew morphology

In the following script a so called sequence to sequence (seq2seq) model is trained, using the Python library Keras, which is useful for fast experimentation with neural networks. Tha basis of the model is a so called Long Short-Term Memory Recurrent Neural Network. The classic example of such a model is the Google's [Neural Machine Translator](https://github.com/tensorflow/nmt).
A seq2seq model is more complicated than a sequence classification model, because the output is a complete sequence. The input and output sequence do not have to of the same length.

In this model the input sequence is the consonantal representation of a word (g_cons). The output sequence consist of a concatenation of the word's corresponding verbal stem (vbs), preformative (pfm), lexeme (lex), verbal ending (vbe), nominal ending (nme) and pronominal suffix (prs), separated by plus signs. If a feature has no value (n/a, 'absent' or ''), its gets the value 'n'. For instance, the input 'JHJ' (3rd person sg yiqtol of 'HJH') has the output value 'n+J+HJH+n+n+n'.

Without extra information there is a clear upper boundary of what can be reached with this approach, for instance 'LK' can be interpreted as lex 'L' with a pronominal suffix, but it can also be the imperative of 'HLK'. However, the present model can be improved substantially.

The code is based on chapter 9 of Jason Brownlee, Long Short-Term Memory Networks with Python, 2017.

The training data consist of all the words in the MT, except for Jonah and Ruth. The test data consist of the words in Jonah and Ruth.

Training an LSTM model is a computationally intensive job, so it may take a while.

In [4]:
import sys, collections, os, re

from random import seed
from random import randint
from numpy import array
from math import ceil
from math import log10
from math import sqrt
from numpy import argmax
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import TimeDistributed
from keras.layers import RepeatVector

In [6]:
from tf.fabric import Fabric

DATABASE = '~/github'
BHSA = 'bhsa/tf/c'

TF = Fabric(locations=[DATABASE], modules=[BHSA], silent=False )

This is Text-Fabric 3.0.3
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

107 features found and 0 ignored


In [7]:
api = TF.load('''
    lex g_cons sp pfm vbs nme uvf prs vbe language
''')

  0.00s loading features ...
   |     0.09s B g_cons               from C:/Users/Martijn/github/bhsa/tf/c
   |     0.09s B lex                  from C:/Users/Martijn/github/bhsa/tf/c
   |     0.10s B sp                   from C:/Users/Martijn/github/bhsa/tf/c
   |     0.10s B pfm                  from C:/Users/Martijn/github/bhsa/tf/c
   |     0.10s B vbs                  from C:/Users/Martijn/github/bhsa/tf/c
   |     0.09s B nme                  from C:/Users/Martijn/github/bhsa/tf/c
   |     0.10s B uvf                  from C:/Users/Martijn/github/bhsa/tf/c
   |     0.10s B prs                  from C:/Users/Martijn/github/bhsa/tf/c
   |     0.09s B vbe                  from C:/Users/Martijn/github/bhsa/tf/c
   |     0.09s B language             from C:/Users/Martijn/github/bhsa/tf/c
   |     0.00s Feature overview: 102 for nodes; 4 for edges; 1 configs; 7 computed
  4.28s All features loaded/computed - for details use loadLog()


In [12]:
api.loadLog()
api.makeAvailableIn(globals())

In [13]:
def prepare_data_train(n_examples):
    n_verbs = 0
    max_con = 0
    max_an = 0
    wo_list = []
    info_dict = {}
    len_dict = {}
    alphabet = set()
    for word in F.otype.s('word'):
        if n_verbs < n_examples and not T.bookName(word) in {'Jonah', 'Ruth'} and F.language.v(word) == 'hbo':       
            if F.sp.v(word) != 'nmpr': # proper nouns are excluded
                n_verbs += 1
                vbs = F.vbs.v(word)
                if vbs in {'', 'n/a', 'absent'}:
                    vbs = 'n'
                pfm = F.pfm.v(word)
                if pfm in {'', 'n/a', 'absent'}:
                    pfm = 'n'
                vbe = F.vbe.v(word)
                if vbe in {'', 'n/a', 'absent'}:
                    vbe = 'n'
                nme = F.nme.v(word)
                if nme in {'', 'n/a', 'absent'}:
                    nme = 'n'
                prs = F.prs.v(word)
                if prs in {'', 'n/a', 'absent'}:
                    prs = 'n'
                root = F.lex.v(word).strip('/').strip('[').strip('=')
                an_length = len(vbs) + len(pfm) + len(root) + len(vbe) + len(nme) + len(prs)
                cons = F.g_cons.v(word)
                for elem in [vbs, pfm, cons, vbe, nme, root, prs]:
                    for char in elem:
                        alphabet.add(char)

                con_length = len(cons)
                wo_list.append(word)
                info_dict[word] = [cons, vbs, pfm, root, vbe, nme, prs]
                len_dict = [con_length, an_length]
                if an_length > max_an:
                    max_an = an_length
                if con_length > max_con:
                    max_con = con_length
                    
    alphabet.add(' ')
    alphabet.add('+')
    print('max_an = ' ,max_an)
    return wo_list, info_dict, len_dict, max_con, max_an, list(alphabet)                

In [14]:
def prepare_data_test(n_examples):
    n_verbs = 0
    max_con = 0
    max_an = 0
    wo_list = []
    info_dict = {}
    len_dict = {}
    for word in F.otype.s('word'):
        if n_verbs < n_examples and T.bookName(word) in {'Jonah', 'Ruth'}:       
            if F.sp.v(word) != 'nmpr':
                n_verbs += 1
                vbs = F.vbs.v(word)
                if vbs in {'', 'n/a', 'absent'}:
                    vbs = 'n'
                pfm = F.pfm.v(word)
                if pfm in {'', 'n/a', 'absent'}:
                    pfm = 'n'
                vbe = F.vbe.v(word)
                if vbe in {'', 'n/a', 'absent'}:
                    vbe = 'n'
                nme = F.nme.v(word)
                if nme in {'', 'n/a', 'absent'}:
                    nme = 'n'
                prs = F.prs.v(word)
                if prs in {'', 'n/a', 'absent'}:
                    prs = 'n'
                    
                root = F.lex.v(word).strip('/').strip('[').strip('=')
                an_length = len(vbs) + len(pfm) + len(root) + len(vbe) + len(nme) + len(prs)
                cons = F.g_cons.v(word)

                con_length = len(cons)
                wo_list.append(word)
                info_dict[word] = [cons, vbs, pfm, root, vbe, nme, prs]
                len_dict = [con_length, an_length]

    return wo_list, info_dict, len_dict             

In [15]:
# convert data to strings
def to_string(wo_list, info_dict, len_dict, max_con, max_an):
    
    ystr = list()
    for wo in wo_list:
        strp = info_dict[wo][1] + '+' + info_dict[wo][2] + '+' + info_dict[wo][3] + '+' + info_dict[wo][4] + '+' + info_dict[wo][5] + '+' + info_dict[wo][6]
        strp2 = ''.join([' ' for _ in range(max_an - len(strp))]) + strp
        ystr.append(strp2)
    
    Xstr = list()
    for wo in wo_list:
        strp = info_dict[wo][0]
        conson = ''.join([' ' for _ in range(max_con - len(strp))]) + strp
        Xstr.append(conson)
    return Xstr, ystr

In [16]:
# integer encode strings
def integer_encode(X, y, alphabet):
    char_to_int = dict((c, i) for i, c in enumerate(alphabet))
    Xenc = list()
    for pattern in X:
        integer_encoded = [char_to_int[char] for char in pattern]
        Xenc.append(integer_encoded)
    yenc = list()
    for pattern in y:
        integer_encoded = [char_to_int[char] for char in pattern]
        yenc.append(integer_encoded)
    return Xenc, yenc

In [17]:
# one hot encode
def one_hot_encode(X, y, max_int):
    Xenc = list()
    for seq in X:
        pattern = list()
        for index in seq:
            vector = [0 for _ in range(max_int)]
            vector[index] = 1
            pattern.append(vector)
        Xenc.append(pattern)
    yenc = list()
    for seq in y:
        pattern = list()
        for index in seq:
            vector = [0 for _ in range(max_int)]
            vector[index] = 1
            pattern.append(vector)
        yenc.append(pattern)
    return Xenc, yenc

In [18]:
# generate an encoded dataset
def generate_data(wo_list, info_dict, len_dict, max_con, max_an, alphabet):
    
    X, y = to_string(wo_list, info_dict, len_dict, max_con, max_an)
    
    X, y = integer_encode(X, y, alphabet)
    
    X, y = one_hot_encode(X, y, len(alphabet))
    
    X, y = array(X), array(y)
    
    return X, y

In [19]:
# invert encoding
def invert(seq, alphabet):
    int_to_char = dict((i, c) for i, c in enumerate(alphabet))
    strings = list()
    for pattern in seq:
        string = int_to_char[argmax(pattern)]
        strings.append(string)
    return ''.join(strings)

In [None]:
wo_list, info_dict, len_dict, max_con, max_an, alphabet = prepare_data_train(400000)

n_chars = len(alphabet)

n_in_seq_length = max_con

n_out_seq_length = max_an + 5

# define LSTM
model = Sequential()
model.add(LSTM(315, input_shape=(n_in_seq_length, n_chars)))
model.add(RepeatVector(n_out_seq_length))
model.add(LSTM(265, return_sequences=True))
model.add(TimeDistributed(Dense(n_chars, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

X, y = generate_data(wo_list, info_dict, len_dict, max_con, max_an + 5, alphabet)
print(X.shape)
print(y.shape)
model.fit(X, y, epochs=1, batch_size=32)


max_an =  15
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 315)               433440    
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 20, 315)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 20, 265)           615860    
_________________________________________________________________
time_distributed_1 (TimeDist (None, 20, 28)            7448      
Total params: 1,056,748
Trainable params: 1,056,748
Non-trainable params: 0
_________________________________________________________________
None
(385093, 9, 28) (385093, 20, 28)
(385093, 9, 28)
(385093, 20, 28)
Epoch 1/1

In [17]:
# evaluate LSTM
wo_list, info_dict, len_dict = prepare_data_test(2000)
X, y = generate_data(wo_list, info_dict, len_dict, max_con, max_an + 5, alphabet)
loss, acc = model.evaluate(X, y, verbose=0)
print('Loss: %f, Accuracy: %f' % (loss, acc*100))

(2000, 9, 28) (2000, 20, 28)
Loss: 0.157896, Accuracy: 94.247500


In [54]:
for _ in range(100):
    # generate an input-output pair
    X, y = generate_data(wo_list, info_dict, len_dict, max_con, max_an + 4, alphabet)
    # make prediction
    yhat = model.predict(X, verbose=0)
    # decode input, expected and predicted
    in_seq = invert(X[_], alphabet)
    out_seq = invert(y[_], alphabet)
    predicted = invert(yhat[_], alphabet)
    print('%s = %s (expect %s)' % (in_seq, predicted, out_seq))

(2000, 9, 28) (2000, 19, 28)
        W =          n+n+W+n+n+n (expect         n+n+W+n+n+n)
(2000, 9, 28) (2000, 19, 28)
      JHJ =        n+J+HJH+n+n+n (expect       n+J+HJH+n+n+n)
(2000, 9, 28) (2000, 19, 28)
      DBR =        n+n+BBR+n+n+n (expect       n+n+DBR+n+n+n)
(2000, 9, 28) (2000, 19, 28)
       >L =         n+n+>L+n+n+n (expect        n+n+>L+n+n+n)
(2000, 9, 28) (2000, 19, 28)
       BN =         n+n+BN+n+n+n (expect        n+n+BN+n+n+n)
(2000, 9, 28) (2000, 19, 28)
        L =          n+n+L+n+n+n (expect         n+n+L+n+n+n)
(2000, 9, 28) (2000, 19, 28)
      >MR =        n+n+>MR+n+n+n (expect       n+n+>MR+n+n+n)
(2000, 9, 28) (2000, 19, 28)
      QWM =        n+n+QWM+n+n+n (expect       n+n+QWM+n+n+n)
(2000, 9, 28) (2000, 19, 28)
       LK =          n+n+L+n+n+K (expect       n+n+HLK+n+n+n)
(2000, 9, 28) (2000, 19, 28)
       >L =         n+n+>L+n+n+n (expect        n+n+>L+n+n+n)
(2000, 9, 28) (2000, 19, 28)
        H =          n+n+H+n+n+n (expect         n+n+H+n+n+n)