# Character based POS-tagger for Biblical Hebrew

In this script you find a character based pos-tagger for Biblical Hebrew. The input of the model consists of clauses of Biblical Hebrew text and the output is a sequence of parts of speech. The model does not know where the word boundaries are, because the space is simply another character.

First some libraries are imported. These are Numpy, Keras and, of course, Text-Frabric.

In [1]:
from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
from tf.app import use
A = use('bhsa', hoist=globals())
A.displaySetup(extraFeatures='g_cons')

TF app is up-to-date.
Using annotation/app-bhsa commit 43c1c5e88b371f575cdbbf57e38167deb8725f7f (=latest)
  in C:\Users\geitb/text-fabric-data/__apps__/bhsa.
Using etcbc/bhsa/tf - c r1.5 in C:\Users\geitb/text-fabric-data
Using etcbc/phono/tf - c r1.2 in C:\Users\geitb/text-fabric-data
Using etcbc/parallels/tf - c r1.2 in C:\Users\geitb/text-fabric-data


**Documentation:** <a target="_blank" href="https://etcbc.github.io/bhsa" title="provenance of BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis">BHSA</a> <a target="_blank" href="https://annotation.github.io/text-fabric/Writing/Hebrew" title="('Hebrew characters and transcriptions',)">Character table</a> <a target="_blank" href="https://etcbc.github.io/bhsa/features/hebrew/c/0_home.html" title="BHSA feature documentation">Feature docs</a> <a target="_blank" href="https://github.com/annotation/app-bhsa" title="bhsa API documentation">bhsa API</a> <a target="_blank" href="https://annotation.github.io/text-fabric/Api/Fabric/" title="text-fabric-api">Text-Fabric API 7.3.15</a> <a target="_blank" href="https://annotation.github.io/text-fabric/Use/Search/" title="Search Templates Introduction and Reference">Search Reference</a>

A train and test set are defined. The model is trained on all the books of the MT, except Jonah. The model will be used to predict parts of speech for this book.

In [74]:
train_books = ['Genesis', 'Exodus', 'Leviticus', 'Numbers', 'Deuteronomy', 'Joshua', 'Judges', '1_Samuel', 
               '2_Samuel','1_Kings', '2_Kings', 'Isaiah', 'Jeremiah', 'Ezekiel', 'Hosea', 'Joel', 'Amos', 
               'Obadiah', 'Micah', 'Nahum', 'Habakkuk', 'Zephaniah', 'Haggai', 'Zechariah', 'Malachi', 
               'Psalms', 'Job', 'Proverbs', 'Ruth', 'Song_of_songs', 'Ecclesiastes', 'Lamentations',
               'Esther', 'Daniel', 'Ezra', 'Nehemiah', '1_Chronicles', '2_Chronicles']

test_books = ['Jonah']

The data are prepared.

In [75]:
def prepare_train_data(books):
    """"
    books is a list containing the books of the training set.
    The function returns:
    input_clauses is a list containing strings with the text of BH clauses
    output_pos is a list containing lists with all the pos of BH clauses
    input_chars is a list containing the characters occurring in the input_clauses (the input vocabulary)
    output_vocab is a list containing all the pos occurring in the bhsa
    max_len_input is the maximum length of all the input clauses in number of characters
    max_len_output is the maximum length of all the output clauses in number of phrases (+2, because a 
    start and stop sign are added)
    """

    input_clauses = []
    output_pos = []
    input_chars = set()
    output_vocab = set()

    for cl in F.otype.s("clause"): 
        
        bo, _, _ = T.sectionFromNode(cl)
        if bo not in books:
            continue
        
        # max length of a clause is 10 words
        if len(L.d(cl, "word")) > 10:
            continue
        
        # input and output is extracted from the bhsa
        words = " ".join([F.g_cons.v(w) for w in L.d(cl, "word")])
        pos_prepare = [F.sp.v(w) for w in L.d(cl, "word")]
        
        poss = ['\t']
        for elem in pos_prepare:
            poss.append(elem)
        poss.append('\n')
    
        input_clauses.append(words)
        output_pos.append(poss)
    
        for ch in words:
            input_chars.add(ch)
            
        for pos in poss:
            output_vocab.add(pos)
    
    input_chars = sorted(list(input_chars))
    output_vocab = sorted(list(output_vocab))
    
    max_len_input = max([len(clause) for clause in input_clauses])
    max_len_output = max([len(poss) for poss in output_pos])
    
    return input_clauses, output_pos, input_chars, output_vocab, max_len_input, max_len_output

In [76]:
def prepare_test_data(books):
    """
    books is a list containing the test books
    The function returns:
    input_clauses, a list containing the text of clauses in the test books
    """

    input_clauses_test = []
    for cl in F.otype.s("clause"): 
        
        bo, _, _ = T.sectionFromNode(cl)
        if bo not in books:
            continue
        
        if len(L.d(cl, "word")) > 10:
            continue

        words = " ".join([F.g_cons.v(w) for w in L.d(cl, "word")])
        input_clauses_test.append(words)
    
    return input_clauses_test

In [77]:
def create_dicts(input_chars, output_vocab):
    """
    The network can only handle numeric data. This function provides four dicts. 
    Two of them map between integers and the input characters (one dict for every direction), the other two 
    map between integers and parts of speech.
    """
    
    input_idx2char = {}
    input_char2idx = {}

    for k, v in enumerate(input_chars):
        input_idx2char[k] = v
        input_char2idx[v] = k
        
    output_idx2char = {}
    output_char2idx = {}
    
    for k, v in enumerate(output_vocab):
        output_idx2char[k] = v
        output_char2idx[v] = k
        
    return input_idx2char, input_char2idx, output_idx2char, output_char2idx

In [78]:
def one_hot_encode(nb_samples, max_len_input, max_len_output, input_chars, output_vocab, input_char2idx, output_char2idx, input_clauses, output_pos):
    """
    Categorical data are generally one-hot encoded in neural networks, which is done here.
    
    """

    tokenized_input_data = np.zeros(shape = (nb_samples,max_len_input,len(input_chars)), dtype='float32')
    tokenized_output = np.zeros(shape = (nb_samples,max_len_output,len(output_vocab)), dtype='float32')
    target_data = np.zeros((nb_samples, max_len_output, len(output_vocab)),dtype='float32')

    for i in range(nb_samples):
        for k, ch in enumerate(input_clauses[i]):
            tokenized_input_data[i, k, input_char2idx[ch]] = 1
        
        for k, ch in enumerate(output_pos[i]):
            tokenized_output[i, k, output_char2idx[ch]] = 1

            # decoder_target_data will be ahead by one timestep and will not include the start character.
            if k > 0:
                target_data[i, k-1, output_char2idx[ch]] = 1
                
    return tokenized_input_data, tokenized_output, target_data

In [79]:
def define_LSTM_model(input_chars, output_vocab):
    """
    
    
    """

    # Encoder model

    encoder_input = Input(shape=(None,len(input_chars)))
    encoder_LSTM = LSTM(512,activation = 'relu',return_state = True, return_sequences=True)(encoder_input)
    encoder_LSTM = LSTM(512,return_state = True)(encoder_LSTM)
    encoder_outputs, encoder_h, encoder_c = encoder_LSTM
    encoder_states = [encoder_h, encoder_c]
    
    # Decoder model

    decoder_input = Input(shape=(None,len(output_vocab)))
    decoder_LSTM = LSTM(512, return_sequences=True, return_state = True)
    decoder_out, _ , _ = decoder_LSTM(decoder_input, initial_state=encoder_states)
    decoder_dense = Dense(len(output_vocab), activation='softmax')
    decoder_out = decoder_dense (decoder_out)
    
    model = Model(inputs=[encoder_input, decoder_input],outputs=[decoder_out])

    model.summary()

    return encoder_input, encoder_states, decoder_input, decoder_LSTM, decoder_dense, model

In [80]:
def compile_and_train(model, tokenized_input, tokenized_output, batch_size, epochs, validation_split):

    model.compile(optimizer='adam', loss='categorical_crossentropy')
    model.fit(x=[tokenized_input,tokenized_output], 
              y=target_data,
              batch_size=batch_size,
              epochs=epochs,
              validation_split=validation_split)
    
    return model

In [81]:
nb_samples = 70000

input_clauses, output_pos, input_chars, output_vocab, max_len_input, max_len_output = prepare_train_data(train_books)
input_idx2char, input_char2idx, output_idx2char, output_char2idx = create_dicts(input_chars, output_vocab)
tokenized_input, tokenized_output, target_data = one_hot_encode(nb_samples, max_len_input, max_len_output, input_chars, output_vocab, input_char2idx, output_char2idx, input_clauses, output_pos)

In [82]:
test_clauses = prepare_test_data(test_books)
tokenized_test_data, _, _ = one_hot_encode(len(test_clauses), max_len_input, max_len_output, input_chars, output_vocab, input_char2idx, output_char2idx, test_clauses, output_pos)

In [83]:
encoder_input, encoder_states, decoder_input, decoder_LSTM, decoder_dense, model = define_LSTM_model(input_chars, output_vocab)
model = compile_and_train(model, tokenized_input, tokenized_output, 512, 70, 0.1)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_19 (InputLayer)           (None, None, 25)     0                                            
__________________________________________________________________________________________________
lstm_19 (LSTM)                  [(None, None, 512),  1101824     input_19[0][0]                   
__________________________________________________________________________________________________
input_20 (InputLayer)           (None, None, 16)     0                                            
__________________________________________________________________________________________________
lstm_20 (LSTM)                  [(None, 512), (None, 2099200     lstm_19[0][0]                    
                                                                 lstm_19[0][1]                    
          

Epoch 58/70
Epoch 59/70
Epoch 60/70
Epoch 61/70
Epoch 62/70
Epoch 63/70
Epoch 64/70
Epoch 65/70
Epoch 66/70
Epoch 67/70
Epoch 68/70
Epoch 69/70
Epoch 70/70


In [84]:
# Inference models for testing

# Encoder inference model
encoder_model_inf = Model(encoder_input, encoder_states)

# Decoder inference model
decoder_state_input_h = Input(shape=(512,))
decoder_state_input_c = Input(shape=(512,))
decoder_input_states = [decoder_state_input_h, decoder_state_input_c]

decoder_out, decoder_h, decoder_c = decoder_LSTM(decoder_input, 
                                                 initial_state=decoder_input_states)

decoder_states = [decoder_h , decoder_c]

decoder_out = decoder_dense(decoder_out)

decoder_model_inf = Model(inputs=[decoder_input] + decoder_input_states,
                          outputs=[decoder_out] + decoder_states )

In [85]:
def decode_seq(inp_seq):
    
    # Initial states value is coming from the encoder 
    states_val = encoder_model_inf.predict(inp_seq)
    
    target_seq = np.zeros((1, 1, len(output_vocab)))
    target_seq[0, 0, output_char2idx['\t']] = 1
    
    translated_sent = ''
    pred_pos = []
    stop_condition = False
    
    while not stop_condition:
        
        decoder_out, decoder_h, decoder_c = decoder_model_inf.predict(x=[target_seq] + states_val)
        
        max_val_index = np.argmax(decoder_out[0,-1,:])
        sampled_out_char = output_idx2char[max_val_index]
        pred_pos.append(sampled_out_char)
        
        if (sampled_out_char == '\n'):
            stop_condition = True
        
        target_seq = np.zeros((1, 1, len(output_vocab)))
        target_seq[0, 0, max_val_index] = 1
        
        states_val = [decoder_h, decoder_c]
        
    return pred_pos



In [86]:
for seq_index in range(220):
    inp_seq = tokenized_test_data[seq_index:seq_index+1]
    
    pred_pos = decode_seq(inp_seq)
    print('-')
    print('Input sentence:', test_clauses[seq_index])
    print('Decoded sentence:', pred_pos[:-1])

-
Input sentence: W JHJ DBR JHWH >L JWNH BN >MTJ
Decoded sentence: ['conj', 'verb', 'subs', 'nmpr', 'prep', 'subs', 'subs', 'subs']
-
Input sentence: L >MR
Decoded sentence: ['prep', 'verb']
-
Input sentence: QWM
Decoded sentence: ['verb']
-
Input sentence: LK >L NJNWH H <JR H GDWLH
Decoded sentence: ['verb', 'prep', 'subs', 'art', 'subs', 'art', 'adjv']
-
Input sentence: W QR> <LJH
Decoded sentence: ['conj', 'verb', 'prep']
-
Input sentence: KJ <LTH R<TM L PNJ
Decoded sentence: ['conj', 'verb', 'subs', 'prep', 'subs']
-
Input sentence: W JQM JWNH
Decoded sentence: ['conj', 'verb', 'nmpr']
-
Input sentence: L BRX TRCJCH M L PNJ JHWH
Decoded sentence: ['prep', 'verb', 'nmpr', 'prep', 'prep', 'subs', 'nmpr']
-
Input sentence: W JRD JPW
Decoded sentence: ['conj', 'verb', 'subs']
-
Input sentence: W JMY> >NJH
Decoded sentence: ['conj', 'verb', 'nmpr']
-
Input sentence: B>H TRCJC
Decoded sentence: ['verb', 'nmpr']
-
Input sentence: W JTN FKRH
Decoded sentence: ['conj', 'verb', 'subs']
-
Inp

-
Input sentence: >K >WSJP
Decoded sentence: ['advb', 'verb']
-
Input sentence: L HBJV >L HJKL QDCK
Decoded sentence: ['prep', 'verb', 'prep', 'subs', 'subs']
-
Input sentence: >PPWNJ MJM <D NPC
Decoded sentence: ['verb', 'subs', 'prep', 'subs']
-
Input sentence: THWM JSBBNJ
Decoded sentence: ['subs', 'verb']
-
Input sentence: SWP XBWC L R>CJ
Decoded sentence: ['verb', 'subs', 'prep', 'subs']
-
Input sentence: L QYBJ HRJM JRDTJ
Decoded sentence: ['prep', 'subs', 'verb', 'verb']
-
Input sentence: H >RY
Decoded sentence: ['art', 'subs']
-
Input sentence: BRXJH B<DJ L <WLM
Decoded sentence: ['verb', 'subs', 'prep', 'subs']
-
Input sentence: W T<L M CXT XJJ
Decoded sentence: ['conj', 'verb', 'prep', 'subs', 'nmpr']
-
Input sentence: JHWH >LHJ
Decoded sentence: ['nmpr', 'subs']
-
Input sentence: B HT<VP <LJ NPCJ
Decoded sentence: ['prep', 'verb', 'prep', 'subs']
-
Input sentence: >T JHWH ZKRTJ
Decoded sentence: ['prep', 'nmpr', 'verb']
-
Input sentence: W TBW> >LJK TPLTJ >L HJKL QDCK
Decode

Input sentence: B <LWT H CXR L  MXRT
Decoded sentence: ['prep', 'verb', 'art', 'subs', 'prep', 'art', 'subs']
-
Input sentence: W TK >T H QJQJWN
Decoded sentence: ['conj', 'verb', 'prep', 'art', 'adjv']
-
Input sentence: W JJBC
Decoded sentence: ['conj', 'verb']
-
Input sentence: W JHJ
Decoded sentence: ['conj', 'verb']
-
Input sentence: K ZRX H CMC
Decoded sentence: ['prep', 'verb', 'art', 'subs']
-
Input sentence: W JMN >LHJM RWX QDJM XRJCJT
Decoded sentence: ['conj', 'verb', 'subs', 'subs', 'subs', 'adjv']
-
Input sentence: W TK H CMC <L R>C JWNH
Decoded sentence: ['conj', 'verb', 'art', 'subs', 'prep', 'subs', 'nmpr']
-
Input sentence: W JT<LP
Decoded sentence: ['conj', 'verb']
-
Input sentence: W JC>L >T NPCW
Decoded sentence: ['conj', 'verb', 'prep', 'subs']
-
Input sentence: L MWT
Decoded sentence: ['prep', 'verb']
-
Input sentence: W J>MR
Decoded sentence: ['conj', 'verb']
-
Input sentence: VWB MWTJ M XJJ
Decoded sentence: ['adjv', 'subs', 'prep', 'adjv']
-
Input sentence: W J>