# Character based POS-tagger for Biblical Hebrew

In this script you find a character based pos-tagger for Biblical Hebrew. The input of the model consists of clauses of Biblical Hebrew text and the output is a sequence of parts of speech. The model does not know where the word boundaries are, because the space is simply another character.

First some libraries are imported. These are Numpy, Keras and, of course, Text-Frabric.

In [4]:
from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [5]:
from tf.app import use
A = use('bhsa', hoist=globals())
A.displaySetup(extraFeatures='g_cons')

TF app is up-to-date.
Using annotation/app-bhsa commit 43c1c5e88b371f575cdbbf57e38167deb8725f7f (=latest)
  in C:\Users\geitb/text-fabric-data/__apps__/bhsa.
Using etcbc/bhsa/tf - c r1.5 in C:\Users\geitb/text-fabric-data
Using etcbc/phono/tf - c r1.2 in C:\Users\geitb/text-fabric-data
Using etcbc/parallels/tf - c r1.2 in C:\Users\geitb/text-fabric-data


**Documentation:** <a target="_blank" href="https://etcbc.github.io/bhsa" title="provenance of BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis">BHSA</a> <a target="_blank" href="https://annotation.github.io/text-fabric/Writing/Hebrew" title="('Hebrew characters and transcriptions',)">Character table</a> <a target="_blank" href="https://etcbc.github.io/bhsa/features/hebrew/c/0_home.html" title="BHSA feature documentation">Feature docs</a> <a target="_blank" href="https://github.com/annotation/app-bhsa" title="bhsa API documentation">bhsa API</a> <a target="_blank" href="https://annotation.github.io/text-fabric/Api/Fabric/" title="text-fabric-api">Text-Fabric API 7.3.15</a> <a target="_blank" href="https://annotation.github.io/text-fabric/Use/Search/" title="Search Templates Introduction and Reference">Search Reference</a>

A train and test set are defined. The model is trained on all the books of the MT, except Jonah. The model will be used to predict parts of speech for this book.

In [67]:
train_books = ['Genesis', 'Exodus', 'Leviticus', 'Numbers', 'Deuteronomy', 'Joshua', 'Judges', '1_Samuel', 
               '2_Samuel','1_Kings', '2_Kings', 'Isaiah', 'Jeremiah', 'Ezekiel', 'Hosea', 'Joel', 'Amos', 
               'Obadiah', 'Micah', 'Nahum', 'Habakkuk', 'Zephaniah', 'Haggai', 'Zechariah', 'Malachi', 
               'Psalms', 'Job', 'Proverbs', 'Ruth', 'Song_of_songs', 'Ecclesiastes', 'Lamentations',
               'Esther', 'Daniel', 'Ezra', 'Nehemiah', '1_Chronicles', '2_Chronicles']

test_books = ['Jonah']

The data are prepared.

In [98]:
def prepare_train_data(books):
    """"
    books is a list containing the books of the training set.
    The function returns:
    input_clauses is a list containing strings with the text of BH clauses
    output_pos is a list containing lists with all the pos of BH clauses
    input_chars is a list containing the characters occurring in the input_clauses (the input vocabulary)
    output_vocab is a list containing all the pos occurring in the bhsa
    max_len_input is the maximum length of all the input clauses in number of characters
    max_len_output is the maximum length of all the output clauses in number of phrases (+2, because a 
    start and stop sign are added)
    """

    input_clauses = []
    output_pos = []
    input_chars = set()
    output_vocab = set()

    for cl in F.otype.s("clause"): 
        
        bo, _, _ = T.sectionFromNode(cl)
        if bo not in books:
            continue
        
        # max length of a clause is 10 words
        if len(L.d(cl, "word")) > 10:
            continue
        
        # input and output is extracted from the bhsa
        words = " ".join([F.g_cons.v(w) for w in L.d(cl, "word")])
        pos_prepare = [F.sp.v(w) for w in L.d(cl, "word")]
        
        poss = ['\t']
        for elem in pos_prepare:
            poss.append(elem)
        poss.append('\n')
    
        input_clauses.append(words)
        output_pos.append(poss)
    
        for ch in words:
            if (ch not in input_chars):
                input_chars.add(ch)
            
        for ch in poss:
            if (ch not in output_vocab):
                output_vocab.add(ch)
                
    output_vocab = sorted(list(output_vocab))
    input_chars = sorted(list(input_chars))
    
    max_len_input = max([len(line) for line in input_clauses])
    max_len_output = max([len(line) for line in output_pos])
    
    return input_clauses, output_pos, input_chars, output_vocab, max_len_input, max_len_output

In [99]:
def prepare_test_data(books):
    """
    books is a list containing the test books
    The function returns:
    input_clauses, a list containing the text of clauses in the test books
    """

    input_clauses_test = []
    for cl in F.otype.s("clause"): 
        
        bo, _, _ = T.sectionFromNode(cl)
        if bo not in books:
            continue
        
        if len(L.d(cl, "word")) > 10:
            continue

        words = " ".join([F.g_cons.v(w) for w in L.d(cl, "word")])
        input_clauses_test.append(words)
    
    return input_clauses_test

In [100]:
def create_dicts(input_chars, output_vocab):
    """
    
    """
    
    input_idx2char = {}
    input_char2idx = {}

    for k, v in enumerate(input_chars):
        input_idx2char[k] = v
        input_char2idx[v] = k
        
    output_idx2char = {}
    output_char2idx = {}
    
    for k, v in enumerate(output_vocab):
        output_idx2char[k] = v
        output_char2idx[v] = k
        
    return input_idx2char, input_char2idx, output_idx2char, output_char2idx

In [101]:
def one_hot_encode(nb_samples, max_len_input, max_len_output, input_chars, output_vocab, input_char2idx, output_char2idx, input_clauses, output_pos):
    """
    
    
    """

    tokenized_input_data = np.zeros(shape = (nb_samples,max_len_input,len(input_chars)), dtype='float32')
    tokenized_output = np.zeros(shape = (nb_samples,max_len_output,len(output_vocab)), dtype='float32')
    target_data = np.zeros((nb_samples, max_len_output, len(output_vocab)),dtype='float32')

    for i in range(nb_samples):
        for k, ch in enumerate(input_clauses[i]):
            tokenized_input_data[i, k, input_char2idx[ch]] = 1
        
        for k, ch in enumerate(output_pos[i]):
            tokenized_output[i, k, output_char2idx[ch]] = 1

            # decoder_target_data will be ahead by one timestep and will not include the start character.
            if k > 0:
                target_data[i, k-1, output_char2idx[ch]] = 1
                
    return tokenized_input_data, tokenized_output, target_data

In [102]:
def define_LSTM_model(input_chars, output_vocab):
    """
    
    
    """

    # Encoder model

    encoder_input = Input(shape=(None,len(input_chars)))
    encoder_LSTM = LSTM(512,activation = 'relu',return_state = True, return_sequences=True)(encoder_input)
    encoder_LSTM = LSTM(512,return_state = True)(encoder_LSTM)
    encoder_outputs, encoder_h, encoder_c = encoder_LSTM
    encoder_states = [encoder_h, encoder_c]
    
    # Decoder model

    decoder_input = Input(shape=(None,len(output_vocab)))
    decoder_LSTM = LSTM(512, return_sequences=True, return_state = True)
    decoder_out, _ , _ = decoder_LSTM(decoder_input, initial_state=encoder_states)
    decoder_dense = Dense(len(output_vocab), activation='softmax')
    decoder_out = decoder_dense (decoder_out)
    
    model = Model(inputs=[encoder_input, decoder_input],outputs=[decoder_out])

    model.summary()

    return encoder_input, encoder_states, decoder_input, decoder_LSTM, decoder_dense, model

In [103]:
def compile_and_train(model, tokenized_input, tokenized_output, batch_size, epochs, validation_split):

    model.compile(optimizer='adam', loss='categorical_crossentropy')
    model.fit(x=[tokenized_input,tokenized_output], 
              y=target_data,
              batch_size=batch_size,
              epochs=epochs,
              validation_split=validation_split)
    
    return model

In [104]:
nb_samples = 30000

input_clauses, output_pos, input_chars, output_vocab, max_len_input, max_len_output = prepare_train_data(train_books)
input_idx2char, input_char2idx, output_idx2char, output_char2idx = create_dicts(input_chars, output_vocab)
tokenized_input, tokenized_output, target_data = one_hot_encode(nb_samples, max_len_input, max_len_output, input_chars, output_vocab, input_char2idx, output_char2idx, input_clauses, output_pos)

In [105]:
test_clauses = prepare_test_data(test_books)
tokenized_test_data, _, _ = one_hot_encode(len(test_clauses), max_len_input, max_len_output, input_chars, output_vocab, input_char2idx, output_char2idx, test_clauses, output_pos)

In [106]:
encoder_input, encoder_states, decoder_input, decoder_LSTM, decoder_dense, model = define_LSTM_model(input_chars, output_vocab)
model = compile_and_train(model, tokenized_input, tokenized_output, 256, 40, 0.1)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_35 (InputLayer)           (None, None, 25)     0                                            
__________________________________________________________________________________________________
lstm_34 (LSTM)                  [(None, None, 512),  1101824     input_35[0][0]                   
__________________________________________________________________________________________________
input_36 (InputLayer)           (None, None, 16)     0                                            
__________________________________________________________________________________________________
lstm_35 (LSTM)                  [(None, 512), (None, 2099200     lstm_34[0][0]                    
                                                                 lstm_34[0][1]                    
          

In [107]:
# Inference models for testing

# Encoder inference model
encoder_model_inf = Model(encoder_input, encoder_states)

# Decoder inference model
decoder_state_input_h = Input(shape=(512,))
decoder_state_input_c = Input(shape=(512,))
decoder_input_states = [decoder_state_input_h, decoder_state_input_c]

decoder_out, decoder_h, decoder_c = decoder_LSTM(decoder_input, 
                                                 initial_state=decoder_input_states)

decoder_states = [decoder_h , decoder_c]

decoder_out = decoder_dense(decoder_out)

decoder_model_inf = Model(inputs=[decoder_input] + decoder_input_states,
                          outputs=[decoder_out] + decoder_states )

In [108]:
def decode_seq(inp_seq):
    
    # Initial states value is coming from the encoder 
    states_val = encoder_model_inf.predict(inp_seq)
    
    target_seq = np.zeros((1, 1, len(output_vocab)))
    target_seq[0, 0, output_char2idx['\t']] = 1
    
    translated_sent = ''
    stop_condition = False
    
    while not stop_condition:
        
        decoder_out, decoder_h, decoder_c = decoder_model_inf.predict(x=[target_seq] + states_val)
        
        max_val_index = np.argmax(decoder_out[0,-1,:])
        sampled_out_char = output_idx2char[max_val_index]
        translated_sent += sampled_out_char
        
        if (sampled_out_char == '\n'):
            stop_condition = True
        
        target_seq = np.zeros((1, 1, len(output_vocab)))
        target_seq[0, 0, max_val_index] = 1
        
        states_val = [decoder_h, decoder_c]
        
    return translated_sent



In [110]:
for seq_index in range(200):
    inp_seq = tokenized_test_data[seq_index:seq_index+1]
    
    translated_sent = decode_seq(inp_seq)
    print('-')
    print('Input sentence:', test_clauses[seq_index])
    print('Decoded sentence:', translated_sent)

-
Input sentence: W JHJ DBR JHWH >L JWNH BN >MTJ
Decoded sentence: conjverbsubsnmprprepnmprsubssubs

-
Input sentence: L >MR
Decoded sentence: prepverb

-
Input sentence: QWM
Decoded sentence: verb

-
Input sentence: LK >L NJNWH H <JR H GDWLH
Decoded sentence: verbprepsubsartsubsartsubs

-
Input sentence: W QR> <LJH
Decoded sentence: conjverbprep

-
Input sentence: KJ <LTH R<TM L PNJ
Decoded sentence: conjverbsubsprepsubs

-
Input sentence: W JQM JWNH
Decoded sentence: conjverbnmpr

-
Input sentence: L BRX TRCJCH M L PNJ JHWH
Decoded sentence: prepverbsubsprepprepsubsnmpr

-
Input sentence: W JRD JPW
Decoded sentence: conjverbsubs

-
Input sentence: W JMY> >NJH
Decoded sentence: conjverbprep

-
Input sentence: B>H TRCJC
Decoded sentence: verbsubs

-
Input sentence: W JTN FKRH
Decoded sentence: conjverbsubs

-
Input sentence: W JRD BH
Decoded sentence: conjverbprep

-
Input sentence: L BW> <MHM TRCJCH M L PNJ JHWH
Decoded sentence: prepverbprepsubsprepprepsubsnmpr

-
Input sentence: W J

-
Input sentence: >CR NDRTJ
Decoded sentence: conjverb

-
Input sentence: >CLMH
Decoded sentence: subs

-
Input sentence: JCW<TH L JHWH
Decoded sentence: verbprepnmpr

-
Input sentence: W J>MR JHWH L  DG
Decoded sentence: conjverbnmprprepartsubs

-
Input sentence: W JQ> >T JWNH >L H JBCH
Decoded sentence: conjverbprepnmprprepartnmpr

-
Input sentence: W JHJ DBR JHWH >L JWNH CNJT
Decoded sentence: conjverbsubsnmprprepsubssubs

-
Input sentence: L >MR
Decoded sentence: prepverb

-
Input sentence: QWM
Decoded sentence: verb

-
Input sentence: LK >L NJNWH H <JR H GDWLH
Decoded sentence: verbprepsubsartsubsartsubs

-
Input sentence: W QR> >LJH >T H QRJ>H
Decoded sentence: conjverbprepprepartsubs

-
Input sentence: >CR >NKJ DBR >LJK
Decoded sentence: conjprpsverbprep

-
Input sentence: W JQM JWNH
Decoded sentence: conjverbnmpr

-
Input sentence: W JLK >L NJNWH K DBR JHWH
Decoded sentence: conjverbprepnmprprepsubsnmpr

-
Input sentence: W NJNWH HJTH <JR GDWLH L >LHJM
Decoded sentence: conjnmp