# Find phrase boundaries in BH

The model in this notebook is designed to find phrase boundaries in chunks of BH text, based on the division in phrases in the ETCBC database.
The model assumes that the text has been analyzed on word level already.

Each input chunk consists of 20 words. Each word is represented by its part of speech. For instance, a typical chunk has the following structure (the example shows the first 20 words of the book of Genesis):

['prep', 'subs', 'verb', 'subs', 'prep', 'art', 'subs', 'conj', 'prep', 'art', 'subs', 'conj', 'art', 'subs', 'verb', 'subs',  'conj',  'subs','conj', 'subs']

The corresponding output is:

['\t', 'x', 'x', 'p', 'x', 'p', 'x', 'p', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'p', 'x', 'p', 'x', 'x', 'p', 'x', 'p', 'x', 'x', 'x', 'p', 'x', 'p', 'x', 'p', '\n']

Here every 'x' represents a word in the input sequence, and 'p' marks the end of a phrase. '\t' and '\n' are start and stop symbols. 

In the following input chunk we have moved forward one word:

Input:

 ['subs', 'verb', 'subs', 'prep', 'art', 'subs', 'conj', 'prep', 'art', 'subs', 'conj', 'art', 'subs', 'verb', 'subs', 'conj', 'subs', 'conj', 'subs', 'prep']

Output:

['\t', 'x', 'p', 'x', 'p', 'x', 'p', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'p', 'x', 'p', 'x', 'x', 'p', 'x', 'p', 'x', 'x', 'x', 'p', 'x', 'p', 'x', 'p', 'x', '\n']


In [1]:
from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np

from sklearn.utils import shuffle

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
from tf.app import use
A = use('bhsa', hoist=globals())
A.displaySetup(extraFeatures='g_cons')

TF app is up-to-date.
Using annotation/app-bhsa commit 7f353d587f4befb6efe1742831e28f301d2b3cea (=latest)
  in C:\Users\geitb/text-fabric-data/__apps__/bhsa.
Using etcbc/bhsa/tf - c r1.5 in C:\Users\geitb/text-fabric-data
Using etcbc/phono/tf - c r1.2 in C:\Users\geitb/text-fabric-data
Using etcbc/parallels/tf - c r1.2 in C:\Users\geitb/text-fabric-data


**Documentation:** <a target="_blank" href="https://etcbc.github.io/bhsa" title="provenance of BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis">BHSA</a> <a target="_blank" href="https://annotation.github.io/text-fabric/Writing/Hebrew" title="('Hebrew characters and transcriptions',)">Character table</a> <a target="_blank" href="https://etcbc.github.io/bhsa/features/hebrew/c/0_home.html" title="BHSA feature documentation">Feature docs</a> <a target="_blank" href="https://github.com/annotation/app-bhsa" title="bhsa API documentation">bhsa API</a> <a target="_blank" href="https://annotation.github.io/text-fabric/Api/Fabric/" title="text-fabric-api">Text-Fabric API 7.3.15</a> <a target="_blank" href="https://annotation.github.io/text-fabric/Use/Search/" title="Search Templates Introduction and Reference">Search Reference</a>

A train and test set are defined. The model is trained on all the books of the MT, except Jonah. The model will be used to predict parts of speech for this book.

In [3]:
for bo in F.otype.s("book"):
    print(F.book.v(bo))

Genesis
Exodus
Leviticus
Numeri
Deuteronomium
Josua
Judices
Samuel_I
Samuel_II
Reges_I
Reges_II
Jesaia
Jeremia
Ezechiel
Hosea
Joel
Amos
Obadia
Jona
Micha
Nahum
Habakuk
Zephania
Haggai
Sacharia
Maleachi
Psalmi
Iob
Proverbia
Ruth
Canticum
Ecclesiastes
Threni
Esther
Daniel
Esra
Nehemia
Chronica_I
Chronica_II


The data are prepared.

In [301]:
def prepare_train_data():
    """"
    books is a list containing the books of the training set.
    The function returns:
    input_clauses is a list containing strings with the text of BH clauses
    output_pos is a list containing lists with all the pos of BH clauses
    input_vocab is a list containing the characters occurring in the input_clauses (the input vocabulary)
    output_vocab is a list containing all the pos occurring in the bhsa
    max_len_input is the maximum length of all the input clauses in number of characters
    max_len_output is the maximum length of all the output clauses in number of phrases (+2, because a 
    start and stop sign are added)
    """

    input_clauses = []
    output_phrases = []
    input_vocab = set()
    output_vocab = {'x', '\t', '\n', 'w'}

    for bo in F.otype.s("book"): 

        if F.book.v(bo) == "Jona":
            continue
        
        words_in_book = L.d(bo, "word")
        
        for wo in range(words_in_book[0], words_in_book[-1] - 5):
            #if F.trailer_utf8.v(wo) == '':
            #    continue
            if F.trailer_utf8.v(wo) != '' and F.trailer_utf8.v(wo-1) == '' and wo != words_in_book[0]:
                continue
            
             
            input_chunk = ("".join(["".join([F.g_cons.v(wo2), F.trailer_utf8.v(wo2)]) for wo2 in range(wo, wo+5)])).strip()
            
            all_words = [w for w in range(wo, wo+5)]
            if F.trailer_utf8.v(wo+4) == '':
                input_chunk += F.g_cons.v(wo+5)
                all_words.append(wo+5)
            if F.trailer_utf8.v(wo+4) == '' and F.trailer_utf8.v(wo+5) == '':
                input_chunk += F.g_cons.v(wo+6)
                all_words.append(wo+6)
                
            
            for char in input_chunk:
                input_vocab.add(char)
            
            output_prep = []
            for word in sorted(all_words):
                for char in F.g_cons.v(word):
                    output_prep.append('x')
                output_prep.append("w")
                        

            output_chunk = ['\t']
            for elem in output_prep:
                output_chunk.append(elem)
            output_chunk.append('\n')
    
            input_clauses.append(input_chunk)
            output_phrases.append(output_chunk)
    
    input_chars = sorted(list(input_vocab))
    output_vocab = sorted(list(output_vocab))
    
    max_len_input = max([len(clause) for clause in input_clauses])
    max_len_output = max([len(output_phr) for output_phr in output_phrases])
    
    # shuffle the data
    #input_clauses, output_phrases = shuffle(input_clauses, output_phrases)
    
    return input_clauses, output_phrases, input_vocab, output_vocab, max_len_input, max_len_output

In [302]:
tr_set = set()
for w in F.otype.s('word'):
    tr_set.add(F.trailer_utf8.v(w))
    
print(tr_set)

{'', '׃ ', '׃ ׆ ', ' פ ', '׃ ׆ ס ', '׀ ', ' ', '׃ פ ', '׃ ׆ פ ', '־', ' ס ', '׃ ס '}


In [303]:
sorted([0,2,4,3,1])

[0, 1, 2, 3, 4]

In [304]:
def prepare_test_data():
    """
    books is a list containing the test books
    The function returns:
    input_clauses, a list containing the text of clauses in the test books
    """
    input_clauses_test = []
    outputs_test = []
    for bo in F.otype.s("book"): 

        if F.book.v(bo) != "Jona":
            continue
        
        words_in_book = L.d(bo, "word")
        
        for wo in range(words_in_book[0], words_in_book[-1] - 5):
            
            if F.trailer_utf8.v(wo) != '' and F.trailer_utf8.v(wo-1) == '' and wo != words_in_book[0]:
                continue
            
            input_chunk = ("".join(["".join([F.g_cons.v(wo2), F.trailer_utf8.v(wo2)]) for wo2 in range(wo, wo+5)])).strip()
            
            all_words = [w for w in range(wo, wo+5)]
            if F.trailer_utf8.v(wo+4) == '':
                input_chunk += F.g_cons.v(wo+5)
                all_words.append(wo+5)
            if F.trailer_utf8.v(wo+4) == '' and F.trailer_utf8.v(wo+5) == '':
                input_chunk += F.g_cons.v(wo+6)
                all_words.append(wo+6)
                
            input_clauses_test.append(input_chunk)
            
            output_chunk = []
            for w in all_words:
                for ch in F.g_cons.v(w):
                    output_chunk.append('x')
                output_chunk.append('w')
            outputs_test.append(output_chunk)
    
    return input_clauses_test, outputs_test

In [306]:
def create_dicts(input_vocab, output_vocab):
    """
    The network can only handle numeric data. This function provides four dicts. 
    Two of them map between integers and the input characters (one dict for every direction), the other two 
    map between integers and parts of speech.
    """

    
    input_idx2char = {}
    input_char2idx = {}

    for k, v in enumerate(input_chars):
        input_idx2char[k] = v
        input_char2idx[v] = k
        
    output_idx2char = {}
    output_char2idx = {}
    
    for k, v in enumerate(output_vocab):
        output_idx2char[k] = v
        output_char2idx[v] = k
     
    
    return input_idx2char, input_char2idx, output_idx2char, output_char2idx

In [307]:
def one_hot_encode(nb_samples, max_len_input, max_len_output, input_chars, output_vocab, input_char2idx, output_char2idx, input_clauses, output_pos):
    """
    Categorical data are generally one-hot encoded in neural networks, which is done here.
    
    """

    tokenized_input_data = np.zeros(shape = (nb_samples,max_len_input,len(input_chars)), dtype='float32')
    tokenized_output = np.zeros(shape = (nb_samples,max_len_output,len(output_vocab)), dtype='float32')
    target_data = np.zeros((nb_samples, max_len_output, len(output_vocab)),dtype='float32')

    for i in range(nb_samples):
        for k, ch in enumerate(input_clauses[i]):
            tokenized_input_data[i, k, input_char2idx[ch]] = 1
        
        for k, ch in enumerate(output_pos[i]):
            tokenized_output[i, k, output_char2idx[ch]] = 1

            # decoder_target_data will be ahead by one timestep and will not include the start character.
            if k > 0:
                target_data[i, k-1, output_char2idx[ch]] = 1
                
    return tokenized_input_data, tokenized_output, target_data

In [308]:
def define_LSTM_model(input_chars, output_vocab):
    """
    
    
    """

    # Encoder model

    encoder_input = Input(shape=(None,len(input_chars)))
    encoder_LSTM = LSTM(512,activation = 'relu',return_state = True, return_sequences=True)(encoder_input)
    encoder_LSTM = LSTM(512,return_state = True)(encoder_LSTM)
    encoder_outputs, encoder_h, encoder_c = encoder_LSTM
    encoder_states = [encoder_h, encoder_c]
    
    # Decoder model

    decoder_input = Input(shape=(None,len(output_vocab)))
    decoder_LSTM = LSTM(512, return_sequences=True, return_state = True)
    decoder_out, _ , _ = decoder_LSTM(decoder_input, initial_state=encoder_states)
    decoder_dense = Dense(len(output_vocab), activation='softmax')
    decoder_out = decoder_dense (decoder_out)
    
    model = Model(inputs=[encoder_input, decoder_input],outputs=[decoder_out])

    model.summary()

    return encoder_input, encoder_states, decoder_input, decoder_LSTM, decoder_dense, model

In [309]:
def compile_and_train(model, tokenized_input, tokenized_output, batch_size, epochs, validation_split):

    model.compile(optimizer='adam', loss='categorical_crossentropy')
    model.fit(x=[tokenized_input,tokenized_output], 
              y=target_data,
              batch_size=batch_size,
              epochs=epochs,
              validation_split=validation_split)
    
    return model

In [316]:
nb_samples = 300000

In [317]:
input_clauses, output_pos, input_chars, output_vocab, max_len_input, max_len_output = prepare_train_data()
input_idx2char, input_char2idx, output_idx2char, output_char2idx = create_dicts(input_chars, output_vocab)
tokenized_input, tokenized_output, target_data = one_hot_encode(nb_samples, max_len_input, max_len_output, input_chars, output_vocab, input_char2idx, output_char2idx, input_clauses, output_pos)

In [311]:
len(input_clauses)

315697

In [318]:
test_clauses, output_test = prepare_test_data()
tokenized_test_data, _, _ = one_hot_encode(len(test_clauses), max_len_input, max_len_output, input_chars, output_vocab, input_char2idx, output_char2idx, test_clauses, output_pos)

WJHJ DBR־JHWH >L־
DBR־JHWH >L־JWNH BN־
JHWH >L־JWNH BN־>MTJ
>L־JWNH BN־>MTJ L
JWNH BN־>MTJ L>MR׃
BN־>MTJ L>MR׃ QWM
>MTJ L>MR׃ QWM LK
L>MR׃ QWM LK >L־
QWM LK >L־NJNWH H
LK >L־NJNWH H<JR
>L־NJNWH H<JR H
NJNWH H<JR HGDWLH
H<JR HGDWLH W
HGDWLH WQR> <LJH
WQR> <LJH KJ־<LTH
<LJH KJ־<LTH R<TM L
KJ־<LTH R<TM LPNJ׃
<LTH R<TM LPNJ׃ W
R<TM LPNJ׃ WJQM
LPNJ׃ WJQM JWNH
WJQM JWNH LBRX
JWNH LBRX TRCJCH M
LBRX TRCJCH ML
TRCJCH MLPNJ JHWH
MLPNJ JHWH W
LPNJ JHWH WJRD
JHWH WJRD JPW W
WJRD JPW WJMY>
JPW WJMY> >NJH׀ B>H
WJMY> >NJH׀ B>H TRCJC
>NJH׀ B>H TRCJC WJTN
B>H TRCJC WJTN FKRH
TRCJC WJTN FKRH W
WJTN FKRH WJRD
FKRH WJRD BH L
WJRD BH LBW>
BH LBW> <MHM TRCJCH
LBW> <MHM TRCJCH M
<MHM TRCJCH MLPNJ
TRCJCH MLPNJ JHWH׃
MLPNJ JHWH׃ W
LPNJ JHWH׃ WJHWH
JHWH׃ WJHWH HVJL RWX־
WJHWH HVJL RWX־GDWLH
HVJL RWX־GDWLH >L־H
RWX־GDWLH >L־HJM
GDWLH >L־HJM W
>L־HJM WJHJ
HJM WJHJ S<R־
WJHJ S<R־GDWL B
S<R־GDWL BJM
GDWL BJM W
BJM WH
JM WH>NJH
WH>NJH XCBH L
H>NJH XCBH LHCBR׃
XCBH LHCBR׃ WJJR>W
LHCBR׃ WJJR>W H
WJJR>W HMLXJM W
HMLXJ

In [319]:
encoder_input, encoder_states, decoder_input, decoder_LSTM, decoder_dense, model = define_LSTM_model(input_chars, output_vocab)
model = compile_and_train(model, tokenized_input, tokenized_output, 512, 20, 0.1)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_31 (InputLayer)           (None, None, 31)     0                                            
__________________________________________________________________________________________________
lstm_31 (LSTM)                  [(None, None, 512),  1114112     input_31[0][0]                   
__________________________________________________________________________________________________
input_32 (InputLayer)           (None, None, 4)      0                                            
__________________________________________________________________________________________________
lstm_32 (LSTM)                  [(None, 512), (None, 2099200     lstm_31[0][0]                    
                                                                 lstm_31[0][1]                    
          

In [320]:
# Inference models for testing

# Encoder inference model
encoder_model_inf = Model(encoder_input, encoder_states)

# Decoder inference model
decoder_state_input_h = Input(shape=(512,))
decoder_state_input_c = Input(shape=(512,))
decoder_input_states = [decoder_state_input_h, decoder_state_input_c]

decoder_out, decoder_h, decoder_c = decoder_LSTM(decoder_input, 
                                                 initial_state=decoder_input_states)

decoder_states = [decoder_h , decoder_c]

decoder_out = decoder_dense(decoder_out)

decoder_model_inf = Model(inputs=[decoder_input] + decoder_input_states,
                          outputs=[decoder_out] + decoder_states )

In [321]:
def decode_seq(inp_seq):
    
    # Initial states value is coming from the encoder 
    states_val = encoder_model_inf.predict(inp_seq)
    
    target_seq = np.zeros((1, 1, len(output_vocab)))
    target_seq[0, 0, output_char2idx['\t']] = 1
    
    pred_pos = []
    stop_condition = False
    
    while not stop_condition:
        
        decoder_out, decoder_h, decoder_c = decoder_model_inf.predict(x=[target_seq] + states_val)
        max_val_index = np.argmax(decoder_out[0,-1,:])
        sampled_out_char = output_idx2char[max_val_index]
        pred_pos.append(sampled_out_char)
        
        if (sampled_out_char == '\n'):
            stop_condition = True
        
        target_seq = np.zeros((1, 1, len(output_vocab)))
        target_seq[0, 0, max_val_index] = 1
        
        states_val = [decoder_h, decoder_c]
        
    return pred_pos

In [325]:
correct = 0

for seq_index in range(len(output_test)):
    inp_seq = tokenized_test_data[seq_index:seq_index+1]
    
    pred_pos = decode_seq(inp_seq)

    if output_test[seq_index] == pred_pos[:-1]:
        correct += 1
    else:
         print('-')
         print('Input clause:', test_clauses[seq_index])
         print('Predicted word boundaries:', pred_pos[:-1])
         print('Input chunk:', output_test[seq_index])  
        
print(correct)


-
Input clause: JWNH LBRX TRCJCH MLPNJ
Predicted phrase boundaries: ['x', 'x', 'x', 'x', 'w', 'x', 'w', 'x', 'x', 'x', 'w', 'x', 'x', 'x', 'x', 'x', 'x', 'w', 'x', 'w', 'x', 'x', 'x', 'x', 'w']
Input clause: ['x', 'x', 'x', 'x', 'w', 'x', 'w', 'x', 'x', 'x', 'w', 'x', 'x', 'x', 'x', 'x', 'x', 'w', 'x', 'w', 'x', 'w', 'x', 'x', 'x', 'w']
-
Input clause: GDWL BJM WH>NJH
Predicted phrase boundaries: ['x', 'x', 'x', 'x', 'w', 'x', 'x', 'x', 'w', 'x', 'w', 'x', 'w', 'x', 'x', 'x', 'x', 'w']
Input clause: ['x', 'x', 'x', 'x', 'w', 'x', 'w', 'w', 'x', 'x', 'w', 'x', 'w', 'x', 'w', 'x', 'x', 'x', 'x', 'w']
-
Input clause: BJM WH>NJH
Predicted phrase boundaries: ['w', 'x', 'x', 'x', 'w', 'x', 'w', 'x', 'w', 'x', 'x', 'x', 'x', 'w']
Input clause: ['x', 'w', 'w', 'x', 'x', 'w', 'x', 'w', 'x', 'w', 'x', 'x', 'x', 'x', 'w']
-
Input clause: WH>NJH XCBH LHCBR
Predicted phrase boundaries: ['x', 'w', 'x', 'x', 'x', 'x', 'x', 'w', 'x', 'x', 'x', 'x', 'w', 'x', 'w', 'x', 'x', 'x', 'x', 'w']
Input clause:

-
Input clause: JWNH LBW> B<JR
Predicted phrase boundaries: ['x', 'x', 'x', 'x', 'w', 'x', 'w', 'w', 'x', 'x', 'x', 'w', 'x', 'w', 'w', 'x', 'x', 'x', 'w']
Input clause: ['x', 'x', 'x', 'x', 'w', 'x', 'w', 'x', 'x', 'x', 'w', 'x', 'w', 'w', 'x', 'x', 'x', 'w']
-
Input clause: LBW> B<JR
Predicted phrase boundaries: ['x', 'w', 'w', 'x', 'x', 'x', 'w', 'x', 'w', 'w', 'x', 'x', 'x', 'w']
Input clause: ['x', 'w', 'x', 'x', 'x', 'w', 'x', 'w', 'w', 'x', 'x', 'x', 'w']
-
Input clause: YWM WJLBCW FQJM MGDWLM
Predicted phrase boundaries: ['x', 'x', 'x', 'w', 'x', 'w', 'x', 'x', 'x', 'x', 'x', 'w', 'x', 'x', 'x', 'x', 'w', 'x', 'x', 'x', 'x', 'x', 'x', 'w']
Input clause: ['x', 'x', 'x', 'w', 'x', 'w', 'x', 'x', 'x', 'x', 'x', 'w', 'x', 'x', 'x', 'x', 'w', 'x', 'w', 'x', 'x', 'x', 'x', 'x', 'w']
-
Input clause: MLK NJNWH WJQM MKS>W
Predicted phrase boundaries: ['x', 'x', 'x', 'w', 'x', 'x', 'x', 'x', 'x', 'w', 'x', 'w', 'x', 'x', 'x', 'w', 'x', 'x', 'x', 'x', 'x', 'w']
Input clause: ['x', 'x', 'x

-
Input clause: MCTJM־<FRH RBW >DM
Predicted phrase boundaries: ['x', 'x', 'x', 'x', 'x', 'w', 'x', 'x', 'x', 'x', 'w', 'x', 'x', 'x', 'w', 'x', 'x', 'x', 'w']
Input clause: ['x', 'w', 'x', 'x', 'x', 'x', 'w', 'x', 'x', 'x', 'x', 'w', 'x', 'x', 'x', 'w', 'x', 'x', 'x', 'w']
640


In [324]:
correct/len(output_test)

0.9103840682788051