# New Text-Fabric module: The Dead Sea Scrolls

By Martijn Naaijer and Jarod Jacobs

Earlier this year, the CACCHT project (Creating Annotated Corpora of Classical Hebrew Texts) was started. CACCHT is a joint project of the ETCBC and the Theological Seminary at Andrews University and the researchers involved include: Jarod Jacobs, Martijn Naaijer, Robert Rezetko, Oliver Glanz and Wido van Peursen. CACCHT focuses on statistically analyzing Ancient Hebrew texts. At the core of our work is the BHSA and the extrabiblical module, but for a comprehensive analysis we intend to broaden our scope by including the Dead Sea Scrolls and Rabbinic texts.

We have complete the first stage the project, the results of which can be found on the [ETCBC github page](https://github.com/ETCBC/dss): a brand new Text-Fabric module containing the Dead Sea Scrolls with morphological encoding.

The DSS transcriptions and morphological data connected with them were generously provided by Martin Abegg. The transcriptions come from various sources, but primarily reflect what is found in the Discoveries in the Judean Desert series. Abegg started morphologically tagging the Qumran texts in the mid-90s with the assistance of several people. Over the following decades, Abegg completed full morphological tagging of nearly every Hebrew and Aramaic scroll found in the Judaean Desert between 1947 and today. In support of open access ideals, Abegg provided CACCHT his work from the past decades, which have been converted to Text-Fabric by Dirk Roorda. 

## Part of Speech tagging of Hebrew texts

With Abegg's data in Text-Fabric, the next step is to convert Abegg's morphological encoding to the encoding system used by the ETCBC. Over the next few months we will work on converting word, phrase and clause features. For part of the data this is pretty straightforward. We assume that features like verb tense, stem formation, gender, number and state are similar to the ETCBC encoding, with only small adaptations needing to be made. Our initial foray into this will be converting part of speech tagging. The Abegg's dataset contains part of speech values, but its conventions deviate from that which is use in the ETCBC database.

POS tagging of the Dead Sea Scroll has various challenges. In the first place, there are many ambiguous cases. For instance, the word אל can be a preposition, but it can also also be a noun meaning “god”. Of course a decision can be made by manually encoding all the DSS in the ETCBC encoding, or we can rely on POS tags and or other indirect information in the dataset of Abegg. The disadvantage of using other information from the dataset is that the conversion would become pretty complicated and in many cases the encoding would remain difficult. For instance, Abegg does not distinguish between part of speech (feature sp) and phrase dependent part of speech (feature pdp).

In this project we want to tag the DSS with the ETCBC encoding system automatically, without manually encoding the logic behind each tag and decision.

How does this work? First, it is important to state that we do not have clause boundaries in the Abegg dataset. This makes the task of POS tagging more difficult, because the structure of a clause may give an indication of the POS of a word. As an example, if a clause ends with the word אל, it is more likely to be a noun than a preposition, because a preposition is followed within the clause by other words or a pronominal suffix.

Even with that limitation, Abegg's dataset does have information about the structure of words and their environment. Significantly, we know where word boundaries are, for instance, ויהי has been split into ו and יהי already. Also, POS tagging is helped by word morphology and, most importantly, we know the order of words in a book or text.

Automated systems for the analysis of language can be roughly divided into two kinds: rule-driven and pattern-driven. Rule-driven systems contain a lot of human input, such as "if then" blocks of code. For instance, in the case of a POS tagger, such a block can be: "if a word is 'H', the POS is 'article'", or "if a word is 'MCH' or 'YHWH', the POS is 'proper noun'". In general, this kind of systems works well, but there are some problems. One is that there are many ambiguous cases, and a rule-driven system can become very complicated to distinguish all the possible cases. Also, there may be patterns in the dataset that the researcher has missed, in which case the rule driven system remains incomplete. 

In the CACCHT project we opt for the pattern-driven approach based on machine learning. Instead of relying on a system based on rules, we let an algorithm search for patterns in the data. In recent years, pattern-driven systems have started to outperform rule-driven system. Modern pattern-driven systems generally rely on machine learning algorithms to identify the structure of the data. The model is feed a large set of examples called the training set. For a POS tagging model, the training set contains words all tagged with their part of speech. The model identifies patterns in the training data from which it builds a structure that can be used tagged new texts that do not have part of speech tags. This approach is called supervised learning.

For the CACCHT project, we train our model on the BHSA, where we know the POS of all the words. The model learns the relationship between words and the corresponding pos values, and then we use this model to predict the POS of the words in the DSS.

We have already seen that there are ambiguous cases, so how do we solve these? If it is possible to use the context of a word we would be helped enormously, because we expect that the preposition אל has a different environment in clause than the noun אל.

To solve this problem, we use a so called sequence to sequence model (seq2seq). Instead of modeling the relationship between a word and a POS, we model the relationship between two sequences. One sequence consists of a number of words, the other of the corresponding POS. These sequences need to be kept relatively short, so we use a clause in each data sample which is a natural choice. However, as we have already mentioned, the Abegg data contain word boundaries, but there are no clause boundaries. Therefore we have chosen to feed the algorithm sequences of eight words. Here is an example of how this works:

In ETCBC transcription, the first sequence in Genesis 1:1 looks as follows:

'B R>CJT BR> >LHJM >T H CMJM W'

This is the first input sequence. The corresponding output is a list and looks as follows.

['\t', 'prep', 'subs', 'verb', 'subs', 'prep', 'art', 'subs', 'conj', '\n']

The signs '\t' and '\n' are start and stop signs, occurring in every output sequence.

For the second and third sample in the train set we move one word forward every time, so the inputs look as follows:

'R>CJT BR> >LHJM >T H CMJM W >T'

'BR> >LHJM >T H CMJM W >T H'

The corresponding outputs are:

['\t', 'subs', 'verb', 'subs', 'prep', 'art', 'subs', 'conj', 'prep', '\n']

['\t', 'verb', 'subs', 'prep', 'art', 'subs', 'conj', 'prep', 'art', '\n']

We move forward this way to the end of the book of Genesis, then we process Exodus and move forward until the end of Chronicles is reached. One book is withheld from the model (in our case Nehemiah), to act as the test set. Keeping part of the data separate as a test set is a standard procedure in machine learning practice.

You can see the seq2seq model as a translation model, and this is exactly what it is used for in other applications. In these applications the input can consist of English sentences, which are translated by the model to another language, for instance Dutch.

What kind of algorithms can be used for such a task, in which a sequence of POS is predicted for a sequence of characters? A type of model which is used often for sequence analysis is the so-called Long Short-Term Memory model (or LSTM model), which is a kind of Neural Network. It is used for a variety of Natural Language Processing tasks (such as chatbots, text classification and text summarization), but also for making predictions of numeric sequences, such as forecasting time-series. It is beyond the scope of this blog to go into the details of Neural Networks and the LSTM model, but there are a lot of helpful sources  online about it, such as [this blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) or the [Keras documentation of seq2seq LSTM models](https://keras.io/examples/lstm_seq2seq/).

One challenge with LSTM models is that the algorithm only ingests numbers and all sequences have to have the same length. Because of this, some further preprocessing is needed. We check the length of the longest sequence and give all the sequences that length by adding zeros to it (this is called padding). All sequences consist of eight words, so how can they have varying lengths? We have chosen to use a character based model, so the model sees the input sequence as a sequence of characters. It does not know where the word boundaries are, because the model takes the space as a character just like the other characters. We also convert each character to a number so that the model can work with them.

Then all those input and output sequences are fed to the algorithm, which trains a model that finds the relationship between the input and output sequences (at least, that is what we hope, of course).

With this model and the input sequences of the DSS we can predict their POS, but how do we know how well the model performs? Predictions based on machine learning models rarely predict everything correctly. To find out how good it is, we start with making predictions on the test set: the book of Nehemiah. We make predictions on the words (first converted to sequences similar to the training data), and then compare these predictions with the true values of the POS in the ETCBC database. When that is done we know how often it predicts unseen words correctly.

## Let's do some real work

The following script works through this whole procedure of training the LSTM model and makes predictions on the test set. The following steps are made:

- import of relevant libraries
- prepare_train_data() in this function input and output sequences for the train set are created.
- prepare test data() input and output sequences for the test set are created.
- in create_dicts() and one_hot_encode() the sequences are preprocessed further.
- in define_LSTM_model() the encode-decode architecture is created.
- compile_and_train() does the training of the model. Here some important hyperparameters of the model are chosen.
- After that, predictions are made using the model and the test set, which is the book of Nehemiah. After the evaluation it becomes clear how well the model works on unseen data. These predictions demonstrate what we want to use the model for: automatically analyzing Hebrew texts grammatically!


First some libraries are imported. Of course, we use [Text-Fabric with the BHSA data](https://etcbc.github.io/bhsa) which you can access with Python 3 and as framework for the Neural Network we use [Keras](https://keras.io).


In [None]:
!pip install text-fabric

In [None]:
import collections

import pandas as pd
import numpy as np

from sklearn.utils import shuffle
from statistics import mode

from keras.models import Model
from keras.layers import Input, LSTM, Dense
from keras.callbacks import EarlyStopping
from keras.optimizers import Adam

In [None]:
from tf.app import use
A = use('bhsa', hoist=globals())

In the function prepare_train_data() the train set is created, and some other useful information is collected. The argument of the function, test_book, is the book which will be excluded from the train set, because it is upon this book that the model will be tested.

In [None]:
def prepare_train_data(test_book):

    input_seqs = []
    output_pos = []
    input_chars = set()
    output_vocab = set()

    # iterate over all the books
    for bo in F.otype.s("book"): 
        
        # exclude the test_book
        if F.book.v(bo) == test_book:
            continue
               
        # all the words from a book are collected
        words = L.d(bo, 'word')
        
        # Now we iterate over all the words, except the last words, because all the sequences have to be 8 words long
        for w in words[0:-7]:
            
            # In the following two lines the train data are prepared
            
            #exclude Aramaic words
            languages_list = [F.language.v(w) for w in range(w, w+8) if (F.g_cons.v(w) != '')]
            if "Aramaic" in languages_list:
                continue
            
            # here the input data are created
            # words_train is a string with the consonantal representation of 8 words, separated by spaces
            # elided-he is excluded, this is the empty string
            g_cons_train = (" ".join([F.g_cons.v(w) for w in range(w, w+8) if (F.g_cons.v(w) != '')])).strip()
            
            # and here outputs are created
            # it is a list containing parts of speech
            parts_of_speech = [F.sp.v(w) for w in range(w, w+8) if (F.g_cons.v(w) != '')]
            
            # the two preceding lines of code and their counterparts in the function prepare_test_data() are the only places 
            # where we extract data from the etcbc database with text-fabric before the data are trained
            
            # to the outputs a start ('\t') and stop ('\n') symbol are added
            parts_of_speech = ['\t'] + parts_of_speech + ['\n']
             
            # the input sepuence g_cons_train is added to input_seqs, which is a list containing all the inputs
            input_seqs.append(g_cons_train)
            
            # the list parts_of_speech is added to the list output_pos
            output_pos.append(parts_of_speech)
            
            # for further processing we need the "vocabularies" of the input and output
            # we use a character-based model, so the input vocabulary consists of all the distinct characters in the input strings
            # also included is the space
            for ch in g_cons_train:
                input_chars.add(ch)
            
            # also collected is the output vocabulary, which consists of all the parts of speech in the etcbc database
            for pos in parts_of_speech:
                output_vocab.add(pos)
                
    
    input_chars = sorted(list(input_chars))
    output_vocab = sorted(list(output_vocab))
    
    # in the LSTM network all the sequences have to have the same length. We find out what the length of the longest sequence is,
    # all the other sequences will get that length
    max_len_input = max([len(clause) for clause in input_seqs])
    max_len_output = max([len(poss) for poss in output_pos])
    
    # shuffle the data. The model will get the data in small batches, it is preferable if the batches are more or less homogeneous
    # of course the inputs and outputs have to be shuffled identically
    input_seqs, output_pos = shuffle(input_seqs, output_pos)
    
    return input_seqs, output_pos, input_chars, output_vocab, max_len_input, max_len_output

In the function prepare_test_data() the test data are prepared, consisting of the data of the single book not included in the train data.

In [None]:
def prepare_test_data(test_book):

    input_seqs_test = []
    output_seqs_test = []
    g_cons_test = []
    pos_test = [] 
    
    for bo in F.otype.s('book'):
        
        # exclude other books than test_book
        if F.book.v(bo) != test_book:
            continue
            
        words = L.d(bo, 'word')

        for w in words[0:-7]:
          
            # exclude Aramaic words
            languages_list = [F.language.v(w) for w in range(w, w+8) if (F.g_cons.v(w) != '')]
            if "Aramaic" in languages_list:
                continue
            
            if F.g_cons.v(w) == '':
                continue
            
            # prepare the test data
            input_seq_test = (" ".join([F.g_cons.v(w) for w in range(w, w+8) if (F.g_cons.v(w) != '')])).strip()
            output_seq_test = [F.sp.v(w) for w in range(w, w+8) if (F.g_cons.v(w) != '')]
            
            input_seqs_test.append(input_seq_test)
            output_seqs_test.append(output_seq_test)
            
    return input_seqs_test, output_seqs_test, [w for w in words if (F.g_cons.v(w) != '')]

The network can only handle numeric data, but after the data have been processed as numbers, they need to be converted back to characters. The function create_dicts() provides dictionaries with mapping between the input and output vocabularies and integers.

In [None]:
def create_dicts(input_voc, output_voc):
    
    # these dicts map the input sequences
    input_idx2char = {}
    input_char2idx = {}

    for k, v in enumerate(input_voc):
        input_idx2char[k] = v
        input_char2idx[v] = k
     
    # and these dicts map the output sequences of parts of speech
    output_idx2char = {}
    output_char2idx = {}
    
    for k, v in enumerate(output_voc):
        output_idx2char[k] = v
        output_char2idx[v] = k
        
    return input_idx2char, input_char2idx, output_idx2char, output_char2idx

Now the final data preparation function is made. Categorical data are generally fed to the LSTM network in one-hot encoded form. The inputs and the outputs have the same length. Also created is an array called decoder_target.

In [None]:
def one_hot_encode(nb_samples, max_len_input, max_len_output, input_chars, output_vocab, input_char2idx, output_char2idx, input_clauses, output_pos):
    
    # three-dimensional numpy arrays are created 
    tokenized_input = np.zeros(shape = (nb_samples, max_len_input, len(input_chars)), dtype='float32')
    tokenized_output = np.zeros(shape = (nb_samples, max_len_output, len(output_vocab)), dtype='float32')
    target_data = np.zeros((nb_samples, max_len_output, len(output_vocab)), dtype='float32')

    for i in range(nb_samples):
        for k, ch in enumerate(input_clauses[i]):
            tokenized_input[i, k, input_char2idx[ch]] = 1
        
        for k, ch in enumerate(output_pos[i]):
            tokenized_output[i, k, output_char2idx[ch]] = 1

            # decoder_target_data will be ahead by one timestep and will not include the start character.
            if k > 0:
                target_data[i, k-1, output_char2idx[ch]] = 1
                
    return tokenized_input, tokenized_output, target_data

In the function define_LSTM_model() the architecture of the model is created. Neural networks are very flexible structures and a variety of architectures have been developed for various tasks. Here we use the encoder-decoder architecture with two LSTM layers in the encoder. In the architecture there is a variety of hyperparameters that you have to choose. Better hyperparameters lead to better predictions, so it is important to spend time on optimizing this. Hyperparameters in this architecture are the number of LSTM layers, the number of cells in each LSTM layer and the activation function.

In [None]:
def define_LSTM_model(input_chars, output_vocab):

    # encoder model
    encoder_input = Input(shape=(None,len(input_chars)))
    encoder_LSTM = LSTM(250,activation='relu',return_state=True, return_sequences=True)(encoder_input)
    encoder_LSTM = LSTM(250,return_state=True)(encoder_LSTM)
    encoder_outputs, encoder_h, encoder_c = encoder_LSTM
    encoder_states = [encoder_h, encoder_c]
    
    # decoder model
    decoder_input = Input(shape=(None,len(output_vocab)))
    decoder_LSTM = LSTM(250, return_sequences=True, return_state = True)
    decoder_out, _ , _ = decoder_LSTM(decoder_input, initial_state=encoder_states)
    decoder_dense = Dense(len(output_vocab), activation='softmax')
    decoder_out = decoder_dense (decoder_out)
    
    model = Model(inputs=[encoder_input, decoder_input],outputs=[decoder_out])

    model.summary()

    return encoder_input, encoder_states, decoder_input, decoder_LSTM, decoder_dense, model

Now the model is compiled and trained using the function compile_and_train(). The data are fed to the model in small batches. The train data are split in a train and validation set. The latter data consist of 5% of the original train set. The model is trained on the train set, and makes a prediction on these data. The difference between the predictions and the true values of the output are calculated with categorical crossentropy and is called the loss. During training this loss becomes smaller, which means that the predictions become more accurate. However, we want the model not only to become good on the train data, but it should be general enough to make accurate predictions on unseen data. Therefore, after every epoch a prediction is made on the small validation set and the validation loss is calculated. Ideally, the validation loss is more or less equal to the train loss. After a number of epochs, you will notice that the train loss keeps decreasing, while the validation loss remains equal or even increases. At this point the model starts to overfit, which means that the algorithm is modeling idiosyncrasies in the train data instead of general patterns. In that case it is time to stop training and make predictions on the test set.

Again, you have to choose a number of hyperparameters. These are the optimizer, the loss function, the batch size, the number of epochs and the learning rate. If you want, you can even tune more hyperparameters.

With Earlystopping() the training process can be stopped earlier than the given number of epochs. This is useful if the model starts overfitting and the validation loss does not decrease anymore.

Note that training an LSTM model is a computationally intensive process. It is recommended to run the script on a GPU.

In [None]:
def compile_and_train(model, one_hot_in, one_hot_out, targets, batch_size, epochs, val_split):

    callback = EarlyStopping(monitor='val_loss', patience=3, verbose=0, mode='auto')
    adam = Adam(lr=0.0008, beta_1=0.99, beta_2=0.999, epsilon=0.00000001)
    model.compile(optimizer=adam, loss='categorical_crossentropy')
    model.fit(x=[one_hot_in,one_hot_out], 
              y=targets,
              batch_size=batch_size,
              epochs=epochs,
              validation_split=val_split,
              callbacks=[callback])
    
    return model

The train data are prepared. The test data consist of sequences of words from the book of Nehemiah, so in the preparation of the train data, Nehemiah is excluded.

In [None]:
test_book = "Nehemia" # the book name is in Latin, because the tf-feature "book" is used in the functions prepare_train_data() and prepare_test_data().

input_clauses, output_pos, input_chars, output_vocab, max_len_input, max_len_output = prepare_train_data(test_book)
input_idx2char, input_char2idx, output_idx2char, output_char2idx = create_dicts(input_chars, output_vocab)

nb_samples = len(input_clauses)
one_hot_input, one_hot_output, target_data = one_hot_encode(nb_samples, max_len_input, max_len_output, input_chars, output_vocab, input_char2idx, output_char2idx, input_clauses, output_pos)

What do the input data look like?

In [None]:
input_clauses[0:10]

In [None]:
output_pos[0:10]

The test data are prepared

In [None]:
test_clauses, output_test, test_word_nodes = prepare_test_data(test_book)
one_hot_test_data, _, _ = one_hot_encode(len(test_clauses), max_len_input, max_len_output, input_chars, output_vocab, input_char2idx, output_char2idx, test_clauses, output_pos)

Here the functions define_LSTM_model() and compile_and_train() are called. A neural network learns in an iterative process. One iteration is called an epoch. In each iteration a prediction is made, and the train and validation loss are calculated, as you can see in the output.

The architecture of the model is also printed with the number of parameters. You also see the number of train samples (397552 samples).

In [None]:
encoder_input, encoder_states, decoder_input, decoder_LSTM, decoder_dense, model = define_LSTM_model(input_chars, output_vocab)
model = compile_and_train(model, one_hot_input, one_hot_output, target_data, 1024, 150, 0.05)

In [None]:
# Encoder inference model
encoder_model_inf = Model(encoder_input, encoder_states)

# Decoder inference model
decoder_state_input_h = Input(shape=(250,))
decoder_state_input_c = Input(shape=(250,))
decoder_input_states = [decoder_state_input_h, decoder_state_input_c]

decoder_out, decoder_h, decoder_c = decoder_LSTM(decoder_input, 
                                                 initial_state=decoder_input_states)

decoder_states = [decoder_h , decoder_c]

decoder_out = decoder_dense(decoder_out)

decoder_model_inf = Model(inputs=[decoder_input] + decoder_input_states,
                          outputs=[decoder_out] + decoder_states )

In the function decode_seq() the predictions on the test set are made. The input, inp_seq, consists of one one-hot encoded sequence of words in the book of Nehemiah.

In [None]:
def decode_seq(inp_seq):
    
    states_val = encoder_model_inf.predict(inp_seq)
    
    target_seq = np.zeros((1, 1, len(output_vocab)))
    target_seq[0, 0, output_char2idx['\t']] = 1
    
    pred_pos = []
    stop_condition = False
    
    while not stop_condition:
        
        decoder_out, decoder_h, decoder_c = decoder_model_inf.predict(x=[target_seq] + states_val)
        
        max_val_index = np.argmax(decoder_out[0,-1,:])
        sampled_out_char = output_idx2char[max_val_index]
        pred_pos.append(sampled_out_char)
        
        if (sampled_out_char == '\n'):
            stop_condition = True
        
        target_seq = np.zeros((1, 1, len(output_vocab)))
        target_seq[0, 0, max_val_index] = 1
        
        states_val = [decoder_h, decoder_c]
        
    return pred_pos

Now the function decode_seq() is called, the predictions are made and the results are preprocessed. 

For most words eight predictions are made, because each word (except the words at the beginning and end of a book) occurs in eight sequences. In the dict decision_dict all eight predictions are collected.

In [None]:
decision_dict = collections.defaultdict(list)

for seq_index in range(len(one_hot_test_data)):
    inp_seq = one_hot_test_data[seq_index:seq_index+1]
    
    pred_pos = decode_seq(inp_seq)
    
    if len(pred_pos[:-1]) == len(output_test[seq_index]):
        for pred_ind in range(len(pred_pos[:-1])):
            decision_dict[seq_index + pred_ind].append(pred_pos[:-1][pred_ind])   

We simply use majority voting to decide what the final prediction is. So, if the model predicts 5 times "verb" and 3 times "subs" for a certain word, we decide that the word is a verb. In the case of a tie, e.g. 4 times "verb" and 4 times "subs", the value with the lowest index is chosen, which can be seen as a random choice from the alternatives with equal numbers. 

### Misclassifications on the test set

We start the evaluation with the bad news: misclassifications. We want the model to predict the POS correctly as often as possible, but in practice it is difficult to reach 100% accuracy. The following cell outputs the words in the book in Nehemiah that were misclassified by the model.

The output shows:

- the text-fabric node number
- (book, chapter, verse)
- the consonants of a word
- correct POS
- predicted POS

In [None]:
correct_test = 0
wrong_test = 0
cross_dict = collections.defaultdict(lambda: collections.defaultdict(int))

for key in range(len(test_word_nodes)):
    data = collections.Counter(decision_dict[key])
    cross_dict[F.sp.v(test_word_nodes[key])][max(decision_dict[key], key=data.get)] += 1

    if F.sp.v(test_word_nodes[key]) == max(decision_dict[key], key=data.get):
        correct_test += 1

    else:
        wrong_test += 1
        print(test_word_nodes[key], T.sectionFromNode(test_word_nodes[key]), F.g_cons.v(test_word_nodes[key]), F.sp.v(test_word_nodes[key]), max(decision_dict[key], key=data.get))


### Quantitative evaluation

The following table shows the predictions in the rows and the true values according to the ETCBC database in the columns. On the diagonal you see the numbers of correct predictions.

In [None]:
evaluation = []

all_pos = list(cross_dict.keys())

for key in all_pos:
    eval_pos = [cross_dict[key][key2] if key2 in cross_dict[key] else 0 for key2 in all_pos]
    evaluation.append(eval_pos)
    
# put everything in dataframe
df_eval = pd.DataFrame(evaluation) 
df_eval.columns = all_pos
df_eval.index = all_pos
df_eval

# Below:
# horizontal: predictions
# vertical: true values according to etcbc database

All these result are interesting, but how good is the model? We calculate this by dividing the number of misclassifications by the total number of predictions: 

In [None]:
print("Correct classifications:", correct_test)
print("Misclassifications:", wrong_test)

correct_percent = 100 * correct_test  / (correct_test + wrong_test)
print("Accuracy:", round(correct_percent, 1), "%")

So, the model predicts the POS of biblical data correctly in nearly 96% of the words, which we think is decent. The most difficult POS to predict is the proper noun (nmpr), in 93 cases in which the true value is a proper noun a substantive (subs) was predicted. The results may vary slightly between different runs of the script.

The model can be saved and loaded again to be used for making predictions on for instance the Dead Sea Scrolls. The language of the DSS may differ a bit from Biblical Hebrew, which may lead to a slight decrease in accuracy, but on the other hand, the extra-biblical text-fabric module contains some DSS scrolls already, which is helpful, because they can already be added to the training set. This addition of other training data is only one way to improve the model. There might be various other ways. If you have suggestions, or if you are a student and you would like to do a project on Ancient Hebrew and machine learning, let us know!

While it is difficult to say beforehand how the model can be improved and how good exactly the algorithm works on unseen DSS, we will update you soon with our findings in a another blogpost.

### Correct classifications

Finally, for the record, these are the correct predictions. In the output you see:

- The text-fabric node number
- (book, chapter, verse)
- the consonants of the word
- correct pos
- predicted pos (which is identical to the correct pos, of course)

In [None]:
for key in range(len(test_word_nodes)):
    data = collections.Counter(decision_dict[key])
    cross_dict[F.sp.v(test_word_nodes[key])][max(decision_dict[key], key=data.get)] += 1

    if F.sp.v(test_word_nodes[key]) == max(decision_dict[key], key=data.get):

        print(test_word_nodes[key], T.sectionFromNode(test_word_nodes[key]), F.g_cons.v(test_word_nodes[key]), F.sp.v(test_word_nodes[key]), max(decision_dict[key], key=data.get))
