In this notebook brandnew biblical texts are generated. Select randomly a sequence of 100 characters from the Hebrew Bible 
and the model trained here will add 400 characters to it in the style of the Bible.

In [1]:
import keras
from keras import layers
import numpy as np
import random
import sys
from keras.callbacks import ModelCheckpoint

from tf.fabric import Fabric
DATABASE = '~/github'
BHSA = 'bhsa/tf/c'
TF = Fabric(locations=[DATABASE], modules=[BHSA], silent=False )

api = TF.load('''
      language g_cons
''')

api.loadLog()
api.makeAvailableIn(globals())

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


This is Text-Fabric 5.5.14
Api reference : https://dans-labs.github.io/text-fabric/Api/General/
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

114 features found and 0 ignored
  0.00s loading features ...
   |     0.09s B g_cons               from C:/Users/geitb/github/bhsa/tf/c
   |     0.08s B language             from C:/Users/geitb/github/bhsa/tf/c
  3.66s All features loaded/computed - for details use loadLog()
   |     0.05s B otype                from C:/Users/geitb/github/bhsa/tf/c
   |     0.45s B oslots               from C:/Users/geitb/github/bhsa/tf/c
   |     0.00s B book                 from C:/Users/geitb/github/bhsa/tf/c
   |     0.00s B chapter              from C:/Users/geitb/github/bhsa/tf/c
   |     0.00s B verse                from C:/Users/geitb/github/bhsa/tf/c
   |     0.09s B g_cons               from C:/Users/geitb/github/bhsa/tf/c
   |     0.13s B g_cons_u

First generate the text of the Hebrew Bible.

In [2]:
words = [F.g_cons.v(word) for word in F.otype.s("word") if F.language.v(word) == "Hebrew"]
    
text = " ".join(words) 

First the data are preprocessed. The text is cut in small pieces with maxlen characters and the data are converted to one-hot encoding.

The input of the model consists of these pieces, the output is the character following the input sequence.

In [3]:
# Length of extracted character sequences
maxlen = 100

# We create a new sequence every "step" characters
step = 5

sentences = []
next_chars = []

# Extracting sentences and the next characters.
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])

print('Number of sentences:', len(sentences))

# Extracting unique characters from the corpus
chars = sorted(list(set(text)))
print('Number of unique characters:', len(chars))

# Dictionary for mapping unique characters to their index
char_indices = dict((char, chars.index(char)) for char in chars)

# Converting characters into one-hot encoding.

x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Number of sentences: 319703
Number of unique characters: 25


First the model is defined

In [4]:
model = keras.models.Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))    

optimizer = keras.optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

Then the model is trained and new texts are generated

In [5]:
# function sample converts predictions to probabilities and chooses the most probable with a certain randomness to
# avoid repetition in the prediction
def sample(preds):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds)
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

# define the checkpoint
filepath="weights-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

for epoch in range(1, 5):
    print('epoch', epoch)
    # Fit the model for 1 epoch
    model.fit(x, y,
          batch_size=128,
          epochs=1,
          callbacks=callbacks_list)
    
 
    # Select a text seed randomly
    start_index = random.randint(0, len(text) - maxlen - 1)
    generated_text = text[start_index: start_index + maxlen]
    print('--- Seeded text: "' + generated_text + '"')

    sys.stdout.write(generated_text)

    # We generate 400 characters        
    for i in range(400):
            
        # first select a text randomly and convert it to one hot
        sampled = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(generated_text):
            sampled[0, t, char_indices[char]] = 1.
                
        # predict next character based on sampled text
        preds = model.predict(sampled, verbose=0)[0]
        next_index = sample(preds)
        next_char = chars[next_index]
            
        # add new character to generated text, remove the first character and make new prediction
        generated_text += next_char
        generated_text = generated_text[1:]

        sys.stdout.write(next_char)
        sys.stdout.flush()


epoch 1
Epoch 1/1

Epoch 00001: loss improved from inf to 1.94198, saving model to weights-01-1.9420.hdf5
--- Seeded text: " W DBC W CMN W YRJ NTNW M<RBK DMFQ SXRTK B RB M<FJK M RB KL HWN B JJN XLBWN W YMR YXR WDN W JWN M >W"
 W DBC W CMN W YRJ NTNW M<RBK DMFQ SXRTK B RB M<FJK M RB KL HWN B JJN XLBWN W YMR YXR WDN W JWN M >WL KL RGP CLX LJ H >CR YR< Z>XTJ MLK BJT CMJM MMNWN >XW W TDBRTM W JMT KPR CBR <MWN <NJH W JD< <L KL >JC LVW PLCTW JDWN B SL W <L CB< L HJWT >T W RYH W J>MR DBRW HNNJ NH JFMXW >TN B NPC >LP XKM KL TJSW >T KL H QWMJM CNK >T H M<DJ LHL KJ NXRT HTNPLR W J>MR <LJHWN >TB<V B JDW W L> TWKWS MLL H <MJM  LBJM >GRN Q>WRJ >XRJ DWMR>D BJN W MPQH W CMH YRJM W <L P>T >HLK W >BD W JGDRW KL H NMLLJM W >MR >T H GBepoch 2
Epoch 1/1

Epoch 00001: loss improved from 1.94198 to 1.75041, saving model to weights-01-1.7504.hdf5
--- Seeded text: "Q XWNN W NWTN KJ MBRKJW JJRCW >RY W MQLLJW JKRTW M JHWH MY<DJ GBR KWNNW W DRKW JXPY KJ JPL L> JWVL K"
Q XWNN W NWTN KJ MBRKJW JJRCW >RY W MQLLJW J

  after removing the cwd from sys.path.


LJJLH W PWSPN J CHJHM KH M  HSBJM HJ PWTWH <LJ >LH B  PW >MJMJM M  JM MJ HWLJM H >KW W TDJHM >JC JD W BKM MXJ H CJLWW >RYJW MN H JJN GCQ W JMH W H JWJN >CJH H MYRH L> JBNW JJMWR W NHJ MXW B> TPT >CR <W HWLJD BJTNJ JMJM HM <M <D P<RJH >RB<JM W H >RY HJX H JTR M DLL <L MZBXJ LJ KJ L> HKH M PJM LBBK M >CH J<NJW W <TH B TXJW W JXNH W >M HHTMLNW NQMH NJLH <MJM L YPHWR L WHM YPHJ MJ JT JMJNK  NJW