# Colab Block
I usually run my notebooks in Colabratory, which has a free 12GB GPU to connect to and use. Files are most easily accessed via a Google Drive, but that needs to be connected to and authenticated twice via this block. Also, some packages need to be installed again every runtime. So that is what this first block does, remove if not using Colab's notebook.

In [0]:
from google.colab import auth
auth.authenticate_user()
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}
!mkdir -p gdrive
!google-drive-ocamlfuse gdrive
!pip install -q keras
!pip install numba
!pip install tqdm
!pip install opencv-python
!apt update && apt install -y libsm6 libxext6

# Imports

In [0]:

import numpy as np
import pandas as pd
import keras as K
import random

from keras.layers import Input, Dropout, Dense, concatenate, Embedding
from keras.layers import Bidirectional, GRU,Flatten, Activation, SpatialDropout1D
from keras.optimizers import Adam
from keras.models import Model
from keras.utils import np_utils

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.models import load_model
from keras.layers import LSTM, CuDNNGRU, CuDNNLSTM
from keras.layers import MaxPooling1D, Conv1D
from keras.callbacks import EarlyStopping, ModelCheckpoint, Callback

import warnings
warnings.filterwarnings('ignore')
import os
os.environ['OMP_NUM_THREADS'] = '4'

import re
import math
# set seed
np.random.seed(123)

Using TensorFlow backend.


#Read in Text
Note: if reading in my sample weights, be sure to run every block the same until the training step. Then skip that and go right on to the generation.

In [0]:
data = open('../Shakespere_input.txt', 'r').read()
data = data.lower()

In [0]:
%%time
charindex = list(set(data))
charindex.sort() 
print(charindex)

['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
CPU times: user 21 ms, sys: 0 ns, total: 21 ms
Wall time: 20 ms


The charindex is important so keeping a backup is a good idea. This is because, either without sorting or by inputing more text files later, the index could be changed on new loads. This can result in your saved models being worthless if the charindex cannot be adequately replicated.

In [0]:
# np.save("../charindex.npy", charindex)

# Create Sequences
In a nutshell, the model will look at the last 75 characters in the script and attempt to predict the 76th. This block chops the .txt file into such blocks. 

In [0]:
%%time
CHARS_SIZE = len(charindex)
SEQUENCE_LENGTH = 75
X_train = []
Y_train = []
for i in range(0, len(data)-SEQUENCE_LENGTH, 1 ): 
    X = data[i:i + SEQUENCE_LENGTH]
    Y = data[i + SEQUENCE_LENGTH]
    X_train.append([charindex.index(x) for x in X])
    Y_train.append(charindex.index(Y))

X_train = np.reshape(X_train, (len(X_train), SEQUENCE_LENGTH))

Y_train = np_utils.to_categorical(Y_train)

CPU times: user 52.9 s, sys: 1 s, total: 53.9 s
Wall time: 54.1 s


#Create the Model
The model uses 3 LSTMs stacked on top of each other with 1d CNNs between.  Note that the CuDNNLSTMs are a special Nvida layer that automatically optimizes the LSTMs to work around twice as fast but needs to be used with certain GPUs. Colab's GPU is compatable and all set for it but replace with regular LSTMs if the layers won't work for you. (Still only try this code with a good GPU, this code would take too long on CPU or even an underpowerd GPU).


Note that lighter models might work for this if computational power is an issue, though not quite as well. One LSTM/GRU with 1D CNNs can get by ok with the loopbreaker in the generation section.

In [0]:
def get_model():
    model = Sequential()
    inp = Input(shape=(SEQUENCE_LENGTH, ))
    x = Embedding(CHARS_SIZE, 75, trainable=False)(inp)
    x = CuDNNLSTM(512, return_sequences=True,)(x)
    x = Dropout(0.1)(x)
    x = Conv1D(64, 5, activation="elu", padding='same')(x)
    x = CuDNNLSTM(512, return_sequences=True,)(x)
    x = Dropout(0.1)(x)
    x = Conv1D(128, 3, activation="elu", padding='same')(x)
    x = CuDNNLSTM(512,)(x)
    x = Dense(256, activation="elu")(x)
    x = Dropout(0.1)(x)
    x = Dense(128, activation="elu")(x)
    x = Dropout(0.1)(x)
    outp = Dense(CHARS_SIZE, activation='softmax')(x)
    
    model = Model(inputs=inp, outputs=outp)
    model.compile(loss='categorical_crossentropy',
                  optimizer=Adam(lr=0.001),
                  metrics=['accuracy'],
                 )

    return model

model = get_model()

model.summary()

Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 75)                0         
_________________________________________________________________
embedding_3 (Embedding)      (None, 75, 75)            2925      
_________________________________________________________________
cu_dnnlstm_5 (CuDNNLSTM)     (None, 75, 512)           1206272   
_________________________________________________________________
dropout_3 (Dropout)          (None, 75, 512)           0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 75, 64)            163904    
_________________________________________________________________
cu_dnnlstm_6 (CuDNNLSTM)     (None, 75, 512)           1183744   
___________________________________________________________

#Checkpoints and Custom Callback
We will use 3 callbacks. Checkpoint, EarlyStopping, and a custom TextSample callback. Text sample prints a sample line at the end of every epoch to see how the model is progressing. 

In [0]:
filepath="../model_checkpoint.hdf5"

checkpoint = ModelCheckpoint(filepath,
                             monitor='loss',
                             verbose=1,
                             save_best_only=True,
                             mode='min')

early = EarlyStopping(monitor="loss",
                      mode="min",
                      patience=1)

In [0]:
class TextSample(Callback):

    def __init__(self):
       super(Callback, self).__init__() 

    def on_epoch_end(self, epoch, logs={}):
        pattern = X_train[700]
#         pattern = pattern * float(CHARS_SIZE)
        outp = []
        seed = [charindex[x] for x in pattern]
        sample = 'TextSample:' +''.join(seed)+'|'
        for t in range(100):
          x = np.reshape(pattern, (1, len(pattern)))
#           x = x / float(CHARS_SIZE)
          pred = self.model.predict(x)
          result = np.argmax(pred)
          outp.append(result)
          pattern = np.append(pattern,result)
          pattern = pattern[1:len(pattern)]
        outp = [charindex[x] for x in outp]
        outp = ''.join(outp)
        sample += outp
        print(sample)

textsample = TextSample()

# Load Model
Load models or weights here. Github file size maximum prevents me from providing the full Shakespere model but the pretrained weights are in the repo.

In [0]:
# model = load_model(filepath)
# model.load_weights("../full_train_SP_weights.hdf5")

Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead


# Train Model
Even with Colab GPU, this can take a while. I trained the sample for ~12 hours. However, usually if gotten to roughly around 1.0 loss the generator is good enough to go. Can train almost indefinatly on most models. If the loss gets too low the text might become overfit, which in this case means just copying Shakespere in the most inefficent way possible. It should take an unrealistically long time to get to that point anyways.

In [0]:
model_callbacks = [checkpoint, early, textsample]
model.fit(X_train, Y_train,
          batch_size=128,
          epochs=1000,
          verbose=1,
          callbacks = model_callbacks)

In [0]:
# model.load_weights(filepath)
model.save_weights("../full_train_weights.hdf5")
model.save("../full_train_model.hdf5")

# Making Some New Text
This block generates new text in the style of the input text of TEXT_LENGTH size in characters. It takes a random seed pattern from the training set, predicts the next character, adds it to the end of the pattern, then drops the first character of the pattern and predicts on the new pattern and so forth. 
## The Loopbreaker

This is *very* *very* usefull technique I came up with while putting this together. Every LOOPBREAKER predictions, the program just changes one of the characters in the pattern randomly (except the last few, to prevent spelling errors). This causes our model to precieve just *slightly* different text which causes it to change it's overall predictions. Without this, even a well trained model starts to repeat itself and get caught in loops. The loopbreaker can even prevent overfitting very well or allow undertrained models to preform much better. Changing this value up and down is interesting will significantly change the output. Setting it high will have much more repeated speach, slightly lower will get many line starting the same then vering off into different dirrections, really low will get lots of varied speach but line structures and format become unstable. (It is also interesting to note that even setting most of the pattern's characters to gibberish the model can still make full words and rough lines as long as the last 10 remain untoched.)

In [0]:
%%time
TEXT_LENGTH = 10000
LOOPBREAKER = 4


x = np.random.randint(0, len(X_train)-1)
pattern = X_train[x]
outp = []
for t in range(TEXT_LENGTH):
  if t % 100 == 0:
    print("%"+str((t/TEXT_LENGTH)*100)+" done")
  
  x = np.reshape(pattern, (1, len(pattern)))
  pred = model.predict(x, verbose=0)
  result = np.argmax(pred)
  outp.append(result)
  pattern = np.append(pattern,result)
  pattern = pattern[1:len(pattern)]
  ####loopbreaker####
  if t % LOOPBREAKER == 0:
    pattern[np.random.randint(0, len(pattern)-10)] = np.random.randint(0, len(charindex)-1)

%0.0 done
%10.0 done
%20.0 done
%30.0 done
%40.0 done
%50.0 done
%60.0 done
%70.0 done
%80.0 done
%90.0 done
CPU times: user 1min 45s, sys: 1min 11s, total: 2min 57s
Wall time: 2min 42s


In [0]:
outp = [charindex[x] for x in outp]
outp = ''.join(outp)

print(outp)


at last in the streets, that he hath stain'd me.

king richard ii:
why, then thou hast more hear his son, and then have
many that he hath said, and the strength of the wall,
that have been still stabb'd and the streams of
the strength o' the lord's part, the strongest stars
are straited and straight
and that the
shearers die about the sea for a strange thing
that i have spoke of him.

menenius:
i have said and the most marriage of the world
that i must not be saffly as a word,
as i have said, and the depose of the sun
that i have spoke of honour to the world.

lucio:
i would the world but soon be so: and then i said
the more i have attended to a wife.

king richard ii:
why, then thou hast of my shame to my soul,
and the depose they would have stay'd to hear
her father with his hopes, and the duke
hath been as much beloved: the subtle traitor
the son of your string; and the sun is spotless
that i have spoke of hostily to my soul,
to lose his head and his
measured to the senate, and the

# Save Text Output

In [0]:
f= open("../output_text.txt","w")
f.write(outp)
f.close()