# Text Generation With Neural Networks

*Isaac Mauro*

For this project, I wanted to explore training a neural network on english text, and using it to generate new text. I was inspired to explore this topic after stumbling upon reddit user /u/deimorz subreddit, /r/SubredditSimulator. The subreddit only allows a collection of bots he has created to post content and comment, and the posts/comments are generated with a markov model trained with comments and posts from a bunch of different subreddits. It makes for some really interesting, and often hilarious results.

I have decided to do something similar, except I will be using a recurrent neural network, and I will be training it with various texts, mainly Shakesphere. 

## Methods

#### The Data

I am going to be using selections of Shakespeare as training data for my model. I have decided to use a character level model instead of a word or sentance level model, as it will be the best at generalizing the data (hopefully). The data is going to be treated as sequence data, and the model will try to make a prediction about what the next character will be in a sequence based on some number of previous characters. A text file with N characters, and a chosen sequence length D, will be split up into N-D sequences, each with D entries. This will form the X matrix. The Y matrix (target matrix) will also have N-D entries, but each entry only contains the character that follows the corresnponding sequence in X. The sequence length can be adjusted to tune how far back the model can remember, with a higher number increasing computation time but also allowing the model to make better predictions based on context further in the past. Increasing the sequence length will also proabbly require more hidden units per layer to effectively keep track of all the features in the data. 

#### The Network

I have decided that to get the most out of this exercise, with the time I have, it would be best for me to use a high-level machine learning library as opposed to creating my own from scratch in numpy or TensorFlow. I settled on Keras as my library of choice. Keras is a high-level machine learning library for python that uses TensorFlow for the backend computations, and works out of the box with minimal code. It's API is simple and intuitive. The main driver for a keras ML implementation is called a model, and we will be using the Sequential model. Once a Sequential model is instantiated, you can add layers to the model, which are responsible for the end to end computation within the network. Keras allows you to add any number of layers to a model with model.add(), and each layer will pass output to the next automatically, with the correct output dimensions. You only have to specify input dimensions for the first layer added, and keras takes care of the rest. After the layers are all defined, you just have to compile the model with model.compile(), which allows you to choose a loss function, an optimization algorithm, and any metrics that you want to collect during training. This makes for a powerful but easy to use interface for building custom networks.

The LSTM (Long Short-Term memory) layer in keras is what I will be using for our network. I'm not very confident in the full math behind LSTM networks, but I will try my best to explain the basic idea.

LSTM networks are a solution to a problem that traditional recurrent neural networks have: not being able to "remember" features in the data that may be relevant, but are too far back in time from the current context. RNN's work very well if the predictions the network is making are based on more recent features, but they begin to deteriorate when they need to take into account features from further in the past. 

To solve this problem, LSTM networks use "memory units" which are more complex versions of the neurons used in regular RNN's. Each memory unit consists of the current state, an input gate, output gate, and forget gate. The input gate decides when the neuron will accept data for updates. The forget gate decides when the neuron will "forget" its current state, and replace it with a completely new state. Finally, the output gate decides when the neuron will "fire" and output something to the next layer. This combination of gates allows the neurons to hold onto information and only update it when the context calls for it, and it also allows a neuron to completely forget its current state and take on new functionality as the context changes. 

I also want to speak briefly about the other two keras layers I use in my networks. The first is a dropout layer. Dropout is a technique used to help with overfitting issues that many neural networks have. Dropout randomly turns off a percentage of neurons during each training iteration. This creates a number of "thinned" networks (subsets of the whole network) that are all trained with a small amount of data. At test time, none of neurons are turned off, and all of the weights get scaled. This forces the neurons to be less co-dependent with other neurons in the network during training, making the network better at generalizing during test time. Nitish Srivastava et al. wrote a short paper on the effectiveness of dropout, and it is linked in the references section.

The other layer I use in my model is the Dense layer. A dense layer is a standard densely connected neural network that sums the inputs times the weights and then passes that to the activation function. I am using this layer as my output layer, with the softmax activation function. The softmax activation function outputs a probability matrix that is the same shape as the Y matrix. Each entry in the output represents the probability of a certain character being the next character in the sequence. Using this as the output layer, as opposed to a single output, allows the network to output probabilities of characters instead of trying to predict the next character exactly. This opens up the potential to adjust the "temperature" of the predictions. Changing the temperature can make the network make riskier decisions instead of the selecting the character with the highest probability every time. You could make it so the generation function (discussed in the next section) would randomly select a character from the 5,10,etc highest probabilities in the output. I chose to simply pick the character with the highest probability in my generation function, but I designed the network in this way so I could play with different temperatures after I get the network trained on a large data set. 

#### Preparing Data and Generating Text

I started by creating a method to help prepare text files so it will be easy to train my net on different inputs. Notice at the end of this method I split the data into X and Y matrices, which represent the input sequence(X) and "true" output (Y). You can change the length of the sequences returned by adjusting seq_len. A longer sequence length will increase the ability of the network to remember further in the past, but it will increase computation time, and a larger network will be required to get meaningful results.

After a model has been trained on the data, I have to generate new text with it. To do this, I created another function to make it easier. The model expects the same input dimensions for predictions as it had while training, so for each character we want to generate, it needs a sequence of characters. Because of this, the model needs a seed sequence to get it started. I chose to pull a random sample from the dataset as my seed, but the function allows you to input any seed you would like, as long as it matches the sequence length the model expects. The function iterates for as many times as a character needs to be generated. During each iteration, the model is fed in the input sequence, a new character is generated and appended to the result string and the input. After the predicted character is appended to the input, it is too long to be used in the next iteration, so chop off the first character and keep going.


In [1]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
def prepare_data(filename, verbose=False, seq_len=100):
    '''
    prepare_data helps prepare a text file for use within a neural network.
    This function prints some information about the data, and creates dictionaries
    to help convert the characters in the text file to numerical values so the 
    NN can use them effectively.
    
    Args: 
    filename: location of ASCII file to read
    verbose: print extra information about the data
    seq_len: length of the sequences to prepare
    
    Return:
    char-to-int, int-to-char: dictionaries for converting chars to ints and vice versa
    chars: list of unique characters in the dataset
    X: matrix with all the input sequences
    Y: matrix with all the outputs corresponding to input sequence
    '''
    data = open(filename, 'r').read()
    chars = list(set(data))
    char_to_int = {char:ix for ix, char in enumerate(chars)}
    int_to_char = {ix:char for ix, char in enumerate(chars)}
    if verbose==True:
        print("length of data:", len(data), "number unique chars:", len(chars))
        print("unique chars:")
        print(chars)
    # Prepare the sequences for input into our RNN
    # Note that X will not contain the first <seq_len> characters
    dataX = [] # Input sequences (len(data)/seq_len X seq_len matrix)
    dataY = [] # Ouput sequence (len(data) X 1 matrix)
    for i in range(0, len(data) - seq_len, 1):
        seq_in = data[i:i + seq_len]
        seq_out = data[i + seq_len]
        # make sure to convert the characters to ints
        dataX.append([char_to_int[char] for char in seq_in]) 
        dataY.append(char_to_int[seq_out])
    X = np.array(dataX)
    Y = np.array(dataY)
    print("X shape:",X.shape)
    print("Y shape:",Y.shape)
    return X, Y, char_to_int, int_to_char, len(chars)

def generate(model, seed, vocab_len, int_to_char, length=100):
    '''
    This function generates characters based on the model and seed that it is given.
    
    args
    model: compiled keras model
    length: number of characters to generate, default 100
    seed: len 100 sequence to get the generation started
    vocab_len: length of text vocab
    int-to-char,char_to_int: dictionaries for converting characters
    
    returns a string of generated characters
    '''
    result = ""
    for i in range(length):
        # reshape the seed into the input the dimensions our model expects
        X = np.reshape(seed, (1, seed.shape[0], 1))
        # normalize
        X = X/float(vocab_len)
        # Make a prediction
        predict = model.predict(X)
        # Pick the value with the highest probability and convert it to a character, and add it to the result
        index = np.argmax(predict)
        char = int_to_char[index]
        result = result + char
        # Append the prediction to the seed, and then remove the first element of the seed.
        # This shifts the "window" over so our model can make a new prediction, also using the
        # new character we just generated. This continues for as long as we have more characters to generate
        seed = np.append(seed, index)
        seed = np.delete(seed, 0)
    return result

Using TensorFlow backend.


In [53]:
dataX,dataY,char_to_int,int_to_char,vocab_len = prepare_data('randjfull.txt',verbose=True)

length of data: 174126 number unique chars: 85
unique chars:
['E', '\ufeff', 'D', 'B', 'T', 'O', 'H', 'C', '?', 'm', '2', ':', 'F', 'Y', '(', ']', 'q', 'b', '4', 'v', 'o', 'N', '>', 'W', 'd', 't', '$', 'z', 'j', '3', '6', 'G', ',', 'P', 'Q', 'A', 'g', '9', ')', 'p', '*', ';', 'k', '8', 'f', 'h', 'i', '1', '7', '%', 'I', '<', 'u', "'", '.', 'J', 'S', 'U', 'R', '5', ' ', '!', 's', 'M', 'K', 'V', 'l', 'r', '/', '0', '"', 'L', '@', 'x', 'w', '[', '-', 'n', 'Z', 'y', '\n', 'a', 'c', 'X', 'e']
X shape: (174026, 100)
Y shape: (174026,)


## Training the models
The documentation calls for the input matrix to be of the shape [number samples, size of samples, number outputs]. This means we need to reshape our X matrix, which is currently of the shape [number samples, size of samples]. 

We also need to normalize the inputs, as is standard for most ML tasks.

Finally, we need to convert the outputs (Y matrix) to a one-hot encoding. A one-hot encoding is a sparse array with a length equal to the number of unique characters, and each index refers to a one of those characters. So each entry in T will be a sparse array of all zeros, except for the index that refers to the character, which will be a 1. This allows me to use the softmax output layer and have the loss function be able to compare Y with the output. 

In [37]:
# Reshape X to follow keras.layers.LSTM specification
X = np.reshape(dataX, (dataX.shape[0], dataX.shape[1], 1))
# Normalize X so all the data falls with in range(0,1). Nice and easy because all our inputs are > 0
X = X/float(vocab_len)
# Do the one-hot encoding of Y. keras provides a very simple way to do this
Y = np_utils.to_categorical(dataY)

Now we can start building up our network. I want to begin with a smaller network to see how well it works, and then move onto larger networks.

The first layer is our LSTM hidden layer, with 100 "memory units".

Finally, we can compile the neural network, which sets everything up on the TensorFlow backend automatically. It requires that you specify a loss function and an optimizer function. In class, we have pretty much only dealt with mean-squared error or classification error as loss functions. For this network, I will be using a categorical cross-entropy loss function, which is supposed to work better for classification problems with lots of classes (we have 85 "classes" in this particular dataset). For our optimization function we will use *adam* which is similar to the gradient descent we used in class, but it is more efficient, especially when the data is large or has a lot of parameters. Using this instead of regular gradient descent should make training much faster. Keras makes it really easy to customize every bit of our network, and we could have even defined our own loss and optimization functions. 

In [38]:
# Initializes the model. Can also do this whole block as a one-liner by
# using a list of the layer definitions as an argument for the Sequential constructor
model1 = Sequential()
# First layer. The first layer in a model requires the input shape be specified
model1.add(LSTM(100, input_shape = (X.shape[1],X.shape[2])))
# Second layer
model1.add(Dense(Y.shape[1], activation='softmax'))
# Compile the network to the TensorFlow backend
model1.compile(loss = 'categorical_crossentropy', optimizer = 'adam')

Because of the computation time required for training these networks, keras provides utilities to create checkpoints during training that save the network configuration and all of the weights at each checkpoint every time there is an improvement in loss after an iteration (epoch). This allows us to leave the network to train and come back to see which iteration has given the best loss, and then use the models state during that iteration to make predictions.

In [96]:
# Each checkpoint will create a file with a name that indicates the epoch and loss
file="model1-improvement-{epoch:02d}-{loss:.4f}.hdf5"
# The checkpoints will be made each time the loss improves between epoch's 
checkpoint = ModelCheckpoint(file, monitor='loss', verbose=1, save_best_only=True, mode='min')
# This is what we pass into the training function
callbacks_list = [checkpoint]

Now that checkpoints are setup, we can fit the model. model.fit() allows us to specify a list of *callback functions* which will be ran after every iteration. In our case, the callback list only contains the checkpoint function which creates a model checkpoint everytime the criteria described above is met. I have set the number of epoch's to a low number for the initial fit to keep training time to a minimum. The batch size indicates how many samples to train with at a time. With a batch size of 100, 100*100 characters will be loaded at a time, which should ensure that I have enough memory during each epoch.

In [None]:
model1.fit(X, Y, epochs = 20, batch_size = 100, callbacks = callbacks_list)

In [49]:
# Load the model with the best loss from the checkpoints
model1.load_weights('model1-improvement-19-2.4693.hdf5')
model1.compile(loss='categorical_crossentropy', optimizer='adam')
# pick a random sequence from the data
rand = np.random.randint(0, dataX.shape[0]-1)
seed = dataX[rand]
result = generate(model1, seed, vocab_len, int_to_char, length=200)
print(result)

''ff'X:SZ'f5''ff'Z:SZ'f5d'uffS:uZ'C5Z:So:dF@ZB'Z:d'ZqSo:dF@ZZ'Zw:''ff':qZZ'Zw:'if51:uZ'C5Z:So:dF@ZB'Z:d'ZqSo:dF@ZB'Zw:'iff1:uZ'C5Z:So:dF@ZB'Z:d'ZqSo:dF@ZZ''f''yd:''ffS:y''f''fB:f'Zf''ydd:'Zf''f5''ff':


It appears that our network is both overfitting to the data, and definitely is not big enough, nor trained on enough iterations to give meaningful results. So its time to try a new network, this time including multiple LSTM layers and a dropout layer.

Note: I accidentally reset the cell aboves output when editing this journal before submission. The model was trained on a text file that I accidently deleted. The new Romeo and Juliet I downloaded had different characters, and the dictionaries were created differently when preparing the data with the new text file, so the output has all the wrong characters, making it completely nonsensical. You can still see the repeating patterns though, which was much more apparent when it was english text. 

In [None]:
model2 = Sequential()
# First layer, this time with more units
model2.add(LSTM(250, input_shape = (X.shape[1],X.shape[2]), return_sequences=True))
# Dropout for the first layer
model2.add(Dropout(0.2))
# Second layer
model2.add(LSTM(250))
# Dropout for second layer
model2.add(Dropout(0.2))
# Output layer
model2.add(Dense(Y.shape[1], activation='softmax'))
# Compile the network to the TensorFlow backend
model2.compile(loss = 'categorical_crossentropy', optimizer = 'adam')

#Setup checkpoints
file="model2-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(file, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

#Train, this time with more epochs 
model2.fit(X, Y, epochs = 50, batch_size = 60, callbacks = callbacks_list)

I ended up stopping this training early after it ran for 14 hours, the change in loss had started to slow down and I was impatient. Lets see if it is working a bit better. I'll also try some of the saved weights for the earlier epochs to see what our model has learned over time. 

In [57]:
# Load a model from first epoch
model2.load_weights('model2-improvement-01-2.6089.hdf5')
model2.compile(loss='categorical_crossentropy', optimizer='adam')
rand = np.random.randint(0, dataX.shape[0]-1)
seed = dataX[rand]
result = generate(model2, seed, vocab_len, int_to_char, length=200)
print(result)

the ware to beat the ware to beat the ware to beat the ware to beat the ware to beat the ware to beat the ware to beat the ware to beat the ware to beat the ware to beat the ware to beat the ware to b


In [58]:
# Load a model from 5th epoch
model2.load_weights('model2-improvement-05-2.1107.hdf5')
model2.compile(loss='categorical_crossentropy', optimizer='adam')
rand = np.random.randint(0, dataX.shape[0]-1)
seed = dataX[rand]
result = generate(model2, seed, vocab_len, int_to_char, length=200)
print(result)

 ao the eoen
    The here in the farth ahe seart ahe seart and the siarn.
    The  our ho teart and the  ousen the eoen
    The here in the farth ahe seart ahe seart and the siarn.
    The  our ho tea


In [60]:
# Load the model from the best epoch
model2.load_weights('model2-improvement-23-1.6399.hdf5')
model2.compile(loss='categorical_crossentropy', optimizer='adam')
# pick a random sequence from the data
rand = np.random.randint(0, dataX.shape[0]-1)
seed = dataX[rand]
result = generate(model2, seed, vocab_len, int_to_char, length=300)
print(result)

 be the fest ahou srt ahould be the fack.

  Rom. What say you teres and the world will be brow the sin.

  Jul. I would the word and the world with ahe sends.

  Rom. What say you tears and the  hildren of the sound.
    The forgring to be oour mere to my love.
    The forering to be oour to be to 


As you can see from the samples above, the model is definitely learning, but I think it could do with some more tuning. A larger network will hopefully help, and I also want to decrease the sequence size a bit so I can train it faster. I am going to up the units in each layer to 512. I'm also going to increase the dropout a bit because multiple runs with different random seeds tend to produce similar patterns, indicating a potential overfit to the data. I also have changed to a slightly smaller dataset, still shakesphere, but only 25 selected sonnets, making for a dataset about 1/10 of the size.

In [55]:
# Prepare the data again, this time with a shorter sequence length
dataX2,dataY2,char_to_int,int_to_char,vocab_len = prepare_data('sonnets.txt',verbose=True, seq_len=50)
# Reshape X to follow keras.layers.LSTM specification
X = np.reshape(dataX2, (dataX2.shape[0], dataX2.shape[1], 1))
# Normalize X 
X = X/float(vocab_len)
# Do the one-hot encoding of Y.
Y = np_utils.to_categorical(dataY2)
# Setup the 3 layer model
model3 = Sequential()
model3.add(LSTM(512, input_shape=(X.shape[1],X.shape[2]), return_sequences=True))
model3.add(Dropout(0.3))
model3.add(LSTM(512))
model3.add(Dropout(0.3))
model3.add(Dense(vocab_len, activation='softmax'))
model3.compile(loss = 'categorical_crossentropy', optimizer = 'adam')

#Setup checkpoints
file="model3-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(file, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

length of data: 16889 number unique chars: 67
unique chars:
['D', 'O', 'H', 'B', 'T', '?', 'C', 'm', 'F', ':', '2', 'q', 'Y', '4', '(', 'b', 'v', 'o', 'N', 'W', 'd', 't', 'z', '3', '6', 'j', 'G', 'P', 'A', 'g', ')', '9', 'p', 'c', ';', 'k', '8', 'f', 'i', 'h', '1', '7', 'I', 'u', "'", '.', 'S', 'U', 'R', '5', ' ', 's', 'M', 'V', 'l', 'r', '0', 'L', 'x', 'w', '-', 'n', 'y', '\n', 'a', ',', 'e']
X shape: (16839, 50)
Y shape: (16839,)


In [7]:
# Training time. I'm going to do fewer epochs b/c time constraint,
# as well as slightly larger batch size for the same reason
model3.fit(X, Y, epochs = 50, batch_size = 128, callbacks = callbacks_list)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7efddf808b00>

In [19]:
# Load the model from the best epoch
model3.load_weights('model3-improvement-49-0.2156.hdf5')
model3.compile(loss='categorical_crossentropy', optimizer='adam')
# pick a random sequence from the data
rand = np.random.randint(0, dataX2.shape[0]-1)
seed = dataX2[rand]
np.random.shuffle(seed)
result = generate(model3, seed, vocab_len, int_to_char, length=500)
print(result)

se.
  And sur the brave day sunk in hideous night,
  When I behold the violet past prime,
  And sable curls all silvered o'er with white:  
  When lofty trees I see barren of leaves,
  Which erst from heat did canopy the herd
  And summer's green all girded up in sheaves
  Borne on the bier with white and bristly beard:
  Then of thy beauty do I question make
  That thou among the wastes of time must go,
  Since sweets and beauties do themselves forsake,
  And die as fast as they see others grow


Wow! It worked a lot better with a larger network and more epochs. The loss was still improving as we iterated, so even more epochs could improve it further. The overfitting problem seems to still be an issue, as some of the generations appear to be straight from the data. There are still a few spelling errors here and there, but the network has learned most words, how to puncuate, and also even remembers to close parenthesis. I think this network would have performed even better with a larger dataset. Now I want to try and add another layer and increase dropout again to see if it improves. I also want to increase the learning rate, which will make the optimization algorithm update weights faster. This is a common suggestion when adding dropout to a networks training. 

I beleive that the best results from this network will come with a much larger text corpus, but due to time contraints, and the fact that this is the computer I use for other work, I am going to stick with this short dataset for now. I have been trying to get an AWS instance so I can train on the cloud, but they are not allowing me to use any instances with GPUs. Hopefully I can train this on a large dataset soon, and maybe my results will be near what Andrej Karpathy has shown with his network.

Now I am going to play around with an even larger network. I tuned the model several times over a couple days and found the configuration below to work quite well. A few of my runs failed to give output due to a bug in keras. I attempted to up the learning rate in the optimization function, but it was failing to produce any output. Running it again without adjusting the learning rate seems to have fixed the problem, which leads me to believe there is a bug when adjusting the hyperparameters for the Adam optimizer. While I would love to play around with increasing the learning rate, I do not think I will be able to figure out how to fix this bug with the time I have for this project. 

In [2]:
# Prepare the data again, this time with a shorter sequence length
dataX2,dataY2,char_to_int,int_to_char,vocab_len = prepare_data('sonnets.txt',verbose=True, seq_len=50)
# Reshape X to follow keras.layers.LSTM specification
X = np.reshape(dataX2, (dataX2.shape[0], dataX2.shape[1], 1))
# Normalize X 
X = X/float(vocab_len)
# Do the one-hot encoding of Y.
Y = np_utils.to_categorical(dataY2)
# Setup the 3 layer model
model4 = Sequential()
model4.add(LSTM(512, input_shape=(X.shape[1],X.shape[2]), return_sequences=True))
model4.add(Dropout(0.4))
model4.add(LSTM(512, return_sequences=True))
model4.add(Dropout(0.4))
model4.add(LSTM(512))
model4.add(Dropout(0.4))
model4.add(Dense(vocab_len, activation='softmax'))
model4.compile(loss='categorical_crossentropy', optimizer='adam')

#Setup checkpoints
file="model4-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(file, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

length of data: 16889 number unique chars: 67
unique chars:
['r', 'd', 'p', 'v', '9', 'V', '4', 'A', ',', '2', 's', 'S', 'a', '0', 'G', 'u', 'U', ' ', 'b', 'j', 'C', 'q', 'Y', 'o', '\n', '3', 'g', '?', 'O', 'R', 'x', 'k', 'I', ')', 'f', 'T', 'B', 'h', ':', '-', 'M', 't', 'z', 'l', '5', 'y', 'N', 'w', "'", 'c', 'n', 'P', ';', 'F', 'D', '7', '8', '6', 'e', 'L', '(', 'W', 'i', 'H', '.', '1', 'm']
X shape: (16839, 50)
Y shape: (16839,)


In [49]:
model4.fit(X, Y, epochs = 50, batch_size = 128, callbacks = callbacks_list)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50

KeyboardInterrupt: 

In [54]:
# Load the model from the best epoch
model4.load_weights('model4-improvement-38-0.3985.hdf5')
model4.compile(loss='categorical_crossentropy', optimizer='adam')
# pick a random sequence from the data
rand = np.random.randint(0, dataX2.shape[0]-1)
seed = dataX2[rand]
#np.random.shuffle(seed)
result = generate(model4, seed, vocab_len, int_to_char, length=500)
print(result)

ng wanth theu sear yous?t
  And ivery fair wrom fair sometime declines,
  By children's eyes, and child, and all wirm thiee,
  Serking that beauteous roof to ruinate
  Which this (Time's pencil) or my pupil 
  Wakes foth tuine eyes fnd allen fiter  Che lovely gaze where every eye doth dwell
  Whiu lavpy sears the time that fair shoul arte.
  How much more praise deserved thy beauty's use,
  If thou couldst answer 'This fair child of mine
  Shall sum my count, and make my old wirm
  But thou cont


As you can see above, the model is no longer overfitting the data, but it has many more spelling errors than before. I assume this is due to the small dataset I am using, in combination with the high dropout. I was hoping an increased learning rate would have helped with that, but due to the bug I am unable to test it at this time. 

I think this network configuration seems like its getting close enough to optimal that I would like to train it on a large dataset for a couple days, lowering the batch size a bit. But due to time constraints, this is all I will have trained before the deadline.

## Conclusion
Overall I am pretty satisfied with the results I was getting from such a small dataset. At first I was getting some serious overfitting problems, but increasing the number of layers and adding a higher dropout helped with that problem significantly, so much so that it may be affecting accuracy. An increase in learning rate may have helped with that issue, but I think the biggest thing that would have helped is a much larger dataset and more epochs. Unfortunately, I don't have the resources currently to train models like that in any reasonable time. My final network, even on the shortened dataset, still took over 16hrs for ~40 epochs, and that was for only one of the configurations I tried with the 3 layer network. 

I look forward to training this on a larger dataset once I can get some AWS instances or upgrade my computer. I would also like to explore different types of sequence based data with this same model like music and other audio recordings. Training this model on a collection of albums from a certain band would be a cool exercise that I am definitely going to be trying this summer. 

# References
Jason Brownlee's blog post, a huge help with getting this up and running: 

http://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/

Original inspiration, some of the bots posts can be NSFW, so visit at your own discretion: 

http://reddit.com/r/subredditsimulator

This blog post also helped push me towards this project (Andrej Karpathy):

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Keras docs and source code:

https://keras.io/
https://github.com/fchollet/keras

Paper on dropout and why it works for the overfitting problem:

https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf