<a href="https://colab.research.google.com/github/Shesh6/IL181--Deep-Learning-Tutorial/blob/master/Deep_Learning_Tutorial_Assignment_3_Deep_LSTM_Lovecraft.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Deep Learning Tutorial Assignment 3 - Sequential Prediction

### Deep One: The H.P. Lovecraft Text Generator
_Yoav Rabinovich, December 2019_

A deep recurrent long-short-term memory network that renders the unspeakable readable.

---------------------------------

#### Introduction

H.P Lovecraft is a classic author of horror fiction. Growing up racist and lonely in a cold and dreary New England farm probably played a part in shaping his grotesque, mysterious prose all about monstrosities in the dark depths and alien horrors of cosmic scale.

Deep One has been raised under similar conditions, on a lonely GPU in a cold and dreary Google server farm. While it makes slightly less grammatical sense than Lovecraft, I believe it manages to capture a bit of his unique style.

#### Model:

Deep One is a character-based Deep RNN using stacked LSTM layers to learn the distribution of characters in Lovecraft's writings given a preceding string. The input sequence is plugged into the first LSTM layer, which then outputs a string predicting each next character in the sequence, which are then fed to the next LSTM layer. When the recurrent layers are done, the final output sequence is put through a dense layer to generate a prediction for the next character in the sequence. The loss is then calculated based on prediction accuracy, and the parameters of the network are optimized through back-probagation.


#### Data

I used a dataset of the entirety of Lovecraft's published works, and generated input-output pairs for training by copying the input string and shifting it one character forward. This way each input character is paired with its consequtive character as a target output.

#### Training:

Training the network on a GPU, I used a batch size as large as fit the RAM, based on my chosen sequence length, hidden layer size and number of LSTM layers. This way I reduced the variance of the stochastic gradient updates to allow the network to converge faster. I trained the model for 400 epochs (around 4 hours) on sequences of 100 characters to produce the results explored below (with 5 256-unit LSTM layers).

#### Analysis:

I implemented two ways to sample the model: Either with a random initial character, or with a seed string. I use the former to print a random sentence between each training epoch to track the progress, and the latter to analyze the resulting model.

During training, a clear pattern emerges when we examine the order in which the model picks up on the correlations in the sequences. Each model trained started out predicting nothing but spaces, since they're a common character that shows little preference regarding which character precedes it. The model then spends a long time producing repeated patterns, corresponding to common words and expressions in the text: First comes the "the the the..." pattern, then "of the of the of the...", and in the case of this Lovecraft dataset, the networks picks up on the word "strange" and keeps repeating it. It seems particularly attracted to words that start with "st" or "co", probably due to their frequency in the text, producing outputs such as "strange stranger of the could strange stranger...". The first non-repeating pattern that's picked up is exhibited when the sampled text begins with a number or a line break: the model learns the common pattern representing a page transition in the raw dataset: two line breaks followed by the page number and another pair of line breaks. It took approximately 400 epochs for my model to rid itself of repeating patterns, and start producing real sentences. However, these still mostly exhibit awareness to two or three words at a time, producing run-on sentences without a clear subject-verb-object structure.

However, examining some seeded examples we find that some complex contextual awareness was successfully captured by the model. For example, when seeded with the word "Written", the model completes the sentence:

"Written in March of 1926 in Weird Tales 

Where bulletin of Boston's speech was the strangeness of the strange stones.."

Showing awareness of the context of meta-data presented in between stories, even though no story was written in that particular month. It seems to lost track a bit and confuse the date of writing with the date of publication, since the year is follwed by the title of a popular magazine. It then also knows to start the next sentence with a capital letter, without the need of a period which doesn't appear in the meta-data sections in the dataset.

The model also completes "Prof" to "Professor Well", a capitalized name that isn't found in the dataset, and completes seeds that begin with quotation marks in a way that resembles dialogue, and begins new paragraphs with quotation marks as well, which is perhaps the longest-range correlation I've managed to identify in the model:

" "Yes to see what I had seen - the season when he saw them 
again. 

"Inn of the party were stumbled to the papers and saw the proper sent 
of the strange paths and the strange stone brown streets and.. "

To encourage the learning of longer-range correlations I've tried to train the model with larger length sequences, as 100 characters only represent around 20 words. However, those took too long to train. Since the loss was still decreasing when I ceased the training, and since the model exhibited a clear pattern of encoding longer-range correlations as training progressed, I believe that longer training would be able to produce believable text eventually. Another way to encourage these patterns is the use of attention, which is a good next step when I come back to this project in the future.

### Code
 _Adapted from https://chunml.github.io/ChunML.github.io/project/Creating-Text-Generator-Using-Recurrent-Neural-Network/_

_Note: Deprecation messages in the code can't be supressed since they're produced inside the Keras functions_

#### Imports

In [2]:
import numpy as np
%tensorflow_version 1.x
import tensorflow as tf
import keras as k

Using TensorFlow backend.


#### Data Preprocessing

In [0]:
# Read corpus and isolate characters
data = open("The-Collected-Works-of-HP-Lovecraft_djvu.txt", 'r').read()
chars = list(set(data))
VOCAB_SIZE = len(chars)
CORPUS_SIZE = len(data)

ix_to_char = {ix:char for ix, char in enumerate(chars)}
char_to_ix = {char:ix for ix, char in enumerate(chars)}

# Size of sequences to train on
SEQ_LENGTH = 100
SLICE = int(CORPUS_SIZE//SEQ_LENGTH)

In [0]:
# Create input and output sequence pairs
X = np.zeros((SLICE, SEQ_LENGTH, VOCAB_SIZE))
y = np.zeros((SLICE, SEQ_LENGTH, VOCAB_SIZE))

for i in range(0, SLICE):
    # One-hot Encoding
    X_sequence = data[i*SEQ_LENGTH:(i+1)*SEQ_LENGTH]
    X_sequence_ix = [char_to_ix[value] for value in X_sequence]
    input_sequence = np.zeros((SEQ_LENGTH, VOCAB_SIZE))
    for j in range(SEQ_LENGTH):
        input_sequence[j][X_sequence_ix[j]] = 1.
    X[i] = input_sequence

    # Shift characters one to the left for output
    y_sequence = data[i*SEQ_LENGTH+1:(i+1)*SEQ_LENGTH+1]
    y_sequence_ix = [char_to_ix[value] for value in y_sequence]
    target_sequence = np.zeros((SEQ_LENGTH, VOCAB_SIZE))
    for j in range(SEQ_LENGTH):
        target_sequence[j][y_sequence_ix[j]] = 1.
    y[i] = target_sequence

#### Text Generation (Decoder) Functions

In [0]:
# Generate text from random character
def generate_text(model, length):
    ix = [np.random.randint(VOCAB_SIZE)]
    y_char = [ix_to_char[ix[-1]]]
    X = np.zeros((1, length, VOCAB_SIZE))   
    for i in range(length):
        X[0, i, :][ix[-1]] = 1
        print(ix_to_char[ix[-1]], end="")
        ix = np.argmax(model.predict(X[:, :i+1, :])[0], 1)
        y_char.append(ix_to_char[ix[-1]])
    return ('').join(y_char)

In [0]:
# Generate text from seed string
def generate_text_from_seed(model, length, seed=None):

    # Encode seed
    seed_length = len(seed)
    X = np.zeros((1, length+seed_length, VOCAB_SIZE))
    y_char = []
    for i in range(seed_length):
        ix = [char_to_ix[seed[i]]]
        X[0,i,:][ix[-1]] = 1
        new_char = ix_to_char[ix[-1]]
        print(new_char, end="")
        y_char.append(new_char)

    # Generate text
    for i in range(seed_length,length):
        ix = np.argmax(model.predict(X[:, :i, :])[0], 1)
        new_char = ix_to_char[ix[-1]]
        print(new_char, end="")
        X[0, i, :][ix[-1]] = 1
        y_char.append(new_char)
    return ('').join(y_char)

#### Model

In [7]:
# Model Parameters
HIDDEN_DIM = 256
LAYER_NUM = 5

# Build Model
model = k.Sequential()
model.add(k.layers.LSTM(HIDDEN_DIM, input_shape=(None, VOCAB_SIZE), \
                        return_sequences=True))
for i in range(LAYER_NUM - 1):
    model.add(k.layers.LSTM(HIDDEN_DIM, return_sequences=True))
model.add(k.layers.TimeDistributed(k.layers.Dense(VOCAB_SIZE)))
model.add(k.layers.Activation('softmax'))
model.compile(loss="categorical_crossentropy", optimizer="rmsprop")
model.summary()






Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, None, 256)         355328    
_________________________________________________________________
lstm_2 (LSTM)                (None, None, 256)         525312    
_________________________________________________________________
lstm_3 (LSTM)                (None, None, 256)         525312    
_________________________________________________________________
lstm_4 (LSTM)                (None, None, 256)         525312    
_________________________________________________________________
lstm_5 (LSTM)                (None, None, 256)         525312    
_________________________________________________________________
time_distributed_1 (TimeDist (None, None, 90)          23130     
_________________________________________________________________
activation_1 (Activation)    (None, None, 90)    

#### Training

In [0]:
# Training Parameters
BATCH_SIZE = 256
GENERATE_LENGTH = 100

# Load model (Optional)
# model = k.models.load_model("model_checkpoint_256_epoch_400_batchsize_1024_seqlength_100.hdf5")

# Custom training cycle, to allow for printing samples between epochs
nb_epoch = 401
while True:
    print("\n---Epoch {}---".format(nb_epoch))
    model.fit(X, y, batch_size=BATCH_SIZE, verbose=1, epochs=1)
    nb_epoch += 1
    generate_text(model, GENERATE_LENGTH)
    # Save snapshot of weights every 10 epochs
    if nb_epoch % 10 == 0:
        model.save_weights('checkpoint_{}_epoch_{}_batchsize_{}_seqlength_{}.hdf5'\
                           .format(HIDDEN_DIM, nb_epoch,BATCH_SIZE,SEQ_LENGTH))

In [0]:
# Save model (Optional)
#model.save('model_checkpoint_{}_epoch_{}_batchsize_{}_seqlength_{}.hdf5'.format(HIDDEN_DIM, nb_epoch,BATCH_SIZE,SEQ_LENGTH))

In [87]:
generate_text(model, 200)
print("\n - end of sequence - ")
print("----------------")
generate_text_from_seed(model, 200, "Deep in the sunken city of")
print("\n - end of sequence - ")

n the stars and the 
strange passage which had been too supernece in the course of the 
party which made me shudder, and the strange stone broad stairces which had 
seemed to be the stones of the stra
 - end of sequence - 
Deep in the sunken city of the 
contourable and incredible castle and brownings and column to the street 
of the strange papers and the strange stones of the strange paths and 
distant seamows and con
 end


In [90]:
generate_text_from_seed(model, 200, "Written")
print("\n - end of sequence - ")

Written in March of 1926 in Weird Tales 

Where bulletin of Boston's speech was the strangeness of the strange stones and 
changes of the present balustradia and deaded watchers and the province to th
 - end of sequence - 


In [105]:
generate_text_from_seed(model, 100, "Prof")
print("\n - end of sequence - ")

Professor Well, and had been able to 
seek his get at the sea in the sea and sense of the present fl
 - end of sequence - 


In [104]:
generate_text_from_seed(model, 200, "\"Yes")
print("\n - end of sequence - ")

"Yes to see what I had seen - the season when he saw them 
again. 

"Inn of the party were stumbled to the papers and saw the proper sent 
of the strange paths and the strange stone brown streets and 
 - end of sequence - 
