In [31]:
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

# keras module for building LSTM 
from keras.layers import Embedding, LSTM, Dense, Dropout, Flatten, Bidirectional
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku 
import keras
import tensorflow as tf
from tensorflow.keras.utils import Sequence

if IN_COLAB:
  !pip install Keras-Preprocessing
from keras_preprocessing.sequence import pad_sequences
from keras.preprocessing.text import text_to_word_sequence

import pandas as pd
import numpy as np
import string, os 

# Text Generation with LSTM

As we have seen, LSTM models are excellent at dealing with sequential data. As luck would have it, text is also sequental data! We can train a model to predict the next word in a sentence, then use that smarts to generate new text. Basically ChatGPT, but far better. 

### Text for Training

We need some text from which to train our model to speak, I captured a small extract of text from Reddit posts, which vaguely resembles actual language. We'll first need to clean up our data a bit before we can assemble it for modelling. The inital cleaning bits are just like what we used in NLP, we just need to get rid of all the junk. 

We can use pretty much anything that you can imagine as source, and assuming we can gather enough data and train our model, the generated speech will be styled after the source. I liken it to going on vacation in Indonesia and talking to Indonesians who spoke English like Australian surfer bros - their training data was a little weird, so the output was a little weird too. If you're looking to build your best ChatGPT competitor you will want a lot of data, specifically a lot of data that is representative of the full gamut of how you want your model to write. If you want slang in the new text, you can't really train on Shakespeare and Wikipedia.

In [32]:
# Get Data
train_text_file = keras.utils.get_file('train_text.txt', 'https://jrssbcrsefilesnait.blob.core.windows.net/3950data1/reddit_wsb.csv')
train_text = pd.read_csv(train_text_file)
train_text.sample(10)

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp
36207,I am seated on the AMC rocket awaiting take of...,137,lsqlri,https://i.redd.it/oltiuvk6arj61.jpg,42,1614345000.0,,2021-02-26 15:14:42
25558,Unpinned Daily Discussion Thread for February ...,0,lcdbiy,https://www.reddit.com/r/wallstreetbets/commen...,1501,1612465000.0,Your daily trading discussion thread. Please k...,2021-02-04 21:00:22
4865,"Go downvote Trading 212 on Play Store, Can't g...",2,l6zifm,https://www.reddit.com/r/wallstreetbets/commen...,0,1611876000.0,Fuck them up,2021-01-29 01:22:16
11012,THE SQUEEZE HASN'T STARTED YET THIS IS JUST A ...,1,l71hrs,https://i.redd.it/hnzd7uy9o3e61.jpg,8,1611880000.0,,2021-01-29 02:32:34
39563,Dropbox plans to acquire DocSend for $165 million,4,m1bwis,https://www.marketwatch.com/story/dropbox-plan...,1,1615341000.0,,2021-03-10 03:47:28
34289,Am I doing this right? YOLO PLTR BITCHES!,82,lnei6t,https://www.reddit.com/gallery/lnei6t,33,1613766000.0,,2021-02-19 22:16:19
5520,I like $GME,0,l6zjl9,https://www.reddit.com/r/wallstreetbets/commen...,0,1611876000.0,But I also like Dogecoin( $DOGE ).\n\nThat is ...,2021-01-29 01:23:23
12121,Something is really suspicious here with robin...,1,l71r6d,https://www.reddit.com/r/wallstreetbets/commen...,8,1611881000.0,So I have 2.3 stocks in GME via robinhood. As ...,2021-01-29 02:41:05
43100,RKT YOLO UPDATE. STILL HOLDING!,106,mbnahe,https://i.redd.it/8cpxv8cg2uo61.jpg,33,1616558000.0,,2021-03-24 05:57:18
46980,Microsoft Directors Decided Bill Gates Needed ...,1449,ne0or5,https://www.wsj.com/articles/microsoft-directo...,499,1621236000.0,,2021-05-17 10:15:28


In [33]:
TOKENS = 500
OUTPUT_LENGTH = 45
OUT_DIM = 8
BATCH_SIZE = 128
SAMP = 7
SHUFFLE = 500
UNITS = 50

### Clean and Build Sequences

We can clean and process the text data with any strategy. Here we'll use a simple strategy of removing punctuation and converting all text to lowercase.

Next we'll use a Keras utility to convert our text into a sequence of words, which we'll then tokenize into integers. The sequence that is created is pretty simple, just a list of all the words in the sentence in the correct order. Basically an OCD version of a normal sentence. 

In [34]:
raw_text = train_text['body']
## Remove punctuation
raw_text = raw_text.dropna()
raw_text = raw_text.apply(lambda x: x.replace('[{}]'.format(string.punctuation), '').lower())
vocab = set()
sentences = []
for sentence in raw_text:
  current_sentence = text_to_word_sequence(sentence)
  sentences.append(current_sentence)
  vocab.update(current_sentence)
#vocab
#sentences
max_length = max([len(sentence) for sentence in sentences])
sentences[SAMP]

['this',
 'is',
 'our',
 'time',
 'if',
 'anyone',
 'will',
 'listen',
 'to',
 'you',
 'please',
 'explain',
 'to',
 'them',
 'why',
 'gme',
 'is',
 'one',
 'of',
 'the',
 'very',
 'best',
 'investments',
 'they',
 'can',
 'make',
 'right',
 'now',
 'i',
 'have',
 'already',
 'gotten',
 '5',
 'people',
 'close',
 'to',
 'me',
 'to',
 'research',
 'what',
 'is',
 'happening',
 'and',
 'they',
 'decided',
 'to',
 'dump',
 'everything',
 'into',
 'gme',
 'not',
 'amc',
 'not',
 'nok',
 'not',
 'bb',
 'but',
 'gme',
 'after',
 'last',
 "night's",
 'ah',
 'manipulation',
 'i',
 'am',
 'not',
 'selling',
 'until',
 'at',
 'least',
 '5k',
 'to',
 '10k',
 'a',
 'share',
 'we',
 'need',
 'to',
 'encourage',
 'people',
 'to',
 'have',
 'the',
 'courage',
 'to',
 'hold',
 'we',
 'finally',
 'have',
 'the',
 'power',
 'to',
 'change',
 'our',
 'futures',
 'they',
 'want',
 'us',
 'begging',
 'again',
 'and',
 'again',
 'for',
 'stimulus',
 'checks',
 'while',
 'they',
 'laugh',
 'at',
 'us',
 'as'

#### Tokenize

We can take the text that we split above and encode it as a sequence of integers. We'll then use the tokenizer to convert our sequences into a sequence of integers.

In [35]:
tokenizer = Tokenizer(num_words=TOKENS)
tokenizer.fit_on_texts(sentences)

tokenized_sentences = tokenizer.texts_to_sequences(sentences)
tokenized_sentences[SAMP]

[9,
 6,
 131,
 78,
 28,
 380,
 25,
 2,
 14,
 305,
 2,
 76,
 128,
 43,
 6,
 73,
 5,
 1,
 165,
 138,
 17,
 40,
 99,
 145,
 57,
 8,
 20,
 229,
 112,
 69,
 317,
 2,
 89,
 2,
 452,
 44,
 6,
 3,
 17,
 2,
 421,
 96,
 43,
 23,
 175,
 23,
 23,
 374,
 24,
 43,
 136,
 126,
 8,
 135,
 23,
 228,
 308,
 22,
 418,
 2,
 4,
 113,
 26,
 173,
 2,
 69,
 2,
 20,
 1,
 2,
 134,
 26,
 20,
 1,
 429,
 2,
 391,
 131,
 17,
 143,
 82,
 232,
 3,
 232,
 12,
 188,
 17,
 22,
 82,
 21,
 17,
 31,
 26,
 15,
 4,
 177,
 2,
 76,
 149,
 57,
 92,
 1,
 222,
 248,
 11,
 436,
 9,
 6,
 131,
 22,
 139,
 9,
 6,
 131,
 38,
 26,
 25,
 19,
 79,
 42]

### Construct Training Sequences

We are going to be a little slack in the construction of the datasets for training because we are limited in the amount of resources we can handle. All of the datasets that are fed to the model need to be the same length, so we'll set a cap and trucate it here for resource concerns. Our dataset will be constructed as:
<ul>
<li> A sequence of (up to) 24 words as the X data. 
<li> The next word as the Y data.
</ul>

So each sequence is effectively one set of features, and its target is the next word. If we were doing this in reality, we'd want to prep more records from our sample:
<ul>
<li> Suppose a sample sentence is "The quick brown fox jumps over the lazy dog". Our ideal data would have something like:
    <ul> 
    <li>X = "the quick brown", Y = "fox"
    <li> X = "quick brown fox", Y = "jumps"
    <li> X = "brown fox jumps", Y = "over"
    </ul>
</ul>

This is superior both because we are generating much more data to train the model and because we are training the model to predict words in all different positions in the sentence. We'd be predicting almost every word in the training dataset. The words that frequently end a sentence are not necessarily the same as the words that start a sentence or sit in the middle, so making predictions up and down the text will likely lead to a more useful model.

<b>We'd end up with a better model if we generated more sequences from our data and/or added more data. The resource demands make that tough, so we have cut some corners that are easy to remedy in a real-world scenario.</b>

In [36]:
# Find longest sequence
max_real_length = max([len(sentence) for sentence in tokenized_sentences])
print("Longest real sequence:", max_real_length)
if max_real_length > OUTPUT_LENGTH:
  trunc_token_sequences = [t_[:OUTPUT_LENGTH] for t_ in tokenized_sentences]
trunc_token_sequences[SAMP]

Longest real sequence: 4572


[9,
 6,
 131,
 78,
 28,
 380,
 25,
 2,
 14,
 305,
 2,
 76,
 128,
 43,
 6,
 73,
 5,
 1,
 165,
 138,
 17,
 40,
 99,
 145,
 57,
 8,
 20,
 229,
 112,
 69,
 317,
 2,
 89,
 2,
 452,
 44,
 6,
 3,
 17,
 2,
 421,
 96,
 43,
 23,
 175]

#### Pad Sequences

We need to pad our sequences so that they are all the same length, as our neural networks require that. The pad_sequences utility does just that, it will fill 0s at either the beginning or end of the sequence, depending on the "padding" option, and make everything the same length. We want to pad before the real data, because we are always planning on predicting the next token, so we want to work from the last value. 

In [37]:
padded_trunc_token_sequences = pad_sequences(trunc_token_sequences, maxlen=OUTPUT_LENGTH, padding='pre')
padded_trunc_token_sequences[SAMP]

array([  9,   6, 131,  78,  28, 380,  25,   2,  14, 305,   2,  76, 128,
        43,   6,  73,   5,   1, 165, 138,  17,  40,  99, 145,  57,   8,
        20, 229, 112,  69, 317,   2,  89,   2, 452,  44,   6,   3,  17,
         2, 421,  96,  43,  23, 175], dtype=int32)

#### Write Tokenized Data to Disk

We can write the tokenized data to disk so that we can use it later. This will save us from having to redo that step that is slow. Since we chopped our data size down a bit, this isn't super needed. I tried to generate all of the sequences noted above and I both exceeded the Colab memory limits and it took a while to run. I wouldn't want to repeat that if I can avoid it, and hard drive space is cheap, so writing the interim results to disk is a good work around. In real scenarios when we had massive amounts of data we would need to load it incrementally from disk anyway, so this is a free win.

In [38]:
#token_path = 'data/sample_tokenized_sentences.csv'
token_path = 'data/padded_sample_tokenized_sentences.csv'
if IN_COLAB:
  !mkdir data
  
if os.path.exists(token_path):
    df_prepped = pd.read_csv(token_path, header=None)
    prepped_sentences = np.array(df_prepped.values.tolist())
else:
    df_prepped = pd.DataFrame(padded_trunc_token_sequences)
    df_prepped.to_csv(token_path, header=False, index=False)

#### Dataset and Array

This data can be reasonably used either as a regular array or or a tensorflow dataset. We'll use both here, just to show that it can be done either way. If we we using the full set of sequences the data would be quite a bit larger with all the fully padded out sequences, we'd probably need to write the data to disk and load it with a dataset or generator. 

In [39]:
print(padded_trunc_token_sequences.shape)
y_t = padded_trunc_token_sequences[:, -1].reshape(-1, 1)
y_t = ku.to_categorical(y_t, num_classes=TOKENS)
print("Y shape:", y_t.shape)
X_t = padded_trunc_token_sequences[:, :-1]
print("X shape:", X_t.shape)

(24738, 45)
Y shape: (24738, 500)
X shape: (24738, 44)


### Model

Now we model. The data that we made mirrors the construction of a sentence.
<ul>
<li> X features - the sentence up to this point. 
<li> Y target - the word(s) that should come next. 
</ul>

So, the model is effectively working to generate text just like a time series model works to predict the next value in a sequence of stock prices or hourly temperature. We train the model on, hopefully a large number of senteneces, where is sees many examples of "here are some words" (X values) and "here is the next word" (Y value). If we give it lots and lots of that training data, it should become better and better at determining what should come next, given the existing sentence. 

To do this well, we'd need a lot more data than we have, and much more time to train. We'd want to give the model enough data so that it can see lots and lots of examples of the same word in different contexts, and of similar contexts with different words. The patterns of language are really complex, so we need data that provides enough variation to demonstrate the patterns. 

The model is wrapped in a little function, so we can make a model to output a different number of words with more convenience.

#### Embedding Layer

We also use an embedding layer here, which accepts our integer inputs and converts them to a vector of a specified size. This is a way of representing the words in a way that is more useful for the model. We saw embeddings with word2vec during the NLP portions. Just as it did then, embedding will represent each of our tokens as a vector of a specified size, or as a value in N dimensional space. 

![Embedding](images/embedding.png "Embedding")

When we check the summary of the model we can see that the embedding layer has a lot of parameters, this is because it is learning the vector representation of each of the words in our vocabulary. When we used word2vec we used a pretrained model - the N dimensional space was already defined, and we placed our words into it. Here, we are letting the model learn the space and the vectors that represent the words, so the vector representation of each word is learned directly from the data. If we are dealing with a scenarion where we have a lot of data and a specialized vocabulary (such as in industry) this can be very useful. The model will learn which words are similar and which are not, based on the context of the text provided in training.

#### Let's Go Bi

For this model we'll use a bidirectional LSTM layer. This is a layer that will process the input sequence in both directions, so it will see the sequence from the beginning and from the end. This is useful because it allows the model to consider what should come next, but also what should come before, which can be useful for determining the next word.

![Bidirectional LSTM](images/bidirectional.webp "Bidirectional LSTM")

Bidirectional layers are most prevalent in NLP, as adding the ability to look at the sequence of words in both directions can really help models build a better understanding of the context of the words. There is a speed penalty due to each layer doing roughly double the number of calculations, but for language tasks such as this, it is pretty likely to be worth it, particularly if we were to expand the dataset. Bidirectional layers can be added simply, as below; on the Keras documentation there are a few more details, mainly that the forward and backward layers can be configured separately then combined. 

In [40]:
model = Sequential()
model.add(Embedding(input_dim=TOKENS, output_dim=OUT_DIM, input_length=OUTPUT_LENGTH-1))
#model.add(LSTM(UNITS, return_sequences=True))
#model.add(LSTM(UNITS, return_sequences=True))
#model.add(LSTM(UNITS))
model.add(Bidirectional(LSTM(UNITS, return_sequences=True)))
model.add(Bidirectional(LSTM(UNITS, return_sequences=True)))
model.add(Bidirectional(LSTM(UNITS)))
model.add(Flatten())
model.add(Dense(TOKENS, activation='softmax'))

model.compile(optimizer='rmsprop',loss='categorical_crossentropy',metrics=['acc'])
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 44, 8)             4000      
                                                                 
 bidirectional_4 (Bidirectio  (None, 44, 100)          23600     
 nal)                                                            
                                                                 
 bidirectional_5 (Bidirectio  (None, 44, 100)          60400     
 nal)                                                            
                                                                 
 bidirectional_6 (Bidirectio  (None, 100)              60400     
 nal)                                                            
                                                                 
 flatten_1 (Flatten)         (None, 100)               0         
                                                      

In [41]:
#Some larger values for Colab
TEST_EPOCHS = 10
TEST_BATCH = 512

if IN_COLAB:
    TEST_EPOCHS = 300
    TEST_BATCH = 1024

In [42]:
# Try with dataframes
early_stop = EarlyStopping(monitor='loss', patience=50)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")
save_weights = tf.keras.callbacks.ModelCheckpoint('weights/lstm_gen_weights.h5', save_best_only=True, monitor='loss', mode='min', save_weights_only=True)

if IN_COLAB:
    !mkdir logs
    !mkdir weights
    %load_ext tensorboard
    %tensorboard --logdir logs

history = model.fit(X_t, y_t, batch_size=TEST_BATCH, epochs=TEST_EPOCHS, verbose=1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [43]:
train_ds = tf.data.Dataset.from_tensor_slices((X_t, y_t)).shuffle(SHUFFLE).batch(TEST_BATCH).prefetch(tf.data.experimental.AUTOTUNE)

In [44]:
# Or with datasets...
history = model.fit(train_ds, epochs=TEST_EPOCHS, verbose=1, callbacks=[early_stop, tensorboard_callback, save_weights])

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Generate Some Text

We can generate text once the model is trained. We'll start with a seed sentence, and then we'll use the model to predict the next word. We'll then append that word to the sentence and use the model to predict the next word, and so on. There is a little helper function to do this for us with limited repetition.

Another new thing is that we create an inverse dictionary to map the encoded words back to the original words.

### Temperature

One weirdly named factor that is important in text generation is the temperature. The temperature is a factor that we can use to control the randomness of the output. To generate text, we are essentially using a probability distribution to determine what the next word should be - the softmax output of the model will tell us the most likely next word. The issue is that certain words are way more likely than others - "the", "it", "a", "and", etc. are all very common words so we can expect the model to predict them as "most likely" a lot, probably too often. 

![Temperature](images/temperature.gif "Temperature")

The most direct way to combat this is to add some degree of randomness to which word we select - we'll still pick the most likely word more often than any other, but we'll also pick other words that have some degree of likelihood at random. The higher the temperature the more randomness is introduced. A correct value requires tuning with human feedback, and it'll vary depending on the base quality of the model - large models that are trained on huge volumes of text and are deep enough to pick up on the "what type of word should be here" patterns will be able to generate better text with lower temperatures. Our model here is small and kind of sucks, so the temperature needs to be higher to get anything remotely usable. The implementation here is stolen shamelessly from the internet, the details don't really matter all that much, we just need to vary our predictions away from always simply picking the most likely word.

In [45]:
inverse_dict = {v: k for k, v in tokenizer.word_index.items()}

In [46]:
from random import *

def sampleWord(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds.flatten(), 1)
    return np.argmax(probas)

def generate_text(seed_text, next_words, model, length_max, tok, inverse, temperature=1.0):
    out = seed_text
    for _ in range(next_words):
        token_list = tok.texts_to_sequences([out])[0]
        token_list = pad_sequences([token_list], maxlen=length_max-1, padding='pre')
        #print(token_list)
        predicted = model.predict(token_list, verbose=0)
        #print(tok.sequences_to_texts(predicted))
        word = inverse[sampleWord(predicted, temperature=temperature)]
        out += " "+word
        #print(out)
    return out

Fake text time!

In [47]:
print (generate_text("Last night I went on a date with", next_words=30, model=model, length_max=OUTPUT_LENGTH, tok=tokenizer, inverse=inverse_dict), "\n")
print (generate_text("We are going to the ", next_words=30, model=model, length_max=OUTPUT_LENGTH, tok=tokenizer, inverse=inverse_dict), "\n")
print (generate_text("Where are all of the", next_words=30, model=model, length_max=OUTPUT_LENGTH, tok=tokenizer, inverse=inverse_dict), "\n")
print (generate_text("I am scared", next_words=30, model=model, length_max=OUTPUT_LENGTH, tok=tokenizer, inverse=inverse_dict), "\n")
print (generate_text("School is out for", next_words=30, model=model, length_max=OUTPUT_LENGTH, tok=tokenizer, inverse=inverse_dict), "\n")

Last night I went on a date with not the days the you and strong of can fuck i’m do will there to me on know value it way almost not its a guys comments x200b b 3 

We are going to the  them anything they i as new options money the 1 change and width not and already is some auto retards weeks have life for my then chart and apes and 

Where are all of the apes it to its in both up a on but now over why low potential auto money really pretty if done at total know his with and shit a market 

I am scared have of stock t it and take as a holding was 0 great a people t well lot help a moon be s gme no stocks keep been time to 

School is out for bad to that in format end r what stock big time by covid i 3 that the a here after in time of can't day shorts be end and s 



A few long ones...

In [48]:
print (generate_text("Last night I went on a date with", next_words=100, model=model, length_max=OUTPUT_LENGTH, tok=tokenizer, inverse=inverse_dict), "\n")
print (generate_text("My toe is purple and I am", next_words=100, model=model, length_max=OUTPUT_LENGTH, tok=tokenizer, inverse=inverse_dict), "\n")
print (generate_text("The Toronto Raptors won and Rob Ford's body rose from the dead", next_words=100, model=model, length_max=OUTPUT_LENGTH, tok=tokenizer, inverse=inverse_dict), "\n")

Last night I went on a date with go their like the the months until than 2021 100 term on i in the the or the today may put platform in with however s and was why are i 15 https a still hold no at hold it so term advice if a them please 2 my today investor hedge great huge all though t t the webp s 4 moon may right back still could profit and back strong the the lose put trading 8 there png good that's 13 every 5 i'm go high in is here public t after selling on 13 am the use 



KeyError: 0

And if we make the temperature low, we likely get worse results.

In [None]:
print (generate_text("Last night I went on a date with", next_words=30, model=model, length_max=OUTPUT_LENGTH, tok=tokenizer, inverse=inverse_dict, temperature=.1), "\n")
print (generate_text("We are going to the ", next_words=30, model=model, length_max=OUTPUT_LENGTH, tok=tokenizer, inverse=inverse_dict, temperature=.1), "\n")

Last night I went on a date with gme the moon stock stock gme the moon stock stock stock the moon stock stock gme gme gme of the of the of the of the of the of the 

We are going to the  moon moon gme stock stock stock stock the stock stock the moon gme gme the moon stock stock stock of the of the of the of the of the of 



## Conclusion and Notes

The LSTM model works well for text generation, if we had more data, more time, and spent a bit more effort cleaning up the edge cases in our code here we could likely get something pretty legible. 

The biggest difference between this and the impressive larger models is that they are higher capacity, trained with more data, and sometimes tuned with human feedback. These things combine to make the model more likely to pick up on the patterns of language and generate text that is more likely to be grammatically correct and follow the expected flow of language. Our model is really struggling to just find a reasonable word to use next, we haven't allowed it to learn enough to really construct logical sentences.