# LSTM Neural Network for Text Generation

Here we will take a subset of the full transcript dataset and train a LSTM style Recurrent Network in order to generate new text based on the corpus. 

In [18]:
import pandas as pd
import numpy as np
import numpy as np
import seaborn as sns; sns.set(color_codes=True)
import re 
from sklearn.feature_extraction.text import CountVectorizer
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

### Load in the data from csv file that was created in get_transcripts.ipynb

Take only the English transcripts since we will be training our model to produce text in English

In [1]:
# Load data from csv
df = pd.read_csv('stand-up-data-cleaned.csv')
df = df.loc[df.language == 'en']

df.head()

Unnamed: 0,title,date_posted,link,name,year,transcript,language,runtime,rating,sentences,words,sent_polarity,sent_subjectivity,word_polarity,word_subjectivity,words_per_sentence,word_count,sent_count,f_words,s_words
0,Lee Mack: Live,"May 7th, 2020",https://scrapsfromtheloft.com/2020/05/07/lee-m...,Lee Mack,2007.0,TAKE ME OUT BY FRANZ FERDINAND PLAYING PRESENT...,en,68.0,7.7,['TAKE ME OUT BY FRANZ FERDINAND PLAYING PRESE...,"['TAKE', 'ME', 'OUT', 'BY', 'FRANZ', 'FERDINAN...",[0.8 0. 0. ... 0. 0. 0.7],[0.9 0. 0. ... 0. 0. 0.6],[0. 0. 0. ... 0.7 0. 0. ],[0. 0. 0. ... 0.6 0. 0. ],8.848975,14238,1609,95,8
1,T.J. Miller: No Real Reason,"May 6th, 2020",https://scrapsfromtheloft.com/2020/05/06/t-j-m...,T.J. Miller,2011.0,"– I wish I didnt have to do this to perform, b...",en,67.0,7.1,"['– I wish I didnt have to do this to perform,...","['–', 'I', 'wish', 'I', 'didnt', 'have', 'to',...",[ 1.00000000e-01 -7.00000000e-01 -7.00000000e-...,[1. 0.66666667 0.66666667 0.86666667 0...,[0. 0. 0. ... 0. 0.35 0. ],[0. 0. 0. ... 0. 0.65 0. ],11.701735,10789,922,11,2
2,Jerry Seinfeld: 23 Hours To Kill,"May 6th, 2020",https://scrapsfromtheloft.com/2020/05/06/jerry...,Jerry Seinfeld,2020.0,"Jerry Seinfelds new hourlong comedy special, J...",en,60.0,6.7,"['Jerry Seinfelds new hourlong comedy special,...","['Jerry', 'Seinfelds', 'new', 'hourlong', 'com...",[ 0.2978355 0. 0.24285714 0. ...,[0.47532468 0. 0.36785714 0. 0...,[0. 0. 0.13636364 ... 0. ...,[0. 0. 0.45454545 ... 0. ...,10.084926,9500,942,0,0
3,Bill Burr On The Late Show With David Letterma...,"May 5th, 2020",https://scrapsfromtheloft.com/2020/05/05/bill-...,Bill Burr,2010.0,Bill Burr performing on The Late Show with Dav...,en,,,['Bill Burr performing on The Late Show with D...,"['Bill', 'Burr', 'performing', 'on', 'The', 'L...",[-0.3 0.375 0. 0. ...,[0.6 0.63333333 0. 0. 0...,[0. 0. 0. ... 0.2 0.2 0. ],[0. 0. 0. ... 0.3 0.2 0. ],9.486726,1072,113,0,0
4,Sincerely Louis Ck,"May 2nd, 2020",https://scrapsfromtheloft.com/2020/05/02/since...,Louis C.K.,2020.0,Great comedy is finally back. Louis C.K. is no...,en,60.0,8.5,"['Great comedy is finally back.', 'Louis C.K.'...","['Great', 'comedy', 'is', 'finally', 'back', '...",[ 4.00000000e-01 0.00000000e+00 3.00000000e-...,[0.375 0. 0.9 0.5 0...,[0.8 0. 0. ... 0. 0. 0. ],[0.75 0. 0. ... 0. 0. 0. ],11.088071,9946,897,93,19


### Create a variable 'text' that will serve as the corpus for our model
This simply adds each transcript together into a single string. Instead of using the full corpus, we will only take 1/10th of it to save on memory and processing time and because this is just for fun. A more powerful generator could be made with the resources to process the full dataset.

In [8]:
text = ''
for i in df.transcript:
    text += (i + ' ')

len(text)

13166511

In [9]:
# Shorten text by 1/10 to save processing time
portion = int(len(text)/10)
text = text[:portion]
len(text)

1316651

### Build the model: LSTM 256 -> LSTM 256 -> LSTM 256 -> Dense
Much of this block is taken right from the Keras Docs. I have changed the network architecture, optimizer settings, and callback functions.

Learning rate starts out at 0.001 and is allowed to run until it plateaus after about 30 epochs.

In [None]:
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import LSTM
from keras.optimizers import Adam
from keras.callbacks import ModelCheckpoint
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys
import io

chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# Cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

# Turn 40-character chunks into vectors for the x. y is the next character in the original sequence
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1


# Build the model: 3 layers deep LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(256, return_sequences=True, input_shape=(maxlen, len(chars))))
model.add(Dropout(0.2))
model.add(LSTM(256, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(len(chars), activation='softmax'))

optimizer = Adam(lr=0.001)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)


def sample(preds, temperature=1.0):
    # Helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(150):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print('\n')

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)
# Checkpoint save on new best
filepath = "model_best.h5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')

model.summary()

model.fit(x, y,
          batch_size=128,
          epochs=60,
          callbacks=[print_callback, checkpoint])

### Some mid-training gems generated by the model...
- Epoch 10: "I make you have a song for the porn before you think you're a good point, like, I'll give you an example. I was like, no."
- Epoch 18: "I make a bank that we're gonna be like, I don't know what the fuck I want. I don't know what the fuck I want to do it. And I don't know what the fuck I want."


This is good! They sound like English, sort of. Not bad for character-level predictions.

### Learning is resumed with a smaller learning rate
Now the Adam optimizer learning rate is set to 0.0001 for another 13 epochs.


In [None]:
optimizer = Adam(lr=0.0001) 
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
model.summary()

model.fit(x, y,
          batch_size=128,
          epochs=60,
          callbacks=[print_callback, checkpoint])

- Epoch 49: "Its not what I meant, the thing the car the man didn't want to do it. I don't know what to do. It wasn't any good anyway."

The model isn't quite making sense yet but it's interesting to see how it is attempting more complex sentences.

### Learning is resumed with an even smaller learning rate
Finally, the learning rate is set to 0.00001 for just 1 more epoch

In [None]:
optimizer = Adam(lr=0.00001) 
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
model.summary()

model.fit(x, y,
          batch_size=128,
          epochs=60,
          callbacks=[print_callback, checkpoint])

### Re-load the model from 'model_best.h5' if needed, then produce results
A great way to save your work is to use the keras callback ModelCheckpoint which can save the entire model so that you can come back later to continue training. Here I am using the model_best.h5 file to reload the model so that we can use it to generate our text for us. 
#### (I will only be using temperatures 0.5 and 1.0 since those seemed to produce the best results during training.)

In [36]:
# Load the model from h5 file
from keras.models import load_model

filename = "model_best.h5"
model = load_model(filename)
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 40, 256)           324608    
_________________________________________________________________
dropout_1 (Dropout)          (None, 40, 256)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 40, 256)           525312    
_________________________________________________________________
dropout_2 (Dropout)          (None, 40, 256)           0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dropout_3 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 60)               

In [38]:
# Grab some random seed text and predict the next letters
start_index = random.randint(0, len(text) - maxlen - 1)
for diversity in [0.5, 1.0]:
    print('----- diversity:', diversity)

    generated = ''
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)

    for i in range(400):
        x_pred = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_indices[char]] = 1.

        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds, diversity)
        next_char = indices_char[next_index]

        sentence = sentence[1:] + next_char

        sys.stdout.write(next_char)
        sys.stdout.flush()
    print('\n')

----- diversity: 0.5
----- Generating with seed: "You know what I mean? My mom used to be "
You know what I mean? My mom used to be like, When when you live at the room, like, Were gonna need to take a burned on here for a compliment for no reason for where I first come from the last thing I was like, What a guy fuck? Thats not the same hands on your crothing sling. You know what I mean? I had a baby. I was like, Oh, do you not be some shit? And then I had done that I love the jokes and I go to the received of the back of a sa

----- diversity: 1.0
----- Generating with seed: "You know what I mean? My mom used to be "
You know what I mean? My mom used to be on a market shit about a ruin presidence of expecting by the first time. Just a Bohuwain, Hey, haya. She made you a drag. Its like, Thank you. a picture for expensive, saying, Noing news know, suck, would you go crazy? Shine of the trats that has the classrow lives of Aborting for your viswed into the under it. And Sorry, we did spi

## Conclusion
- In the end, the generated text looks like English with only a few spelling errors though it doesn't really make much sense. There are grammar and syntax errors everywhere but this is partially to be expected given that the source text is composed of transcripts from spoken stand-up comedy routines. Unfortunately, due to computational and time constraints, the full corpus was not used. It is possible that the results of this network, with 90% more training data, would be significantly different.

- There is clearly a lot of work to be done before actually funny text could be created with this LSTM model... Unless you're into absurdist humor.