### Chappelle Monologue Generator

Based on Jason Brownlee's Machine Learning Mastery: "How to Develop a Word-Level Neural Language Model and Use it to Generate Text"

https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/

In [1]:
with open("transcripts.txt", "r", encoding="utf-8") as f:
    transcript_text = f.read()

In [2]:
print(transcript_text[:50])

Good people of Atlanta, we must never forget… that


In [3]:
import string

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r', encoding="utf-8")
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# turn a doc into clean tokens
def clean_doc(doc):
    # replace '--' with a space ' '
    doc = doc.replace('--', ' ')
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # make lower case
    tokens = [word.lower() for word in tokens]
    return tokens

# save tokens to file, one dialog per line
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

# load document
in_filename = 'transcripts.txt'
doc = load_doc(in_filename)
print(doc[:50])

# clean document
tokens = clean_doc(doc)
print(tokens[:20])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

# organize into sequences of tokens
length = 50 + 1
sequences = list()
for i in range(length, len(tokens)):
    # select sequence of tokens
    seq = tokens[i-length:i]
    # convert into a line
    line = ' '.join(seq)
    # store
    sequences.append(line)
print('Total Sequences: %d' % len(sequences))

# save sequences to file
out_filename = 'trans_sequences.txt'
save_doc(sequences, out_filename)

Good people of Atlanta, we must never forget… that
['good', 'people', 'of', 'atlanta', 'we', 'must', 'never', 'that', 'anthony', 'yeah', 'himself', 'anthony', 'bourdain', 'had', 'the', 'greatest', 'job', 'that', 'show', 'business']
Total Tokens: 52922
Unique Tokens: 4313
Total Sequences: 52871


In [4]:
from numpy import array
from pickle import dump
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential, load_model, Model
from tensorflow.keras.layers import Dense, Flatten, Dropout, Embedding, LSTM

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load
in_filename = 'trans_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)
# vocabulary size
vocab_size = len(tokenizer.word_index) + 1

# separate into input and output
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]

# define model
model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model.fit(X, y, batch_size=128, epochs=100)

# save the model to file
model.save('model.h5')
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb'))

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 50, 50)            215700    
_________________________________________________________________
lstm (LSTM)                  (None, 50, 100)           60400     
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense (Dense)                (None, 100)               10100     
_________________________________________________________________
dense_1 (Dense)              (None, 4314)              435714    
Total params: 802,314
Trainable params: 802,314
Non-trainable params: 0
_________________________________________________________________
None
Train on 52871 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoc

Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


In [5]:
# create a seed text - taken from an early section of 2020 SNL Opening Monologue
seed_text = "this morning after the results came in got a text from a friend of mine in london and she said the world feels like a safer place now that america has a new president and I said that’s great but America doesn’t do you guys remember what life was like before"

In [10]:
import pandas as pd
from random import randint
from pickle import load
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.sequence import pad_sequences

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# generate a sequence from a language model
def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
    result = list()
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)

# load cleaned text sequences
in_filename = 'trans_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')
seq_length = len(lines[0].split()) - 1

# load the model
model = load_model('model.h5')

# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))

# seed text

# generate new text
generated = generate_seq(model, tokenizer, seq_length, seed_text, 200)

In [11]:
# convert to list
generated_list = generated.split()

In [12]:
# replaces potentially offensive words
bad = pd.read_csv('badlist.csv')
bad_dict = dict(zip(bad.Bad, bad.Replace))

In [13]:
# censorship function, loop through list and replace
def censor(alist):
    for i in range(0,len(alist)):
        if alist[i] in bad_dict:
            alist[i] = bad_dict[alist[i] ]

In [14]:
censor(generated_list)

In [15]:
# convert list to text for word cloud
generated_text = " ".join(words for words in generated_list)

In [18]:
print(seed_text + '\n')
print(generated_text)

this morning after the results came in got a text from a friend of mine in london and she said the world feels like a safer place now that america has a new president and I said that’s great but America doesn’t do you guys remember what life was like before

that feels like i know that he was like this moans he made a shotgun the poor black person screaming at meetings i got attacked online by me and then i had to admit kind of fxxx hilarious i got in the circus the fxxx is sending and it was a newport be like bro mr about that i am as strong so hard about that shxxx literally from all those filipino them is buckshot this is the least threatening moxxx i know what i mean i am i know that is not a joke that would pass around then he left a bar in chicago andand uh weeks all right short of rape in the where i said it i trust you moxxx the first one was the victim of me now i might then i was like i mean i said no i was supposed to be a hero heroes die friends thought about mmhm cold much

In [17]:
# Human Editing - adding punctuation and grammar

This morning after the results came in, I got a text from a friend of mine in London and she said the world feels like a safer place now that America has a new president and I said that’s great, but America doesn’t. Do you guys remember what life was like before ... 

that? Feels like I know that, he was like this moans. He made a shotgun ... the poor black person screaming at meetings. I got attacked online by me, and then I had to admit, kind of fxxx hilarious. I got in the circus, the fxxx is sending and it was a Newport. Be like "Bro, Mr, about that I am as strong, so hard, about that." Shxxx literally, from all those Filipino. Them is buckshot. This is the least threatening moxxx I know. What I mean, I am, I know that is not a joke that would pass around. Then he left a bar in chicago, and and uh weeks all right. Short of rape, in the where i said it, i trust you moxxx. The first one was the victim of me, now I might, then I was like "I mean I said no, I was supposed to be a hero". Heroes die friends, thought about mmhm cold, much? I was in a comedy club in New York, but then they just missed your radio down. Nobody wants to be the Clippers and then he walks right automatic. Chappelle right, you can kiss all this engineering homework goodbye.