### Chappelle Monologue Generator

Based on Jason Brownlee's Machine Learning Mastery: "How to Develop a Word-Level Neural Language Model and Use it to Generate Text"

https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/

In [1]:
with open("transcripts.txt", "r", encoding="utf-8") as f:
    transcript_text = f.read()

In [2]:
print(transcript_text[:50])

Good people of Atlanta, we must never forget… that


In [3]:
import string

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r', encoding="utf-8")
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# turn a doc into clean tokens
def clean_doc(doc):
    # replace '--' with a space ' '
    doc = doc.replace('--', ' ')
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # make lower case
    tokens = [word.lower() for word in tokens]
    return tokens

# save tokens to file, one dialog per line
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

# load document
in_filename = 'transcripts.txt'
doc = load_doc(in_filename)
print(doc[:50])

# clean document
tokens = clean_doc(doc)
print(tokens[:20])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

# organize into sequences of tokens
length = 50 + 1
sequences = list()
for i in range(length, len(tokens)):
    # select sequence of tokens
    seq = tokens[i-length:i]
    # convert into a line
    line = ' '.join(seq)
    # store
    sequences.append(line)
print('Total Sequences: %d' % len(sequences))

# save sequences to file
out_filename = 'trans_sequences.txt'
save_doc(sequences, out_filename)

Good people of Atlanta, we must never forget… that
['good', 'people', 'of', 'atlanta', 'we', 'must', 'never', 'that', 'anthony', 'yeah', 'himself', 'anthony', 'bourdain', 'had', 'the', 'greatest', 'job', 'that', 'show', 'business']
Total Tokens: 52922
Unique Tokens: 4313
Total Sequences: 52871
