I try out a Recurrent Neural Network in [this notebook](https://www.kaggle.com/pradhyo/keras-lstm-script-generator?scriptVersionId=9466911) to try and generate TV scripts for the Office. While the output was nowhere near being realistic, it was still interesting looking at some of the lines generated using this method.

Here's some sample output. The complete outputs are [here](https://github.com/Pradhyo/machine-learning-practice-notebooks/blob/master/text-generation/generated_script.txt) and [here](https://github.com/Pradhyo/machine-learning-practice-notebooks/blob/master/text-generation/phyllis_script.txt) (lines only from Phyllis).
> jim: i know, i’m not saying it. pam: how do you know a joke? phyllis: i don’t think we should get in here. dwight: thank you. pam: i want to be working on the phone? jim: oh, hey, darryl. what do you mean? michael: oh, thanks. jim: i want to go? andy: yeah. kevin: i'm sorry

The first thing to do was to fetch the lines from the Office so I wrote [some web scraping code for this](https://github.com/Pradhyo/the-office-us-tv-show). I first learned about Recurrent Neural Networks from my [Deep Learning Nanodegree](https://www.udacity.com/course/deep-learning-nanodegree--nd101) and did a [similar project](https://github.com/Pradhyo/udacity-deep-learning-nanodegree/tree/master/tv-script-generation) there using Tensorflow but I wanted to try this out with Keras. 

I found [this notebook](https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/8.1-text-generation-with-lstm.ipynb) that contains the Keras implementation but when I tried to run it as is as the first step with the Office dataset, it exceeded the time limit on [Kaggle Kernels with a free GPU](https://www.kaggle.com/dansbecker/running-kaggle-kernels-with-a-gpu) despite their very generous 9 hour limit.

So I started modifying the code and decided to use words instead of characters thinking it will produce better results since the output from the original code had some invalid words. This way the model would just have learn to form sentences and not learn to form words too. However this resulted in the kernel running out of memory due to the huge increase in number of building blocks - there were around 70 unique characters before but around 10,000 unique words. 

To reduce the amount of unique words, I first removed lines of everyone other than Michael but since he had a lot of lines, the data was still too much. I ended up using just the lines for Phylis and ended up with some good results. Since I wanted to capture the styles of all actors, I decided another approach to reduce the data - getting the top 2000 common words and considering just the lines made of these words.

All of these steps can be seen in this [initial notebook](https://github.com/Pradhyo/machine-learning-practice-notebooks/blob/master/text-generation/keras-lstm-script-generator-scratchpad.ipynb).

In [1]:
# https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/8.1-text-generation-with-lstm.ipynb
import keras
import numpy as np

Using TensorFlow backend.


In [5]:
import ssl

# To avoid ssl: certificate_verify_failed error
ssl._create_default_https_context = ssl._create_unverified_context

# Get the script
def get_text():
    
    office_script_file_url = "https://raw.githubusercontent.com/Pradhyo/the-office-us-tv-show/master/the-office-all-episodes.txt"
    path = keras.utils.get_file('script.txt', origin=office_script_file_url)
        
    text = open(path).read().lower()
    return text

text = get_text()
print('Corpus length:', len(text))

Corpus length: 4108664


In [6]:
from collections import Counter
from pprint import pprint
char_counts = Counter()
for c in text:
    char_counts[c] += 1
    
pprint(char_counts.most_common())
pprint(len(char_counts))

[(' ', 683865),
 ('e', 325287),
 ('t', 257302),
 ('a', 255726),
 ('o', 249519),
 ('i', 231860),
 ('n', 192916),
 ('h', 182121),
 ('s', 168200),
 ('r', 145558),
 ('l', 140074),
 ('d', 111329),
 ('y', 104883),
 ('m', 102473),
 ('.', 100099),
 ('u', 97290),
 ('g', 81523),
 ('c', 79151),
 ('w', 78946),
 ('\n', 69462),
 (':', 60428),
 ('p', 53410),
 (',', 47960),
 ('k', 47018),
 ('b', 43158),
 ('f', 42647),
 ("'", 33029),
 ('v', 27859),
 ('?', 18735),
 ('j', 18556),
 ('[', 12031),
 (']', 12021),
 ('!', 9991),
 ('-', 6245),
 ('x', 3625),
 ('’', 3209),
 ('"', 2435),
 ('z', 2005),
 ('q', 1331),
 ('0', 1038),
 ('1', 615),
 ('2', 424),
 ('…', 417),
 ('5', 360),
 ('”', 255),
 ('3', 251),
 ('4', 248),
 ('“', 246),
 ('9', 156),
 ('—', 149),
 (';', 145),
 ('$', 138),
 ('8', 137),
 ('7', 115),
 ('6', 109),
 ('/', 88),
 ('&', 85),
 ('‘', 69),
 ('#', 59),
 ('*', 59),
 ('%', 58),
 ('–', 55),
 (')', 45),
 ('(', 32),
 ('_', 6),
 ('é', 6),
 ('+', 4),
 ('ü', 4),
 ('@', 3),
 ('�', 3),
 ('ñ', 3),
 ('{', 2),
 

In [7]:
# Get some sample strings for each character to explore the data
def sample_strings(char, string_length=20, num_samples=5):
    sample = 0
    samples = []
    for i, c in enumerate(text):
        if i < string_length:
            continue
        if char == c:
            samples.append(text[int(i-string_length/2):int(i+string_length/2)])
            sample += 1
            if sample == num_samples:
                break
    return samples

for c in char_counts:
    print(f"{c}: {sample_strings(c)}")

m: ['l right jim. your qu', 'ibrary?\njim: oh, i t', 'it. so...\nmichael: s', " you've come to the ", 'me to the master for']
i: ['ll right jim. your q', 'r quarterlies look v', 'how are things at th', 's at the library?\nji', 'library?\njim: oh, i ']
c: ["ld you. i couldn't c", " couldn't close it. ", '. so...\nmichael: so ', "so you've come to th", 'for guidance? is thi']
h: ['ery good. how are th', ' how are things at t', 'hings at the library', 'ry?\njim: oh, i told ', ' so...\nmichael: so y']
a: ['m. your quarterlies ', 'good. how are things', 're things at the lib', 't the library?\njim: ', 'so...\nmichael: so yo']
e: ['your quarterlies loo', ' quarterlies look ve', 'ies look very good. ', 'od. how are things a', 'ings at the library?']
l: ['ur quarterlies look ', 'arterlies look very ', 'gs at the library?\nj', ': oh, i told you. i ', "you. i couldn't clos"]
:: ['brary?\njim: oh, i to', "..\nmichael: so you'v", 'opper?\njim: actually', 'h.\nmichael: all righ', '.\n\nmichael: [on

ñ: ['ame is. señor loaden', 'ughing] señor loaden', ' called señor loaden']
–: ['reminders – no burpi', 'you asked – connecti', 'chapter 2 – announci', 'chapter 4 – one of t', 'chapter 9 – the tabl']
ü: ['rine and güiro]\ndarr', '[removes güiro and b', '. [plays güiro and s', ' playing güiro] fish']
é: ['elve clichés every t', ' her fiancé ravi was', 's ex-fiancé’s weddin', 's ex-fiancé.\npam: [e', 'y ex-fiancé.\npam: [s']
—: ['0 children—\npam: kay', 'rk and, um—\npete: pe', 'k: is this—is this l', 't a glance—\ndwight: ', 'ait, sales—what sale']


In [8]:
# See longer strings for non alphanumeric characters
for c in char_counts:
    if not c.isalnum():
        print(f"{c}: {sample_strings(c, 40)}")

:: [' at the library?\njim: oh, i told you. i ', "se it. so...\nmichael: so you've come to ", 'ng, grasshopper?\njim: actually, you call', 'e, but yeah.\nmichael: all right. well, l', " it's done.\n\nmichael: [on the phone] yes"]
 : ['im. your quarterlies look very good. how', 'our quarterlies look very good. how are ', 'uarterlies look very good. how are thing', 'lies look very good. how are things at t', ' look very good. how are things at the l']
.: ['rlies look very good. how are things at ', "\njim: oh, i told you. i couldn't close i", " i couldn't close it. so...\nmichael: so ", "ouldn't close it. so...\nmichael: so you'", "uldn't close it. so...\nmichael: so you'v"]
?: ['hings at the library?\njim: oh, i told yo', " master for guidance? is this what you'r", ' saying, grasshopper?\njim: actually, you', " forever. right, pam?\npam: well. i don't", '. [growls]\npam: what?\nmichael: any messa']

: ['ings at the library?\njim: oh, i told you', "dn't close it. so...\nmichael: so you'v

Looking at the above text, some of the characters like `\n`appear in between words but some of them like `'` appear as part of the word. 
I am going to leave the ones within words as is but consider the others as separate words so the model doesn't consider *jim* in`\njim` different from  just`jim`. I am also going to consider all numbers the same.

In [9]:
# consider these as words
consider_words = ''.join(c for c in char_counts if not c.isalnum())
print(consider_words)

: .?
,'[]-!"$%;)&#/(*+{@�_}=’…“”‘–—


Looking at the symbols more closely, it doesn't look like there are a lot of symbols that appear within the words so I am just going to consider all of them separate words.

In [10]:
numbers = '0123456789'
def replace_numbers(text):
    for n in numbers:
        text = text.replace(n, "0")
    return text

text = replace_numbers(text)
consider_words += '0' # consider 0 also a word
print(consider_words)

: .?
,'[]-!"$%;)&#/(*+{@�_}=’…“”‘–—0


In [12]:
def split_into_words(text, consider_words):
    # Split text into words - characters above are also considered words
    text = text.replace(' ', ' | ') # pick a char not in the above list
    text = text.replace('\n', ' | ') # pick a char not in the above list

    for char in consider_words:
        text = text.replace(char, f" {char} ") # to split on spaces to get char

    words_with_pipe = text.split()
    words = [word if word != '|' else ' ' for word in words_with_pipe]
    return words

words = split_into_words(text, consider_words)
print(words[:50])

['michael', ':', ' ', 'all', ' ', 'right', ' ', 'jim', '.', ' ', 'your', ' ', 'quarterlies', ' ', 'look', ' ', 'very', ' ', 'good', '.', ' ', 'how', ' ', 'are', ' ', 'things', ' ', 'at', ' ', 'the', ' ', 'library', '?', ' ', 'jim', ':', ' ', 'oh', ',', ' ', 'i', ' ', 'told', ' ', 'you', '.', ' ', 'i', ' ', 'couldn']


In [16]:
# Length of extracted word sequences
maxlen = 20

# We sample a new sequence every `step` words
step = 3

def setup_inputs(words, maxlen, step):
    try:
        # This holds our extracted sequences
        sentences = []

        # This holds the targets (the follow-up characters)
        next_words = []

        for i in range(0, len(words) - maxlen, step):
            sentences.append(words[i: i + maxlen])
            next_words.append(words[i + maxlen])
        print('Number of sequences:', len(sentences))

        # List of unique characters in the corpus
        unique_words = sorted(list(set(words)))
        print('Unique words:', len(unique_words))
        # Dictionary mapping unique characters to their index in `unique_words`
        word_indices = dict((word, unique_words.index(word)) for word in unique_words)

        # Next, one-hot encode the characters into binary arrays.
        print('Vectorization...')
        x = np.zeros((len(sentences), maxlen, len(unique_words)), dtype=np.bool)
        y = np.zeros((len(sentences), len(unique_words)), dtype=np.bool)
        for i, sentence in enumerate(sentences):
            for t, word in enumerate(sentence):
                x[i, t, word_indices[word]] = 1
            y[i, word_indices[next_words[i]]] = 1
        return x, y, unique_words, word_indices
    except MemoryError as e:
        print(e)
        pass

# Commenting out to avoid MemoryError
# Tried catching it but didn't seem to work
# x, y, unique_words, word_indices = setup_inputs(words, maxlen, step)


### Reducing data

Since the above was throwing a MemoryError, I tried reducing the data by considering lines by just one actor. Using Michael's lines caused the same issue again so I tried using lines for Phyllis.

In [17]:
text = get_text()

selected_actor = "phyllis"

def get_selected_lines(text, selected_actor):
    lines = text.split("\n")
    return "\n".join(line for line in lines if line.startswith(f"{selected_actor}:"))

text = get_selected_lines(text, selected_actor)
print(text[:200])

phyllis: so what does downsizing actually mean?
phyllis: what?
phyllis: well, uh, for decorations, maybe we could... it's stupid, forget it.
phyllis: i was just going to say, maybe we could have strea


In [18]:
text = replace_numbers(text)
words = split_into_words(text, consider_words)
x, y, unique_words, word_indices = setup_inputs(words, maxlen, step)

Number of sequences: 8328
Unique words: 1813
Vectorization...


In [None]:
from keras import layers

def build_model(maxlen, num_unique_words):
    model = keras.models.Sequential()
    model.add(layers.LSTM(128, input_shape=(maxlen, num_unique_words)))
    model.add(layers.Dense(num_unique_words, activation='softmax'))
    optimizer = keras.optimizers.RMSprop(lr=0.01)
    model.compile(loss='categorical_crossentropy', optimizer=optimizer)    
    return model

model = build_model(maxlen, len(unique_words))

In [None]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [None]:
import random
import sys

def train_model(text, words, unique_words, word_indices, max_epoch, script_file, model_file=""):
    with open(script_file, "wt") as f:
        f.write("") # Just to create/overwrite the file

    for epoch in range(1, max_epoch):
        with open(script_file, "at") as f:
            f.write(f'\n\nepoch {epoch}\n\n')
        # Fit the model for 1 epoch on the available training data
        model.fit(x, y,
                  batch_size=128,
                  epochs=1)

        # Select a text seed at random
        start_index = random.randint(0, len(words) - maxlen - 1)
        generated_text = words[start_index: start_index + maxlen]

        with open(script_file, "at") as f:
            f.write('--- Generating with seed: "' + ''.join(generated_text) + '"\n')

        with open(script_file, "at") as f:        
            for temperature in [0.2, 0.5, 1.0, 1.2]:
                f.write('\n--- temperature: ' + str(temperature) + "\n")
                f.write(''.join(generated_text))

                for i in range(200):
                    sampled = np.zeros((1, maxlen, len(unique_words)))
                    for t, word in enumerate(generated_text):
                        sampled[0, t, word_indices[word]] = 1.

                    preds = model.predict(sampled, verbose=0)[0]
                    next_index = sample(preds, temperature)
                    next_word = unique_words[next_index]

                    generated_text.append(next_word)
                    generated_text = generated_text[1:]

                    f.write(next_word)
        
        if model_file:
            model.save(model_file)
    
train_model(text, words, unique_words, word_indices, 100, "phyllis_script.txt")

#### Using only the most common words

Since I wanted to learn lines from all actors, I reduced the data by taking the 2000 most common words and considering the sentences only made solely of these words.

In [22]:
text = get_text()
text = replace_numbers(text)
words = split_into_words(text, consider_words)
words_counter = Counter(words)
print(len(words_counter))

# Display just 200 on the blog post
# 2000th most common word occurred 25 times
print(words_counter.most_common(200))

20795
[(' ', 753327), ('.', 100099), (':', 60428), (',', 47960), ("'", 33029), ('i', 29843), ('you', 24675), ('?', 18735), ('the', 17982), ('to', 16773), ('a', 15401), ('michael', 15184), ('s', 14738), ('it', 13938), ('[', 12031), (']', 12021), ('and', 11393), ('that', 11344), ('!', 9991), ('dwight', 9905), ('jim', 8971), ('is', 8502), ('t', 8013), ('of', 7616), ('pam', 7208), ('in', 6985), ('what', 6608), ('-', 6245), ('no', 6047), ('we', 6032), ('this', 5906), ('on', 5283), ('andy', 5087), ('my', 5041), ('me', 5035), ('m', 4934), ('have', 4886), ('just', 4786), ('know', 4453), ('do', 4432), ('so', 4427), ('for', 4387), ('oh', 4340), ('not', 4332), ('don', 4071), ('are', 3965), ('re', 3696), ('be', 3612), ('was', 3608), ('he', 3554), ('your', 3490), ('can', 3484), ('0', 3453), ('with', 3433), ('like', 3381), ('all', 3309), ('yeah', 3237), ('’', 3209), ('okay', 2981), ('up', 2911), ('but', 2847), ('here', 2749), ('out', 2722), ('right', 2710), ('at', 2659), ('get', 2623), ('about', 254

In [24]:
top_words = []
for word, count in words_counter.most_common(2000):
        top_words.append(word)

print("Total number of top words: ", len(top_words))

def get_lines_with_words(top_words):
    selected_lines = []
    text = get_text()
    lines = text.split("\n")
    for line in lines:
        line = replace_numbers(line)
        words_in_line = split_into_words(line, consider_words)
        excluded_words = 0
        for word_in_line in words_in_line:
            if word_in_line not in top_words:
                excluded_words += 1
                break
        if not excluded_words:
            selected_lines.append(line)
    return selected_lines
                
                
selected_lines = get_lines_with_words(top_words)
print("Total number of selected lines: ", len(selected_lines))
print(selected_lines[:100])

Total number of top words:  2000
Total number of selected lines:  40366
["jim: oh, i told you. i couldn't close it. so...", 'jim: actually, you called me in here, but yeah.', "michael: all right. well, let me show you how it's done.", '', '', "pam: well. i don't know.", 'pam: what?', 'michael: any messages?', 'pam: uh, yeah. just a fax.', "pam: you haven't told me.", '', '', '', '', 'jim: nothing.', 'michael: ok. all right. see you later.', 'jim: all right. take care.', 'michael: back to work.', '', 'jan: [on her cell phone] just before lunch. that would be great.', '', '', "jan: what? i'm sorry?", "michael: really? i didn't... [looks at pam] did we get a fax this morning?", 'pam: uh, yeah, the one...', 'jan: do you want to look at mine?', 'michael: yeah, yeah. lovely. thank you.', 'michael: ok...', 'michael: no, no, no, no, this is good. this is good. this is fine. excellent.', 'michael: ok. no problem.', '', 'jan: go ahead.', "michael: oh, that's not appropriate.", "michael: uh, i do

In [25]:
selected_text = "\n".join(selected_lines)

selected_text = replace_numbers(selected_text)
selected_words = split_into_words(selected_text, consider_words)

In [26]:
x, y, unique_words, word_indices = setup_inputs(selected_words, maxlen, step)

Number of sequences: 189933
Unique words: 1992
Vectorization...


In [None]:
model = build_model(maxlen, len(unique_words))
train_model(selected_text, selected_words, unique_words, word_indices, 100, "generated_script.txt", "top_lines.h5")        

## Reflections

- The [output for lines from Phyllis](https://github.com/Pradhyo/machine-learning-practice-notebooks/blob/master/text-generation/phyllis_script.txt) is just Phyllis talking to herself all the time.

- This [output from the most common words](https://github.com/Pradhyo/machine-learning-practice-notebooks/blob/master/text-generation/generated_script.txt) seems more realistic but I think it suffers from lots of sentences removed from the data that interrupted the flow in the dialogs.

- As the `temperature` increased, so did the randomness in the dialogs

- Getting a lot more sentences without interruption in the flow with just the most commonly used words should produce better results

## Resources
1. [Text generation with LSTM](https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/8.1-text-generation-with-lstm.ipynb) (notebook)