# TV Script Generation
In this project, you'll generate your own Simpsons TV scripts using RNNs. You'll be using part of the Simpsons dataset of scripts from 27 seasons. The Neural Network you'll build will generate a new TV script for a scene at Moe's Tavern.

## Get the Data
The data is already provided for you. You'll be using a subset of the original dataset. It consists of only the scenes in Moe's Tavern. This doesn't include other versions of the tavern, like "Moe's Cavern", "Flaming Moe's", "Uncle Moe's Family Feed-Bag", etc..



In [35]:
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
import helper

data_dir = './data/simpsons/moes_tavern_lines.txt'
text = helper.load_data(data_dir)
# Ignore notice, since we don't use it for analysing the data
text = text[81:]

## Explore the Data
Play around with view_sentence_range to view different parts of the data.

In [36]:
view_sentence_range = (0, 10)

"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
import numpy as np

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))
scenes = text.split('\n\n')
print('Number of scenes: {}'.format(len(scenes)))
sentence_count_scene = [scene.count('\n') for scene in scenes]
print('Average number of sentences in each scene: {}'.format(np.average(sentence_count_scene)))

sentences = [sentence for scene in scenes for sentence in scene.split('\n')]
print('Number of lines: {}'.format(len(sentences)))
word_count_sentence = [len(sentence.split()) for sentence in sentences]
print('Average number of words in each line: {}'.format(np.average(word_count_sentence)))

print()
print('The sentences {} to {}:'.format(*view_sentence_range))
print('\n'.join(text.split('\n')[view_sentence_range[0]:view_sentence_range[1]]))


Dataset Stats
Roughly the number of unique words: 11492
Number of scenes: 262
Average number of sentences in each scene: 15.251908396946565
Number of lines: 4258
Average number of words in each line: 11.50164396430249

The sentences 0 to 10:

Moe_Szyslak: (INTO PHONE) Moe's Tavern. Where the elite meet to drink.
Bart_Simpson: Eh, yeah, hello, is Mike there? Last name, Rotch.
Moe_Szyslak: (INTO PHONE) Hold on, I'll check. (TO BARFLIES) Mike Rotch. Mike Rotch. Hey, has anybody seen Mike Rotch, lately?
Moe_Szyslak: (INTO PHONE) Listen you little puke. One of these days I'm gonna catch you, and I'm gonna carve my name on your back with an ice pick.
Moe_Szyslak: What's the matter Homer? You're not your normal effervescent self.
Homer_Simpson: I got my problems, Moe. Give me another one.
Moe_Szyslak: Homer, hey, you should not drink to forget your problems.
Barney_Gumble: Yeah, you should only drink to enhance your social skills.



## Implement Preprocessing Functions
The first thing to do to any dataset is preprocessing. Implement the following preprocessing functions below:

* Lookup Table
* Tokenize Punctuation

### Lookup Table
To create a word embedding, you first need to transform the words to ids. In this function, create two dictionaries:

* Dictionary to go from the words to an id, we'll call vocab_to_int
* Dictionary to go from the id to word, we'll call int_to_vocab

Return these dictionaries in the following tuple (vocab_to_int, int_to_vocab)

In [20]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

In [21]:
import helper

data_dir = './data/simpsons/moes_tavern_lines.txt'
text = helper.load_data(data_dir)

In [22]:
print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))
scenes = text.split('\n\n')
print('Number of scenes: {}'.format(len(scenes)))
sentence_count_scene = [scene.count('\n') for scene in scenes]
print('Average number of sentences in each scene: {}'.format(np.average(sentence_count_scene)))

sentences = [sentence for scene in scenes for sentence in scene.split('\n')]
print('Number of lines: {}'.format(len(sentences)))
word_count_sentence = [len(sentence.split()) for sentence in sentences]
print('Average number of words in each line: {}'.format(np.average(word_count_sentence)))

print()

Dataset Stats
Roughly the number of unique words: 11501
Number of scenes: 263
Average number of sentences in each scene: 15.190114068441064
Number of lines: 4258
Average number of words in each line: 11.504462188821043



In [23]:
# load ascii text and covert to lowercase
raw_text = text.lower()

In [24]:
# create mapping of unique chars to integers
chars = sorted(list(set(raw_text.split())))
char_to_int = dict((c, i) for i, c in enumerate(chars))

In [25]:
len(chars)

10353

In [26]:
len(raw_text.split())

48986

In [27]:
# summarize the loaded data
n_chars = len(raw_text.split())
n_vocab = len(chars)

In [28]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
    seq_in = raw_text.split()[i:i + seq_length]
    seq_out = raw_text.split()[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
    #print(i)
n_patterns = len(dataX)

In [29]:
n_patterns

48886

In [30]:
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = np_utils.to_categorical(dataY)

In [31]:
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))

In [32]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [33]:
# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}-bigger.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [34]:
# fit the model
model.fit(X, y, epochs=50, batch_size=64, callbacks=callbacks_list)

Epoch 1/50
Epoch 00001: loss improved from inf to 8.01233, saving model to weights-improvement-01-8.0123-bigger.hdf5
Epoch 2/50
Epoch 00002: loss improved from 8.01233 to 7.87624, saving model to weights-improvement-02-7.8762-bigger.hdf5
Epoch 3/50
Epoch 00003: loss did not improve
Epoch 4/50
Epoch 00004: loss did not improve
Epoch 5/50
Epoch 00005: loss did not improve
Epoch 6/50
Epoch 00006: loss did not improve
Epoch 7/50
Epoch 00007: loss did not improve
Epoch 8/50
Epoch 00008: loss did not improve
Epoch 9/50
Epoch 00009: loss did not improve
Epoch 10/50
Epoch 00010: loss did not improve
Epoch 11/50
Epoch 00011: loss did not improve
Epoch 12/50
Epoch 00012: loss did not improve
Epoch 13/50
Epoch 00013: loss did not improve
Epoch 14/50
Epoch 00014: loss did not improve
Epoch 15/50
Epoch 00015: loss did not improve
Epoch 16/50
Epoch 00016: loss did not improve
Epoch 17/50
Epoch 00017: loss did not improve
Epoch 18/50
Epoch 00018: loss did not improve
Epoch 19/50
Epoch 00019: loss did

KeyboardInterrupt: 

In [108]:
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]
int_to_char = dict((i, c) for i, c in enumerate(chars))
print("Seed:")
print("\"" + ''.join([int_to_char[value] for value in pattern]) + "\"")

(29361, 5580)

In [None]:
import sys
for i in range(1000):
    x = np.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
    prediction = model.predict(x, verbose=0)
    index = np.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    sys.stdout.write(result)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]
print("\nDone.")