# How to generate your own "The Simpsons" TV script using Deep Learning
**Have you ever dreamed of creating your own episode of The Simpsons? I did.**

**This is the notebook for the story published at (https://towardsdatascience.com/how-to-generate-your-own-the-simpsons-tv-script-using-deep-learning-980337173796)** 

That is what i thought when i saw the Simpsons dataset at Kaggle. It is the perfect dataset for a small "just for fun" project on Natural Language Generation (NLG).

### What is Natural Language Generation (NLG)?

**Natural-language generation (NLG) is the aspect of language technology that focuses on generating natural language from structured data or structured representations such as a knowledge base or a logical form.
(https://en.wikipedia.org/wiki/Natural-language_generation)**

In this case we will see how to train a model that will be capable of creating new "Simpsons-Style" conversations. As input for the training we will use the  file simpsons_script_lines.csv from the Simpsons dataset.

In [0]:
import numpy as np
import pandas as pd
import os
import pickle
import re

from collections import Counter

from IPython.display import display

import tensorflow as tf
from tensorflow.contrib import seq2seq

### Downloading and preparing the data
First you need to download the data file. You can do this on the Kaggle website of "The Simpsons by the Data". Download the file [simpsons_script_lines.csv](https://medium.com/r/?url=https%3A%2F%2Fwww.kaggle.com%2Fwcukierski%2Fthe-simpsons-by-the-data%23simpsons_script_lines.csv), save it to a folder "data" and unzip it. It should be ~34MB after unzipping.

In [0]:
data_dir = './data/simpsons_script_lines.csv'
input_file = os.path.join(data_dir)

clean_text = ''

with open(input_file, "r", encoding="utf8") as f:
    for line in f:
        text = re.search('[0-9]*,[0-9]*,[0-9]*,(.+?),[0-9]*,', line)
        if text:
            text = text.group(1).replace('"', '')
            text_parts = text.split(':')
            text_parts[0] = text_parts[0].replace(' ', '_')
            text = ':'.join(text_parts)
            clean_text += text + '\n'

print('\n'.join(clean_text.split('\n')[:10]))

Miss_Hoover: No, actually, it was a little of both. Sometimes when a disease is in all the magazines and all the news shows, it's only natural that you think you have it.
Lisa_Simpson: (NEAR TEARS) Where's Mr. Bergstrom?
Miss_Hoover: I don't know. Although I'd sure like to talk to him. He didn't touch my lesson plan. What did he teach you?
Lisa_Simpson: That life is worth living.
Edna_Krabappel-Flanders: The polls will be open from now until the end of recess. Now, (SOUR) just in case any of you have decided to put any thought into this, we'll have our final statements. Martin?
Martin_Prince: (HOARSE WHISPER) I don't think there's anything left to say.
Edna_Krabappel-Flanders: Bart?
Bart_Simpson: Victory party under the slide!
(Apartment_Building: Ext. apartment building - day)
Lisa_Simpson: (CALLING) Mr. Bergstrom! Mr. Bergstrom!


### Data preprocessing

Before we can use this as input for training of our model we first need to do some additional preprocessing.

We'll be splitting the script into a word array using spaces as delimiters. However, punctuations like periods and exclamation marks make it hard for the neural network to distinguish between the word "bye" and "bye!".

To solve this we create a dictionary that we will use to token the symbols and add the delimiter (space) around it. This separates the symbols from the words, making it easier for the neural network to predict on the next word.

In the next step we will use this dictionary to replace the symbols, build the vocabulary and lookup table for the words in the text.

In [0]:
tokenized_punctuation = {
    '.' : '||Period||',
    ',' : '||Comma||',
    '"' : '||Quotation_Mark||',
    ';' : '||Semicolon||',
    '!' : '||Exclamation_Mark||',
    '?' : '||Question_Mark||',
    '(' : '||Left_Parentheses||',
    ')' : '||Right_Parentheses||',
    '--' : '||Dash||',
    '\n' : '||Return||'
}

In [0]:
for key, token in tokenized_punctuation.items():
    clean_text = clean_text.replace(key, ' {} '.format(token))

clean_text = clean_text.lower()
clean_text = clean_text.split()

word_counts = Counter(clean_text)
sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
vocab_to_int = {word: ii for ii, word in int_to_vocab.items()} 

int_text = [vocab_to_int[word] for word in clean_text]
pickle.dump((int_text, vocab_to_int, int_to_vocab, tokenized_punctuation), open('preprocess.p', 'wb'))

### Build the Neural Network
Now that we have prepared the data it is time to create the neural network.

In [0]:
def get_inputs():
    input_placeholder = tf.placeholder(tf.int32, [None, None], name = 'input')
    targets_placeholder = tf.placeholder(tf.int32, [None, None])
    learning_rate_placeholder = tf.placeholder(tf.float32)
    
    return input_placeholder, targets_placeholder, learning_rate_placeholder

In [0]:
def get_init_cell(batch_size, rnn_size):
    lstm = tf.contrib.rnn.GRUCell(rnn_size)
    cell = tf.contrib.rnn.MultiRNNCell([lstm])
    initial_state = tf.identity(cell.zero_state(batch_size, tf.float32), name='initial_state')
    return cell, initial_state

In [0]:
def get_embed(input_data, vocab_size, embed_dim):
    embedding = tf.Variable(tf.random_uniform((vocab_size, embed_dim), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, input_data)    
    return embed

In [0]:
def build_rnn(cell, inputs):
    outputs, state = tf.nn.dynamic_rnn(cell, inputs, dtype=tf.float32)
    final_state = tf.identity(state, name="final_state")
    return outputs, final_state

In [0]:
def build_nn(cell, rnn_size, input_data, vocab_size, embed_dim):
    embeddings = get_embed(input_data, vocab_size, embed_dim)
    inputs, final_state = build_rnn(cell, embeddings)
    logits = tf.contrib.layers.fully_connected(inputs=inputs, num_outputs=vocab_size, activation_fn=None)
    return logits, final_state

### Training the Neural Network

In [0]:
def get_batches(int_text, batch_size, seq_length):
    n_batches = len(int_text) // (batch_size * seq_length)
    words = np.asarray(int_text[:n_batches*(batch_size * seq_length)])
    
    batches = np.zeros(shape=(n_batches, 2, batch_size, seq_length))

    input_sequences = words.reshape(-1, seq_length)
    target_sequences = np.roll(words, -1)
    target_sequences = target_sequences.reshape(-1, seq_length)
    
    for idx in range(0, input_sequences.shape[0]):
        input_idx = idx % n_batches
        target_idx = idx // n_batches
        batches[input_idx,0,target_idx,:] = input_sequences[idx,:]
        batches[input_idx,1,target_idx,:] = target_sequences[idx,:]        
    return batches

In [0]:
# Number of Epochs
num_epochs = 50
# Batch Size
batch_size = 32
# RNN Size
rnn_size = 512
# Embedding Dimension Size
embed_dim = 256
# Sequence Length
seq_length = 16
# Learning Rate
learning_rate = 0.001
# Show stats for every n number of batches
show_every_n_batches = 200

save_dir = './save'

In [0]:
train_graph = tf.Graph()
with train_graph.as_default():
    vocab_size = len(int_to_vocab)
    input_text, targets, lr = get_inputs()
    input_data_shape = tf.shape(input_text)
    cell, initial_state = get_init_cell(input_data_shape[0], rnn_size)
    logits, final_state = build_nn(cell, rnn_size, input_text, vocab_size, embed_dim)

    # Probabilities for generating words
    probs = tf.nn.softmax(logits, name='probs')

    # Loss function
    cost = seq2seq.sequence_loss(
        logits,
        targets,
        tf.ones([input_data_shape[0], input_data_shape[1]]))

    # Optimizer
    optimizer = tf.train.AdamOptimizer(lr)

    # Gradient Clipping
    gradients = optimizer.compute_gradients(cost)
    capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
    train_op = optimizer.apply_gradients(capped_gradients)

In [0]:
batches = get_batches(int_text, batch_size, seq_length)

with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())

    for epoch_i in range(num_epochs):
        state = sess.run(initial_state, {input_text: batches[0][0]})

        for batch_i, (x, y) in enumerate(batches):
            feed = {
                input_text: x,
                targets: y,
                initial_state: state,
                lr: learning_rate}
            train_loss, state, _ = sess.run([cost, final_state, train_op], feed)

            # Show every <show_every_n_batches> batches
            if (epoch_i * len(batches) + batch_i) % show_every_n_batches == 0:
                print('Epoch {:>3} Batch {:>4}/{}   train_loss = {:.3f}'.format(
                    epoch_i,
                    batch_i,
                    len(batches),
                    train_loss))

    # Save Model
    saver = tf.train.Saver()
    saver.save(sess, save_dir)
    print('Model Trained and Saved')

Epoch   0 Batch    0/4686   train_loss = 11.039
Epoch   0 Batch  200/4686   train_loss = 6.892
Epoch   0 Batch  400/4686   train_loss = 6.000
Epoch   0 Batch  600/4686   train_loss = 5.782
Epoch   0 Batch  800/4686   train_loss = 5.047
Epoch   0 Batch 1000/4686   train_loss = 4.870
Epoch   0 Batch 1200/4686   train_loss = 4.674
Epoch   0 Batch 1400/4686   train_loss = 4.906
Epoch   0 Batch 1600/4686   train_loss = 4.652
Epoch   0 Batch 1800/4686   train_loss = 4.736
Epoch   0 Batch 2000/4686   train_loss = 4.699
Epoch   0 Batch 2200/4686   train_loss = 4.515
Epoch   0 Batch 2400/4686   train_loss = 4.317
Epoch   0 Batch 2600/4686   train_loss = 4.358
Epoch   0 Batch 2800/4686   train_loss = 4.982
Epoch   0 Batch 3000/4686   train_loss = 4.443
Epoch   0 Batch 3200/4686   train_loss = 4.373
Epoch   0 Batch 3400/4686   train_loss = 4.505
Epoch   0 Batch 3600/4686   train_loss = 4.286
Epoch   0 Batch 3800/4686   train_loss = 4.423
Epoch   0 Batch 4000/4686   train_loss = 4.398
Epoch   0 Ba

Epoch   7 Batch 2198/4686   train_loss = 2.437
Epoch   7 Batch 2398/4686   train_loss = 2.610
Epoch   7 Batch 2598/4686   train_loss = 2.471
Epoch   7 Batch 2798/4686   train_loss = 2.204
Epoch   7 Batch 2998/4686   train_loss = 2.677
Epoch   7 Batch 3198/4686   train_loss = 2.667
Epoch   7 Batch 3398/4686   train_loss = 2.884
Epoch   7 Batch 3598/4686   train_loss = 2.601
Epoch   7 Batch 3798/4686   train_loss = 2.366
Epoch   7 Batch 3998/4686   train_loss = 2.800
Epoch   7 Batch 4198/4686   train_loss = 2.366
Epoch   7 Batch 4398/4686   train_loss = 2.524
Epoch   7 Batch 4598/4686   train_loss = 2.571
Epoch   8 Batch  112/4686   train_loss = 2.663
Epoch   8 Batch  312/4686   train_loss = 2.549
Epoch   8 Batch  512/4686   train_loss = 2.690
Epoch   8 Batch  712/4686   train_loss = 2.342
Epoch   8 Batch  912/4686   train_loss = 2.290
Epoch   8 Batch 1112/4686   train_loss = 2.395
Epoch   8 Batch 1312/4686   train_loss = 2.592
Epoch   8 Batch 1512/4686   train_loss = 2.565
Epoch   8 Bat

Epoch  14 Batch 4396/4686   train_loss = 2.199
Epoch  14 Batch 4596/4686   train_loss = 2.108
Epoch  15 Batch  110/4686   train_loss = 2.044
Epoch  15 Batch  310/4686   train_loss = 2.094
Epoch  15 Batch  510/4686   train_loss = 2.060
Epoch  15 Batch  710/4686   train_loss = 2.217
Epoch  15 Batch  910/4686   train_loss = 2.135
Epoch  15 Batch 1110/4686   train_loss = 2.166
Epoch  15 Batch 1310/4686   train_loss = 2.141
Epoch  15 Batch 1510/4686   train_loss = 2.303
Epoch  15 Batch 1710/4686   train_loss = 2.112
Epoch  15 Batch 1910/4686   train_loss = 2.196
Epoch  15 Batch 2110/4686   train_loss = 2.263
Epoch  15 Batch 2310/4686   train_loss = 1.863
Epoch  15 Batch 2510/4686   train_loss = 2.081
Epoch  15 Batch 2710/4686   train_loss = 1.972
Epoch  15 Batch 2910/4686   train_loss = 1.937
Epoch  15 Batch 3110/4686   train_loss = 2.243
Epoch  15 Batch 3310/4686   train_loss = 2.013
Epoch  15 Batch 3510/4686   train_loss = 1.986
Epoch  15 Batch 3710/4686   train_loss = 1.891
Epoch  15 Bat

Epoch  22 Batch 1908/4686   train_loss = 2.077
Epoch  22 Batch 2108/4686   train_loss = 2.041
Epoch  22 Batch 2308/4686   train_loss = 2.205
Epoch  22 Batch 2508/4686   train_loss = 1.812
Epoch  22 Batch 2708/4686   train_loss = 2.365
Epoch  22 Batch 2908/4686   train_loss = 1.887
Epoch  22 Batch 3108/4686   train_loss = 2.128
Epoch  22 Batch 3308/4686   train_loss = 2.156
Epoch  22 Batch 3508/4686   train_loss = 2.156
Epoch  22 Batch 3708/4686   train_loss = 2.089
Epoch  22 Batch 3908/4686   train_loss = 2.139
Epoch  22 Batch 4108/4686   train_loss = 2.179
Epoch  22 Batch 4308/4686   train_loss = 2.265
Epoch  22 Batch 4508/4686   train_loss = 1.880
Epoch  23 Batch   22/4686   train_loss = 2.191
Epoch  23 Batch  222/4686   train_loss = 2.148
Epoch  23 Batch  422/4686   train_loss = 2.007
Epoch  23 Batch  622/4686   train_loss = 2.034
Epoch  23 Batch  822/4686   train_loss = 2.188
Epoch  23 Batch 1022/4686   train_loss = 2.073
Epoch  23 Batch 1222/4686   train_loss = 2.036
Epoch  23 Bat

Epoch  29 Batch 4106/4686   train_loss = 2.083
Epoch  29 Batch 4306/4686   train_loss = 1.937
Epoch  29 Batch 4506/4686   train_loss = 2.148
Epoch  30 Batch   20/4686   train_loss = 1.816
Epoch  30 Batch  220/4686   train_loss = 1.967
Epoch  30 Batch  420/4686   train_loss = 1.988
Epoch  30 Batch  620/4686   train_loss = 2.027
Epoch  30 Batch  820/4686   train_loss = 2.017
Epoch  30 Batch 1020/4686   train_loss = 1.906
Epoch  30 Batch 1220/4686   train_loss = 1.750
Epoch  30 Batch 1420/4686   train_loss = 1.890
Epoch  30 Batch 1620/4686   train_loss = 1.962
Epoch  30 Batch 1820/4686   train_loss = 1.966
Epoch  30 Batch 2020/4686   train_loss = 1.941
Epoch  30 Batch 2220/4686   train_loss = 1.876
Epoch  30 Batch 2420/4686   train_loss = 1.946
Epoch  30 Batch 2620/4686   train_loss = 1.862
Epoch  30 Batch 2820/4686   train_loss = 2.046
Epoch  30 Batch 3020/4686   train_loss = 1.737
Epoch  30 Batch 3220/4686   train_loss = 1.847
Epoch  30 Batch 3420/4686   train_loss = 2.126
Epoch  30 Bat

Epoch  37 Batch 1618/4686   train_loss = 1.850
Epoch  37 Batch 1818/4686   train_loss = 1.947
Epoch  37 Batch 2018/4686   train_loss = 1.811
Epoch  37 Batch 2218/4686   train_loss = 1.859
Epoch  37 Batch 2418/4686   train_loss = 1.826
Epoch  37 Batch 2618/4686   train_loss = 2.134
Epoch  37 Batch 2818/4686   train_loss = 2.044
Epoch  37 Batch 3018/4686   train_loss = 1.722
Epoch  37 Batch 3218/4686   train_loss = 1.736
Epoch  37 Batch 3418/4686   train_loss = 1.937
Epoch  37 Batch 3618/4686   train_loss = 1.751
Epoch  37 Batch 3818/4686   train_loss = 1.811
Epoch  37 Batch 4018/4686   train_loss = 1.961
Epoch  37 Batch 4218/4686   train_loss = 1.934
Epoch  37 Batch 4418/4686   train_loss = 1.988
Epoch  37 Batch 4618/4686   train_loss = 1.920
Epoch  38 Batch  132/4686   train_loss = 1.928
Epoch  38 Batch  332/4686   train_loss = 1.814
Epoch  38 Batch  532/4686   train_loss = 1.988
Epoch  38 Batch  732/4686   train_loss = 1.671
Epoch  38 Batch  932/4686   train_loss = 1.768
Epoch  38 Bat

Epoch  44 Batch 3816/4686   train_loss = 1.947
Epoch  44 Batch 4016/4686   train_loss = 1.765
Epoch  44 Batch 4216/4686   train_loss = 1.855
Epoch  44 Batch 4416/4686   train_loss = 1.974
Epoch  44 Batch 4616/4686   train_loss = 1.850
Epoch  45 Batch  130/4686   train_loss = 1.779
Epoch  45 Batch  330/4686   train_loss = 2.046
Epoch  45 Batch  530/4686   train_loss = 2.007
Epoch  45 Batch  730/4686   train_loss = 1.737
Epoch  45 Batch  930/4686   train_loss = 1.728
Epoch  45 Batch 1130/4686   train_loss = 1.706
Epoch  45 Batch 1330/4686   train_loss = 1.951
Epoch  45 Batch 1530/4686   train_loss = 1.705
Epoch  45 Batch 1730/4686   train_loss = 1.839
Epoch  45 Batch 1930/4686   train_loss = 2.195
Epoch  45 Batch 2130/4686   train_loss = 2.027
Epoch  45 Batch 2330/4686   train_loss = 2.050
Epoch  45 Batch 2530/4686   train_loss = 1.837
Epoch  45 Batch 2730/4686   train_loss = 1.752
Epoch  45 Batch 2930/4686   train_loss = 1.898
Epoch  45 Batch 3130/4686   train_loss = 1.604
Epoch  45 Bat

### Generate TV Script
When training is finished we are at the last step of this project: generating a new TV Script for "The Simpsons"!

In [0]:
def get_tensors(loaded_graph):
    input_tensor = loaded_graph.get_tensor_by_name('input:0')
    initial_state_tensor = loaded_graph.get_tensor_by_name('initial_state:0')
    final_state_tensor = loaded_graph.get_tensor_by_name('final_state:0')
    probs_tensor = loaded_graph.get_tensor_by_name('probs:0')
    return input_tensor, initial_state_tensor, final_state_tensor, probs_tensor

In [0]:
def pick_word(probabilities, int_to_vocab):
    word_id = np.argmax(probabilities)
    word_string = int_to_vocab[word_id]
    return word_string

In [0]:
gen_length = 500

prime_word = 'homer_simpson'

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(save_dir + '.meta')
    loader.restore(sess, save_dir)

    # Get Tensors from loaded model
    input_text, initial_state, final_state, probs = get_tensors(loaded_graph)

    # Sentences generation setup
    gen_sentences = [prime_word + ':']
    prev_state = sess.run(initial_state, {input_text: np.array([[1]])})

    # Generate sentences
    for n in range(gen_length):
        # Dynamic Input
        dyn_input = [[vocab_to_int[word] for word in gen_sentences[-seq_length:]]]
        dyn_seq_length = len(dyn_input[0])

        # Get Prediction
        probabilities, prev_state = sess.run(
            [probs, final_state],
            {input_text: dyn_input, initial_state: prev_state})
        
        pred_word = pick_word(probabilities[0][dyn_seq_length-1], int_to_vocab)

        gen_sentences.append(pred_word)
    
    # Remove tokens
    tv_script = ' '.join(gen_sentences)
    for key, token in tokenized_punctuation.items():
        ending = ' ' if key in ['\n', '(', '"'] else ''
        tv_script = tv_script.replace(' ' + token.lower(), key)
    tv_script = tv_script.replace('\n ', '\n')
    tv_script = tv_script.replace('( ', '(')
        
    print(tv_script)

INFO:tensorflow:Restoring parameters from ./save
homer_simpson:(moans)
marge_simpson:(annoyed murmur)
homer_simpson:(annoyed grunt)
(moe's_tavern: ext. moe's - night)
homer_simpson:(to moe) this is a great idea, children. now, what are we playing here?
bart_simpson:(horrified gasp)
(simpson_home: ext. simpson house - day - establishing)
homer_simpson:(worried) i've got a wet!
homer_simpson:(faking enthusiasm) well, maybe i could kiss my little girl. mine!
(department int. sports arena - night)
seymour_skinner:(chuckles)
chief_wiggum:(laughing) oh, i get it.
seymour_skinner:(snapping) i guess this building is quiet.
homer_simpson:(stunned) what? how'd you like that?
professor_jonathan_frink: uh, well, looks like the little bit of you.
bart_simpson:(to larry) i guess this is clearly justin, right?
homer_simpson:(dismissive snort) oh, i am.
marge_simpson:(pained) hi.
homer_simpson:(pained sound) i thought you might have some good choice.
homer_simpson:(pained) oh, sorry.
(simpson_home: in

### Conclusion
We have trained a model to generate new text!

As you can see the text does not really make any sense, but that's ok. This project was meant to show you how to prepare the data for training the model and to give a basic idea on how NLG works.

If you want you can tune the parameters, add more layers or change their size. Look at how the output of the model changes.