# TV Script Generation
In this project, we'll generate our own Seinfeld TV scripts using RNNs.
<p>we'll be using part of the Seinfeld dataset of scripts from 9 seasons. The Neural Network we'll build will generate a new ,"fake" TV script, based on patterns it recognizes in this training data.</p>

In [0]:
from zipfile import ZipFile

with ZipFile('Seinfeld_Scripts.zip', 'r') as zip_ref:
  zip_ref.extractall('/content')

## Get the Data
* As a first step, we'll load in this data and look at some samples.

In [0]:
import os
def load_data(path):
  """
  Load Dataset from File
  """
  input_file = os.path.join(path)
  with open(input_file, "r") as f:
    data = f.read()

  return data

In [0]:
data_dir = '/content/Seinfeld_Scripts.txt'
text = load_data(data_dir)

##  Explore the data

In [0]:
view_line_range = (0, 10)

import numpy as np

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))

lines = text.split('\n')
print('Number of lines: {}'.format(len(lines)))
word_count_line = [len(line.split()) for line in lines]
print('Average number of words in each line: {}'.format(np.average(word_count_line)))

print()
print('The lines {} to {}:'.format(*view_line_range))
print('\n'.join(text.split('\n')[view_line_range[0]:view_line_range[1]]))

Dataset Stats
Roughly the number of unique words: 46367
Number of lines: 109233
Average number of words in each line: 5.544240293684143

The lines 0 to 10:
jerry: do you know what this is all about? do you know, why were here? to be out, this is out...and out is one of the single most enjoyable experiences of life. people...did you ever hear people talking about we should go out? this is what theyre talking about...this whole thing, were all out now, no one is home. not one person here is home, were all out! there are people trying to find us, they dont know where we are. (on an imaginary phone) did you ring?, i cant find him. where did he go? he didnt tell me where he was going. he must have gone out. you wanna go out you get ready, you pick out the clothes, right? you take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...then youre standing around, what do you do? you go we gotta be getting back. once youre out, you wanna get back! y

### Implement Pre_processing Functions

The first thing to do to any dataset is pre-processing.
### Lookup Table
To create a word embedding, you first need to transform the words to ids. In this function,  we create two dictionaries
- vocab_to_int - Dictionary to go from the words to an id
- int_to_vocab - Dictionary to go from the id to word

In [0]:
from collections import Counter
def create_lookup_tables(text):
  '''
  Create lookup tables for vocabulary
  :param text: The text of tv scripts split into words
  :return: A tuple of dicts (vocab_to_int, int_to_vocab)
  '''
  counts = Counter(text)
  sorted_vocab = sorted(counts, key = counts.get, reverse = True)
  
  # create into_to_vocab and vocab_to_int dictionaries
  into_to_vocab = {ii:word for ii, word in enumerate(sorted_vocab)}
  vocab_to_int = {word:ii for ii, word in into_to_vocab.items()}
  
  
  return (vocab_to_int, into_to_vocab)

### Tokenize function
We'll be splitting the script into a word array using spaces as delimiters. However, punctuations like periods and exclamation marks can create multiple ids for the same word. For example, "bye" and "bye!" would generate two different word ids.

Function token_lookup to return a dict that will be used to tokenize symbols like "!" into "||Exclamation_Mark||". Create a dictionary for the following symbols where the symbol is the key and value is the token:
- Period ( . )
- Comma ( , )
- Quotation Mark ( " )
- Semicolon ( ; )
- Exclamation mark ( ! )
- Question mark ( ? )
- Left Parentheses ( ( )
- Right Parentheses ( ) )
- Dash ( - )
- Return ( \n )

This dictionary will be used to tokenize the symbols and add the delimiter (space) around it. This separates each symbols as its own word, making it easier for the neural network to predict the next word

In [0]:
def token_lookup():
  '''
  Generate a dict to turn punctuation into a token
  :return: Tokenized dictionary where puctuation is the key and value iis the token
  '''
  
  tokenized_dictionary = dict()
  tokenized_dictionary["."] = "||period||"
  tokenized_dictionary[","] = "||comma||"
  tokenized_dictionary["\""] = "||quotationmark||"
  tokenized_dictionary[";"] = "||semicolon||"
  tokenized_dictionary["!"] = "||exclamationmark||"
  tokenized_dictionary["?"] = "||questionmark||"
  tokenized_dictionary["("] = "||lparentheses||"
  tokenized_dictionary[")"] = "||rparentheses||"
  tokenized_dictionary["-"] = "||dash||"
  tokenized_dictionary["\n"] = "||return||"
  
  
  return tokenized_dictionary

## Preprocess all the data and save it
Running the code cell below will pre-process all the data and save it to file

In [0]:
import pickle
SPECIAL_WORDS = {'PADDING': '<PAD>'}
def preprocess_and_save_data(data_dir, token_lookup, create_lookup_tables):
  """
  Preprocess Text Data
  """
  text = load_data(data_dir)
  # Ignore notice, since we don't use it for analysing the data
  text = text[81:]

  token_dict = token_lookup()
  for key, token in token_dict.items():
    text = text.replace(key, ' {} '.format(token))

  text = text.lower()
  text = text.split()

  vocab_to_int, int_to_vocab = create_lookup_tables(text + list(SPECIAL_WORDS.values()))
  int_text = [vocab_to_int[word] for word in text]
  pickle.dump((int_text, vocab_to_int, int_to_vocab, token_dict), open('preprocess.p', 'wb'))

In [0]:
preprocess_and_save_data(data_dir, token_lookup, create_lookup_tables)

## Check Point
This is the first checkpoint.The preprocessed data has been saved to disk.


In [0]:
def load_preprocess():
  """
  Load the Preprocessed Training data and return them in batches of <batch_size> or less
  """
  return pickle.load(open('preprocess.p', mode='rb'))

In [0]:
int_text, vocab_to_int, int_to_vocab, token_dict = load_preprocess()

In [0]:
print (int_text[:10])
print('size of vocab_to_int: %d' %len(vocab_to_int))
print('size of int_to_vocab: %d'%len(int_to_vocab))

[24, 22, 47, 1, 1, 1, 17, 47, 22, 82]
size of vocab_to_int: 21388
size of int_to_vocab: 21388


## Build the Neural Network
In this section, we'll build the components necessary to build an RNN by implementing the RNN Module and forward and backpropagation functions.

## Check Access to GPU

In [0]:

# check for GPU
train_on_gpu = torch.cuda.is_available()
if train_on_gpu:
  print('CUDA is Available! Training on GPU')
else:
  print('No GPU found. Please use a GPU to train your neural network.')

CUDA is Available! Training on GPU


In [0]:
from torch.utils.data import TensorDataset, DataLoader

def batch_data(words, sequence_length, batch_size):
  '''
  Batch the Neural Network Data Using Dataloader
  :param words: the word ids of the tv scripts
  :param seq_length: the sequence of each batch
  :param batch_size: the size of each batch, number of sequences in each batch
  :return DataLoader with batch data
  '''
  
  num_batches = len(words) // batch_size
  # get only the batches that will make full batches
  words = words[:num_batches * batch_size]
  
  features, targets = [], []
  
  for idx in range(0, (len(words) - sequence_length)):
    features.append(words[idx:idx+sequence_length])
    targets.append(words[idx + sequence_length])
    
  tensor_data = TensorDataset(torch.from_numpy(np.asarray(features)), torch.from_numpy(np.asarray(targets)))
  data_loader = DataLoader(tensor_data, shuffle = True, batch_size = batch_size)
  
  return data_loader
    

## Testing Data Loader

In [0]:
# test dataloader

test_text = range(50)
t_loader = batch_data(test_text, sequence_length=5, batch_size=10)

data_iter = iter(t_loader)
sample_x, sample_y = data_iter.next()

print(sample_x.shape)
print(sample_x)
print()
print(sample_y.shape)
print(sample_y)

torch.Size([10, 5])
tensor([[14, 15, 16, 17, 18],
        [32, 33, 34, 35, 36],
        [22, 23, 24, 25, 26],
        [ 8,  9, 10, 11, 12],
        [20, 21, 22, 23, 24],
        [18, 19, 20, 21, 22],
        [19, 20, 21, 22, 23],
        [39, 40, 41, 42, 43],
        [21, 22, 23, 24, 25],
        [40, 41, 42, 43, 44]])

torch.Size([10])
tensor([19, 37, 27, 13, 25, 23, 24, 44, 26, 45])


## Define the Network

In [0]:
import torch.nn as nn

class RNN(nn.Module):
  def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout = 0.5):
    '''
    Initializes the Pytorch's RNN Module
    :param vocab_size: The number of input dimensions of the neural network (the size of the vocabulary)
    :param output_size: The number of output dimensions of the neural network
    :param embedding_dim: The size of embeddings, should you choose to use them        
    :param hidden_dim: The size of the hidden layer outputs
    :param dropout: dropout to add in between LSTM/GRU layers
    '''
    super(RNN, self).__init__()
    
    self.output_size = output_size
    self.hidden_dim = hidden_dim
    self.n_layers = n_layers
    
    # embedding and LSTM layers
    self.embedding = nn.Embedding(vocab_size, embedding_dim)
    self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
                        dropout=dropout, batch_first=True)
    
    #dropout layer
    self.dropout = nn.Dropout(0.3)
        
    # linear layer
    self.fc = nn.Linear(hidden_dim, output_size)
    
    
  def forward(self, nn_input, hidden):
    '''
    forward propagation of the neural network
    :param nn_input: the input of the neural network
    :param hidden: The hiiden state
    :return: Two tensors, output of the neural network and latest hidden state
    '''
    batch_size = nn_input.size(0)
      
    # embeddings and lstm_out
    embeds = self.embedding(nn_input)
    lstm_output, hidden = self.lstm(embeds, hidden)
      
    # stack up lstm outputs
    lstm_output = lstm_output.contiguous().view(-1, self.hidden_dim)
      
    # dropout and fully-connected layer
    output = self.dropout(lstm_output)
    output = self.fc(output)
      
      
    # reshape to be batch_size first
    output = output.view(batch_size, -1, self.output_size)
    output = output[:, -1] # get last batch of labels       
    # return one batch of output word scores and the hidden state
    return output, hidden
  
  def init_hidden(self, batch_size):
    ''' Initializes hidden state '''
    # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
    # initialized to zero, for hidden state and cell state of LSTM
    weight = next(self.parameters()).data
      
    if (train_on_gpu):
      hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
    else:
      hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
    return hidden
    

## Define forward and backward propagation

In [0]:
def forward_back_prop(rnn, optimizer, criterion, inp, target, hidden):
  '''
  forward and backpropagation of the neural network
  :param rnn: pytorch module that holds the rnn network
  :param optimizer: optimizer function
  :param criterion: loss function
  :param inp: A batch of input to the neural network
  :parm target:The Target output for the batch of input
  :return: The loss and the latest hidden state
  '''
  
  # move data to GPU, if available
  if train_on_gpu:
    rnn.cuda()
  
  # create new variable for the hidden state, otherwise 
  # we would back propagate through  entire training history
  h = tuple([each.data for each in hidden])
  
  # zero accumulate the gradients
  rnn.zero_grad()
  
  if train_on_gpu:
    inp, target = inp.cuda(), target.cuda()
  
  # get the output from the model
  output, h = rnn(inp, h)
  
  # calculate the loss
  loss = criterion(output, target)
  # backward pass
  loss.backward()
  # 'clip_grad_norm' prevents exploiding gradient problem in RNN/LSTM
  nn.utils.clip_grad_norm(rnn.parameters(), 5)
  
  # perform single optimization step (parameter update)
  optimizer.step()
  
  return loss.item(), h

## Neural Network Training

### Train loop
The training loop is implemented for you in the train_decoder function. This function will train the network over all the batches for the number of epochs given. The model progress will be shown every number of batches. This number is set with the show_every_n_batches parameter

In [0]:
def train_rnn(rnn, batch_size, optimizer, criterion, n_epochs, show_every_n_batches=100):
    batch_losses = []
    
    rnn.train()

    print("Training for %d epoch(s)..." % n_epochs)
    for epoch_i in range(1, n_epochs + 1):
        
        # initialize hidden state
        hidden = rnn.init_hidden(batch_size)
        
        for batch_i, (inputs, labels) in enumerate(train_loader, 1):
            
            # make sure you iterate over completely full batches, only
            n_batches = len(train_loader.dataset)//batch_size
            if(batch_i > n_batches):
                break
            
            # forward, back prop
            loss, hidden = forward_back_prop(rnn, optimizer, criterion, inputs, labels, hidden)          
            # record loss
            batch_losses.append(loss)

            # printing loss stats
            if batch_i % show_every_n_batches == 0:
                print('Epoch: {:>4}/{:<4}  Loss: {}\n'.format(
                    epoch_i, n_epochs, np.average(batch_losses)))
                batch_losses = []

    # returns a trained rnn
    return rnn

## Hyperparameters
Set and train the neural network with the following parameters:

- Set sequence_length to the length of a sequence.
- Set batch_size to the batch size.
- Set num_epochs to the number of epochs to train for.
- Set learning_rate to the learning rate for an Adam optimizer.
- Set vocab_size to the number of unique tokens in our vocabulary.
- Set output_size to the desired size of the output.
- Set embedding_dim to the embedding dimension; smaller than the - vocab_size.
- Set hidden_dim to the hidden dimension of your RNN.
- Set n_layers to the number of layers/cells in your RNN.
- Set show_every_n_batches to the number of batches at which the neural network should print progress.

If the network isn't getting the desired results, tweak these parameters and/or the layers in the RNN class.


In [0]:
# Data params
# Sequence Length
sequence_length = 20   # of words in a sequence
# Batch Size
batch_size = 128

# data loader - do not change
train_loader = batch_data(int_text, sequence_length, batch_size)

In [0]:
# optimization hyperparameters
# number of epochs to train the model
num_epochs = 10
# Learning Rate
learning_rate = 0.001

# Model hyperparameters
# Vocab size
vocab_size = len(vocab_to_int)
# Output size
output_size = vocab_size
# Embedding Dimension
embedding_dim = 200
# Hidden Dimension
hidden_dim = 512
# Number of RNN Layers
n_layers = 2

# Show stats for every n number of batches
show_every_n_batches = 500

In [0]:
# helper function to save the model
def save_model(filename, decoder):
  save_filename = os.path.splitext(os.path.basename(filename))[0] + '.pt'
  torch.save(decoder, save_filename)

In [0]:
# helper function to load the model
def load_model(filename):
  save_filename = os.path.splitext(os.path.basename(filename))[0] + '.pt'
  return torch.load(save_filename)

In [0]:
# create a model
rnn = RNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5)
# move model to GPU if CUDA is available
if train_on_gpu:
  rnn.cuda()

print(rnn)

RNN(
  (embedding): Embedding(21388, 200)
  (lstm): LSTM(200, 512, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.3)
  (fc): Linear(in_features=512, out_features=21388, bias=True)
)


In [0]:


# defining loss and optimization functions for training
optimizer = torch.optim.Adam(rnn.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

# training the model
trained_rnn = train_rnn(rnn, batch_size, optimizer, criterion, num_epochs, show_every_n_batches)

# saving the trained model
save_model('./save/trained_rnn', trained_rnn)
print('Model Trained and Saved')

Training for 10 epoch(s)...




Epoch:    1/10    Loss: 5.3398141145706175

Epoch:    1/10    Loss: 4.804179976463318

Epoch:    1/10    Loss: 4.628442614555359

Epoch:    1/10    Loss: 4.494254862785339

Epoch:    1/10    Loss: 4.4316678123474125

Epoch:    1/10    Loss: 4.37546141242981

Epoch:    1/10    Loss: 4.326348867893219

Epoch:    1/10    Loss: 4.298630898952484

Epoch:    1/10    Loss: 4.243415849208832

Epoch:    1/10    Loss: 4.244653454780579

Epoch:    1/10    Loss: 4.2123463587760925

Epoch:    1/10    Loss: 4.182699248790741

Epoch:    1/10    Loss: 4.164765825748444

Epoch:    2/10    Loss: 4.073565861410346

Epoch:    2/10    Loss: 3.997549789428711

Epoch:    2/10    Loss: 4.018961612224579

Epoch:    2/10    Loss: 3.9901534953117372

Epoch:    2/10    Loss: 3.9896482038497925

Epoch:    2/10    Loss: 4.01260741186142

Epoch:    2/10    Loss: 3.978522463321686

Epoch:    2/10    Loss: 4.001900125980377

Epoch:    2/10    Loss: 3.983846251964569

Epoch:    2/10    Loss: 3.9907336673736573

Epoch: 

  "type " + obj.__name__ + ". It won't be checked "


## Checkpoint
we can resume our progress by running the below cell, which will load in our word:id dictionaries and load in our saved model by name!

In [0]:
_, vocab_to_int, int_to_vocab, token_dict = load_preprocess()
trained_rnn = load_model('./save/trained_rnn')

## Generate TV Scrpit
With the network trained and saved, we'll use it to generate a new, "fake" Seinfeld TV script in this section.

### Generate Text
To generate the text, the network needs to start with a single word and repeat its predictions until it reaches a set length. we'll be using the generate function to do this. It takes a word id to start with, prime_id, and generates a set length of text, predict_len. Also note that it uses topk sampling to introduce some randomness in choosing the most likely next word, given an output set of word scores!

In [0]:
import torch.nn.functional as F

def generate(rnn, prime_id, int_to_vocab, token_dict, pad_value, predict_len=100):
  """
  Generate text using the neural network
  :param decoder: The PyTorch Module that holds the trained neural network
  :param prime_id: The word id to start the first prediction
  :param int_to_vocab: Dict of word id keys to word values
  :param token_dict: Dict of puncuation tokens keys to puncuation values
  :param pad_value: The value used to pad a sequence
  :param predict_len: The length of text to generate
  :return: The generated text
  """
  rnn.eval()
    
  # create a sequence (batch_size=1) with the prime_id
  current_seq = np.full((1, sequence_length), pad_value)
  current_seq[-1][-1] = prime_id
  predicted = [int_to_vocab[prime_id]]
    
  for _ in range(predict_len):
    if train_on_gpu:
      current_seq = torch.LongTensor(current_seq).cuda()
    else:
      current_seq = torch.LongTensor(current_seq)
      
    # initialize the hidden state
    hidden = rnn.init_hidden(current_seq.size(0))
        
    # get the output of the rnn
    output, _ = rnn(current_seq, hidden)
        
    # get the next word probabilities
    p = F.softmax(output, dim=1).data
    if(train_on_gpu):
      p = p.cpu() # move to cpu
      
    # use top_k sampling to get the index of the next word
    top_k = 5
    p, top_i = p.topk(top_k)
    top_i = top_i.numpy().squeeze()
        
    # select the likely next word index with some element of randomness
    p = p.numpy().squeeze()
    word_i = np.random.choice(top_i, p=p/p.sum())
        
    # retrieve that word from the dictionary
    word = int_to_vocab[word_i]
    predicted.append(word)     
    
    if train_on_gpu:
      current_seq = current_seq.cpu()
    
    # the generated word becomes the next "current sequence" and the cycle can continue
    current_seq = np.roll(current_seq, -1, 1)
    current_seq[-1][-1] = word_i
    
  gen_sentences = ' '.join(predicted)
    
  # Replace punctuation tokens
  for key, token in token_dict.items():
    ending = ' ' if key in ['\n', '(', '"'] else ''
    gen_sentences = gen_sentences.replace(' ' + token.lower(), key)
  gen_sentences = gen_sentences.replace('\n ', '\n')
  gen_sentences = gen_sentences.replace('( ', '(')
    
  # return all the sentences
  return gen_sentences

## Generate a new script
It's time to generate the text. Set gen_length to the length of TV script we want to generate and set prime_word to one of the following to start the prediction:

- "jerry"
- "elaine"
- "george"
- "kramer"

we can set the prime word to any word in our dictionary, but it's best to start with a name for generating a TV script. (we can also start with any other names you find in the original text file!)

In [0]:
gen_length = 400 # modify the length to your preference
prime_word = 'kramer' # name for starting the script


pad_word = SPECIAL_WORDS['PADDING']
generated_script = generate(trained_rnn, vocab_to_int[prime_word + ':'], int_to_vocab, token_dict, vocab_to_int[pad_word], gen_length)
print(generated_script)

kramer: the other side of his own and he said he was a good driver to make a good thing, and he's a very unusual freak...(george looks at the tag)

jerry:(pointing) well... i think i just want to know.

kramer:(to jerry) you know, i was wondering what the hell are we doing here?

kramer: oh, yeah! i got the car.. you know what i'm gonna do for the first person, and i think i am. but i think it's gonna work, you know, you can do it?

jerry: i can't.

george: i don't know.

george: i know.

kramer: well, you can't go.

elaine: well, i guess i can see if i had a good time.

george: yeah, yeah. i got a good one.(george laughs)

george:(to jerry) i don't know what to do.

george:(to elaine) hey.

jerry: what is that?

elaine: what, are you kidding?

kramer:(to jerry) hey.

jerry: hey, hey, hey.

elaine: hey, you know how i got that?

george: what?

jerry: i don't know, i don't know if he had a little fight with him.

elaine: oh, yeah, yeah. yeah.

george:(to kramer) what are you doing?

kra

**Save your favorite scripts¶**
Once we have a script that we like (or find interesting), save it to a text file!

In [0]:
# save script to a text file
f =  open("generated_script_2.txt","w")
f.write(generated_script)
f.close()