# Script Generator Model Training

In this notebook, I aim to train a model to generate TV scripts using Deep Learning. For training purposes, I have decided to use the scripts from all 10 seasons of F.R.I.E.N.D.S as it is one of my favorite shows. 

In [1]:
#Import dependencies
import os
import util
import time
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import problem_unittests as tests
from torch.utils.data import TensorDataset, DataLoader

#Check for GPU
train_on_gpu = torch.cuda.is_available()
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

## Load and Explore Data
In this step, I load the data from a .txt file. 
I then explore it a little to get a sense of what I'm working with. 

You can skip to the checkpoint without exploring the text files if you wish to train the model directly.

In [2]:
data_dir = '../Data/friends.txt'
text = util.load_data(data_dir)

In [3]:
view_line_range = (303, 320) #prints text between the given lines

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))

lines = text.split('\n')
print('Number of lines: {}'.format(len(lines)))
word_count_line = [len(line.split()) for line in lines]
print('Average number of words in each line: {}'.format(np.average(word_count_line)))

print()
print('The lines {} to {}:'.format(*view_line_range))
print('\n'.join(text.split('\n')[view_line_range[0]:view_line_range[1]]))

Dataset Stats
Roughly the number of unique words: 54785
Number of lines: 104959
Average number of words in each line: 8.59682352157
()
The lines 303 to 320:
Paul: No, it's, it's more of a fifth date kinda
revelation.
Monica: Oh, so there is gonna be a fifth date?
Paul: Isn't there?
Monica: Yeah... yeah, I think there is. -What were
you gonna say?
Paul: Well, ever-ev-... ever since she left me, um,
I haven't been able to, uh, perform. (Monica takes a sip of her drink.) ...Sexually.
Monica: (spitting out her drink in shock) Oh God, oh
God, I am sorry... I am so sorry...
Paul: It's okay...
Monica: I know being spit on is probably not what
you need right now. Um... how long?
Paul: Two years.
Monica: Wow! I'm-I'm-I'm glad you smashed her watch!
Paul: So you still think you, um... might want that
fifth date?


## Pre-process Data
In this step, I pre-process the input data so that the model can train on it. 
1. I create a lookup table that maps words to indexes
2. I create a token dictionary to seperate punctuations from regular words.

I save these dictionaries so that we don't have to pre-process it each time we run the notebook. 
You can directly skip to Checkpoint without running the next three blocks.

In [None]:
def create_lookup_tables(text):
    """
    Create lookup tables for vocabulary
    :param text: The text of tv scripts split into words
    :return: A tuple of dicts (vocab_to_idx, idx_to_vocab)
    """
    vocab = set(text)
    
    # Use comprenhension lists to build our dictionaries.
    vocab_to_idx = {word:idx for idx, word in enumerate(vocab)}
    idx_to_vocab = {idx:word for idx, word in enumerate(vocab)}
    return (vocab_to_idx, idx_to_vocab)

In [None]:
def token_lookup():
    """
    Generate a dict to turn punctuation into a token.
    :return: Tokenize dictionary where the key is the punctuation and the value is the token
    """
    return {'.':'||Period||',
            ',':'||Comma||',
            '"':'||Quotation_Mark||',
            ';':'||Semicolon||', 
            '!':'||Exclamation_Mark||',
            '?':'||Question_Mark||', 
            '(':'||Left_Parentheses||', 
            ')':'||Right_Parentheses||',
            '-':'||Dash||',
            '\n':'||Return||'}

In [None]:
# Preprocess Data and Save it
util.preprocess_and_save_data(data_dir, token_lookup, create_lookup_tables)

## Checkpoint: 
### 1. Load the pre-processed data

In [4]:
int_text, vocab_to_idx, idx_to_vocab, token_dict = util.load_preprocess()


In [5]:
print(len(int_text))

1086209


In [6]:
def batch_data(words, sequence_length, batch_size):
    """
    Batch the neural network data using DataLoader
    :param words: The word ids of the TV scripts
    :param sequence_length: The sequence length of each batch
    :param batch_size: The size of each batch; the number of sequences in a batch
    :return: DataLoader with batched data
    """
    n_batches = len(words)//batch_size
    words = words[:n_batches*batch_size]
    y_len = len(words) - sequence_length
    x, y = [], []
    for idx in range(0, y_len):
        idx_end = sequence_length + idx
        x_batch = words[idx:idx_end]
        x.append(x_batch)
        batch_y =  words[idx_end]  
        y.append(batch_y)    
    
    print(len(x), len(y))
    #Create Tensor datasets
    data = TensorDataset(torch.from_numpy(np.asarray(x)), torch.from_numpy(np.asarray(y)))
    data_loader = DataLoader(data, shuffle=False, batch_size=batch_size)
  
    return data_loader    

### 2. Define Model

In [4]:
class RNN(nn.Module):
    
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5,lr=0.001):
        """
        Initialize the PyTorch RNN Module
        :param vocab_size: The number of input dimensions of the neural network (the size of the vocabulary)
        :param output_size: The number of output dimensions of the neural network
        :param embedding_dim: The size of embeddings, should you choose to use them        
        :param hidden_dim: The size of the hidden layer outputs
        :param dropout: dropout to add in between LSTM/GRU layers
        """
        super(RNN, self).__init__()
               
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=dropout, batch_first=True)
    
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim

        self.fc = nn.Linear(hidden_dim, output_size)

        
    def forward(self, nn_input, hidden):
        """
        Forward propagation of the neural network
        :param nn_input: The input to the neural network
        :param hidden: The hidden state        
        :return: Two Tensors, the output of the neural network and the latest hidden state
        """

        batch_size = nn_input.size(0)

        embeds = self.embedding(nn_input)
        lstm_out, hidden = self.lstm(embeds, hidden)
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        
        out = self.fc(lstm_out)
        out = out.view(batch_size, -1, self.output_size)
        out = out[:, -1]

        return out, hidden

    
    def init_hidden(self, batch_size):
        '''
        Initialize the hidden state of an LSTM/GRU
        :param batch_size: The batch_size of the hidden state
        :return: hidden state of dims (n_layers, batch_size, hidden_dim)
        '''
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(0),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(0))
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden

In [5]:
def forward_back_prop(rnn, optimizer, criterion, inp, target, hidden):
    """
    Forward and backward propagation on the neural network
    :param decoder: The PyTorch Module that holds the neural network
    :param decoder_optimizer: The PyTorch optimizer for the neural network
    :param criterion: The PyTorch loss function
    :param inp: A batch of input to the neural network
    :param target: The target output for the batch of input
    :return: The loss and the latest hidden state Tensor
    """
    if(train_on_gpu):
        rnn.cuda(0)

    h = tuple([each.data for each in hidden])

    rnn.zero_grad()
    
    if(train_on_gpu):
        inputs, target = inp.cuda(0), target.cuda(0)

    output, h = rnn(inputs, h)

    loss = criterion(output, target)
    
    loss.backward()
    nn.utils.clip_grad_value_(rnn.parameters(), 1) #clip_grad_value_ - 1

    optimizer.step()
    
    return loss.item(), h

### 3. Train Model

You can re-train the model by running the next four blocks. 
Note: Training takes a lot of time.

In [7]:
#Data parameters
sequence_length = 10 
batch_size = 128 

#Load batched data
train_loader = batch_data(int_text, sequence_length, batch_size)

(1086198, 1086198)


In [7]:
#Training parameters
num_epochs = 128
learning_rate = 0.001

#Model parameters
vocab_size = len(vocab_to_idx)
output_size = vocab_size
embedding_dim = 256
hidden_dim = 512
n_layers = 2

#Show stats for every n number of batches
show_every_n_batches = 2000

In [8]:
def train_rnn(rnn, batch_size, optimizer, criterion, n_epochs, show_every_n_batches=100):
    batch_losses = []
    losses = []
    rnn.train()
    #Make sure you iterate over completely full batches, only
    n_batches = len(train_loader.dataset)//batch_size
    
    #Initialize hidden state
    hidden = rnn.init_hidden(batch_size)

    print("Training for %d epoch(s)..." % n_epochs)
    for epoch_i in range(1, n_epochs + 1):
        
        hidden = (hidden[0].detach(), hidden[1].detach())
        
        for batch_i, (inputs, labels) in enumerate(train_loader, 1):
            
            if(batch_i > n_batches):
                break
            
            #Forward, back prop
            loss, hidden = forward_back_prop(rnn, optimizer, criterion, inputs, labels, hidden)          
            #Record loss
            batch_losses.append(loss)

            #Printing loss stats
            if batch_i % show_every_n_batches == 0:
                avg = np.average(batch_losses)
                print('Epoch: {:>4}/{:<4}  Loss: {}\n'.format(
                    epoch_i, n_epochs, avg))
                losses.append([batch_i, avg])
                batch_losses = []

    return rnn

In [None]:
now = time.time()
#Create model and move to gpu if available
rnn = RNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5)

if(train_on_gpu):
    rnn.cuda(0)

#Defining loss and optimization functions for training
optimizer = torch.optim.Adam(rnn.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

#Training the model
trained_rnn = train_rnn(rnn, batch_size, optimizer, criterion, num_epochs, show_every_n_batches)

#Saving the trained model
util.save_model('trained_rnn_5', trained_rnn)
print('Model Trained and Saved')
np.save('loss-model-5.npy', np.array(losses))
print("Time taken: ", (time.time() - now)/60, " mins")

Training for 128 epoch(s)...
Epoch:    1/128   Loss: 5.22909194422

Epoch:    1/128   Loss: 4.74667523277

Epoch:    1/128   Loss: 4.69687299383

Epoch:    1/128   Loss: 4.70357798135

Epoch:    2/128   Loss: 4.44755326977

Epoch:    2/128   Loss: 4.25997623396

Epoch:    2/128   Loss: 4.30073728991

Epoch:    2/128   Loss: 4.29542011225

Epoch:    3/128   Loss: 4.1677004982

Epoch:    3/128   Loss: 4.04923562765

Epoch:    3/128   Loss: 4.10956633604

Epoch:    3/128   Loss: 4.09766844451

Epoch:    4/128   Loss: 3.98008153894

Epoch:    4/128   Loss: 3.90513658226

Epoch:    4/128   Loss: 3.96116559172

Epoch:    4/128   Loss: 3.94743908989

Epoch:    5/128   Loss: 3.84947528925

Epoch:    5/128   Loss: 3.79161346495

Epoch:    5/128   Loss: 3.8454236778

Epoch:    5/128   Loss: 3.83603486192

Epoch:    6/128   Loss: 3.7486120957

Epoch:    6/128   Loss: 3.70206538689

Epoch:    6/128   Loss: 3.7610859431

Epoch:    6/128   Loss: 3.74780666864

Epoch:    7/128   Loss: 3.66739134913



Epoch:   53/128   Loss: 2.95581162918

Epoch:   54/128   Loss: 2.90120667442

Epoch:   54/128   Loss: 2.94101360506

Epoch:   54/128   Loss: 3.0039335767

Epoch:   54/128   Loss: 2.94973499125

Epoch:   55/128   Loss: 2.89741489571

Epoch:   55/128   Loss: 2.94192066336

Epoch:   55/128   Loss: 2.9995897705

Epoch:   55/128   Loss: 2.95000543618

Epoch:   56/128   Loss: 2.8970115014

Epoch:   56/128   Loss: 2.93874767721

Epoch:   56/128   Loss: 2.99495403409

Epoch:   56/128   Loss: 2.94015707648

Epoch:   57/128   Loss: 2.89037267999

Epoch:   57/128   Loss: 2.93645435935

Epoch:   57/128   Loss: 2.99185612679

Epoch:   57/128   Loss: 2.93587187797

Epoch:   58/128   Loss: 2.88881511026

Epoch:   58/128   Loss: 2.93039199924

Epoch:   58/128   Loss: 2.98743226004

Epoch:   58/128   Loss: 2.93263111281

Epoch:   59/128   Loss: 2.88556299507

Epoch:   59/128   Loss: 2.9311093033

Epoch:   59/128   Loss: 2.98363184774

Epoch:   59/128   Loss: 2.93331667107

Epoch:   60/128   Loss: 2.879