<a href="https://colab.research.google.com/github/vedantdave77/project.Orca/blob/master/Project/TV-Scripts_Generation-%5BRNN%5D/TV_Script_Generation%5BLSTM%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# TV Script Generation using RNN. 

AIM:

My main aim is to work with kaggle dataset ["seinfeld-chronic"](https://www.kaggle.com/thec03u5/seinfeld-chronicles#scripts.csv) dataset. Which is taken from 9 different sensors.

A dataset for textual analysis on arguably the best written comedy television show ever. 

## Import Libraries

Import required libraries, I will use PyTorch for deep learning. I will use gpu for training. The current gpu provided by Colab is Tesla - T4. 


In [2]:
import os
import pickle

import numpy as np
import matplotlib.pyplot
%matplotlib inline

import torch
import torch.nn as nn
import torch.nn.functional as F
from collections import Counter
from torch.utils.data import TensorDataset, DataLoader

# check gpu for the later use 
train_on_gpu = torch.cuda.is_available()
if not train_on_gpu:
  print('No gpu available, process takes more time. Use less iterations.')
else:
  print('GPU available : |Name| : {}'.format(torch.cuda.get_device_name(0)))

GPU available : |Name| : Tesla T4


## Data Exploration 

The data file is Seindeld movie script, which is in form of text. 

In [12]:
# Get and Read the Data
input_file = '/content/Seinfeld_Scripts.txt'

with open(input_file, 'r', encoding='utf8') as file:
  input_data = file.read()

text = input_data
# can not read whole dataset at a time so, try to show sample 
print('Text Exploration')
print('-----------------')
print('No. of unique words {}'.format(len({word: None for word in text.split()})))


line_range = (0,10)
lines = text.split('\n')
print('Number of Lines : {}'.format(len(lines)))

word_count_line = [len(line.split()) for line in lines]
print('average words per line {}'.format(np.average(word_count_line)))

print('====================================================================')
print('The lines {} to {}:'.format(*line_range))
print('\n'.join(text.split('\n')[line_range[0]:line_range[1]]))



Text Exploration
-----------------
No. of unique words 46367
Number of Lines : 109233
average words per line 5.544240293684143
The lines 0 to 10:
jerry: do you know what this is all about? do you know, why were here? to be out, this is out...and out is one of the single most enjoyable experiences of life. people...did you ever hear people talking about we should go out? this is what theyre talking about...this whole thing, were all out now, no one is home. not one person here is home, were all out! there are people trying to find us, they dont know where we are. (on an imaginary phone) did you ring?, i cant find him. where did he go? he didnt tell me where he was going. he must have gone out. you wanna go out you get ready, you pick out the clothes, right? you take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...then youre standing around, what do you do? you go we gotta be getting back. once youre out, you wanna get back! you wanna g

## Data Preprocess

Preprocess data for converting  text to lookup table and convert special characters to tokens. 

lookup table will convert data to **index : word** format. 

In [13]:
# create lookup table (generate vocab dictionary word : int and int :word)
def create_lookup_tables(text):
  word_counts = Counter(text)
  # sorting word from most to least words 
  sorted_vocab = sorted(word_counts,key=word_counts.get, reverse = True)
  # create desired dictionaries
  int_to_vocab = {index:word for index,word in enumerate(sorted_vocab)}
  vocab_to_int = {word:index for index,word in int_to_vocab.items()}
  return (vocab_to_int,int_to_vocab)

# still, each line has the special characters and required to tranfer as words 
def token_lookup():
  tokens = dict()
  tokens['.'] = '<PERIOD>'
  tokens[','] = '<COMMA>'
  tokens['"'] = '<QUOTATION_MARK>'
  tokens[';'] = '<SEMICOLON>'
  tokens['!'] = '<EXCLAMATION_MARK>'
  tokens['?'] = '<QUESTION_MARK>'
  tokens['('] = '<LEFT_PAREN>'
  tokens[')'] = '<RIGHT_PAREN>'
  tokens['?'] = '<QUESTION_MARK>'
  tokens['-'] = '<DASH>'
  tokens['\n'] = '<NEW_LINE>'
  return tokens

special_words = {'PADDING': '<PAD>'}

In [14]:
# preprocess data and save it for later use 
text = input_data
text = text[81:]
token_dict = token_lookup()
for key,token in token_dict.items():
  text = text.replace(key, '{}'.format(token))

text = text.lower()
text = text. split()

vocab_to_int, int_to_vocab = create_lookup_tables(text + list(special_words.values()))
int_text = [vocab_to_int[word] for word in text]
pickle.dump((int_text, vocab_to_int, int_to_vocab, token_dict ), open('preprocess.p','wb'))

In [15]:
# load saved processed dataset  
int_text, vocab_to_int, int_to_vocab, token_dict=pickle.load(open('preprocess.p','rb'))

Generate small batches of words for deep learning model. Its convenient to use batch training. 

Then, I will generate dataloader by converting data to tensor dataset.

In [33]:
# generate dataloader with batches.

def batch_data(words, sequence_length, batch_size):
  n_batches = len(words)//batch_size
  words = words[:n_batches*batch_size]       # words related to full batches 
  y_len = len(words) - sequence_length       # decide no. of rotation (50 word - 5 seq_length = 45 y_len, means 45 times it will generate loop steps)
  x,y = [], []
  for idx in range(0,y_len):
    idx_end = sequence_length + idx
    x_batch = words[idx:idx_end]
    x.append(x_batch)
    batch_y = words[idx_end]
    y.append(batch_y)

  data = TensorDataset(torch.from_numpy(np.asarray(x)),torch.from_numpy(np.asarray(y)))
  data_loader = DataLoader(data, shuffle= False, batch_size = batch_size)
  return data_loader 


## Model Architecture.

Model architecture consist embedding layers, hidden layers and output layer. 

Here, for RNN I am going to use LSTM cell. 

In [28]:
# Define RNN Architecture

class RNN(nn.Module):
  def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout= 0.5, lr=0.001):
    super(RNN,self).__init__()
    self.output_size = output_size
    self.n_layers = n_layers
    self.hidden_dim = hidden_dim

    self.embedding = nn.Embedding(vocab_size, embedding_dim)
    self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=dropout, batch_first=True)
    self.fc = nn.Linear(hidden_dim, output_size)

  def forward(self,nn_input, hidden):
    batch_size = nn_input.size(0)
    embeds = self.embedding(nn_input)
    lstm_out, hidden = self.lstm(embeds, hidden)
    lstm_out = lstm_out.contiguous().view(-1,self.hidden_dim)
    out = self.fc(lstm_out)
    out = out.view(batch_size, -1, self.output_size)
    out = out[:, -1]
    return out, hidden

  def init_hidden(self, batch_size):
    weight = next(self.parameters()).data
    if (train_on_gpu):
      hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(), weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
    else:
      hidden = (weight.new(self.n_layers,batch_size, self.hidden_dim).zero_(), weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
    return hidden


In [29]:
# Generate forward backward propagation for training loop

def forward_back_prop(rnn, optimizer, criterion, input, target, hidden):
  if(train_on_gpu):
    rnn.cuda()
  
  h = tuple([each.data for each in hidden])
  rnn.zero_grad()
  if(train_on_gpu):
    inputs, target = input.cuda(),target.cuda()

  output,h = rnn(inputs,h)
  loss = criterion(output,target)
  loss.backward()
  nn.utils.clip_grad_norm_(rnn.parameters(),5)                    # prevent exploding gradinet with gradient clipping method.
  optimizer.step()
  return loss.item(), h

In [30]:
# Generate Training Loop
def TV_script_RNN(rnn, batch_size, optimizer, criterion, n_epochs, show_every = 100):
  batch_losses = []
  rnn.train()
  print('Training for the {} epochs ... '.format(n_epochs))

  for epoch in range(1, n_epochs+1):
    hidden = rnn.init_hidden(batch_size)
    for batch_i, (inputs,labels) in enumerate(train_loader, 1):
      n_batches = len(train_loader.dataset)//batch_size
      if(batch_i > n_batches):
        break
      loss, hidden = forward_back_prop(rnn,optimizer,criterion,inputs,labels, hidden)
      batch_losses.append(loss)
      # print required updatae status
      if batch_i % show_every == 0:
        print('Epoch : {:>4}/{:<4} and Loss: {}\n'.format(epoch, n_epochs, np.average(batch_losses)))
  return rnn


In [31]:
# Define Hyper Parameters
sequence_length = 10
batch_size = 128

train_loader = batch_data(int_text, sequence_length, batch_size)

# Training Parameters
num_epochs = 10
learning_rate = 0.001

# Model Parameters
vocab_size = len(vocab_to_int)
print(vocab_size)
output_size = vocab_size
embedding_dim = 200
hidden_dim = 250
n_layers = 2         
show_every = 2000

46713


In [32]:
# Let's Train the model
rnn = RNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout= 0.5)
if train_on_gpu:
  rnn.cuda()

optimizer = torch.optim.Adam(rnn.parameters(),lr=learning_rate)
criterion =nn.CrossEntropyLoss()

# training the model
TV_script_RNN_Model_1 = TV_script_RNN(rnn, batch_size, optimizer, criterion, num_epochs, show_every)

saved_model = os.path.splitext(os.path.basename('RNN_Model_1'))[0] + '.pt'
torch.save(TV_script_RNN,saved_model)
print('Model Trained and Saved')

# Model is already trained previously, so I called it from my drive

Training for the 10 epochs ... 
Epoch :    1/10   and Loss: 7.066025070190429

Epoch :    1/10   and Loss: 6.893861082911491

Epoch :    2/10   and Loss: 6.538230449838057

Epoch :    2/10   and Loss: 6.362951337731308

Epoch :    3/10   and Loss: 6.134565952415033

Epoch :    3/10   and Loss: 6.008999293644042

Epoch :    4/10   and Loss: 5.841501305661429

Epoch :    4/10   and Loss: 5.746286462610275

Epoch :    5/10   and Loss: 5.6169854751280575

Epoch :    5/10   and Loss: 5.539699204554733

Epoch :    6/10   and Loss: 5.434922536902028

Epoch :    6/10   and Loss: 5.370780675148231

Epoch :    7/10   and Loss: 5.28334653884506

Epoch :    7/10   and Loss: 5.229114390406217

Epoch :    8/10   and Loss: 5.15482465223314

Epoch :    8/10   and Loss: 5.107828242674056

Epoch :    9/10   and Loss: 5.0437797542377165

Epoch :    9/10   and Loss: 5.002669688033556

Epoch :   10/10   and Loss: 4.9462711167116185

Epoch :   10/10   and Loss: 4.909681207590511

Model Trained and Saved


In [34]:
saved_model = os.path.splitext(os.path.basename('/content/drive/My Drive/RNN_Model_1'))[0] + '.pt'
torch.save(TV_script_RNN,saved_model)
print('Model Trained and Saved')

Model Trained and Saved


In [35]:
# load model
load_model = os.path.splitext(os.path.basename('/content/drive/My Drive/RNN_Model_1'))[0] + '.pt'
TV_script_RNN_Model = torch.load(load_model)

In [None]:
TV_script_RNN_Model_1.load_state_dict(torch.load('/content/drive/My Drive/RNN_Model_1')['state_dict'])
print(TV_script_RNN_Model_1)

## Generate TV Script

With the network trained and saved, I will use it to generate a new, "fake" Seinfeld TV script in this section.

In [92]:
# generate function for generating text
def generate(rnn, prime_id, int_to_vocab, token_dict, pad_value, prediction_length = 100):
  rnn.eval()

  current_seq = np.full((1,sequence_length),pad_value)
  current_seq[-1][-1] = prime_id
  predicted = [int_to_vocab[prime_id]]

  for _ in range(prediction_length):
    if train_on_gpu:
      current_seq = torch.LongTensor(current_seq).cuda()
    else:
      current_seq = torch.LongTensor(current_seq)
    
    # initiate hidden state
    hidden = rnn.init_hidden(current_seq.size(0))

    # get the output of rnn 
    output, _ = rnn(current_seq, hidden)

    # get next word probability 
    p = F.softmax(output, dim=1).data
    if train_on_gpu:
      p = p.cpu()
  
    # Use top k sample for the next index word
    top_k = 5
    p, top_i = p.topk(top_k)
    top_i =  top_i.numpy().squeeze()

    # select the likely next word index with some of the element 
    p = p.numpy().squeeze()
    word_i = np.random.choice(top_i, p=p/p.sum())

    #  retrieve that word from dictionary
    word = int_to_vocab[word_i]
    predicted.append(word)

    # generate newxt word for thesequence and continue it 
    current_seq = np.roll(current_seq.cpu(), -1,1)
    current_seq[-1][-1] = word_i

  gen_sentences = ' '.join(predicted)

  # replace the punctuation
  for key,token in token_dict.items():
    ending = ' ' if key in ['\n','(', '"'] else ''
    gen_sentences = gen_sentences.replace(token.lower(), key)
  gen_sentences = gen_sentences.replace('\n ', '\n')
  gen_sentences = gen_sentences.replace('( ','(')
  
  return gen_sentences


In [93]:
# Generate New Script
gen_length = 400         # prediction_length
prime_word = 'jerry'
pad_word = special_words['PADDING']

generated_script = generate(TV_script_RNN_Model_1, vocab_to_int[prime_word + ':'], int_to_vocab, token_dict, vocab_to_int[pad_word], gen_length)
print(generated_script)

jerry: show's on, moops? you know what? what was it? 

george: well, i think it's better than a woman on the contrary, 

jerry: you know you're going to be hungry accountable. and the other way of year. 

elaine: what happened? 

jerry: i can't believe this was going on. 

george: i think i was a little bit nicer. 

george: oh, i can't. it's the moops. 

jerry: so you don't think so. 

kramer: well, that's the most stand. 

elaine: you don't know how you were getting any more? and i can get uromycitisis poisoning for you. 

george: what are we gonna do? you got the whole one that blocked me in a lady. 

jerry: i thought you can do that. 

george: what did he say? 

elaine: it's a big adjustment. 

elaine: i don't know. 

jerry: you know what the hell was it? 

george: it's the most stand. 

george: so i can tell you, how did you get the hell to paris? 

kramer: oh, it's not an accident. 

elaine: what is that? 

elaine: oh, it's a good time ago. 

jerry: what is that a balk?! or the in