## Loading Data
Dataset can be downloaded from [here](https://www.kaggle.com/albenft/game-of-thrones-script-all-seasons).

In [1]:
import pandas as pd
import numpy as np

script = pd.read_csv('./data/Game_of_Thrones_Script.csv')
script['Name'] = script['Name'] + ': '
script['dialogue'] = script['Name'] + script['Sentence']
script.drop(['Release Date', 'Season', 'Episode', 'Episode Title', 'Name', 'Sentence'], axis=1, inplace=True)
np.savetxt('./got_script.txt', script.values, fmt='%s')
print(script.head())

                                            dialogue
0  waymar royce: What do you expect? They're sava...
1  will: I've never seen wildlings do a thing lik...
2               waymar royce: How close did you get?
3                      will: Close as any man would.
4            gared: We should head back to the wall.


In [2]:
import os

got_data_dir = './got_script.txt'
input_file = os.path.join(got_data_dir)
with open(input_file, "r") as f:
    text = f.read()

## Explore the Data
This cell will gives a sense of the data I'm be working with. For example, it is all lowercase text, and each new line of dialogue is separated by a newline character `\n`.

In [3]:
import numpy as np

view_line_range = (0, 10)
print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))
lines = text.split('\n')
print('Number of lines: {}'.format(len(lines)))
word_count_line = [len(line.split()) for line in lines]
print('Average number of words in each line: {}'.format(np.average(word_count_line)))
print()
print('The lines {} to {}:'.format(*view_line_range))
print('\n'.join(text.split('\n')[view_line_range[0]:view_line_range[1]]))

Dataset Stats
Roughly the number of unique words: 21829
Number of lines: 23912
Average number of words in each line: 13.66782368685179

The lines 0 to 10:
waymar royce: What do you expect? They're savages. One lot steals a goat from another lot and before you know it, they're ripping each other to pieces.
will: I've never seen wildlings do a thing like this. I've never seen a thing like this, not ever in my life.
waymar royce: How close did you get?
will: Close as any man would.
gared: We should head back to the wall.
royce: Do the dead frighten you?
gared: Our orders were to track the wildlings. We tracked them. They won't trouble us no more.
royce: You don't think he'll ask us how they died? Get back on your horse.
will: Whatever did it to them could do it to us. They even killed the children.
royce: It's a good thing we're not children. You want to run away south, run away. Of course, they will behead you as a deserter … If I don't catch you first. Get back on your horse. I won't sa

## Lookup Table
Since I'm using word embeddings, I'm transforming the words to ids by creating two dictionaries:

Dictionary to go from the words to an id as `vocab_to_int`

Dictionary to go from the id to word, we'll call `int_to_vocab`

In [4]:
from collections import Counter

def create_lookup_tables(text):
    word_counts = Counter(text)
    sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
    int_to_vocab = {i: word for i, word in enumerate(sorted_vocab)}
    vocab_to_int = {word: i for i, word in int_to_vocab.items()}
    return (vocab_to_int, int_to_vocab)

## Tokenize Punctuation
I'll be splitting the script into a word array using spaces as delimiters. However, punctuations like periods and exclamation marks can create multiple ids for the same word. For example, "bye" and "bye!" would generate two different word ids. 

The dictionary I create will be used to tokenize the symbols and add the delimiter (space) around it. This separates each symbols as its own word, making it easier for the neural network to predict the next word.

In [5]:
def token_lookup():
    tokens = dict()
    tokens['.'] = '<period>'
    tokens[','] = '<comma>'
    tokens['"'] = '<quotation>'
    tokens[';'] = '<semicolon>'
    tokens['!'] = '<exclamation>'
    tokens['?'] = '<question>'
    tokens['('] = '<left_parenthesis>'
    tokens[')'] = '<right_parenthesis>'
    tokens['-'] = '<dash>'
    tokens['\n'] = '<new_line>'
    return tokens

## Preprocessing the data
Calling the 2 functions created above on the data and saving it.

In [6]:
import pickle

SPECIAL_WORDS = {'PADDING': '<PAD>'}
token_dict = token_lookup()
for key, token in token_dict.items():
    text = text.replace(key, ' {} '.format(token))

text = text.lower()
text = text.split()
vocab_to_int, int_to_vocab = create_lookup_tables(text + list(SPECIAL_WORDS.values()))
int_text = [vocab_to_int[word] for word in text]
pickle.dump((int_text, vocab_to_int, int_to_vocab, token_dict), open('preprocess.p', 'wb'))

The `load_preprocess()` function will load the saved preprocessed data.

In [7]:
def load_preprocess():
    return pickle.load(open('preprocess.p', mode='rb'))

In [8]:
import torch

train_on_gpu = torch.cuda.is_available()
if not train_on_gpu:
    print('No GPU found. Please use a GPU to train your neural network.')

## Batching
The `batch_data()` function batches words data into chunks of size `batch_size` using the TensorDataset and DataLoader classes.

In [9]:
from torch.utils.data import TensorDataset, DataLoader

def batch_data(words, sequence_length, batch_size):
    n_batches = len(words) // batch_size
    words = words[:n_batches * batch_size]
    y_len = len(words) - sequence_length
    x, y = [], []
    for idx in range(0, y_len):
        idx_end = sequence_length + idx
        x_batch = words[idx:idx_end]
        x.append(x_batch)
        batch_y =  words[idx_end]
        y.append(batch_y)    
    data = TensorDataset(torch.from_numpy(np.asarray(x)), torch.from_numpy(np.asarray(y)))
    data_loader = DataLoader(data, shuffle=True, batch_size=batch_size)
    return data_loader


## Testing the batching
I'm generating some test text data and defining a dataloader using the function defined above. Then, I'm getting some sample batch of inputs sample_x and targets sample_y from the dataloader. 

`sample_x` should have size (10, 5) and `sample_y` should have one dimension (10). 

Also, `sample_y` should be ordered such that it has the next value of each sequence of `test_text`.

In [10]:
test_text = range(50)
t_loader = batch_data(test_text, sequence_length=5, batch_size=10)

data_iter = iter(t_loader)
sample_x, sample_y = data_iter.next()

print(sample_x.shape)
print(sample_x)
print()
print(sample_y.shape)
print(sample_y)

torch.Size([10, 5])
tensor([[25, 26, 27, 28, 29],
        [ 2,  3,  4,  5,  6],
        [13, 14, 15, 16, 17],
        [39, 40, 41, 42, 43],
        [32, 33, 34, 35, 36],
        [ 3,  4,  5,  6,  7],
        [23, 24, 25, 26, 27],
        [19, 20, 21, 22, 23],
        [17, 18, 19, 20, 21],
        [26, 27, 28, 29, 30]])

torch.Size([10])
tensor([30,  7, 18, 44, 37,  8, 28, 24, 22, 31])


## Building the Network
The `__init__()` function creates the layers of the neural network and saves them to the class. The `forward()` function will use these layers to run forward propagation and generate an output and a hidden state.

The output of this model is the last batch of word scores after a complete sequence has been processed, i.e, for each input sequence of words, the word scores for a single, most likely, next word is only the output.

In [11]:
import torch.nn as nn

class RNN(nn.Module):
    
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5):
        """
        Initialize the PyTorch RNN Module
        """
        
        super(RNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=dropout, batch_first=True)
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        self.fc = nn.Linear(hidden_dim, output_size)
    
    
    def forward(self, nn_input, hidden):
        """
        Forward propagation of the neural network
        """
        
        batch_size = nn_input.size(0)
        embed = self.embedding(nn_input)
        lstm_out, hidden = self.lstm(embed, hidden)
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        out = self.fc(lstm_out)
        out = out.view(batch_size, -1, self.output_size)
        out = out[:, -1]
        return out, hidden
    

    def init_hidden(self, batch_size):
        '''
        Initialize the hidden state of an LSTM
        '''
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        return hidden

## Forward and Backprop pass
The `forward_back_prop()` function describes the forward and backprop steps by using the rnn defined above.

In [12]:
def forward_back_prop(rnn, optimizer, criterion, inp, target, hidden):
    """
    Forward and backward propagation on the neural network
    """
    
    if(train_on_gpu):
        rnn.cuda()
    h = tuple([each.data for each in hidden])
    rnn.zero_grad()
    
    if(train_on_gpu):
        inp, target = inp.cuda(), target.cuda()    
    output, h = rnn(inp, h)
    loss = criterion(output, target)
    loss.backward()
    nn.utils.clip_grad_norm_(rnn.parameters(), 5)
    optimizer.step()
    return loss.item(), h

## Define Training Process
The training loop is implemented in the `train_rnn()` function. This function will train the network over all the batches for the number of epochs given. The model progress will be shown every number of batches. This number is set with the show_every_n_batches parameter.

In [13]:
def train_rnn(rnn, batch_size, optimizer, criterion, n_epochs, show_every_n_batches=100):
    batch_losses = []
    rnn.train()
    print("Training for %d epoch(s)..." % n_epochs)
    for epoch_i in range(1, n_epochs + 1):
        hidden = rnn.init_hidden(batch_size)
        for batch_i, (inputs, labels) in enumerate(train_loader, 1):
            n_batches = len(train_loader.dataset)//batch_size
            if(batch_i > n_batches):
                break
            loss, hidden = forward_back_prop(rnn, optimizer, criterion, inputs, labels, hidden)
            batch_losses.append(loss)
            if batch_i % show_every_n_batches == 0:
                print('Epoch: {:>4}/{:<4}  Loss: {}\n'.format(
                    epoch_i, n_epochs, np.average(batch_losses)))
                batch_losses = []
    return rnn

## Hyperparameter Settings

In [14]:
sequence_length = 30
batch_size = 64
train_loader = batch_data(int_text, sequence_length, batch_size)

In [15]:
num_epochs = 10
learning_rate = 0.001 
vocab_size = len(vocab_to_int)
output_size = vocab_size
embedding_dim = 250
hidden_dim = 512
n_layers = 2
show_every_n_batches = 1500

## Training

In [16]:
rnn = RNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5)
if train_on_gpu:
    rnn.cuda()

optimizer = torch.optim.Adam(rnn.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()
trained_rnn = train_rnn(rnn, batch_size, optimizer, criterion, num_epochs, show_every_n_batches)
save_filename = os.path.splitext(os.path.basename('./save/trained_rnn'))[0] + '.pt'
torch.save(trained_rnn, save_filename)
print('Model Trained and Saved')

Training for 10 epoch(s)...
Epoch:    1/10    Loss: 5.376764600594838

Epoch:    1/10    Loss: 4.82786775747935

Epoch:    1/10    Loss: 4.626995973904927

Epoch:    1/10    Loss: 4.531740409851074

Epoch:    2/10    Loss: 4.312917083774056

Epoch:    2/10    Loss: 4.237606526056926

Epoch:    2/10    Loss: 4.217242928345998

Epoch:    2/10    Loss: 4.173193832556406

Epoch:    3/10    Loss: 3.9802359445706164

Epoch:    3/10    Loss: 3.9356480825742084

Epoch:    3/10    Loss: 3.944112125873566

Epoch:    3/10    Loss: 3.971533558209737

Epoch:    4/10    Loss: 3.7491800918302376

Epoch:    4/10    Loss: 3.73022225300471

Epoch:    4/10    Loss: 3.7566925455729168

Epoch:    4/10    Loss: 3.7741627855300903

Epoch:    5/10    Loss: 3.552854811229224

Epoch:    5/10    Loss: 3.552754388809204

Epoch:    5/10    Loss: 3.598949054082235

Epoch:    5/10    Loss: 3.628698653539022

Epoch:    6/10    Loss: 3.408755863692926

Epoch:    6/10    Loss: 3.410572488943736

Epoch:    6/10    Loss:

  "type " + obj.__name__ + ". It won't be checked "


In [17]:
def load_model(filename):
    """
    Load the Preprocessed Training data and return them in batches of <batch_size> or less
    """
    save_filename = os.path.splitext(os.path.basename(filename))[0] + '.pt'
    return torch.load(save_filename)

## Generate TV Script
To generate the text, the network needs to start with a single word and repeat its predictions until it reaches a set length. The `generate()` function does exactly this. It takes a word id to start with, prime_id, and generates a set length of text, predict_len. Also, topk sampling is used to introduce some randomness in choosing the most likely next word, given an output set of word scores.

In [21]:
import torch.nn.functional as F

def generate(rnn, prime_id, int_to_vocab, token_dict, pad_value, predict_len=100):
    rnn.eval()
    current_seq = np.full((1, sequence_length), pad_value)
    current_seq[-1][-1] = prime_id
    predicted = [int_to_vocab[prime_id]]
    
    for _ in range(predict_len):
        if train_on_gpu:
            current_seq = torch.LongTensor(current_seq).cuda()
        else:
            current_seq = torch.LongTensor(current_seq)
        hidden = rnn.init_hidden(current_seq.size(0))
        output, _ = rnn(current_seq, hidden)
        p = F.softmax(output, dim=1).data
        if(train_on_gpu):
            p = p.cpu()
        top_k = 5
        p, top_i = p.topk(top_k)
        top_i = top_i.numpy().squeeze()
        p = p.numpy().squeeze()
        word_i = np.random.choice(top_i, p=p/p.sum())
        word = int_to_vocab[word_i]
        predicted.append(word)
        current_seq = current_seq.cpu()
        current_seq = np.roll(current_seq, -1, 1)
        current_seq[-1][-1] = word_i
    
    gen_sentences = ' '.join(predicted)
    for key, token in token_dict.items():
        ending = ' ' if key in ['\n', '(', '"'] else ''
        gen_sentences = gen_sentences.replace(' ' + token.lower(), key)
    gen_sentences = gen_sentences.replace('\n ', '\n')
    gen_sentences = gen_sentences.replace('( ', '(')
    return gen_sentences

## Generating the Fake Script 

In [22]:
gen_length = 400
prime_word = 'will'
pad_word = '<PAD>'
generated_script = generate(trained_rnn, vocab_to_int[prime_word + ':'], int_to_vocab, token_dict, vocab_to_int[pad_word], gen_length)
print(generated_script)

will: the underlying region skull.
davos: i didn't ask you, lord varys. i don't know how much it feels.
shireen: i don't want to see the words.
shireen: i'm not doing this.
shireen: i'm not a smuggler.
tyrion lannister: i don't know.
tyrion lannister: you did. you don't want to see me.
tyrion lannister: you must have taken your own hand.
tyrion lannister: no, but you're making me. i didn't even want it to save my brother.
tyrion lannister: and i intend to be a queen.
daenerys targaryen: you will be happy about this.
tyrion lannister: yes, i'm not a hero.
tyrion lannister: you know how i know, you are.
varys: i'm sorry.
shae: you know who makes word about your father, your grace.
tyrion lannister: i don't know where it is.
varys: i do not recognize a ship.
tyrion lannister: what? why do you know who i am?
tyrion lannister: because i don't know. but i don't need to be alone. i was a whore.
varys: i was invited to you.
daenerys targaryen: and what does the people want for you?
daario: no,

In [23]:
# saving the fake script. 
f =  open("generated_script_1.txt","w")
f.write(generated_script)
f.close()