<a href="https://colab.research.google.com/github/MariannaJan/SherlockScriptGenerator/blob/master/Sherlock_script_generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BBC Sherlock script generator

Natural language processing (NLP) offers us many ways to use text - from getting certain proper names (Named Entity Recognition - NER), through classifing the text according to emotions it confers (sentiment analysis) to automatic creation of summaries and automatic translation.

NLP offers us also a possibility to generate new, unique texts, on the basis of a chosen dataset. Generally, the bigger the dataset, the better the results. This creative aspect of NLP is used in chatbots, especially with other language processing methods.

There are many ways to approach text generation with machine learning. What is shown here, is a simple Recurrent Neural Network based on Long Short Term Memory Cells (LSTM), that does not use any language model, but instead is based on the assumption, that you can predict a probable next word if you know the words that precede it.


![Sherlock and Watson](https://media.giphy.com/media/3osxYAEY9vBn8BkS0U/giphy.gif)

## Downloading the data

First we need to get the txt file with the series transcript.

The transcripts are gathered from [here](https://www.springfieldspringfield.co.uk/episode_scripts.php?tv-show=sherlock)



In [0]:
!wget -O sherlock.txt https://www.dropbox.com/s/od4a3cowfu3sezm/Sherlock_script.txt?dl=0

--2019-06-08 10:07:34--  https://www.dropbox.com/s/od4a3cowfu3sezm/Sherlock_script.txt?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.65.1, 2620:100:6021:1::a27d:4101
Connecting to www.dropbox.com (www.dropbox.com)|162.125.65.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/od4a3cowfu3sezm/Sherlock_script.txt [following]
--2019-06-08 10:07:35--  https://www.dropbox.com/s/raw/od4a3cowfu3sezm/Sherlock_script.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc849916a3cf4330377ee4766b9f.dl.dropboxusercontent.com/cd/0/inline/Aiax3UXjf-qTh4HfhAqIRhyN-O5P_90JsCcJe46hGs1QDxL9zX_wqxvEWi3MhFRUhddmmaZkV3qHaA0PbT7urMpHp5ACGnEe64C18FNqWxzSfA/file# [following]
--2019-06-08 10:07:35--  https://uc849916a3cf4330377ee4766b9f.dl.dropboxusercontent.com/cd/0/inline/Aiax3UXjf-qTh4HfhAqIRhyN-O5P_90JsCcJe46hGs1QDxL9zX_wqxvEWi3MhFRUhddmmaZkV3qHaA0PbT7urMpHp5ACGnEe64C18FNqWx

Then we read the text file and save it as string.

In [0]:
with open('sherlock.txt', 'r') as f:
  transcript = f.read()

## Setup for saving onto Google Drive

We need to mout the drive and specify the write / read location on it. We also import datetime module for future use in naming the files for saving.

In [0]:
#mounting google drive

from google.colab import drive

drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
import datetime

project_dir = "/content/gdrive/My Drive/Script_gen/"

##Exploring the data

Next we explore the data to know what preprocessing steps it would need for the model to work.

In [0]:
lines = transcript.split('\n')

In [0]:
line_lengths = [len(line.split()) for line in lines]

In [0]:
def getUniqueWords(text):
  words = text.lower().split()
  return list(set(words))

Basic stats for the dataset:

In [0]:
import numpy as np

print('number of lines: ', len(lines))
print('average number of words in a sentence: ', round(np.average(line_lengths), 1))
print('number of words: ', len(transcript.lower().split()))
print('number of unique words: ', len(getUniqueWords(transcript)))

number of lines:  15583
average number of words in a sentence:  8.7
number of words:  135946
number of unique words:  15641


For reference english language has 171476 words (based on Oxford English dictionary).

In [0]:
def getUniqueChars(text):
  uniqueChars = list(set([char for char in text.lower()]))
  uniqueChars.sort()
  return uniqueChars

In [0]:
#uniqueChars = list(set([char for char in transcript.lower()]))
#uniqueChars.sort()


print('There are {} unique characters in the dataset, counting lower and upper case letters as one letter'.format(len(getUniqueChars(transcript))))
print(getUniqueChars(transcript))

There are 67 unique characters in the dataset, counting lower and upper case letters as one letter
['\n', ' ', '!', '"', '#', '%', '&', "'", '(', ')', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '@', '[', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '\x80', '\x8d', '\x99', '£', '©', '°', '½', 'â', 'ã', '\ufeff']


## Preprocessing the data

As we have seen there is a number of characters in the dataset, that we could safely ommit, without loosing the meaning of the text.

In [0]:
import string

def cleanupText(text):
  wanted_chars = string.digits + string.ascii_letters + "'" + '".,:;!?-() \n'
  print(wanted_chars)
  return ''.join(x for x in text if x in (wanted_chars))

In [0]:
transcript = cleanupText(transcript)

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'".,:;!?-() 



In [0]:
print('After cleanup there are {} unique characters in the dataset'.format(len(getUniqueChars(transcript))))
print(getUniqueChars(transcript))

After cleanup there are 49 unique characters in the dataset
['\n', ' ', '!', '"', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


One more step we need to take is to tokenize symbols (punctuation), as without it, the dataset after splitting to single words would have several varaiants of the same word with different punctuation, eg. 'go', 'go?', 'go.',  'go! and so on. To that end we can replace those symbols with spoecial tokens and delimit them with space, so that they will be treated as separate words.

First we need to create a dictionary for the symbol tokens

In [0]:
punctuation_dict = {
        '.':'||period||',
        ',':'||comma||',
        '"':'||quote||',
        ';':'||semicolon||',
        '!':'||exclamation||',
        '?':'||question||',
        '(':'||leftPar||',
        ')':'||rightPar||',
        '-':'||dash||',
        '\n':'||return||',
        "'":'||single_quote||'
    }

Then we change all the punctuation in the dataset with the appropriate tokens.

In [0]:
def tokenizePunctuation(text):
  for key, value in punctuation_dict.items():
    text = text.replace(key, ' {} '.format(value))
  return text

In [0]:
transcript = tokenizePunctuation(transcript)

In [0]:
print(transcript[:100])

 ||dash|| ELLA: How ||single_quote|| s your blog going ||question||   ||dash|| Hmm ||comma||  fine |


Now we can tokenize the dataset, which in our case will be splitting it into single words. This will allow us to create a lookup table

In [0]:
unique_words = getUniqueWords(transcript)

In [0]:
print('The number of unique words is: ', len(unique_words))

The number of unique words is:  8506


As we can see the tokenization of punctuation lowered the number of unique words almost by half.

Now we need to create the lookup table, to convert the words in the dataset into nubers. We will do this by creating a dictionary of words as keys, and integers as values. We will do this by first sorting the words according to how often they appear in the dataset, so that the more often the word appears, the lower integer it will get..

We will also need a lookup table allowing us to convert numbers generated by our model back into words, so we'll create another dictionary for that, which will be the first dictionary with keys and values swapped.

In [0]:
from collections import Counter

def createLookupTables(text, padding_symbol):
  text = text.lower().split()
  text.append(padding_symbol)
  count = Counter(text)
  vocabulary = sorted(count, key=count.get, reverse=True)
  word_to_int = {word:idx for idx, word in enumerate(vocabulary)}
  int_to_word = {idx:word for idx, word in enumerate(vocabulary)}
  return (word_to_int,int_to_word)
  

To allow for generation of sequences (that is parts of a new script), we'll need a padding symbol, so we'll add it to our lookup tables.

In [0]:
PADDING_SYMBOL = '<PAD>'

In [0]:
word_to_int,int_to_word = createLookupTables(transcript, PADDING_SYMBOL)

In [0]:
for i in range(5):
  print(int_to_word[i])

||return||
||period||
||comma||
||single_quote||
you


As the neural networks are in essence a series of linear equations, they work on numbers and not on words. We'll use out word_to_int dictionary the dataset into a list of integers.

In [0]:
def changeIntoInts(text):
  return [word_to_int[word] for word in text.lower().split()]

In [0]:
transcript_int = changeIntoInts(transcript)

This is how the beginning of the transcript looks after all the changes we have made:

In [0]:
print(transcript_int[:20])

[8, 4733, 55, 3, 10, 29, 388, 73, 5, 8, 170, 2, 215, 1, 0, 93, 1, 0, 98, 93]


### Saving and loading the results of preprocessing

We can now save our lookup tables, punctuation dictionary and prepared transcript (in the form of a list of integers) for future use.

In [0]:
import pickle

In [0]:
def save_preprocessed(word_to_int, int_to_word, punctuation_dict, transcript_int, path):  
  with open(path, 'wb') as f:
    pickle.dump((word_to_int, int_to_word, punctuation_dict, transcript_int), f)
    print('Saved preprocessed data in', path)

In [0]:
def load_preprocessed(path):
  return pickle.load(open(path, mode='rb'))

In [0]:
#saving current preprocessed data
path = project_dir + 'preprocessed' + str(datetime.datetime.now()) +'.p'
save_preprocessed(word_to_int, int_to_word, punctuation_dict, transcript_int, path)

Saved preprocessed data in /content/gdrive/My Drive/Script_gen/preprocessed2019-05-12 13:25:34.233846.p


## Building the model

### Model architecture

To build the script generatorl we'll build a RNN ( [Recurrent Neural Network](https://towardsdatascience.com/recurrent-neural-networks-d4642c9bc7ce)) based on LSTM (Long-Short Term Memory) cells. 

RNNs are a specific kind of neural networks, that differ from vanilla neural networks by taking as input a series of data (like a time series) instead of a fixed size vector. This allows for concideration of relations of input itms in time. This means, that the result of analysing the input is influenced also by preciding inputs.

### Dataloader

To create a dataset that is adequate for this job, we'll turn it into batches, with features being a specified number of words preceding the word that is concidered our target (or label).

To provide a consistent format for the dataset we'll use [TensorDataset](https://pytorch.org/docs/master/data.html#torch.utils.data.TensorDataset) and to turn it into appropriate batches we'll use [DataLoader](https://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader).

The TensorDataset needs PyTorch Tensors as arguments. To turn a simple Pyton list into a Tensor, we'll use [torch.from_numpy](https://pytorch.org/docs/0.4.0/torch.html#torch.from_numpy) , and that's why we need to convert the list into a numpy array first.

To allow for flexible adjustment of the dataloader for the model, in the fiunction creating it, we'll concider two additional parameters:


*   sequence length - this specifies how many words before the target word are concidered to be features
*   batch size - this specifies, how many feature / target sets are in a batch



In [0]:
import torch
from torch.utils.data import TensorDataset, DataLoader

def createDataloader(text_int, sequence_length, batch_size):
  
  #first we need to convert the whole dataset into features and targets lists
  features, target = [], []
  
  #we loop through the whole dataset, moving one element at a time
  #we start at the first element and end at the element that is sequence_length from the end of the dataset
  
  for sequence_no in range(len(text_int)-sequence_length):
    features.append(text_int[sequence_no:sequence_no+sequence_length])
    target.append(text_int[sequence_no+sequence_length])
    
  #then we convert the dataset into the Tensor Dataset, which accepts NumPy arrays as parameter
  #Tensor Dataset needs features and target nparrays as parameters
  
  dataset = TensorDataset(torch.from_numpy(np.array(features)), torch.from_numpy(np.array(target)))

  #once we have the appropriate dataset, we can create the dataloder
  
  dataloader = DataLoader(dataset, shuffle=True, batch_size=batch_size)
  
  return dataloader

The dataloader is an iterator, that returns fetures and target in the form of tensors, that form a batch. For a first batch in the dataset, and chosen parameters, the batch looks like:

In [0]:
sample_loader = createDataloader(transcript_int[:50], sequence_length=5, batch_size=10)
sample_iterable = iter(sample_loader)
sample_feature, sample_target = sample_iterable.next()

print('Sample feature batch:')
print(sample_feature)
print('Sample target:')
print(sample_target)

Sample feature batch:
tensor([[  93,    1,    0,   98,   93],
        [   9,    3,   10,   73,   12],
        [   1,    0,   65,    2,    9],
        [  73,    5,    8,  170,    2],
        [  27,   11,  288,    1,    0],
        [ 425,   12, 3332,   12, 2198],
        [   0,   98,   93,    1,    0],
        [ 288,    1,    0,   65,    2],
        [   8,   27,   11,  288,    1],
        [  10,   73,   12,  140,    4]])
Sample taget:
tensor([  1, 140,   3, 215,  65, 184,   8,   9,   0,  11])


### Defining the neural network

First we need to define the model - we'll use Recurrent Neural Network. To impement this, we'll use PyTorch [nn.Module](https://pytorch.org/docs/master/nn.html#torch.nn.Module), that allows us to construct neural network models.

The init function specifies the structure of the neural network. It accepts several parameters, to allow for tutning the model:

*   vocab_size: The number of input dimensions of the neural network (the size of the vocabulary)
*   output_size: The number of output dimensions of the neural network
*  embedding_dim: The size of embeddings
*  hidden_dim: The size of the hidden layer outputs
*  dropout: dropout to add in between LSTM layers



In [0]:
import torch.nn as nn

class ScriptGenModel(nn.Module):
  
  #here we define the structure of the model
  
  def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5):
    super(ScriptGenModel, self).__init__()
    
    #saving the parameters for future use
    self.output_size = output_size
    self.n_layers = n_layers
    self.hidden_dim = hidden_dim
        
    # defining model layers
    self.embedding = nn.Embedding(vocab_size, embedding_dim)
    self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=dropout, batch_first=True)
    self.fc = nn.Linear(hidden_dim, output_size)
    
  #here we decide, how the forward pass of the model should be performed
  
  def forward(self, nn_input, hidden):
    
    batch_size = nn_input.size(0)

    embeds = self.embedding(nn_input)
    lstm_out, hidden = self.lstm(embeds, hidden)
    lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
    out = self.fc(lstm_out)

    out = out.view(batch_size, -1, self.output_size)
    out = out[:, -1]
    return out, hidden
  
  #here we define the way to initialise the hidden state
  
  def init_hidden(self, batch_size):
    weight = next(self.parameters()).data

    if (train_on_gpu):
      hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
    else:
      hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
    return hidden

## Training the model

### Training functions

Now that we have the model defined, we need to prepare the training functions. 

In [0]:
def forward_back_prop(scriptGenModel, optimizer, criterion, inp, target, hidden):
  # move data to GPU, if available
  if(train_on_gpu):
    inp, target = inp.cuda(), target.cuda()
    
  # perform backpropagation and optimization
  hidden = tuple([each.data for each in hidden])

  scriptGenModel.zero_grad()
    
  output, hidden = scriptGenModel(inp, hidden)

  loss = criterion(output, target)
  loss.backward()
  nn.utils.clip_grad_norm_(scriptGenModel.parameters(), 3)
  optimizer.step()
  return loss.item(), hidden


In [0]:
def train_scriptGenModel(scriptGenModel, batch_size, optimizer, criterion, n_epochs, show_every_n_batches=100):
    batch_losses = []
    
    scriptGenModel.train()

    print("Training for %d epoch(s)..." % n_epochs)
    for epoch_i in range(1, n_epochs + 1):
        
        # initialize hidden state
        hidden = scriptGenModel.init_hidden(batch_size)
        
        for batch_i, (inputs, labels) in enumerate(train_loader, 1):
            
            # make sure you iterate over completely full batches, only
            n_batches = len(train_loader.dataset)//batch_size
            if(batch_i > n_batches):
                break
            
            # forward, back prop
            loss, hidden = forward_back_prop(scriptGenModel, optimizer, criterion, inputs, labels, hidden)          
            # record loss
            batch_losses.append(loss)

            # printing loss stats
            if batch_i % show_every_n_batches == 0:
                print('Epoch: {:>4}/{:<4}  Loss: {}\n'.format(
                    epoch_i, n_epochs, np.average(batch_losses)))
                batch_losses = []

    # returns a trained rnn
    return scriptGenModel

### Hyperparameters


For the training of the model to work properly, we need to decide on the hypeparameters we want to use for the model.

In [0]:
# Parameters for data preprocessing

sequence_length = 8
batch_size = 64

# Training parameters

num_epochs = 10
learning_rate = 0.001

# Model parameters

vocab_size = len(word_to_int)
output_size = vocab_size
embedding_dim = 300
hidden_dim = 512
n_layers = 2

# Show stats for every n number of batches

show_every_n_batches = 500

### Preparing for saving / loading model

As we train the model we will save it on Google Drive, so that we can use the checkpoint that satisfies us best, if we decide to tune the hyperparameters. We will also define the method for loading a saved checkpoint for future use, like generating new scripts.

In [0]:
def save_model(filename, scriptGenModel):
  model_save_name = filename + ".pt"
  path = project_dir + model_save_name
  torch.save(scriptGenModel, path)

In [0]:
def load_model(filename):
    path = project_dir + filename + ".pt"
    return torch.load(path)

### Running the training loop

We can finally start the training of our model. First we'll check if training on GPU is available, as training this kind of model on CPU would be really time consuming.

In [0]:
#checking if GPU is available

train_on_gpu = torch.cuda.is_available()
if not train_on_gpu:
  print('There is no GPU available. Please consider switching to GPU for training the model.')
else:
  print("GPU available. You're good to go!") 

GPU available. You're good to go!


Next we need to create the dataloder for the dataset.

In [0]:
train_loader = createDataloader(transcript_int, sequence_length, batch_size)

We instantiate the model. Before training the model we have to decide on the optimiser and loss function we'll use.

In [0]:
# creaing the model
scriptGenModel = ScriptGenModel(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5)

#deciding on optimizer and loss functions
optimizer = torch.optim.Adam(scriptGenModel.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

We can see how our model looks like:

In [0]:
print(scriptGenModel)

ScriptGenModel(
  (embedding): Embedding(8507, 300)
  (lstm): LSTM(300, 512, num_layers=2, batch_first=True, dropout=0.5)
  (fc): Linear(in_features=512, out_features=8507, bias=True)
)


We can finnaly start the training of the model. When the training is finished, the model will be saved to Google Drive. We'll add the current time to the name of the saved model, to make sure, that we don't overwrite a previous model by accident.

In [0]:
#moving the model to GPU if it's available
if train_on_gpu:
    scriptGenModel.cuda()

#running the trainining loop
trained_scriptGenModel = train_scriptGenModel(scriptGenModel, batch_size, optimizer, criterion, num_epochs, show_every_n_batches)

#Setup for saving the trained model to Google Drive
current_time = datetime.datetime.now() 
model_name = 'trained_scriptGenModel' + str(current_time)
save_model(model_name, trained_scriptGenModel)

Training for 10 epoch(s)...
Epoch:    1/10    Loss: 5.223743494510651

Epoch:    1/10    Loss: 4.690614940643311

Epoch:    1/10    Loss: 4.528136008262634

Epoch:    1/10    Loss: 4.424260503292084

Epoch:    1/10    Loss: 4.378831959247589

Epoch:    1/10    Loss: 4.304594363689422

Epoch:    2/10    Loss: 4.091347557698895

Epoch:    2/10    Loss: 4.055635792732239

Epoch:    2/10    Loss: 4.004753903865814

Epoch:    2/10    Loss: 3.975473068714142

Epoch:    2/10    Loss: 3.968696400642395

Epoch:    2/10    Loss: 3.947124272823334

Epoch:    3/10    Loss: 3.8058034220855395

Epoch:    3/10    Loss: 3.71501554107666

Epoch:    3/10    Loss: 3.7199957451820373

Epoch:    3/10    Loss: 3.7051987519264222

Epoch:    3/10    Loss: 3.7419188141822817

Epoch:    3/10    Loss: 3.6934643173217774

Epoch:    4/10    Loss: 3.5206771789196725

Epoch:    4/10    Loss: 3.4885264630317687

Epoch:    4/10    Loss: 3.494842125892639

Epoch:    4/10    Loss: 3.5142241282463074

Epoch:    4/10    L

  "type " + obj.__name__ + ". It won't be checked "


## Generating the script

Now that we finally have our trained model, we can try and generate a new script.

In [0]:
import torch.nn.functional as F

def generate(scriptGenModel, prime_id, int_to_vocab, punctuation_dict, pad_value, predict_len=100):

    scriptGenModel.eval()
    
    # create a sequence (batch_size=1) with the prime_id
    current_seq = np.full((1, sequence_length), pad_value)
    current_seq[-1][-1] = prime_id
    predicted = [int_to_vocab[prime_id]]
    
    for _ in range(predict_len):
        if train_on_gpu:
            current_seq = torch.LongTensor(current_seq).cuda()
        else:
            current_seq = torch.LongTensor(current_seq)
        
        # initialize the hidden state
        hidden = scriptGenModel.init_hidden(current_seq.size(0))
        
        # get the output of the model
        
        ##### Added next 2 lines to fix numpy bug  ####
        scriptGenModel.cpu()
        current_seq = current_seq.cpu()
        hidden = hidden[0].cpu(), hidden[1].cpu()
        #print(" curs:", current_seq.device, "hidden[0]", hidden[0].device, "hidden[1]", hidden[1].device)
        output, _ = scriptGenModel(current_seq, hidden)
        
        # get the next word probabilities
        p = F.softmax(output, dim=1).data
        if(train_on_gpu):
            p = p.cpu() # move to cpu
         
        # use top_k sampling to get the index of the next word
        top_k = 5
        p, top_i = p.topk(top_k)
        top_i = top_i.numpy().squeeze()
        
        # select the likely next word index with some element of randomness
        p = p.numpy().squeeze()
        word_i = np.random.choice(top_i, p=p/p.sum())
        
        # retrieve that word from the dictionary
        word = int_to_vocab[word_i]
        predicted.append(word)     
        
        # the generated word becomes the next "current sequence" and the cycle can continue
        current_seq = np.roll(current_seq, -1, 1)
        current_seq[-1][-1] = word_i
    
    gen_sentences = ' '.join(predicted)
    
    # Replace punctuation tokens
    for key, token in punctuation_dict.items():
        ending = ' ' if key in ['\n', '(', '"'] else ''
        gen_sentences = gen_sentences.replace(' ' + token.lower(), key)
    gen_sentences = gen_sentences.replace('\n ', '\n')
    gen_sentences = gen_sentences.replace('( ', '(')
    
    # return all the sentences
    return gen_sentences

We setup basic values for the generator function and see the result.

We load the trained model and saved parameters, if they are not defined straight after training the model.

In [0]:
#we check if the parameters for text generation are loaded, and if not we load them in

try:
  if (word_to_int and int_to_word and punctuation_dict and PADDING_SYMBOL):
    print('Data for text generation present')
except:
  path = project_dir+'preprocessed.p'
  word_to_int, int_to_word, punctuation_dict, transcript_int = load_preprocessed(path)

In [0]:
#we check if the model is present, and if not we load it in
import torch
import numpy as np

try:
  if (trained_scriptGenModel):
    print('The model for text generation is present')
except:
  trained_scriptGenModel = load_model('trained_scriptGenModel')

In [0]:
gen_length = 400 
prime_word = 'I'

generated_script = generate(trained_scriptGenModel, word_to_int[prime_word.lower()], int_to_word, punctuation_dict, word_to_int[PADDING_SYMBOL], gen_length)
print(generated_script)

i' d say is your best friend.
- oh, i see.
- you' re a doctor, john.
what? what? the hiker turns up with me!' i' m sorry.
i' m afraid you acquainted me, i' m just wondering.
oh, i think you' re going to need jones and abby the merchant' s drinking habits out.
- i don' t know.
- i am.
- no.
- no, no, i don' t think so.
i' m just saying.
you' ll need some shopping, because i' ll have to go to the good details.
- i' ve seen it.
no, it' s all.
i know.
you' re going to die.
i know you' ve got a gun.
oh- you think i' ll take the case?- i need to know.
- what do you mean?- i didn' t know.
- you were a doctor.
- i' ll skip you in tomorrow.
' you realise i' ve got bluebell.
i don' t want the money.
- what?- nothing, buddy hacker.
- you spoke to her husband being executed.
it' s a game of chess with a bullet through his head? it must be a bit mundane but i was able to help the target ones.
i' m afraid you can.
i just thought that' ll be fine.
i' ll be late for this place.
it' s a memory pet, but

We can now save the generated script to a file if we like it.

In [0]:
def saveGenScript(generated_script):
  path = project_dir + 'generated_script' + str(datetime.datetime.now()) + '.txt'
  with open(path, "w") as text_file:
    text_file.write(generated_script)

In [0]:
#saving generated script to Google Drive

saveGenScript(generated_script)