# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





__Downloads__

In [2]:
%pip install torchdata
%pip install Cython
%pip install typing-extensions --upgrade
%pip install torch --upgrade
%pip install torchtext==0.9.0

Defaulting to user installation because normal site-packages is not writeable
Collecting torchdata
  Downloading torchdata-0.5.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB)
[K     |████████████████████████████████| 4.6 MB 4.8 MB/s eta 0:00:01     |██▍                             | 348 kB 4.8 MB/s eta 0:00:01
Collecting torch==1.13.1
  Downloading torch-1.13.1-cp37-cp37m-manylinux1_x86_64.whl (887.5 MB)
[K     |████████████████████████████████| 887.5 MB 5.5 kB/s  eta 0:00:01     |██████████████████████████████▋ | 849.5 MB 51.4 MB/s eta 0:00:01��███████████▊| 878.5 MB 51.4 MB/s eta 0:00:01
[?25hCollecting portalocker>=2.0.0
  Downloading portalocker-2.7.0-py2.py3-none-any.whl (15 kB)
Collecting nvidia-cuda-nvrtc-cu11==11.7.99; platform_system == "Linux"
  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)
[K     |████████████████████████████████| 21.0 MB 38.0 MB/s eta 0:00:01
[?25hCollecting nvidia-cuda-runtime-cu11==11.7.99;

_Restart the kernel_

In [25]:
# initial imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

import time
import math
import random
import re

import torchtext
import torch
import torch.optim as optim
import torch.nn as nn
import gensim.downloader

In [26]:
# dataset download
train, test = torchtext.datasets.SQuAD1("root")

The Squad 1 dataset contains 100k+ questions and answers based on over 500 articles. More information can be found [here](https://rajpurkar.github.io/SQuAD-explorer/).

In [27]:
# check number of rows
print(f"Number of training data rows: {train.num_lines}")
print(f"Number of test data rows: {test.num_lines}")

Number of training data rows: 87599
Number of test data rows: 10570


In [28]:
# print out example question
for context, question, answer, answer_start in train:
    print(f"Question: {question}")
    print(f"Answer: {answer[0]}")
    break

Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Answer: Saint Bernadette Soubirous


In [29]:
# extract data into dataframe
def convert_to_df(iterator):

    contexts, questions, answers, answer_starts = [], [], [], []

    for line in iterator:
        context, question, answer, answer_start = line


        contexts.append(context)
        questions.append(question)
        answers.append(answer[0])
        answer_starts.append(answer_start[0])

    data_dict = {
        "context": contexts,
        "question": questions,
        "answer": answers,
        "answer_start": answer_starts
    }

    df = pd.DataFrame(data_dict)
    
    return df

df_train = convert_to_df(train)
df_test = convert_to_df(test)

_Data Checks_

In [30]:
print(df_train.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87598 entries, 0 to 87597
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   context       87598 non-null  object
 1   question      87598 non-null  object
 2   answer        87598 non-null  object
 3   answer_start  87598 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 2.7+ MB
None


In [31]:
# get a feel for what the data looks like
df_train.sample(10)

Unnamed: 0,context,question,answer,answer_start
73696,The small landowner-cultivators formed the maj...,What type of housing did the Han government pr...,temporary,472
67311,"Before the French and Indian War, the Appalach...",What did the backcountry settlers want to secure?,their settlement of Kentucky,1059
23553,"In military affairs, the use of infantry with ...",Against whom was gunpowder used in 1304?,the Scots,629
3917,"The Battle of Long Island, the largest battle ...",When did the English army start to retreat and...,1783,675
27164,"Computers control functions at many utilities,...",How did the Stuxnet worm infect industrial equ...,via removable media,530
43748,"In 1866, the feud between Austria and Prussia ...",What was Germany going to be called if Prussia...,Little Germany,503
65463,The population of Paris in its administrative ...,Who created The Paris Urban Area?,INSEE,534
13397,The Government was known officially as the Cou...,What was the RSFSR government called starting ...,Council of Ministers,87
6445,Gautama was now determined to complete his spi...,How old was the Buddha at the time of his death?,80,713
86247,Bronx gang life was depicted in the 1974 novel...,When was 'A Bronx Tale' released?,1993,442


__Build a Vocab__

In [34]:
SOS_token = 0
EOS_token = 1

class Vocab:
    """ This vocabulary class cleans and indexes words.
    """
    def __init__(self, name):
        self.name = name
        self.index2word = {0: "<SOS>", 1: "<EOS>"}
        self.word2index = {"<SOS>": 0, "<EOS>": 1}
        self.word2count = {}
        self.count = 2 # count SOS and EOS
    
    # Clean words before adding them to vocab object
    def cleanText(self, text):
        return prepare_text(text)
    
    def addSentence(self, sentence):
        for word in sentence.split(" "):
            self.addWord(word)
    
    # Index words in our vocabulary
    def addWord(self, word):
        if word not in self.word2index: # if word not in index
            self.word2index[word] = self.count # add word and word no to words dictionary
            self.index2word[self.count] = word # add word to index
            self.word2count[word] = 1 # initialise word count
            self.count +=1
            return True
        else:
            self.word2count[word] += 1 # increment word count
            return False

In [35]:
def normalizeString(s):
    s = s.lower().strip() # convert to lower, remove excess spaces
    s = re.sub(r"[^a-zA-Z.!?0-9]+", r" ", s) # remove all non-letter characters
    return s

pairs_train = [[normalizeString(q), normalizeString(a)] 
               for i, (q, a) in df_train[["question", "answer"]].iterrows()]
pairs_test = [[normalizeString(q), normalizeString(a)] 
               for i, (q, a) in df_test[["question", "answer"]].iterrows()]

In [41]:
# print random sentence and answer
random_pair = random.choice(pairs_train)
print(random_pair)

['what union are yale s clerical and technical employees a part of?', 'local 34 of unite here']


In [37]:
# create and populate vocab object
vocab = Vocab("squad_1")
for question, answer in pairs_train:
    vocab.addSentence(question)
    vocab.addSentence(answer)

In [38]:
# check words added
print(f"Added {vocab.count} words")

Added 66781 words


In [39]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def indexesFromSentence(sentence, vocab=vocab):
    # returns a list of indices representing the input sentence
    return [vocab.word2index[word] for word in sentence.split(" ")]

def tensorFromSentence(sentence, vocab=vocab):
    # appends a EOS token and returns a tensor list of indices representing the input sentence
    indexes = indexesFromSentence(sentence)
    indexes.append(EOS_token)
    # view(-1, 1) specifies that we want the shape of 1 column and whatever number of rows to achieve that shape
    return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)

def tensorsFromPair(pair):
    # combines the function above and returns question and answer tensors
    question_tensor = tensorFromSentence(pair[0])
    answer_tensor = tensorFromSentence(pair[1])
    return (question_tensor, answer_tensor)

In [42]:
# test functions
print(random_pair)
tensorsFromPair(random_pair)

['what union are yale s clerical and technical employees a part of?', 'local 34 of unite here']


(tensor([[    2],
         [  693],
         [   60],
         [  535],
         [   48],
         [36923],
         [   30],
         [14776],
         [ 6390],
         [   12],
         [  383],
         [  848],
         [    1]], device='cuda:0'),
 tensor([[ 4354],
         [ 7789],
         [    6],
         [14456],
         [ 2629],
         [    1]], device='cuda:0'))

__Download the gensim embedding__

In [43]:
glove_vectors = gensim.downloader.load("glove-wiki-gigaword-300")



_Explore embedding model_

In [44]:
embedding_dim = glove_vectors.vectors.shape[1]
print(embedding_dim)

300


In [45]:
glove_vectors.get_index("man")

300

In [46]:
glove_vectors.most_similar("man")

[('woman', 0.6998663544654846),
 ('person', 0.6443442106246948),
 ('boy', 0.6208277940750122),
 ('he', 0.5926738381385803),
 ('men', 0.5819568634033203),
 ('himself', 0.5810033082962036),
 ('one', 0.5779521465301514),
 ('another', 0.5721587538719177),
 ('who', 0.5703631639480591),
 ('him', 0.5670831203460693)]

In [47]:
glove_vectors["man"]

array([-0.29784  , -0.13255  , -0.14505  , -0.22752  , -0.027429 ,
        0.11005  , -0.039245 , -0.0089607, -0.18866  , -1.1213   ,
        0.34793  , -0.30056  , -0.50103  , -0.031383 , -0.032185 ,
        0.018318 , -0.090429 , -0.14427  , -0.14306  , -0.057477 ,
       -0.020931 ,  0.56276  , -0.018557 ,  0.15168  , -0.25586  ,
       -0.081564 ,  0.2803   , -0.10585  , -0.16777  ,  0.21814  ,
       -0.11845  ,  0.56475  , -0.12645  , -0.062461 , -0.68043  ,
        0.10507  ,  0.24793  , -0.20249  , -0.30726  ,  0.42815  ,
        0.38378  , -0.19371  , -0.075951 , -0.058287 , -0.067195 ,
        0.2192   ,  0.56116  , -0.28156  , -0.13705  ,  0.45754  ,
       -0.14671  , -0.18562  , -0.074146 ,  0.60737  ,  0.07952  ,
        0.41023  ,  0.18377  , -0.08532  ,  0.43795  , -0.34727  ,
        0.2077   ,  0.50454  ,  0.40244  ,  0.1095   , -0.48078  ,
       -0.22372  , -0.54619  , -0.20782  ,  0.13751  , -0.16206  ,
       -0.24835  ,  0.17124  ,  0.037355 ,  0.14547  , -0.0562

_Create weights matrix to map each word embedding to the word in the vocab_

In [48]:
matrix_len = len(vocab.word2index) # length of vocab
weights_matrix = np.zeros((matrix_len, embedding_dim)) # initialise empty weights matrix
words_found = 0

# add words to weights matrix
for i, word in enumerate(vocab.word2index):
    try:
        weights_matrix[i] = glove_vectors[word] # map to glove vector word embedding
        words_found += 1
    except KeyError:
        weights_matrix[i] = np.random.normal(scale=0.6, size=(embedding_dim, )) # insert random weights

In [49]:
# check
weights_matrix[vocab.word2index["man"]] - glove_vectors["man"]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

__Define Model Structure__

In [None]:
class Encoder(nn.Module):
    
    def __init__(self, input_size, embedding_size, hidden_size, num_layers=1):
        super(Encoder, self).__init__()
        
        self.input_size = input_size
        self.embedding_size = embedding_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        self.embedding = nn.Embedding(input_size, embedding_size)
        # copy loaded weights matrix into embedding weights - a vector representation of the model word inputs
        #self.embedding.weight.data.copy_(torch.from_numpy(weights_matrix))
        # initialise lstm to take input dimension of embedding size and output hidden dimension
        self.lstm = nn.LSTM(embedding_size, hidden_size, num_layers=num_layers)
        
    
    def forward(self, src):
        
        '''
        Inputs: src, the src vector
        Outputs: output, the encoder outputs
                hidden, the hidden state
                cell, the cell state
        '''
        # src shape: (seq_length, batch_size)
        embedding = self.embedding(src).view(1, 1, -1)
        # embedding shape: (seq_length, batch_size, embedding_size)
        output, (hidden, cell) = self.lstm(embedding) # output, hidden and cell state
        
        return hidden, cell
    
    def initHidden(self):# initialise zero tensor with shape (1, 1, hidden_size)
        return torch.zeros(1, 1, self.hidden_size, device=device)
    

class Decoder(nn.Module):
      
    def __init__(self, input_size, embedding_size, hidden_size, output_size, num_layers=1):
        super(Decoder, self).__init__()
        
        self.embedding_size = embedding_size
        self.hidden_size = hidden_size
        self.output_size = output_size # length of vocab
        self.num_layers = num_layers        
        
        # self.embedding provides a vector representation of the target to our model
        self.embedding = nn.Embedding(output_size, embedding_size)
        #self.embedding.weight.data.copy_(torch.from_numpy(weights_matrix))
        # self.lstm, accepts the embeddings and outputs a hidden state
        self.lstm = nn.LSTM(embedding_size, hidden_size, num_layers=num_layers)
        # self.output, predicts on the hidden state via a linear output layer
        self.fc = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)
        
    def forward(self, trg, hidden, cell):
        
        '''
        Inputs: trg, the target vector
        Outputs: predictions, the predictions
                hidden, the hidden state
                cell, the cell state
        '''
        # shape of src is batch_size, but we want (1, batch_size)
        # we want batch_size batches of a single word at a time
        trg = trg.unsqueeze(0) # add one dimension
        embedding = self.embedding(trg).view(1, 1, -1)
        # embedding shape: (1, batch_size, embedding_size)
        output, (hidden, cell) = self.lstm(embedding, (hidden, cell))
        # shape of outputs: (1, batch_size, hidden_size)
        predictions = self.softmax(self.fc(output[0]))
        # shape of predictions: (1, batch_size, length_of_vocab)        
        
        return predictions, hidden, cell
    
    def initHidden(self): # initialise zero tensor with shape (1, 1, hidden_size)
        return torch.zeros(1, 1, self.hidden_size, device=device)
        

class Seq2Seq(nn.Module):
    
    def __init__(self, encoder, decoder, device):
        
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device=device
    
    # teacher forcing ratio - switch between using target and prediction words as inputs for next word
    def forward(self, src, trg, max_trg_len=100, teacher_forcing_ratio = 0.5):
        batch_size = src.shape[1]
        
        input_length = src.size(0) # how many question words
        target_length = trg.size(0) if self.training else max_trg_len# how many answer words
        
        target_vocab_size = len(vocab.word2index)
        
        outputs = torch.zeros(target_length, batch_size, target_vocab_size).to(self.device)
        
        for i in range(input_length):
            encoder_hidden, cell = self.encoder(src[i]) # process and encode the entire question
            
        decoder_hidden = encoder_hidden # initialise decoder hidden state with encoder hidden state
        decoder_input = torch.LongTensor([[SOS_token]]).to(device) # add SOS token as first predicted word
        
        for t in range(1, target_length):
            output, decoder_hidden, cell = self.decoder(decoder_input, decoder_hidden, cell)
            
            outputs[t] = output # add decoder predictions array to outputs
            # (batch_size, vocab_size)
            best_guess = output.argmax(1) # get index of best word guess
            teacher_force = False # initialise
            if self.training:
                # use target word if teacher forcing, else use word with the highest predicted value
                teacher_force = random.random() < teacher_forcing_ratio # update where relevant
                decoder_input = trg[t] if teacher_force else best_guess 
            else:
                decoder_input = best_guess
                
            if (teacher_force == False and decoder_input.item() == EOS_token):
                break
            
        return outputs

In [21]:
#torch.zeros(target_length, batch_size, target_vocab_size)
b = torch.zeros(10, 1, 10)
b.shape

torch.Size([10, 1, 10])

In [24]:
b[1:].view(-1, b.shape[-1]).shape

torch.Size([9, 10])

In [66]:
torch.LongTensor([[SOS_token]]).to(device)

tensor([[0]], device='cuda:0')

__Train the model__

In [54]:
# Helper functions

def showPlot(points):
    # This function plots the input points and sets defined tick intervals on the y axis
    plt.figure()
    fig, ax = plt.subplots()
    
    loc = ticker.MultipleLocator(base=0.2) # define tick intervals
    ax.yaxis.set_major_locator(loc) # set y axis tick intervals
    plt.plot(points)
    
def asMinutes(s):
    # format seconds as minutes
    m = math.floor(s / 60)
    s -= m * 60
    return "%dm %ds" % (m, s)

def timeSince(since, percent):
    # calculate the time between now and since
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return "%s (- %s)" % (asMinutes(s), asMinutes(rs))

In [55]:
def calcModel(model, input_tensor, target_tensor, model_optimizer, criterion):
    model_optimizer.zero_grad() # don't accumulate gradient
    
    loss = 0
    epoch_loss = 0
    
    output = model(input_tensor, target_tensor)
    
    num_iter = output.size(0) # number of predicted words
    
    # calculate loss from predicted sentence with expected result
    for ot in range(num_iter):
        loss += criterion(output[ot], target_tensor[ot])
        
    loss.backward() # calculate gradients
    model_optimizer.step() # update weights
    epoch_loss = loss.item() / num_iter # avg loss
    
    return epoch_loss

def trainModel(model, pairs, num_iterations=10000, print_every=1000, plot_every=100, learning_rate=0.01):
    start = time.time()
    model.train()
    
    optimizer = optim.SGD(model.parameters(), lr=learning_rate)
    criterion = nn.NLLLoss() # crossentropyloss = softmax + NLLLoss
    plot_losses = []
    print_loss_total = 0 # reset every print_every
    plot_loss_total = 0 # reset every plot_every
    
    training_pairs = [tensorsFromPair(random.choice(pairs)) for _ in range(num_iterations)]
    
    for iter_ in range(1, num_iterations+1):
        training_pair = training_pairs[iter_ - 1]
        input_tensor = training_pair[0]
        target_tensor = training_pair[1]
        
        loss = calcModel(model, input_tensor, target_tensor, optimizer, criterion)
        
        print_loss_total += loss
        plot_loss_total += loss        
        
        if iter_ % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print("{0:s} ({1:d} {2:.0f}%), {3:.4f}". format(timeSince(start, iter_ / num_iterations),
                                                 iter_, iter_ / num_iterations * 100, print_loss_avg))
            
        if iter_ % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0
            
    showPlot(plot_losses)
            
    torch.save(model.state_dict(), "mytraining.pt")
    return model

In [53]:
# Model hyperparameters
vocab_length = len(vocab.word2index)
input_size_enc = vocab_length
input_size_dec = vocab_length
output_size = vocab_length
enc_embed_size = embedding_dim
dec_embed_size = embedding_dim
hidden_size = 1024
num_layers = 1
teacher_forcing_ratio = 0.5

# Training hyperparameters
learning_rate = 0.01
num_iterations = 1000

# initialise models
encoder = Encoder(input_size_enc, enc_embed_size, hidden_size, num_layers=num_layers)
decoder = Decoder(input_size_dec, dec_embed_size, hidden_size, output_size, num_layers=num_layers)
model = Seq2Seq(encoder, decoder, device).to(device)

In [56]:
# train model
model_1 = trainModel(model, pairs_train, learning_rate=learning_rate, num_iterations=num_iterations, print_every=500)

RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

In [48]:
test = tensorsFromPair(random.choice(pairs_train))
test

(tensor([[  379],
         [ 1382],
         [   21],
         [  785],
         [    7],
         [ 3882],
         [30209],
         [12375],
         [   92],
         [14637],
         [    4],
         [35020],
         [    1]], device='cuda:0'),
 tensor([[48967],
         [49067],
         [    1]], device='cuda:0'))

In [96]:
a = model_1(test[0], test[1])

In [98]:
a[2]

tensor([[0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0',
       grad_fn=<SelectBackward>)

In [74]:
def evaluate(model, source, max_trg_len=100):
    model.eval()
    with torch.no_grad():
        input_tensor = tensorFromSentence(source)
        input_length = input_tensor.size()[0]
        
        decoded_words = []
        
        output = model(input_tensor, None)
        
        for ot in range(max_trg_len):
            topv, topi = output[ot].topk(1)
            
            if topi[0].item() == EOS_token:
                decoded_words.append("<EOS>")
                break
            else:
                decoded_words.append(vocab.index2word[topi[0].item()])
    
    return decoded_words

In [68]:
def evaluateRandomly(model, pairs=pairs_test, n=10):
    for i in range(n):
        pair = random.choice(pairs)
        print(f"source: {pair[0]}")
        print(f"target: {pair[1]}")
        
        output_words = evaluate(model, pair)
        output_sentence = " ".join(output_words)
        print("predicted {}".format(output_sentence))
        
def evaluateInput(model, input_sentence):
    
    output_words = evaluate(model, input_sentence)
    output_sentence = " ".join(output_words)
    print("predicted {}".format(output_sentence))

In [89]:
sent = "when did the queen die?"
evaluateInput(model_1, sent)

predicted SOS <EOS>


In [72]:
tensorFromSentence(sent)

tensor([[  38],
        [  39],
        [   7],
        [1786],
        [4036],
        [   1]], device='cuda:0')