# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [None]:
pip install torchdata torchsummary --upgrade typing-extensions torchtext torchvision

In [18]:
import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
import torch

In [19]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [20]:
# load brown embeddings

# from nltk.corpus import brown

# nltk.download('brown')
# nltk.download('punkt')

# model = gensim.models.Word2Vec(brown.sents())
# model.save('brown.embedding')

# w2v = gensim.models.Word2Vec.load('brown.embedding')

In [21]:
# load and explore SQuAD1 data

from torchtext.datasets import SQuAD1

train, test = SQuAD1()   

def LoadSQuAD(data):
    df = {"question": [], "answer": []}
    index = 0
    for context, question, answers, indices in data:
        if answers[0]:
            df["question"].append(question)
            df["answer"].append(answers[0])
        index += 1
    df_complete = pd.DataFrame.from_dict(df)
    SRC = df_complete["question"]
    TRG = df_complete["answer"]
    return df_complete, SRC, TRG
    
SRC_and_TRG_train_complete, SRC_train_complete, TRG_train_complete = LoadSQuAD(train)
len_val_data = SRC_train_complete.shape[0]//10

SRC_train = SRC_train_complete.iloc[len_val_data:]
SRC_val = SRC_train_complete.iloc[:len_val_data]
TRG_train = TRG_train_complete.iloc[len_val_data:]
TRG_val = TRG_train_complete.iloc[:len_val_data]

_, SRC_test, TRG_test = LoadSQuAD(test)

print('There are {} questions and {} answers in the training dataset.'.format(SRC_train.shape[0], TRG_train.shape[0]))
print('There are {} questions and {} answers in the validation dataset.'.format(SRC_val.shape[0], TRG_val.shape[0]))
print('There are {} questions and {} answers in the test dataset.'.format(SRC_test.shape[0], TRG_test.shape[0]))
SRC_train.head()

There are 78840 questions and 78840 answers in the training dataset.
There are 8759 questions and 8759 answers in the validation dataset.
There are 10570 questions and 10570 answers in the test dataset.


8759    What important neopragmatist was Harthorne's s...
8760    How was Whitehead's theory of gravitation rece...
8761    What physicists in the field of quantum theory...
8762    What affect  did the discovery of gravitationa...
8763                        What are gravitational waves?
Name: question, dtype: object

In [22]:
# define a vocabulary class

class Vocab:
    def __init__(self, name):
        self.name = name
        self.index = {}
        self.count = 0
        self.words = {}
    
    # tokenize each sentence
    def prepareText(self, text):
        tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
        tokens = tokenizer.tokenize(text)
        return tokens
    
    # create a list of all words contained in the text
    def indexWord(self, word):
        if word not in self.words:
            self.words[word] = self.count
            self.index[str(self.count)] = word
            self.count += 1
            return True
        else:
            return False
    
    # take in a sentence and returns a list of integers 
    def indexSentences(self, sentence):
        tokens = self.prepareText(sentence)
        return [self.words[token] for token in tokens]
    
    # fill a vocabulary object with contents
    def fillVocab(self, series, print_every=1000):
        self.indexWord('<pad>')
        
        count = 0
        for sentence in series:
            text = self.prepareText(sentence)
            for t in text:
                if(self.indexWord(t)):
                    if count % print_every == 0:
                        print('Adding word {} to our vocabulary.'.format(count))
                    count += 1
        print('Added {} words to vocabulary.'.format(len(self.words)))

In [23]:
# instantiate a vocabulary object and fill it 
vocab = Vocab(name='SQuAD1_vocab')
SRC_and_TRG_complete = pd.concat([SRC_train, TRG_train, SRC_val, TRG_val, SRC_test, TRG_test])
vocab.fillVocab(SRC_and_TRG_complete, 10000)

Adding word 0 to our vocabulary.
Adding word 10000 to our vocabulary.
Adding word 20000 to our vocabulary.
Adding word 30000 to our vocabulary.
Adding word 40000 to our vocabulary.
Adding word 50000 to our vocabulary.
Adding word 60000 to our vocabulary.
Added 64259 words to vocabulary.


In [24]:
# print out first 30 items of the vocabulary
dict(list(vocab.words.items())[:30]).items()

dict_items([('<pad>', 0), ('What', 1), ('important', 2), ('neopragmatist', 3), ('was', 4), ('Harthorne', 5), ('s', 6), ('student', 7), ('How', 8), ('Whitehead', 9), ('theory', 10), ('of', 11), ('gravitation', 12), ('received', 13), ('physicists', 14), ('in', 15), ('the', 16), ('field', 17), ('quantum', 18), ('have', 19), ('been', 20), ('influenced', 21), ('by', 22), ('affect', 23), ('did', 24), ('discovery', 25), ('gravitational', 26), ('waves', 27), ('on', 28), ('are', 29)])

In [25]:
# index and pad sentences to length of the longest sentence in the data set
from torch.nn.utils.rnn import pad_sequence
from torch import LongTensor

SRC_train_indices = [vocab.indexSentences(s) for s in SRC_train]
TRG_train_indices = [vocab.indexSentences(s) for s in TRG_train]
SRC_val_indices = [vocab.indexSentences(s) for s in SRC_val]
TRG_val_indices = [vocab.indexSentences(s) for s in TRG_val]
SRC_test_indices = [vocab.indexSentences(s) for s in SRC_test]
TRG_test_indices = [vocab.indexSentences(s) for s in TRG_test]

In [26]:
# pad sequences to max_length
def padSequences(sequences, max_len):
    padded_sequences = []
    for s in sequences:
        
        # calculate the number of padding tokens needed
        num_padding = max_len - len(s)
        
        # create a new sequence with padding tokens added to the end
        padded_sequence = s + [vocab.words['<pad>']] * num_padding
        
        # convert the sequence to a LongTensor and add it to the list
        padded_sequences.append(LongTensor(padded_sequence))
    return padded_sequences

# determine the maximum length of sentences
max_len = max(max(len(s) for s in SRC_train_indices), 
              max(len(s) for s in TRG_train_indices), 
              max(len(s) for s in SRC_val_indices), 
              max(len(s) for s in TRG_val_indices),
              max(len(s) for s in SRC_test_indices),
              max(len(s) for s in TRG_test_indices))

SRC_train_pad = torch.stack(padSequences(SRC_train_indices, max_len))
TRG_train_pad = torch.stack(padSequences(TRG_train_indices, max_len))
SRC_val_pad = torch.stack(padSequences(SRC_val_indices, max_len))
TRG_val_pad = torch.stack(padSequences(TRG_val_indices, max_len))
SRC_test_pad = torch.stack(padSequences(SRC_test_indices, max_len))
TRG_test_pad = torch.stack(padSequences(TRG_test_indices, max_len))    

In [27]:
print(max_len)

43


In [28]:
# create data loaders
from torch.utils.data import TensorDataset, DataLoader
batch_size = 64  

train_data = TensorDataset(SRC_train_pad, TRG_train_pad)
val_data = TensorDataset(SRC_val_pad, TRG_val_pad)
test_data = TensorDataset(SRC_test_pad, TRG_test_pad)

train_loader = DataLoader(train_data, batch_size=batch_size)
val_loader = DataLoader(val_data, batch_size=batch_size)
test_loader = DataLoader(test_data, batch_size=batch_size)

In [29]:
# for i in train_loader:
#     print(type(i))
#     print(len(i))
#     print(i)
#     for j in i:
#         print(type(j))
#         print(j.dim())
#         print(len(j))
#         display(j)
#         for k in j:
#             print(type(k))
#             print(k.dim())
#             print(len(k))
#             print(k)
#             break
#         break
#     break

lstm cell needs arguments input, (hidden state, cell state)'
where for batched data input is (sequence lengt, batch size, input size)

In [30]:
import torch.nn as nn

# Encoder, Decoder and Seq2Seq modules
class Encoder(nn.Module):
    
    def __init__(self, input_size, hidden_size, embedding_size, drop_prob=0.5):
        
        super(Encoder, self).__init__()
        self.hidden_size = hidden_size
        self.input_size = input_size
        self.embedding_size = embedding_size
        
        # nn.Embedding provides a vector representation of the input
        self.embedding = nn.Embedding(self.input_size, self.embedding_size)
        
        # nn.LSTM expects the arguments [input, (hidden state, cell state)]
        # for batched data input is expected to be (sequence lengt, batch size, input size)
        # batch_first=True changes the order to (batch size, sequence length, input size)
        self.lstm = nn.LSTM(self.embedding_size, self.hidden_size, batch_first=True)
        
        self.dropout = nn.Dropout(p=drop_prob)
    
    def forward(self, i):
        
        '''
        Inputs: i, the src vector
        Outputs: o, the encoder outputs
                h, the hidden state
                c, the cell state
        '''
        
        print('ENCODER')
        print('encoder input before embedding:', i.size())
        
        embedded = self.embedding(i)
        
        print('embedding dimension:', embedded.dim())
        print('embedding size:', embedded.size())
        
        embedded = self.dropout(embedded)
        o, (h, c) = self.lstm(embedded)
        
        print('lstm output size:', o.size())
        print('lstm hidden size:', h.size())
        print('lstm cell state size:', c.size())
        print(' ')
        
        return h, c
    

class Decoder(nn.Module):
      
    def __init__(self, output_size, embedding_size, hidden_size):
        
        super(Decoder, self).__init__()
        
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.embedding_size = embedding_size        
        
        self.embedding = nn.Embedding(output_size, embedding_size)
        
        self.lstm = nn.LSTM(embedding_size, hidden_size, batch_first=True)
         
        self.output = nn.Linear(hidden_size, output_size)
        
        
    def forward(self, i, h, c):
        
        '''
        Inputs: i, the target vector
        Outputs: o, the prediction
                h, the hidden state
        '''
        
        print('DECODER')
        print('initial input size before unsqueezing:', i.size())
        
        i = i.unsqueeze(0)
        
        print('decoder input before embedding (i.unsqueeze(0)):', i.size())
        
        embedded = self.embedding(i)

        print('embedding dimension:', embedded.dim())
        print('embedding size:', embedded.size())
        
        o, (h, c) = self.lstm(embedded, (h, c))
        
        print('lstm output size:', o.size())

        o = self.output(o.squeeze(0))
        
        print('decoder output size:', o.size())
        print(' ')
        
        return o, h, c       
                

class Seq2Seq(nn.Module):
    
    def __init__(self, input_size, hidden_size, embedding_size, output_size, device=device):
        
        super(Seq2Seq, self).__init__()
        self.encoder = Encoder(input_size, hidden_size, embedding_size, drop_prob=drop_prob)
        self.decoder = Decoder(output_size, embedding_size, hidden_size)
        
        assert self.encoder.hidden_size == self.decoder.hidden_size, \
            'hidden dimensions of encoder and decoder must be equal.'
    
    def forward(self, src, trg, teacher_forcing_ratio=0.5):      
        
        # create empty output tensor with shape (length of trg, batch size, trg vocab size)
        # that will later be filled with the predictions of the decoder
        outputs = torch.zeros(trg.shape[0], trg.shape[1], self.decoder.output_size).to(device)

        # use last hidden state of encoder as initial state for decoder
        decoder_hidden, decoder_cell = self.encoder(src)
        
        decoder_input = trg[0, :]
    
        print('SEQ2SEQ OUTPUT')
        print('size of initialized outputs variable:', outputs.size())
        print('size of one element in outputs:', outputs[0].size())
        print(' ')
        
        # loop through elements in batch
        for t in range(1, trg.shape[0]):
            decoder_output, decoder_hidden, decoder_cell = self.decoder(decoder_input, decoder_hidden, decoder_cell)
            outputs[t] = decoder_output
            teacher_force = torch.rand(1) < teacher_forcing_ratio
            # use token with highest score as output
            top1 = decoder_output.argmax(1)
            decoder_input = trg[t] if teacher_force else top1
                 
        return outputs

In [31]:
# training loop
def train(model, train_loader, criterion, optimizer, device=device):
    model.train()
    total_loss = 0.0
    
    for src, trg in train_loader:
        src = src.to(device)
        trg = trg.to(device)
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        # reshape output and target to calculate loss
        # (slice off the first column and flatten output to 2 dim)
        output = output[1:].view(-1, output.shape[-1])
        trg = trg[1:].view(-1)
        
        loss = criterion(output, trg)
        loss.backward()
        
        optimizer.step()
        
        total_loss += loss.item()
    
    return total_loss / len(train_loader)


def evaluate(model, val_loader, criterion, device=device):
    model.eval()
    total_loss = 0.0
    
    with torch.no_grad():
        for src, trg in val_loader:
            src = src.to(device)
            trg = trg.to(device)

            output = model(src, trg, teacher_forcing_ratio=0.0)

            # reshape output and target to calculate loss
            output = output[1:].view(-1, output.shape[-1])
            trg = trg[1:].view(-1)

            loss = criterion(output, trg)
            total_loss += loss.item()
    
    return total_loss / len(val_loader)

In [32]:
# hyperparameters
input_size = len(vocab.words)
output_size = len(vocab.words)
### im tutorial auf https://www.kaggle.com/code/columbine/seq2seq-pytorch
### INPUT_DIM = len(SRC.vocab)
### OUTPUT_DIM = len(TRG.vocab)
### ggf prüfen

embedding_size = 256
hidden_size = 512
num_epochs = 10
learning_rate = 0.001
drop_prob = 0.5
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [33]:
# initialize the model, optimizer and loss function
model = Seq2Seq(input_size, hidden_size, hidden_size, output_size)
model = model.to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss(ignore_index=vocab.words['<pad>'])

In [34]:
# initialize the minimum validation loss
min_val_loss = float('inf')

# training
for epoch in range(num_epochs):
    train_loss = train(model, train_loader, criterion, optimizer, device=device)
    val_loss = evaluate(model, val_loader, criterion, device=device)
    
    print(f'Epoch: {epoch+1}')
    print(f'Train Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}"')
    print(f'Val Loss: {val_loss:.3f} | Val PPL: {math.exp(val_loss):7.3f}"')
    
    # save the model if the validation loss is at a minimum value
    if valid_loss < min_val_loss:
        min_val_loss = val_loss
        torch.save(model.state_dict(), 'best_model.pt')

ENCODER
encoder input before embedding: torch.Size([64, 43])
embedding dimension: 3
embedding size: torch.Size([64, 43, 512])
lstm output size: torch.Size([64, 43, 512])
lstm hidden size: torch.Size([1, 64, 512])
lstm cell state size: torch.Size([1, 64, 512])
 
SEQ2SEQ OUTPUT
size of initialized outputs variable: torch.Size([64, 43, 64259])
size of one element in outputs: torch.Size([43, 64259])
 
DECODER
initial input size before unsqueezing: torch.Size([43])
decoder input before embedding (i.unsqueeze(0)): torch.Size([1, 43])
embedding dimension: 3
embedding size: torch.Size([1, 43, 512])


RuntimeError: Expected hidden[0] size (1, 1, 512), got [1, 64, 512]

PPL stands for "perplexity". According to https://www.educative.io/answers/what-is-perplexity-in-nlp,

*Perplexity is a standard that evaluates how well a probability model can predict a sample. When applied to language models like GPT, it represents the exponentiated average negative log-likelihood of a sequence. In essence, a lower perplexity score suggests that the model has a higher certainty in its predictions.*

See https://towardsdatascience.com/perplexity-in-language-models-87a196019a94 for further information.

In [None]:
best_model = Seq2Seq(encoder, decoder, device).to(device)
best_model.load_state_dict(torch.load('best_model.pt'))

test_loss = evaluate(model, test_loader, criterion)
 
print(f"Test Loss : {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f}")
    
test()

Helpful tutorial:
https://www.kaggle.com/code/columbine/seq2seq-pytorch