# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [1]:
pip install torchdata --upgrade typing-extensions torchtext torchvision

Defaulting to user installation because normal site-packages is not writeable
Collecting torchdata
  Downloading torchdata-0.5.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB)
[K     |████████████████████████████████| 4.6 MB 4.4 MB/s eta 0:00:01
[?25hCollecting typing-extensions
  Downloading typing_extensions-4.7.1-py3-none-any.whl (33 kB)
Collecting torchtext
  Downloading torchtext-0.14.1-cp37-cp37m-manylinux1_x86_64.whl (2.0 MB)
[K     |████████████████████████████████| 2.0 MB 62.8 MB/s eta 0:00:01
[?25hCollecting torchvision
  Downloading torchvision-0.14.1-cp37-cp37m-manylinux1_x86_64.whl (24.2 MB)
[K     |████████████████████████████████| 24.2 MB 64.4 MB/s eta 0:00:01
Collecting torch==1.13.1
  Downloading torch-1.13.1-cp37-cp37m-manylinux1_x86_64.whl (887.5 MB)
[K     |████████████████████████████████| 887.5 MB 6.3 kB/s  eta 0:00:01     |█████▏                          | 142.0 MB 45.2 MB/s eta 0:00:17     |██████████████████▊             | 518.6 MB 80.

In [1]:
import numpy as np
import pandas as pd
import nltk, time, math
from tqdm import tqdm

import torch
import torch.nn as nn
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import TensorDataset, DataLoader

In [2]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print('device:', device)

device: cuda:0


### Loading and Exploration of Data

In [3]:
# load SQuAD1 data

from torchtext.datasets import SQuAD1
train, test = SQuAD1()   

def LoadSQuAD(data):
    df = {"question": [], "answer": []}
    index = 0
    for context, question, answers, indices in data:
        if answers[0]:
            df["question"].append(question)
            df["answer"].append(answers[0])
        index += 1
    df_complete = pd.DataFrame.from_dict(df)
    SRC = df_complete["question"]
    TRG = df_complete["answer"]
    return SRC, TRG
    
SRC_train, TRG_train = LoadSQuAD(train)
SRC_test, TRG_test = LoadSQuAD(test)

# explore dataset
print('There are {} questions and {} answers in the training dataset.'.format(SRC_train.shape[0], TRG_train.shape[0]))
print('There are {} questions and {} answers in the test dataset.'.format(SRC_test.shape[0], TRG_test.shape[0]))
SRC_train.head()

There are 87599 questions and 87599 answers in the training dataset.
There are 10570 questions and 10570 answers in the test dataset.


0    To whom did the Virgin Mary allegedly appear i...
1    What is in front of the Notre Dame Main Building?
2    The Basilica of the Sacred heart at Notre Dame...
3                    What is the Grotto at Notre Dame?
4    What sits on top of the Main Building at Notre...
Name: question, dtype: object

In [4]:
# reduce size of training set to make training faster
SRC_train = SRC_train.iloc[:10000]
TRG_train = TRG_train.iloc[:10000]
print('There are now {} questions and {} answers in the training dataset.'.format(SRC_train.shape[0], TRG_train.shape[0]))

There are now 10000 questions and 10000 answers in the training dataset.


### Building a Vocabulary

In [5]:
# define a vocabulary class
class Vocab:
    def __init__(self, name):
        self.name = name
        self.index = {}
        self.count = 0
        self.words = {}
    
    # tokenize each sentence
    def prepareText(self, text):
        tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
        tokens = tokenizer.tokenize(text)
        return tokens
    
    # create a list of all words contained in the text
    def indexWord(self, word):
        if word not in self.words:
            self.words[word] = self.count
            self.index[str(self.count)] = word
            self.count += 1
            return True
        else:
            return False
    
    # takes in a sentence and returns a list of integers 
    def indexSentences(self, sentence):
        tokens = self.prepareText(sentence)
        return [self.words[token] for token in tokens]
    
    # takes in a sequence of integers and returns a list of words
    def IndexToWord(self, sentence):
        return [self.index[str(word)] for word in sentence]
    
    # fill a vocabulary object with contents
    def fillVocab(self, series, print_every=1000):
        
        # add 'pad' as word to vocabulary to account for padding  
        self.indexWord('<pad>')
        
        count = 0
        for sentence in series:
            text = self.prepareText(sentence)
            for t in text:
                if(self.indexWord(t)):
                    if count % print_every == 0:
                        print('Adding word {} to our vocabulary.'.format(count))
                    count += 1
        print('Added {} words to vocabulary.'.format(len(self.words)))

In [6]:
# instantiate and fill a vocabulary object
vocab = Vocab(name='SQuAD1_vocab')
SRC_and_TRG_complete = pd.concat([SRC_train, TRG_train, SRC_test, TRG_test])
vocab.fillVocab(SRC_and_TRG_complete, 10000)

Adding word 0 to our vocabulary.
Adding word 10000 to our vocabulary.
Adding word 20000 to our vocabulary.
Added 24112 words to vocabulary.


In [7]:
# print out first 30 items of the vocabulary
dict(list(vocab.words.items())[:30]).items()

dict_items([('<pad>', 0), ('To', 1), ('whom', 2), ('did', 3), ('the', 4), ('Virgin', 5), ('Mary', 6), ('allegedly', 7), ('appear', 8), ('in', 9), ('1858', 10), ('Lourdes', 11), ('France', 12), ('What', 13), ('is', 14), ('front', 15), ('of', 16), ('Notre', 17), ('Dame', 18), ('Main', 19), ('Building', 20), ('The', 21), ('Basilica', 22), ('Sacred', 23), ('heart', 24), ('at', 25), ('beside', 26), ('to', 27), ('which', 28), ('structure', 29)])

In [8]:
# turn words into indices
SRC_train_indices = [vocab.indexSentences(s) for s in SRC_train]
TRG_train_indices = [vocab.indexSentences(s) for s in TRG_train]
SRC_test_indices = [vocab.indexSentences(s) for s in SRC_test]
TRG_test_indices = [vocab.indexSentences(s) for s in TRG_test]

In [9]:
# takes in a sequence of integers and pads it to max_length
def padSequences(sequences, max_len):
    padded_sequences = []
    for s in sequences:
        
        # calculate the number of padding tokens needed
        num_padding = max_len - len(s)
        
        # create a new sequence with padding tokens added to the end
        padded_sequence = s + [vocab.words['<pad>']] * num_padding
        
        # convert the sequence to a LongTensor and add it to the list
        padded_sequences.append(torch.LongTensor(padded_sequence))
    return padded_sequences

# determine the maximum length of sentences in the dataset
max_len = max(max(len(s) for s in SRC_train_indices), 
              max(len(s) for s in TRG_train_indices), 
              max(len(s) for s in SRC_test_indices),
              max(len(s) for s in TRG_test_indices))

# pad sequences to max_length
SRC_train_pad = torch.stack(padSequences(SRC_train_indices, max_len))
TRG_train_pad = torch.stack(padSequences(TRG_train_indices, max_len))
SRC_test_pad = torch.stack(padSequences(SRC_test_indices, max_len))
TRG_test_pad = torch.stack(padSequences(TRG_test_indices, max_len))    

In [10]:
# create data loaders
batch_size = 32  

train_data = TensorDataset(SRC_train_pad, TRG_train_pad)
test_data = TensorDataset(SRC_test_pad, TRG_test_pad)

train_loader = DataLoader(train_data, batch_size=batch_size, drop_last=True)
test_loader = DataLoader(test_data, batch_size=batch_size, drop_last=True)

In [11]:
# delete stored variables that are not needed anymore
# to free up memory space
del(SRC_and_TRG_complete, 
    SRC_test, SRC_test_indices, SRC_test_pad, 
    SRC_train, SRC_train_indices, SRC_train_pad, 
    TRG_test, TRG_test_indices, TRG_test_pad, 
    TRG_train, TRG_train_indices, TRG_train_pad,
    train, test, train_data, test_data)

### Model Architecture

In [12]:
# Encoder, Decoder and Seq2Seq modules
class Encoder(nn.Module):
    
    def __init__(self, input_size, hidden_size, embedding_size, drop_prob):
        
        super(Encoder, self).__init__()
        self.hidden_size = hidden_size
        self.input_size = input_size
        self.embedding_size = embedding_size
        
        # nn.Embedding provides a vector representation of the input
        self.embedding = nn.Embedding(self.input_size, self.embedding_size)
        
        # nn.LSTM takes the arguments [input, (hidden state, cell state)] 
        # where for batched data input is expected to be (sequence lengt, batch size, input size).
        # batch_first=True changes the order to (batch size, sequence length, input size), but
        # with swapped batch and sequence dimensions, it's crucial to ensure that all batch-related 
        # indexes are handled on index 0 while sequence-related indexes are processed on index 1
        self.lstm = nn.LSTM(self.embedding_size, self.hidden_size, batch_first=True)

        self.dropout = nn.Dropout(p=drop_prob)
    
    def forward(self, i):
        
        '''
        Inputs: i, the src vector
        Outputs: h, the hidden state
                c, the cell state
        '''
        embedded = self.embedding(i)
        embedded = self.dropout(embedded)
        o, (h, c) = self.lstm(embedded)
        
        return h, c
    

class Decoder(nn.Module):
      
    def __init__(self, output_size, embedding_size, hidden_size):
        
        super(Decoder, self).__init__()
        
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.embedding_size = embedding_size        
        
        self.embedding = nn.Embedding(output_size, embedding_size)
        self.lstm = nn.LSTM(embedding_size, hidden_size, batch_first=True)
        self.output = nn.Linear(hidden_size, output_size)
        
        
    def forward(self, i, h, c):
        
        '''
        Inputs: i, the target vector
        Outputs: o, the prediction
                h, the hidden state
        '''       
        i = i.unsqueeze(1)
        embedded = self.embedding(i)
        o, (h, c) = self.lstm(embedded, (h, c))
        o = self.output(o.squeeze(0))
        
        return o, h, c       
                

class Seq2Seq(nn.Module):
    
    def __init__(self, input_size, hidden_size, embedding_size, output_size, drop_prob, device=device, ):
        
        super(Seq2Seq, self).__init__()
        self.encoder = Encoder(input_size, hidden_size, embedding_size, drop_prob)
        self.decoder = Decoder(output_size, embedding_size, hidden_size)
    
    def forward(self, src, trg, teacher_forcing_ratio=0.5):      
        
        # create empty output tensor with shape (batch size, length of trg, trg vocab size)
        # that will later be filled with the predictions of the decoder
        outputs = torch.zeros(trg.shape[0], trg.shape[1], self.decoder.output_size).to(device)

        # use last hidden state of encoder as initial state for decoder
        decoder_hidden, decoder_cell = self.encoder(src)
        
        decoder_input = trg[:, 0]

        # loop through elements in batch
        for t in range(1, trg.shape[1]):
            decoder_output, decoder_hidden, decoder_cell = self.decoder(decoder_input, decoder_hidden, decoder_cell)
            outputs[:, t, :] = decoder_output.view(*outputs[:, t, :].shape)
            teacher_force = torch.rand(1) < teacher_forcing_ratio
            # use token with highest score as output
            top1 = decoder_output.argmax(2)
            decoder_input = trg[:, t] if teacher_force else top1.squeeze(1)
            
        return outputs

### Training

In [13]:
# training loop
def train(model, train_loader, criterion, optimizer, device=device):
    model.train()
    total_loss = 0.0
    
    for src, trg in tqdm(train_loader):
        src = src.to(device)
        trg = trg.to(device)
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        # reshape output and target to calculate loss
        # (slice off the first column and flatten output to 2 dim)
        output = output[1:].view(-1, output.shape[-1])
        trg = trg[1:].view(-1)
        
        loss = criterion(output, trg)
        loss.backward()
        
        optimizer.step()
        
        total_loss += loss.item()
    
    return total_loss / len(train_loader)

In [14]:
# hyperparameters
input_size = len(vocab.words)
output_size = len(vocab.words)
embedding_size = 256
hidden_size = 512
num_epochs = 30
learning_rate = 0.001
drop_prob = 0.2

In [15]:
# initialize the model, optimizer and loss function
model = Seq2Seq(input_size, hidden_size, hidden_size, output_size, drop_prob)
model = model.to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss(ignore_index=vocab.words['<pad>'])

In [16]:
# a function to tell us how long an epoch takes
# taken from https://www.kaggle.com/code/columbine/seq2seq-pytorch
def epoch_time(start_time, end_time):
    
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time  / 60)
    elapsed_secs = int(elapsed_time -  (elapsed_mins * 60))
    return  elapsed_mins, elapsed_secs

In [17]:
# initialize the minimum training loss
min_train_loss = float('inf')

# training
for epoch in range(num_epochs):
    
    start_time = time.time()
    
    train_loss = train(model, train_loader, criterion, optimizer, device=device)
    
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    print(f'Epoch: {epoch+1} | Time {epoch_mins}m {epoch_secs}s"')
    print(f'Train Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}"')
    
    # save the model if the validation loss is at a minimum value
    if train_loss < min_train_loss:
        min_train_loss = train_loss
        torch.save(model.state_dict(), 'chatbot_model.pt')

100%|██████████| 312/312 [03:34<00:00,  1.46it/s]


Epoch: 1 | Time 3m 34s"
Train Loss: 9.410 | Train PPL: 12209.620"


100%|██████████| 312/312 [03:34<00:00,  1.46it/s]


Epoch: 2 | Time 3m 34s"
Train Loss: 8.251 | Train PPL: 3831.780"


100%|██████████| 312/312 [03:36<00:00,  1.44it/s]


Epoch: 3 | Time 3m 36s"
Train Loss: 7.819 | Train PPL: 2486.392"


100%|██████████| 312/312 [03:37<00:00,  1.44it/s]


Epoch: 4 | Time 3m 37s"
Train Loss: 7.403 | Train PPL: 1641.561"


100%|██████████| 312/312 [03:36<00:00,  1.44it/s]


Epoch: 5 | Time 3m 36s"
Train Loss: 6.909 | Train PPL: 1000.966"


100%|██████████| 312/312 [03:34<00:00,  1.46it/s]


Epoch: 6 | Time 3m 34s"
Train Loss: 6.361 | Train PPL: 578.958"


100%|██████████| 312/312 [03:36<00:00,  1.44it/s]


Epoch: 7 | Time 3m 36s"
Train Loss: 5.930 | Train PPL: 376.243"


100%|██████████| 312/312 [03:37<00:00,  1.44it/s]


Epoch: 8 | Time 3m 37s"
Train Loss: 5.628 | Train PPL: 278.065"


100%|██████████| 312/312 [03:35<00:00,  1.44it/s]


Epoch: 9 | Time 3m 35s"
Train Loss: 5.350 | Train PPL: 210.608"


100%|██████████| 312/312 [03:35<00:00,  1.45it/s]


Epoch: 10 | Time 3m 35s"
Train Loss: 5.165 | Train PPL: 175.026"


100%|██████████| 312/312 [03:36<00:00,  1.44it/s]


Epoch: 11 | Time 3m 36s"
Train Loss: 5.070 | Train PPL: 159.214"


100%|██████████| 312/312 [03:36<00:00,  1.44it/s]


Epoch: 12 | Time 3m 36s"
Train Loss: 4.958 | Train PPL: 142.364"


100%|██████████| 312/312 [03:36<00:00,  1.44it/s]


Epoch: 13 | Time 3m 36s"
Train Loss: 4.864 | Train PPL: 129.518"


100%|██████████| 312/312 [03:36<00:00,  1.44it/s]


Epoch: 14 | Time 3m 36s"
Train Loss: 4.863 | Train PPL: 129.363"


100%|██████████| 312/312 [03:31<00:00,  1.47it/s]


Epoch: 15 | Time 3m 31s"
Train Loss: 4.816 | Train PPL: 123.470"


100%|██████████| 312/312 [03:33<00:00,  1.46it/s]


Epoch: 16 | Time 3m 33s"
Train Loss: 4.791 | Train PPL: 120.450"


100%|██████████| 312/312 [03:36<00:00,  1.44it/s]


Epoch: 17 | Time 3m 36s"
Train Loss: 4.782 | Train PPL: 119.327"


100%|██████████| 312/312 [03:35<00:00,  1.45it/s]


Epoch: 18 | Time 3m 35s"
Train Loss: 4.760 | Train PPL: 116.740"


100%|██████████| 312/312 [03:35<00:00,  1.45it/s]


Epoch: 19 | Time 3m 35s"
Train Loss: 4.719 | Train PPL: 112.006"


100%|██████████| 312/312 [03:35<00:00,  1.45it/s]


Epoch: 20 | Time 3m 35s"
Train Loss: 4.691 | Train PPL: 108.976"


100%|██████████| 312/312 [03:39<00:00,  1.42it/s]


Epoch: 21 | Time 3m 39s"
Train Loss: 4.679 | Train PPL: 107.696"


100%|██████████| 312/312 [03:38<00:00,  1.43it/s]


Epoch: 22 | Time 3m 38s"
Train Loss: 4.656 | Train PPL: 105.230"


100%|██████████| 312/312 [03:38<00:00,  1.43it/s]


Epoch: 23 | Time 3m 38s"
Train Loss: 4.618 | Train PPL: 101.310"


100%|██████████| 312/312 [03:32<00:00,  1.47it/s]
  0%|          | 0/312 [00:00<?, ?it/s]

Epoch: 24 | Time 3m 32s"
Train Loss: 4.626 | Train PPL: 102.078"


100%|██████████| 312/312 [03:37<00:00,  1.44it/s]


Epoch: 25 | Time 3m 37s"
Train Loss: 4.590 | Train PPL:  98.473"


100%|██████████| 312/312 [04:15<00:00,  1.22it/s]


Epoch: 26 | Time 4m 15s"
Train Loss: 4.571 | Train PPL:  96.675"


100%|██████████| 312/312 [04:38<00:00,  1.12it/s]


Epoch: 27 | Time 4m 38s"
Train Loss: 4.560 | Train PPL:  95.605"


100%|██████████| 312/312 [04:35<00:00,  1.13it/s]
  0%|          | 0/312 [00:00<?, ?it/s]

Epoch: 28 | Time 4m 35s"
Train Loss: 4.577 | Train PPL:  97.230"


100%|██████████| 312/312 [04:35<00:00,  1.13it/s]


Epoch: 29 | Time 4m 35s"
Train Loss: 4.552 | Train PPL:  94.858"


100%|██████████| 312/312 [04:36<00:00,  1.13it/s]

Epoch: 30 | Time 4m 36s"
Train Loss: 4.561 | Train PPL:  95.650"





### Improvments

In [17]:
# load model with best training loss for optimization
model.load_state_dict(torch.load('chatbot_model.pt'))

# adjust probability for droput layers
drop_prob = 0.4
num_epochs = 10

In [19]:
# training
for epoch in range(num_epochs):
    
    start_time = time.time()
    
    train_loss = train(model, train_loader, criterion, optimizer, device=device)
    
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    print(f'Epoch: {epoch+31} | Time {epoch_mins}m {epoch_secs}s"')
    print(f'Train Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}"')
    
        # save the model if the validation loss is at a minimum value
    if train_loss < min_train_loss:
        min_train_loss = train_loss
        torch.save(model.state_dict(), 'chatbot_model.pt')

100%|██████████| 312/312 [04:33<00:00,  1.14it/s]


Epoch: 31 | Time 4m 33s"
Train Loss: 4.551 | Train PPL:  94.725"


100%|██████████| 312/312 [04:32<00:00,  1.15it/s]


Epoch: 32 | Time 4m 32s"
Train Loss: 4.538 | Train PPL:  93.508"


100%|██████████| 312/312 [04:32<00:00,  1.15it/s]
  0%|          | 0/312 [00:00<?, ?it/s]

Epoch: 33 | Time 4m 32s"
Train Loss: 4.547 | Train PPL:  94.317"


100%|██████████| 312/312 [04:32<00:00,  1.14it/s]


Epoch: 34 | Time 4m 32s"
Train Loss: 4.515 | Train PPL:  91.356"


100%|██████████| 312/312 [04:34<00:00,  1.14it/s]


Epoch: 35 | Time 4m 34s"
Train Loss: 4.450 | Train PPL:  85.646"


100%|██████████| 312/312 [04:37<00:00,  1.12it/s]


Epoch: 36 | Time 4m 37s"
Train Loss: 4.400 | Train PPL:  81.453"


100%|██████████| 312/312 [04:35<00:00,  1.13it/s]


Epoch: 37 | Time 4m 35s"
Train Loss: 4.366 | Train PPL:  78.689"


100%|██████████| 312/312 [04:35<00:00,  1.13it/s]


Epoch: 38 | Time 4m 35s"
Train Loss: 4.321 | Train PPL:  75.287"


100%|██████████| 312/312 [04:34<00:00,  1.14it/s]


Epoch: 39 | Time 4m 34s"
Train Loss: 4.284 | Train PPL:  72.509"


100%|██████████| 312/312 [03:42<00:00,  1.40it/s]


Epoch: 40 | Time 3m 42s"
Train Loss: 4.260 | Train PPL:  70.795"


### Evaluation

In [18]:
def evaluate(model, data_loader, criterion, device=device):
    model.eval()
    total_loss = 0.0
    
    with torch.no_grad():
        for src, trg in tqdm(data_loader):
            src = src.to(device)
            trg = trg.to(device)

            output = model(src, trg, teacher_forcing_ratio=0.0)

            # reshape output and target to calculate loss
            output = output[1:].view(-1, output.shape[-1])
            trg = trg[1:].view(-1)

            loss = criterion(output, trg)
            total_loss += loss.item()
    
    return total_loss / len(data_loader)

In [19]:
test_model = Seq2Seq(input_size, hidden_size, hidden_size, output_size, drop_prob).to(device)
test_model.load_state_dict(torch.load('chatbot_model.pt'))

test_loss = evaluate(model, test_loader, criterion)
 
print(f"Test Loss : {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f}")

100%|██████████| 330/330 [02:58<00:00,  1.85it/s]

Test Loss : 12.697 | Test PPL: 326806.326





The high test loss is a sign of significant overfitting. Training with the whole dataset and a larger vocabulary would likely have led to a better result, but due to computational restraints and limited GPU ressources, I decided to clip the dataset to 10,000 samples. 

### Chat Bot

In [20]:
chatbot_model = Seq2Seq(input_size, hidden_size, hidden_size, output_size, drop_prob).to(device)
chatbot_model.load_state_dict(torch.load('chatbot_model.pt'))
chatbot_model.eval()

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(24112, 512)
    (lstm): LSTM(512, 512, batch_first=True)
    (dropout): Dropout(p=0.4, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(24112, 512)
    (lstm): LSTM(512, 512, batch_first=True)
    (output): Linear(in_features=512, out_features=24112, bias=True)
  )
)

In [21]:
def chatbot(model=chatbot_model, vocab=vocab):
    print("Hi, I'm, your ChatBot. Type 'quit' to exit the chat.\n")
    input_sentence = ''
    while(1):
        try:
            # Get input sentence
            input_sentence = input('> ')
            # Check if it is quit case
            if input_sentence == 'quit' or input_sentence == 'q': break

            # normalize sentence
            tokenized_sentence = vocab.indexSentences(input_sentence)

            # pad sentence
            num_padding = max_len - len(tokenized_sentence)
            padded_sentence = tokenized_sentence + [vocab.words['<pad>']] * num_padding
                       
            # create input tensor
            input_tensor = torch.LongTensor(padded_sentence).to(device)
            
            # fill up batch with empty tensors to match model dimensions
            tensor_list = [input_tensor] + [torch.zeros_like(input_tensor) for i in range(batch_size-1)]
            input_batch = torch.stack(tensor_list)
                     
            # create empty target tensor
            empty_trg = torch.zeros_like(input_batch).to(device)
            
            # create model ouput
            output = model(src=input_batch, trg=empty_trg, teacher_forcing_ratio=0.0)
            output = output[0,:,:]
            _, answer_indices = torch.max(output, dim=1)
            
            # print('answer_indices:', answer_indices)
            
            # turn indices back into words
            answer = vocab.IndexToWord(index.item() for index in answer_indices if index != 0)
            print('ChatBot:', ' '.join([word for word in answer]))
            print(' ')
            
        except KeyError:
             print("Error: Encountered unknown word.")

In [22]:
chatbot()

Hi, I'm, your ChatBot. Type 'quit' to exit the chat.

> Who was the first president of the United States of America?
ChatBot: 000 people in Afghanistan be Whitehead to both geographically and culturally of the northeastern Indian subcontinent more important than a Tajik prison in August 24 about whether Turkic or Iranian peoples were the original inhabitants of Central Asia it can be overcome
 
> What is the capital of France?
ChatBot: 3 850 000 people in developing countries and artificial photosynthesis of the science forbids me to enter the world and the world and to the world around it can provide humanitarian assistance and security of the world as a Tajik prison in
 
> When did the American Civil War end?
ChatBot: 000 people and to 147 F minor Op 21 years BCE and to think of people and objects as remaining fundamentally the same things harmful by an aim to all entities with the world around it by a self accept his father
 
> quit


### Notes and Credits

PPL is short for "perplexity". According to https://www.educative.io/answers/what-is-perplexity-in-nlp,

*Perplexity is a standard that evaluates how well a probability model can predict a sample. When applied to language models like GPT, it represents the exponentiated average negative log-likelihood of a sequence. In essence, a lower perplexity score suggests that the model has a higher certainty in its predictions.*

See https://towardsdatascience.com/perplexity-in-language-models-87a196019a94 for further information.

Helpful tutorial:
https://www.kaggle.com/code/columbine/seq2seq-pytorch

Chatbot inspired by https://pytorch.org/tutorials/beginner/chatbot_tutorial.html and 
https://github.com/aockel/seq2seq-squad2.