# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [14]:
!pip install torch==1.12.0 torchdata==0.4.0 torchtext==0.13.0

Defaulting to user installation because normal site-packages is not writeable


In [1]:
from util.Data import loadDF, prepare_text, getPairs, toTensor, getMaxLen
from util.Models import Seq2Seq
from util.Vocab import Vocab
from util.Train import train
from util.Evaluate import evaluate
import random

In [2]:
learning_rate = 1e-3
hidden_size = 256 # encoder and decoder hidden size
batch_size = 256
epochs = 150

In [3]:
import nltk
import torch
import torchtext

In [4]:
data_df = loadDF('data')
data_df = data_df.iloc[:5000, :]

In [5]:
for i in range(0, 5):
    print("> ", data_df.iloc[i,0], "\n< ", data_df.iloc[i,1], "\n") 

>  To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? 
<  Saint Bernadette Soubirous 

>  What is in front of the Notre Dame Main Building? 
<  a copper statue of Christ 

>  The Basilica of the Sacred heart at Notre Dame is beside to which structure? 
<  the Main Building 

>  What is the Grotto at Notre Dame? 
<  a Marian place of prayer and reflection 

>  What sits on top of the Main Building at Notre Dame? 
<  a golden statue of the Virgin Mary 



In [6]:
data_df['Question'] = data_df['Question'].apply(prepare_text)
data_df['Answer'] = data_df['Answer'].apply(prepare_text)

In [7]:
pairs = getPairs(data_df)

In [8]:
max_src, max_trg = getMaxLen(pairs)
max_trg, max_src

(43, 29)

In [9]:
Q_vocab = Vocab()
A_vocab = Vocab()

for pair in pairs:
    Q_vocab.add_words(pair[0])
    A_vocab.add_words(pair[1])

In [10]:
source_data = [toTensor(Q_vocab, pair[0]) for pair in pairs]
target_data = [toTensor(A_vocab, pair[1]) for pair in pairs]

In [11]:
seq2seq = Seq2Seq(Q_vocab.words_count, hidden_size, A_vocab.words_count)

train(source_data = source_data,
      target_data = target_data,
      model = seq2seq,
      epochs = epochs,
      learning_rate = learning_rate,
      batch_size = batch_size,
      setting_patience = 10
     )


Model Improved
best loss changed : 100 ---> 5.642314862336766
1/150 Epoch  -  Training Loss = 6.4542  -  Validation Loss = 5.6423
Model Improved
best loss changed : 5.642314862336766 ---> 5.43468880971718
2/150 Epoch  -  Training Loss = 5.4901  -  Validation Loss = 5.4347
Model Improved
best loss changed : 5.43468880971718 ---> 5.304959308488919
3/150 Epoch  -  Training Loss = 5.3046  -  Validation Loss = 5.3050
Model Improved
best loss changed : 5.304959308488919 ---> 5.239271837211345
4/150 Epoch  -  Training Loss = 5.2206  -  Validation Loss = 5.2393
Model Improved
best loss changed : 5.239271837211345 ---> 5.17715295189235
5/150 Epoch  -  Training Loss = 5.1561  -  Validation Loss = 5.1772
Model Improved
best loss changed : 5.17715295189235 ---> 5.1107371987414965
6/150 Epoch  -  Training Loss = 5.0921  -  Validation Loss = 5.1107
Model Improved
best loss changed : 5.1107371987414965 ---> 4.9842802104718436
7/150 Epoch  -  Training Loss = 5.0019  -  Validation Loss = 4.9843
Model I

Model Improved
best loss changed : 0.11860476647355396 ---> 0.10722762205731616
58/150 Epoch  -  Training Loss = 0.0889  -  Validation Loss = 0.1072
Model Improved
best loss changed : 0.10722762205731616 ---> 0.10191054622812475
59/150 Epoch  -  Training Loss = 0.0839  -  Validation Loss = 0.1019
Model Improved
best loss changed : 0.10191054622812475 ---> 0.09408089323986203
60/150 Epoch  -  Training Loss = 0.0798  -  Validation Loss = 0.0941
Model Improved
best loss changed : 0.09408089323986203 ---> 0.08772642673981008
61/150 Epoch  -  Training Loss = 0.0753  -  Validation Loss = 0.0877
Model Improved
best loss changed : 0.08772642673981008 ---> 0.08723713334250215
62/150 Epoch  -  Training Loss = 0.0716  -  Validation Loss = 0.0872
Model Improved
best loss changed : 0.08723713334250215 ---> 0.07792061133235836
63/150 Epoch  -  Training Loss = 0.0687  -  Validation Loss = 0.0779
Model Improved
best loss changed : 0.07792061133235836 ---> 0.0721203631962891
64/150 Epoch  -  Training L

Model Improved
best loss changed : 0.012699424852216156 ---> 0.012421036150912114
113/150 Epoch  -  Training Loss = 0.0117  -  Validation Loss = 0.0124
Model Improved
best loss changed : 0.012421036150912114 ---> 0.01197068461272086
114/150 Epoch  -  Training Loss = 0.0116  -  Validation Loss = 0.0120
Model Improved
best loss changed : 0.01197068461272086 ---> 0.011621422797935141
115/150 Epoch  -  Training Loss = 0.0113  -  Validation Loss = 0.0116
Model Improved
best loss changed : 0.011621422797935141 ---> 0.01137540420040689
116/150 Epoch  -  Training Loss = 0.0109  -  Validation Loss = 0.0114
Model Improved
best loss changed : 0.01137540420040689 ---> 0.011087986929503416
117/150 Epoch  -  Training Loss = 0.0106  -  Validation Loss = 0.0111
Model Improved
best loss changed : 0.011087986929503416 ---> 0.010939851468617116
118/150 Epoch  -  Training Loss = 0.0104  -  Validation Loss = 0.0109
Model Improved
best loss changed : 0.010939851468617116 ---> 0.010692847618867228
119/150 Ep

In [12]:
model_path = 'checkpoint.pt'

seq2seq = Seq2Seq(Q_vocab.words_count, hidden_size, A_vocab.words_count)
seq2seq.load_state_dict(torch.load(model_path, map_location='cuda'))
seq2seq.cuda()
seq2seq.eval()

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4504, 256)
    (lstm): LSTM(256, 256)
  )
  (decoder): Decoder(
    (embedding): Embedding(4079, 256)
    (lstm): LSTM(256, 256)
    (fc): Linear(in_features=256, out_features=4079, bias=True)
    (softmax): LogSoftmax(dim=1)
  )
)

In [13]:
print("Type 'exit' to finish the chat.\n", "-"*30, '\n')
while (True):
    src = input("> ")
    if src.strip() == "exit":
        break
    evaluate(src, Q_vocab, A_vocab, seq2seq, max_trg)

Type 'exit' to finish the chat.
 ------------------------------ 

> What is the biggest building?
< fort hill 

> then, What is the second biggest building?
< queen of 

> Thank you
< skyfal 

> hello
Error: Word Encountered Not In The Vocabulary.
> haha
Error: Word Encountered Not In The Vocabulary.
> exit
