# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [14]:
!pip3 install torch torchdata torchtext

Collecting torch
  Using cached torch-2.0.0-cp38-cp38-manylinux1_x86_64.whl (619.9 MB)
Collecting torchdata
  Using cached torchdata-0.6.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB)
Collecting torchtext
  Using cached torchtext-0.15.1-cp38-cp38-manylinux1_x86_64.whl (2.0 MB)
Collecting nvidia-cublas-cu11==11.10.3.66
  Using cached nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)
Collecting nvidia-nvtx-cu11==11.7.91
  Using cached nvidia_nvtx_cu11-11.7.91-py3-none-manylinux1_x86_64.whl (98 kB)
Collecting triton==2.0.0
  Using cached triton-2.0.0-1-cp38-cp38-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (63.2 MB)
Collecting nvidia-cusolver-cu11==11.4.0.1
  Using cached nvidia_cusolver_cu11-11.4.0.1-2-py3-none-manylinux1_x86_64.whl (102.6 MB)
Collecting nvidia-cusparse-cu11==11.7.4.91
  Using cached nvidia_cusparse_cu11-11.7.4.91-py3-none-manylinux1_x86_64.whl (173.2 MB)
Collecting nvidia-cuda-runtime-cu11==11.7.99
  Using cached nvidia_cuda_

In [7]:
!pip install nltk
!pip install pandas
!pip install scikit-learn

Collecting nltk
  Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting joblib
  Using cached joblib-1.2.0-py3-none-any.whl (297 kB)
Collecting click
  Using cached click-8.1.3-py3-none-any.whl (96 kB)
Collecting regex>=2021.8.3
  Using cached regex-2023.3.23-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (771 kB)
Installing collected packages: regex, joblib, click, nltk
Successfully installed click-8.1.3 joblib-1.2.0 nltk-3.8.1 regex-2023.3.23


In [1]:
from src.Data import loadDF, prepare_text, getPairs, toTensor, getMaxLen
from src.Models import Seq2Seq
from src.Vocab import Vocab
from src.Train import train
from src.Evaluate import evaluate
import random

In [2]:
learning_rate = 0.01
hidden_size = 128 # encoder and decoder hidden size
batch_size = 128
epochs = 65

In [3]:
data_df = loadDF('data')
# I will take only the first 5,000 Q&A to avoid CUDA out of memory error due to the large dataset
data_df = data_df.iloc[:5000, :]

  return train_df.append(validation_df)


In [4]:
for i in range(0, 5): # first 5 Q&A
    print("> ", data_df.iloc[i,0], "\n< ", data_df.iloc[i,1], "\n") 

>  To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? 
<  Saint Bernadette Soubirous 

>  What is in front of the Notre Dame Main Building? 
<  a copper statue of Christ 

>  The Basilica of the Sacred heart at Notre Dame is beside to which structure? 
<  the Main Building 

>  What is the Grotto at Notre Dame? 
<  a Marian place of prayer and reflection 

>  What sits on top of the Main Building at Notre Dame? 
<  a golden statue of the Virgin Mary 



In [5]:
data_df['Question'] = data_df['Question'].apply(prepare_text)
data_df['Answer'] = data_df['Answer'].apply(prepare_text)

In [6]:
pairs = getPairs(data_df)

In [7]:
max_src, max_trg = getMaxLen(pairs)
max_trg, max_src

(43, 29)

In [8]:
Q_vocab = Vocab()
A_vocab = Vocab()

# build vocabularies for questions "source" and answers "target"
for pair in pairs:
    Q_vocab.add_words(pair[0])
    A_vocab.add_words(pair[1])

In [9]:
source_data = [toTensor(Q_vocab, pair[0]) for pair in pairs]
target_data = [toTensor(A_vocab, pair[1]) for pair in pairs]

In [11]:
seq2seq = Seq2Seq(Q_vocab.words_count, hidden_size, A_vocab.words_count)

train(source_data = source_data,
      target_data = target_data,
      model = seq2seq,
      print_every = 5,
      epochs = epochs,
      learning_rate = learning_rate,
      batch_size = batch_size)

5/65 Epoch  -  Training Loss = 5.6954  -  Validation Loss = 5.5855
10/65 Epoch  -  Training Loss = 5.2241  -  Validation Loss = 5.3815
15/65 Epoch  -  Training Loss = 4.9225  -  Validation Loss = 5.0250
20/65 Epoch  -  Training Loss = 4.4364  -  Validation Loss = 4.5959
25/65 Epoch  -  Training Loss = 3.8199  -  Validation Loss = 4.0299
30/65 Epoch  -  Training Loss = 3.0603  -  Validation Loss = 3.3682
35/65 Epoch  -  Training Loss = 2.2336  -  Validation Loss = 2.6545
40/65 Epoch  -  Training Loss = 1.4674  -  Validation Loss = 1.8930
45/65 Epoch  -  Training Loss = 0.9090  -  Validation Loss = 1.3050
50/65 Epoch  -  Training Loss = 0.6037  -  Validation Loss = 0.8735
55/65 Epoch  -  Training Loss = 0.3658  -  Validation Loss = 0.5187
60/65 Epoch  -  Training Loss = 0.2227  -  Validation Loss = 0.2932
65/65 Epoch  -  Training Loss = 0.1606  -  Validation Loss = 0.1959


In [12]:
import torch

model_path = 'seq2seq.pt'

torch.save(seq2seq, model_path)

seq2seq = torch.load(model_path, map_location=torch.device('cuda'))
seq2seq.eval()

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4504, 128)
    (lstm): LSTM(128, 128)
  )
  (decoder): Decoder(
    (embedding): Embedding(4079, 128)
    (lstm): LSTM(128, 128)
    (fc): Linear(in_features=128, out_features=4079, bias=True)
    (softmax): LogSoftmax(dim=1)
  )
)

In [13]:
print("Type 'exit' to finish the chat.\n", "-"*30, '\n')
while (True):
    src = input("> ")
    if src.strip() == "exit":
        break
    evaluate(src, Q_vocab, A_vocab, seq2seq, max_trg)

Type 'exit' to finish the chat.
 ------------------------------ 



>  What is the Grotto at Notre Dame? 


< a marian place of art and prayer 



>  exit
