# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [365]:
!pip install torch

In [366]:
!pip install torchtext
!pip install torchdata==0.3.0

In [367]:
import torchdata
import torchtext

In [368]:
import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
import torch
from sklearn.model_selection import train_test_split
from nltk.corpus import brown
from nltk.stem.snowball import SnowballStemmer
from torchtext.datasets import SQuAD2
import torch.nn as nn
import random
import string

#nltk.download('brown')
#nltk.download('punkt')

# Output, save, and load brown embeddings

#model = gensim.models.Word2Vec(brown.sents())
#model.save('brown.embedding')

#w2v = gensim.models.Word2Vec.load('brown.embedding')

stemming = SnowballStemmer('english')
def loadDF():
      
    # download and extract the dataset
    df = {"question": [], "answer": []}
    index = 0
    train_iter, dev_iter = SQuAD2()
    for context, question, answers, indices in train_iter:
        if answers[0]:
            df["question"].append(question)
            df["answer"].append(answers[0])
        index += 1
    df =  pd.DataFrame.from_dict(df)
    return df

def prepare_data(vocab, sentence):
    indices = [vocab.word2index[word] for word in sentence.split(' ')]
    indices.append(vocab.word2index['<EOS>'])
    
    return torch.Tensor(indices).long().to(device).view(-1,1)

def prepare_text(sentence):
    
    sentence = ''.join([s.lower() for s in sentence if s not in string.punctuation])
    sentence = ' '.join(stemming.stem(w) for w in sentence.split())
    return sentence

def train_test_split2(SRC, TRG):
    
    '''
    Input: SRC, our list of questions from the dataset
            TRG, our list of responses from the dataset

    Output: Training and test datasets for SRC & TRG

    '''
    
    return SRC_train_dataset, SRC_test_dataset, TRG_train_dataset, TRG_test_dataset


In [369]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [370]:
df = loadDF()


In [371]:
df.head()

Unnamed: 0,question,answer
0,When did Beyonce start becoming popular?,in the late 1990s
1,What areas did Beyonce compete in when she was...,singing and dancing
2,When did Beyonce leave Destiny's Child and bec...,2003
3,In what city and state did Beyonce grow up?,"Houston, Texas"
4,In which decade did Beyonce become famous?,late 1990s


In [372]:
from nltk.stem.porter import *
from nltk.stem import *
from nltk.tokenize import RegexpTokenizer

# Vocab class is based on a suggestion by a mentor in the Knowledge forum.  
# It is slightly different from the one provided in the lecture notes 

class Vocab:
    def __init__(self, lang):
        self.lang = lang
        self.word2index = {"<PAD>": 0, "<SOS>": 1, "<EOS>": 2}
        self.index2word = {0: "<PAD>", 1: "<SOS>", 2: "<EOS>"}
        self.word2count = {}
        self.n_words = 3  # Count <PAD>, <SOS>, <EOS>
    def build_vocab(self, sentences):
        for sentence in sentences:
            for word in self.tokenize(sentence):
                self.add_word(word)
    #Tokenize the sentence here as opposed to the prepare_text function
    def tokenize(self, sentence):
        return sentence.strip().split()
    def add_word(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.index2word[self.n_words] = word
            self.word2count[word] = 1
            self.n_words += 1
        else:
            self.word2count[word] += 1



In [373]:
numQnA = 5000
df = df.head(numQnA)
df.shape[0]

5000

In [374]:
df['question'] = df['question'].apply(prepare_text)
df['answer'] = df['answer'].apply(prepare_text)
df.head()

Unnamed: 0,question,answer
0,when did beyonc start becom popular,in the late 1990s
1,what area did beyonc compet in when she was gr...,sing and danc
2,when did beyonc leav destini child and becom a...,2003
3,in what citi and state did beyonc grow up,houston texa
4,in which decad did beyonc becom famous,late 1990s


In [375]:
questions = df['question'].head(numQnA).values
answers = df['answer'].head(numQnA).values

vocabQ = Vocab(lang='Questions')
vocabA = Vocab(lang='Answers')

vocabQ.build_vocab(questions)
vocabA.build_vocab(answers)


In [376]:
print(questions)

['when did beyonc start becom popular'
 'what area did beyonc compet in when she was grow up'
 'when did beyonc leav destini child and becom a solo singer' ...
 'how is the mean of dukkha explain'
 'what is a contribut factor to dukkha' 'the second truth is']


In [377]:
print(answers)

['in the late 1990s' 'sing and danc' '2003' ... 'crave' 'ignor'
 'the origin of dukkha can be known']


In [378]:
print(vocabQ.n_words)
print(vocabA.n_words)

4339
4026


In [379]:
train, test = train_test_split(df, test_size=0.2)

In [380]:
train.head()

Unnamed: 0,question,answer
1687,the presenc of the altan khan in the west redu...,the ming
2030,what did appl origin tell consum to purchas wh...,refurbish replac ipod
4581,what journalist drew comparison between my bea...,simon vozicklevinson
1959,which appl technolog did patright complain bre...,fairplay
687,what is a critic of other stream servic,low payout of royalti


In [381]:
source = [prepare_data(vocabQ, q) for q in train['question'].values]
target = [prepare_data(vocabA, a) for a in train['answer'].values]

In [382]:
source[0].shape, target[0].shape

(torch.Size([13, 1]), torch.Size([3, 1]))

In [383]:
print(source[0])
print(target[0])

tensor([[  31],
        [2060],
        [  59],
        [  31],
        [1995],
        [1792],
        [  12],
        [  31],
        [1120],
        [2061],
        [ 666],
        [ 688],
        [   2]], device='cuda:0')
tensor([[   4],
        [1312],
        [   2]], device='cuda:0')


In [384]:
class Encoder(nn.Module):
    
    def __init__(self, input_size, hidden_size):
        
        super(Encoder, self).__init__()
        
        # self.embedding provides a vector representation of the inputs to our model
        
        # self.lstm, accepts the vectorized input and passes a hidden state
        
        self.hidden_size = hidden_size
        self.input_size = input_size
        #self.embedding_size = embedding_size
        #self.n_layers = n_layers

        #self.hidden = torch.zeros(1, 1, hidden_size)

        self.embedding = nn.Embedding(self.input_size, self.hidden_size)
        
        self.lstm = nn.LSTM(self.hidden_size, self.hidden_size, 1) 
    
    
    def forward(self, i):
        
        '''
        Inputs: i, the src vector
        Outputs: o, the encoder outputs
                h, the hidden state
                c, the cell state
        '''
        # i = i.unsqueeze(0)
        embedded = self.embedding(i)
        o, (h,c) = self.lstm(embedded)
        
        #return o, h, c
        return h, c
    

class Decoder(nn.Module):
      
    def __init__(self, hidden_size, output_size):
        
        super(Decoder, self).__init__()
        
        # self.embedding provides a vector representation of the target to our model
        
        # self.lstm, accepts the embeddings and outputs a hidden state

        # self.ouput, predicts on the hidden state via a linear output layer 
        self.hidden_size = hidden_size
        self.output_size = output_size
        #self.embedding_size = embedding_size

        self.embedding = nn.Embedding(self.output_size, self.hidden_size)

        self.lstm = nn.LSTM(self.hidden_size, self.hidden_size)
        
        # The LSTM produces an output by passing the hidden state to the   Linear layer

        self.out = nn.Linear(self.hidden_size, self.output_size)
        self.softmax = nn.LogSoftmax(dim=1)
        
    def forward(self, i, h, c):
        
        '''
        Inputs: i, the target vector
        Outputs: o, the prediction
                h, the hidden state
        '''
        
        embedded = self.embedding(i)
        o, (h,c) = self.lstm(embedded, (h,c))
        o = self.softmax(self.out(o[0])) 
        return o, h, c
        
# Code used for the Seq2Seq class was heavily based on code linked by the mentors in the Knowledge forums
# https://github.com/iJoud/Seq2Seq-Chatbot/blob/main/src/Models.py
# https://github.com/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb

class Seq2Seq(nn.Module):
    
    def __init__(self, input_size, hidden_size, output_size):
        
        super(Seq2Seq, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        self.encoder = Encoder(self.input_size, self.hidden_size)
        self.decoder = Decoder(self.hidden_size, self.output_size)
    
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):      
        
        trg_len = trg.shape[0]
        o = torch.zeros(trg.shape[0], trg.shape[1], self.decoder.output_size).to(device)
        #output, (h,c) = self.encoder(src)
        h,c = self.encoder(src)
        i = trg[0,:].unsqueeze(0)
        
        for t in range(1, trg_len):
            output, h, c = self.decoder(i, h, c)
            
            top1 = o.argmax(1)
            
            o[t] = output
            
            # teaching = random.random() < teacher_forcing_ratio
            # if teaching:
            #     i = trg[t]
            # else:
            #     i = top1
            if self.training:
                # Decide if we are going to use teacher forcing or not
                teacher_force = random.random() < teacher_forcing_ratio
                
                # Get the highest predicted token from our predictions
                top1 = output.argmax(1)

                # If teacher forcing, use actual next token as next input
                # if not, use predicted token
                decoder_input = trg[t].unsqueeze(0) if teacher_force else top1.unsqueeze(0)
            else:
                top1 = output.argmax(1)
                decoder_input = output.argmax(1).unsqueeze(0).detach()
        return o

    



In [385]:
hidden_size = 512
batch_size = 128;
learning_rate = 0.01
n_epochs = 10

In [386]:
model = Seq2Seq(vocabQ.n_words, hidden_size, vocabA.n_words)
print(model)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4339, 512)
    (lstm): LSTM(512, 512)
  )
  (decoder): Decoder(
    (embedding): Embedding(4026, 512)
    (lstm): LSTM(512, 512)
    (out): Linear(in_features=512, out_features=4026, bias=True)
    (softmax): LogSoftmax(dim=1)
  )
)


In [387]:
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
criterion = nn.NLLLoss()

In [388]:
for epoch in range(1, n_epochs + 1):
    
    model.to(device)
    model.train()
    train_loss = 0
    for i in range(len(source)):
        
        src = source[i].to(device)
        trg = target[i].to(device)
        output = model(src,trg)
    