# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [1]:
import torch

learning_rate = 0.01
batch_size = 32
hidden_size = 128
epoch = 100

device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [2]:
device

'cuda:0'

In [3]:
from torch.utils.data import Dataset
import numpy as np

PAD_token = 0
SOS_token = 1 
EOS_token = 2 

class Vocab:
    def __init__(self, name):
        self.name = name
        self.index = {PAD_token: "PAD_token", SOS_token: "SOS_token", EOS_token: "EOS_token"}
        self.num_words = 3
        self.words = {"PAD_token": PAD_token, "SOS_token": SOS_token, "EOS_token": EOS_token}

    def add_sentense(self, sentence):
        for word in sentence.split(' '):
            self.index_to_word(word)
                
    def index_to_word(self, word):
        word_list = list(self.words)
        if word not in word_list:
            self.index[self.num_words] = word
            self.words[word] = self.num_words
            self.num_words += 1
            
    def get_words(self):
        return self.words

class QuestionAnswerDataset(Dataset):
    def __init__(self, questions, answers, ques_vocab, ans_vocab):
        assert len(questions) == len(answers), \
            "The length of questions and answers is different"
        super(QuestionAnswerDataset, self).__init__()
        
        # ques_maxlen = self.get_maxlen(questions)
        # ans_maxlen = self.get_maxlen(answers)
        # max_sentence_len = ques_maxlen if ques_maxlen > ans_maxlen else ans_maxlen
        # maxlen = ques_maxlen if ques_maxlen > ans_maxlen else ans_maxlen
        self.questions = torch.full((len(questions), self.get_maxlen(questions)+1, 1), PAD_token)
        self.answers = torch.full((len(answers), self.get_maxlen(answers)+1, 1), PAD_token)
        
        for i in range(self.questions.size(0)):
            q_tensor = self.words_to_tensor(questions[i], ques_vocab)
            self.questions[i, :q_tensor.size(0), :] = q_tensor
            self.questions[i, q_tensor.size(0):, :] = torch.LongTensor([EOS_token]).view(1, 1)
        #     for j in range(len(q_tensor_list)):
        #         self.data[i, j] = q_tensor_list[j]
                
        for i in range(self.answers.size(0)):
            a_tensor = self.words_to_tensor(answers[i], ans_vocab)
            self.answers[i, 0, :] = torch.LongTensor([SOS_token]).view(1, 1)
            self.answers[i, 1:a_tensor.size(0)+1, :] = a_tensor
        
    def get_maxword(self, vocab):
        max_len = 0
        for _, v in vocab.words.items():
            if v > max_len:
                max_len = v
        return max_len
    
    def get_maxlen(self, sentences):
        max_len = 0
        for sentence in sentences:
            if len(sentence) > max_len:
                max_len = len(sentence)
        return max_len
    
    def words_to_tensor(self, words, vocab):
        word_list = [vocab.words[word] for word in words]
        return torch.tensor(word_list).long().view(-1, 1)
            
    def __len__(self):
        return len(self.questions)
    
    def __getitem__(self, index):
        return (self.questions[index].type(torch.LongTensor), self.answers[index].type(torch.LongTensor))

In [4]:
# Refer to the code at https://github.com/iJoud/Seq2Seq-Chatbot

import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
from nltk.corpus import brown
from torchtext import datasets

# nltk.download('brown')
# nltk.download('punkt')

# Output, save, and load brown embeddings

model = gensim.models.Word2Vec(brown.sents())
# model.save('brown.embedding')

w2v = gensim.models.Word2Vec.load('brown.embedding')

def loadDF(path, split=False):
  '''

  You will use this function to load the dataset into a Pandas Dataframe for processing.

  '''
  def get_dict(dataiter):
        data_dict = {
          "Question": [],
          "Answer": [],
        }
        
        for _, question, answer, _ in dataiter:
              if len(question) != 0 and len(answer[0]) != 0:
                data_dict["Question"].append(question)
                data_dict["Answer"].append(answer[0])
              
        return data_dict

  train_iter, test_iter = datasets.SQuAD2(path, split=("train", "dev"))

  # train_data_dict, test_data_dict = get_dict(train_iter), get_dict(test_iter)
  train_df = pd.DataFrame(get_dict(train_iter))
  test_df = pd.DataFrame(get_dict(test_iter))
  
  if split:
    return train_df, test_df
  return train_df.append(test_df)


def prepare_text(sentence):
    '''

    Our text needs to be cleaned with a tokenizer. This function will perform that task.
    https://www.nltk.org/api/nltk.tokenize.html

    '''
    from nltk.tokenize import RegexpTokenizer
    from nltk.stem import PorterStemmer
    from nltk.corpus import stopwords

    sentence = sentence.lower()
    tokenizer = RegexpTokenizer(r'\w+')
    
    tokens = tokenizer.tokenize(sentence)
    stop_words = stopwords.words("english")
    
    new_tokens = []
    for token in tokens:
        if not token.isdigit():
          token = PorterStemmer().stem(token)
          if token not in stop_words:
              new_tokens.append(token)
      
    return new_tokens


def train_test_split(SRC, TRG, ques_vocab, ans_vocab):
    
    '''
    Input: SRC, our list of questions from the dataset
            TRG, our list of responses from the dataset

    Output: Training and test datasets for SRC & TRG

    '''
    train_set = QuestionAnswerDataset(SRC[0], TRG[0], ques_vocab, ans_vocab)
    test_set = QuestionAnswerDataset(SRC[1], TRG[1], ques_vocab, ans_vocab)
    
    # SRC_train_dataset = SRC["train"]
    # SRC_test_dataset = SRC["test"]
    # TRG_train_dataset = TRG["train"]
    # TRG_test_dataset = TRG["test"]
    
    # for question, answer in zip(SRC, TRG):
        
    
    return train_set, test_set


In [5]:
train_frame, test_frame = loadDF("./data/squad", split=True)
train_frame = train_frame.iloc[:5000, :]
test_frame = test_frame.iloc[:5000, :]

train_frame["Question"] = train_frame["Question"].apply(prepare_text)
train_frame["Answer"] = train_frame["Answer"].apply(prepare_text)
test_frame["Question"] = test_frame["Question"].apply(prepare_text)
test_frame["Answer"] = test_frame["Answer"].apply(prepare_text)

In [6]:
total_ques = train_frame["Question"].append(test_frame["Question"])
total_ans = train_frame["Answer"].append(test_frame["Answer"])

total_ques = total_ques.apply(lambda x: " ".join(x) ).to_list()
total_ans = total_ans.apply(lambda x: " ".join(x) ).to_list()

datapairs = [list(i) for i in zip(total_ques, total_ans)]

In [7]:
total_ques

['beyonc start becom popular',
 'area beyonc compet wa grow',
 'beyonc leav destini child becom solo singer',
 'citi state beyonc grow',
 'decad beyonc becom famou',
 'r b group wa lead singer',
 'album made worldwid known artist',
 'manag destini child group',
 'beyoncé rise fame',
 'role beyoncé destini child',
 'wa first album beyoncé releas solo artist',
 'beyoncé releas danger love',
 'mani grammi award beyoncé win first solo album',
 'wa beyoncé role destini child',
 'wa name beyoncé first solo album',
 'second solo album entertain ventur beyonc explor',
 'artist beyonc marri',
 'set record grammi mani beyonc win',
 'movi beyonc receiv first golden globe nomin',
 'beyonc take hiatu career take control manag',
 'album wa darker tone previou work',
 'movi portray etta jame beyonc creat sasha fierc',
 'destini child end group act',
 'wa name beyoncé second solo album',
 'wa beyoncé first act job',
 'beyoncé marri',
 'name beyoncé alter ego',
 'music recur element',
 'time magazin na

In [8]:
question_vocab = Vocab("Vocab for Question")
answer_vocab = Vocab("Vocab for Answer")
# vocab = Vocab("vocabulary")

for sentense in total_ques:
    question_vocab.add_sentense(sentense)
for sentense in total_ans:
    answer_vocab.add_sentense(sentense)

In [9]:
answer_vocab.words

{'PAD_token': 0,
 'SOS_token': 1,
 'EOS_token': 2,
 'late': 3,
 '1990': 4,
 'sing': 5,
 'danc': 6,
 '': 7,
 'houston': 8,
 'texa': 9,
 'destini': 10,
 'child': 11,
 'danger': 12,
 'love': 13,
 'mathew': 14,
 'knowl': 15,
 'lead': 16,
 'singer': 17,
 'five': 18,
 'act': 19,
 'jay': 20,
 'z': 21,
 'six': 22,
 'dreamgirl': 23,
 'beyoncé': 24,
 'cadillac': 25,
 'record': 26,
 'june': 27,
 'b': 28,
 'day': 29,
 'sasha': 30,
 'fierc': 31,
 'relationship': 32,
 'monogami': 33,
 'influenti': 34,
 'forb': 35,
 '2000': 36,
 'modern': 37,
 'feminist': 38,
 'million': 39,
 'mother': 40,
 'maiden': 41,
 'name': 42,
 'african': 43,
 'american': 44,
 'methodist': 45,
 'xerox': 46,
 'hairdress': 47,
 'salon': 48,
 'owner': 49,
 'solang': 50,
 'joseph': 51,
 'broussard': 52,
 'fredericksburg': 53,
 'darlett': 54,
 'johnson': 55,
 'instructor': 56,
 'st': 57,
 'john': 58,
 'unit': 59,
 'church': 60,
 'music': 61,
 'magnet': 62,
 'school': 63,
 'imagin': 64,
 'seven': 65,
 'arn': 66,
 'frager': 67,
 'fat

In [10]:
max_v = 0
for k, v in question_vocab.words.items():
    if v > max_v:
        max_v = v
max_v

6649

In [11]:
l = list(answer_vocab.index)
l[-1]

6470

In [12]:
trainset, testset = train_test_split(
    SRC=(train_frame["Question"], test_frame["Question"]),
    TRG=(train_frame["Answer"], test_frame["Answer"]),
    ques_vocab=question_vocab,
    ans_vocab=answer_vocab
)

In [13]:
from torch.utils.data import DataLoader
train_loader = DataLoader(trainset, batch_size=batch_size, shuffle=False, num_workers=0)
test_loader = DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=0)

In [14]:
data = next(iter(train_loader))
data[1].shape

torch.Size([32, 23, 1])

In [15]:
import torch.nn as nn
import torch.nn.functional as F

class Encoder(nn.Module):
    def __init__(self, embedding_size, hidden_size):
        super(Encoder, self).__init__()

        self.hidden_size = hidden_size
        self.embedding_size = embedding_size

        self.embedding = nn.Embedding(self.embedding_size, self.hidden_size)
        self.lstm = nn.LSTM(self.hidden_size, self.hidden_size, batch_first=True) 
        self.dropout = nn.Dropout()
        
    def forward(self, x, hidden=None, cell=None):
        x = self.embedding(x)
        if hidden is None and cell is None:
            out, (hidden, cell) = self.lstm(x.squeeze(2))
        else:
            out, (hidden, cell) = self.lstm(x.squeeze(2), (hidden, cell))
        return self.dropout(out), hidden, cell

class Decoder(nn.Module):
    def __init__(self, embedding_size, hidden_size):
        super(Decoder, self).__init__()

        self.hidden_size = hidden_size
        self.embedding_size = embedding_size

        self.embedding = nn.Embedding(self.embedding_size, self.hidden_size)
        self.lstm = nn.LSTM(self.hidden_size, self.hidden_size, batch_first=True)
        self.linear = nn.Linear(self.hidden_size, self.embedding_size)
        
    def forward(self, x, hidden, cell):
        x = self.embedding(x)
        out, (hidden, cell) = self.lstm(x, (hidden, cell))
        # out = F.log_softmax(self.linear(out))
        out = self.linear(out)
        return out, hidden, cell
        
class Seq2Seq(nn.Module):
    def __init__(self, encoder_embedding_size, hidden_size, decoder_embedding_size):
        super(Seq2Seq, self).__init__()
        
        self.encoder = Encoder(encoder_embedding_size, hidden_size)
        self.decoder = Decoder(decoder_embedding_size, hidden_size)
        
    def forward(self, src, trg):
        hidden, cell = None, None
        for i in range(src.size(1)):
            _, hidden, cell = self.encoder(src, hidden, cell)
        
        result = []
        decoder_input = torch.ones(trg.size(0), 1).long().to(device)
        for i in range(trg.size(1)):
            decoder_output, hidden, cell = self.decoder(decoder_input, hidden, cell)
            result.append(decoder_output)
            decoder_input = decoder_output.argmax(-1)
            
        return torch.stack(result, dim=1)

In [16]:
# Refer to the code at https://github.com/iJoud/Seq2Seq-Chatbot

def train(model, epochs, learning_rate):
    
    model.to(device)
    
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
    criterion = nn.CrossEntropyLoss(ignore_index=PAD_token)

    train_loss = 0
    for e in range(epochs):
        model.train()
        for i, data in enumerate(train_loader):
            src_batch, trg_batch = data[0].to(device), data[1].to(device)
            
            outputs = model(src_batch, trg_batch)    # (batch_size, trg_len, 1, answer_vocab_count)

            loss = 0
            for output, trg in zip(outputs.permute(1, 0, 2, 3), trg_batch.permute(1, 0, 2)): 
                loss += criterion(output.squeeze(1), trg.squeeze(1))
            
            train_loss = loss / trg_batch.size(0)
            train_loss = train_loss.item()
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            if i % 100 == 0:
                print(f"{e}/{epochs} Epoch, {i} iter - Training loss: {train_loss}")

        model.eval()
        val_loss = 0
        for i, data in enumerate(test_loader):
            src_batch, trg_batch = data[0].to(device), data[1].to(device)
            
            outputs = model(src_batch, trg_batch)    # (batch_size, trg_len, 1, answer_vocab_count)

            loss = 0
            for output, trg in zip(outputs.permute(1, 0, 2, 3), trg_batch.permute(1, 0, 2)): 
                loss += criterion(output.squeeze(1), trg.squeeze(1))
            
            val_loss += loss.item() / trg_batch.size(0)
        val_loss = val_loss / len(testset)
        print(f"Validation loss: {val_loss}")

In [23]:
del model

In [17]:
encoder_embedding_size = list(question_vocab.index)[-1]
decoder_embedding_size = list(answer_vocab.index)[-1]
model = Seq2Seq(encoder_embedding_size, hidden_size, decoder_embedding_size)
print(model)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6649, 128)
    (lstm): LSTM(128, 128, batch_first=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(6470, 128)
    (lstm): LSTM(128, 128, batch_first=True)
    (linear): Linear(in_features=128, out_features=6470, bias=True)
  )
)


In [18]:
train(
      model=model,
      epochs=epoch,
      learning_rate=learning_rate,
)

0/100 Epoch, 0 iter - Training loss: 1.0912964344024658
0/100 Epoch, 100 iter - Training loss: 0.9449151158332825


/opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [128,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [128,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [128,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [128,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [128,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED