# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [1]:
import torch

learning_rate = 0.01
batch_size = 128
hidden_size = 128
epoch = 100

device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [2]:
device

'cuda:0'

In [3]:
from torch.utils.data import Dataset
import numpy as np

PAD_token = 0
SOS_token = 1 
EOS_token = 2 

class Vocab:
    def __init__(self, name):
        self.name = name
        self.index = {PAD_token: "PAD_token", SOS_token: "SOS_token", EOS_token: "EOS_token"}
        self.num_words = 3
        self.words = {"PAD_token": PAD_token, "SOS_token": SOS_token, "EOS_token": EOS_token}

    def add_sentense(self, sentence):
        for word in sentence.split(' '):
            self.index_to_word(word)
                
    def index_to_word(self, word):
        word_list = list(self.words)
        if word not in word_list:
            self.index[self.num_words] = word
            self.words[word] = self.num_words
            self.num_words += 1
            
    def get_word_count(self):
        return len(self.words)

class QuestionAnswerDataset(Dataset):
    def __init__(self, questions, answers, ques_vocab, ans_vocab):
        assert len(questions) == len(answers), \
            "The length of questions and answers is different"
        super(QuestionAnswerDataset, self).__init__()
        
        ques_maxlen = self.get_maxlen(questions)
        ans_maxlen = self.get_maxlen(answers)
        max_sentence_len = ques_maxlen if ques_maxlen > ans_maxlen else ans_maxlen
        # maxlen = ques_maxlen if ques_maxlen > ans_maxlen else ans_maxlen
        self.questions = torch.full((len(questions), max_sentence_len+1, 1), PAD_token)
        self.answers = torch.full((len(answers), max_sentence_len+1, 1), PAD_token)
        
        for i in range(self.questions.size(0)):
            q_tensor = self.words_to_tensor(questions[i], ques_vocab)
            q_len = q_tensor.size(0)
            self.questions[i, :q_len, :] = q_tensor
            self.questions[i, q_len:q_len+1, :] = torch.LongTensor([EOS_token]).view(1, 1)
        #     for j in range(len(q_tensor_list)):
        #         self.data[i, j] = q_tensor_list[j]
                
        for i in range(self.answers.size(0)):
            a_tensor = self.words_to_tensor(answers[i], ans_vocab)
            a_len = a_tensor.size(0)
            self.answers[i, :a_len :] = a_tensor
            self.answers[i, a_len:a_len+1, :] = torch.LongTensor([EOS_token]).view(1, 1)
        
    def get_maxword(self, vocab):
        max_len = 0
        for _, v in vocab.words.items():
            if v > max_len:
                max_len = v
        return max_len
    
    def get_maxlen(self, sentences):
        max_len = 0
        for sentence in sentences:
            if len(sentence) > max_len:
                max_len = len(sentence)
        return max_len
    
    def words_to_tensor(self, words, vocab):
        word_list = [vocab.words[word] for word in words]
        return torch.tensor(word_list).long().view(-1, 1)
            
    def __len__(self):
        return len(self.questions)
    
    def __getitem__(self, index):
        return (self.questions[index].type(torch.LongTensor), self.answers[index].type(torch.LongTensor))

In [4]:
# Refer to the code at https://github.com/iJoud/Seq2Seq-Chatbot

import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
from nltk.corpus import brown
from torchtext import datasets

# nltk.download('brown')
# nltk.download('punkt')

# Output, save, and load brown embeddings

model = gensim.models.Word2Vec(brown.sents())
# model.save('brown.embedding')

w2v = gensim.models.Word2Vec.load('brown.embedding')

def loadDF(path, split=False):
  '''

  You will use this function to load the dataset into a Pandas Dataframe for processing.

  '''
  def get_dict(dataiter):
        data_dict = {
          "Question": [],
          "Answer": [],
        }
        
        for _, question, answer, _ in dataiter:
              if len(question) != 0 and len(answer[0]) != 0:
                data_dict["Question"].append(question)
                data_dict["Answer"].append(answer[0])
              
        return data_dict

  train_iter, test_iter = datasets.SQuAD2(path, split=("train", "dev"))

  # train_data_dict, test_data_dict = get_dict(train_iter), get_dict(test_iter)
  train_df = pd.DataFrame(get_dict(train_iter))
  test_df = pd.DataFrame(get_dict(test_iter))
  
  if split:
    return train_df, test_df
  return train_df.append(test_df)


def prepare_text(sentence):
    '''

    Our text needs to be cleaned with a tokenizer. This function will perform that task.
    https://www.nltk.org/api/nltk.tokenize.html

    '''
    from nltk.tokenize import RegexpTokenizer
    from nltk.stem import PorterStemmer
    from nltk.corpus import stopwords

    sentence = sentence.lower()
    tokenizer = RegexpTokenizer(r'\w+')
    
    tokens = tokenizer.tokenize(sentence)
    # stop_words = stopwords.words("english")
    
    new_tokens = []
    for token in tokens:
        if not token.isdigit():
          token = PorterStemmer().stem(token)
          new_tokens.append(token)
          # if token not in stop_words:
          #     new_tokens.append(token)
      
    return new_tokens


def train_test_split(SRC, TRG, ques_vocab, ans_vocab):
    
    '''
    Input: SRC, our list of questions from the dataset
            TRG, our list of responses from the dataset

    Output: Training and test datasets for SRC & TRG

    '''
    train_set = QuestionAnswerDataset(SRC[0], TRG[0], ques_vocab, ans_vocab)
    test_set = QuestionAnswerDataset(SRC[1], TRG[1], ques_vocab, ans_vocab)
    
    # SRC_train_dataset = SRC["train"]
    # SRC_test_dataset = SRC["test"]
    # TRG_train_dataset = TRG["train"]
    # TRG_test_dataset = TRG["test"]
    
    # for question, answer in zip(SRC, TRG):
        
    
    return train_set, test_set


In [5]:
train_frame, test_frame = loadDF("./data/squad", split=True)
train_frame = train_frame.iloc[:5000, :]
test_frame = test_frame.iloc[:5000, :]

train_frame["Question"] = train_frame["Question"].apply(prepare_text)
train_frame["Answer"] = train_frame["Answer"].apply(prepare_text)
test_frame["Question"] = test_frame["Question"].apply(prepare_text)
test_frame["Answer"] = test_frame["Answer"].apply(prepare_text)

In [6]:
total_ques = train_frame["Question"].append(test_frame["Question"])
total_ans = train_frame["Answer"].append(test_frame["Answer"])

total_ques = total_ques.apply(lambda x: " ".join(x) ).to_list()
total_ans = total_ans.apply(lambda x: " ".join(x) ).to_list()

datapairs = [list(i) for i in zip(total_ques, total_ans)]

In [7]:
total_ques

['when did beyonc start becom popular',
 'what area did beyonc compet in when she wa grow up',
 'when did beyonc leav destini s child and becom a solo singer',
 'in what citi and state did beyonc grow up',
 'in which decad did beyonc becom famou',
 'in what r b group wa she the lead singer',
 'what album made her a worldwid known artist',
 'who manag the destini s child group',
 'when did beyoncé rise to fame',
 'what role did beyoncé have in destini s child',
 'what wa the first album beyoncé releas as a solo artist',
 'when did beyoncé releas danger in love',
 'how mani grammi award did beyoncé win for her first solo album',
 'what wa beyoncé s role in destini s child',
 'what wa the name of beyoncé s first solo album',
 'after her second solo album what other entertain ventur did beyonc explor',
 'which artist did beyonc marri',
 'to set the record for grammi how mani did beyonc win',
 'for what movi did beyonc receiv her first golden globe nomin',
 'when did beyonc take a hiatu in 

In [8]:
question_vocab = Vocab("Vocab for Question")
answer_vocab = Vocab("Vocab for Answer")
# vocab = Vocab("vocabulary")

for sentense in total_ques:
    question_vocab.add_sentense(sentense)
for sentense in total_ans:
    answer_vocab.add_sentense(sentense)

In [9]:
answer_vocab.words

{'PAD_token': 0,
 'SOS_token': 1,
 'EOS_token': 2,
 'in': 3,
 'the': 4,
 'late': 5,
 '1990': 6,
 'sing': 7,
 'and': 8,
 'danc': 9,
 '': 10,
 'houston': 11,
 'texa': 12,
 'destini': 13,
 's': 14,
 'child': 15,
 'danger': 16,
 'love': 17,
 'mathew': 18,
 'knowl': 19,
 'lead': 20,
 'singer': 21,
 'five': 22,
 'act': 23,
 'jay': 24,
 'z': 25,
 'six': 26,
 'dreamgirl': 27,
 'beyoncé': 28,
 'cadillac': 29,
 'record': 30,
 'june': 31,
 'b': 32,
 'day': 33,
 'sasha': 34,
 'fierc': 35,
 'relationship': 36,
 'monogami': 37,
 'influenti': 38,
 'forb': 39,
 '2000': 40,
 'modern': 41,
 'feminist': 42,
 'million': 43,
 'her': 44,
 'mother': 45,
 'maiden': 46,
 'name': 47,
 'african': 48,
 'american': 49,
 'methodist': 50,
 'xerox': 51,
 'hairdress': 52,
 'salon': 53,
 'owner': 54,
 'solang': 55,
 'joseph': 56,
 'broussard': 57,
 'fredericksburg': 58,
 'darlett': 59,
 'johnson': 60,
 'instructor': 61,
 'st': 62,
 'john': 63,
 'unit': 64,
 'church': 65,
 'music': 66,
 'magnet': 67,
 'school': 68,
 'im

In [10]:
max_v = 0
for k, v in question_vocab.words.items():
    if v > max_v:
        max_v = v
max_v

6765

In [11]:
l = list(answer_vocab.index)
l[-1]

6573

In [12]:
print(question_vocab.get_word_count())
print(answer_vocab.get_word_count())

6766
6574


In [13]:
trainset, testset = train_test_split(
    SRC=(train_frame["Question"], test_frame["Question"]),
    TRG=(train_frame["Answer"], test_frame["Answer"]),
    ques_vocab=question_vocab,
    ans_vocab=answer_vocab
)

In [14]:
from torch.utils.data import DataLoader
train_loader = DataLoader(trainset, batch_size=batch_size, shuffle=False, drop_last=True, num_workers=0)
test_loader = DataLoader(testset, batch_size=batch_size, shuffle=False, drop_last=True, num_workers=0)

In [15]:
data = next(iter(train_loader))
data[1].shape

torch.Size([128, 44, 1])

In [16]:
import torch.nn as nn
import torch.nn.functional as F
import random

class Encoder(nn.Module):
    def __init__(self, embedding_size, hidden_size):
        super(Encoder, self).__init__()

        self.hidden_size = hidden_size
        self.embedding_size = embedding_size

        self.embedding = nn.Embedding(self.embedding_size, self.hidden_size)
        self.lstm = nn.LSTM(self.hidden_size, self.hidden_size, batch_first=True) 
        self.dropout = nn.Dropout()
        
    def forward(self, x, hidden=None, cell=None):
        x = self.embedding(x)
        if hidden is None and cell is None:
            out, (hidden, cell) = self.lstm(x)
        else:
            out, (hidden, cell) = self.lstm(x, (hidden, cell))
        return self.dropout(out), hidden, cell

class Decoder(nn.Module):
    def __init__(self, embedding_size, hidden_size):
        super(Decoder, self).__init__()

        self.hidden_size = hidden_size
        self.embedding_size = embedding_size

        self.embedding = nn.Embedding(self.embedding_size, self.hidden_size)
        self.lstm = nn.LSTM(self.hidden_size, self.hidden_size, batch_first=True)
        self.linear = nn.Linear(self.hidden_size, self.embedding_size)
        
    def forward(self, x, hidden, cell):
        x = self.embedding(x)
        out, (hidden, cell) = self.lstm(x, (hidden, cell))
        out = F.log_softmax(self.linear(out))
        # out = self.linear(out)
        return out, hidden, cell
        
class Seq2Seq(nn.Module):
    def __init__(self, encoder_embedding_size, hidden_size, decoder_embedding_size):
        super(Seq2Seq, self).__init__()
        
        self.encoder = Encoder(encoder_embedding_size, hidden_size)
        self.decoder = Decoder(decoder_embedding_size, hidden_size)
        
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        hidden, cell = None, None
        # src = src.permute(1, 0, 2)
        trg = trg.permute(1, 0, 2)
        
        # for encoder_input in src:
        _, hidden, cell = self.encoder(src.squeeze(-1), hidden, cell)
        
        result = []
        decoder_input = torch.full((batch_size, 1), SOS_token).long().to(device)
        for i in range(trg.size(0)):
            decoder_output, hidden, cell = self.decoder(decoder_input, hidden, cell)
            result.append(decoder_output)
            
            teacher_force = random.random() < teacher_forcing_ratio
            
            decoder_input = trg[i] if teacher_force else decoder_output.argmax(-1)
            
        return torch.stack(result, dim=0).permute(1, 0, 2, 3)

In [17]:
# Refer to the code at https://github.com/iJoud/Seq2Seq-Chatbot

def train(model, epochs, learning_rate):
    
    model.to(device)
    
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
    # criterion = nn.CrossEntropyLoss(ignore_index=PAD_token)
    criterion = nn.NLLLoss(ignore_index=PAD_token)

    for e in range(epochs):
        model.train()
        train_loss = 0
        for data in train_loader:
            src_batch, trg_batch = data[0].to(device), data[1].to(device)
            
            outputs = model(src_batch, trg_batch)    # (batch_size, trg_len, 1, answer_vocab_count)
            # output_dim = outputs.size(-1)
            # outputs = outputs.reshape(-1, output_dim)
            # trg_batch = trg_batch.reshape(-1)
            
            loss = 0
            for output, trg in zip(outputs.permute(1, 0, 2, 3), trg_batch.permute(1, 0, 2)): 
                loss += criterion(output.squeeze(1), trg.squeeze(1))
            
            # loss = criterion(outputs, trg_batch)
            
            # train_loss += loss.item()
            train_loss += loss.item()
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        train_loss = train_loss / len(trainset)
        print(f"{e+1}/{epochs} Epoch, Training loss: {train_loss}")

        model.eval()
        val_loss = 0
        for data in test_loader:
            src_batch, trg_batch = data[0].to(device), data[1].to(device)
            
            with torch.no_grad():
                outputs = model(src_batch, trg_batch)    # (batch_size, trg_len, 1, answer_vocab_count)
                # output_dim = outputs.size(-1)
                # outputs = outputs.reshape(-1, output_dim)
                # trg_batch = trg_batch.reshape(-1)
                
                loss = 0
                for output, trg in zip(outputs.permute(1, 0, 2, 3), trg_batch.permute(1, 0, 2)): 
                    loss += criterion(output.squeeze(1), trg.squeeze(1))
            
                # loss = criterion(outputs, trg_batch)
                
                val_loss += loss.item()
        
        val_loss = val_loss / len(testset)
        print(f"{e+1}/{epochs} Epoch, Validation loss: {val_loss}")

In [18]:
del model

In [19]:
encoder_embedding_size = question_vocab.get_word_count()
decoder_embedding_size = answer_vocab.get_word_count()
model = Seq2Seq(encoder_embedding_size, hidden_size, decoder_embedding_size)
print(model)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6766, 128)
    (lstm): LSTM(128, 128, batch_first=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(6574, 128)
    (lstm): LSTM(128, 128, batch_first=True)
    (linear): Linear(in_features=128, out_features=6574, bias=True)
  )
)


In [20]:
train(
      model=model,
      epochs=epoch,
      learning_rate=learning_rate,
)



1/100 Epoch, Training loss: 0.5056349739074707
1/100 Epoch, Validation loss: 0.5894493770599365
2/100 Epoch, Training loss: 0.505430152130127
2/100 Epoch, Validation loss: 0.5896267753601074
3/100 Epoch, Training loss: 0.5056607566833496
3/100 Epoch, Validation loss: 0.5900566505432129
4/100 Epoch, Training loss: 0.5044498603820801
4/100 Epoch, Validation loss: 0.590094592666626
5/100 Epoch, Training loss: 0.505496157836914
5/100 Epoch, Validation loss: 0.5886661632537842
6/100 Epoch, Training loss: 0.5048639389038085
6/100 Epoch, Validation loss: 0.5908963611602783
7/100 Epoch, Training loss: 0.5060110450744629
7/100 Epoch, Validation loss: 0.5899629245758057
8/100 Epoch, Training loss: 0.5047361083984375
8/100 Epoch, Validation loss: 0.5890245510101318
9/100 Epoch, Training loss: 0.505214331817627
9/100 Epoch, Validation loss: 0.5910688598632813
10/100 Epoch, Training loss: 0.5058362030029296
10/100 Epoch, Validation loss: 0.5911217006683349
11/100 Epoch, Training loss: 0.50450417480