# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [1]:
import torch

learning_rate = 0.0001
batch_size = 256
hidden_size = 256
epoch = 200

device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [2]:
device

'cuda:0'

In [3]:
from torch.utils.data import Dataset
import numpy as np

PAD_token = 0
SOS_token = 1 
EOS_token = 2 

def max_len(sentences):
    max_len = 0
    for sentence in sentences:
        if len(sentence) > max_len:
            max_len = len(sentence)
    return max_len
    
def words_to_tensor(words, vocab, is_str=False):
    word_list = [vocab.words[word] for word in words] if not is_str else [vocab.words[word] for word in words.split(' ')] 
    return torch.LongTensor(word_list).view(-1, 1)


class Vocab:
    def __init__(self, name):
        self.name = name
        self.index = {PAD_token: "PAD_token", SOS_token: "SOS_token", EOS_token: "EOS_token"}
        self.num_words = 3
        self.words = {"PAD_token": PAD_token, "SOS_token": SOS_token, "EOS_token": EOS_token}

    def add_sentence(self, sentence):
        for word in sentence.split(' '):
            self.index_to_word(word)
                
    def index_to_word(self, word):
        word_list = list(self.words)
        if word not in word_list:
            self.index[self.num_words] = word
            self.words[word] = self.num_words
            self.num_words += 1
            
    def get_word_count(self):
        return len(self.words)

class QuestionAnswerDataset(Dataset):
    def __init__(self, questions, answers, ques_vocab, ans_vocab):
        assert len(questions) == len(answers), \
            "The length of questions and answers is different"
        super(QuestionAnswerDataset, self).__init__()
        
        ques_maxlen = max_len(questions)
        ans_maxlen = max_len(answers)
        self.max_sentence_len = ques_maxlen if ques_maxlen > ans_maxlen else ans_maxlen
        # maxlen = ques_maxlen if ques_maxlen > ans_maxlen else ans_maxlen
        self.questions = torch.full((len(questions), self.max_sentence_len+1, 1), PAD_token)
        self.answers = torch.full((len(answers), self.max_sentence_len+1, 1), PAD_token)
        
        for i in range(self.questions.size(0)):
            q_tensor = words_to_tensor(questions[i], ques_vocab)
            q_len = q_tensor.size(0)
            self.questions[i, :q_len, :] = q_tensor
            self.questions[i, q_len:q_len+1, :] = torch.LongTensor([EOS_token]).view(1, 1)
        #     for j in range(len(q_tensor_list)):
        #         self.data[i, j] = q_tensor_list[j]
                
        for i in range(self.answers.size(0)):
            a_tensor = words_to_tensor(answers[i], ans_vocab)
            a_len = a_tensor.size(0)
            self.answers[i, :a_len :] = a_tensor
            self.answers[i, a_len:a_len+1, :] = torch.LongTensor([EOS_token]).view(1, 1)
            
    def __len__(self):
        return len(self.questions)
    
    def __getitem__(self, index):
        return (self.questions[index].type(torch.LongTensor), self.answers[index].type(torch.LongTensor))

In [4]:
# Refer to the code at https://github.com/iJoud/Seq2Seq-Chatbot

import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
from nltk.corpus import brown
from torchtext import datasets

# nltk.download('brown')
# nltk.download('punkt')

# Output, save, and load brown embeddings

model = gensim.models.Word2Vec(brown.sents())
# model.save('brown.embedding')

w2v = gensim.models.Word2Vec.load('brown.embedding')


def loadDF(data_iter):
  '''

  You will use this function to load the dataset into a Pandas Dataframe for processing.

  '''
  data = {"Question": [], "Answer": []}
  
  for _, question, answer, _ in data_iter:
      if len(question) != 0 and len(answer[0]) != 0:
          data["Question"].append(question)
          data["Answer"].append(answer[0])
         
  df = pd.DataFrame(data)
  return df


def prepare_text(sentence):
    '''

    Our text needs to be cleaned with a tokenizer. This function will perform that task.
    https://www.nltk.org/api/nltk.tokenize.html

    '''
    from nltk.tokenize import RegexpTokenizer
    from nltk.stem import PorterStemmer
    from nltk.corpus import stopwords

    sentence = sentence.lower()
    tokenizer = RegexpTokenizer(r'\w+')
    
    tokens = tokenizer.tokenize(sentence)
    # stop_words = stopwords.words("english")
    
    new_tokens = []
    for token in tokens:
        if not token.isdigit():
          token = PorterStemmer().stem(token)
          new_tokens.append(token)
          # if token not in stop_words:
          #     new_tokens.append(token)
      
    return new_tokens


def train_test_split(SRC, TRG, ques_vocab, ans_vocab):
    
    '''
    Input: SRC, our list of questions from the dataset
            TRG, our list of responses from the dataset

    Output: Training and test datasets for SRC & TRG

    '''
    train_set = QuestionAnswerDataset(SRC[0], TRG[0], ques_vocab, ans_vocab)
    test_set = QuestionAnswerDataset(SRC[1], TRG[1], ques_vocab, ans_vocab)
    
    # SRC_train_dataset = SRC["train"]
    # SRC_test_dataset = SRC["test"]
    # TRG_train_dataset = TRG["train"]
    # TRG_test_dataset = TRG["test"]
    
    # for question, answer in zip(SRC, TRG):
        
    
    return train_set, test_set


In [5]:
train_iter, test_iter = datasets.SQuAD2("./data/SQuAD2", split=("train", "dev"))
train_frame = loadDF(train_iter).iloc[:20000]
test_frame = loadDF(test_iter)

train_frame["Question"] = train_frame["Question"].apply(prepare_text)
train_frame["Answer"] = train_frame["Answer"].apply(prepare_text)
test_frame["Question"] = test_frame["Question"].apply(prepare_text)
test_frame["Answer"] = test_frame["Answer"].apply(prepare_text)

In [6]:
train_frame

Unnamed: 0,Question,Answer
0,"[when, did, beyonc, start, becom, popular]","[in, the, late, 1990]"
1,"[what, area, did, beyonc, compet, in, when, sh...","[sing, and, danc]"
2,"[when, did, beyonc, leav, destini, s, child, a...",[]
3,"[in, what, citi, and, state, did, beyonc, grow...","[houston, texa]"
4,"[in, which, decad, did, beyonc, becom, famou]","[late, 1990]"
...,...,...
19995,"[the, 1080i30, or, 1080i60, notion, identifi, ...",[]
19996,"[the, 720p60, notion, identifi, progress, scan...",[]
19997,"[what, three, scan, rate, do, hz, system, supp...","[50i, 25p, and, 50p]"
19998,"[which, system, suport, 94i, 60i, 976p, 24p, 9...",[hz]


In [7]:
test_frame

Unnamed: 0,Question,Answer
0,"[in, what, countri, is, normandi, locat]",[franc]
1,"[when, were, the, norman, in, normandi]","[10th, and, 11th, centuri]"
2,"[from, which, countri, did, the, nors, origin]","[denmark, iceland, and, norway]"
3,"[who, wa, the, nors, leader]",[rollo]
4,"[what, centuri, did, the, norman, first, gain,...","[10th, centuri]"
...,...,...
5923,"[what, is, the, metric, term, less, use, than,...","[kilogram, forc]"
5924,"[what, is, the, kilogram, forc, sometim, reffe...",[kilopond]
5925,"[what, is, a, veri, seldom, use, unit, of, mas...",[slug]
5926,"[what, seldom, use, term, of, a, unit, of, for...",[kip]


In [8]:
total_ques = train_frame["Question"].append(test_frame["Question"])
total_ans = train_frame["Answer"].append(test_frame["Answer"])

total_ques = total_ques.apply(lambda x: ' '.join(x)).to_list()
total_ans = total_ans.apply(lambda x: ' '.join(x)).to_list()

In [9]:
question_vocab = Vocab("Vocab for Question")
answer_vocab = Vocab("Vocab for Answer")
# vocab = Vocab("vocabulary")

for sentense in total_ques:
    question_vocab.add_sentence(sentense)
for sentense in total_ans:
    answer_vocab.add_sentence(sentense)

In [10]:
print(question_vocab.get_word_count())
print(answer_vocab.get_word_count())

12269
12164


In [11]:
trainset, testset = train_test_split(
    SRC=(train_frame["Question"], test_frame["Question"]),
    TRG=(train_frame["Answer"], test_frame["Answer"]),
    ques_vocab=question_vocab,
    ans_vocab=answer_vocab
)

In [12]:
from torch.utils.data import DataLoader
train_loader = DataLoader(trainset, batch_size=batch_size, shuffle=False, drop_last=True, num_workers=0)
test_loader = DataLoader(testset, batch_size=batch_size, shuffle=False, drop_last=True, num_workers=0)

In [13]:
data = next(iter(train_loader))
data[1].shape

torch.Size([256, 44, 1])

In [14]:
import torch.nn as nn
import torch.nn.functional as F
import random

class Encoder(nn.Module):
    def __init__(self, embedding_size, hidden_size):
        super(Encoder, self).__init__()

        self.hidden_size = hidden_size
        self.embedding_size = embedding_size

        self.embedding = nn.Embedding(self.embedding_size, self.hidden_size)
        self.lstm = nn.LSTM(self.hidden_size, self.hidden_size, batch_first=True) 
        self.dropout = nn.Dropout()
        
    def forward(self, x, hidden=None, cell=None):
        x = self.embedding(x)
        if hidden is None and cell is None:
            out, (hidden, cell) = self.lstm(x)
        else:
            out, (hidden, cell) = self.lstm(x, (hidden, cell))
        return self.dropout(out), hidden, cell

class Decoder(nn.Module):
    def __init__(self, embedding_size, hidden_size):
        super(Decoder, self).__init__()

        self.hidden_size = hidden_size
        self.embedding_size = embedding_size

        self.embedding = nn.Embedding(self.embedding_size, self.hidden_size)
        self.lstm = nn.LSTM(self.hidden_size, self.hidden_size, batch_first=True, dropout=0.5)
        self.linear = nn.Linear(self.hidden_size, self.embedding_size)
        self.softmax = nn.LogSoftmax(dim=-1)
        
    def forward(self, x, hidden, cell):
        x = self.embedding(x)
        out, (hidden, cell) = self.lstm(x, (hidden, cell))
        out = self.linear(out)
        out = self.softmax(out)
        # out = self.linear(out)
        return out, hidden, cell
        
class Seq2Seq(nn.Module):
    def __init__(self, encoder_embedding_size, hidden_size, decoder_embedding_size):
        super(Seq2Seq, self).__init__()
        
        self.encoder = Encoder(encoder_embedding_size, hidden_size)
        self.decoder = Decoder(decoder_embedding_size, hidden_size)
        
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        hidden, cell = None, None
        # src = src.permute(1, 0, 2)
        if trg is not None:
            trg = trg.permute(1, 0, 2)
        
        _, hidden, cell = self.encoder(src.squeeze(-1), hidden, cell)
        
        result = []
        decoder_input = torch.full((hidden.size(1), 1), SOS_token).long().to(device)
        decoder_iter = trg.size(0) if trg is not None else src.size(0)
        for i in range(decoder_iter):
            decoder_output, hidden, cell = self.decoder(decoder_input, hidden, cell)
            result.append(decoder_output)
            
            teacher_force = random.random() < teacher_forcing_ratio
            
            if trg is not None:
                decoder_input = trg[i] if teacher_force else decoder_output.argmax(-1)
            else:
                decoder_input = decoder_output.argmax(-1)
            
        return torch.stack(result, dim=0).permute(1, 0, 2, 3)

In [15]:
# Refer to the code at https://github.com/iJoud/Seq2Seq-Chatbot

def train(model, epochs, learning_rate):
    
    model.to(device)
    
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
    # criterion = nn.CrossEntropyLoss(ignore_index=PAD_token)
    criterion = nn.NLLLoss(ignore_index=PAD_token)

    for e in range(epochs):
        model.train()
        train_loss = 0
        for data in train_loader:
            src_batch, trg_batch = data[0].to(device), data[1].to(device)
            
            outputs = model(src_batch, trg_batch)    # (batch_size, trg_len, 1, answer_vocab_count)
            # output_dim = outputs.size(-1)
            # outputs = outputs.reshape(-1, output_dim)
            # trg_batch = trg_batch.reshape(-1)
            
            loss = 0
            for output, trg in zip(outputs.permute(1, 0, 2, 3), trg_batch.permute(1, 0, 2)): 
                loss += criterion(output.squeeze(1), trg.squeeze(1))
            
            # loss = criterion(outputs, trg_batch)
            
            # train_loss += loss.item()
            train_loss += loss.item()
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        train_loss = train_loss / len(trainset)
        print(f"{e+1}/{epochs} Epoch, Training loss: {train_loss}")

        model.eval()
        val_loss = 0
        for data in test_loader:
            src_batch, trg_batch = data[0].to(device), data[1].to(device)
            
            with torch.no_grad():
                outputs = model(src_batch, trg_batch)    # (batch_size, trg_len, 1, answer_vocab_count)
                # output_dim = outputs.size(-1)
                # outputs = outputs.reshape(-1, output_dim)
                # trg_batch = trg_batch.reshape(-1)
                
                loss = 0
                for output, trg in zip(outputs.permute(1, 0, 2, 3), trg_batch.permute(1, 0, 2)): 
                    loss += criterion(output.squeeze(1), trg.squeeze(1))
            
                # loss = criterion(outputs, trg_batch)
                
                val_loss += loss.item()
        
        val_loss = val_loss / len(testset)
        print(f"{e+1}/{epochs} Epoch, Validation loss: {val_loss}")

In [19]:
del model

In [16]:
encoder_embedding_size = question_vocab.get_word_count()
decoder_embedding_size = answer_vocab.get_word_count()
model = Seq2Seq(encoder_embedding_size, hidden_size, decoder_embedding_size)

  "num_layers={}".format(dropout, num_layers))


In [17]:
train(
      model=model,
      epochs=epoch,
      learning_rate=learning_rate,
)

1/200 Epoch, Training loss: 0.5305834712982178
1/200 Epoch, Validation loss: 0.6588143195539673
2/200 Epoch, Training loss: 0.4847267686843872
2/200 Epoch, Validation loss: 0.6050957256322287
3/200 Epoch, Training loss: 0.4633235534667969
3/200 Epoch, Validation loss: 0.5996774102029531
4/200 Epoch, Training loss: 0.4601088565826416
4/200 Epoch, Validation loss: 0.5935666538604036
5/200 Epoch, Training loss: 0.4532050287246704
5/200 Epoch, Validation loss: 0.5834636675362324
6/200 Epoch, Training loss: 0.45029867458343503
6/200 Epoch, Validation loss: 0.5832721373008491
7/200 Epoch, Training loss: 0.4452033266067505
7/200 Epoch, Validation loss: 0.5889618522242496
8/200 Epoch, Training loss: 0.4402744167327881
8/200 Epoch, Validation loss: 0.5772013812251741
9/200 Epoch, Training loss: 0.4395935293197632
9/200 Epoch, Validation loss: 0.5778769018196384
10/200 Epoch, Training loss: 0.43747643089294436
10/200 Epoch, Validation loss: 0.5686007253875939
11/200 Epoch, Training loss: 0.44072

In [18]:
torch.save(model.state_dict(), "./checkpoint.pth")