# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [2]:
import torch

learning_rate = 0.01
batch_size = 128
hidden_size = 128
epoch = 100

device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [3]:
device

'cuda:0'

In [4]:
from torch.utils.data import Dataset
import numpy as np

PAD_token = 0
SOS_token = 1 
EOS_token = 2 

class Vocab:
    def __init__(self, name):
        self.name = name
        self.index = {PAD_token: "PAD_token", SOS_token: "SOS_token", EOS_token: "EOS_token"}
        self.num_words = 3
        self.words = {"PAD_token": PAD_token, "SOS_token": SOS_token, "EOS_token": EOS_token}

    def add_sentense(self, sentences):
        for sentence in sentences:
            for word in sentence.split(' '):
                self.index_to_word(word)
                
    def index_to_word(self, word):
        word_list = list(self.words)
        if word not in word_list:
            self.index[self.num_words] = word
            self.words[word] = self.num_words
            self.num_words += 1
            
    def get_words(self):
        return self.words

class QuestionAnswerDataset(Dataset):
    def __init__(self, questions, answers, vocab):
        assert len(questions) == len(answers), \
            "The length of questions and answers is different"
        super(QuestionAnswerDataset, self).__init__()
        
        ques_maxlen = self.get_maxlen(questions)
        ans_maxlen = self.get_maxlen(answers)
        maxlen = ques_maxlen if ques_maxlen > ans_maxlen else ans_maxlen
        
        self.data = torch.zeros((len(questions), 2, maxlen))
        self.vocab = vocab
        
        for i in range(self.data.size(0)):
            q_tensor, a_tensor = self.words_to_tensor(questions[i]), self.words_to_tensor(answers[i])
            self.data[i, 0, :q_tensor.size(0)] = q_tensor
            self.data[i, 1, :a_tensor.size(0)] = a_tensor
        #     for j in range(len(q_tensor_list)):
        #         self.data[i, j] = q_tensor_list[j]
                
        # for i in range(self.answers.size(0)):
        #     a_tensor_list = self.words_to_tensor(answers[i])
        #     for j in range(len(a_tensor_list)):
        #         self.answers[i, j] = a_tensor_list[j]
        
    def get_maxlen(self, sentences):
        max_len = 0
        for sentence in sentences:
            length = len(sentence)
            if length > max_len:
                max_len = length
        return max_len
    
    def words_to_tensor(self, words):
        word_list = [self.vocab.words[word] for word in words]
        return torch.tensor(word_list)
            
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        return self.data[index].type(torch.LongTensor)

In [5]:
# Refer to the code at https://github.com/iJoud/Seq2Seq-Chatbot

import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
from nltk.corpus import brown
from torchtext import datasets

# nltk.download('brown')
# nltk.download('punkt')

# Output, save, and load brown embeddings

model = gensim.models.Word2Vec(brown.sents())
# model.save('brown.embedding')

w2v = gensim.models.Word2Vec.load('brown.embedding')

def loadDF(path, split=False):
  '''

  You will use this function to load the dataset into a Pandas Dataframe for processing.

  '''
  def get_dict(dataiter):
        data_dict = {
          "Question": [],
          "Answer": [],
        }
        
        for _, question, answer, _ in dataiter:
              if len(question) != 0 and len(answer[0]) != 0:
                data_dict["Question"].append(question)
                data_dict["Answer"].append(answer[0])
              
        return data_dict

  train_iter, test_iter = datasets.SQuAD2(path, split=("train", "dev"))

  # train_data_dict, test_data_dict = get_dict(train_iter), get_dict(test_iter)
  train_df = pd.DataFrame(get_dict(train_iter))
  test_df = pd.DataFrame(get_dict(test_iter))
  
  if split:
    return train_df, test_df
  return train_df.append(test_df)


def prepare_text(sentence):
    
    '''

    Our text needs to be cleaned with a tokenizer. This function will perform that task.
    https://www.nltk.org/api/nltk.tokenize.html

    '''
    from nltk.tokenize import RegexpTokenizer
    from nltk.corpus import stopwords

    sentence = sentence.lower()
    tokenizer = RegexpTokenizer(r'\w+')
    
    tokens = tokenizer.tokenize(sentence)
    stop_words = stopwords.words("english")
    
    new_tokens = []
    for token in tokens:
        if token not in stop_words:
            new_tokens.append(token)
    
    return new_tokens


def train_test_split(SRC, TRG, vocab):
    
    '''
    Input: SRC, our list of questions from the dataset
            TRG, our list of responses from the dataset

    Output: Training and test datasets for SRC & TRG

    '''
    train_set = QuestionAnswerDataset(SRC[0], TRG[0], vocab)
    test_set = QuestionAnswerDataset(SRC[1], TRG[1], vocab)
    
    # SRC_train_dataset = SRC["train"]
    # SRC_test_dataset = SRC["test"]
    # TRG_train_dataset = TRG["train"]
    # TRG_test_dataset = TRG["test"]
    
    # for question, answer in zip(SRC, TRG):
        
    
    return train_set, test_set


In [6]:
train_frame, test_frame = loadDF("./data/squad", split=True)
train_frame = train_frame.iloc[:50000, :]
test_frame = test_frame.iloc[:5000, :]

train_frame["Question"] = train_frame["Question"].apply(prepare_text)
train_frame["Answer"] = train_frame["Answer"].apply(prepare_text)
test_frame["Question"] = test_frame["Question"].apply(prepare_text)
test_frame["Answer"] = test_frame["Answer"].apply(prepare_text)

In [7]:
total_ques = train_frame["Question"].append(test_frame["Question"])
total_ans = train_frame["Answer"].append(test_frame["Answer"])

total_ques = total_ques.apply(lambda x: " ".join(x) ).to_list()
total_ans = total_ans.apply(lambda x: " ".join(x) ).to_list()

datapairs = [list(i) for i in zip(total_ques, total_ans)]

In [8]:
datapairs

[['beyonce start becoming popular', 'late 1990s'],
 ['areas beyonce compete growing', 'singing dancing'],
 ['beyonce leave destiny child become solo singer', '2003'],
 ['city state beyonce grow', 'houston texas'],
 ['decade beyonce become famous', 'late 1990s'],
 ['r b group lead singer', 'destiny child'],
 ['album made worldwide known artist', 'dangerously love'],
 ['managed destiny child group', 'mathew knowles'],
 ['beyoncé rise fame', 'late 1990s'],
 ['role beyoncé destiny child', 'lead singer'],
 ['first album beyoncé released solo artist', 'dangerously love'],
 ['beyoncé release dangerously love', '2003'],
 ['many grammy awards beyoncé win first solo album', 'five'],
 ['beyoncé role destiny child', 'lead singer'],
 ['name beyoncé first solo album', 'dangerously love'],
 ['second solo album entertainment venture beyonce explore', 'acting'],
 ['artist beyonce marry', 'jay z'],
 ['set record grammys many beyonce win', 'six'],
 ['movie beyonce receive first golden globe nomination', 

In [9]:
# question_vocab = Vocab("Vocab for Question")
# answer_vocab = Vocab("Vocab for Answer")
vocab = Vocab("vocabulary")

for sentense in datapairs:
    vocab.add_sentense(sentense)

In [10]:
vocab.words

{'PAD_token': 0,
 'SOS_token': 1,
 'EOS_token': 2,
 'beyonce': 3,
 'start': 4,
 'becoming': 5,
 'popular': 6,
 'late': 7,
 '1990s': 8,
 'areas': 9,
 'compete': 10,
 'growing': 11,
 'singing': 12,
 'dancing': 13,
 'leave': 14,
 'destiny': 15,
 'child': 16,
 'become': 17,
 'solo': 18,
 'singer': 19,
 '2003': 20,
 'city': 21,
 'state': 22,
 'grow': 23,
 'houston': 24,
 'texas': 25,
 'decade': 26,
 'famous': 27,
 'r': 28,
 'b': 29,
 'group': 30,
 'lead': 31,
 'album': 32,
 'made': 33,
 'worldwide': 34,
 'known': 35,
 'artist': 36,
 'dangerously': 37,
 'love': 38,
 'managed': 39,
 'mathew': 40,
 'knowles': 41,
 'beyoncé': 42,
 'rise': 43,
 'fame': 44,
 'role': 45,
 'first': 46,
 'released': 47,
 'release': 48,
 'many': 49,
 'grammy': 50,
 'awards': 51,
 'win': 52,
 'five': 53,
 'name': 54,
 'second': 55,
 'entertainment': 56,
 'venture': 57,
 'explore': 58,
 'acting': 59,
 'marry': 60,
 'jay': 61,
 'z': 62,
 'set': 63,
 'record': 64,
 'grammys': 65,
 'six': 66,
 'movie': 67,
 'receive': 68,

In [11]:
max_v = 0
for k, v in vocab.words.items():
    if v > max_v:
        max_v = v
max_v

39584

In [12]:
l = list(vocab.index)
l[-1]

39584

In [13]:
trainset, testset = train_test_split(
    SRC=(train_frame["Question"], test_frame["Question"]),
    TRG=(train_frame["Answer"], test_frame["Answer"]),
    vocab=vocab
)

In [14]:
from torch.utils.data import DataLoader
train_loader = DataLoader(trainset, batch_size=batch_size, shuffle=False, num_workers=12)
test_loader = DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=12)

In [15]:
data = next(iter(train_loader))
data[:, 0, :]

tensor([[  3,   4,   5,  ...,   0,   0,   0],
        [  9,   3,  10,  ...,   0,   0,   0],
        [  3,  14,  15,  ...,   0,   0,   0],
        ...,
        [400,   3,  46,  ...,   0,   0,   0],
        [  3,  46,  32,  ...,   0,   0,   0],
        [ 49,  37,  38,  ...,   0,   0,   0]])

In [58]:
import torch.nn as nn
import torch.nn.functional as F

class Encoder(nn.Module):
    def __init__(self, input_size, hidden_size, embedding_size, num_layer=2):
        super(Encoder, self).__init__()

        self.hidden_size = hidden_size
        self.input_size = input_size
        self.embedding_size = embedding_size

        self.embedding = nn.Embedding(self.input_size, self.embedding_size)
        self.lstm = nn.LSTM(self.embedding_size, self.hidden_size, num_layer) 
        self.dropout = nn.Dropout()
        
    def forward(self, x):
        x = self.embedding(x)
        out, (hidden, cell) = self.lstm(x)
        return self.dropout(out), hidden, cell

class Decoder(nn.Module):
    def __init__(self, input_size, output_size, embedding_size, num_layer=2):
        super(Decoder, self).__init__()

        self.input_size = input_size
        self.output_size = output_size
        self.embedding_size = embedding_size

        self.embedding = nn.Embedding(self.input_size, self.embedding_size)
        self.lstm = nn.LSTM(self.embedding_size, self.embedding_size, num_layer)
        self.linear = nn.Linear(self.embedding_size, self.output_size)
        
    def forward(self, x, hidden, cell):
        x = self.embedding(x)
        out, (hidden, cell) = self.lstm(x, (hidden, cell))
        out = F.log_softmax(self.linear(out))
        return out, hidden, cell
        
class Seq2Seq(nn.Module):
    def __init__(self, encoder_input_size, hidden_size, embedding_size, decoder_output_size, num_layer=2):
        super(Seq2Seq, self).__init__()
        
        self.encoder = Encoder(encoder_input_size, hidden_size, embedding_size, num_layer)
        self.decoder = Decoder(encoder_input_size, decoder_output_size, embedding_size, num_layer)
        
    def forward(self, src, trg):
        encoder_out, hidden, cell = self.encoder(src)
        
        decoder_input = torch.zeros(trg.size(0), 1).type(torch.LongTensor)
        decoder_output, hidden, cell = self.decoder(decoder_input, hidden[:, 0].unsqueeze(1), cell[:, 0].unsqueeze(1))
        result = []
        for i in range(trg.size(1)):
            decoder_input = trg[:, i]
            # hidden, cell = hidden[:, i].unsqueeze(1), cell[:, i].unsqueeze(1)
            decoder_output, hidden, cell = self.decoder(decoder_input.unsqueeze(1), hidden, cell)
            result.append(decoder_output.squeeze(1))
        return torch.stack(result, dim=1)

In [59]:
B, _, data_len = data.shape
encoder_input_size = list(vocab.index)[-1]
model = Seq2Seq(encoder_input_size, hidden_size, hidden_size, encoder_input_size)

In [60]:
model

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(39584, 128)
    (lstm): LSTM(128, 128, num_layers=2)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(39584, 128)
    (lstm): LSTM(128, 128, num_layers=2)
    (linear): Linear(in_features=128, out_features=39584, bias=True)
  )
)

In [61]:
out = model(data[:, 0, :], data[:, 1, :])



In [62]:
out.shape

torch.Size([128, 31, 39584])

In [None]:
# Refer to the code at https://github.com/iJoud/Seq2Seq-Chatbot

from sklearn.model_selection import KFold

def train(model, epochs, batch_size, print_every, learning_rate):
    
    model.to(device)
    total_training_loss = 0
    total_valid_loss = 0
    loss = 0
    
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    # criterion = nn.NLLLoss()
    criterion = nn.CrossEntropyLoss(ignore_index=EOS_token)

    kf = KFold(n_splits=epochs, shuffle=True)
    model.train()
    for e in range(epochs):
        for data in train_loader:
            data = data.to(device)

            output = model(src, trg)

            current_loss = 0
            for (s, t) in zip(output["decoder_output"], trg): 
                current_loss += criterion(s, t)

            loss += current_loss
            total_training_loss += (current_loss.item() / trg.size(0)) 
            
            if i % batch_size == 0 or i == (len(train_index)-1):
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
                loss = 0


        # validation set 
        model.eval()
        for i in range(0, len(test_index)):
            src = SRC[i].to(device)
            trg = TRG[i].to(device)

            output = model(src, trg)

            current_loss = 0
            for (s, t) in zip(output["decoder_output"], trg): 
                current_loss += criterion(s, t)

            total_valid_loss += (current_loss.item() / trg.size(0)) 


        if e % print_every == 0:
            training_loss_average = total_training_loss / (len(train_index)*print_every)
            validation_loss_average = total_valid_loss / (len(test_index)*print_every)
            print("{}/{} Epoch  -  Training Loss = {:.4f}  -  Validation Loss = {:.4f}".format(e, epochs, training_loss_average, validation_loss_average))
            total_training_loss = 0
            total_valid_loss = 0

In [12]:
learning_rate = 0.01
hidden_size = 128 # encoder and decoder hidden size
batch_size = 128
epochs = 1000

In [13]:
seq2seq = Seq2Seq(question_vocab.get_wordcount(), hidden_size, hidden_size, answer_vocab.get_wordcount())

train(
      model=seq2seq,
      print_every=10,
      epochs=epochs,
      learning_rate=learning_rate,
      batch_size=batch_size
)

10/1000 Epoch  -  Training Loss = 4.4055  -  Validation Loss = 5.0852
20/1000 Epoch  -  Training Loss = 4.2805  -  Validation Loss = 4.9401
30/1000 Epoch  -  Training Loss = 4.2887  -  Validation Loss = 4.8887
40/1000 Epoch  -  Training Loss = 4.2449  -  Validation Loss = 4.8427
50/1000 Epoch  -  Training Loss = 4.2359  -  Validation Loss = 4.8480
60/1000 Epoch  -  Training Loss = 4.2368  -  Validation Loss = 4.7974
70/1000 Epoch  -  Training Loss = 4.2257  -  Validation Loss = 4.8032
80/1000 Epoch  -  Training Loss = 4.2208  -  Validation Loss = 4.8002
90/1000 Epoch  -  Training Loss = 4.2209  -  Validation Loss = 4.7937
100/1000 Epoch  -  Training Loss = 4.2208  -  Validation Loss = 4.7922
110/1000 Epoch  -  Training Loss = 4.2204  -  Validation Loss = 4.7909
120/1000 Epoch  -  Training Loss = 4.2216  -  Validation Loss = 4.7864
130/1000 Epoch  -  Training Loss = 4.2191  -  Validation Loss = 4.7983
140/1000 Epoch  -  Training Loss = 4.2167  -  Validation Loss = 4.7912
150/1000 Epoch 