# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [1]:
!pip install torchdata==0.3.0

Defaulting to user installation because normal site-packages is not writeable
Collecting torchdata==0.3.0
  Downloading torchdata-0.3.0-py3-none-any.whl (47 kB)
[K     |████████████████████████████████| 47 kB 2.5 MB/s eta 0:00:011
Installing collected packages: torchdata
Successfully installed torchdata-0.3.0


In [104]:
import random
import itertools
import os
import math

import torch
import torch.nn as nn
import torch.optim as optim

import gensim
import nltk
import tqdm

import numpy as np
import pandas as pd
import gzip
import time

from nltk.corpus import brown
# import dataset SQuAD2 as suggsted
from torchtext.datasets import SQuAD2

from torch.utils.data import Dataset, DataLoader

nltk.download('brown')
nltk.download('punkt')

# Output, save, and load brown embeddings
# model = gensim.models.Word2Vec(brown.sents())
# model.save('brown.embedding')
# w2v = gensim.models.Word2Vec.load('brown.embedding')


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
#You will use this function to load the dataset into a Pandas Dataframe for processing.
def loadDF(ds):
    df = {"question": [], "answer": []}
    for context, question, answers, indices in ds:
        if answers[0]:
            df["question"].append(question)
            df["answer"].append(answers[0])
    return pd.DataFrame.from_dict(df)
    

def prepare_text(sentence):
    '''
    Our text needs to be cleaned with a tokenizer. This function will perform that task.
    https://www.nltk.org/api/nltk.tokenize.html
    '''
    return nltk.tokenize.word_tokenize(sentence)


def train_test_split(SRC, TRG):
    '''
    Input: SRC, our list of questions from the dataset
           TRG, our list of responses from the dataset
    Output: Training and test datasets for SRC & TRG
    '''
    
    SRC_train_dataset = train_df["question"].tolist()
    TRG_train_dataset = train_df["answer"].tolist()
    
    SRC_test_dataset = test_df["question"].tolist()
    TRG_test_dataset = test_df["answer"].tolist()
        
    return SRC_train_dataset, SRC_test_dataset, TRG_train_dataset, TRG_test_dataset


In [3]:
train_dataset, test_dataset = SQuAD2()

In [4]:
train_df = loadDF(train_dataset)
test_df = loadDF(test_dataset)

In [5]:
train_df.head()

Unnamed: 0,question,answer
0,When did Beyonce start becoming popular?,in the late 1990s
1,What areas did Beyonce compete in when she was...,singing and dancing
2,When did Beyonce leave Destiny's Child and bec...,2003
3,In what city and state did Beyonce grow up?,"Houston, Texas"
4,In which decade did Beyonce become famous?,late 1990s


In [6]:
import util as util
print("Start preparing training data ...")
voc, pairs = util.readVocs(train_df, "train")
print("Read {!s} sentence pairs".format(len(pairs)))

Start preparing training data ...
Reading lines...
Read 86821 sentence pairs


In [7]:
pairs = util.filterPairs(pairs)
print("Trimmed to {!s} sentence pairs".format(len(pairs)))

Trimmed to 29334 sentence pairs


In [8]:
print("Counting words...")
for pair in pairs:
    voc.addSentence(pair[0])
    voc.addSentence(pair[1])
print("Counted words:", voc.num_words)

Counting words...
Counted words: 28298


In [9]:
for pair in pairs[:5]:
    print(pair)

['when did beyonce start becoming popular ?', 'in the late s']
['in which decade did beyonce become famous ?', 'late s']
['what album made her a worldwide known artist ?', 'dangerously in love']
['who managed the destiny s child group ?', 'mathew knowles']
['when did beyonce rise to fame ?', 'late s']


In [10]:
MIN_COUNT = 3    # Minimum word count threshold for trimming

def trimRareWords(voc, pairs, MIN_COUNT):
    # Trim words used under the MIN_COUNT from the voc
    voc.trim(MIN_COUNT)
    # Filter out pairs with trimmed words
    keep_pairs = []
    for pair in pairs:
        input_sentence = pair[0]
        output_sentence = pair[1]
        keep_input = True
        keep_output = True
        # Check input sentence
        for word in input_sentence.split(' '):
            if word not in voc.word2index:
                keep_input = False
                break
        # Check output sentence
        for word in output_sentence.split(' '):
            if word not in voc.word2index:
                keep_output = False
                break

        # Only keep pairs that do not contain trimmed word(s) in their input or output sentence
        if keep_input and keep_output:
            keep_pairs.append(pair)

    print("Trimmed from {} pairs to {}, {:.4f} of total".format(len(pairs), len(keep_pairs), len(keep_pairs) / len(pairs)))
    return keep_pairs


# Trim voc and pairs
pairs = trimRareWords(voc, pairs, MIN_COUNT)

keep_words 9536 / 28295 = 0.3370
Trimmed from 29334 pairs to 13802, 0.4705 of total


In [11]:
print(dir(voc))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'addSentence', 'addWord', 'index2word', 'name', 'num_words', 'trim', 'trimmed', 'word2count', 'word2index']


In [12]:
voc.num_words

9539

## Prepare Data for Model

In [13]:
# Default word tokens
PAD_token = 0  # Used for padding short sentences
SOS_token = 1  # Start-of-sentence token
EOS_token = 2  # End-of-sentence token
MAX_LENGTH = 10 # Maximun Length of the statement

In [14]:
def indexesFromSentence(voc, sentence):
    return [voc.word2index[word] for word in sentence.split(' ')] + [EOS_token]


def zeroPadding(l, fillvalue=PAD_token):
    return list(itertools.zip_longest(*l, fillvalue=fillvalue))

def binaryMatrix(l, value=PAD_token):
    m = []
    for i, seq in enumerate(l):
        m.append([])
        for token in seq:
            if token == PAD_token:
                m[i].append(0)
            else:
                m[i].append(1)
    return m

# Returns padded input sequence tensor and lengths
def inputVar(l, voc):
    indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    padList = zeroPadding(indexes_batch)
    padVar = torch.LongTensor(padList)
    return padVar, lengths

# Returns padded target sequence tensor, padding mask, and max target length
def outputVar(l, voc):
    indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]
    max_target_len = max([len(indexes) for indexes in indexes_batch])
    padList = zeroPadding(indexes_batch)
    mask = binaryMatrix(padList)
    mask = torch.BoolTensor(mask)
    padVar = torch.LongTensor(padList)
    return padVar, mask, max_target_len

# Returns all items for a given batch of pairs
def batch2TrainData(voc, pair_batch):
    pair_batch.sort(key=lambda x: len(x[0].split(" ")), reverse=True)
    input_batch, output_batch = [], []
    for pair in pair_batch:
        input_batch.append(pair[0])
        output_batch.append(pair[1])
    inp, lengths = inputVar(input_batch, voc)
    output, mask, max_target_len = outputVar(output_batch, voc)
    return inp, lengths, output, mask, max_target_len


# Example for validation
small_batch_size = 5
batches = batch2TrainData(voc, [random.choice(pairs) for _ in range(small_batch_size)])
input_variable, lengths, target_variable, mask, max_target_len = batches

print("input_variable:", input_variable)
print("lengths:", lengths)
print("target_variable:", target_variable)
print("mask:", mask)
print("max_target_len:", max_target_len)

input_variable: tensor([[  28,   18,   10,   18,   46],
        [1857,   46,   18, 4386, 4679],
        [  11, 1221, 1768, 4387,   22],
        [3859,   62,   41,   77, 3210],
        [1294,  164, 7376, 2479, 3020],
        [1750,   11, 3880,  562, 8140],
        [ 924, 2036,    9,    9,    9],
        [1448, 6671,    2,    2,    2],
        [   9,    9,    0,    0,    0],
        [   2,    2,    0,    0,    0]])
lengths: tensor([10, 10,  8,  8,  8])
target_variable: tensor([[ 836, 1027, 7368, 4388,  423],
        [ 959,  886,    2,    2,   46],
        [  79,   83,    0,    0, 7867],
        [  11,    2,    0,    0,    2],
        [5095,    0,    0,    0,    0],
        [1970,    0,    0,    0,    0],
        [3859,    0,    0,    0,    0],
        [1024,    0,    0,    0,    0],
        [   2,    0,    0,    0,    0]])
mask: tensor([[ True,  True,  True,  True,  True],
        [ True,  True,  True,  True,  True],
        [ True,  True, False, False,  True],
        [ True,  True, Fal

## Encoder, Decoder and Seq2Seq

In [15]:
print(len(input_variable))
print(len(target_variable))
print(voc.num_words)

10
9
9539


In [16]:
# adjustable parameters
INPUT_DIM = voc.num_words
OUTPUT_DIM = voc.num_words
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

# First initialize our model.
# input_size = voc.num_words
# output_size = voc.num_words
# embedding_size = 256
# hidden_size = 512

In [17]:
class Encoder(nn.Module):
    def __init__(self, input_dim: int, emb_dim: int, hid_dim: int, n_layers: int, dropout: float):
        super().__init__()
        self.emb_dim = emb_dim
        self.hid_dim = hid_dim
        self.input_dim = input_dim
        self.n_layers = n_layers
        self.dropout = dropout

        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)

    def forward(self, src_batch: torch.LongTensor):
        embedded = self.embedding(src_batch) # [sent len, batch size, emb dim]
        outputs, (hidden, cell) = self.rnn(embedded)
        # outputs -> [sent len, batch size, hidden dim * n directions]
        return hidden, cell

In [18]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
encoder = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT).to(device)
hidden, cell = encoder(input_variable)
hidden.shape, cell.shape

(torch.Size([2, 5, 512]), torch.Size([2, 5, 512]))

In [19]:
class Decoder(nn.Module):
    def __init__(self, output_dim: int, emb_dim: int, hid_dim: int, n_layers: int, dropout: float):
        super().__init__()
        self.emb_dim = emb_dim
        self.hid_dim = hid_dim
        self.output_dim = output_dim
        self.n_layers = n_layers
        self.dropout = dropout

        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)
        self.out = nn.Linear(hid_dim, output_dim)

    def forward(self, trg: torch.LongTensor, hidden: torch.FloatTensor, cell: torch.FloatTensor):
        # [1, batch size, emb dim], the 1 serves as sent len
        embedded = self.embedding(trg.unsqueeze(0))
        outputs, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        prediction = self.out(outputs.squeeze(0))
        return prediction, hidden, cell
    

In [20]:
decoder = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT).to(device)

# notice that we are not passing the entire the .trg
prediction, hidden, cell = decoder(input_variable[0], hidden, cell)
prediction.shape, hidden.shape, cell.shape

(torch.Size([5, 9539]), torch.Size([2, 5, 512]), torch.Size([2, 5, 512]))

In [21]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder: Encoder, decoder: Decoder, device: torch.device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

        assert encoder.hid_dim == decoder.hid_dim, \
            'Hidden dimensions of encoder and decoder must be equal!'
        assert encoder.n_layers == decoder.n_layers, \
            'Encoder and decoder must have equal number of layers!'

    def forward(self, src_batch: torch.LongTensor, trg_batch: torch.LongTensor,
                teacher_forcing_ratio: float=0.5):

        max_len, batch_size = trg_batch.shape
        trg_vocab_size = self.decoder.output_dim

        # tensor to store decoder's output
        outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)

        # last hidden & cell state of the encoder is used as the decoder's initial hidden state
        hidden, cell = self.encoder(src_batch)

        trg = trg_batch[0]
        for i in range(1, max_len):
            prediction, hidden, cell = self.decoder(trg, hidden, cell)
            outputs[i] = prediction

            if random.random() < teacher_forcing_ratio:
                trg = trg_batch[i]
            else:
                trg = prediction.argmax(1)

        return outputs

In [22]:
# note that this implementation assumes that the size of the hidden layer,
# and the number of layer are the same between the encoder and decoder
encoder = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
decoder = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)
seq2seq = Seq2Seq(encoder, decoder, device).to(device)
seq2seq

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(9539, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
  )
  (decoder): Decoder(
    (embedding): Embedding(9539, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (out): Linear(in_features=512, out_features=9539, bias=True)
  )
)

In [23]:
outputs = seq2seq(input_variable, target_variable)
outputs.shape
#todo input_var twice?

torch.Size([9, 5, 9539])

In [24]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(seq2seq):,} trainable parameters')

The model has 17,133,891 trainable parameters


In [25]:
# Training seq2seq

In [26]:
optimizer = optim.Adam(seq2seq.parameters())
criterion = nn.CrossEntropyLoss()

In [99]:
def train(seq2seq, training_batch, optimizer, criterion):
    seq2seq.train()
     # Extract fields from batch
    input_variable, lengths, target_variable, mask, max_target_len = training_batch
    print("Training...")
    epoch_loss = 0
    for batch in training_batch:
        optimizer.zero_grad()
        outputs = seq2seq(training_batch[0], training_batch[2])

        # 1. as mentioned in the seq2seq section, we will
        # cut off the first element when performing the evaluation
        # 2. the loss function only works on 2d inputs
        # with 1d targets we need to flatten each of them
        outputs_flatten = outputs[1:].view(-1, outputs.shape[-1])
        trg_flatten = training_batch[2][1:].view(-1)
        loss = criterion(outputs_flatten, trg_flatten)

        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    return epoch_loss / len(training_batch)

In [100]:
def evaluate(seq2seq, training_batch, criterion):
    seq2seq.eval()
    print("Evaluating...")
    input_variable, lengths, target_variable, mask, max_target_len = training_batch

    epoch_loss = 0
    with torch.no_grad():
        for batch in training_batch:
            # turn off teacher forcing
            outputs = seq2seq(training_batch[0], training_batch[2], teacher_forcing_ratio=0) 

            # trg = [trg sent len, batch size]
            # output = [trg sent len, batch size, output dim]
            outputs_flatten = outputs[1:].view(-1, outputs.shape[-1])
            trg_flatten = training_batch[2][1:].view(-1)
            loss = criterion(outputs_flatten, trg_flatten)
            epoch_loss += loss.item()

    return epoch_loss / len(training_batch)

In [105]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [108]:
N_EPOCHS = 20
best_valid_loss = float('inf')

# Load batches for each iteration
training_batches = [batch2TrainData(voc, [random.choice(pairs) for _ in range(128)])]
print("Length of training batches"+str(len(training_batches)))

for epoch in range(N_EPOCHS): 
    training_batch = training_batches[epoch - 1]
    
    start_time = time.time()
    train_loss = train(seq2seq, training_batch, optimizer, criterion)
    valid_loss = evaluate(seq2seq, training_batch, criterion)
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(seq2seq.state_dict(), 'chatbotmodel.pt')

    # it's easier to see a change in perplexity between epoch as it's an exponential
    # of the loss, hence the scale of the measure is much bigger
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Length of training batches1
Training...
Evaluating...
Epoch: 01 | Time: 0m 27s
	Train Loss: 2.125 | Train PPL:   8.370
	 Val. Loss: 2.017 |  Val. PPL:   7.517
Training...
Evaluating...
Epoch: 02 | Time: 0m 27s
	Train Loss: 1.782 | Train PPL:   5.942
	 Val. Loss: 1.736 |  Val. PPL:   5.675


IndexError: list index out of range

## Evaluating the model

In [109]:
seq2seq.load_state_dict(torch.load('chatbotmodel.pt'))

test_loss = evaluate(seq2seq, test_iterator, criterion)
print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

NameError: name 'model' is not defined

In [None]:
example_idx = 0
example = train_data.examples[example_idx]
print('source sentence: ', ' '.join(example.src))
print('target sentence: ', ' '.join(example.trg))

In [114]:
src_tensor = source.process([example.src]).to(device)
trg_tensor = target.process([example.trg]).to(device)
print(trg_tensor.shape)

seq2seq.eval()
with torch.no_grad():
    outputs = seq2seq(src_tensor, trg_tensor, teacher_forcing_ratio=0)

outputs.shape

In [None]:
output_idx = outputs[1:].squeeze(1).argmax(1)
' '.join([target.vocab.itos[idx] for idx in output_idx])