# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





## Import Libaries

In [13]:
# download liabries needed
# need to restart the kernal after running this cell
!pip install torch==1.12.0 torchdata==0.4.0 torchtext==0.13.0

Defaulting to user installation because normal site-packages is not writeable


In [330]:
# import libraries
import gensim
import nltk
from nltk.stem import *
from nltk.tokenize import RegexpTokenizer
import numpy as np
import pandas as pd
import gzip
from nltk.corpus import brown

import torchtext

import string
from sklearn.model_selection import train_test_split

import torch
import torch.nn as nn
import torch.optim as optim

import random
import math

In [2]:
# set the random seeds
SEED=42


## Step 1: Build Vocabulary & create the Word Embeddings

In [3]:
# download the data
nltk.download('brown')
nltk.download('punkt')

# Output, save, and load brown embeddings

model = gensim.models.Word2Vec(brown.sents())
model.save('brown.embedding')

w2v = gensim.models.Word2Vec.load('brown.embedding')


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [202]:
# function to load the data
def loadDF(path):
    
    '''
    You will use this function to load the dataset into a Pandas Dataframe for processing.
    Number of lines per split:
        train: 87599
        dev: 10570
    '''
    # load data
    train_data, valid_data = torchtext.datasets.SQuAD1(root=path, split=('train', 'dev'))
    
    # returns: DataPipe that yields data points from SQuaAD1 dataset which consist of context, question, 
    # list of answers and corresponding index in context
    # convert dataPipe to dictionary 
    # make simple pairs of questions and answers
#     train_dict = {'src':[], 'trg':[]}
#     valid_dict = {'src':[], 'trg':[]}
    
    
    train_dict = [{'src': question, 'trg': answers[0]} for _, question, answers, _ in train_data]
    valid_dict = [{'src': question, 'trg': answers[0]} for _, question, answers, _ in val_data]
    
#     for _, question, answer, _ in train_data:
#         train_dict['src'].append(question)
#         train_dict['trg'].append(answer[0])
    
#     for _, question, answer, _ in valid_data:
#         valid_dict['src'].append(question)
#         valid_dict['trg'].append(answers[0])
        
    # convert Dictionaries to Pandas DataFrame
    train_df = pd.DataFrame(train_dict)    
    valid_df = pd.DataFrame(valid_dict)
    
    # combine two parts
    # df = train_df.append(validation_df)
    
    return train_df, valid_df
#     return train_dict, valid_dict

def prepare_text(sentence):
    
    '''
    Our text needs to be cleaned with a tokenizer. This function will perform that task.
    https://www.nltk.org/api/nltk.tokenize.html
    '''
    # clean text
    sentence = ''.join([s.lower() for s in sentence if s not in string.punctuation])
    
    stemmer = snowball.SnowballStemmer('english')
    sentence = ' '.join(stemmer.stem(w) for w in sentence.split())
    
    # tokenize text
    tokens = RegexpTokenizer(r'\w+').tokenize(sentence)
    
    return tokens


def train_test_split(SRC, TRG):
    
    '''
    Input: SRC, our list of questions from the dataset
            TRG, our list of responses from the dataset

    Output: Training and test datasets for SRC & TRG

    '''
    SRC_train_dataset = SRC.sample(frac=0.8, random_state=SEED)
    SRC_test_dataset = SRC.drop(SRC_train_dataset.index)

    TRG_train_dataset = TRG.sample(frac=0.8, random_state=SEED)
    TRG_test_dataset = TRG.drop(TRG_train_dataset.index)
    
    return SRC_train_dataset, SRC_test_dataset, TRG_train_dataset, TRG_test_dataset


In [281]:
# test loadDF function
train_raw, valid_raw = loadDF('data')


In [282]:
# to make implementation test quicker, grab a subset of whole dataset
# data_df = data_df_raw.iloc[:5000, :]
train_data = train_raw.iloc[:5000, :]
valid_data = valid_raw.iloc[:5000, :]

In [283]:
# top rows of train data
train_data.head()

Unnamed: 0,src,trg
0,To whom did the Virgin Mary allegedly appear i...,Saint Bernadette Soubirous
1,What is in front of the Notre Dame Main Building?,a copper statue of Christ
2,The Basilica of the Sacred heart at Notre Dame...,the Main Building
3,What is the Grotto at Notre Dame?,a Marian place of prayer and reflection
4,What sits on top of the Main Building at Notre...,a golden statue of the Virgin Mary


In [284]:
# count of train_df_raw, valid_df_raw
print(f'train data set: {train_raw.shape}')
print(f'valid data set: {valid_raw.shape}')

train data set: (87599, 2)
valid data set: (10570, 2)


In [285]:
# prepare the test
train_data.loc[:, 'src'] = train_data['src'].apply(prepare_text)
train_data.loc[:, 'trg'] = train_data['trg'].apply(prepare_text)
train_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[selected_item_labels] = value


Unnamed: 0,src,trg
0,"[to, whom, did, the, virgin, mari, alleg, appe...","[saint, bernadett, soubir]"
1,"[what, is, in, front, of, the, notr, dame, mai...","[a, copper, statu, of, christ]"
2,"[the, basilica, of, the, sacr, heart, at, notr...","[the, main, build]"
3,"[what, is, the, grotto, at, notr, dame]","[a, marian, place, of, prayer, and, reflect]"
4,"[what, sit, on, top, of, the, main, build, at,...","[a, golden, statu, of, the, virgin, mari]"


In [286]:
valid_data.loc[:, 'src'] = valid_data['src'].apply(prepare_text)
valid_data.loc[:, 'trg'] = valid_data['trg'].apply(prepare_text)
valid_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[selected_item_labels] = value


Unnamed: 0,src,trg
0,"[which, nfl, team, repres, the, afc, at, super...","[denver, bronco]"
1,"[which, nfl, team, repres, the, nfc, at, super...","[carolina, panther]"
2,"[where, did, super, bowl, 50, take, place]","[santa, clara, california]"
3,"[which, nfl, team, won, super, bowl, 50]","[denver, bronco]"
4,"[what, color, was, use, to, emphas, the, 50th,...",[gold]


In [287]:
# split the train and test data
SRC = train_data[['src']]
TRG = train_data[['trg']]
SRC_train_dataset, SRC_test_dataset, TRG_train_dataset, TRG_test_dataset = train_test_split(SRC, TRG)

SRC_valid_dataset = valid_data[['src']]
TRG_valid_dataset = valid_data[['trg']]

In [288]:
print(f'SRC_train_dataset size: {SRC_train_dataset.shape}')
print(f'SRC_test_dataset size: {SRC_test_dataset.shape}')
print(f'SRC_valid_dataset size: {SRC_valid_dataset.shape}')

print(f'TRG_train_dataset size: {TRG_train_dataset.shape}')
print(f'TRG_test_dataset size: {TRG_test_dataset.shape}')
print(f'TRG_valid_dataset size: {TRG_valid_dataset.shape}')


SRC_train_dataset size: (4000, 1)
SRC_test_dataset size: (1000, 1)
SRC_valid_dataset size: (5000, 1)
TRG_train_dataset size: (4000, 1)
TRG_test_dataset size: (1000, 1)
TRG_valid_dataset size: (5000, 1)


In [289]:
# example of SRC_train_dataset
SRC_train_dataset.head()

Unnamed: 0,src
1501,"[what, year, did, chopin, die]"
2586,"[what, appl, code, name, for, the, newer, 8pin..."
2653,"[on, what, devic, can, video, game, be, use]"
1055,"[where, doe, the, saskatchewan, river, empti, ..."
705,"[who, coach, beyoncé, for, her, spanish, record]"


In [290]:
# example of TRG_train_dataset
TRG_train_dataset.head()

Unnamed: 0,trg
1501,[1849]
2586,[lightn]
2653,[ipod]
1055,"[hudson, bay]"
705,"[rudi, perez]"


In [294]:
# define the Vovabulary Object
class Vocab:
    def __init__(self, name):
        SOS = 0
        EOS = 1

        self.name = name
        self.index = {SOS: "", EOS: ""}
        self.count = 0
        self.words = {"": SOS, "": EOS}
        
    def indexWord(self, word):
        if word not in self.words:
            self.words[word] = self.count
            self.index[str(self.count)] = word
            self.count += 1
            return True
        else:
            return False
        

In [295]:
# build vocabularies for questions "source" and answers "target"
vocab_src = Vocab(name='src')
vocab_trg = Vocab(name='trg')

for idx, r in train_data.iterrows():
    for w in r['src']:
        vocab_src.indexWord(w)
    for w in r['trg']:
        vocab_trg.indexWord(w)
        
for idx, r in valid_data.iterrows():
    for w in r['src']:
        vocab_src.indexWord(w)
    for w in r['trg']:
        vocab_trg.indexWord(w)


In [296]:
print(f"tokens in vocab_src: {vocab_src.count}")
print(f"tokens in vocab_trg: {vocab_trg.count}")

tokens in vocab_src: 7041
tokens in vocab_trg: 7032


In [297]:
# ref: https://github.com/iJoud/Seq2Seq-Chatbot/blob/main/src/Data.py
def toTensor(vocab, sentence):
    # convert list of words "sentence" to a torch tensor of indices
    indices = [vocab.words[word] for word in sentence]
    indices.append(vocab.words[''])
    return torch.Tensor(indices).long().to(device).view(-1, 1)

In [298]:
SRC_train_dataset = [toTensor(vocab_src, r['src']) for idx, r in SRC_train_dataset.iterrows()]
TRG_train_dataset = [toTensor(vocab_trg, r['trg']) for idx, r in TRG_train_dataset.iterrows()]
SRC_test_dataset = [toTensor(vocab_src, r['src']) for idx, r in SRC_test_dataset.iterrows()]
TRG_test_dataset = [toTensor(vocab_trg, r['trg']) for idx, r in TRG_test_dataset.iterrows()]
SRC_valid_dataset = [toTensor(vocab_src, r['src']) for idx, r in SRC_valid_dataset.iterrows()]
TRG_valid_dataset = [toTensor(vocab_trg, r['trg']) for idx, r in TRG_valid_dataset.iterrows()]


## Step 2: Create the Encoder
The Encoder's job is to create a representation of the input sequence. Then, it captures the representation in the hidden state of the LSTM. And finally, it passes the hidden state to the second half of Seq2Seq.

The layers of The Encoder are:
- The Embedding Layer
- The LSTM
- Dropout Layer (optional)

The parameters of The Encoder are:

- The input size
- The hidden size
- The embedding size

ref:
https://github.com/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb


In [299]:
# define a torch.device. This is used to tell torchText to put the tensors on the GPU or not
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [300]:
# Create the Encoder
class Encoder(nn.Module):
    
    def __init__(self, input_size, hidden_size, embedding_size, n_layers, dropout):
        
        super(Encoder, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.embedding_size = embedding_size
        self.n_layers = n_layers
        
        # self.embedding provides a vector representation of the inputs to our model
        self.embedding = nn.Embedding(self.input_size, self.embedding_size)
        
        # self.lstm, accepts the vectorized input and passes a hidden state
        self.lstm = nn.LSTM(self.embedding_size, self.hidden_size, n_layers, dropout = dropout)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, i):
        
        '''
        Inputs: i, the src vector
        Outputs: o, the encoder outputs
                h, the hidden state
                c, the cell state
        '''
        #Inputs = [input_size, batch size]
        embedded = self.dropout(self.embedding(i))
        
        #embedded = [input_size, batch size, embedding_size]
        o, (h, c) = self.lstm(embedded)
        
        #outputs = [input_size, batch size, hidden_size * n directions]
        #hidden = [n_layers * n directions, batch size, hidden_size]
        #cell = [n_layers * n directions, batch size, hidden_size]
        
        #outputs are always from the top hidden layer
        
        return o, h, c
    

## Step 3: Create the Decoder
The Decoder's job is to output a prediction based on the hidden state of The Encoder. It combines this with information from the N-1 prediction to create an output.

The layers of The Decoder are:

- The Embedding Layer
- The LSTM
- The Linear Output Layer

The parameters of The Encoder are:

- The output size
- The hidden size
- The embedding size

ref: https://github.com/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb

In [301]:
# Create the Decoder
class Decoder(nn.Module):
      
    def __init__(self, hidden_size, output_size, embedding_size, n_layers, dropout):
        
        super(Decoder, self).__init__()
        
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.embedding_size = embedding_size
        self.n_layers = n_layers
        
        # self.embedding provides a vector representation of the target to our model
        self.embedding = nn.Embedding(self.output_size, self.embedding_size)
        
        # self.lstm, accepts the embeddings and outputs a hidden stat
        self.lstm = nn.LSTM(self.embedding_size, self.hidden_size, n_layers, dropout = dropout)

        # self.ouput, predicts on the hidden state via a linear output layer     
        self.out = nn.Linear(self.hidden_size, self.output_size)

        self.dropout = nn.Dropout(dropout)
        
    def forward(self, i, h, c):
        
        '''
        Inputs: i, the target vector
        Outputs: o, the prediction
                h, the hidden state
        '''
        #input = [batch size]
        #hidden = [n_layers * n directions, batch size, hidden_size]
        #cell = [n_layers * n directions, batch size, hidden_size]
        
        #n directions in the decoder will both always be 1, therefore:
        #hidden = [n_layers, batch size, hidden_size]
        #context = [n_layerss, batch size, hidden_size]
        
        i = i.unsqueeze(0)
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(i))
        #embedded = [1, batch size, embedding_size]
        
        output, (h, c) = self.lstm(embedded, (h, c))
        #output = [seq len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #seq len and n directions will always be 1 in the decoder, therefore:
        #output = [1, batch size, hid dim]
        #hidden = [n layers, batch size, hid dim]
        #cell = [n layers, batch size, hid dim]
        
        o = self.out(output.squeeze(0))
        #prediction = [batch size, output dim]
        
        return o, h

## Step 4: Combine into a Seq2Seq Architecture
This will handle:

- receiving the input/source sentence
- using the encoder to produce the context vectors
- using the decoder to produce the predicted output/target sentence

In [336]:
# Combine them into a Seq2Seq Architecture
class Seq2Seq(nn.Module):
    
    def __init__(self, encoder_input_size, encoder_hidden_size, encoder_embedding_size, encoder_n_layers, encoder_dropout,\
                 decoder_hidden_size, decoder_output_size, decoder_embedding_size, decoder_n_layers, decoder_dropout):

        super(Seq2Seq, self).__init__()
        
        self.encoder = Encoder(encoder_input_size, encoder_hidden_size, encoder_embedding_size, encoder_n_layers, encoder_dropout)
        self.decoder = Decoder(decoder_hidden_size, decoder_output_size, decoder_embedding_size, decoder_n_layers, decoder_dropout)
    
    def forward(self, src, trg, batch_size, teacher_forcing_ratio = 0.5):      
#     def forward(self, src, trg, teacher_forcing_ratio = 0.5):      

        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
#         batch_size = trg.shape[1]
        trg_len = trg.shape[0]
#         trg_len = trg.size(0)
        trg_vocab_size = self.decoder.output_size
        
        #tensor to store decoder outputs
        o = torch.zeros(trg_len, batch_size, trg_vocab_size).to(device)

        #last hidden state of the encoder is used as the initial hidden state of the decoder
        encoder_output, h, c = self.encoder(src)
        
        #first input to the decoder is the <sos> tokens
        decoder_input = trg[0,:]
        
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden and previous cell states
            #receive output tensor (predictions) and new hidden and cell states            
            decoder_output, h = self.decoder(decoder_input, h, c)
            
            #place predictions in a tensor holding predictions for each token
            o[t] = decoder_output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = decoder_output.argmax(1)
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            decoder_input = trg[t] if teacher_force else top1
            
        return o
    

## Step 5: Train & evaluate model

ref: https://learn.udacity.com/nanodegrees/nd101/parts/cd1822/lessons/23e85aa3-ecde-4dc6-ae90-dc7144383206/concepts/e1424b2b-4627-4953-b342-78b4c444478a

In [337]:
# nitialize the model
# the input and output dimensions are defined by the size of the vocabulary
# the embedding dimesions and dropout for the encoder and decoder can be different, 
# but the size of the hidden/cell states must be the same.
# then define the encoder, decoder and then Seq2Seq model, which we place on the device.

encoder_input_size = vocab_src.count
decoder_output_size = vocab_trg.count
encoder_embedding_size = 256
decoder_embedding_size = 256
hidden_size = 512
n_layers = 2
encoder_dropout = 0.5
decoder_dropout = 0.5

model = Seq2Seq(encoder_input_size, hidden_size, encoder_embedding_size, n_layers, encoder_dropout,\
                 hidden_size, decoder_output_size, decoder_embedding_size, n_layers, decoder_dropout)

In [338]:
# define optimizer, which we use to update our parameters in the training loop. 
optimizer = optim.Adam(model.parameters())
# optimizer = optim.SGD(model.parameters(), lr=learning_rate)

# define loss function
criterion = nn.CrossEntropyLoss()
# criterion = nn.NLLLoss()


In [339]:
# train function
def train(model, source_data, target_data, batch_size, optimizer, criterion, clip):

    model.to(device)
    
    model.train()
    
    epoch_loss = 0
    
    for i in range(0, len(source_data)):
        
        src = source_data[i]
        trg = target_data[i]
        
        optimizer.zero_grad()
        
        output = model(src, trg, batch_size, 0.5)
        
        #trg = [trg len, batch_size]
        #output = [trg len, batch_size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [340]:
# evaluate function
def evaluate(model, source_data, target_data, batch_size, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
        
        for i in range(0, len(source_data)):
            
            src = source_data[i]
            trg = target_data[i]

            output = model(src, trg, batch_size, 0) #turn off teacher forcing

            #trg = [trg len, batch_size]
            #output = [trg len, batch_size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch_size]
            #output = [(trg len - 1) * batch_size, output dim]

            loss = criterion(output, trg)
            
            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [341]:
# train the model
# At each epoch, will check if model has achieved the best validation loss so far. 
# If it has, will update best validation loss and save the parameters of our model. 
# when test model, will use the saved parameters used to achieve the best validation loss.

n_epochs = 1
clip = 1
batch_size = 128
model_path = 'seq2seq.pt'

best_valid_loss = float('inf')

for epoch in range(n_epochs):
    
    train_loss = train(model, SRC_train_dataset, TRG_train_dataset, batch_size, optimizer, criterion, clip)
    valid_loss = evaluate(model, source_data, target_data, batch_size, criterion)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), model_path)
    
    print(f'Epoch: {epoch+1:02}')
    print(f'\tTrain Loss: {train_loss:.3f} | Val. Loss: {valid_loss:.3f}')
    

ValueError: Expected input batch_size (128) to match target batch_size (1).

In [None]:
# load the model
model.load_state_dict(torch.load(model_path))

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f}')

model.eval()

## Step 6: Interact with the Chatbot

In [None]:
# simple interactive interfaces
print("Type 'stop' to exit chat")
ANSWER_LENGTH = 12
while True:
    src = input(">")
    
    # If STOP in input, stop script
    if "stop" == src.strip():
        break
    # get the answer
    src = toTensor(vocab_src, " ".join(prepare_text(src)))
    
    answer_words = [] 
    output = model(src, trg, batch_size, 0)
    
    for tensor in output['decoder_output']:

        _, top_token = tensor.data.topk(1)
        if top_token.item() == 1:
            break
        else:
            word = vocab_trg.index[top_token.item()]
            answer_words.append(word)

    # write out an answer for user
    print("<", " ".join(answer), "\n")