# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [1]:
import torch
from torch.utils.data import Dataset

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class QuestionAnswerDataset(Dataset):
    def __init__(self, vocab_tensor: torch.Tensor):
        super(QuestionAnswerDataset, self).__init__()
        self.vocab_tensor = vocab_tensor
        # self.vocab = vocab
        
    def __len__(self):
        return len(self.vocab_tensor)
    
    def __getitem__(self, index):
        return self.vocab_tensor[index]

In [2]:
# Refer to the code at https://pytorch.org/tutorials/beginner/chatbot_tutorial.html

SOS_token = 0  # Start-of-sentence token
EOS_token = 1  # End-of-sentence token

class Vocab:
    def __init__(self, name):
        self.name = name
        self.word2index = {"": SOS_token, "": EOS_token}
        # self.word2count = {}
        self.num_words = len(self.word2index)
        self.index2word = {SOS_token: "", EOS_token: ""}
        # self.num_words = 3  # Count SOS, EOS, PAD  
        
    def addSentence(self, sentence):
        for word in sentence.split(" "):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.num_words
            # self.word2count[word] = 1
            self.index2word[self.num_words] = word
            self.num_words += 1
        # else:
        #     self.word2count[word] += 1
            
    def get_word2index(self):
        return self.word2index
    
    def get_wordcount(self):
        return self.num_words


In [1]:
# Refer to the code at https://github.com/iJoud/Seq2Seq-Chatbot

import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
from nltk.corpus import brown
from torchtext import datasets

# nltk.download('brown')
# nltk.download('punkt')

# Output, save, and load brown embeddings

model = gensim.models.Word2Vec(brown.sents())
# model.save('brown.embedding')

w2v = gensim.models.Word2Vec.load('brown.embedding')
  

def loadDF(path, split=False):
  '''

  You will use this function to load the dataset into a Pandas Dataframe for processing.

  '''
  def get_dict(dataiter):
        data_dict = {
          "Question": [],
          "Answer": [],
        }
        
        for _, question, answer, _ in dataiter:
              data_dict["Question"].append(question)
              data_dict["Answer"].append(answer[0])
              
        return data_dict

  train_iter, test_iter = datasets.SQuAD2(path, split=("train", "dev"))

  # train_data_dict, test_data_dict = get_dict(train_iter), get_dict(test_iter)
  train_df = pd.DataFrame(get_dict(train_iter))
  test_df = pd.DataFrame(get_dict(test_iter))
  if split:
    return train_df, test_df
  return train_df.append(test_df)


def prepare_text(sentence):
    
    '''

    Our text needs to be cleaned with a tokenizer. This function will perform that task.
    https://www.nltk.org/api/nltk.tokenize.html

    '''
    from nltk.tokenize import RegexpTokenizer
    from nltk.corpus import stopwords

    # import string
    # stemmer = nltk.stem.snowball.SnowballStemmer('english')
    
    # sentence = ''.join([s.lower() for s in sentence if s not in string.punctuation])
    # sentence = ' '.join(stemmer.stem(w) for w in sentence.split())
    # tokens = nltk.tokenize.RegexpTokenizer(r'\w+').tokenize(sentence)

    # return tokens
    sentence = sentence.lower()
    tokenizer = RegexpTokenizer(r'\w+')
    
    tokens = tokenizer.tokenize(sentence)
    stop_words = stopwords.words("english")
    
    new_tokens = []
    for token in tokens:
        if token not in stop_words:
            new_tokens.append(token)
    
    return new_tokens


def train_test_split(SRC, TRG):
    
    '''
    Input: SRC, our list of questions from the dataset
            TRG, our list of responses from the dataset

    Output: Training and test datasets for SRC & TRG

    '''

    
    return 


In [2]:
dataframe = loadDF("./data/squad")
dataframe

Unnamed: 0,Question,Answer
0,When did Beyonce start becoming popular?,in the late 1990s
1,What areas did Beyonce compete in when she was...,singing and dancing
2,When did Beyonce leave Destiny's Child and bec...,2003
3,In what city and state did Beyonce grow up?,"Houston, Texas"
4,In which decade did Beyonce become famous?,late 1990s
...,...,...
11868,What is the seldom used force unit equal to on...,sthène
11869,What does not have a metric counterpart?,
11870,What is the force exerted by standard gravity ...,
11871,What force leads to a commonly used unit of mass?,


In [7]:
type(dataframe["Question"])

pandas.core.series.Series

In [3]:
dataframe = dataframe.iloc[:10000, :]
# test_df = test_df.iloc[:10000, :]
dataframe["Question"] = dataframe["Question"].apply(prepare_text)
dataframe["Answer"] = dataframe["Answer"].apply(prepare_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [4]:
dataframe["Question"]

0                     [beyonce, start, becoming, popular]
1                      [areas, beyonce, compete, growing]
2       [beyonce, leave, destiny, child, become, solo,...
3                            [city, state, beyonce, grow]
4                       [decade, beyonce, become, famous]
                              ...                        
9995                  [many, kelvins, daylight, measured]
9996     [color, temperature, around, 2800, 3000, kelvin]
9997    [said, lights, high, color, temperature, energ...
9998         [lamp, energy, yellow, red, spectrum, known]
9999    [light, classified, intended, purpose, mainly,...
Name: Question, Length: 10000, dtype: object

In [5]:
dataframe["Answer"]

0                     [late, 1990s]
1                [singing, dancing]
2                            [2003]
3                  [houston, texas]
4                     [late, 1990s]
                   ...             
9995                         [6400]
9996           [incandescent, bulb]
9997                  [blue, white]
9998    [lower, color, temperature]
9999     [light, produced, fixture]
Name: Answer, Length: 10000, dtype: object