# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [1]:
!pip install torchdata

Defaulting to user installation because normal site-packages is not writeable
Collecting torchdata
  Downloading torchdata-0.5.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB)
[K     |████████████████████████████████| 4.5 MB 5.3 MB/s eta 0:00:01
Collecting torch==1.13.0
  Downloading torch-1.13.0-cp37-cp37m-manylinux1_x86_64.whl (890.2 MB)
[K     |████████████████████████████████| 890.2 MB 4.2 kB/s  eta 0:00:01
Collecting portalocker>=2.0.0
  Downloading portalocker-2.6.0-py2.py3-none-any.whl (15 kB)
Collecting nvidia-cudnn-cu11==8.5.0.96
  Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)
[K     |████████████████████████████████| 557.1 MB 7.8 kB/s  eta 0:00:01   |▎                               | 4.9 MB 13.1 MB/s eta 0:00:43     |████▌                           | 78.3 MB 48.1 MB/s eta 0:00:10▏                 | 245.7 MB 37.6 MB/s eta 0:00:09     |██████████████▎                 | 248.0 MB 37.6 MB/s eta 0:00:09     |██████████████

In [1]:
import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
#import torch
from nltk.corpus import brown
# import dataset SQuAD2 as suggsted
#from torchtext.datasets import SQuAD2

nltk.download('brown')
nltk.download('punkt')

# Output, save, and load brown embeddings
# model = gensim.models.Word2Vec(brown.sents())
# model.save('brown.embedding')
# w2v = gensim.models.Word2Vec.load('brown.embedding')


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
#You will use this function to load the dataset into a Pandas Dataframe for processing.
def loadDF(ds):
    df = {"question": [], "answer": []}
    for context, question, answers, indices in ds:
        if answers[0]:
            df["question"].append(question)
            df["answer"].append(answers[0])
    return pd.DataFrame.from_dict(df)
    

def prepare_text(sentence):
    '''
    Our text needs to be cleaned with a tokenizer. This function will perform that task.
    https://www.nltk.org/api/nltk.tokenize.html
    '''
    return nltk.tokenize.word_tokenize(sentence)


def train_test_split(SRC, TRG):
    '''
    Input: SRC, our list of questions from the dataset
           TRG, our list of responses from the dataset
    Output: Training and test datasets for SRC & TRG
    '''
    
    SRC_train_dataset = train_df["question"].tolist()
    TRG_train_dataset = train_df["answer"].tolist()
    
    SRC_test_dataset = test_df["question"].tolist()
    TRG_test_dataset = test_df["answer"].tolist()
        
    return SRC_train_dataset, SRC_test_dataset, TRG_train_dataset, TRG_test_dataset


In [3]:
import torch
from torchtext.datasets import SQuAD2

ImportError: cannot import name 'TypeAlias' from 'typing_extensions' (/opt/conda/lib/python3.7/site-packages/typing_extensions.py)

In [5]:
train_dataset, test_dataset = SQuAD2()

ModuleNotFoundError: Package `torchdata` not found. Please install following instructions at `https://github.com/pytorch/data`

In [8]:
# Training dataset to dataframe
train_df = loadDF(train_dataset)

NameError: name 'train_dataset' is not defined

In [None]:
train_df.head()

In [None]:
class Encoder(nn.Module): 
    def __init__(self, input_size, hidden_size):
        super(Encoder, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.embedding_dim = embedding_size

        self.hidden = torch.zeros(1, 1, hidden_size)
        # self.embedding provides a vector representation of the inputs to our model
        self.embedding = nn.Embedding(num_embeddings=self.input_size, 
                                      embedding_dim=self.embedding_dim)
        
        # self.lstm, accepts the vectorized input and passes a hidden state
        self.lstm = nn.LSTM(input_size=self.embedding_dim,
                            hidden_size=self.hidden_size,
                            num_layers=1)
    
    def forward(self, i):
        
        '''
        Inputs: i, the src vector
        Outputs: o, the encoder outputs
                h, the hidden state
                c, the cell state
        '''
        
        return o, h, c
    

In [None]:
class Decoder(nn.Module):
    def __init__(self, hidden_size, output_size):
        
        super(Decoder, self).__init__()
        self.hidden_size=hidden_size
        self.output_size=output_size
        # self.embedding provides a vector representation of the target to our model
        self.embedding = nn.Embedding(num_embeddings=self.output_size,
                                      embedding_dim=self.embedding_size)
       
        # self.lstm, accepts the embeddings and outputs a hidden state
        self.lstm = nn.LSTM(self.embedding_size, hidden_size, num_layers=3)
        
        # self.ouput, predicts on the hidden state via a linear output layer     
        self.out = nn.Linear(self.hidden_size, self.output_size)
        
    def forward(self, i, h):
        
        '''
        Inputs: i, the target vector
        Outputs: o, the prediction
                h, the hidden state
        '''
        unsqueezed = i.unsqueeze(0)
        embedded = self.embedding(unsqueezed)
        
        o, h, c = self.lstm(embedded, (h, c))
        
        p = self.fc(o.squeeze(0))
        
        return o, h, p, c

In [None]:
class Seq2Seq(nn.Module):
    
    def __init__(self, encoder_input_size, encoder_hidden_size, decoder_hidden_size, decoder_output_size):
        
        super(Seq2Seq, self).__init__()
        
    
    
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):      
        
        return o