# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [1]:
test = False
init = False

if test:
    !pip install -U  pytest
if init:
    !pip install typing-extensions --upgrade
    !pip install -U torch torchvision torchtext torchdata pytest

In [2]:
# public libraries
import nltk
import numpy as np
import pandas as pd
import gzip
import torch
import torch.nn as nn
import pandas as pd
import torch.optim
import random

# my libraries
import utils
import data

In [3]:
if False:
    from nltk.corpus import brown
    from nltk.tokenize import RegexpTokenizer
    import gensim


    #from ntlk.stem import *

    nltk.download('brown')
    nltk.download('punkt')

    # Output, save, and load brown embeddings

    model = gensim.models.Word2Vec(brown.sents())
    model.save('brown.embedding')

    w2v = gensim.models.Word2Vec.load('brown.embedding')



    def loadDF(path):
      '''

      You will use this function to load the dataset into a Pandas Dataframe for processing.

      '''
      return df


    def prepare_text(sentence):

        '''

        Our text needs to be cleaned with a tokenizer. This function will perform that task.
        https://www.nltk.org/api/nltk.tokenize.html

        '''

        return tokens



    def train_test_split(SRC, TRG):

        '''
        Input: SRC, our list of questions from the dataset
                TRG, our list of responses from the dataset

        Output: Training and test datasets for SRC & TRG

        '''
        share = 0.2
        src_split = len(SRC)*share
        trg_split = len(TRG)*share
        SRC_test_dataset = SRC[:scr_split]
        SRC_train_dataset = SRC[scr_split:]
        TRG_test_dataset = SRC[:trg_split]
        TRG_train_dataset = SRC[trg_split:]


        return SRC_train_dataset, SRC_test_dataset, TRG_train_dataset, TRG_test_dataset


### Get and watch the data

In [4]:
if init:
    data.squad1_to_csv()
    
df_train = pd.read_csv('train_dataset_squad1.csv')  
df_test = pd.read_csv('dev_dataset_squad1.csv') 
split = round(len(df_train)*100/(len(df_train)+len(df_test)),1)
print(f"{len(df_train)} training samples and {len(df_test)} have been loaded.")
print(f"The test data makes {split}% of all the data.")
df_train.head()

87599 training samples and 10570 have been loaded.
The test data makes 89.2% of all the data.


Unnamed: 0.1,Unnamed: 0,context,question,answer,answer_start
0,0,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,['Saint Bernadette Soubirous'],[515]
1,1,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,['a copper statue of Christ'],[188]
2,2,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,['the Main Building'],[279]
3,3,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,['a Marian place of prayer and reflection'],[381]
4,4,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,['a golden statue of the Virgin Mary'],[92]


### Create vocabulary and dictionary

In [5]:
class Vocab:
    def __init__(self, name):
        self.name = name
        self.index = {}
        self.count = 0
        self.words = {}
        
    def clean_text(self, text):
        tokenizer = RegexpTokenizer(r'\w+')
        text = tokenizer.tokenize(text)
        return text
                                    
    def indexWord(self, word):
        if word not in self.words:
            self.words[word] = self.count
            self.index[str(self.count)] = word
            self.count += 1
            return True
        else:
            return False

vocab = utils.Vocab(name='SQuAD1')
count = 0
for i, r in df_train.iterrows():
    question_words = vocab.clean_text(r['question'])
    answer_words = vocab.clean_text(r['answer'][2:-2])
    PAD = "<PAD>"
    SOS = "<SOS>"
    EOS = "<EOS>"
    OUT = "<OUT>"
    special_tokens = ["<PAD>", "<SOS>", "<EOS>", "<OUT>"]
    for word in special_tokens + answer_words + question_words:
        vocab.indexWord(word)

VOCAB_SIZE = len(vocab.words)
list(vocab.words.items())[:8]

[('<PAD>', 0),
 ('<SOS>', 1),
 ('<EOS>', 2),
 ('<OUT>', 3),
 ('Saint', 4),
 ('Bernadette', 5),
 ('Soubirous', 6),
 ('To', 7)]

### Calculating sequence necessary sequence length

In [6]:
sequence_length = utils.get_sequence_length(df_train, vocab)
print(f"The sequence has to have a length of {sequence_length}.")
df_train.head()

The sequence has to have a length of 44.


Unnamed: 0.1,Unnamed: 0,context,question,answer,answer_start
0,0,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,['Saint Bernadette Soubirous'],[515]
1,1,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,['a copper statue of Christ'],[188]
2,2,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,['the Main Building'],[279]
3,3,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,['a Marian place of prayer and reflection'],[381]
4,4,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,['a golden statue of the Virgin Mary'],[92]


### Tests

In [7]:
if test:
    !python -m pytest -vv tests.py

### Defining the model

In [8]:
import torch.nn as nn
class Encoder(nn.Module):
    
    def __init__(self, input_size, embedding_size, hidden_size):
        
        super(Encoder, self).__init__()
        
        # self.embedding provides a vector representation of the inputs to our model
        self.embedding = nn.Embedding(input_size, embedding_size)
        
        # self.lstm, accepts the vectorized input and passes a hidden state
        self.lstm = nn.LSTM(embedding_size, hidden_size)
        self.dropout = nn.Dropout(p=0.0)
        
    
    def forward(self, i, h):
        
        '''
        Inputs: i, the src vector
        Outputs: o, the encoder outputs
                h, the hidden state
                c, the cell state
        '''
        embedding = self.embedding(i)

        o, h= self.lstm(embedding, h)
        o = self.dropout(o)
        
        return o, h
    

class Decoder(nn.Module):
      
    def __init__(self, hidden_size, embedding_size, output_size):
        
        super(Decoder, self).__init__()
        
        # self.embedding provides a vector representation of the target to our model
        self.embedding = nn.Embedding(output_size, embedding_size)
        
        # self.lstm, accepts the embeddings and outputs a hidden state
        self.lstm = nn.LSTM(embedding_size, hidden_size)

        # self.ouput, predicts on the hidden state via a linear output layer  
        self.linear = nn.Linear(hidden_size, output_size)
        
        self.softmax = nn.Softmax(dim=1)
        
    def forward(self, i, h):
        
        '''
        Inputs: i, the target vector
        Outputs: o, the prediction
                h, the hidden state (actually a tuple of hidden state and cell state)
        '''

        embedding = self.embedding(i)

        o, h = self.lstm(embedding, h)

        o = self.linear(o)

        o = self.softmax(o)

        
        return o, h
        
        

class Seq2Seq(nn.Module):
    
    def __init__(self, encoder_input_size, embedding_size, hidden_size, decoder_output_size):
        
        super(Seq2Seq, self).__init__()
        self.encoder = Encoder(encoder_input_size, embedding_size, hidden_size)
        self.decoder = Decoder(hidden_size, embedding_size, decoder_output_size)
                
        
    
    
    def forward(self, src, trg, he, teacher_forcing_ratio = 0.5): 
        o, he = self.encoder(src, he)
        o, hd = self.decoder(trg, he)
        
        
        return o, he

    



### Getting the data

In [9]:
questions = utils.get_questions(df_train, vocab)
answers = utils.get_answers(df_train, vocab)

### Training

In [10]:
# import torch
# import torch.nn as nn
# encoder_input_size = 4
# embedding_size = 2

# hidden_size = 3
# decoder_output_size = 4
# print(encoder_input_size, embedding_size, hidden_size, decoder_output_size)
# qt=torch.LongTensor([[1]])
# trg=torch.LongTensor([[0]])
# hidden=torch.zeros(1,1,hidden_size)
# hidden = (hidden, hidden)
# # 1.
# e = nn.Embedding(encoder_input_size, embedding_size)
# embedding = e(qt)
# #print(embedding.shape, hidden.shape)
# #print(embedding)
# # 2.
# lstm = nn.LSTM(embedding_size, hidden_size,1)

# lstm(embedding, hidden)
# lstm

# model = Seq2Seq(encoder_input_size, embedding_size, hidden_size, decoder_output_size)
# model(qt,trg,hidden)

In [11]:
is_cuda = torch.cuda.is_available()
#is_cuda = False
print(is_cuda)

import torch.optim
# hyperparams
epochs = 1
batch_size = len(df_train)
batch_size = 10
hidden_size = 10
lr = 0.01
teacher_forcing_ratio = 0.5


# model
encoder_input_size = len(vocab.words)
embedding_size = len(vocab.words)
hidden_size = hidden_size
decoder_output_size = len(vocab.words)
print("hyperparams", encoder_input_size, embedding_size, hidden_size, decoder_output_size)


model = Seq2Seq(encoder_input_size, embedding_size, hidden_size, decoder_output_size)
if is_cuda:
    model.cuda()
optim = torch.optim.SGD(model.parameters(), lr=lr)
loss_fn = nn.NLLLoss()


# training
batches = len(df_train)//batch_size
epoch = 0
for epoch in range(epochs):
    hidden = (torch.zeros(1,hidden_size), torch.zeros(1,hidden_size))
    for batch in range(batches):
                   
        for q, a in zip(questions[:10], answers[:10]):
            # trg: created target/answer
            # at: "true" target/answer
            # qt: input source/question
            trg = torch.LongTensor([vocab.words["<SOS>"]])
            print(trg, trg.shape)

            for qt, at in zip(q, a):
                qt = qt.view(-1)
                at = at.view(-1)
                
                optim.zero_grad() 
                if is_cuda:
                    at=at.cuda()
                    qt=qt.cuda()
                    trg=trg.cuda()
                    a,b=hidden
                    hidden = a.cuda(),b.cuda()
                output, hidden = model(qt, trg, hidden)


                a,b = hidden
         
                rand = random.uniform(0,1)
                if rand > teacher_forcing_ratio:
                    loss = loss_fn(output, at)
                else:
                    loss = loss_fn(output, trg)

                loss.backward()
                optim.step()
                
                print("loss",loss)
                trg = output
                print("###")
            
        

True
hyperparams 60311 60311 10 60311


RuntimeError: CUDA out of memory. Tried to allocate 13.55 GiB (GPU 0; 11.17 GiB total capacity; 0 bytes already allocated; 11.12 GiB free; 0 bytes reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [None]:
print(qt)
print(trg)
print(hidden)

In [None]:
torch.cuda.memory_summary(device='cuda', abbreviated=False)

In [None]:
t = torch.Tensor(2)
#t = t.view(1,-1)
t.shape
t=t.view()
t


In [None]:
at = at.view(-1)
output = output.squeeze(0)
print(at.shape)
print(output.shape)
loss_fn(output,at)

In [None]:
# is_cuda = torch.cuda.is_available()
# print(is_cuda)

# import torch.optim
# # hyperparams
# epochs = 1
# batch_size = len(df_train)
# batch_size = 10
# hidden_size = 100
# lr = 0.01
# teacher_forcing_ratio = 0.5


# # model
# encoder_input_size = len(vocab.words)
# embedding_size = len(vocab.words)
# hidden_size = hidden_size
# decoder_output_size = len(vocab.words)
# print(encoder_input_size, embedding_size, hidden_size, decoder_output_size)


# model = Seq2Seq(encoder_input_size, embedding_size, hidden_size, decoder_output_size)
# if is_cuda:
#     model.cuda()
# optim = torch.optim.SGD(model.parameters(), lr=lr)
# loss_fn = nn.NLLLoss()


# # training
# batches = len(df_train)//batch_size
# epoch = 0
# for epoch in range(epochs):
#     hidden = (torch.zeros(1,1,hidden_size), torch.zeros(1,1,hidden_size))
#     for batch in range(batches):
                   
#         for q, a in zip(questions[:10], answers[:10]):
#             trg = torch.LongTensor([[vocab.words["<SOS>"]]])
#             print(trg, trg.shape)

#             for qt, at in zip(q, a):
                
#                 optim.zero_grad() 
#                 at = at.view(1,-1)
#                 qt = qt.view(1,-1)
#                 if is_cuda:
#                     at=at.cuda()
#                     qt=qt.cuda()
#                     trg=trg.cuda()
#                     a,b=hidden
#                     hidden = a.cuda(),b.cuda()
#                 print("trg", trg.shape)
#                 print("at", at.shape)
#                 output, hidden = model(qt, trg, hidden)


#                 a,b = hidden
#                 print(output.shape, at.shape)
#                 print(output, at)
                
#                 rand = random.uniform(0,1)
#                 if rand > teacher_forcing_ratio:
#                     loss = loss_fn(output, at)
#                 else:
#                     loss = loss_fn(output, trg)
#                 loss.backward()
#                 trg = output
#                 print("###")
            
            
            
#         break
#     break
        

In [None]:
for que in questions[:2]:
    for word in que:
        print(word.shape, word.view(-1).shape,word.view(1,-1).shape,word.view(1,-1), word.view(1,1,-1).shape)

In [None]:
import src.abc
src.abc.test()

In [None]:
import torch.nn.functional as F
F.one_hot(torch.Tensor([19]),50)