# Sequence to Sequence Chatbot

In this notebook, I first train a  Sequence to Sequence text generation architecture based chatbot on a subset of  torchtext's SQuAD 1.0 dataset which consists of question-answer-pairs. Subsequently, I confront the trained model with both questions it has been exposed to and questions it has not been exposed to demonstrate the way the model works.

### dependencies:

# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [1]:

from src.data_chatbot import questions_answers, load_df, toTensor, show_lengths, vectorize_questions, vectorize_answers
from src.data_chatbot import pretrained_w2v, prepare_text
from src.models_chatbot import Seq2Seq
from src.vocab_chatbot import Vocab


from src.train_chatbot import pretrain, train
from src.apply_chatbot import apply_chatbot



Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int):
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, positive=False):
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  max_n_alphas=1000, n_jobs=None, eps=np.finfo(np.float).e

ImportError: cannot import name 'tokenize_questions' from 'src.data_chatbot' (C:\Users\Adam\Desktop\Udacity\DL-Generative-Chatbot\src\data_chatbot.py)

### Tests

In [None]:
!python -m pytest -vv src/tests_chatbot.py

### Inspect raw data

In [None]:
df_train = load_df()
df_train.head()

### Tokenize data

In [None]:
questions, answers = questions_answers(5000)
show_lengths(questions, answers)

### Create vocabularies

In [None]:
vQ = Vocab("Questions")
for sequence in [["<SOS>", "<EOS>"]] + questions:
    for token in sequence:
        vQ.indexWord(token)
vA = Vocab("Answers")
for sequence in [["<SOS>", "<EOS>"]] + answers:
    for token in sequence:
        vA.indexWord(token)
print(f"The source vocabulary contains {len(vQ.word2index)} and the target vocabulary contains {len(vA.word2index)} words.")

### Create vectors

In [None]:
vectorized_questions = vectorize_questions(questions, vQ)
vectorized_answers = vectorize_answers(answers, vA)
print('Tokenization completed.')


### Create model

In [None]:
input_size = len(vQ.word2index)
hidden_size = 124
output_size = len(vA.word2index) 

dropout_E=0.0
dropout_D=0.0
teacher_forcing_ratio=0.0


model = Seq2Seq(input_size, hidden_size, output_size)

### Train model

In [None]:
epochs = 30
batch_size = 124
print_each = 1
lr = 0.01
weight_decay = 0
version = str(hidden_size)
train(epochs, batch_size, print_each, lr, model, version, vectorized_questions, vectorized_answers, vQ, vA)

### Apply model

In [None]:
import torch
version = 124
model.load_state_dict(torch.load(f"model_{version}.pt", map_location=torch.device('cpu')))
print(f"Loading from checkpoint: 'model_{version}.pt'")


max_count = 0
for answer in answers_train:
    if len(answer) > max_count:
        max_count = len(answer)

model.eval()
string2stop = 'quit'
print(f"Type {string2stop} to finish the chat.\n")

    
while (True):
    question = input("> ")
    if question.strip() == string2stop:
        break
    
    apply_chatbot(model, tokenize_questions([prepare_text(question)],vQ)[0].view(-1,1), vQ, vA, max_count)