### title:

### project overview:

### dependencies:

# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [None]:
from src.data_chatbot import questions_answers, load_df, toTensor, show_lengths, tokenize_questions,tokenize_answers
from src.data_chatbot import pretrained_w2v, prepare_text
from src.models_chatbot import Seq2Seq
from src.vocab_chatbot import Vocab


from src.train_chatbot import pretrain, train
from src.apply_chatbot import apply_chatbot



### Tests

In [None]:
!python -m pytest -vv src/tests_chatbot.py

### Raw data

In [None]:
df_train = load_df()
df_train.head()

In [None]:
len(df_train)

### Tokenized sentences

In [None]:
questions_train_raw, questions_valid_raw, answers_train_raw, answers_valid_raw = questions_answers(source_name=source_name)
show_lengths(questions_train_raw, questions_valid_raw, answers_train_raw, answers_valid_raw)

### Filter data

In [None]:
# a desicion has to be made between the next and the after next block

In [None]:
# temp = [pair for pair in zip(questions_train_raw, answers_train_raw) if len(pair[1])>3]
# questions_train_filt, answers_train_filt = map(list, zip(*temp))
# temp = [pair for pair in zip(questions_valid_raw, answers_valid_raw) if len(pair[1])>3]
# questions_valid_filt, answers_valid_filt = map(list, zip(*temp))
# print(f"{len(questions_train_filt)} training questions and {len(questions_valid_filt)} valid questions remain.")

In [None]:
questions_train_filt = questions_train_raw[:5000]
questions_valid_filt = questions_valid_raw[4501:5000]
answers_train_filt = answers_train_raw[:5000]
answers_valid_filt = answers_valid_raw[4501:5000]

### Create vocabularies

In [None]:
vQ = Vocab("Questions")
for sequence in [["<SOS>", "<EOS>"]] + questions_train_filt + questions_valid_filt:
    for token in sequence:
        vQ.indexWord(token)
vA = Vocab("Answers")
for sequence in [["<SOS>", "<EOS>"]] + answers_train_filt + answers_valid_filt:
    for token in sequence:
        vA.indexWord(token)
print(f"The source vocabulary contains {len(vQ.word2index)} and the target vocabulary contains {len(vA.word2index)} words.")

### Create vectors

In [None]:
questions_train = tokenize_questions(questions_train_filt, vQ)
answers_train = tokenize_answers(answers_train_filt, vA)
questions_valid = tokenize_questions(questions_valid_filt, vQ)
answers_valid = tokenize_answers(answers_valid_filt, vA)

### Create model

In [None]:
input_size = len(vQ.word2index)
hidden_size = 124
output_size = len(vA.word2index) 

dropout_E=0.0
dropout_D=0.0
teacher_forcing_ratio=0.0


model = Seq2Seq(input_size, hidden_size, output_size)

### Utilize pretrained embeddings

In [None]:
# w2v = pretrained_w2v(init=False)
# model = pretrain(model, vQ, vA, w2v)

#### => most_similar is not working after adding vector in gensim

### Train model

In [None]:
epochs = 50
batch_size = 124
print_each = 5
lr = 0.01
weight_decay = 0
version = str(hidden_size)
train(epochs, batch_size, print_each, lr, model, version, questions_train, answers_train, vQ, vA)

In [None]:
import torch
version = 124
model.load_state_dict(torch.load(f"model_{version}.pt", map_location=torch.device('cpu')))
print(f"Loading from checkpoint: 'model_{version}.pt'")


max_count = 0
for answer in answers_train:
    if len(answer) > max_count:
        max_count = len(answer)

model.eval()
string2stop = 'quit'
print(f"Type {string2stop} to finish the chat.\n")

    
while (True):
    question = input("> ")
    if question.strip() == string2stop:
        break
    
    apply_chatbot(model, tokenize_questions([prepare_text(question)],vQ)[0].view(-1,1), vQ, vA, max_count)