In this guide, we delve into an engaging and intriguing application of recurrent sequence-to-sequence models by training a basic chatbot using film scripts from the Cornell Movie-Dialogs Corpus.

Conversational models are currently a trending subject in the field of AI research. Chatbots are commonly used in various scenarios such as customer service platforms and online help desks. These bots typically utilize retrieval-based models that provide pre-set responses to specific types of questions. While these models might be adequate for highly specific domains like a company's IT helpdesk, they lack the robustness required for broader applications. However, the recent surge in deep learning, spearheaded recently by ChatGPT, has led to the development of potent multi-domain generative conversational models. In this guide, we will create one such model using the tools we have learnt so far.

To begin, download the dialog dataset: 

https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html (should be replaced with direct data source for DLCC)

and put in a ``data/`` directory under the current directory.

https://zissou.infosci.cornell.edu/convokit/datasets/movie-corpus/movie-corpus.zip

## Data Loading and Preprocessing

This step involves reorganizing our data file and loading the data into formats that are manageable.

The [Cornell Movie-Dialogs Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html) is a comprehensive dataset of dialogues from movie characters:

- It contains 220,579 conversational exchanges between 10,292 pairs of movie characters.
- It features 9,035 characters from 617 movies.
- It has a total of 304,713 utterances.

This dataset is vast and varied, with a wide range of language formality, time periods, sentiment, etc. We anticipate that this diversity will make our model capable of handling a variety of inputs and queries.

Initially, we will examine some lines from our data file to understand the original format.

In [127]:
# import os
# from convokit import download, Corpus

# import ssl

# ssl._create_default_https_context = ssl._create_unverified_context

# corpus_name = "tennis-corpus"  # "movie-corpus"
# if not os.path.exists(corpus_name):
#     download(corpus_name, data_dir="./")

# corpus = Corpus(corpus_name)
# corpus.print_summary_stats()

an uninterrupted chain of spoken or written language.

In [128]:
# for _ in range(5):
#     print(corpus.random_utterance().text)

# ex_text = corpus.random_utterance().text

In [129]:
import os


if not os.path.exists("chatterbot-corpus"):
    os.system("git clone https://github.com/gunthercox/chatterbot-corpus.git")

data_dir = os.path.join("chatterbot-corpus", "chatterbot_corpus", "data", "english")
files_list = os.listdir(data_dir)

In [130]:
files_list

['politics.yml',
 'ai.yml',
 'emotion.yml',
 'computers.yml',
 'botprofile.yml',
 'history.yml',
 'psychology.yml',
 'food.yml',
 'literature.yml',
 'money.yml',
 'trivia.yml',
 'gossip.yml',
 'humor.yml',
 'conversations.yml',
 'greetings.yml',
 'sports.yml',
 'movies.yml',
 'science.yml',
 'health.yml']

In [136]:
from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer("basic_english")

import re
import contractions


def tokenize(text):
    standardized_text = contractions.fix(text)

    standardized_text = (
        standardized_text.replace("’", "'")
        .replace("‘", "'")
        .replace("´", "'")
        .replace("“", '"')
        .replace("”", '"')
        .replace("´´", '"')
    )

    tokens = tokenizer(standardized_text)

    filtered_tokens = [
        token
        for token in tokens
        if re.match(
            r"^[a-zA-Z0-9.,!?]+(-[a-zA-Z0-9.,!?]+)*(_[a-zA-Z0-9.,!?]+)*$", token
        )
    ]
    return filtered_tokens

In [137]:
# from torchtext.vocab import build_vocab_from_iterator


# def corpus_iterator(corpus):  # (1)
#     for utt_id in corpus.utterances:
#         utt = corpus.get_utterance(utt_id)
#         yield tokenize(utt.text)


# # Add EOS, SOS, and PAD to the specials list
# special_tokens = ["<unk>", "<pad>", "<sos>", "<eos>"]

# vocab = build_vocab_from_iterator(
#     corpus_iterator(corpus),  # (2)
#     specials=special_tokens,  # (3),
#     min_freq=5,  # (4)
# )
# vocab.set_default_index(vocab["<unk>"])  # (5)

In [154]:
import yaml
from torchtext.vocab import build_vocab_from_iterator


def corpus_iterator(files_list):  # (1)
    for filepath in files_list:
        subject = open(data_dir + os.sep + filepath, "rb")
        docs = yaml.safe_load(subject)
        conversations = docs["conversations"]
        for con in conversations:
            range_end = len(con) - (len(con) % 2)
            for i in range(0, range_end):
                yield tokenize(con[i])


# Add EOS, SOS, and PAD to the specials list
special_tokens = ["<pad>", "<sos>", "<eos>", "<unk>"]

vocab = build_vocab_from_iterator(
    corpus_iterator(files_list),
    specials=special_tokens,
    min_freq=2,
)
vocab.set_default_index(vocab["<unk>"])

In [142]:
vocab.__len__()

791

In [156]:
vocab.lookup_token(0)

'<pad>'

In [None]:
# import numpy as np

# MAX_LENGTH = 30

# queries = []
# responses = []


# def all_words_in_vocab(sentence, vocab):
#     return all(word in vocab for word in sentence)


# def process_sentence(sequence, max_length=MAX_LENGTH):
#     processed = (
#         ["<sos>"] + sequence + ["<eos>"] + ["<pad>"] * (max_length - len(sequence) + 1)
#     )
#     return processed


# for conv_id in corpus.conversations:
#     conv = corpus.get_conversation(conv_id)
#     utt_ids = conv.get_utterance_ids()

#     range_end = len(utt_ids) - (len(utt_ids) % 2)
#     # Adjust range to exclude the last utterance in conversations with an odd number of utterances

#     for i in range(0, range_end, 2):

#         query = tokenize(corpus.get_utterance(utt_ids[i]).text)
#         response = tokenize(corpus.get_utterance(utt_ids[i + 1]).text)

#         if (
#             all_words_in_vocab(query + response, vocab)
#             and len(query) <= MAX_LENGTH
#             and len(response) <= MAX_LENGTH
#         ):
#             query = process_sentence(query)
#             response = process_sentence(response)

#             queries.append(vocab(query))
#             responses.append(vocab(response))

# queries = np.asarray(queries)
# responses = np.asarray(responses)

# print(f"Number of queries/responses: {len(queries)}")

In [303]:
import numpy as np

MAX_LENGTH = 30

queries, responses, masks_q, masks_r = [], [], [], []


def all_words_in_vocab(sentence, vocab):
    return all(word in vocab for word in sentence)


# def process_sentence(sequence, max_length=MAX_LENGTH):
#     processed = (
#         ["<sos>"] + sequence + ["<eos>"] + ["<pad>"] * (max_length - len(sequence) + 1)
#     )
#     mask =
#     return processed, mask
def process_sentence(sequence, max_length=MAX_LENGTH):
    # Calculate the length needed for padding. Subtract 2 for <sos> and <eos> tokens
    padding_length = max_length - len(sequence) + 1

    # Processed sequence with <sos>, <eos>, and <pad>
    processed = ["<sos>"] + sequence + ["<eos>"] + ["<pad>"] * padding_length

    # Create a mask: 1s for actual tokens and 0s for padding
    # The mask length is len(sequence) + 2 for <sos> and <eos> tokens. The rest are 0s for padding.
    mask = [1] * (len(sequence) + 2) + [0] * padding_length

    return processed, mask


for filepath in files_list:
    subject = open(data_dir + os.sep + filepath, "rb")
    docs = yaml.safe_load(subject)
    conversations = docs["conversations"]
    for con in conversations:
        range_end = len(con) - (len(con) % 2)

        for i in range(0, range_end, 2):
            query = tokenize(con[i])
            response = tokenize(con[i + 1])

            if (
                # all_words_in_vocab(query + response, vocab)
                len(query) <= MAX_LENGTH
                and len(response) <= MAX_LENGTH
            ):
                query, _ = process_sentence(query)
                response, mask_r = process_sentence(response)

                queries.append(vocab(query))
                responses.append(vocab(response))
                # masks_q.append(mask_q)
                masks_r.append(mask_r)

queries = np.asarray(queries)
responses = np.asarray(responses)
# masks_q = np.asarray(masks_q)
masks_r = np.asarray(masks_r)

print(f"Number of queries/responses: {len(queries)}")

Number of queries/responses: 689


In [295]:
len(queries[0])

33

In [296]:
import torch

# The dimensionality of GloVe embeddings
embedding_dim = 300

# Special tokens
special_tokens = ["<pad>", "<sos>", "<eos>"]
num_special_tokens = len(special_tokens)

# Initialize a tensor to hold the embeddings for special tokens
# Here, PAD is initialized to zeros, and SOS, EOS to random values
special_embeddings = torch.zeros(num_special_tokens, embedding_dim)
special_embeddings[1:] = (
    torch.rand(num_special_tokens - 1, embedding_dim) * 0.01
)  # Small random numbers for SOS and EOS

from torchtext.vocab import GloVe

# Load GloVe embeddings
glove = GloVe(name="42B", dim=embedding_dim, cache="./.vector_cache")

# Get GloVe embeddings for the vocabulary tokens
# Assuming 'vocab' is a list of vocabulary tokens including special tokens at the beginning
glove_embeddings = glove.get_vecs_by_tokens(vocab.get_itos(), lower_case_backup=True)

# Concatenate the special token embeddings with the GloVe embeddings
extended_embeddings = torch.cat([special_embeddings, glove_embeddings], dim=0)

INFO:torchtext.vocab.vectors:Loading vectors from ./.vector_cache/glove.42B.300d.txt.pt


In [None]:
# import deeplay as dl
# import torch.nn as nn

# features_dim = 500
# vocab_size = len(vocab)

# embedding = dl.Layer(nn.Embedding, vocab_size, embedding_dim)
# embedding.weight = extended_embeddings
# embedding.weight.requires_grad = False

# encoder = dl.RecurrentModel(
#     in_features=embedding_dim,
#     hidden_features=[features_dim, features_dim],
#     out_features=embedding_dim,
#     dropout=0.1,
#     rnn_type="GRU",
# )
# encoder.layer[2].configure(nn.Identity)

# decoder = dl.RecurrentModel(
#     in_features=embedding_dim,
#     hidden_features=[features_dim, features_dim],
#     out_features=embedding_dim,
#     dropout=0.1,
#     rnn_type="GRU",
# )
# decoder.blocks[2].activation.configure(nn.Softmax)
# seq2seq = dl.Sequential(
#     embedding,
#     encoder,
#     decoder,
# )

# reg = dl.Regressor(
#     model=seq2seq,
#     loss=nn.CrossEntropyLoss(),
#     optimizer=dl.Adam(),
# )
# seq2seq_reg = reg.create()

In [536]:
import torch
import torch.nn as nn
import torch.optim as optim
import deeplay as dl
from deeplay import DeeplayModule, Classifier

hidden_features = 50


class MyClassifier(Classifier):

    def training_step(self, batch, batch_idx):
        x1, x2, m = batch
        y = torch.cat((x2[:, 1:], x2[:, -1:]), dim=1)
        y_hat = self(x1, x2)
        # print(y.shape)
        # print(y_hat.shape)
        # y = y.view(-1)
        # y_hat = y_hat.view(-1, y_hat.size(-1))
        loss = self.loss(y_hat, y, m)
        # loss = self.loss(y_hat.view(-1, y_hat.size(-1)), y.view(-1))
        self.log(
            f"train_loss",
            loss,
            on_step=True,
            on_epoch=True,
            prog_bar=True,
            logger=True,
        )

        self.log_metrics(
            "train", y_hat, y, on_step=True, on_epoch=True, prog_bar=True, logger=True
        )

        return loss

    def forward(self, x1, x2):
        return self.model(x1, x2)


class Encoder(DeeplayModule):
    def __init__(self, vocab_size, embedding_dim, lstm_units):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, lstm_units, batch_first=True, dropout=0.1)

    def forward(self, x):
        x = self.embedding(x)
        # x = torch.nn.utils.rnn.pack_padded_sequence(x, lengths)
        outputs, (hidden, cell) = self.lstm(x)
        return hidden, cell


class Decoder(DeeplayModule):
    def __init__(self, vocab_size, embedding_dim, lstm_units):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, lstm_units, batch_first=True, dropout=0.1)
        self.dense = nn.Linear(lstm_units, vocab_size)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, x, hidden, cell):
        x = self.embedding(x)
        outputs, (hidden, cell) = self.lstm(x, (hidden, cell))
        outputs = self.dense(outputs)
        outputs = self.softmax(outputs)
        return outputs, hidden, cell


class Seq2Seq(DeeplayModule):
    def __init__(self, vocab_size, embedding_dim, lstm_units):
        super(Seq2Seq, self).__init__()
        self.encoder = Encoder(vocab_size, embedding_dim, lstm_units)
        self.decoder = Decoder(vocab_size, embedding_dim, lstm_units)
        self.vocab_size = vocab_size
        self.lstm_units = lstm_units

    def forward(self, encoder_input_data, decoder_input_data):
        encoder_hidden, encoder_cell = self.encoder(encoder_input_data)

        decoder_hidden = encoder_hidden
        decoder_cell = encoder_cell

        outputs = torch.zeros(
            (decoder_input_data.size(0), decoder_input_data.size(1), self.vocab_size)
        ).to("mps")
        for t in range(decoder_input_data.size(1)):  # Iterate through the sequence
            output, decoder_hidden, decoder_cell = self.decoder(
                decoder_input_data[:, t].unsqueeze(-1), decoder_hidden, decoder_cell
            )
            outputs[:, t, :] = output.squeeze(1)
        return outputs


seq2seq = Seq2Seq(len(vocab), embedding_dim, hidden_features)


def NLLLoss(inp, target, mask):
    # nTotal = mask.sum()
    crossEntropy = -torch.log(
        torch.gather(inp.view(-1, inp.shape[-1]), 1, target.view(-1, 1))
    )
    # loss = crossEntropy.mean()
    loss = crossEntropy.masked_select(mask.view(-1, 1)).mean()
    # loss = loss.to(device)
    return loss  # , nTotal.item()


seq2seq_classifier = MyClassifier(
    model=seq2seq,
    loss=NLLLoss,  # nn.CrossEntropyLoss(),
    optimizer=dl.RMSprop(),
).create()

seq2seq_classifier.model.encoder.embedding.weight.data = extended_embeddings
seq2seq_classifier.model.encoder.embedding.weight.requires_grad = False
seq2seq_classifier.model.decoder.embedding.weight.data = extended_embeddings
seq2seq_classifier.model.decoder.embedding.weight.requires_grad = False



In [537]:
import deeptrack as dt
import torch

sources = dt.sources.Source(inputs=queries, targets=responses, masks=masks_r)

inputs_pl = dt.Value(sources.inputs) >> dt.pytorch.ToTensor(dtype=torch.int)
targets_pl = dt.Value(sources.targets) >> dt.pytorch.ToTensor(dtype=torch.int)
masks_pl = dt.Value(sources.masks) >> dt.pytorch.ToTensor(dtype=torch.bool)

In [538]:
from torch.utils.data import DataLoader

train_dataset = dt.pytorch.Dataset(inputs_pl & targets_pl & masks_pl, inputs=sources)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=False)

In [539]:
trainer = dl.Trainer(max_epochs=100, accelerator="mps")
trainer.fit(seq2seq_classifier, train_loader)

/Users/841602/Documents/GitHub/Environments/deeplay_env/lib/python3.10/site-packages/lightning/pytorch/trainer/configuration_validator.py:74: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.


Output()

/Users/841602/Documents/GitHub/Environments/deeplay_env/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.
/Users/841602/Documents/GitHub/Environments/deeplay_env/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py:293: The number of training batches (22) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.


In [563]:
def make_inference(model, source_sequence):
    # Source sequence tensor is assumed to be prepared accordingly (tokenized, converted to tensor, etc.)

    # Encoder inference
    with torch.no_grad():
        hidden, cell = model.encoder(torch.tensor(source_sequence).to("mps"))

    # Prepare inputs to the decoder
    target_index = torch.tensor([1]).to("mps")

    predictions = []
    for _ in range(MAX_LENGTH):
        with torch.no_grad():
            output, hidden, cell = model.decoder(target_index, hidden, cell)
            top1 = output.argmax(1)  # Get the index with the highest probability
            predictions.append(top1.item())
            target_index = top1
            if top1.item() == 2:  # Check if the EOS token was generated
                break

    return predictions

In [564]:
qu = queries[100]
# for q in qu:
#     print(vocab.lookup_token(q))

predictions = make_inference(seq2seq_classifier.model, qu)
print(predictions)

for p in predictions:
    print(vocab.lookup_token(p))

[6, 12, 7, 99, 330, 2]
i
are
a
bad
spouse
<eos>


i
am
not
<unk>
.
<eos>


### Generate Formatted Data File

For ease of use, we will generate a well-structured data file where each line comprises a tab-separated pair of a *query sentence* and a *response sentence*.

In [None]:
# pairs = []

# for conv_id in corpus.conversations:
#     conv = corpus.get_conversation(conv_id)
#     utt_ids = conv.get_utterance_ids()

#     num_utt = len(utt_ids)
#     range_end = num_utt - 2 if num_utt % 2 != 0 else num_utt - 1

#     for i in range(range_end):
#         query = corpus.get_utterance(utt_ids[i]).text.strip()
#         response = corpus.get_utterance(utt_ids[i + 1]).text.strip()
#         if query and response:
#             pairs.append(f"{query}|-->|{response}")

# print(f"Number of pairs: {len(pairs)}")

# # filename = os.path.join(corpus_name, "formatted_movie_lines.txt")

# # with open(filename, "w", encoding="utf-8") as file:
# #     for pair in pairs:
# #         file.write(pair + "\n")

convert all letters to lowercase 

trim all non-letter characters except for basic punctuation

filter out sentences with length greater than the MAX_LENGTH

In [None]:
# import re


# def clean_text(text):
#     text = text.strip().lower()
#     text = re.sub(r"[^a-z0-9.!?]+", r" ", text)
#     text = re.sub(r"([.!?])", r" \1 ", text)
#     text = re.sub(r"\s+", " ", text).strip()
#     return text


# MAX_LENGTH = 30
# proc_pairs = []

# for pair in pairs:
#     query, response = pair.strip().split("|-->|")
#     clean_query = clean_text(query)
#     clean_response = clean_text(response)
#     if len(clean_query) <= MAX_LENGTH and len(clean_response) <= MAX_LENGTH:
#         proc_pairs.append(f"{clean_query}|-->|{clean_response}")

# print(f"Number of pairs: {len(proc_pairs)}")

create a vocabulary

Remove words below a certain count threshold

### Data Loading and Trimming

The next step involves creating a vocabulary and loading query/response sentence pairs into memory.

Keep in mind that we are working with sequences of **words**, which do not inherently map to a discrete numerical space. Therefore, we need to create such a mapping by associating each unique word we encounter in our dataset with an index value.

To achieve this, we define a `Vocabulary` class, which maintains a mapping from words to indexes, a reverse mapping from indexes to words, a count of each word, and a total word count. The class offers methods for adding a word to the vocabulary (`add_word`), adding all words in a sentence (`add_sentence`), and trimming infrequently seen words (`trim`). We will discuss trimming in more detail later.

In [None]:
# from sklearn.feature_extraction.text import CountVectorizer

# sentences = [
#     sentence.strip() for pair in proc_pairs for sentence in pair.strip().split("|-->|")
# ]

# MIN_COUNT = 5

# # Initialize CountVectorizer
# vectorizer = CountVectorizer(
#     min_df=MIN_COUNT, tokenizer=lambda txt: txt.strip().split(" ")
# )
# X = vectorizer.fit_transform(sentences)

# # Get the feature names to build a vocab dictionary
# vocab = vectorizer.get_feature_names_out()

# print(f"Vocabulary size: {len(vocab)}")
# # print(f"Encoded sentence example: {encoded_sentences[0]}")

In [None]:
# keep_pairs = []
# for pair in proc_pairs:
#     query, response = pair.strip().split("|-->|")

#     keep_query = True
#     keep_response = True
#     # Check input sentence
#     for word in query.strip().split(" "):
#         if word not in vocab:
#             keep_query = False
#             break
#     # Check output sentence
#     for word in response.strip().split(" "):
#         if word not in vocab:
#             keep_response = False
#             break

#     if keep_query and keep_response:
#         keep_pairs.append(pair)
# print(f"Number of pairs: {len(keep_pairs)}")

PAD, SOS,EOS token

In [None]:
# # Mapping words to indices (+3 offset for special tokens)
# word_to_idx = {word: i + 3 for i, word in enumerate(vocab)}

# # Adding special tokens to the dictionary
# word_to_idx["<PAD>"] = 0  # PAD
# word_to_idx["<SOS>"] = 1  # SOS
# word_to_idx["<EOS>"] = 2  # EOS


# # Example of encoding a sentence with SOS, EOS, and converting to indices
# def encode_sentence(sentence, word_to_idx, max_len=MAX_LENGTH):
#     # Tokenize the sentence
#     tokens = sentence.split()

#     # Add SOS and EOS tokens
#     tokens = ["<SOS>"] + tokens + ["<EOS>"]

#     # Convert tokens to indices
#     indices = [word_to_idx.get(token, word_to_idx["<EOS>"]) for token in tokens]

#     # Pad the sequence to max_len
#     padded_sequence = indices + [word_to_idx["<PAD>"]] * (max_len - len(indices))
#     return padded_sequence[:max_len]


# sentences = [
#     sentence.strip() for pair in keep_pairs for sentence in pair.strip().split("|-->|")
# ]
# # Example: encoding the first sentence
# # max_len = max(len(s.split()) for s in sentences) + 2  # +2 for SOS and EOS tokens
# encoded_sentences = [encode_sentence(s, word_to_idx) for s in sentences]

In [None]:
# print(f"Encoded sentence example: {encoded_sentences[10000]}")

In [None]:
# idx_to_word = {idx: word for word, idx in word_to_idx.items()}


# def decode_sentence(encoded_sentence, idx_to_word):
#     # Convert indices back to tokens, ignoring special tokens for padding, start, and end
#     tokens = [
#         idx_to_word.get(idx)
#         for idx in encoded_sentence
#         if idx in idx_to_word and idx > 2
#     ]

#     # Join the tokens back into a single string
#     sentence = " ".join(tokens)
#     return sentence


# # Example: decoding the first encoded sentence
# decoded_sentences = [
#     decode_sentence(encoded, idx_to_word) for encoded in encoded_sentences
# ]

In [None]:
# print(f"Decoded sentence example: {decoded_sentences[0:20]}")

We can now compile our vocabulary and query/response sentence pairs. However, before we can utilize this data, we need to carry out some preprocessing steps.

Initially, we need to transform the Unicode strings into ASCII using `unicodeToAscii`. Subsequently, we should convert all characters to lowercase and remove all non-letter characters, excluding basic punctuation (`normalize_string`). Lastly, to facilitate training convergence, we will exclude sentences exceeding the `MAX_LENGTH` threshold.

Another tactic that is beneficial to achieving faster convergence during
training is trimming rarely used words out of our vocabulary. Decreasing
the feature space will also soften the difficulty of the function that
the model must learn to approximate. We will do this as a two-step
process:

1) Trim words used under ``MIN_COUNT`` threshold using the ``voc.trim``
   function.

2) Filter out pairs with trimmed words.




### Data Preparation for Models

Despite our extensive efforts to curate and process our data into a convenient vocabulary object and list of sentence pairs, our models will ultimately require numerical torch tensors as inputs.  

 To accommodate sentences of different sizes in the same batch, we will create our batched input tensor of shape (max_length, batch_size), where sentences shorter than the max_length are zero padded after an EOS_token.

If we simply convert our English sentences to tensors by converting words to their indexes and zero-pad, our tensor would have shape (batch_size, max_length) and indexing the first dimension would return a full sequence across all time-steps. However, we need to be able to index our batch along time, and across all sequences in the batch. Therefore, we transpose our input batch shape to (max_length, batch_size), so that indexing across the first dimension returns a time step across all sentences in the batch.

The output function palso returns a binary mask tensor and a maximum target sentence length. The binary mask tensor has the same shape as the output target tensor, but every element that is a PAD_token is 0 and all others are 1.

`batch_to_train_data` simply takes a bunch of pairs and returns the input and target tensors using the aforementioned functions.

In [None]:
import itertools
import torch
import random

# Assuming PAD_token and EOS_token are defined with their respective integral values


def indexes_from_sentence(vocabulary, sentence):
    """
    Convert sentence to a list of indexes, appending the EOS token at the end.
    """
    return [vocabulary.word_to_index[word] for word in sentence.split(" ")] + [
        EOS_TOKEN
    ]


def binary_matrix(l, value=PAD_TOKEN):
    """
    Create a binary matrix representing the padding of sentences.
    """
    return [[0 if token == value else 1 for token in seq] for seq in l]


def batch_to_train_data(vocabulary, pair_batch):
    """
    Prepare the batch for training: sort by input length, create tensors for input/target variables.
    """
    pair_batch.sort(key=lambda x: len(x[0].split(" ")), reverse=True)
    input_batch, output_batch = zip(*pair_batch)
    input_indexes = [
        indexes_from_sentence(vocabulary, sentence) for sentence in input_batch
    ]
    input_lengths = torch.tensor([len(indexes) for indexes in input_indexes])
    input_padded = torch.LongTensor(
        list(itertools.zip_longest(*input_indexes, fillvalue=PAD_TOKEN))
    )

    output_indexes = [
        indexes_from_sentence(vocabulary, sentence) for sentence in output_batch
    ]
    output_padded = torch.LongTensor(
        list(itertools.zip_longest(*output_indexes, fillvalue=PAD_TOKEN))
    )
    output_mask = torch.BoolTensor(binary_matrix(output_padded))
    max_target_len = max(len(indexes) for indexes in output_indexes)

    return input_padded, input_lengths, output_padded, output_mask, max_target_len


# Example for validation
small_batch_size = 5
batches = batch_to_train_data(
    voc, [random.choice(pairs) for _ in range(small_batch_size)]
)
input_variable, lengths, target_variable, mask, max_target_len = batches

print("input_variable:", input_variable)
print("lengths:", lengths)
print("target_variable:", target_variable)
print("mask:", mask)
print("max_target_len:", max_target_len)

## Training Procedure Definition

### Loss with Masking

Given that we're working with batches of padded sequences, we can't compute loss using all tensor elements. We establish `mask_nll_loss` to compute our loss based on the decoder's output tensor, the target tensor, and a binary mask tensor that indicates the padding of the target tensor. This loss function computes the average negative log likelihood of the elements that align with a *1* in the mask tensor.



In [None]:
def mask_nll_loss(inp, target, mask, device):
    """
    Calculate the negative log likelihood loss with a mask over the lengths of target sequences.
    """
    n_total = mask.sum()
    cross_entropy = -torch.log(torch.gather(inp, 1, target.view(-1, 1)).squeeze(1))
    loss = cross_entropy.masked_select(mask).mean()
    loss = loss.to(device)
    return loss, n_total.item()

### Single Training Iteration Procedure

The `train` function encapsulates the process for a single training iteration (a single batch of inputs).

We employ two strategies to aid convergence:

-  **Teacher forcing**: At a probability determined by `teacher_forcing_ratio`, we use the current target word as the decoder’s next input instead of the decoder’s current guess. This helps in efficient training but can cause instability during inference. Hence, the `teacher_forcing_ratio` must be set carefully.

-  **Gradient clipping**: This technique counters the "exploding gradient" problem by capping gradients to a maximum value, preventing them from growing exponentially and causing overflow or overshooting steep cost function cliffs.

**Procedure:**

   1) Pass the entire input batch through the encoder.
   2) Initialize decoder inputs as SOS_token, and hidden state as the encoder's final hidden state.
   3) Pass the input batch sequence through the decoder one time step at a time.
   4) If teacher forcing: set next decoder input as the current target; else: set next decoder input as current decoder output.
   5) Calculate and accumulate loss.
   6) Perform backpropagation.
   7) Clip gradients.
   8) Update encoder and decoder model parameters.

Note: PyTorch’s RNN modules (`RNN`, `LSTM`, `GRU`) can be used like any other non-recurrent layers by passing them the entire input sequence. We use the `GRU` layer like this in the `encoder`. However, you can also run these modules one time-step at a time, as we do for the `decoder` model.

In [None]:
def train(
    input_variable,
    lengths,
    target_variable,
    mask,
    max_target_len,
    encoder,
    decoder,
    embedding,
    encoder_optimizer,
    decoder_optimizer,
    batch_size,
    clip,
    max_length=MAX_LENGTH,
):

    # Zero gradients
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    # Set device options
    input_variable = input_variable.to(device)
    target_variable = target_variable.to(device)
    mask = mask.to(device)
    # Lengths for RNN packing should always be on the CPU
    lengths = lengths.to("cpu")

    # Initialize variables
    loss = 0
    print_losses = []
    n_totals = 0

    # Forward pass through encoder
    encoder_outputs, encoder_hidden = encoder(input_variable, lengths)

    # Create initial decoder input (start with SOS tokens for each sentence)
    decoder_input = torch.LongTensor([[SOS_TOKEN for _ in range(batch_size)]])
    decoder_input = decoder_input.to(device)

    # Set initial decoder hidden state to the encoder's final hidden state
    decoder_hidden = encoder_hidden[: decoder.n_layers]

    # Determine if we are using teacher forcing this iteration
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    # Forward batch of sequences through decoder one time step at a time
    if use_teacher_forcing:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
            # Teacher forcing: next input is current target
            decoder_input = target_variable[t].view(1, -1)
            # Calculate and accumulate loss
            mask_loss, nTotal = mask_nll_loss(
                decoder_output, target_variable[t], mask[t], device
            )
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal
    else:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
            # No teacher forcing: next input is decoder's own current output
            _, topi = decoder_output.topk(1)
            decoder_input = torch.LongTensor([[topi[i][0] for i in range(batch_size)]])
            decoder_input = decoder_input.to(device)
            # Calculate and accumulate loss
            mask_loss, nTotal = mask_nll_loss(
                decoder_output, target_variable[t], mask[t], device
            )
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal

    # Perform backpropagation
    loss.backward()

    # Clip gradients: gradients are modified in place
    _ = nn.utils.clip_grad_norm_(encoder.parameters(), clip)
    _ = nn.utils.clip_grad_norm_(decoder.parameters(), clip)

    # Adjust model weights
    encoder_optimizer.step()
    decoder_optimizer.step()

    return sum(print_losses) / n_totals

### Training Iterations

Now we can integrate the complete training procedure with the data. The `train_iters` function executes `n_iterations` of training using the provided models, optimizers, data, etc. Most of the complex work is handled by the `train` function.

It's important to note that when we save our model, we store a tarball that includes the encoder and decoder `state_dicts` (parameters), the optimizers' `state_dicts`, the loss, the iteration, etc. Saving the model in this way provides maximum flexibility with the checkpoint. After loading a checkpoint, we can either use the model parameters to run inference or continue training from where we left off.

In [None]:
def train_iters(
    model_name,
    voc,
    pairs,
    encoder,
    decoder,
    encoder_optimizer,
    decoder_optimizer,
    embedding,
    encoder_n_layers,
    decoder_n_layers,
    save_dir,
    n_iteration,
    batch_size,
    print_every,
    save_every,
    clip,
    corpus_name,
):
    """
    Run training for a set number of iterations.
    """
    # Load batches for each iteration
    training_batches = [
        batch_to_train_data(voc, [random.choice(pairs) for _ in range(batch_size)])
        for _ in range(n_iteration)
    ]

    print("Initializing ...")
    start_iteration = 1
    print_loss_total = 0  # Reset every print_every

    print("Training...")
    for iteration in range(start_iteration, n_iteration + 1):
        training_batch = training_batches[iteration - 1]
        input_variable, lengths, target_variable, mask, max_target_len = training_batch

        # Run a training iteration with batch
        loss = train(
            input_variable,
            lengths,
            target_variable,
            mask,
            max_target_len,
            encoder,
            decoder,
            embedding,
            encoder_optimizer,
            decoder_optimizer,
            batch_size,
            clip,
        )
        print_loss_total += loss

        # Print progress
        if iteration % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print(
                f"Iteration: {iteration}; Percent complete: {iteration / n_iteration * 100:.1f}%; Average loss: {print_loss_avg:.4f}"
            )
            print_loss_total = 0

        # Save checkpoint
        if iteration % save_every == 0:
            directory = os.path.join(
                save_dir,
                model_name,
                corpus_name,
                f"{encoder_n_layers}-{decoder_n_layers}_{hidden_features}",
            )
            if not os.path.exists(directory):
                os.makedirs(directory)
            torch.save(
                {
                    "iteration": iteration,
                    "en": encoder.state_dict(),
                    "de": decoder.state_dict(),
                    "en_opt": encoder_optimizer.state_dict(),
                    "de_opt": decoder_optimizer.state_dict(),
                    "loss": loss,
                    "voc_dict": voc.__dict__,
                    "embedding": embedding.state_dict(),
                },
                os.path.join(directory, f"{iteration}_checkpoint.tar"),
            )

## Evaluation

After training the model, we want to interact with the bot. We need to define how the model decodes the encoded input.

### Greedy Decoding

Greedy decoding is used during training when we're not using teacher forcing. At each time step, we choose the word from `decoder_output` with the highest softmax value. This method is optimal at a single time-step level.

We define a `GreedySearchDecoder` class to perform greedy decoding. The input sentence is evaluated as follows:

**Computation Steps:**

   1) Pass input through the encoder model.
   2) Prepare the encoder's final hidden layer to be the first hidden input to the decoder.
   3) Initialize the decoder's first input as SOS_token.
   4) Initialize tensors to append decoded words to.
   5) Iteratively decode one word token at a time:
       a) Pass through the decoder.
       b) Obtain the most likely word token and its softmax score.
       c) Record the token and score.
       d) Prepare the current token to be the next decoder input.
   6) Return collections of word tokens and scores.

In [None]:
class GreedySearchDecoder(nn.Module):
    def __init__(self, encoder, decoder):
        """
        Greedy decoding module initialization.
        """
        super(GreedySearchDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, input_seq, input_length, max_length):
        """
        Forward propagation of the input to produce a sequence of tokens.
        """
        # Forward input through encoder model
        encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)

        # Prepare encoder's final hidden layer to be first hidden input to the decoder
        decoder_hidden = encoder_hidden[: self.decoder.n_layers]

        # Initialize decoder input with SOS_token
        decoder_input = torch.tensor([[SOS_TOKEN]], device=device, dtype=torch.long)

        # Initialize tensors to append decoded words to
        all_tokens = torch.zeros(0, dtype=torch.long, device=device)
        all_scores = torch.zeros(0, device=device)

        # Iteratively decode one word token at a time
        for _ in range(max_length):
            # Forward pass through decoder
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden)

            # Obtain most likely word token and its softmax score
            decoder_scores, decoder_input = torch.max(decoder_output, dim=1)

            # Record token and score
            all_tokens = torch.cat((all_tokens, decoder_input), dim=0)
            all_scores = torch.cat((all_scores, decoder_scores), dim=0)

            # Prepare current token to be next decoder input (add a dimension)
            decoder_input = torch.unsqueeze(decoder_input, 0)

        # Return collections of word tokens and scores
        return all_tokens, all_scores

### Text Evaluation

With our decoding method defined, we can create functions to evaluate a string input sentence. The `evaluate` function handles the input sentence, it formats the sentence as an input batch of word indexes with *batch_size==1*. This is done by converting the sentence words to their corresponding indexes and transposing the dimensions to prepare the tensor for our models. A `lengths` tensor is also created which contains the length of our input sentence. The decoded response sentence tensor is obtained using our `GreedySearchDecoder` object (`searcher`). Finally, the response’s indexes are converted to words and the list of decoded words is returned.

`evaluate_input` serves as the user interface for our chatbot. It prompts an input text field where we can enter our query sentence. After entering our input sentence and pressing *Enter*, our text is normalized like our training data, and is fed to the `evaluate` function to obtain a decoded output sentence. This process is looped for continuous interaction with our bot until we enter either “q” or “quit”.

If a sentence is entered that contains a word not in the vocabulary, an error message is printed and the user is prompted to enter another sentence.

In [None]:
def evaluate(searcher, voc, sentence, max_length=MAX_LENGTH):
    """
    Evaluate a sentence using the encoder, decoder, and searcher provided.
    """
    # Prepare the input sentence as a batch of word indexes
    indexes_batch = [indexes_from_sentence(voc, sentence)]
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    input_batch = torch.LongTensor(indexes_batch).transpose(0, 1).to(device)
    lengths = lengths.to("cpu")  # Lengths need to be on CPU for pack_padded_sequence

    # Decode the sentence with the searcher
    tokens, scores = searcher(input_batch, lengths, max_length)
    decoded_words = [voc.index_to_word[token.item()] for token in tokens]

    return decoded_words


def evaluate_input(encoder, decoder, searcher, voc):
    """
    Interactively evaluate input from the user.
    """
    while True:
        try:
            input_sentence = input("> ")
            if input_sentence in ("q", "quit"):
                break
            print(input_sentence)
            input_sentence = normalize_string(input_sentence)
            output_words = evaluate(searcher, voc, input_sentence)
            output_words = [word for word in output_words if word not in ("EOS", "PAD")]
            print("Bot:", " ".join(output_words))

        except KeyError:
            print("Error: Encountered unknown word.")


## Models Overview

### Seq2Seq Model

Our chatbot uses a sequence-to-sequence (seq2seq) model, which takes a variable-length sequence as input and returns a variable-length sequence as output. This is achieved by using two separate recurrent neural nets (RNNs): an **encoder** and a **decoder**. The encoder encodes the input sequence into a fixed-length context vector, which theoretically contains semantic information about the input sentence. The decoder takes an input word and the context vector, and returns a guess for the next word in the sequence and a hidden state for the next iteration.

### Encoder

The encoder RNN iterates through the input sentence one token at a time, outputting an "output" vector and a "hidden state" vector at each time step. The hidden state vector is passed to the next time step, while the output vector is recorded. The encoder uses a multi-layered Gated Recurrent Unit (GRU) and a bidirectional variant of the GRU. An `embedding` layer is used to encode our word indices in an arbitrarily sized feature space. 

### Decoder

The decoder RNN generates the response sentence in a token-by-token fashion. It uses the encoder’s context vectors, and internal hidden states to generate the next word in the sequence. To avoid information loss, especially when dealing with long input sequences, an "attention mechanism" is typically used that allows the decoder to pay attention to certain parts of the input sequence, rather than using the entire fixed context at every step. However, in this tutorial, we only consider the use of standard RNNs, which will limit the performance of our model.

In [None]:
# Set configuration parameters for the model
model_name = "cb_model"
hidden_features = 500
in_features = hidden_features
encoder_n_layers = 2
decoder_n_layers = 2
rnn_type = "GRU"
dropout = 0.1
batch_size = 64

# Set training and optimization parameters
clip = 50.0
teacher_forcing_ratio = 1.0
learning_rate = 0.0001
decoder_learning_ratio = 5.0
n_iteration = 4000
print_every = 1
save_every = 500

embedding = nn.Embedding(voc.num_words, hidden_features)
# Initialize the EncoderRNN
encoder = dl.EncoderRNN(
    in_features=in_features,
    hidden_features=hidden_features,
    num_layers=encoder_n_layers,
    embedding=embedding,
    rnn_type=rnn_type,
    dropout=dropout,
).to(device)


encoder.build()
print(encoder)

decoder = dl.DecoderRNN(
    in_features=in_features,
    hidden_features=hidden_features,
    out_features=voc.num_words,
    num_layers=decoder_n_layers,
    embedding=embedding,
    rnn_type=rnn_type,
    dropout=dropout,
).to(device)


encoder.build()
print(decoder)

## Define and train model

Now, we're ready to run our model!

Whether we're training or testing the chatbot model, we need to initialize the encoder and decoder models. In the next block, we configure the parameters, and construct and initialize the models. You're encouraged to experiment with different model configurations to enhance performance.


In [None]:
# Initialize word embeddings and encoder & decoder models
# Set models to training mode
encoder.train()
decoder.train()

# Initialize optimizers for the encoder and decoder
encoder_optimizer = torch.optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = torch.optim.Adam(
    decoder.parameters(), lr=learning_rate * decoder_learning_ratio
)

# Move optimizer states to GPU if necessary
if torch.cuda.is_available():
    for state in encoder_optimizer.state.values():
        for k, v in state.items():
            if isinstance(v, torch.Tensor):
                state[k] = v.cuda()

    for state in decoder_optimizer.state.values():
        for k, v in state.items():
            if isinstance(v, torch.Tensor):
                state[k] = v.cuda()

# Begin the training process
train_iters(
    model_name,
    voc,
    pairs,
    encoder,
    decoder,
    encoder_optimizer,
    decoder_optimizer,
    embedding,
    encoder_n_layers,
    decoder_n_layers,
    save_dir,
    n_iteration,
    batch_size,
    print_every,
    save_every,
    clip,
    corpus_name,
)

In [None]:
# Set dropout layers to ``eval`` mode
encoder.eval()
decoder.eval()

# Initialize search module
searcher = GreedySearchDecoder(encoder, decoder)

# Begin chatting (uncomment and run the following line to begin)
evaluate_input(encoder, decoder, searcher, voc)

Let's review the changes made to incorporate conversation history into your chatbot model training and evaluation:
### Changes for Training with Conversation History

create_history_pairs Function: This function creates training pairs that include the history of the conversation. For each exchange in the pairs, it appends a certain number of previous exchanges (up to MAX_HISTORY) separated by the <EOS> token. These pairs are then used for training.

Batch Preparation (batch_to_train_data): The function for preparing a batch of training data remains largely the same. The only difference is that it now handles input sequences that include conversation history.

Training Function (train):
    The core training logic remains unchanged.
    The function receives input sequences that now contain conversation history.
    The forward pass through the encoder and the decoding steps are performed as usual, without any specific changes needed to accommodate the conversation history.

Data Preparation for Training:
    The history_pairs are generated using the create_history_pairs function.
    These pairs are then used throughout the training process.

### Changes for Evaluation with Conversation History

evaluate Function:
    The conversation history is now considered when evaluating a new input.
    The history is concatenated with the current input sentence, separated by spaces.
    The concatenated string is then processed and fed into the model for generating a response.

evaluate_input Function:
    Manages interactive evaluation with the user.
    Maintains a conversation_history list, appending each user input and the model's response to it.
    Passes the accumulated conversation history to the evaluate function for each new input.

### Training and Evaluation Process

The model is trained with input sequences that include conversation history, allowing it to learn the context of the conversation.
During evaluation, the model uses the accumulated conversation history to generate more context-aware responses.

### General Setup

Model, optimizer, and training configurations are set according to your specifications.
The training process (train_iters) follows the standard approach but uses the modified input pairs with conversation history.

### Important Notes

The effectiveness of including conversation history in the model depends on the depth of context the model can understand and how well it can handle longer input sequences.
Fine-tuning and experimenting with the MAX_HISTORY parameter and the MAX_LENGTH of sequences may be necessary to achieve optimal performance.

In [None]:
MAX_LENGTH = 10  # Maximum length of a single sentence, adjust as necessary
MAX_HISTORY = 5  # Number of previous exchanges to include in the history


def create_history_pairs(pairs, max_history=MAX_HISTORY):
    history_pairs = []
    for i in range(len(pairs)):
        dialogue_history = " EOS ".join(
            [pairs[j][0] for j in range(max(0, i - max_history), i)]
        )
        history_pairs.append([dialogue_history, pairs[i][1]])
    return history_pairs


def zero_padding(l, fillvalue=0):
    return list(itertools.zip_longest(*l, fillvalue=fillvalue))


def binary_matrix(l, value=0):
    m = []
    for i, seq in enumerate(l):
        m.append([])
        for token in seq:
            if token == value:
                m[i].append(0)
            else:
                m[i].append(1)
    return m


def output_var(l, voc):
    indexes_batch = [indexes_from_sentence(voc, sentence) for sentence in l]
    max_target_len = max([len(indexes) for indexes in indexes_batch])
    pad_list = zero_padding(indexes_batch)
    mask = binary_matrix(pad_list)
    mask = torch.BoolTensor(mask)
    pad_var = torch.LongTensor(pad_list)
    return pad_var, mask, max_target_len


def indexes_from_sentence(voc, sentence):
    # return [voc.word_to_index[word] for word in sentence.split(' ')] + [EOS_TOKEN]
    return [voc.word_to_index[word] for word in sentence.split()] + [EOS_TOKEN]


def batch_to_train_data(voc, pair_batch):
    pair_batch.sort(key=lambda p: len(p[0].split()), reverse=True)
    input_batch, output_batch = [], []
    for pair in pair_batch:
        input_batch.append(pair[0])
        output_batch.append(pair[1])
    inp, lengths = input_var(input_batch, voc)
    output, mask, max_target_len = output_var(output_batch, voc)
    return inp, lengths, output, mask, max_target_len


# Input_var function needs to calculate the correct lengths for packed sequences
def input_var(l, voc):
    indexes_batch = [indexes_from_sentence(voc, sentence) for sentence in l]
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    pad_list = zero_padding(indexes_batch)
    pad_var = torch.LongTensor(pad_list)
    return pad_var, lengths


# Prepare the data with history
history_pairs = create_history_pairs(pairs)


# Modify the training function to handle the dialogue history in the input sequences
def train(
    input_variable,
    lengths,
    target_variable,
    mask,
    max_target_len,
    encoder,
    decoder,
    embedding,
    encoder_optimizer,
    decoder_optimizer,
    batch_size,
    clip,
    max_length=MAX_LENGTH,
):

    # Zero gradients
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    # Set device options
    input_variable = input_variable.to(device)
    target_variable = target_variable.to(device)
    mask = mask.to(device)
    # Lengths for RNN packing should always be on the CPU
    lengths = lengths.to("cpu")

    # Initialize variables
    loss = 0
    print_losses = []
    n_totals = 0

    # Forward pass through encoder
    encoder_outputs, encoder_hidden = encoder(input_variable, lengths)

    # Create initial decoder input (start with SOS tokens for each sentence)
    decoder_input = torch.LongTensor([[SOS_TOKEN for _ in range(batch_size)]])
    decoder_input = decoder_input.to(device)

    # Set initial decoder hidden state to the encoder's final hidden state
    decoder_hidden = encoder_hidden[: decoder.n_layers]

    # Determine if we are using teacher forcing this iteration
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    # Forward batch of sequences through decoder one time step at a time
    if use_teacher_forcing:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
            # Teacher forcing: next input is current target
            decoder_input = target_variable[t].view(1, -1)
            # Calculate and accumulate loss
            mask_loss, nTotal = mask_nll_loss(
                decoder_output, target_variable[t], mask[t], device
            )
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal
    else:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
            # No teacher forcing: next input is decoder's own current output
            _, topi = decoder_output.topk(1)
            decoder_input = torch.LongTensor([[topi[i][0] for i in range(batch_size)]])
            decoder_input = decoder_input.to(device)
            # Calculate and accumulate loss
            mask_loss, nTotal = mask_nll_loss(
                decoder_output, target_variable[t], mask[t], device
            )
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal

    # Perform backpropagation
    loss.backward()

    # Clip gradients: gradients are modified in place
    _ = nn.utils.clip_grad_norm_(encoder.parameters(), clip)
    _ = nn.utils.clip_grad_norm_(decoder.parameters(), clip)

    # Adjust model weights
    encoder_optimizer.step()
    decoder_optimizer.step()

    return sum(print_losses) / n_totals


def train_iters(
    model_name,
    voc,
    history_pairs,
    encoder,
    decoder,
    encoder_optimizer,
    decoder_optimizer,
    embedding,
    encoder_n_layers,
    decoder_n_layers,
    save_dir,
    n_iteration,
    batch_size,
    print_every,
    save_every,
    clip,
    corpus_name,
):
    """
    Run training for a set number of iterations.
    """
    # Load batches for each iteration
    training_batches = [
        batch_to_train_data(
            voc, [random.choice(history_pairs) for _ in range(batch_size)]
        )
        for _ in range(n_iteration)
    ]

    print("Initializing ...")
    start_iteration = 1
    print_loss_total = 0  # Reset every print_every

    print("Training...")
    for iteration in range(start_iteration, n_iteration + 1):
        training_batch = training_batches[iteration - 1]
        input_variable, lengths, target_variable, mask, max_target_len = training_batch

        # Run a training iteration with batch
        loss = train(
            input_variable,
            lengths,
            target_variable,
            mask,
            max_target_len,
            encoder,
            decoder,
            embedding,
            encoder_optimizer,
            decoder_optimizer,
            batch_size,
            clip,
        )
        print_loss_total += loss

        # Print progress
        if iteration % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print(
                f"Iteration: {iteration}; Percent complete: {iteration / n_iteration * 100:.1f}%; Average loss: {print_loss_avg:.4f}"
            )
            print_loss_total = 0

        # Save checkpoint
        if iteration % save_every == 0:
            directory = os.path.join(
                save_dir,
                model_name,
                corpus_name,
                f"{encoder_n_layers}-{decoder_n_layers}_{hidden_features}",
            )
            if not os.path.exists(directory):
                os.makedirs(directory)
            torch.save(
                {
                    "iteration": iteration,
                    "en": encoder.state_dict(),
                    "de": decoder.state_dict(),
                    "en_opt": encoder_optimizer.state_dict(),
                    "de_opt": decoder_optimizer.state_dict(),
                    "loss": loss,
                    "voc_dict": voc.__dict__,
                    "embedding": embedding.state_dict(),
                },
                os.path.join(directory, f"{iteration}_checkpoint.tar"),
            )


def evaluate(searcher, voc, conversation_history, max_length=MAX_LENGTH):
    """
    Evaluate a conversation history using the encoder, decoder, and searcher provided.
    """
    # Join the conversation history into a single input and normalize
    input_sentence = " ".join(conversation_history)
    input_sentence = normalize_string(input_sentence)

    # Prepare the input sentence as a batch of word indexes
    indexes_batch = [indexes_from_sentence(voc, input_sentence)]
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    input_batch = torch.LongTensor(indexes_batch).transpose(0, 1).to(device)
    lengths = lengths.to("cpu")  # Lengths need to be on CPU for pack_padded_sequence

    # Decode the sentence with the searcher
    tokens, scores = searcher(input_batch, lengths, max_length)
    decoded_words = [voc.index_to_word[token.item()] for token in tokens]

    return decoded_words


def evaluate_input(searcher, voc):
    """
    Interactively evaluate input from the user, considering the entire conversation history.
    """
    conversation_history = []
    while True:
        try:
            input_sentence = input("> ")
            if input_sentence in ("q", "quit"):
                break

            # Normalize and add the user's input to the conversation history
            input_sentence = normalize_string(input_sentence)
            conversation_history.append(input_sentence)

            # Evaluate the conversation history
            output_words = evaluate(searcher, voc, conversation_history)
            output_words = [word for word in output_words if word not in ("EOS", "PAD")]
            print("Bot:", " ".join(output_words))

            # Add the bot's response to the conversation history
            conversation_history.extend(output_words)
            conversation_history = conversation_history[-MAX_HISTORY * MAX_LENGTH :]

        except KeyError:
            print("Error: Encountered unknown word.")


# Set configuration parameters for the model
model_name = "cb_model"
hidden_features = 2000
in_features = hidden_features
encoder_n_layers = 4
decoder_n_layers = 4
dropout = 0.1
batch_size = 64

# Set training and optimization parameters
clip = 50.0
teacher_forcing_ratio = 1.0
learning_rate = 0.0001
decoder_learning_ratio = 5.0
n_iteration = 16000
print_every = 1
save_every = 8000

# Initialize word embeddings and encoder & decoder models
embedding = nn.Embedding(voc.num_words, hidden_features)
# Initialize the EncoderRNN
encoder = dl.EncoderRNN(
    in_features=in_features,
    hidden_features=hidden_features,
    num_layers=encoder_n_layers,
    embedding=embedding,
    rnn_type=rnn_type,
    dropout=dropout,
).to(device)

decoder = dl.DecoderRNN(
    in_features=in_features,
    hidden_features=hidden_features,
    out_features=voc.num_words,
    num_layers=decoder_n_layers,
    embedding=embedding,
    rnn_type=rnn_type,
    dropout=dropout,
).to(device)

encoder.build()
decoder.build()
# Set models to training mode
encoder.train()
decoder.train()

# Initialize optimizers for the encoder and decoder
encoder_optimizer = torch.optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = torch.optim.Adam(
    decoder.parameters(), lr=learning_rate * decoder_learning_ratio
)

# Move optimizer states to GPU if necessary
if torch.cuda.is_available():
    for state in encoder_optimizer.state.values():
        for k, v in state.items():
            if isinstance(v, torch.Tensor):
                state[k] = v.cuda()

    for state in decoder_optimizer.state.values():
        for k, v in state.items():
            if isinstance(v, torch.Tensor):
                state[k] = v.cuda()

In [None]:
# Begin the training process
train_iters(
    model_name,
    voc,
    history_pairs,
    encoder,
    decoder,
    encoder_optimizer,
    decoder_optimizer,
    embedding,
    encoder_n_layers,
    decoder_n_layers,
    save_dir,
    n_iteration,
    batch_size,
    print_every,
    save_every,
    clip,
    corpus_name,
)

In [None]:
def evaluate(searcher, voc, conversation_history, max_length=MAX_LENGTH):
    """
    Evaluate a conversation history using the encoder, decoder, and searcher provided.
    """
    # Join the conversation history into a single input and normalize
    input_sentence = " ".join(conversation_history)
    input_sentence = normalize_string(input_sentence)
    input_sentence = input_sentence.replace("eos", "EOS")

    # Prepare the input sentence as a batch of word indexes
    indexes_batch = [indexes_from_sentence(voc, input_sentence)]
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    input_batch = torch.LongTensor(indexes_batch).transpose(0, 1).to(device)
    lengths = lengths.to("cpu")  # Lengths need to be on CPU for pack_padded_sequence

    # Decode the sentence with the searcher
    tokens, scores = searcher(input_batch, lengths, max_length)
    decoded_words = [voc.index_to_word[token.item()] for token in tokens]

    return decoded_words


def evaluate_input(searcher, voc):
    """
    Interactively evaluate input from the user, considering the entire conversation history.
    """
    conversation_history = []
    while True:
        try:
            input_sentence = input("> ")
            if input_sentence in ("q", "quit"):
                break

            # Normalize and add the user's input to the conversation history
            print(input_sentence)
            input_sentence = normalize_string(input_sentence)
            conversation_history.append(input_sentence)
            conversation_history.append("EOS")

            # Evaluate the conversation history
            output_words = evaluate(searcher, voc, conversation_history)
            print(
                "Bot:",
                " ".join([word for word in output_words if word not in ("EOS", "PAD")]),
            )

            # Add the bot's response to the conversation history
            conversation_history.extend(output_words)
            conversation_history = conversation_history[-MAX_HISTORY * MAX_LENGTH :]

        except KeyError:
            print("Error: Encountered unknown word.")


# Set dropout layers to ``eval`` mode
encoder.eval()
decoder.eval()

# Initialize search module
searcher = GreedySearchDecoder(encoder, decoder)

# Begin chatting (uncomment and run the following line to begin)
evaluate_input(searcher, voc)