In this guide, we delve into an engaging and intriguing application of recurrent sequence-to-sequence models by training a basic chatbot using film scripts from the Cornell Movie-Dialogs Corpus.

Conversational models are currently a trending subject in the field of AI research. Chatbots are commonly used in various scenarios such as customer service platforms and online help desks. These bots typically utilize retrieval-based models that provide pre-set responses to specific types of questions. While these models might be adequate for highly specific domains like a company's IT helpdesk, they lack the robustness required for broader applications. However, the recent surge in deep learning, spearheaded recently by ChatGPT, has led to the development of potent multi-domain generative conversational models. In this guide, we will create one such model using the tools we have learnt so far.

To begin, download the dialog dataset: 

https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html (should be replaced with direct data source for DLCC)

and put in a ``data/`` directory under the current directory.

https://zissou.infosci.cornell.edu/convokit/datasets/movie-corpus/movie-corpus.zip

## Data Loading and Preprocessing

This step involves reorganizing our data file and loading the data into formats that are manageable.

The [Cornell Movie-Dialogs Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html) is a comprehensive dataset of dialogues from movie characters:

- It contains 220,579 conversational exchanges between 10,292 pairs of movie characters.
- It features 9,035 characters from 617 movies.
- It has a total of 304,713 utterances.

This dataset is vast and varied, with a wide range of language formality, time periods, sentiment, etc. We anticipate that this diversity will make our model capable of handling a variety of inputs and queries.

Initially, we will examine some lines from our data file to understand the original format.

an uninterrupted chain of spoken or written language.

In [None]:
filepath = "./dialogs.txt"

In [None]:
from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer("basic_english")

import re
import contractions


def tokenize(text):
    standardized_text = contractions.fix(text)

    standardized_text = (
        standardized_text.replace("’", "'")
        .replace("‘", "'")
        .replace("´", "'")
        .replace("“", '"')
        .replace("”", '"')
        .replace("´´", '"')
    )

    tokens = tokenizer(standardized_text)

    filtered_tokens = [
        token
        for token in tokens
        if re.match(
            r"^[a-zA-Z0-9.,!?]+(-[a-zA-Z0-9.,!?]+)*(_[a-zA-Z0-9.,!?]+)*$", token
        )
    ]
    return filtered_tokens

In [None]:
import yaml
from torchtext.vocab import build_vocab_from_iterator


def corpus_iterator(filepath):
    with open(filepath, "r", encoding="utf-8") as file:
        lines = file.readlines()

        prev_reply = None
        for line in lines:

            query, reply = line.strip().split("\t")

            # Check if not the last line and if the current reply is identical to the next query
            if query == prev_reply:
                out = reply
            else:
                out = query + reply
            prev_reply = reply

            yield tokenize(out)


# Add EOS, SOS, and PAD to the specials list
special_tokens = ["<pad>", "<sos>", "<eos>", "<unk>"]

vocab = build_vocab_from_iterator(
    corpus_iterator(filepath),
    specials=special_tokens,
    min_freq=2,
)
vocab.set_default_index(vocab["<unk>"])

In [None]:
vocab.__len__()

In [None]:
vocab.lookup_token(100)

In [None]:
import numpy as np

MAX_LENGTH = 30

queries, responses, masks_r = [], [], []


def all_words_in_vocab(sentence, vocab):
    return all(word in vocab for word in sentence)


def process_sentence(sequence, max_length=MAX_LENGTH):
    # Calculate the length needed for padding. Subtract 2 for <sos> and <eos> tokens
    padding_length = max_length - len(sequence) + 1

    # Processed sequence with <sos>, <eos>, and <pad>
    processed = ["<sos>"] + sequence + ["<eos>"] + ["<pad>"] * padding_length

    # Create a mask: 1s for actual tokens and 0s for padding
    # The mask length is len(sequence) + 2 for <sos> and <eos> tokens. The rest are 0s for padding.
    mask = [1] * (len(sequence) + 2) + [0] * padding_length

    return processed, mask


with open(filepath, "r", encoding="utf-8") as file:
    lines = file.readlines()

    prev_reply = None
    for line in lines:

        q, r = line.strip().split("\t")

        query = tokenize(q)
        response = tokenize(r)

        if (
            all_words_in_vocab(query + response, vocab)
            and len(query) <= MAX_LENGTH
            and len(response) <= MAX_LENGTH
        ):
            query, _ = process_sentence(query)
            response, mask_r = process_sentence(response)

            queries.append(vocab(query))
            responses.append(vocab(response))
            masks_r.append(mask_r)

queries = np.asarray(queries)
responses = np.asarray(responses)
masks_r = np.asarray(masks_r)

print(f"Number of queries/responses: {len(queries)}")

In [None]:
len(queries[0])

In [None]:
import torch

# The dimensionality of GloVe embeddings
embedding_dim = 300

from torchtext.vocab import GloVe

# Load GloVe embeddings
glove = GloVe(name="42B", dim=embedding_dim, cache="./.vector_cache")

# Get GloVe embeddings for the vocabulary tokens
# Assuming 'vocab' is a list of vocabulary tokens including special tokens at the beginning
glove_embeddings = glove.get_vecs_by_tokens(vocab.get_itos(), lower_case_backup=True)


# Special tokens
special_tokens = ["<pad>", "<sos>", "<eos>"]
num_special_tokens = len(special_tokens)

# Initialize a tensor to hold the embeddings for special tokens
# Here, PAD is initialized to zeros, and SOS, EOS to random values
special_embeddings = torch.zeros(num_special_tokens, embedding_dim)
special_embeddings[1:] = (
    torch.rand(num_special_tokens - 1, embedding_dim) * 0.01
)  # Small random numbers for SOS and EOS


# Concatenate the special token embeddings with the GloVe embeddings
extended_embeddings = torch.cat([special_embeddings, glove_embeddings], dim=0)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import deeplay as dl
from deeplay import DeeplayModule, Classifier

hidden_features = 150


class MyClassifier(Classifier):

    def training_step(self, batch, batch_idx):
        x1, x2, m = batch
        y = torch.cat((x2[:, 1:], x2[:, -1:]), dim=1)
        y_hat = self(x1, x2)
        loss = self.loss(y_hat, y, m)
        # loss = self.loss(y_hat.view(-1, y_hat.size(-1)), y.view(-1))
        self.log(
            f"train_loss",
            loss,
            on_step=True,
            on_epoch=True,
            prog_bar=True,
            logger=True,
        )

        self.log_metrics(
            "train", y_hat, y, on_step=True, on_epoch=True, prog_bar=True, logger=True
        )

        return loss

    def forward(self, x1, x2):
        return self.model(x1, x2)


class Encoder(DeeplayModule):
    def __init__(self, vocab_size, embedding_dim, lstm_units):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, lstm_units, batch_first=True, dropout=0.1)

    def forward(self, x):
        x = self.embedding(x)
        # x = torch.nn.utils.rnn.pack_padded_sequence(x, lengths)
        outputs, (hidden, cell) = self.lstm(x)
        return hidden, cell


class Decoder(DeeplayModule):
    def __init__(self, vocab_size, embedding_dim, lstm_units):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, lstm_units, batch_first=True, dropout=0.1)
        self.dense = nn.Linear(lstm_units, vocab_size)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, x, hidden, cell):
        x = self.embedding(x)
        outputs, (hidden, cell) = self.lstm(x, (hidden, cell))
        outputs = self.dense(outputs)
        outputs = self.softmax(outputs)
        return outputs, hidden, cell


class Seq2Seq(DeeplayModule):
    def __init__(self, vocab_size, embedding_dim, lstm_units):
        super(Seq2Seq, self).__init__()
        self.encoder = Encoder(vocab_size, embedding_dim, lstm_units)
        self.decoder = Decoder(vocab_size, embedding_dim, lstm_units)
        self.vocab_size = vocab_size
        self.lstm_units = lstm_units

    def forward(self, encoder_input_data, decoder_input_data):
        encoder_hidden, encoder_cell = self.encoder(encoder_input_data)

        decoder_hidden = encoder_hidden
        decoder_cell = encoder_cell

        outputs = torch.zeros(
            (decoder_input_data.size(0), decoder_input_data.size(1), self.vocab_size)
        ).to("mps")
        for t in range(decoder_input_data.size(1)):  # Iterate through the sequence
            output, decoder_hidden, decoder_cell = self.decoder(
                decoder_input_data[:, t].unsqueeze(-1), decoder_hidden, decoder_cell
            )
            outputs[:, t, :] = output.squeeze(1)
        return outputs


seq2seq = Seq2Seq(len(vocab), embedding_dim, hidden_features)


def NLLLoss(inp, target, mask):
    crossEntropy = -torch.log(
        torch.gather(inp.view(-1, inp.shape[-1]), 1, target.view(-1, 1))
    )
    loss = crossEntropy.masked_select(mask.view(-1, 1)).mean()
    return loss  # , nTotal.item()


seq2seq_classifier = MyClassifier(
    model=seq2seq,
    loss=NLLLoss,  # nn.CrossEntropyLoss(),
    optimizer=dl.RMSprop(),
).create()

seq2seq_classifier.model.encoder.embedding.weight.data = extended_embeddings
seq2seq_classifier.model.encoder.embedding.weight.requires_grad = False
seq2seq_classifier.model.decoder.embedding.weight.data = extended_embeddings
seq2seq_classifier.model.decoder.embedding.weight.requires_grad = False

In [None]:
import deeptrack as dt
import torch

sources = dt.sources.Source(inputs=queries, targets=responses, masks=masks_r)

inputs_pl = dt.Value(sources.inputs) >> dt.pytorch.ToTensor(dtype=torch.int)
targets_pl = dt.Value(sources.targets) >> dt.pytorch.ToTensor(dtype=torch.int)
masks_pl = dt.Value(sources.masks) >> dt.pytorch.ToTensor(dtype=torch.bool)

In [None]:
from torch.utils.data import DataLoader

train_dataset = dt.pytorch.Dataset(inputs_pl & targets_pl & masks_pl, inputs=sources)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=False)

In [None]:
trainer = dl.Trainer(max_epochs=100, accelerator="mps")

In [None]:
trainer.fit(seq2seq_classifier, train_loader)

In [None]:
def make_inference(model, source_text, max_length=MAX_LENGTH):
    # Tokenize the source text
    query_tokens = tokenize(source_text)

    # Process the tokens into the model's expected format, including adding <sos>, <eos>, and padding
    query, _ = process_sentence(query_tokens)

    # Convert tokens to indices using the vocabulary
    query = np.array(vocab(query))

    # Convert list of indices to a tensor and add a batch dimension
    source_sequence = torch.tensor(query, dtype=torch.int)

    # Move tensor to the same device as the model
    source_sequence = source_sequence.to(next(model.parameters()).device)

    # Encoder inference
    with torch.no_grad():
        hidden, cell = model.encoder(source_sequence)

    # Prepare the initial input to the decoder: <sos> token index
    target_index = torch.tensor(vocab(["<sos>"]), device=source_sequence.device)

    predictions = []
    for _ in range(max_length):
        with torch.no_grad():
            output, hidden, cell = model.decoder(target_index, hidden, cell)
            top1 = output.argmax(1)  # Adjust indexing based on output shape
            if top1.item() == vocab(["<eos>"])[0]:  # Stop if <eos> token is generated
                break
            predictions.append(top1.item())
            target_index = top1

    # Convert indices back to tokens
    predicted_tokens = [vocab.lookup_token(idx) for idx in predictions]

    return " ".join(predicted_tokens)

In [1]:
source_text = "you are special!"
response = make_inference(seq2seq_classifier.model, source_text)
print(response)

NameError: name 'make_inference' is not defined

### Generate Formatted Data File

For ease of use, we will generate a well-structured data file where each line comprises a tab-separated pair of a *query sentence* and a *response sentence*.

In [None]:
# pairs = []

# for conv_id in corpus.conversations:
#     conv = corpus.get_conversation(conv_id)
#     utt_ids = conv.get_utterance_ids()

#     num_utt = len(utt_ids)
#     range_end = num_utt - 2 if num_utt % 2 != 0 else num_utt - 1

#     for i in range(range_end):
#         query = corpus.get_utterance(utt_ids[i]).text.strip()
#         response = corpus.get_utterance(utt_ids[i + 1]).text.strip()
#         if query and response:
#             pairs.append(f"{query}|-->|{response}")

# print(f"Number of pairs: {len(pairs)}")

# # filename = os.path.join(corpus_name, "formatted_movie_lines.txt")

# # with open(filename, "w", encoding="utf-8") as file:
# #     for pair in pairs:
# #         file.write(pair + "\n")

convert all letters to lowercase 

trim all non-letter characters except for basic punctuation

filter out sentences with length greater than the MAX_LENGTH

In [None]:
# import re


# def clean_text(text):
#     text = text.strip().lower()
#     text = re.sub(r"[^a-z0-9.!?]+", r" ", text)
#     text = re.sub(r"([.!?])", r" \1 ", text)
#     text = re.sub(r"\s+", " ", text).strip()
#     return text


# MAX_LENGTH = 30
# proc_pairs = []

# for pair in pairs:
#     query, response = pair.strip().split("|-->|")
#     clean_query = clean_text(query)
#     clean_response = clean_text(response)
#     if len(clean_query) <= MAX_LENGTH and len(clean_response) <= MAX_LENGTH:
#         proc_pairs.append(f"{clean_query}|-->|{clean_response}")

# print(f"Number of pairs: {len(proc_pairs)}")

create a vocabulary

Remove words below a certain count threshold

### Data Loading and Trimming

The next step involves creating a vocabulary and loading query/response sentence pairs into memory.

Keep in mind that we are working with sequences of **words**, which do not inherently map to a discrete numerical space. Therefore, we need to create such a mapping by associating each unique word we encounter in our dataset with an index value.

To achieve this, we define a `Vocabulary` class, which maintains a mapping from words to indexes, a reverse mapping from indexes to words, a count of each word, and a total word count. The class offers methods for adding a word to the vocabulary (`add_word`), adding all words in a sentence (`add_sentence`), and trimming infrequently seen words (`trim`). We will discuss trimming in more detail later.

In [None]:
# from sklearn.feature_extraction.text import CountVectorizer

# sentences = [
#     sentence.strip() for pair in proc_pairs for sentence in pair.strip().split("|-->|")
# ]

# MIN_COUNT = 5

# # Initialize CountVectorizer
# vectorizer = CountVectorizer(
#     min_df=MIN_COUNT, tokenizer=lambda txt: txt.strip().split(" ")
# )
# X = vectorizer.fit_transform(sentences)

# # Get the feature names to build a vocab dictionary
# vocab = vectorizer.get_feature_names_out()

# print(f"Vocabulary size: {len(vocab)}")
# # print(f"Encoded sentence example: {encoded_sentences[0]}")

In [None]:
# keep_pairs = []
# for pair in proc_pairs:
#     query, response = pair.strip().split("|-->|")

#     keep_query = True
#     keep_response = True
#     # Check input sentence
#     for word in query.strip().split(" "):
#         if word not in vocab:
#             keep_query = False
#             break
#     # Check output sentence
#     for word in response.strip().split(" "):
#         if word not in vocab:
#             keep_response = False
#             break

#     if keep_query and keep_response:
#         keep_pairs.append(pair)
# print(f"Number of pairs: {len(keep_pairs)}")

PAD, SOS,EOS token

In [None]:
# # Mapping words to indices (+3 offset for special tokens)
# word_to_idx = {word: i + 3 for i, word in enumerate(vocab)}

# # Adding special tokens to the dictionary
# word_to_idx["<PAD>"] = 0  # PAD
# word_to_idx["<SOS>"] = 1  # SOS
# word_to_idx["<EOS>"] = 2  # EOS


# # Example of encoding a sentence with SOS, EOS, and converting to indices
# def encode_sentence(sentence, word_to_idx, max_len=MAX_LENGTH):
#     # Tokenize the sentence
#     tokens = sentence.split()

#     # Add SOS and EOS tokens
#     tokens = ["<SOS>"] + tokens + ["<EOS>"]

#     # Convert tokens to indices
#     indices = [word_to_idx.get(token, word_to_idx["<EOS>"]) for token in tokens]

#     # Pad the sequence to max_len
#     padded_sequence = indices + [word_to_idx["<PAD>"]] * (max_len - len(indices))
#     return padded_sequence[:max_len]


# sentences = [
#     sentence.strip() for pair in keep_pairs for sentence in pair.strip().split("|-->|")
# ]
# # Example: encoding the first sentence
# # max_len = max(len(s.split()) for s in sentences) + 2  # +2 for SOS and EOS tokens
# encoded_sentences = [encode_sentence(s, word_to_idx) for s in sentences]

In [None]:
# print(f"Encoded sentence example: {encoded_sentences[10000]}")

In [None]:
# idx_to_word = {idx: word for word, idx in word_to_idx.items()}


# def decode_sentence(encoded_sentence, idx_to_word):
#     # Convert indices back to tokens, ignoring special tokens for padding, start, and end
#     tokens = [
#         idx_to_word.get(idx)
#         for idx in encoded_sentence
#         if idx in idx_to_word and idx > 2
#     ]

#     # Join the tokens back into a single string
#     sentence = " ".join(tokens)
#     return sentence


# # Example: decoding the first encoded sentence
# decoded_sentences = [
#     decode_sentence(encoded, idx_to_word) for encoded in encoded_sentences
# ]

In [None]:
# print(f"Decoded sentence example: {decoded_sentences[0:20]}")

We can now compile our vocabulary and query/response sentence pairs. However, before we can utilize this data, we need to carry out some preprocessing steps.

Initially, we need to transform the Unicode strings into ASCII using `unicodeToAscii`. Subsequently, we should convert all characters to lowercase and remove all non-letter characters, excluding basic punctuation (`normalize_string`). Lastly, to facilitate training convergence, we will exclude sentences exceeding the `MAX_LENGTH` threshold.

Another tactic that is beneficial to achieving faster convergence during
training is trimming rarely used words out of our vocabulary. Decreasing
the feature space will also soften the difficulty of the function that
the model must learn to approximate. We will do this as a two-step
process:

1) Trim words used under ``MIN_COUNT`` threshold using the ``voc.trim``
   function.

2) Filter out pairs with trimmed words.




### Data Preparation for Models

Despite our extensive efforts to curate and process our data into a convenient vocabulary object and list of sentence pairs, our models will ultimately require numerical torch tensors as inputs.  

 To accommodate sentences of different sizes in the same batch, we will create our batched input tensor of shape (max_length, batch_size), where sentences shorter than the max_length are zero padded after an EOS_token.

If we simply convert our English sentences to tensors by converting words to their indexes and zero-pad, our tensor would have shape (batch_size, max_length) and indexing the first dimension would return a full sequence across all time-steps. However, we need to be able to index our batch along time, and across all sequences in the batch. Therefore, we transpose our input batch shape to (max_length, batch_size), so that indexing across the first dimension returns a time step across all sentences in the batch.

The output function palso returns a binary mask tensor and a maximum target sentence length. The binary mask tensor has the same shape as the output target tensor, but every element that is a PAD_token is 0 and all others are 1.

`batch_to_train_data` simply takes a bunch of pairs and returns the input and target tensors using the aforementioned functions.

## Training Procedure Definition

### Loss with Masking

Given that we're working with batches of padded sequences, we can't compute loss using all tensor elements. We establish `mask_nll_loss` to compute our loss based on the decoder's output tensor, the target tensor, and a binary mask tensor that indicates the padding of the target tensor. This loss function computes the average negative log likelihood of the elements that align with a *1* in the mask tensor.



### Single Training Iteration Procedure

The `train` function encapsulates the process for a single training iteration (a single batch of inputs).

We employ two strategies to aid convergence:

-  **Teacher forcing**: At a probability determined by `teacher_forcing_ratio`, we use the current target word as the decoder’s next input instead of the decoder’s current guess. This helps in efficient training but can cause instability during inference. Hence, the `teacher_forcing_ratio` must be set carefully.

-  **Gradient clipping**: This technique counters the "exploding gradient" problem by capping gradients to a maximum value, preventing them from growing exponentially and causing overflow or overshooting steep cost function cliffs.

**Procedure:**

   1) Pass the entire input batch through the encoder.
   2) Initialize decoder inputs as SOS_token, and hidden state as the encoder's final hidden state.
   3) Pass the input batch sequence through the decoder one time step at a time.
   4) If teacher forcing: set next decoder input as the current target; else: set next decoder input as current decoder output.
   5) Calculate and accumulate loss.
   6) Perform backpropagation.
   7) Clip gradients.
   8) Update encoder and decoder model parameters.

Note: PyTorch’s RNN modules (`RNN`, `LSTM`, `GRU`) can be used like any other non-recurrent layers by passing them the entire input sequence. We use the `GRU` layer like this in the `encoder`. However, you can also run these modules one time-step at a time, as we do for the `decoder` model.