In this guide, we delve into an engaging and intriguing application of recurrent sequence-to-sequence models by training a basic chatbot using film scripts from the Cornell Movie-Dialogs Corpus.

Conversational models are currently a trending subject in the field of AI research. Chatbots are commonly used in various scenarios such as customer service platforms and online help desks. These bots typically utilize retrieval-based models that provide pre-set responses to specific types of questions. While these models might be adequate for highly specific domains like a company's IT helpdesk, they lack the robustness required for broader applications. However, the recent surge in deep learning, spearheaded recently by ChatGPT, has led to the development of potent multi-domain generative conversational models. In this guide, we will create one such model using the tools we have learnt so far.

To begin, download the dialog dataset: 

https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html (should be replaced with direct data source for DLCC)

and put in a ``data/`` directory under the current directory.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import csv
import re
import os
import unicodedata
from io import open
import json
import deeplay as dl

# Check for CUDA availability for PyTorch
USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda" if USE_CUDA else "cpu")

# Mount Google Drive if using Google Colab
#from google.colab import drive
#drive.mount('/content/gdrive')
#os.chdir("/content/gdrive/My Drive")



## Data Loading and Preprocessing

This step involves reorganizing our data file and loading the data into formats that are manageable.

The [Cornell Movie-Dialogs Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html) is a comprehensive dataset of dialogues from movie characters:

- It contains 220,579 conversational exchanges between 10,292 pairs of movie characters.
- It features 9,035 characters from 617 movies.
- It has a total of 304,713 utterances.

This dataset is vast and varied, with a wide range of language formality, time periods, sentiment, etc. We anticipate that this diversity will make our model capable of handling a variety of inputs and queries.

Initially, we will examine some lines from our data file to understand the original format.

In [2]:
# Set the corpus name
corpus_name = "movie-corpus"

# Function to print first 'n' lines from a file
def print_lines(file, n=10):
    with open(file, 'r', encoding='utf-8') as datafile:
        lines = datafile.readlines()
    for line in lines[:n]:
        print(line.strip())

print_lines(os.path.join(corpus_name, "utterances.jsonl"))

{"id": "L1045", "conversation_id": "L1044", "text": "They do not!", "speaker": "u0", "meta": {"movie_id": "m0", "parsed": [{"rt": 1, "toks": [{"tok": "They", "tag": "PRP", "dep": "nsubj", "up": 1, "dn": []}, {"tok": "do", "tag": "VBP", "dep": "ROOT", "dn": [0, 2, 3]}, {"tok": "not", "tag": "RB", "dep": "neg", "up": 1, "dn": []}, {"tok": "!", "tag": ".", "dep": "punct", "up": 1, "dn": []}]}]}, "reply-to": "L1044", "timestamp": null, "vectors": []}
{"id": "L1044", "conversation_id": "L1044", "text": "They do to!", "speaker": "u2", "meta": {"movie_id": "m0", "parsed": [{"rt": 1, "toks": [{"tok": "They", "tag": "PRP", "dep": "nsubj", "up": 1, "dn": []}, {"tok": "do", "tag": "VBP", "dep": "ROOT", "dn": [0, 2, 3]}, {"tok": "to", "tag": "TO", "dep": "dobj", "up": 1, "dn": []}, {"tok": "!", "tag": ".", "dep": "punct", "up": 1, "dn": []}]}]}, "reply-to": null, "timestamp": null, "vectors": []}
{"id": "L985", "conversation_id": "L984", "text": "I hope so.", "speaker": "u0", "meta": {"movie_id": 

### Generate Formatted Data File

For ease of use, we will generate a well-structured data file where each line comprises a tab-separated pair of a *query sentence* and a *response sentence*.

The functions below aid in parsing the raw `utterances.jsonl` data file.

- `loadLinesAndConversations` breaks down each line of the file into a dictionary of lines with fields: `lineID`, `characterID`, and text, and then groups these into conversations with fields: `conversationID`, `movieID`, and lines.
- `extractSentencePairs` pulls out pairs of sentences from the conversations.

In [3]:
# Load lines and conversations
def load_data(file_name):
    lines, conversations = {}, {}
    with open(file_name, 'r', encoding='iso-8859-1') as file:
        for line in file:
            data = json.loads(line)
            lines[data["id"]] = {"lineID": data["id"], "characterID": data["speaker"], "text": data["text"]}
            conv_id = data["conversation_id"]
            if conv_id not in conversations:
                conversations[conv_id] = {
                    "conversationID": conv_id,
                    "movieID": data["meta"]["movie_id"],
                    "lines": [lines[data["id"]]]
                }
            else:
                conversations[conv_id]["lines"].append(lines[data["id"]])
    return lines, conversations

# Extract sentence pairs from conversations
def extract_pairs(conversations):
    return [[input_line["text"].strip(), target_line["text"].strip()]
            for conversation in conversations.values()
            for input_line, target_line in zip(conversation["lines"], conversation["lines"][1:])
            if input_line["text"].strip() and target_line["text"].strip()]
    
# Processing and writing to file
def process_and_write(corpus_name):
    # Define path to new file
    datafile = os.path.join(corpus_name, "formatted_movie_lines.txt")
    delimiter = '\t'  # Tab character, no need to unescape

    # Load lines and conversations
    print("\nProcessing corpus into lines and conversations...")
    lines, conversations = load_data(os.path.join(corpus_name, "utterances.jsonl"))

    # Write new csv file
    print("\nWriting newly formatted file...")
    with open(datafile, 'w', encoding='utf-8') as outputfile:
        writer = csv.writer(outputfile, delimiter=delimiter, lineterminator='\n')
        for pair in extract_pairs(conversations):
            writer.writerow(pair)

    # Print a sample of lines (assuming a function printLines exists)
    print("\nSample lines from file:")
    # If printLines is not defined, we can simply print the first few lines from the file
    with open(datafile, 'r', encoding='utf-8') as file:
        for _ in range(5):  # Print first 5 lines as a sample
            print(file.readline().strip())
    return datafile

datafile=process_and_write(corpus_name)



Processing corpus into lines and conversations...

Writing newly formatted file...

Sample lines from file:
They do not!	They do to!
I hope so.	She okay?
Let's go.	Wow
Okay -- you're gonna need to learn how to lie.	No
No	"I'm kidding.  You know how sometimes you just become this ""persona""?  And you don't know how to quit?"


### Data Loading and Trimming

The next step involves creating a vocabulary and loading query/response sentence pairs into memory.

Keep in mind that we are working with sequences of **words**, which do not inherently map to a discrete numerical space. Therefore, we need to create such a mapping by associating each unique word we encounter in our dataset with an index value.

To achieve this, we define a `Vocabulary` class, which maintains a mapping from words to indexes, a reverse mapping from indexes to words, a count of each word, and a total word count. The class offers methods for adding a word to the vocabulary (`add_word`), adding all words in a sentence (`add_sentence`), and trimming infrequently seen words (`trim`). We will discuss trimming in more detail later.

In [4]:
# Default word tokens
PAD_TOKEN = 0  # Used for padding short sentences
SOS_TOKEN = 1  # Start-of-sentence token
EOS_TOKEN = 2  # End-of-sentence token

class Vocabulary:
    def __init__(self, name):
        self.name = name
        self.trimmed = False
        self.word_to_index = {"PAD":PAD_TOKEN,"SOS":SOS_TOKEN, "EOS":EOS_TOKEN}
        self.word_to_count = {}
        self.index_to_word = {PAD_TOKEN: "PAD", SOS_TOKEN: "SOS", EOS_TOKEN: "EOS"}
        self.num_words = 3  # Count SOS, EOS, PAD

    def add_sentence(self, sentence):
        for word in sentence.split():
            self.add_word(word)

    def add_word(self, word):
        if word not in self.word_to_index:
            self.word_to_index[word] = self.num_words
            self.word_to_count[word] = 1
            self.index_to_word[self.num_words] = word
            self.num_words += 1
        else:
            self.word_to_count[word] += 1

    def trim(self, min_count):
        if self.trimmed:
            return
        self.trimmed = True

        keep_words = [word for word, count in self.word_to_count.items() if count >= min_count]

        # Save the counts for the kept words
        keep_word_counts = {word: self.word_to_count[word] for word in keep_words}

        # Reinitialize dictionaries
        self.word_to_index, self.word_to_count, self.index_to_word = {"PAD":PAD_TOKEN,"SOS":SOS_TOKEN, "EOS":EOS_TOKEN}, {}, {PAD_TOKEN: "PAD", SOS_TOKEN: "SOS", EOS_TOKEN: "EOS"}
        self.num_words = 3  # Reset to count default tokens

        for word in keep_words:
            self.word_to_index[word] = self.num_words
            self.word_to_count[word] = keep_word_counts[word]
            self.index_to_word[self.num_words] = word
            self.num_words += 1

        # Provide a report on the trimming process
        print(f'keep_words {len(keep_words)} / {len(self.index_to_word) - 3} = '
              f'{len(keep_words) / (len(self.index_to_word) - 3):.4f}')

# Helper function to add multiple sentences to a Vocabulary instance
def add_sentences_to_vocabulary(vocab, sentences):
    for sentence in sentences:
        vocab.add_sentence(sentence)




We can now compile our vocabulary and query/response sentence pairs. However, before we can utilize this data, we need to carry out some preprocessing steps.

Initially, we need to transform the Unicode strings into ASCII using `unicodeToAscii`. Subsequently, we should convert all characters to lowercase and remove all non-letter characters, excluding basic punctuation (`normalize_string`). Lastly, to facilitate training convergence, we will exclude sentences exceeding the `MAX_LENGTH` threshold.

In [5]:
MAX_LENGTH = 10  # Maximum sentence length to consider

# Turn a Unicode string to plain ASCII
def normalize_string(s):
    """
    Normalize the string: Convert to ASCII, make lowercase, strip leading/trailing whitespace,
    separate punctuation with spaces, and remove non-letter characters. 
    """  
    s = unicodedata.normalize('NFD', s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", re.sub(r"[^a-zA-Z.!?]+", r" ", s))
    s = re.sub(r"\s+", r" ", s).strip()
    
    return s

def load_prepare_data(corpus_name, datafile):
    """
    Load and prepare data: Open the datafile, split into lines, normalize,
    filter by length, create a vocabulary, and add each sentence to it.
    """
    with open(datafile, encoding='utf-8') as file:
        lines = [line.split('\t') for line in file.read().strip().split('\n')]
    pairs = [[normalize_string(s) for s in pair] for pair in lines]
    pairs = [pair for pair in pairs if all(len(s.split()) < MAX_LENGTH for s in pair)]
    voc = Vocabulary(corpus_name)
    for pair in pairs:
        for s in pair:
            voc.add_sentence(s)
    return voc, pairs

# Load/Assemble voc and pairs
save_dir = os.path.join("data", "save")
voc, pairs = load_prepare_data(corpus_name, datafile)
# Print some pairs to validate
print("\npairs:")
for pair in pairs[:10]:
    print(pair)


pairs:
['they do not !', 'they do to !']
['i hope so .', 'she okay ?']
['let s go .', 'wow']
['like my fear of wearing pastels ?', 'the real you .']
['the real you .', 'what good stuff ?']
['what crap ?', 'do you listen to this crap ?']
['you always been this selfish ?', 'but']
['but', 'then that s all you had to say .']
['then that s all you had to say .', 'well no . . .']
['tons', 'have fun tonight ?']


Another tactic that is beneficial to achieving faster convergence during
training is trimming rarely used words out of our vocabulary. Decreasing
the feature space will also soften the difficulty of the function that
the model must learn to approximate. We will do this as a two-step
process:

1) Trim words used under ``MIN_COUNT`` threshold using the ``voc.trim``
   function.

2) Filter out pairs with trimmed words.




In [6]:
MIN_COUNT = 3    # Minimum word count threshold for trimming

def trim_rare_words(voc, pairs, MIN_COUNT):
    # Trim words used under the MIN_COUNT from the voc
    voc.trim(MIN_COUNT)
    # Filter out pairs with trimmed words
    keep_pairs = []
    for pair in pairs:
        input_sentence = pair[0]
        output_sentence = pair[1]
        keep_input = True
        keep_output = True
        # Check input sentence
        for word in input_sentence.split(' '):
            if word not in voc.word_to_index:
                keep_input = False
                break
        # Check output sentence
        for word in output_sentence.split(' '):
            if word not in voc.word_to_index:
                keep_output = False
                break

        # Only keep pairs that do not contain trimmed word(s) in their input or output sentence
        if keep_input and keep_output:
            keep_pairs.append(pair)

    print("Trimmed from {} pairs to {}, {:.4f} of total".format(len(pairs), len(keep_pairs), len(keep_pairs) / len(pairs)))
    return keep_pairs


# Trim voc and pairs
pairs = trim_rare_words(voc, pairs, MIN_COUNT)

keep_words 7832 / 7832 = 1.0000
Trimmed from 64259 pairs to 53074, 0.8259 of total


### Data Preparation for Models

Despite our extensive efforts to curate and process our data into a convenient vocabulary object and list of sentence pairs, our models will ultimately require numerical torch tensors as inputs.  

 To accommodate sentences of different sizes in the same batch, we will create our batched input tensor of shape (max_length, batch_size), where sentences shorter than the max_length are zero padded after an EOS_token.

If we simply convert our English sentences to tensors by converting words to their indexes and zero-pad, our tensor would have shape (batch_size, max_length) and indexing the first dimension would return a full sequence across all time-steps. However, we need to be able to index our batch along time, and across all sequences in the batch. Therefore, we transpose our input batch shape to (max_length, batch_size), so that indexing across the first dimension returns a time step across all sentences in the batch.

The output function palso returns a binary mask tensor and a maximum target sentence length. The binary mask tensor has the same shape as the output target tensor, but every element that is a PAD_token is 0 and all others are 1.

`batch_to_train_data` simply takes a bunch of pairs and returns the input and target tensors using the aforementioned functions.

In [7]:
import itertools
import torch
import random

# Assuming PAD_token and EOS_token are defined with their respective integral values

def indexes_from_sentence(vocabulary, sentence):
    """
    Convert sentence to a list of indexes, appending the EOS token at the end.
    """
    return [vocabulary.word_to_index[word] for word in sentence.split(' ')] + [EOS_TOKEN]

def binary_matrix(l, value=PAD_TOKEN):
    """
    Create a binary matrix representing the padding of sentences.
    """
    return [[0 if token == value else 1 for token in seq] for seq in l]

def batch_to_train_data(vocabulary, pair_batch):
    """
    Prepare the batch for training: sort by input length, create tensors for input/target variables.
    """
    pair_batch.sort(key=lambda x: len(x[0].split(" ")), reverse=True)
    input_batch, output_batch = zip(*pair_batch)
    input_indexes = [indexes_from_sentence(vocabulary, sentence) for sentence in input_batch]
    input_lengths = torch.tensor([len(indexes) for indexes in input_indexes])
    input_padded = torch.LongTensor(list(itertools.zip_longest(*input_indexes, fillvalue=PAD_TOKEN)))

    output_indexes = [indexes_from_sentence(vocabulary, sentence) for sentence in output_batch]
    output_padded = torch.LongTensor(list(itertools.zip_longest(*output_indexes, fillvalue=PAD_TOKEN)))
    output_mask = torch.BoolTensor(binary_matrix(output_padded))
    max_target_len = max(len(indexes) for indexes in output_indexes)

    return input_padded, input_lengths, output_padded, output_mask, max_target_len

# Example for validation
small_batch_size = 5
batches = batch_to_train_data(voc, [random.choice(pairs) for _ in range(small_batch_size)])
input_variable, lengths, target_variable, mask, max_target_len = batches

print("input_variable:", input_variable)
print("lengths:", lengths)
print("target_variable:", target_variable)
print("mask:", mask)
print("max_target_len:", max_target_len)


input_variable: tensor([[  24,    8,   24,    8, 1255],
        [ 495,  203, 4291,   53, 7267],
        [3199,    5,   11,   11,   11],
        [ 468, 7014,    2,    2,    2],
        [ 105,   11,    0,    0,    0],
        [  24,    2,    0,    0,    0],
        [ 466,    0,    0,    0,    0],
        [  11,    0,    0,    0,    0],
        [   2,    0,    0,    0,    0]])
lengths: tensor([9, 6, 4, 4, 4])
target_variable: tensor([[  67,    8,   27,    8,   76],
        [  16,  203, 3313,  263,   90],
        [  64, 7010,  300,   81,   81],
        [ 595,   26,   14,   11,   14],
        [1380,   77,    2,    2,    2],
        [  11,  592,    0,    0,    0],
        [   2,  331,    0,    0,    0],
        [   0, 7014,    0,    0,    0],
        [   0,   11,    0,    0,    0],
        [   0,    2,    0,    0,    0]])
mask: tensor([[ True,  True,  True,  True,  True],
        [ True,  True,  True,  True,  True],
        [ True,  True,  True,  True,  True],
        [ True,  True,  True,  

## Training Procedure Definition

### Loss with Masking

Given that we're working with batches of padded sequences, we can't compute loss using all tensor elements. We establish `mask_nll_loss` to compute our loss based on the decoder's output tensor, the target tensor, and a binary mask tensor that indicates the padding of the target tensor. This loss function computes the average negative log likelihood of the elements that align with a *1* in the mask tensor.



In [9]:
def mask_nll_loss(inp, target, mask, device):
    """
    Calculate the negative log likelihood loss with a mask over the lengths of target sequences.
    """
    n_total = mask.sum()
    cross_entropy = -torch.log(torch.gather(inp, 1, target.view(-1, 1)).squeeze(1))
    loss = cross_entropy.masked_select(mask).mean()
    loss = loss.to(device)
    return loss, n_total.item()

### Single Training Iteration Procedure

The `train` function encapsulates the process for a single training iteration (a single batch of inputs).

We employ two strategies to aid convergence:

-  **Teacher forcing**: At a probability determined by `teacher_forcing_ratio`, we use the current target word as the decoder’s next input instead of the decoder’s current guess. This helps in efficient training but can cause instability during inference. Hence, the `teacher_forcing_ratio` must be set carefully.

-  **Gradient clipping**: This technique counters the "exploding gradient" problem by capping gradients to a maximum value, preventing them from growing exponentially and causing overflow or overshooting steep cost function cliffs.

**Procedure:**

   1) Pass the entire input batch through the encoder.
   2) Initialize decoder inputs as SOS_token, and hidden state as the encoder's final hidden state.
   3) Pass the input batch sequence through the decoder one time step at a time.
   4) If teacher forcing: set next decoder input as the current target; else: set next decoder input as current decoder output.
   5) Calculate and accumulate loss.
   6) Perform backpropagation.
   7) Clip gradients.
   8) Update encoder and decoder model parameters.

Note: PyTorch’s RNN modules (`RNN`, `LSTM`, `GRU`) can be used like any other non-recurrent layers by passing them the entire input sequence. We use the `GRU` layer like this in the `encoder`. However, you can also run these modules one time-step at a time, as we do for the `decoder` model.

In [10]:
def train(input_variable, lengths, target_variable, mask, max_target_len, encoder, decoder, embedding,
          encoder_optimizer, decoder_optimizer, batch_size, clip, max_length=MAX_LENGTH):

    # Zero gradients
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    # Set device options
    input_variable = input_variable.to(device)
    target_variable = target_variable.to(device)
    mask = mask.to(device)
    # Lengths for RNN packing should always be on the CPU
    lengths = lengths.to("cpu")

    # Initialize variables
    loss = 0
    print_losses = []
    n_totals = 0

    # Forward pass through encoder
    encoder_outputs, encoder_hidden = encoder(input_variable, lengths)

    # Create initial decoder input (start with SOS tokens for each sentence)
    decoder_input = torch.LongTensor([[SOS_TOKEN for _ in range(batch_size)]])
    decoder_input = decoder_input.to(device)

    # Set initial decoder hidden state to the encoder's final hidden state
    decoder_hidden = encoder_hidden[:decoder.n_layers]

    # Determine if we are using teacher forcing this iteration
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    # Forward batch of sequences through decoder one time step at a time
    if use_teacher_forcing:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden
            )
            # Teacher forcing: next input is current target
            decoder_input = target_variable[t].view(1, -1)
            # Calculate and accumulate loss
            mask_loss, nTotal = mask_nll_loss(decoder_output, target_variable[t], mask[t],device)
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal
    else:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden
            )
            # No teacher forcing: next input is decoder's own current output
            _, topi = decoder_output.topk(1)
            decoder_input = torch.LongTensor([[topi[i][0] for i in range(batch_size)]])
            decoder_input = decoder_input.to(device)
            # Calculate and accumulate loss
            mask_loss, nTotal = mask_nll_loss(decoder_output, target_variable[t], mask[t],device)
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal

    # Perform backpropagation
    loss.backward()

    # Clip gradients: gradients are modified in place
    _ = nn.utils.clip_grad_norm_(encoder.parameters(), clip)
    _ = nn.utils.clip_grad_norm_(decoder.parameters(), clip)

    # Adjust model weights
    encoder_optimizer.step()
    decoder_optimizer.step()

    return sum(print_losses) / n_totals

### Training Iterations

Now we can integrate the complete training procedure with the data. The `train_iters` function executes `n_iterations` of training using the provided models, optimizers, data, etc. Most of the complex work is handled by the `train` function.

It's important to note that when we save our model, we store a tarball that includes the encoder and decoder `state_dicts` (parameters), the optimizers' `state_dicts`, the loss, the iteration, etc. Saving the model in this way provides maximum flexibility with the checkpoint. After loading a checkpoint, we can either use the model parameters to run inference or continue training from where we left off.

In [11]:
def train_iters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer, embedding,
                encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size, print_every, save_every,
                clip, corpus_name):
    """
    Run training for a set number of iterations.
    """
    # Load batches for each iteration
    training_batches = [batch_to_train_data(voc, [random.choice(pairs) for _ in range(batch_size)])
                        for _ in range(n_iteration)]

    print('Initializing ...')
    start_iteration = 1
    print_loss_total = 0  # Reset every print_every

    print("Training...")
    for iteration in range(start_iteration, n_iteration + 1):
        training_batch = training_batches[iteration - 1]
        input_variable, lengths, target_variable, mask, max_target_len = training_batch

        # Run a training iteration with batch
        loss = train(input_variable, lengths, target_variable, mask, max_target_len, encoder,
                     decoder, embedding, encoder_optimizer, decoder_optimizer, batch_size, clip)
        print_loss_total += loss

        # Print progress
        if iteration % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print(f"Iteration: {iteration}; Percent complete: {iteration / n_iteration * 100:.1f}%; Average loss: {print_loss_avg:.4f}")
            print_loss_total = 0

        # Save checkpoint
        if iteration % save_every == 0:
            directory = os.path.join(save_dir, model_name, corpus_name, f'{encoder_n_layers}-{decoder_n_layers}_{hidden_features}')
            if not os.path.exists(directory):
                os.makedirs(directory)
            torch.save({
                'iteration': iteration,
                'en': encoder.state_dict(),
                'de': decoder.state_dict(),
                'en_opt': encoder_optimizer.state_dict(),
                'de_opt': decoder_optimizer.state_dict(),
                'loss': loss,
                'voc_dict': voc.__dict__,
                'embedding': embedding.state_dict()
            }, os.path.join(directory, f'{iteration}_checkpoint.tar'))

## Evaluation

After training the model, we want to interact with the bot. We need to define how the model decodes the encoded input.

### Greedy Decoding

Greedy decoding is used during training when we're not using teacher forcing. At each time step, we choose the word from `decoder_output` with the highest softmax value. This method is optimal at a single time-step level.

We define a `GreedySearchDecoder` class to perform greedy decoding. The input sentence is evaluated as follows:

**Computation Steps:**

   1) Pass input through the encoder model.
   2) Prepare the encoder's final hidden layer to be the first hidden input to the decoder.
   3) Initialize the decoder's first input as SOS_token.
   4) Initialize tensors to append decoded words to.
   5) Iteratively decode one word token at a time:
       a) Pass through the decoder.
       b) Obtain the most likely word token and its softmax score.
       c) Record the token and score.
       d) Prepare the current token to be the next decoder input.
   6) Return collections of word tokens and scores.

In [12]:
class GreedySearchDecoder(nn.Module):
    def __init__(self, encoder, decoder):
        """
        Greedy decoding module initialization.
        """
        super(GreedySearchDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, input_seq, input_length, max_length):
        """
        Forward propagation of the input to produce a sequence of tokens.
        """
        # Forward input through encoder model
        encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)

        # Prepare encoder's final hidden layer to be first hidden input to the decoder
        decoder_hidden = encoder_hidden[:self.decoder.n_layers]

        # Initialize decoder input with SOS_token
        decoder_input = torch.tensor([[SOS_TOKEN]], device=device, dtype=torch.long)

        # Initialize tensors to append decoded words to
        all_tokens = torch.zeros(0, dtype=torch.long, device=device)
        all_scores = torch.zeros(0, device=device)

        # Iteratively decode one word token at a time
        for _ in range(max_length):
            # Forward pass through decoder
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden)

            # Obtain most likely word token and its softmax score
            decoder_scores, decoder_input = torch.max(decoder_output, dim=1)

            # Record token and score
            all_tokens = torch.cat((all_tokens, decoder_input), dim=0)
            all_scores = torch.cat((all_scores, decoder_scores), dim=0)

            # Prepare current token to be next decoder input (add a dimension)
            decoder_input = torch.unsqueeze(decoder_input, 0)

        # Return collections of word tokens and scores
        return all_tokens, all_scores

### Text Evaluation

With our decoding method defined, we can create functions to evaluate a string input sentence. The `evaluate` function handles the input sentence, it formats the sentence as an input batch of word indexes with *batch_size==1*. This is done by converting the sentence words to their corresponding indexes and transposing the dimensions to prepare the tensor for our models. A `lengths` tensor is also created which contains the length of our input sentence. The decoded response sentence tensor is obtained using our `GreedySearchDecoder` object (`searcher`). Finally, the response’s indexes are converted to words and the list of decoded words is returned.

`evaluate_input` serves as the user interface for our chatbot. It prompts an input text field where we can enter our query sentence. After entering our input sentence and pressing *Enter*, our text is normalized like our training data, and is fed to the `evaluate` function to obtain a decoded output sentence. This process is looped for continuous interaction with our bot until we enter either “q” or “quit”.

If a sentence is entered that contains a word not in the vocabulary, an error message is printed and the user is prompted to enter another sentence.

In [19]:
def evaluate(searcher, voc, sentence, max_length=MAX_LENGTH):
    """
    Evaluate a sentence using the encoder, decoder, and searcher provided.
    """
    # Prepare the input sentence as a batch of word indexes
    indexes_batch = [indexes_from_sentence(voc, sentence)]
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    input_batch = torch.LongTensor(indexes_batch).transpose(0, 1).to(device)
    lengths = lengths.to("cpu")  # Lengths need to be on CPU for pack_padded_sequence

    # Decode the sentence with the searcher
    tokens, scores = searcher(input_batch, lengths, max_length)
    decoded_words = [voc.index_to_word[token.item()] for token in tokens]

    return decoded_words

def evaluate_input(encoder, decoder, searcher, voc):
    """
    Interactively evaluate input from the user.
    """
    while True:
        try:
            input_sentence = input('> ')
            if input_sentence in ('q', 'quit'):
                break
            print (input_sentence)
            input_sentence = normalize_string(input_sentence)
            output_words = evaluate(searcher, voc, input_sentence)
            output_words = [word for word in output_words if word not in ('EOS', 'PAD')]
            print('Bot:', ' '.join(output_words))

        except KeyError:
            print("Error: Encountered unknown word.")


## Models Overview

### Seq2Seq Model

Our chatbot uses a sequence-to-sequence (seq2seq) model, which takes a variable-length sequence as input and returns a variable-length sequence as output. This is achieved by using two separate recurrent neural nets (RNNs): an **encoder** and a **decoder**. The encoder encodes the input sequence into a fixed-length context vector, which theoretically contains semantic information about the input sentence. The decoder takes an input word and the context vector, and returns a guess for the next word in the sequence and a hidden state for the next iteration.

### Encoder

The encoder RNN iterates through the input sentence one token at a time, outputting an "output" vector and a "hidden state" vector at each time step. The hidden state vector is passed to the next time step, while the output vector is recorded. The encoder uses a multi-layered Gated Recurrent Unit (GRU) and a bidirectional variant of the GRU. An `embedding` layer is used to encode our word indices in an arbitrarily sized feature space. 

### Decoder

The decoder RNN generates the response sentence in a token-by-token fashion. It uses the encoder’s context vectors, and internal hidden states to generate the next word in the sequence. To avoid information loss, especially when dealing with long input sequences, an "attention mechanism" is typically used that allows the decoder to pay attention to certain parts of the input sequence, rather than using the entire fixed context at every step. However, in this tutorial, we only consider the use of standard RNNs, which will limit the performance of our model.

In [14]:
# Set configuration parameters for the model
model_name = 'cb_model'
hidden_features = 500
in_features=hidden_features
encoder_n_layers = 2
decoder_n_layers = 2
rnn_type = 'GRU'
dropout = 0.1
batch_size = 64

# Set training and optimization parameters
clip = 50.0
teacher_forcing_ratio = 1.0
learning_rate = 0.0001
decoder_learning_ratio = 5.0
n_iteration = 4000
print_every = 1
save_every = 500

embedding = nn.Embedding(voc.num_words, hidden_features)
# Initialize the EncoderRNN
encoder =dl.EncoderRNN(
    in_features=in_features, 
    hidden_features=hidden_features,
    num_layers=encoder_n_layers, 
    embedding=embedding,
    rnn_type=rnn_type, 
    dropout=dropout
).to(device)


encoder.build()
print(encoder)

decoder =dl.DecoderRNN(
    in_features=in_features, 
    hidden_features=hidden_features,
    out_features=voc.num_words,
    num_layers=decoder_n_layers, 
    embedding=embedding,
    rnn_type=rnn_type, 
    dropout=dropout
).to(device)


encoder.build()
print(decoder)


EncoderRNN(
  (blocks): LayerList(
    (0): GRU(500, 500, num_layers=2, dropout=0.1)
  )
  (rnn): GRU(500, 500, num_layers=2, dropout=0.1)
  (embedding): Embedding(7835, 500)
)
DecoderRNN(
  (blocks): LayerList(
    (0): GRU(500, 500, num_layers=2, dropout=0.1)
  )
  (rnn): GRU(500, 500, num_layers=2, dropout=0.1)
  (embedding): Embedding(7835, 500)
  (embedding_dropout): Dropout(p=0.1, inplace=False)
  (out): Linear(in_features=500, out_features=7835, bias=True)
)


## Define and train model

Now, we're ready to run our model!

Whether we're training or testing the chatbot model, we need to initialize the encoder and decoder models. In the next block, we configure the parameters, and construct and initialize the models. You're encouraged to experiment with different model configurations to enhance performance.


In [15]:

# Initialize word embeddings and encoder & decoder models
# Set models to training mode
encoder.train()
decoder.train()

# Initialize optimizers for the encoder and decoder
encoder_optimizer = torch.optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = torch.optim.Adam(decoder.parameters(), lr=learning_rate * decoder_learning_ratio)

# Move optimizer states to GPU if necessary
if torch.cuda.is_available():
    for state in encoder_optimizer.state.values():
        for k, v in state.items():
            if isinstance(v, torch.Tensor):
                state[k] = v.cuda()

    for state in decoder_optimizer.state.values():
        for k, v in state.items():
            if isinstance(v, torch.Tensor):
                state[k] = v.cuda()

# Begin the training process
train_iters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer,
           embedding, encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size,
           print_every, save_every, clip, corpus_name)

Initializing ...
Training...
Iteration: 1; Percent complete: 0.0%; Average loss: 8.9828
Iteration: 2; Percent complete: 0.1%; Average loss: 8.8544
Iteration: 3; Percent complete: 0.1%; Average loss: 8.7097
Iteration: 4; Percent complete: 0.1%; Average loss: 8.5482
Iteration: 5; Percent complete: 0.1%; Average loss: 8.3108
Iteration: 6; Percent complete: 0.1%; Average loss: 7.9890
Iteration: 7; Percent complete: 0.2%; Average loss: 7.6131
Iteration: 8; Percent complete: 0.2%; Average loss: 7.0363
Iteration: 9; Percent complete: 0.2%; Average loss: 6.5861
Iteration: 10; Percent complete: 0.2%; Average loss: 6.5028
Iteration: 11; Percent complete: 0.3%; Average loss: 6.2936
Iteration: 12; Percent complete: 0.3%; Average loss: 6.0369
Iteration: 13; Percent complete: 0.3%; Average loss: 5.8092
Iteration: 14; Percent complete: 0.4%; Average loss: 5.4931
Iteration: 15; Percent complete: 0.4%; Average loss: 5.1215
Iteration: 16; Percent complete: 0.4%; Average loss: 5.3872
Iteration: 17; Perce

In [20]:
# Set dropout layers to ``eval`` mode
encoder.eval()
decoder.eval()

# Initialize search module
searcher = GreedySearchDecoder(encoder, decoder)

# Begin chatting (uncomment and run the following line to begin)
evaluate_input(encoder, decoder, searcher, voc)

hello
Bot: hello ? ! ! ! !
how are you?
Bot: you re right . in there ?
I sure am
Bot: you re a cop . s holding .
how did you know?
Bot: i didn t know . s dead .
alright
Bot: i m sorry . in the car ?
what?
Bot: you know what i mean . ?
she's in the car?
Bot: i m going to take a while . ?
goodbye
Bot: i m sorry . in the car ?


Let's review the changes made to incorporate conversation history into your chatbot model training and evaluation:
### Changes for Training with Conversation History

create_history_pairs Function: This function creates training pairs that include the history of the conversation. For each exchange in the pairs, it appends a certain number of previous exchanges (up to MAX_HISTORY) separated by the <EOS> token. These pairs are then used for training.

Batch Preparation (batch_to_train_data): The function for preparing a batch of training data remains largely the same. The only difference is that it now handles input sequences that include conversation history.

Training Function (train):
    The core training logic remains unchanged.
    The function receives input sequences that now contain conversation history.
    The forward pass through the encoder and the decoding steps are performed as usual, without any specific changes needed to accommodate the conversation history.

Data Preparation for Training:
    The history_pairs are generated using the create_history_pairs function.
    These pairs are then used throughout the training process.

### Changes for Evaluation with Conversation History

evaluate Function:
    The conversation history is now considered when evaluating a new input.
    The history is concatenated with the current input sentence, separated by spaces.
    The concatenated string is then processed and fed into the model for generating a response.

evaluate_input Function:
    Manages interactive evaluation with the user.
    Maintains a conversation_history list, appending each user input and the model's response to it.
    Passes the accumulated conversation history to the evaluate function for each new input.

### Training and Evaluation Process

The model is trained with input sequences that include conversation history, allowing it to learn the context of the conversation.
During evaluation, the model uses the accumulated conversation history to generate more context-aware responses.

### General Setup

Model, optimizer, and training configurations are set according to your specifications.
The training process (train_iters) follows the standard approach but uses the modified input pairs with conversation history.

### Important Notes

The effectiveness of including conversation history in the model depends on the depth of context the model can understand and how well it can handle longer input sequences.
Fine-tuning and experimenting with the MAX_HISTORY parameter and the MAX_LENGTH of sequences may be necessary to achieve optimal performance.

In [27]:
MAX_LENGTH = 10  # Maximum length of a single sentence, adjust as necessary
MAX_HISTORY = 5  # Number of previous exchanges to include in the history

def create_history_pairs(pairs, max_history=MAX_HISTORY):
    history_pairs = []
    for i in range(len(pairs)):
        dialogue_history = ' EOS '.join([pairs[j][0] for j in range(max(0, i-max_history), i)])
        history_pairs.append([dialogue_history, pairs[i][1]])
    return history_pairs

def zero_padding(l, fillvalue=0):
    return list(itertools.zip_longest(*l, fillvalue=fillvalue))

def binary_matrix(l, value=0):
    m = []
    for i, seq in enumerate(l):
        m.append([])
        for token in seq:
            if token == value:
                m[i].append(0)
            else:
                m[i].append(1)
    return m

def output_var(l, voc):
    indexes_batch = [indexes_from_sentence(voc, sentence) for sentence in l]
    max_target_len = max([len(indexes) for indexes in indexes_batch])
    pad_list = zero_padding(indexes_batch)
    mask = binary_matrix(pad_list)
    mask = torch.BoolTensor(mask)
    pad_var = torch.LongTensor(pad_list)
    return pad_var, mask, max_target_len

def indexes_from_sentence(voc, sentence):
    #return [voc.word_to_index[word] for word in sentence.split(' ')] + [EOS_TOKEN]
    return [voc.word_to_index[word] for word in sentence.split()] + [EOS_TOKEN] 

def batch_to_train_data(voc, pair_batch):
    pair_batch.sort(key=lambda p: len(p[0].split()), reverse=True)
    input_batch, output_batch = [], []
    for pair in pair_batch:
        input_batch.append(pair[0])
        output_batch.append(pair[1])
    inp, lengths = input_var(input_batch, voc)
    output, mask, max_target_len = output_var(output_batch, voc)
    return inp, lengths, output, mask, max_target_len

# Input_var function needs to calculate the correct lengths for packed sequences
def input_var(l, voc):
    indexes_batch = [indexes_from_sentence(voc, sentence) for sentence in l]
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    pad_list = zero_padding(indexes_batch)
    pad_var = torch.LongTensor(pad_list)
    return pad_var, lengths
# Prepare the data with history
history_pairs = create_history_pairs(pairs)

# Modify the training function to handle the dialogue history in the input sequences
def train(input_variable, lengths, target_variable, mask, max_target_len, encoder, decoder, embedding,
          encoder_optimizer, decoder_optimizer, batch_size, clip, max_length=MAX_LENGTH):

    # Zero gradients
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    # Set device options
    input_variable = input_variable.to(device)
    target_variable = target_variable.to(device)
    mask = mask.to(device)
    # Lengths for RNN packing should always be on the CPU
    lengths = lengths.to("cpu")

    # Initialize variables
    loss = 0
    print_losses = []
    n_totals = 0

    # Forward pass through encoder
    encoder_outputs, encoder_hidden = encoder(input_variable, lengths)

    # Create initial decoder input (start with SOS tokens for each sentence)
    decoder_input = torch.LongTensor([[SOS_TOKEN for _ in range(batch_size)]])
    decoder_input = decoder_input.to(device)

    # Set initial decoder hidden state to the encoder's final hidden state
    decoder_hidden = encoder_hidden[:decoder.n_layers]

    # Determine if we are using teacher forcing this iteration
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    # Forward batch of sequences through decoder one time step at a time
    if use_teacher_forcing:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden
            )
            # Teacher forcing: next input is current target
            decoder_input = target_variable[t].view(1, -1)
            # Calculate and accumulate loss
            mask_loss, nTotal = mask_nll_loss(decoder_output, target_variable[t], mask[t],device)
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal
    else:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden
            )
            # No teacher forcing: next input is decoder's own current output
            _, topi = decoder_output.topk(1)
            decoder_input = torch.LongTensor([[topi[i][0] for i in range(batch_size)]])
            decoder_input = decoder_input.to(device)
            # Calculate and accumulate loss
            mask_loss, nTotal = mask_nll_loss(decoder_output, target_variable[t], mask[t],device)
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal

    # Perform backpropagation
    loss.backward()

    # Clip gradients: gradients are modified in place
    _ = nn.utils.clip_grad_norm_(encoder.parameters(), clip)
    _ = nn.utils.clip_grad_norm_(decoder.parameters(), clip)

    # Adjust model weights
    encoder_optimizer.step()
    decoder_optimizer.step()

    return sum(print_losses) / n_totals

def train_iters(model_name, voc, history_pairs, encoder, decoder, encoder_optimizer, decoder_optimizer, embedding,
                encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size, print_every, save_every,
                clip, corpus_name):
    """
    Run training for a set number of iterations.
    """
    # Load batches for each iteration
    training_batches = [batch_to_train_data(voc, [random.choice(history_pairs) for _ in range(batch_size)])
                        for _ in range(n_iteration)]

    print('Initializing ...')
    start_iteration = 1
    print_loss_total = 0  # Reset every print_every

    print("Training...")
    for iteration in range(start_iteration, n_iteration + 1):
        training_batch = training_batches[iteration - 1]
        input_variable, lengths, target_variable, mask, max_target_len = training_batch

        # Run a training iteration with batch
        loss = train(input_variable, lengths, target_variable, mask, max_target_len, encoder,
                     decoder, embedding, encoder_optimizer, decoder_optimizer, batch_size, clip)
        print_loss_total += loss

        # Print progress
        if iteration % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print(f"Iteration: {iteration}; Percent complete: {iteration / n_iteration * 100:.1f}%; Average loss: {print_loss_avg:.4f}")
            print_loss_total = 0

        # Save checkpoint
        if iteration % save_every == 0:
            directory = os.path.join(save_dir, model_name, corpus_name, f'{encoder_n_layers}-{decoder_n_layers}_{hidden_features}')
            if not os.path.exists(directory):
                os.makedirs(directory)
            torch.save({
                'iteration': iteration,
                'en': encoder.state_dict(),
                'de': decoder.state_dict(),
                'en_opt': encoder_optimizer.state_dict(),
                'de_opt': decoder_optimizer.state_dict(),
                'loss': loss,
                'voc_dict': voc.__dict__,
                'embedding': embedding.state_dict()
            }, os.path.join(directory, f'{iteration}_checkpoint.tar'))

def evaluate(searcher, voc, conversation_history, max_length=MAX_LENGTH):
    """
    Evaluate a conversation history using the encoder, decoder, and searcher provided.
    """
    # Join the conversation history into a single input and normalize
    input_sentence = ' '.join(conversation_history)
    input_sentence = normalize_string(input_sentence)

    # Prepare the input sentence as a batch of word indexes
    indexes_batch = [indexes_from_sentence(voc, input_sentence)]
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    input_batch = torch.LongTensor(indexes_batch).transpose(0, 1).to(device)
    lengths = lengths.to("cpu")  # Lengths need to be on CPU for pack_padded_sequence

    # Decode the sentence with the searcher
    tokens, scores = searcher(input_batch, lengths, max_length)
    decoded_words = [voc.index_to_word[token.item()] for token in tokens]

    return decoded_words


def evaluate_input(searcher, voc):
    """
    Interactively evaluate input from the user, considering the entire conversation history.
    """
    conversation_history = []
    while True:
        try:
            input_sentence = input('> ')
            if input_sentence in ('q', 'quit'):
                break

            # Normalize and add the user's input to the conversation history
            input_sentence = normalize_string(input_sentence)
            conversation_history.append(input_sentence)

            # Evaluate the conversation history
            output_words = evaluate(searcher, voc, conversation_history)
            output_words = [word for word in output_words if word not in ('EOS', 'PAD')]
            print('Bot:', ' '.join(output_words))

            # Add the bot's response to the conversation history
            conversation_history.extend(output_words)
            conversation_history=conversation_history[-MAX_HISTORY*MAX_LENGTH:]
            
        except KeyError:
            print("Error: Encountered unknown word.")

# Set configuration parameters for the model
model_name = 'cb_model'
hidden_features = 2000
in_features=hidden_features
encoder_n_layers = 4
decoder_n_layers = 4
dropout = 0.1
batch_size = 64

# Set training and optimization parameters
clip = 50.0
teacher_forcing_ratio = 1.0
learning_rate = 0.0001
decoder_learning_ratio = 5.0
n_iteration = 16000
print_every = 1
save_every = 8000

# Initialize word embeddings and encoder & decoder models
embedding = nn.Embedding(voc.num_words, hidden_features)
# Initialize the EncoderRNN
encoder =dl.EncoderRNN(
    in_features=in_features, 
    hidden_features=hidden_features,
    num_layers=encoder_n_layers, 
    embedding=embedding,
    rnn_type=rnn_type, 
    dropout=dropout
).to(device)

decoder =dl.DecoderRNN(
    in_features=in_features, 
    hidden_features=hidden_features,
    out_features=voc.num_words,
    num_layers=decoder_n_layers, 
    embedding=embedding,
    rnn_type=rnn_type, 
    dropout=dropout
).to(device)

encoder.build()
decoder.build()
# Set models to training mode
encoder.train()
decoder.train()

# Initialize optimizers for the encoder and decoder
encoder_optimizer = torch.optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = torch.optim.Adam(decoder.parameters(), lr=learning_rate * decoder_learning_ratio)

# Move optimizer states to GPU if necessary
if torch.cuda.is_available():
    for state in encoder_optimizer.state.values():
        for k, v in state.items():
            if isinstance(v, torch.Tensor):
                state[k] = v.cuda()

    for state in decoder_optimizer.state.values():
        for k, v in state.items():
            if isinstance(v, torch.Tensor):
                state[k] = v.cuda()


In [28]:

# Begin the training process
train_iters(model_name, voc, history_pairs, encoder, decoder, encoder_optimizer, decoder_optimizer,
           embedding, encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size,
           print_every, save_every, clip, corpus_name)

Initializing ...
Training...
Iteration: 1; Percent complete: 0.0%; Average loss: 8.9714
Iteration: 2; Percent complete: 0.0%; Average loss: 8.5552
Iteration: 3; Percent complete: 0.0%; Average loss: 8.3447


KeyboardInterrupt: 

In [29]:

def evaluate(searcher, voc, conversation_history, max_length=MAX_LENGTH):
    """
    Evaluate a conversation history using the encoder, decoder, and searcher provided.
    """
    # Join the conversation history into a single input and normalize
    input_sentence = ' '.join(conversation_history)
    input_sentence = normalize_string(input_sentence)
    input_sentence=input_sentence.replace("eos","EOS")
    
    # Prepare the input sentence as a batch of word indexes
    indexes_batch = [indexes_from_sentence(voc, input_sentence)]
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    input_batch = torch.LongTensor(indexes_batch).transpose(0, 1).to(device)
    lengths = lengths.to("cpu")  # Lengths need to be on CPU for pack_padded_sequence

    # Decode the sentence with the searcher
    tokens, scores = searcher(input_batch, lengths, max_length)
    decoded_words = [voc.index_to_word[token.item()] for token in tokens]

    return decoded_words

def evaluate_input(searcher, voc):
    """
    Interactively evaluate input from the user, considering the entire conversation history.
    """
    conversation_history = []
    while True:
        try:
            input_sentence = input('> ')
            if input_sentence in ('q', 'quit'):
                break

            # Normalize and add the user's input to the conversation history
            print(input_sentence)
            input_sentence = normalize_string(input_sentence)
            conversation_history.append(input_sentence)
            conversation_history.append("EOS")


            # Evaluate the conversation history
            output_words = evaluate(searcher, voc, conversation_history)
            print('Bot:', ' '.join([word for word in output_words if word not in ('EOS', 'PAD')]))

            # Add the bot's response to the conversation history
            conversation_history.extend(output_words)
            conversation_history=conversation_history[-MAX_HISTORY*MAX_LENGTH:]
            

        except KeyError:
            print("Error: Encountered unknown word.")

# Set dropout layers to ``eval`` mode
encoder.eval()
decoder.eval()

# Initialize search module
searcher = GreedySearchDecoder(encoder, decoder)

# Begin chatting (uncomment and run the following line to begin)
evaluate_input(searcher, voc)

hello
Bot: . . . . . . . . . .
alright
Bot: . . . . . . . . . .
what?
Bot: . . . . . . . . . .
