<a href="https://colab.research.google.com/github/Jiaweihu08/Chatbot/blob/master/TPU_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a chitchating Chatbot using a seq2seq model with attention

##### The Chatbot to be built in this notebook is an **encoder-decoder based seq2seq model**. Both encoder and decoder are **RNN** models with two layers of **GRU** cells, the RNN layers in the encoder are  **bidirectional** and the decoder uses **Loung's attention** to improve performance. Both encoder and decoder share the same **embedding** layer for their inputs.

 The dataset that the model is trained on is a combination of the following conversational datasets:
- **DailyDialogues**
- **ConvAI**
- **EmpatheticDialogues**
- **Persona Chat**
- **Cornell Movie's Dataset**

The steps of combining these dataset are conducted in a different notebook. The combined dataset has **220426** utterances in total, and it's split into training and evaluation sets, **10.000** instances are used for evaluation.

The model is trained with **TPUs** available on Google Colab. **Beam-Search** is used for inference.

In [None]:
import tensorflow as tf

import json
import os
import time
import re
from random import choice

Set up to use TPU for model training.

In [None]:
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
# This is the TPU initialization code that has to be at the beginning.
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))

INFO:tensorflow:Initializing the TPU system: grpc://10.67.39.154:8470


INFO:tensorflow:Initializing the TPU system: grpc://10.67.39.154:8470


INFO:tensorflow:Clearing out eager caches


INFO:tensorflow:Clearing out eager caches


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Finished initializing TPU system.


All devices:  [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU')]


In [None]:
strategy = tf.distribute.TPUStrategy(resolver)

INFO:tensorflow:Found TPU system:


INFO:tensorflow:Found TPU system:


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


Defining parameter:
- The max length of the sentences is 14, the vocab size is set to eliminate words that appeared less than 3 times in the entire dataset. These parameters are set during the creation of the dataset.

- The global batch size is 64, and there are 8 TPU's used for distributed training.

### Loading the dataset and the tokenizer

In [None]:
MAX_LEN = 14
BUFFER_SIZE = 150000
VOCAB_SIZE = 13199 # Eliminating words that appear less than 3 times

BATCH_SIZE_PER_REPLICA = 8
GLOBAL_BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync
# strategy.num_replicas_in_sync ==> 8

root_path = '/content/drive/MyDrive/Colab Notebooks/Chatbots/version-2'
path_to_datasets = os.path.join(root_path, 'datasets')

The following block of code defines the function for dataset and tokenizer loading. Both dataset and tokenizer are extracted and defined in a separate notebook.

The loaded dataset is defined as tensorflow's Dataset object, which is a prerequisite for TPU usage.

In [None]:
def load_dataset(path_to_dataset):
    with open(path_to_dataset) as f:
        lines = f.read().strip().split('\n')

    messages = []
    responses = []
    breaker = ' _+++_ '
    for line in lines:
        m, r = line.split(breaker)
        messages.append(m)
        responses.append(r)
    
    print(f'- number of instances: {len(messages)}')

    return messages, responses


def load_tokenizer():
    tokenizer_dir = os.path.join(root_path, 'tokenizer.json')
    with open(tokenizer_dir) as f:
        data = json.load(f)
        tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(data)

    return tokenizer


def get_tensors(message, response, tokenizer, max_len):
    m_tensor = tokenizer.texts_to_sequences(message)
    m_tensor = tf.keras.preprocessing.sequence.pad_sequences(
        m_tensor,
        maxlen = max_len,
        padding='post',
        truncating='post')
    
    r_tensor = tokenizer.texts_to_sequences(response)
    r_tensor = tf.keras.preprocessing.sequence.pad_sequences(
        r_tensor,
        maxlen=max_len,
        padding='post',
        truncating='post')

    return m_tensor, r_tensor


def get_dataset(path_to_dataset, tokenizer, max_len=MAX_LEN,
                buffer_size=BUFFER_SIZE, batch_size=GLOBAL_BATCH_SIZE):
    
    messages, responses = load_dataset(path_to_dataset)
    steps_per_epoch = len(messages) // batch_size

    message_tensor, response_tensor = get_tensors(messages, responses, tokenizer, max_len)
    
    print(f'- tensor shape: {message_tensor.shape}')
    print(f'- steps per epoch: {steps_per_epoch}')

    dataset = tf.data.Dataset.from_tensor_slices((message_tensor, response_tensor))
    dataset = dataset.shuffle(buffer_size).batch(batch_size, drop_remainder=True)

    return dataset, steps_per_epoch


In [None]:
train_utters_path = os.path.join(path_to_datasets, 'train_utters.txt')
eval_utters_path = os.path.join(path_to_datasets, 'eval_utters.txt')

tokenizer = load_tokenizer()

print("Loading training data...")
train_set, steps_per_epoch = get_dataset(train_utters_path, tokenizer)

print("\nLoading evaluation data...")
eval_set, steps_per_epoch_eval = get_dataset(eval_utters_path, tokenizer, batch_size=64)


train_dist_dataset = strategy.experimental_distribute_dataset(train_set)
eval_dist_dataset = strategy.experimental_distribute_dataset(eval_set)

Loading training data...
- number of instances: 210426
- tensor shape: (210426, 14)
- steps per epoch: 3287

Loading evaluation data...
- number of instances: 10000
- tensor shape: (10000, 14)
- steps per epoch: 156


### Model training

Brief description of the model:
- The model is a **encoder-decoder** based RNN model.
- The encoder is in charge of **encoding** the input sequence and passes this sequence together with its **last hidden states** to the decoder.
- The encoder has a **embedding** layer to map the input token ids to their distributed representations(i.e. vectors), this embedding layer is **shared** with the decoder.
- The encoder has two **bidirectional** GRU layers so it can read the inputs **from both direction** to produce better encodings.
- The encoded sequences from both direction are **concatenated**, so are the forward and backward hidden states.
- The concatenated hidden states are used as hidden state initializers for the decoder.
- Since the encoder is concatenating outputs from both directions, its number of **units** in its GRU layers are the **half** of that of the decoder.
- The encoded sequence from the encoder is passed to the decode at each time step and an **attention mechanism** is used for information retrieval. The output of the attention is known at the **context vector**. In the case of **Loung's attention**, this context and the encoder's last RNN output are **concatenated** and passed to a dense layer to produce predictions.
- Aside from the encoder outputs, the decoder also takes an input that is its prediction from the last time step(during inference). During training, the correct output from the last step is used instead. This is know as **teacher forcing**.

In [None]:
from tensorflow.keras.layers import Embedding, GRU, Bidirectional, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy


class Encoder(tf.keras.Model):
    def __init__(self, enc_units, embedding_dim, vocab_size):
        super(Encoder, self).__init__()
        self.embedding = Embedding(vocab_size, embedding_dim)
        
        self.gru_1 = Bidirectional(
            GRU(enc_units,
                return_sequences=True,
                return_state=True,
                dropout=0.1,
                recurrent_initializer='glorot_uniform'))
        
        self.gru_2 = Bidirectional(
            GRU(enc_units,
                return_sequences=True,
                return_state=True,
                recurrent_initializer='glorot_uniform'))

    def call(self, enc_input):
        # enc_input shape: (batch_size, max_len)
        # x_emb shape: (batch_size, max_len, embedding_dim)
        x_emb = self.embedding(enc_input)

        # sequence shape: (batch_size, max_len, 2 * units)
        # hiddens shape: (batch_size, 2 * units)
        sequence_1, hidden_forward_1, hidden_backward_1 = self.gru_1(x_emb)
        hiddens_1 = tf.concat([hidden_forward_1, hidden_backward_1], axis=-1)

        output_sequence, hidden_forward_2, hidden_backward_2 = self.gru_2(sequence_1)
        hiddens_2 = tf.concat([hidden_forward_2, hidden_backward_2], axis=-1)

        hiddens = [hiddens_1, hiddens_2]

        return output_sequence, hiddens


class LuongAttention(tf.keras.Model):
    def __init__(self, units):
        super(LuongAttention, self).__init__()
        self.W = Dense(units)

    def call(self, attention_inputs):
        # query shape: (batch_size, 1, units)
        # values shape: (batch_size, max_len, units)
        query, values = attention_inputs

        # scores shape: (batch_size, 1, max_len)
        scores = tf.matmul(query, self.W(values), transpose_b=True)

        # attention_weights shape: (batch_size, 1, max_len)
        attention_weights = tf.nn.softmax(scores, axis=-1)

        # context_vector shape: (batch_size, 1, units)
        context_vector = tf.matmul(attention_weights, values)
        
        return context_vector, attention_weights


class Decoder(tf.keras.Model):
    def __init__(self, dec_units, embedding_layer, vocab_size):
        super(Decoder, self).__init__()
        
        self.embedding = embedding_layer
        
        self.gru_1 = GRU(dec_units,
                         return_sequences=True,
                         return_state=True,
                         dropout=0.1,
                         recurrent_initializer='glorot_uniform')

        self.gru_2 = GRU(dec_units,
                         return_sequences=True,
                         return_state=True,
                         recurrent_initializer='glorot_uniform')
        
        self.attention = LuongAttention(dec_units)

        self.fc = Dense(dec_units, activation='tanh')
        self.out = Dense(vocab_size)
        
        
    def call(self, dec_inputs):
        # x shape: (batch_size, 1)
        # hiddens shape: (batch_size, units)
        # enc_outputs shape: (batch_size, max_len, units)
        x, hiddens, enc_output = dec_inputs
        input_hiddens_1, input_hiddens_2 = hiddens

        # x_emb shape: (batch_size, 1, embedding_dim)
        x_emb = self.embedding(x)

        # output sequence shape: (batch_size, 1, units)
        # hiddens shape: (batch_size, units)
        sequence_1, hiddens_1 = self.gru_1(x_emb, initial_state=input_hiddens_1)
        output_sequence, hiddens_2 = self.gru_2(sequence_1, initial_state=input_hiddens_2)

        hiddens = [hiddens_1, hiddens_2]

        # context shape: (batch_size, 1, units)
        # attention_weights shape: (batch_size, 1, max_len)
        attention_inputs = (output_sequence, enc_output)
        context, attention_weights = self.attention(attention_inputs)
        
        # output_sequence shape: (batch_size, 2 * units)
        output_sequence = tf.concat([context, output_sequence], -1)
        output_sequence = tf.squeeze(output_sequence, axis=1)

        # output_sequence shape: (batch_size, units)
        output_sequence = self.fc(output_sequence)

        # logits shape: (batch_size, VOCAB_SIZE)
        logits = self.out(output_sequence)
        
        return logits, hiddens, attention_weights

Defining model, optimizer, loss function, and checkpoint for saving.

This has to be done with the context manager **strategy.scope()** for TPU usage.

In [None]:
units = 1024
embedding_dim = 512

checkpoint_dir = os.path.join(root_path, 'training_checkpoints_TPU')
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
local_device_option = tf.train.CheckpointOptions(experimental_io_device="/job:localhost")

with strategy.scope():
    encoder = Encoder(units//2, embedding_dim, VOCAB_SIZE)
    decoder = Decoder(units, encoder.embedding, VOCAB_SIZE)

    optimizer = Adam(learning_rate=0.0001)

    checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                    encoder=encoder,
                                    decoder=decoder)

    loss_object = SparseCategoricalCrossentropy(
        from_logits=True,
        reduction=tf.keras.losses.Reduction.NONE)

    def loss_function(real, pred):
        mask = tf.math.logical_not(tf.math.equal(real, 0))
        loss_ = loss_object(real, pred)

        mask = tf.cast(mask, dtype=loss_.dtype)
        loss_ *= mask

        return tf.nn.compute_average_loss(loss_, global_batch_size=GLOBAL_BATCH_SIZE)

Defining training and evalution steps:

- As mentioned before, the inputs are first encoded by the encoder and it's used at each time step by the decoder. The encoder's last hidden state from each of it GRU layers are used as hidden state initializer for the first time step of the decoder.

- The loss is computed at each time step depending on how much the predictions and the 'labels' differ.

- For gradient calculation, the context manager **tf.GradientTape()** is used to track the operations done in the forward pass(from input to output).

- **model.fit()** cannot be used in this case since the model is not **auto-regressive**, and the training loop must be defined according to needs.

- The decorator @tf.function is used to speed up the operations of the function.


In [None]:
def train_step(m, r):
    loss = 0.0

    with tf.GradientTape() as tape:

        enc_out, hiddens = encoder(m)

        dec_in = (tf.expand_dims([tokenizer.word_index['<start>']] * r.shape[0], 1),
                  hiddens, enc_out)
        
        for t in range(1, r.shape[1]):

            predictions, hiddens, _ = decoder(dec_in)

            loss += loss_function(r[:, t], predictions)

            dec_in = (tf.expand_dims(r[:, t], 1), hiddens, enc_out)
    
    batch_loss = (loss / int(r.shape[1]))

    variables = encoder.trainable_variables + decoder.trainable_variables

    gradients = tape.gradient(loss, variables)
    
    optimizer.apply_gradients(zip(gradients, variables))

    return batch_loss


def eval_step(m, r):
    loss = 0.0

    enc_out, hiddens = encoder(m)

    dec_in = (tf.expand_dims([tokenizer.word_index['<start>']] * r.shape[0], 1),
              hiddens, enc_out)
    
    for t in range(1, r.shape[1]):
        predictions, hiddens, _ = decoder(dec_in)

        loss += loss_function(r[:, t], predictions)

        dec_in = (tf.expand_dims(r[:, t], 1), hiddens, enc_out)

    batch_loss = (loss / int(r.shape[1]))

    return batch_loss


@tf.function
def distributed_train_step(m, r):
    per_replica_losses = strategy.run(train_step, args=(m, r))
    return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses,
                            axis=None)
    

@tf.function
def distributed_eval_step(m, r):
    per_replica_losses = strategy.run(eval_step, args=(m, r))
    return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses,
                           axis=None)

In [None]:
init_epoch = 0
EPOCHS = 20

train_losses = []
eval_losses = []

print('Starting training at {}'.format(time.strftime("%Y_%m_%d-%H_%M_%S")))
print('Entering training loop...')
print('Total Epochs {}'.format(EPOCHS))

for epoch in range(init_epoch, EPOCHS):
    start = time.time()
    
    # Training Iterations
    total_train_loss = 0.0
    num_train_batches = 0
    for (m, r) in train_dist_dataset:
        total_train_loss += distributed_train_step(m, r)
        num_train_batches += 1

        if num_train_batches % 500 == 0:
            print(">>> Iteration: {}/{}, Train Loss: {:.4f}".format(
                num_train_batches,
                steps_per_epoch,
                total_train_loss.numpy() / num_train_batches))

    train_loss = total_train_loss.numpy() / num_train_batches
    train_losses.append(train_loss)

    # Evaluation Iterations
    total_eval_loss= 0.0
    num_eval_batches = 0
    for (m, r) in eval_dist_dataset:
        total_eval_loss += distributed_eval_step(m, r)
        num_eval_batches += 1

    eval_loss = total_eval_loss.numpy() / num_eval_batches
    eval_losses.append(eval_loss)

    if epoch + 1 % 2 == 0:
        checkpoint.write(checkpoint_prefix, options=local_device_option)


    template = "Epoch {}/{}, Train Loss: {:.4f}, Eval Loss: {:.4f}"
    print(template.format(epoch, EPOCHS, train_loss, eval_loss))
    print("Time taken for 1 epoch: {}s".format(int(time.time() - start)))


Starting training at 2020_11_28-19_12_52
Entering training loop...
Total Epochs 20
>>> Iteration: 500/3287, Train Loss: 3.2070
>>> Iteration: 1000/3287, Train Loss: 3.0026
>>> Iteration: 1500/3287, Train Loss: 2.8647
>>> Iteration: 2000/3287, Train Loss: 2.7646
>>> Iteration: 2500/3287, Train Loss: 2.6920
>>> Iteration: 3000/3287, Train Loss: 2.6332
Epoch 0/20, Train Loss: 2.6056, Eval Loss: 2.2780
Time taken for 1 epoch: 269s
>>> Iteration: 500/3287, Train Loss: 2.2795
>>> Iteration: 1000/3287, Train Loss: 2.2559
>>> Iteration: 1500/3287, Train Loss: 2.2439
>>> Iteration: 2000/3287, Train Loss: 2.2287
>>> Iteration: 2500/3287, Train Loss: 2.2178
>>> Iteration: 3000/3287, Train Loss: 2.2092
Epoch 1/20, Train Loss: 2.2050, Eval Loss: 2.1331
Time taken for 1 epoch: 229s
>>> Iteration: 500/3287, Train Loss: 2.1139
>>> Iteration: 1000/3287, Train Loss: 2.1071
>>> Iteration: 1500/3287, Train Loss: 2.1057
>>> Iteration: 2000/3287, Train Loss: 2.0996
>>> Iteration: 2500/3287, Train Loss: 2.09

In [None]:
init_epoch = 20
EPOCHS_1 = 31

train_losses_1 = []
eval_losses_1 = []

print('Continue training at {}'.format(time.strftime("%Y_%m_%d-%H_%M_%S")))
print('Total Epochs {}'.format(EPOCHS_1 - init_epoch))

for epoch in range(init_epoch, EPOCHS_1):
    start = time.time()
    
    # Training Iterations
    total_train_loss = 0.0
    num_train_batches = 0
    for (m, r) in train_dist_dataset:
        total_train_loss += distributed_train_step(m, r)
        num_train_batches += 1

        if num_train_batches % 500 == 0:
            print(">>> Iteration: {}/{}, Train Loss: {:.4f}".format(
                num_train_batches,
                steps_per_epoch,
                total_train_loss.numpy() / num_train_batches))

    train_loss = total_train_loss.numpy() / num_train_batches
    train_losses.append(train_loss)

    # Evaluation Iterations
    total_eval_loss= 0.0
    num_eval_batches = 0
    for (m, r) in eval_dist_dataset:
        total_eval_loss += distributed_eval_step(m, r)
        num_eval_batches += 1

    eval_loss = total_eval_loss.numpy() / num_eval_batches
    eval_losses.append(eval_loss)

    if epoch + 1 % 2 == 0:
        checkpoint.write(checkpoint_prefix, options=local_device_option)


    template = "Epoch {}/{}, Train Loss: {:.4f}, Eval Loss: {:.4f}"
    print(template.format(epoch, EPOCHS, train_loss, eval_loss))
    print("Time taken for 1 epoch: {}s".format(int(time.time() - start)))


Continue training at 2020_11_28-20_57_08
Total Epochs 11
>>> Iteration: 500/3287, Train Loss: 1.0423
>>> Iteration: 1000/3287, Train Loss: 1.0474
>>> Iteration: 1500/3287, Train Loss: 1.0507
>>> Iteration: 2000/3287, Train Loss: 1.0544
>>> Iteration: 2500/3287, Train Loss: 1.0596
>>> Iteration: 3000/3287, Train Loss: 1.0623
Epoch 20/20, Train Loss: 1.0641, Eval Loss: 1.7884
Time taken for 1 epoch: 227s
>>> Iteration: 500/3287, Train Loss: 0.9809
>>> Iteration: 1000/3287, Train Loss: 0.9911
>>> Iteration: 1500/3287, Train Loss: 0.9956
>>> Iteration: 2000/3287, Train Loss: 0.9999
>>> Iteration: 2500/3287, Train Loss: 1.0034
>>> Iteration: 3000/3287, Train Loss: 1.0068
Epoch 21/20, Train Loss: 1.0076, Eval Loss: 1.7929
Time taken for 1 epoch: 227s
>>> Iteration: 500/3287, Train Loss: 0.9266
>>> Iteration: 1000/3287, Train Loss: 0.9332
>>> Iteration: 1500/3287, Train Loss: 0.9391
>>> Iteration: 2000/3287, Train Loss: 0.9441
>>> Iteration: 2500/3287, Train Loss: 0.9474
>>> Iteration: 3000/3

KeyboardInterrupt: ignored

The training process is stopped here.

Although the training loss is still decreasing, the evaluation loss has started to increase.

We then proceed to save the model using keras' **model.save()**. For this to be possible, the model in question must only have one input variable for its **call** function(see model definition above).

**save_options** defined below is necessary when training with TPUs.

In [None]:
save_options = tf.saved_model.SaveOptions(experimental_io_device='/job:localhost')

encoder_path = os.path.join(root_path, 'encoder_25ep')
encoder.save(encoder_path, options=save_options)

decoder_path = os.path.join(root_path, 'decoder_25ep')
decoder.save(decoder_path, options=save_options)

INFO:tensorflow:Assets written to: /content/drive/MyDrive/Colab Notebooks/Chatbots/version-2/encoder_25ep/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Colab Notebooks/Chatbots/version-2/encoder_25ep/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Colab Notebooks/Chatbots/version-2/decoder_25ep/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Colab Notebooks/Chatbots/version-2/decoder_25ep/assets


### Inferencing

A class is defined here for inference, both **greedy-search** and **beam-search** are implemented.

In [None]:
class Chatbot:
	def __init__(self, encoder, decoder, tokenizer, max_len=MAX_LEN):
		self.encoder = encoder
		self.decoder = decoder
		self.tokenizer = tokenizer
		self.max_len = max_len

	def preprocess_text(self, text):
		text = re.sub(r"([?.!,¿])", r" \1 ", text)
		
		text = re.sub(r'[" "]+', " ", text)

		text = re.sub(r"[^a-zA-Z0-9?.!,]+", " ", text)

		text = text.strip()

		text = '<start> ' + text + ' <end>'
		
		return text

	def prepare_input(self, message):
		message = self.preprocess_text(message)

		sequence = self.tokenizer.texts_to_sequences([message])

		padded_sequence = tf.keras.preprocessing.sequence.pad_sequences(
			sequence,
			maxlen=self.max_len,
			truncating='post')

		tensor = tf.convert_to_tensor(padded_sequence)

		return tensor

	def greedy_search(self, message):
		tensor = self.prepare_input(message)

		response = ""

		enc_output, hiddens = self.encoder(tensor)

		dec_in = (tf.expand_dims([self.tokenizer.word_index['<start>']], 0),
			hiddens, enc_output)

		for t in range(self.max_len):
			pred, hiddens, _ = self.decoder(dec_in)

			pred_id = tf.argmax(pred[0]).numpy()

			pred_word = self.tokenizer.index_word[pred_id]

			if pred_word == '<end>':
				return message, response.strip()

			response += pred_word + ' '

			dec_in = (tf.expand_dims([pred_id], 0), hiddens, enc_output)

		return message, response.strip()

	def find_top_k(self, acc_val, ids, hiddens, enc_sequence, k):
		dec_in = (tf.expand_dims([ids[-1]], 0), hiddens, enc_sequence)
		pred, hiddens, _ = self.decoder(dec_in)

		top_k = tf.math.top_k(pred, k=k)
		top_vals = tf.nn.softmax(top_k.values).numpy()[0]
		top_indices = top_k.indices.numpy()[0]

		candidates = []

		for val, id_ in zip(top_vals, top_indices):
			candidates.append([val, ids + [id_], hiddens])

		return candidates

	def beam_search(self, message, k=5):
		start_token = self.tokenizer.word_index['<start>']
		end_token = self.tokenizer.word_index['<end>']

		tensor = self.prepare_input(message)

		enc_sequence, hiddens = self.encoder(tensor)

		candidates = self.find_top_k(1, [start_token], hiddens, enc_sequence, k)

		while True:
			next_candidates = []
			for candidate in candidates:
				if len(candidate[1]) == self.max_len or candidate[1][-1] == end_token:
					next_candidates.append(candidate)
					continue
				next_candidates.extend(self.find_top_k(*candidate, enc_sequence, k))

			candidates = sorted(next_candidates, reverse=True)[:k]

			are_ended = []
			for candidate in candidates:
				is_ended = len(candidate[1]) == self.max_len or candidate[1][-1] == end_token
				are_ended.append(is_ended)

			if all(are_ended):
				sequences = [cand[1][1:-1] for cand in candidates]
				# response = choice(self.tokenizer.sequences_to_texts(sequences))
				response = self.tokenizer.sequences_to_texts(sequences)[0]
				return message, response, sequences[0]


In [None]:
chatbot = Chatbot(encoder, decoder, tokenizer)

In [None]:
while True:
    message = input('You: ')
    print(f'Bot: {chatbot.beam_search(message)[1]}')

Bot: hey , hows it going ?
Bot: what s the matter with that guy ?
Bot: don t you ?
Bot: how do you feel about that ?
Bot: don t you want to talk about me ?
Bot: he s a very sick guy , what is up .
Bot: who the ?
Bot: tell me more about yourself , please .
Bot: , my family moved here so i am in the city .
Bot: 6 and 4 . do you have any children ?
Bot: that s nice . how many kids do you have ?


KeyboardInterrupt: ignored