# Deploying Seq2Seq Model with the Hybrid Frontend

This tutorial will walk through the process of transitioning a sequence-to-sequence model to graph mode using PyTorch's Hybrid Frontend. The model that we will convert is the chatbot model from the [Chatbot tutorial](). While the [introductory Hybrid Frontend tutorials]() are useful for gaining an understanding of the work-flow, purpose, and basic syntax of the feature, this document covers a more challenging model and a more practical use-case. You can either treat this tutorial as a "Part 2" to the [Chatbot tutorial]() and deploy your own pretrained model, or you can start with this document and use a pretrained model that we host. In the latter case, you can reference the original [Chatbot tutorial]() for details regarding data preprocessing, model theory and definition, and model training.


## What is the Hybrid Frontend?
During the research and development phase of a deep learning-based project, it is adventagous to interact with an **eager**, imperative interface like PyTorch's. This gives users the ability to write familiar idiomatic Python that executes as-is, allowing for the use of Python data structures, control flow operations, print statements, and debugging utilities. Although the eager interface is an advantageous tool for R&D applications, when it comes time to deploy the model in a production environment, having a **graph**-based model representation is very beneficial. Having a deferred representation allows for numerous optimization techniques such as out-of-order execution, framework-agnostic model exportation with ONNX, and the ability to target highly optimized hardware architectures. The Hybrid Frontend provides a flexible and non-intrusive tool for translating models from eager mode to graph mode.

The Hybrid Frontend is faciliated through a Just-In-Time (JIT) compiler (`torch.jit`). The JIT compiler has two core modalities for converting an eager model to a graph representation: **trace** and **script**. The `torch.jit.trace` function takes a module or function and an example input. It then runs the example input through the function or module while tracing the computational steps that are encountered, and outputs a graph-based function that performs the same actions. The **trace** mode is great for straightforward modules and functions that do not involve data-dependent control flow. However, if a function with data-dependent if statements and loops is traced, only the operations called using the control sequence taken by the example input will be recorded. In other words, the control flow itself is not captured. To convert modules and functions containing data-dependent control flows, a **script** mode is provided. The script mode explicitly converts the module or function code to graph mode, including all possible control flow routes. To use script mode, simply add a `torch.jit.script` decorator to your Python function or a `torch.jit.script_method` decorator to your module's `forward` function. The one caveat with using script mode is that it currently only supports a restricted subset of Python. As of now, features such as generators, defs, and Python data structures are not supported. To remedy this, you can invoke traced modules from script modules (and vice-versa), and you can call pure Python functions and modules from script modules. However, the operations done in the pure Python functions will not be compiled, and will run as-is.

## Prepare Environment

First, we will import the required modules and set some constants. If you are planning on using your own model, be sure that the `MAX_LENGTH` constant is set correctly. As a reminder, this constant defines the maximum allowed sentence length during training and the maximum length output that the model is capable of.

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import torch
from torch.jit import script, trace
import torch.nn as nn
import torch.nn.functional as F
import re
import os
import unicodedata


USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda" if USE_CUDA else "cpu")

# Constants
MAX_LENGTH = 10  # Maximum sentence length

# Default word tokens
PAD_token = 0  # Used for padding short sentences
SOS_token = 1  # Start-of-sentence token
EOS_token = 2  # End-of-sentence token

## Model Overview

As mentioned, the model that we are using is a [sequence-to-sequence](https://arxiv.org/abs/1409.3215) (seq2seq) model. This type of model is used in cases when our input is a variable-length sequence, and our output is also a variable length sequence that is not necessarily a one-to-one mapping of the input. A seq2seq model is comprised of two recurrent neural networks (RNNs) that work cooperatively: an **encoder** and a **decoder**.

![model](images/seq2seq_model.png)

Image source: https://medium.com/botsupply/generative-model-chatbots-e422ab08461e

### Encoder

The encoder RNN iterates through the input sentence one word at a time, at each time step outputting an "output" vector and a hidden state that is used in the next time step. Essentially, the encoder takes the input sequence, and attempts to encode its "meaning" to a fixed-sized context or "thought" tensor. In our case, the context tensor is simply the output of the RNN, or the output features of the last hidden layer in the network.


### Decoder

The decoder RNN generates the response sentence in a word-by-word fashion. It uses the encoder's context tensor, and a previous word to generate the next word in the sequence. It continues generating words until it outputs an *EOS_token*, representing the end of the sentence. In practice, relying soley on the encoder's context tensor to encode the meaning of the entire sequence is not very effective, especially when encoding long input sequences. To remedy this, we use an [attention mechanism](https://arxiv.org/abs/1409.0473) in our decoder. This attention mechanism helps the decoder to "pay attention" to certain parts of the input when generating the output. For our model, we implement [Luong et al.](https://arxiv.org/abs/1508.04025)'s "Global attention" module, and use it as a layer in our decode model.

## Define Encoder

We implement our encoder's RNN with the `torch.nn.GRU` module which we feed an entire batch of sentences (word embeddings) and it internally iterates through the sequences one word at a time calculating the hidden states. We initialize this module to be bidirectional, meaning that we have two independent GRUs: one that iterates through the senquences in chronological order, and another that iterates in reverse order. We ultimately return the sum of these models' outputs. Since our model was trained using batching, our `EncoderRNN` model's `forward` function expects a padded input batch. To batch variable-length sentences, we allow a maximum of *MAX_LENGTH* words in a sentence, and all sentences in the batch that have less than *MAX_LENGTH* words are padded at the end with our dedicated *PAD_token* tokens. To use padded batches with one of PyTorch's RNN modules, we must wrap the GRU forward pass call with `torch.nn.utils.rnn.pack_padded_sequence` and `torch.nn.utils.rnn.pad_packed_sequence` data transformations. Note that the `forward` function also takes an `input_lengths` list, which contains the length of each sentence in the batch. This input is used by the `torch.nn.utils.rnn.pack_padded_sequence` function when padding.

#### Hybrid Frontend

Since the encoder's `forward` function does not contain any data-dependent control flow, we will use the **trace** method for converting it to graph mode. When tracing a module, we can leave the module definition as-is, as we perform the trace upon initialization. We will initialize all models towards the end of this document before we run evaluations.

In [2]:
class EncoderRNN(nn.Module):
    def __init__(self, hidden_size, embedding, n_layers=1, dropout=0):
        super(EncoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        self.embedding = embedding
        
        # Initialize GRU; the input_size and hidden_size params are both set to 'hidden_size'
        #   because our input size is a word embedding with number of features = hidden_size
        self.gru = nn.GRU(hidden_size, hidden_size, n_layers,
                          dropout=(0 if n_layers == 1 else dropout), bidirectional=True)

    def forward(self, input_seq, input_lengths, hidden=None):
        # Convert word indexes to embeddings
        embedded = self.embedding(input_seq)
        # Pack padded batch of sequences for RNN module
        packed = torch.nn.utils.rnn.pack_padded_sequence(embedded, input_lengths)
        # Forward pass through GRU
        outputs, hidden = self.gru(packed, hidden)
        # Unpack padding
        outputs, _ = torch.nn.utils.rnn.pad_packed_sequence(outputs)
        # Sum bidirectional GRU outputs
        outputs = outputs[:, :, :self.hidden_size] + outputs[:, : ,self.hidden_size:]
        return outputs, hidden

## Define Decoder's Attention Module

#### Hybrid Frontend

In [3]:
# Luong attention layer
class Attn(nn.Module):
    def __init__(self, method, hidden_size):
        super(Attn, self).__init__()

        self.method = method
        self.hidden_size = hidden_size

        if self.method == 'general':
            self.attn = nn.Linear(self.hidden_size, hidden_size)

        elif self.method == 'concat':
            self.attn = nn.Linear(self.hidden_size * 2, hidden_size)
            self.v = nn.Parameter(torch.FloatTensor(1, hidden_size))

    def forward(self, hidden, encoder_outputs):
        max_len = encoder_outputs.size(0)
        batch_size = encoder_outputs.size(1)

        # Create variable to store attention energies
        attn_energies = torch.zeros(batch_size, max_len) # B x S
        attn_energies = attn_energies.to(device)

        # For each batch of encoder outputs
        for b in range(batch_size):
            # Calculate energy for each encoder output
            for i in range(max_len):
                attn_energies[b, i] = self.score(hidden[:, b], encoder_outputs[i, b].unsqueeze(0))

        # Normalize energies to weights in range 0 to 1, resize to 1 x B x S
        return F.softmax(attn_energies, dim=1).unsqueeze(1)

    # Score functions
    def score(self, hidden, encoder_output):
        if self.method == 'dot':
            energy = hidden.squeeze(0).dot(encoder_output.squeeze(0))
            return energy

        elif self.method == 'general':
            energy = self.attn(encoder_output)
            energy = hidden.squeeze(0).dot(energy.squeeze(0))
            return energy

        elif self.method == 'concat':
            energy = self.attn(torch.cat((hidden, encoder_output), 1))
            energy = self.v.squeeze(0).dot(energy.squeeze(0))
            return energy

## Define Decoder

Similarly to the `EncoderRNN`, we use the `torch.nn.GRU` module for our decoder's RNN. This time, however, we use a unidirectional GRU. It is important to note that unlike the encoder, we will feed the decoder RNN one word at a time. We start by getting the embedding of the current word and applying a dropout. Next, we forward the embedding and the last hidden state to the GRU and obtain a current GRU output and hidden state. We then use our attention module as a layer to obtain the attention weights, which we multiply by the encoder's output to obtain our attended encoder output. The attended encoder output represents a weighted sum indicating what parts of the encoder's output to pay attention to. From here, we use a linear layer and softmax normalization to select the next word in the output sequence.

#### Hybrid Frontend



In [4]:
class LuongAttnDecoderRNN(torch.jit.ScriptModule):
    def __init__(self, attn_model, embedding, hidden_size, output_size, n_layers=1, dropout=0.1):
        super(LuongAttnDecoderRNN, self).__init__()

        # Keep for reference
        self.attn_model = attn_model
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.dropout = dropout

        # Define layers
        self.embedding = embedding
        self.embedding_dropout = nn.Dropout(dropout)
        self.gru = nn.GRU(hidden_size, hidden_size, n_layers, dropout=(0 if n_layers == 1 else dropout))
        self.concat = nn.Linear(hidden_size * 2, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)

        self.attn = Attn(attn_model, hidden_size)

    @torch.jit.script_method
    def forward(self, input_seq, last_hidden, encoder_outputs):
        # Note: we run this one step (word) at a time
        # Get embedding of current input word
        embedded = self.embedding(input_seq)
        embedded = self.embedding_dropout(embedded)
        # Forward through unidirectional GRU
        rnn_output, hidden = self.gru(embedded, last_hidden)
        # Calculate attention weights from the current GRU output
        attn_weights = self.attn(rnn_output, encoder_outputs)
        # Multiply attention weights to encoder outputs to get new "weighted sum" context vector
        context = attn_weights.bmm(encoder_outputs.transpose(0, 1))
        # Concatenate weighted context vector and GRU output using Luong eq. 5
        rnn_output = rnn_output.squeeze(0)
        context = context.squeeze(1) 
        concat_input = torch.cat((rnn_output, context), 1)
        concat_output = torch.tanh(self.concat(concat_input))
        # Predict next word using Luong eq. 6
        output = self.out(concat_output)
        output = F.softmax(output, dim=1)
        return output, hidden

## Data Handling

Although our models conceptually deal with sequences of words, in reality, they deal with numbers like all machine learning models do. In this case, every word in the model's vocabulary, which was established before training, is mapped to an integer index. We use a `Voc` object to contain the mappings from word to index, as well as the total number of words in the vocabulary. We will load the object later before we run the model.
 
Also, in order for us to be able to run evaluations, we must provide a tool for processing our string inputs. The `normalizeString` function converts all characters in a string to lowercase and removes all non-letter characters. The `indexesFromSentence` function takes a sentence of words and returns the corresponding sequence of word indexes.

In [5]:
class Voc:
    def __init__(self, name):
        self.name = name
        self.trimmed = False
        self.word2index = {}
        self.word2count = {}
        self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
        self.num_words = 3  # Count SOS, EOS, PAD

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.num_words
            self.word2count[word] = 1
            self.index2word[self.num_words] = word
            self.num_words += 1
        else:
            self.word2count[word] += 1

    # Remove words below a certain count threshold
    def trim(self, min_count):
        if self.trimmed:
            return
        self.trimmed = True

        keep_words = []

        for k, v in self.word2count.items():
            if v >= min_count:
                keep_words.append(k)

        print('keep_words {} / {} = {:.4f}'.format(
            len(keep_words), len(self.word2index), len(keep_words) / len(self.word2index)
        ))

        # Reinitialize dictionaries
        self.word2index = {}
        self.word2count = {}
        self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
        self.num_words = 3 # Count default tokens

        for word in keep_words:
            self.addWord(word)
            

# Lowercase and remove non-letter characters
def normalizeString(s):
    s = s.lower()
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s


def indexesFromSentence(voc, sentence):
    return [voc.word2index[word] for word in sentence.split(' ')] + [EOS_token]

## Define Evaluation

### Decoding

To decode a given decoder output, we must iteratively run forward passes through our `decoder` model, which outputs softmax scores corresponding to the probability of each word being the correct next word in the decoded sequence. We initialize the `decoder_input` to a tensor containing an *SOS_token*. After each pass through the `decoder`, we append the word with the highest softmax probability to the `decoded_words` list. We also use this word as the `decoder_input` for the next iteration. The decoding process terminates either if the `decoded_words` list has reached a length of *MAX_LENGTH* or if the predicted word is the *EOS_token*.

In [6]:
def decode(decoder, decoder_hidden, encoder_outputs, voc, max_length=MAX_LENGTH):
    # Initialize input, words, and attentions
    decoder_input = torch.LongTensor([[SOS_token]])
    decoder_input = decoder_input.to(device)
    decoded_words = []
    decoder_attentions = torch.zeros(max_length, max_length)
    
    # Allow output sequences with a max length of max_length
    for _ in range(max_length):
        # Run forward pass though decoder model
        decoder_output, decoder_hidden = decoder(
            decoder_input, decoder_hidden, encoder_outputs
        )
        
        # Take word with highest softmax probability
        _, topi = decoder_output.topk(1)
        ni = topi[0][0]
        # If the recommended word is an EOS token, append the token to the decoded_words list and stop decoding
        if ni == EOS_token:
            break
        # Else, append the string word to decoded_words list
        else:
            decoded_words.append(voc.index2word[ni.item()])

        # Set next decoder input as the chosen decoded word
        decoder_input = torch.LongTensor([[ni]])
        decoder_input = decoder_input.to(device)

    return decoded_words

### Evaluating an Input

Next, we define an `evaluate` function to drive an input through our seq2seq model. This function takes a normalized string `sentence`, prepares it in a batch of size 1. This batch is then passed through the encoder along with a `lengths` tensor. We then prepare the encoder's final hidden layer to be the first hidden input to the decoder by simply slicing out an appropriate amount of layers. Finally, we call our `decode` function with the `decoder_hidden` and `encoder_outputs` tensors that we created, and return the output.

We also define an `evaluateInput` function, which takes a string sentence, normalizes it, calls the evaluate function, and prints the results in a conversational format.

In [7]:
# Evaluate a sentence
def evaluate(encoder, decoder, voc, sentence, max_length=MAX_LENGTH):
    # Format input sentence as a batch
    indexes_batch = [indexesFromSentence(voc, sentence)]
    lengths = [len(indexes) for indexes in indexes_batch]
    input_batch = torch.LongTensor(indexes_batch).transpose(0, 1)
    input_batch = input_batch.to(device)

    # Forward input through encoder model
    encoder_outputs, encoder_hidden = encoder(input_batch, torch.tensor(lengths))

    # Prepare encoder's final hidden layer to be first hidden input to the decoder
    decoder_hidden = encoder_hidden[:decoder.n_layers]

    # Decode sentence
    return decode(decoder, decoder_hidden, encoder_outputs, voc)


# Normalize input sentence and call evaluate()
def evaluateExample(sentence, encoder, decoder, voc):
    print("> " + sentence)
    # Normalize sentence
    input_sentence = normalizeString(sentence)
    # Evaluate sentence
    output_words = evaluate(encoder, decoder, voc, input_sentence)
    output_sentence = ' '.join(output_words)
    print('bot:', output_sentence)

## Load Pretrained Parameters

Ok, its time to load our model!

### Use hosted model

### Use my own model

In [8]:
save_dir = os.path.join("data", "save")
corpus_name = "cornell movie-dialogs corpus"

# Configure models
model_name = 'model7'
attn_model = 'dot'
#attn_model = 'general'
#attn_model = 'concat'
hidden_size = 500
encoder_n_layers = 2
decoder_n_layers = 2
dropout = 0.1
batch_size = 64

# Set checkpoint to load from; set to None if starting from scratch
checkpoint_iter = 4000
loadFilename = os.path.join(save_dir, model_name, corpus_name, '{}-{}_{}'.format(encoder_n_layers, decoder_n_layers, hidden_size), '{}_checkpoint.tar'.format(checkpoint_iter))


#checkpoint = torch.load(loadFilename)
checkpoint = torch.load(loadFilename, map_location=torch.device('cpu'))
encoder_sd = checkpoint['en']
decoder_sd = checkpoint['de']
embedding_sd = checkpoint['embedding']
voc = checkpoint['voc']

# Initialize Model
checkpoint = None
print('Building encoder and decoder ...')
embedding = nn.Embedding(voc.num_words, hidden_size)
embedding.load_state_dict(embedding_sd)

# Initialize encoder
dummy_seq = torch.zeros((1,1), dtype=torch.int64)
dummy_lengths = torch.tensor([1])
encoder = trace(dummy_seq, dummy_lengths)(EncoderRNN(hidden_size, embedding, encoder_n_layers, dropout))

# Initialize decoder
decoder = LuongAttnDecoderRNN(attn_model, embedding, hidden_size, voc.num_words, decoder_n_layers, dropout)

# Populate model state_dicts
encoder.load_state_dict(encoder_sd)
decoder.load_state_dict(decoder_sd)


# use cuda
#encoder = encoder.to(device)
#decoder = decoder.to(device)
print('Models built and ready to go!')


Building encoder and decoder ...
Models built and ready to go!


## Print Graphs

In [9]:
print('encoder graph', encoder.__getattr__('forward').graph)
print('decoder graph', decoder.__getattr__('forward').graph)

encoder graph graph(%0 : Long(1, 1)
      %1 : Long(1)
      %2 : Float(7826, 500)
      %3 : Float(1500, 500)
      %4 : Float(1500, 500)
      %5 : Float(1500)
      %6 : Float(1500)
      %7 : Float(1500, 500)
      %8 : Float(1500, 500)
      %9 : Float(1500)
      %10 : Float(1500)
      %11 : Float(1500, 1000)
      %12 : Float(1500, 500)
      %13 : Float(1500)
      %14 : Float(1500)
      %15 : Float(1500, 1000)
      %16 : Float(1500, 500)
      %17 : Float(1500)
      %18 : Float(1500)) {
  %19 : int = prim::Constant[value=-1](), scope: EncoderRNN/Embedding[embedding]
  %20 : int = prim::Constant[value=0](), scope: EncoderRNN/Embedding[embedding]
  %21 : int = prim::Constant[value=0](), scope: EncoderRNN/Embedding[embedding]
  %22 : Float(1, 1, 500) = aten::embedding(%2, %0, %19, %20, %21), scope: EncoderRNN/Embedding[embedding]
  %70 : Float(1, 500), %71 : Long(1) = ^pack_padded_sequence_trace_wrapper()(%22, %1), scope: EncoderRNN
  %83 : Float(4, 1, 500) = prim::Constant[v

## Run Evaluation

In [10]:
# Set dropout layers to eval mode
#encoder.eval()
#decoder.eval()

# Evaluate examples
sentences = ["hello", "what's up?", "who are you?", "where are we?", "where are you from?"]
for s in sentences:
    evaluateExample(s, encoder, decoder, voc)

> hello
bot: hello .
> what's up?
bot: i m sorry .
> who are you?
bot: no one s in the trunk .
> where are we?
bot: we re in the garage .
> where are you from?
bot: the zoo .
