# Implementing a new model with Jack 

In this tutorial, we focus on the minimal steps required to implement a new model from scratch using Jack.
Please note that this tutorial has a lot of detail. It is aimed at developers who want to understand the internals of Jack. 

In order to implement a Jack Reader, we define three modules:
- **Input Module**: Responsible for mapping `QASetting`s to numpy arrays assoicated with `TensorPort`s
- **Model Module**: Defines the differentiable model architecture graph (_TensorFlow_ or _PyTorch_)
- **Output Module**: Converting the network output to human-readable overall system output. 

Jack is modular, in the sense that any particular input/model/output module can be exchanged with another one.
To illustrate this, we will implement an entire reader, and will then go on to implement another reader, but reusing the _Input_ and _Output_ module of the first.

The first reader will have a _Model Module_ based on *TensorFlow*, the second will have a _Model Module_ based on *PyTorch*.

### Model Overview
As example, we will implement a simple Bi-LSTM baseline for extractive question answering, which involves extracting the answer to a question from a given text. On a high level, the architecture looks as follows:
- Words of question and support are embedded using random embeddings (not trained)
- Both word and question are encoded using a bi-directional LSTM
- The question is summarized by a weighted token representation
- A feedforward NN scores each of the support tokens to be the _start_ of the answer
- A feedforward NN scores each of the support tokens to be the _end_ of the answer


In [1]:
# First change dir to jack parent
import os
os.chdir('..')

In [2]:
import re
from jack.core import *
from jack.core.tensorflow import TFReader, TFModelModule
from jack.io.embeddings import Embeddings
from jack.util.hooks import LossHook
from jack.util.vocab import *
from jack.readers.extractive_qa.shared import XQAPorts, XQAOutputModule
from jack.readers.extractive_qa.util import prepare_data
from jack.readers.extractive_qa.util import tokenize
from jack import tfutil
from jack.tfutil import sequence_encoder
from jack.tfutil.misc import mask_for_lengths
from jack.util.map import numpify
from jack.util.preprocessing import stack_and_pad
import tensorflow as tf
_tokenize_pattern = re.compile('\w+|[^\w\s]')

## Ports

All communication between _Input_, _Model_ and _Output_ modules happens via `TensorPort`s (see `jack/core/tensorport.py`). Tensorports can be understood as placeholders for tensors, and define the ways in which information is communicated between the differentiable model architecture (_Model_ module), and the _Input_ and _Output_ modules.

This is useful when implementing new models: often there already exists a model for the same task, and you can re-use existing _Input_ or _Output_ modules. You can re-use existing modules by making sure that your new module is compatible to the  ports specified in the already existing modules.

In case you can reuse existing _Input_ or _Output_ modules, it is then enough to simply
implement a new _Model_ Module (see below) that adheres to the same Tensorport interface.
See `jack/readers/implementations.py` to see how different readers re-use the same modules.

If you need a new port, however, it is also straight-forward to define one.
For this tutorial, we will define most ports here.

In [3]:
class MyPorts:

    embedded_question = TensorPort(np.float32, [None, None, None],
                                   "embedded_question",
                                   "Represents the embedded question",
                                   "[B, max_num_question_tokens, N]")
    # or reuse Ports.Misc.embedded_question

    question_length = TensorPort(np.int32, [None],
                                 "question_length",
                                 "Represents length of questions in batch",
                                 "[B]")
    # or reuse Ports.Input.question_length

    embedded_support = TensorPort(np.float32, [None, None, None],
                                  "embedded_support",
                                  "Represents the embedded support",
                                  "[B, max_num_tokens, N]")
    # or reuse Ports.Misc.embedded_support

    support_length = TensorPort(np.int32, [None],
                                "support_length",
                                "Represents length of support in batch",
                                "[B]")
    # or reuse Ports.Input.support_length

    start_scores = TensorPort(np.float32, [None, None],
                              "start_scores",
                              "Represents start scores for each support sequence",
                              "[B, max_num_tokens]")
    # or reuse Ports.Prediction.start_scores

    end_scores = TensorPort(np.float32, [None, None],
                            "end_scores",
                            "Represents end scores for each support sequence",
                            "[B, max_num_tokens]")
    # or reuse Ports.Prediction.end_scores

    span_prediction = TensorPort(np.int32, [None, 2],
                                 "span_prediction",
                                 "Represents predicted answer as a (start, end) span",
                                 "[B, 2]")
    # or reuse Ports.Prediction.span_prediction

    answer_span = TensorPort(np.int32, [None, 2],
                             "answer_span_target",
                             "Represents target answer as a (start, end) span",
                             "[B, 2]")
    # or reuse Ports.Target.answer_span

    token_offsets = TensorPort(np.int32, [None, None],
                               "token_offsets",
                               "Character index of tokens in support.",
                               "[B, support_length]")
    # or reuse XQAPorts.token_offsets
    
    loss = Ports.loss  # this port must be used

## Implementing an Input Module

The _Input_ module is responsible for converting `QASetting` instances (the inputs to the reader) into numpy
arrays, which are mapped to `TensorPort`s and passed on to the _Model_ module.
Effectively, we are building the tensorflow _feed dictionary_ used during training and inference. 
There are _Input_ modules for
several readers that can easily be reused when your model requires the same
pre-processing and input as another model. 
**Note**: Similarly, this is also true for the _Output_ Module. 

To implement a new _Input_ module, you could implement the `InputModule` interface, but in many cases it'll be
easier to inherit from `OnlineInputModule`, which already comes with useful functionality. In our implementation we will do the latter. We will need to:
- Define the output `TensorPort`s of our input module. These will be used to communicate with the _Model_ module
- Implement the actual preprocessing (e.g. tokenization, mapping to embedding vectors, ...). The result of this step is one *annotation* per instance; this annotation is a `dict` with values for every Tensorport to pass on to the _Model_ module (see `_preprocess_instance()` below).
- Implement batching.

In [4]:
class MyInputModule(OnlineInputModule):
    
    def setup(self):
        self.vocab = self.shared_resources.vocab
        self.emb_matrix = self.vocab.emb.lookup

    # We will now define the input and output TensorPorts of our model.

    @property
    def output_ports(self):
        return [MyPorts.embedded_question,           # Question embeddings
                MyPorts.question_length,             # Lengths of the questions
                MyPorts.embedded_support,            # Support embeddings
                MyPorts.support_length,              # Lengths of the supports
                MyPorts.token_offsets  # Character offsets of tokens in support, used for in ouput module
               ]

    @property
    def training_ports(self):
        return [MyPorts.answer_span]                 # Answer span, one for each question

    # Now, we implement our preprocessing. This involves tokenization,
    # mapping to token IDs, mapping to to token embeddings,
    # and computing the answer spans.

    def _get_emb(self, idx):
        """Maps a token ID to it's respective embedding vector"""
        if idx < self.emb_matrix.shape[0]:
            return self.vocab.emb.lookup[idx]
        else:
            # <OOV>
            return np.zeros([self.vocab.emb_length])

    def preprocess(self, questions, answers=None, is_eval=False):
        """Maps a list of instances to a list of annotations.

        Since in our case, all instances can be preprocessed independently, we'll
        delegate the preprocessing to a `_preprocess_instance()` method.
        """

        if answers is None:
            answers = [None] * len(questions)

        return [self._preprocess_instance(q, a)
                for q, a in zip(questions, answers)]

    def _preprocess_instance(self, question, answers=None):
        """Maps an instance to an annotation.

        An annotation contains the embeddings and length of question and support,
        token offsets, and optionally answer spans.
        """

        has_answers = answers is not None

        # `prepare_data()` handles most of the computation in our case, but
        # you could implement your own preprocessing here.
        q_tokenized, q_ids, _, q_length, s_tokenized, s_ids, _, s_length, \
        word_in_question, offsets, answer_spans = \
            prepare_data(question, answers, self.vocab,
                         with_answers=has_answers,
                         max_support_length=100)
        # there is only 1 support
        s_tokenized, s_ids, s_length, offsets = s_tokenized[0], s_ids[0], s_length[0], offsets[0]

        # For both question and support, we'll fill an embedding tensor
        emb_support = np.zeros([s_length, self.emb_matrix.shape[1]])
        emb_question = np.zeros([q_length, self.emb_matrix.shape[1]])
        for k in range(len(s_ids)):
            emb_support[k] = self._get_emb(s_ids[k])
        for k in range(len(q_ids)):
            emb_question[k] = self._get_emb(q_ids[k])

        # Now, we build the annotation for the question instance. We'll use a
        # dict that maps from `TensorPort` to numpy array, but this could be
        # any data type, like a custom class, or a named tuple.

        annotation = {
            MyPorts.question_length: q_length,
            MyPorts.embedded_question: emb_question,
            MyPorts.support_length: s_length,
            MyPorts.embedded_support: emb_support,
            MyPorts.token_offsets: offsets
        }

        if has_answers:
            # For the purpose of this tutorial, we'll only use the first answer, such
            # that we will have exactly as many answers as questions.
            annotation[MyPorts.answer_span] = answer_spans[0][0]

        return numpify(annotation, keys=annotation.keys())

    def create_batch(self, annotations, is_eval, with_answers):
        """Now, we need to implement the mapping of a list of annotations to a feed dict.
        
        Because our annotations already are dicts mapping TensorPorts to numpy
        arrays, we only need to do padding here.
        """

        return {key: stack_and_pad([a[key] for a in annotations])
                for key in annotations[0].keys()}

## Implementing a Model Module

The _Model_ module defines the differentiable computation graph.
It takes _Input_ module outputs as inputs, and produces outputs (such as the loss, or logits)
that match the inputs to the _Output_ module.

We first look at a _TensorFlow_ implementation of the _Model_ module; futher below you can find an implementation using _PyTorch_.

In [5]:
class MyModelModule(TFModelModule):

    @property
    def input_ports(self) -> Sequence[TensorPort]:
        return [MyPorts.embedded_question,
                MyPorts.question_length,
                MyPorts.embedded_support,
                MyPorts.support_length]

    @property
    def output_ports(self) -> Sequence[TensorPort]:
        return [MyPorts.start_scores,
                MyPorts.end_scores,
                MyPorts.span_prediction]

    @property
    def training_input_ports(self) -> Sequence[TensorPort]:
        return [MyPorts.start_scores,
                MyPorts.end_scores,
                MyPorts.answer_span]

    @property
    def training_output_ports(self) -> Sequence[TensorPort]:
        return [MyPorts.loss]

    def create_output(self, shared_resources, input_tensors):
        """
        Implements the "core" model: The TensorFlow subgraph which computes the
        answer span from the embedded question and support.
        Args:
            emb_question: [Q, L_q, N]
            question_length: [Q]
            emb_support: [Q, L_s, N]
            support_length: [Q]

        Returns:
            start_scores [B, L_s, N], end_scores [B, L_s, N], span_prediction [B, 2]
        """
        tensors = TensorPortTensors(input_tensors)
        with tf.variable_scope("fast_qa", initializer=tf.contrib.layers.xavier_initializer()):
            dim = shared_resources.config['repr_dim']
            # set shapes for inputs
            tensors.embedded_question.set_shape([None, None, dim])
            tensors.embedded_support.set_shape([None, None, dim])

            # encode question and support
            rnn = tf.contrib.rnn.LSTMBlockFusedCell
            encoded_question = sequence_encoder.bi_lstm(dim, tensors.embedded_question,
                                                        tensors.question_length, name='bilstm',
                                                        with_projection=True)

            encoded_support = sequence_encoder.bi_lstm(dim, tensors.embedded_support,
                                                       tensors.support_length, name='bilstm',
                                                       reuse=True, with_projection=True)

            start_scores, end_scores, predicted_start_pointer, predicted_end_pointer = \
                self._output_layer(dim, encoded_question, tensors.question_length,
                                   encoded_support, tensors.support_length)

            span = tf.concat([predicted_start_pointer, predicted_end_pointer], 1)

            return TensorPort.to_mapping(self.output_ports, (start_scores, end_scores, span))

    def _output_layer(self,
                      dim,
                      encoded_question,
                      question_length,
                      encoded_support,
                      support_length):
        """Simple span prediction layer of our network"""
        batch_size = tf.shape(question_length)[0]

        # Computing weighted question state
        attention_scores = tf.contrib.layers.fully_connected(encoded_question, 1,
                                                             scope="question_attention")
        q_mask = mask_for_lengths(question_length, batch_size)
        attention_scores = attention_scores + tf.expand_dims(q_mask, 2)
        question_attention_weights = tf.nn.softmax(attention_scores, 1,
                                                   name="question_attention_weights")
        question_state = tf.reduce_sum(question_attention_weights * encoded_question, [1])

        # Prediction
        support_mask = mask_for_lengths(support_length, batch_size)
        interaction = tf.expand_dims(question_state, 1) * encoded_support
        
        def predict():
            scores = tf.layers.dense(tf.concat([interaction, encoded_support], axis=2), 1)
            scores = tf.squeeze(scores, [2])
            scores = scores + support_mask
            _, predicted = tf.nn.top_k(scores, 1)
            return scores, predicted

        start_scores, predicted_start_pointer = predict()
        end_scores, predicted_end_pointer = predict()

        return start_scores, end_scores, predicted_start_pointer, predicted_end_pointer

    def create_training_output(self,
                               shared_resources,
                               input_tensors) -> Sequence[TensorPort]:
        """Compute loss from start & end scores and the gold-standard `answer_span`."""
        tensors = TensorPortTensors(input_tensors)
        start, end = [tf.squeeze(t, 1) for t in tf.split(tensors.answer_span_target, 2, 1)]

        start_score_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=tensors.start_scores,
                                                                          labels=start)
        end_score_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=tensors.end_scores,
                                                                        labels=end)
        loss = start_score_loss + end_score_loss
        return TensorPort.to_mapping(self.training_output_ports, [tf.reduce_mean(loss)])

## Implementing an Output Module

The _Output_ module converts model predictions from the differentiable computation graph into `Answer` instances.
Since our model is a standard extractive QA model, we could reuse the existing `XQAOutputModule`, rather than implementing our own.

In [6]:
class MyOutputModule(OutputModule):
    @property
    def input_ports(self) -> List[TensorPort]:
        return [MyPorts.span_prediction,
                MyPorts.token_offsets,
                MyPorts.start_scores,
                MyPorts.end_scores]
    
    def __call__(self,
                 questions,
                 input_tensors) -> Sequence[Answer]:
        """Produces best answer for each question."""
        answers = []
        tensors = TensorPortTensors(input_tensors)
        for i, question in enumerate(questions):
            offsets = tensors.token_offsets[i]
            start, end = tensors.span_prediction[i]
            score = tensors.start_scores[i, start] + tensors.end_scores[i, end]
            # map token to char span
            char_start = offsets[start]
            char_end = offsets[end + 1] if end < len(offsets) - 1 else len(question.support[0])
            answer = question.support[0][char_start: char_end]
            answer = answer.rstrip()
            char_end = char_start + len(answer)
            
            answers.append(Answer(answer, span=(char_start, char_start), score=score))

        return answers

# Putting Together all Modules

We are now ready to put together the above defined _Input_, _Model_, and _Output_ modules into one _Reader_.

For illustration purposes, we will use a toy data example with just one example question:

In [7]:
data_set = [
    (QASetting(
        question="Which is it?",
        support=["While b seems plausible, answer a is correct."],
        id="1"),
     [Answer(text="a", span=(32, 33))])
]

Before assembling the parts of our newly defined reader, we will need to define some shared resources, which all of the modules can depend on. This includes a vocabulary `Vocab`, and a configuration hyperparameter dictionary `config`.

We build the vocabulary directly from the above data set using the function `build_vocab()`, which also associates each word with random embedding vectors.

In [8]:
embedding_dim = 10

def build_vocab(questions):
    """Build a vocabulary of random vectors."""

    embedding_lookup = dict()
    for question in questions:
        for t in tokenize(question.question):
            if t not in embedding_lookup:
                embedding_lookup[t] = len(embedding_lookup)
    embeddings = Embeddings(embedding_lookup, 
                            np.random.random([len(embedding_lookup),
                                              embedding_dim]))

    vocab = Vocab(emb=embeddings, init_from_embeddings=True)
    return vocab

questions = [q for q, _ in data_set]
shared_resources = SharedResources(build_vocab(questions),
                                   config={'repr_dim': 10,
                                           'repr_dim_input': embedding_dim})

We then instantiate our above defined modules with these `shared_resources` as input parameter.

In [9]:
tf.reset_default_graph()

input_module = MyInputModule(shared_resources)
model_module = MyModelModule(shared_resources)
output_module = MyOutputModule()

reader = TFReader(shared_resources, input_module, model_module, output_module)

At this point, the Reader is complete! It is composed of the three modules and shared resources, and is ready to generate predictions, or to train it.

In [15]:
batch_size = 1

hooks = [LossHook(reader, iter_interval=1)]
optimizer = tf.train.AdamOptimizer(learning_rate=0.1)
reader.train(optimizer, data_set, batch_size, max_epochs=10, hooks=hooks)

print()
print(questions[0].question, questions[0].support[0])
answers = reader(questions)
print("{}, {}, {}".format(answers[0].score, answers[0].span, answers[0].text))

INFO:jack.core.reader:Setting up data and model...
INFO:jack.core.input_module:OnlineInputModule pre-processes data on-the-fly in first epoch and caches results for subsequent epochs! That means, first epoch might be slower.
INFO:jack.core.reader:Start training...


AttributeError: 'AdamOptimizer' object has no attribute 'zero_grad'


**Note:** If you want to train your newly implemented model using the main training script `jack/train_reader.py`, you first have to register a name for your new model in `jack.core.implementations`.

### Hooks
In the above example, we are making use of a _hook_. Hooks are used to monitor progress throughout training. For example, the `LossHook` monitors the loss throughout training, but other hooks can measure validation accuracy, time elapsed, etc. 
Jack comes with several hooks predefined (see `jack.util.hooks`), but you can always extend them or add your own.


## Implementing a QA model in PyTorch

Above, we have implemented a complete reader from scratch, using _TensorFlow_ to define the differentiable computation graph in the _Model_ module. Let's now implement another reader, reusing as much as possible, but change frameworks from _TensorFlow_ to _PyTorch_.

All we need to do to accomplish this, is to write another _ModelModule_.

**Note:** the following code requires that you to have installed [PyTorch](http://pytorch.org/).

### Differentiable Model Architecture (PyTorch)

Let's first define _PyTorch_ modules that define the differentiable model architecture. This is independent of Jack, but we do offer some convenience functions, similar to TF. 

In [11]:
import torch
import torch.nn as nn
import torch.nn.functional as F

from jack.torch_util import embedding, misc, xqa
from jack.torch_util.highway import Highway
from jack.torch_util.rnn import BiLSTM

class MyPredictionTorchModule(nn.Module):
    def __init__(self, shared_resources):
        super(MyPredictionTorchModule, self).__init__()
        self._shared_resources = shared_resources
        repr_dim_input = shared_resources.config["repr_dim_input"]
        repr_dim = shared_resources.config["repr_dim"]
        
        # nn child modules
        self._bilstm = BiLSTM(repr_dim_input, repr_dim)
        self._linear_question_attention = nn.Linear(2 * repr_dim, 1, bias=False)
        self._linear_start_scores = nn.Linear(2 * repr_dim, 1, bias=False)
        self._linear_end_scores = nn.Linear(2 * repr_dim, 1, bias=False)


    def forward(self, emb_question, question_length, emb_support, support_length):
        # encode
        encoded_question = self._bilstm(emb_question)[0]
        encoded_support = self._bilstm(emb_support)[0]

        # answer
        # computing attention over question
        attention_scores = self._linear_question_attention(encoded_question)
        q_mask = misc.mask_for_lengths(question_length)
        attention_scores = attention_scores.squeeze(2) + q_mask
        question_attention_weights = F.softmax(attention_scores, dim=1)
        question_state = torch.matmul(question_attention_weights.unsqueeze(1),
                                      encoded_question).squeeze(1)
        
        interaction = question_state * encoded_support
        # Prediction
        start_scores = self._linear_start_scores(interaction).squeeze(2)
        end_scores = self._linear_start_scores(interaction).squeeze(2)
        # Mask
        support_mask = misc.mask_for_lengths(support_length)
        start_scores += support_mask
        end_scores += support_mask

        _, predicted_start_pointer = start_scores.max(1)
        _, predicted_end_pointer = end_scores.max(1)
        
        # end pointer cannot come before start
        predicted_end_pointer = torch.max(predicted_end_pointer, predicted_start_pointer)

        span = torch.stack([predicted_start_pointer, predicted_end_pointer], 1)
        return start_scores, end_scores, span
    
class MyLossTorchModule(nn.Module):
    def forward(self, start_scores, end_scores, answer_span):
        start, end = answer_span[:, 0], answer_span[:, 1]
        
        # start prediction loss
        loss = -torch.index_select(F.log_softmax(start_scores, dim=1), dim=1, index=start.long())
        # end prediction loss
        loss -= torch.index_select(F.log_softmax(end_scores, dim=1), dim=1, index=end.long())
        
        # mean loss over the current batch
        return loss.mean()

### Implementing the Jack _Model_ Module with PyTorch 

After defining our `torch nn.Module` classes, we can use them in a Jack `ModelModule`. Note that the signature of the `nn.Module` torch implementations above must match the tensorport signature of the `ModelModule`.

In [12]:
from jack.core.torch import PyTorchModelModule, PyTorchReader


class MyTorchModelModule(PyTorchModelModule):

    @property
    def input_ports(self) -> Sequence[TensorPort]:
        return [MyPorts.embedded_question,
                MyPorts.question_length,
                MyPorts.embedded_support,
                MyPorts.support_length]

    @property
    def output_ports(self) -> Sequence[TensorPort]:
        return [MyPorts.start_scores,
                MyPorts.end_scores,
                MyPorts.span_prediction]

    @property
    def training_input_ports(self) -> Sequence[TensorPort]:
        return [MyPorts.start_scores,
                MyPorts.end_scores,
                MyPorts.answer_span]

    @property
    def training_output_ports(self) -> Sequence[TensorPort]:
        return [MyPorts.loss]
    
    
    def create_loss_module(self, shared_resources: SharedResources):
        return MyLossTorchModule()

    def create_prediction_module(self, shared_resources: SharedResources):
        return MyPredictionTorchModule(shared_resources)

After defining our new PyTorchModelModule we can create our JackReader similar as before, by instantiating a `PyTorchReader`, rather than a `TFReader`, as before.

In [13]:
input_module = MyInputModule(shared_resources)
model_module = MyTorchModelModule(shared_resources)  # was MyModelModule
output_module = MyOutputModule()

reader = PyTorchReader(shared_resources,
                       input_module,
                       model_module,
                       output_module)  # was TFReader

Interacting with the instantiated readers is transparent. For the user it doesn't matter whether it is a `TFReader` or a `PyTorchReader`.

In [14]:
batch_size = 1

# torch needs to be setup already at this point, to get the parameters
reader.setup_from_data(data_set, is_training=True)
optimizer = torch.optim.Adam(reader.model_module.prediction_module.parameters(), lr=0.1)
hooks = [LossHook(reader, iter_interval=1)]

reader.train(optimizer,
             data_set,
             batch_size,
             max_epochs=10,
             hooks=hooks)

print()
print(questions[0].question, questions[0].support[0])
answers = reader(questions)
print("{}, {}, {}".format(answers[0].score, answers[0].span, answers[0].text))

INFO:jack.core.reader:Setting up data and model...
INFO:jack.core.input_module:OnlineInputModule pre-processes data on-the-fly in first epoch and caches results for subsequent epochs! That means, first epoch might be slower.
INFO:jack.core.reader:Start training...
INFO:jack.util.hooks:Epoch 1	Iter 1	train loss 4.596769332885742
INFO:jack.util.hooks:Epoch 2	Iter 2	train loss 4.3474507331848145
INFO:jack.util.hooks:Epoch 3	Iter 3	train loss 3.7517993450164795
INFO:jack.util.hooks:Epoch 4	Iter 4	train loss 3.123795986175537
INFO:jack.util.hooks:Epoch 5	Iter 5	train loss 3.4127023220062256
INFO:jack.util.hooks:Epoch 6	Iter 6	train loss 1.4740217924118042
INFO:jack.util.hooks:Epoch 7	Iter 7	train loss 0.8284779191017151
INFO:jack.util.hooks:Epoch 8	Iter 8	train loss 0.4712883234024048
INFO:jack.util.hooks:Epoch 9	Iter 9	train loss 0.2944047451019287
INFO:jack.util.hooks:Epoch 10	Iter 10	train loss 0.8605778813362122

Which is it? While b seems plausible, answer a is correct.
12.947671890258