# Task: MLE and Policy Gradient trained sequence to sequence model in Texar

We build two versions of a simple (but attention using) sequence-to-sequence model:

- The first version will be trained only with an MLE objective,
- The second will be trained both with MLE as a pretraining, and then Policy Gradient.

The goal of the task is simply to get to know Texar a bit, nothing else -- the dataset is a toy dataset from Google which simply reverses the input, and without proper hyperparameter tuning (which we won't do) Policy Gradient quickly collapses, even after the MLE pretraining.

Consequently, the task is simply to build the models and get the training running, there is no target performance which you'd have to achieve. In addition, almost all code that is needed can be found by looking at the Texar documentation...

# Texar prerequisites

In [None]:
%tensorflow_version 1.x # Texar needs TF 1.x!!!
! pip install texar

In [2]:
import texar.tf as tx
import tensorflow as tf
import numpy as np

# Downloading the data
We download and extract the toy dataset with Texar's download utility function:

In [3]:
tx.data.maybe_download(
            urls='https://drive.google.com/file/d/'
                 '1fENE2rakm8vJ8d3voWBgW4hGlS6-KORW/view?usp=sharing',
            path='./',
            filenames='toy_copy.zip',
            extract=True)

Successfully downloaded toy_copy.zip.


['./toy_copy.zip']

After extraction this will already be in the format needed for Texar's data readers.

# The basic model: RNN Seq2seq with attention trained with MLE

## Model parameters

In Texar, hyperparameters are typically represented by multi-level dictionaries (or  dictionary-like texar.HParams intstances). Before building the model, we define a minimal set of hyperparameter dictionaries for the embedding, encoder, decoder and attention.

In [2]:
# We want to use the same dimensionality ("number of units") for the embedding, and the encoder and decoder RNNs
num_units = ... # Please, specify a reasonable number

# For inference, the model will use beam search
beam_width = ... # Specify a reasonable number (remember that the search time is not 
                 # linear with respect to this parameter!)

embedder_hparams = {"dim": num_units}
encoder_hparams = {
    'rnn_cell_fw': {
        'kwargs': {
            'num_units': num_units
        }
    }
}
decoder_hparams = {
    'rnn_cell': {
        'kwargs': {
            'num_units': num_units
        },
    },
    'attention': {
        'kwargs': {
            'num_units': num_units,
        },
        'attention_layer_size': num_units
    }
}

## Building the model

First we build a simple seq2seq model with attention.

In [None]:
def build_mle_model(batch, train_data):
    """Build a basic seq2seq model with attention for MLE training.
    """
    
    # Please define a word embedding layer for the Encoder using Texar's API.
    # For hyperparameters, use the embedder hparams defined in the previous cell.
    source_embedder = ...(
        vocab_size=train_data.source_vocab.size, hparams=...)
    
    # For encoder, use a Bidirectional RNN encoder from the Texar API.
    # hparams were defined above.
    encoder = ...(
        hparams=...)

    enc_outputs, _ = encoder(source_embedder(batch['source_text_ids']))
    
    # Please define a word embedding layer for the Decoder using Texar's API.
    # For hyperparameters, use the embedder hparams defined in the previous cell.
    target_embedder = ...(
        vocab_size=train_data.target_vocab.size, hparams=...)

    # The decoder should be a Texar Attention RNN decoder with the hyperparameters 
    # defined above
    decoder = ...(
        memory=tf.concat(enc_outputs, axis=2),
        memory_sequence_length=batch['source_length'],
        vocab_size=train_data.target_vocab.size,
        hparams=...)

    # For MLE training, we use greedy decoding and teacher forcing, 
    # this is why the input is coming from the target text
    mle_training_outputs, _, _ = decoder(
        decoding_strategy=..., # Please specify greedy training decoding here 
                               # see the possible values in the Texar "Decoders" documentation section
        inputs=target_embedder(batch['target_text_ids'][:, :-1]),
        sequence_length=batch['target_length'] - 1)

    # The loss for MLE training is the familiar sparse softmax cross entropy
    # Please use the Texar version of it here!
    mle_loss = ...(
        labels=batch['target_text_ids'][:, 1:],
        logits=mle_training_outputs.logits,
        sequence_length=batch['target_length'] - 1)

    # Texar produces a train op from the loss for us:
    mle_train_op = tx.core.get_train_op(mle_loss)

    # For inference (text generation), we need the start (bos) tokens from the data set
    # and we produce here a whole vector of them, for the entire batch.
    start_tokens = tf.ones_like(batch['target_length']) * train_data.target_vocab.bos_token_id

    # Inference (text generation) by beam search -- nothing to do here, just observe!
    beam_search_outputs, _, _ = \
        tx.modules.beam_search_decode(
            decoder_or_cell=decoder,
            embedding=target_embedder,
            start_tokens=start_tokens,
            end_token=train_data.target_vocab.eos_token_id,
            beam_width=beam_width,
            max_decoding_length=60)

    # Having built the model, we need to return two things that will be needed for the training
    # and evaluation of the model: the mle training op  and the beam search output 
    # please add these in the next line (in this order)!!
    return ..., ...

## Data sets and training parameters

In [None]:
source_vocab_file = './data/toy_copy/train/vocab.sources.txt'
target_vocab_file = './data/toy_copy/train/vocab.targets.txt'

mle_training_num_epochs  = ... # Please specify the number of training epochs
steps_per_train_epochs = 312 # Don't touch this, this is the correct value for the toy dataset
batch_size = ... # Please specify a batch size

display = 50

# Texar hparams for the toy dataset

train_hparams = {
    'num_epochs': 500, # We set this to an unexhaustible number bec. of a Texar bug!!
    'batch_size': batch_size,
    'allow_smaller_final_batch': False,
    'source_dataset': {
        "files": './data/toy_copy/train/sources.txt',
        'vocab_file': source_vocab_file
    },
    'target_dataset': {
        'files': './data/toy_copy/train/targets.txt',
        'vocab_file': target_vocab_file
    }
}
val_hparams = {
    'batch_size': batch_size,
    'allow_smaller_final_batch': False,
    'source_dataset': {
        "files": './data/toy_copy/dev/sources.txt',
        'vocab_file': source_vocab_file
    },
    'target_dataset': {
        "files": './data/toy_copy/dev/targets.txt',
        'vocab_file': target_vocab_file
    }
}
test_hparams = {
    'batch_size': batch_size,
    'allow_smaller_final_batch': False,
    'source_dataset': {
        "files": './data/toy_copy/test/sources.txt',
        'vocab_file': source_vocab_file
    },
    'target_dataset': {
        "files": './data/toy_copy/test/targets.txt',
        'vocab_file': target_vocab_file
    }
}



In [None]:
# All of our data sets consist of paired texts -- please specify the correct
# Texar data class in the next three lines:

train_data = ...(hparams=train_hparams)
val_data = ...(hparams=val_hparams)
test_data = ...(hparams=test_hparams)

# Texar's data iterators are thin wrappers around the Tensorflow Dataset API
# Please put Texar's data iterator here which can switch between train, test and validation data
iterator = ...(train=train_data, val=val_data, test=test_data)
batch = iterator.get_next()
train_op, infer_outputs = build_mle_model(batch, train_data) # build the model, get train and inference outputs

## MLE training

In [None]:
# Now we manually write the training loops...
# not as cosy as Keras, for sure..
# Nothing to do in this cell, just observe

def mle_train_epoch(sess, iterator, train_op):
    """Train the Seq2Seq model for an epoch.
    sess is a TF session to use, 
    iterator is a TrainTestDataIterator with the data,
    train_op is training op in the model's graph.
    """
    iterator.switch_to_train_data(sess)
    for step in range(steps_per_train_epochs):
        try:
            loss = sess.run(train_op) # Run graph until the train op
            if step % display == 0:
                print("step={}, loss={:.4f}".format(step, loss))
        except tf.errors.OutOfRangeError:
            break


def eval_epoch(sess, mode, iterator, batch):
    """ Evaluate an epoch. Mode is 'test' or 'val'.
    """
    if mode == 'val':
        iterator.switch_to_val_data(sess)
    else:
        iterator.switch_to_test_data(sess)

    refs, hypos = [], []
    while True:
        try:
            # fetches are what we want to get back from the session
            # in this case the target texts and the predicted texts
            fetches = [
                batch['target_text'][:, 1:],
                infer_outputs.predicted_ids[:, :, 0]
            ]
            feed_dict = {
                tx.global_mode(): tf.estimator.ModeKeys.PREDICT,
            }
            target_texts, output_ids = \
                sess.run(fetches, feed_dict=feed_dict)

            target_texts = tx.utils.strip_special_tokens(target_texts)
            output_texts = tx.utils.map_ids_to_strs(
                ids=output_ids, vocab=val_data.target_vocab)

            for hypo, ref in zip(output_texts, target_texts):
                hypos.append(hypo)
                refs.append([ref])
        except tf.errors.OutOfRangeError:
            break
    # For evaluation we want to use a BLEU variant:
    # please put here Texar's "moses" corpus BLEU variant.
    return ...(list_of_references=refs,
                                        hypotheses=hypos)


def mle_train_and_eval(sess, iterator, batch, train_op):
    """Train the model with MLE and eval.
    """
    best_val_bleu = -1.
    for i in range(mle_training_num_epochs):
        mle_train_epoch(sess, iterator, train_op)

        val_bleu = eval_epoch(sess, 'val', iterator, batch)
        best_val_bleu = max(best_val_bleu, val_bleu)
        print('val epoch={}, BLEU={:.4f}; best-ever={:.4f}'.format(
            i, val_bleu, best_val_bleu))

        test_bleu = eval_epoch(sess, 'test', iterator, batch)
        print('test epoch={}, BLEU={:.4f}'.format(i, test_bleu))

        print('=' * 50)


# The only thing left is to run the training and evaluation:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    sess.run(tf.local_variables_initializer())
    sess.run(tf.tables_initializer())
    mle_train_and_eval(sess, iterator, batch, train_op)
   

# Attention Seq2Seq with Policy gradient

## Building the model

In [None]:
def build_rl_model(batch, train_data):
    """Build a seq2seq model trained with Policy Gradient.
    """
    
    # Our RL-trained model will be almost totally the same as the previous one,
    # except that we add sampled outputs for Policy Gradient.
    # So, please repeat here the missing elements of the previous model. (Copy & paste...)
    
    source_embedder = ...
    
    encoder = ...
    
    enc_outputs, _ = encoder(source_embedder(batch['source_text_ids']))

    target_embedder = ...

    decoder = ...

    # MLE pretraining

    mle_training_outputs, _, _ = ...
    
    mle_loss = ...

    mle_train_op = tx.core.get_train_op(mle_loss)

    start_tokens = tf.ones_like(batch['target_length']) * train_data.target_vocab.bos_token_id

    beam_search_outputs, _, _ = \
        tx.modules.beam_search_decode(
            decoder_or_cell=decoder,
            embedding=target_embedder,
            start_tokens=start_tokens,
            end_token=train_data.target_vocab.eos_token_id,
            beam_width=beam_width,
            max_decoding_length=60)    
    
    # Here comes the novelty...
    # We need random sampling for Policy Gradient
    sampled_outputs, _, sequence_length = decoder(
        decoding_strategy= ..., # Please add here the correct 'decoding strategy' for random sampling
        start_tokens=start_tokens,
        end_token=train_data.target_vocab.eos_token_id,
        embedding=target_embedder,
        max_decoding_length=30)

    # We need to return a bit more things from the graph for Policy Gradient
    return sampled_outputs, mle_train_op, sequence_length, beam_search_outputs

## Data sets and iterator

In [None]:
tf.reset_default_graph()

# Please repeat here the previous definitions for our data sets and iterator!

train_data = 
val_data = 
test_data = 

iterator = 

batch = iterator.get_next()

# We build the model:
sampled_outputs, mle_train_op, sequence_length, infer_outputs = build_rl_model(batch, train_data)


## Agent definition

In [None]:
# Now a crucial point: we need te create a Texar Sequence Policy Gradient Agent
# Please specify the correct Texar class!
agent = ...(
    samples=sampled_outputs.sample_id,
    logits=sampled_outputs.logits,
    sequence_length=sequence_length,
    hparams={'discount_factor': 0.95, 'entropy_weight': 0.5})

## Policy gradient training

In [None]:
# PG training and evaluation function
def pg_train_and_eval_epoch(sess, agent, iterator, batch):
    best_val_bleu = -1.
    for step in range(steps_per_train_epochs):
        iterator.switch_to_train_data(sess)

        
        extra_fetches = {
            'truth': batch['target_text_ids'],
        }

        # The agent needs to get the samples with the current policy.
        # Please add the the correct agent method in the next line!!
        # to be clear: you will need something like 
        # fetches = agent.<METHOD_NAME>(extra_fetches=extra_fetches) 
        # here.
        fetches = agent....(extra_fetches=extra_fetches)

        sample_text = tx.utils.map_ids_to_strs(
            fetches['samples'], train_data.target_vocab,
            strip_eos=False, join=False)
        truth_text = tx.utils.map_ids_to_strs(
            fetches['truth'], train_data.target_vocab,
            strip_eos=False, join=False)    

        # Compute the rewards
        reward = []
        for ref, hyp in zip(truth_text, sample_text):
            r = tx.evals.sentence_bleu([ref], hyp, smooth=True)
            reward.append(r)

        # Now we need to do the actual weight updates with the policy gradient,
        # in the Texar API this is called "observing".
        # Please add, again, the correct method name in the next line!
        loss = agent....(reward=reward)

        # Displays & evaluates
        if step == 1 or step % display == 0:
            print("step={}, loss={:.4f}, reward={:.4f}".format(
                step, loss, np.mean(reward)))

        if step % display == 0:
            val_bleu = eval_epoch(sess, 'val', iterator, batch)
            best_val_bleu = max(best_val_bleu, val_bleu)
            print('val step={}, BLEU={:.4f}; best-ever={:.4f}'.format(
                step, val_bleu, best_val_bleu))

            test_bleu = eval_epoch(sess, 'test', iterator, batch)
            print('test step={}, BLEU={:.4f}'.format(step, test_bleu))
            print('=' * 50)



## Running it

In [None]:
mle_training_num_epochs = ... # Specify the number of MLE training epochs!
pg_train_num_epochs = ... # Specify the number of PG training epochs!

# Now we can run the training and see how the (untuned) Policy Gradient training quickly ruins 
# the MLE results...
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    sess.run(tf.local_variables_initializer())
    sess.run(tf.tables_initializer())

    print("== Starting MLE pretraining ==")

    mle_train_and_eval(sess, iterator, batch, mle_train_op)

    print("== Starting PG training ==")

    agent.sess = sess

    for epoch in range(pg_train_num_epochs):
        print('=' * 50)
        print('== EPOCH NO', epoch, '==')
        print('=' * 50)
        pg_train_and_eval_epoch(sess, agent, iterator, batch)
        