## Translation using tensor2tensor on Cloud ML Engine

This notebook illustrates using the <a href="https://github.com/tensorflow/tensor2tensor">tensor2tensor</a> library to do from-scratch, distributed training of a English-German translator. Then, the trained model is deployed to Cloud ML Engine and used to translate new pieces of text.
<p/>
### Install tensor2tensor, and specify Google Cloud Platform project and bucket

In [None]:
%bash
pip install tensor2tensor

In [1]:
import os
PROJECT = 'cloud-training-demos' # REPLACE WITH YOUR PROJECT ID
BUCKET = 'cloud-training-demos-ml' # REPLACE WITH YOUR BUCKET NAME
REGION = 'us-central1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1

# for bash
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION

### Load bucket with data

We'll put the training dataset on cloud storage

In [None]:
%bash
wget http://data.statmt.org/wmt17/translation-task/training-parallel-nc-v12.tgz
wget http://data.statmt.org/wmt17/translation-task/dev.tgz

In [None]:
%bash
gsutil cp -m training-parallel-nc-v12.tgz dev.tgz gs://${BUCKET}/translate_ende/

### Set up a Problem
The Problem in tensor2tensor is where you specify parameters like the size of your vocabulary and where to get the training data from.

In [4]:
import tensorflow as tf
from tensor2tensor.data_generators import generator_utils
from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_encoder
from tensor2tensor.data_generators import wsj_parsing
from tensor2tensor.data_generators.wmt import TranslateProblem
from tensor2tensor.utils import registry

_ENDE_TRAIN_DATASETS = [
    [
        "./training-parallel-nc-v12.tgz",
        ("training/news-commentary-v12.de-en.en",
         "training/news-commentary-v12.de-en.de")
    ],
]
_ENDE_TEST_DATASETS = [
    [
        "./dev.tgz",
        ("dev/newstest2013.en", "dev/newstest2013.de")
    ],
]

@registry.register_problem
class MyTranslateProblem(TranslateProblem):
  @property
  def targeted_vocab_size(self):
    return 2**13  # 8192

  def generator(self, data_dir, tmp_dir, train):
    symbolizer_vocab = generator_utils.get_or_generate_vocab(
        data_dir, tmp_dir, self.vocab_file, self.targeted_vocab_size)
    datasets = _ENDE_TRAIN_DATASETS if train else _ENDE_TEST_DATASETS
    tag = "train" if train else "dev"
    data_path = _compile_data(tmp_dir, datasets, "wmt_ende_tok_%s" % tag)
    return token_generator(data_path + ".lang1", data_path + ".lang2",
                           symbolizer_vocab, EOS)

  @property
  def input_space_id(self):
    return problem.SpaceID.EN_TOK

  @property
  def target_space_id(self):
    return problem.SpaceID.DE_TOK

In [5]:
%bash
PROBLEM=MyTranslateProblem
DATA_DIR=./t2t_data
TMP_DIR=/tmp/t2t_datagen
rm -rf $DATA_DIR $TMP_DIR
mkdir -p $DATA_DIR $TMP_DIR
# Generate data
t2t-datagen \
  --data_dir=$DATA_DIR \
  --tmp_dir=$TMP_DIR \
  --problem=$PROBLEM

Traceback (most recent call last):
  File "/usr/local/bin/t2t-datagen", line 213, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/usr/local/bin/t2t-datagen", line 160, in main
    raise ValueError(error_msg)
ValueError: You must specify one of the supported problems to generate data for:
  * algorithmic_addition_binary40
  * algorithmic_addition_decimal40
  * algorithmic_algebra_inverse
  * algorithmic_cipher_shift200
  * algorithmic_cipher_shift5
  * algorithmic_cipher_vigenere200
  * algorithmic_cipher_vigenere5
  * algorithmic_identity_binary40
  * algorithmic_identity_decimal40
  * algorithmic_multiplication_binary40
  * algorithmic_multiplication_decimal40
  * algorithmic_reverse_binary40
  * algorithmic_reverse_decimal40
  * algorithmic_reverse_nlplike32k
  * algorithmic_reverse_nlplike8k
  * algorithmic_shift_decimal40
  * audio_timit_ch

In [6]:
%bash
t2t-trainer --registry_help

INFO:tensorflow:
Registry contents:
------------------

  Models:
    aligned:
      * aligned
    attention:
      * attention_lm
      * attention_lm_moe
    blue:
      * blue_net
    byte:
      * byte_net
    cycle:
      * cycle_gan
    diagonal:
      * diagonal_neural_gpu
    gene:
      * gene_expression_conv
    lstm:
      * lstm_seq2seq
      * lstm_seq2seq_attention
    multi:
      * multi_model
    neural:
      * neural_gpu
    shake:
      * shake_shake
    slice:
      * slice_net
    transformer:
      * transformer
      * transformer_ae
      * transformer_alt
      * transformer_decoder
      * transformer_encoder
      * transformer_moe
      * transformer_revnet
    xception:
      * xception

  HParams:
    aligned:
      * aligned_8k
      * aligned_base
      * aligned_local
      * aligned_local_1k
      * aligned_local_expert
      * aligned_memory_efficient
      * aligned_moe
      * aligned_no_att
      * aligned_no_timing
      * aligned_pos_emb
      *