# FAIRSeq in Amazon SageMaker: Translation task - German to English - Distributed / multi machine training

The Facebook AI Research (FAIR) Lab made available through the [FAIRSeq toolkit](https://github.com/pytorch/fairseq) their state-of-the-art Sequence to Sequence models. 

In this notebook, we will show you how to train a German to English translation model using a fully convolutional architecture on multiple GPUs and machines.

## Permissions

Running this notebook requires permissions in addition to the regular SageMakerFullAccess permissions. This is because it creates new repositories in Amazon ECR. The easiest way to add these permissions is simply to add the managed policy AmazonEC2ContainerRegistryFullAccess to the role that you used to start your notebook instance. There's no need to restart your notebook instance when you do this, the new permissions will be available immediately.

## Prepare dataset

To train the model, we will be using the IWSLT'14 dataset as descibed [here](https://github.com/pytorch/fairseq/tree/master/examples/translation#prepare-iwslt14sh). This was used in the IWSLT'14 German to English translation task: ["Report on the 11th IWSLT evaluation campaign" by Cettolo et al](http://workshop2014.iwslt.org/downloads/proceeding.pdf).

First, we'll download the dataset and start the pre-processing. Among other steps, this pre-processing cleans the tokens and applys BPE encoding as you can see [here](https://github.com/pytorch/fairseq/blob/master/examples/translation/prepare-iwslt14.sh).

In [2]:
%%sh
cd data
chmod +x prepare-iwslt14.sh

# Download dataset and start pre-processing
./prepare-iwslt14.sh

Cloning Moses github repository (for tokenization scripts)...
Cloning Subword NMT repository (for BPE pre-processing)...
Downloading data from https://wit3.fbk.eu/archive/2014-01/texts/de/en/de-en.tgz...
Data successfully downloaded.
de-en/
de-en/IWSLT14.TED.dev2010.de-en.de.xml
de-en/IWSLT14.TED.dev2010.de-en.en.xml
de-en/IWSLT14.TED.tst2010.de-en.de.xml
de-en/IWSLT14.TED.tst2010.de-en.en.xml
de-en/IWSLT14.TED.tst2011.de-en.de.xml
de-en/IWSLT14.TED.tst2011.de-en.en.xml
de-en/IWSLT14.TED.tst2012.de-en.de.xml
de-en/IWSLT14.TED.tst2012.de-en.en.xml
de-en/IWSLT14.TEDX.dev2012.de-en.de.xml
de-en/IWSLT14.TEDX.dev2012.de-en.en.xml
de-en/README
de-en/train.en
de-en/train.tags.de-en.de
de-en/train.tags.de-en.en
pre-processing train data...


pre-processing valid/test data...
orig/de-en/IWSLT14.TED.dev2010.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.dev2010.de-en.de

orig/de-en/IWSLT14.TED.tst2010.de-en.de.xml iwslt14.tokenized.de-en/tmp/IWSLT14.TED.tst2010.de-en.de

orig/de-en/IWSLT14

Cloning into 'mosesdecoder'...
Cloning into 'subword-nmt'...
--2019-07-23 07:13:11--  https://wit3.fbk.eu/archive/2014-01/texts/de/en/de-en.tgz
Resolving wit3.fbk.eu (wit3.fbk.eu)... 217.77.80.8
Connecting to wit3.fbk.eu (wit3.fbk.eu)|217.77.80.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19982877 (19M) [application/x-gzip]
Saving to: ‘de-en.tgz’

     0K .......... .......... .......... .......... ..........  0%  806K 24s
    50K .......... .......... .......... .......... ..........  0% 2.36M 16s
   100K .......... .......... .......... .......... ..........  0% 2.41M 13s
   150K .......... .......... .......... .......... ..........  1% 2.48M 12s
   200K .......... .......... .......... .......... ..........  1%  102M 9s
   250K .......... .......... .......... .......... ..........  1% 2.51M 9s
   300K .......... .......... .......... .......... ..........  1% 46.3M 8s
   350K .......... .......... .......... .......... ..........  2% 54.4M 7s
   400K

Next step is to apply the second set of pre-processing, which binarizes the dataset based on the source and target language. Full information on this script [here](https://github.com/pytorch/fairseq/blob/master/preprocess.py).  

In [3]:
%%sh

# First we download fairseq in order to have access to the scripts
git clone https://github.com/pytorch/fairseq.git fairseq-git
cd fairseq-git

# Binarize the dataset:
TEXT=../data/iwslt14.tokenized.de-en
python preprocess.py --source-lang de --target-lang en \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir ../data/iwslt14.tokenized.de-en

Namespace(alignfile=None, cpu=False, criterion='cross_entropy', dataset_impl='cached', destdir='../data/iwslt14.tokenized.de-en', fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=1000, lr_scheduler='fixed', memory_efficient_fp16=False, min_loss_scale=0.0001, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer='nag', padding_factor=8, seed=1, source_lang='de', srcdict=None, target_lang='en', task='translation', tbmf_wrapper=False, tensorboard_logdir='', testpref='../data/iwslt14.tokenized.de-en/test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, trainpref='../data/iwslt14.tokenized.de-en/train', user_dir=None, validpref='../data/iwslt14.tokenized.de-en/valid', workers=1)
| [de] Dictionary: 8847 types
| [de] ../data/iwslt14.tokenized.de-en/train.de: 160239 sents, 4035591 tokens, 0.0% replaced by <unk>
| [de] Dictionary: 8847 types
| [de] ../data

fatal: destination path 'fairseq-git' already exists and is not an empty directory.


The dataset is now all prepared for training on one of the FAIRSeq translation models. The next step is upload the data to Amazon S3 in order to make it available for training.

### Upload data to Amazon S3

In [4]:
import sagemaker

sagemaker_session = sagemaker.Session()
region =  sagemaker_session.boto_session.region_name
account = sagemaker_session.boto_session.client('sts').get_caller_identity().get('Account')

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/DEMO-pytorch-fairseq/datasets/iwslt14'

role = sagemaker.get_execution_role()

In [5]:
inputs = sagemaker_session.upload_data(path='data/iwslt14.tokenized.de-en', bucket=bucket, key_prefix=prefix)

Next we need to register a Docker image in Amazon SageMaker that will contain the FAIRSeq code and that will be pulled at training and inference time to perform the respective training of the model and the serving of the precitions. 

## Build FAIRSeq Translation task container

In [6]:
%%sh
chmod +x create_container.sh 

./create_container.sh pytorch-fairseq

Getting from region eu-central-1 and account 893044784148
Login Succeeded
Login Succeeded
Sending build context to Docker daemon  560.2MB
Step 1/21 : FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04
9.0-cudnn7-devel-ubuntu16.04: Pulling from nvidia/cuda
35b42117c431: Pulling fs layer
ad9c569a8d98: Pulling fs layer
293b44f45162: Pulling fs layer
0c175077525d: Pulling fs layer
695112388c71: Pulling fs layer
a911faa54767: Pulling fs layer
ae34ac42e04c: Pulling fs layer
9894e655955e: Pulling fs layer
3494688e8c7f: Pulling fs layer
96f6f0a0ab09: Pulling fs layer
0c175077525d: Waiting
695112388c71: Waiting
a911faa54767: Waiting
ae34ac42e04c: Waiting
9894e655955e: Waiting
3494688e8c7f: Waiting
96f6f0a0ab09: Waiting
ad9c569a8d98: Verifying Checksum
ad9c569a8d98: Download complete
293b44f45162: Verifying Checksum
293b44f45162: Download complete
35b42117c431: Verifying Checksum
35b42117c431: Download complete
0c175077525d: Verifying Checksum
0c175077525d: Download complete
695112388c71: Verifying

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



The FAIRSeq image has been pushed into Amazon ECR, the registry from which Amazon SageMaker will be able to pull that image and launch both training and prediction. 

## Training on Amazon SageMaker



Next we will set the hyper-parameters of the model we want to train. Here we are using the recommended ones from the [FAIRSeq example](https://github.com/pytorch/fairseq/tree/master/examples/translation#prepare-iwslt14sh). The full list of hyper-parameters available for use can be found [here](https://fairseq.readthedocs.io/en/latest/command_line_tools.html). Please note you can use dataset, training, and generation parameters. For the distributed backend, **gloo** is the only supported option and is set as default. 

In [7]:
hyperparameters = {
    "lr": 0.25,    
    "clip-norm": 0.1,
    "dropout": 0.2,
    "max-tokens": 4000,
    "criterion": "label_smoothed_cross_entropy",
    "label-smoothing": 0.1,
    "lr-scheduler": "fixed",
    "force-anneal": 200,
    "arch": "fconv_iwslt_de_en",
    "max-epoch": 2
}

We are ready to define the Estimator, which will encapsulate all the required parameters needed for launching the training on Amazon SageMaker. 

For training, the FAIRSeq toolkit recommends to train on GPU instances, such as the `ml.p3` instance family [available in Amazon SageMaker](https://aws.amazon.com/sagemaker/pricing/instance-types/). In this example, we are training on 2 instances.

In [8]:
from sagemaker.estimator import Estimator

algorithm_name = "pytorch-fairseq"
image = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account, region, algorithm_name)

estimator = Estimator(image,
                     role,
                     train_instance_count=2,
                     train_instance_type='ml.p3.8xlarge',
                     train_volume_size=100, 
                     output_path='s3://{}/output'.format(bucket),
                     hyperparameters=hyperparameters)

The call to fit will launch the training job and regularly report on the different performance metrics related to the training. 

In [9]:
estimator.fit(inputs=inputs)

2019-07-23 07:27:14 Starting - Starting the training job...
2019-07-23 07:27:19 Starting - Launching requested ML instances......
2019-07-23 07:28:18 Starting - Preparing the instances for training......
2019-07-23 07:29:33 Downloading - Downloading input data
2019-07-23 07:29:33 Training - Downloading the training image...........
[32mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[32mbash: no job control in this shell[0m
[31mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[31mbash: no job control in this shell[0m
[31mStarting the training.[0m
[31m{'force-anneal': '200', 'criterion': 'label_smoothed_cross_entropy', 'lr': '0.25', 'dropout': '0.2', 'label-smoothing': '0.1', 'clip-norm': '0.1', 'lr-scheduler': 'fixed', 'max-tokens': '4000', 'arch': 'fconv_iwslt_de_en', 'max-epoch': '2'}[0m
[31m['--force-anneal', '200', '--criterion', 'label_smoothed_cross_entropy', '--lr', '0.25', '--dropout', '0.2', '--label-s


2019-07-23 07:31:26 Training - Training image download completed. Training in progress.[31m| distributed init (rank 3): tcp://algo-1:1112[0m
[31m| distributed init (rank 0): tcp://algo-1:1112[0m
[31m| distributed init (rank 2): tcp://algo-1:1112[0m
[31m| distributed init (rank 1): tcp://algo-1:1112[0m
[32m| distributed init (rank 7): tcp://algo-1:1112[0m
[32m| distributed init (rank 4): tcp://algo-1:1112[0m
[32m| distributed init (rank 5): tcp://algo-1:1112[0m
[32m| distributed init (rank 6): tcp://algo-1:1112[0m
[31m| initialized host algo-1 as rank 0[0m
[31mNamespace(arch='fconv_iwslt_de_en', beam=5, bucket_cap_mb=150, buffer_size=0, clip_norm=0.1, cpu=False, criterion='label_smoothed_cross_entropy', data=['/opt/ml/input/data/training'], ddp_backend='c10d', decoder_attention='True', decoder_embed_dim=256, decoder_embed_path=None, decoder_layers='[(256, 3)] * 3', decoder_out_embed_dim=256, device_id=0, distributed_backend='gloo', distributed_init_method='tcp://algo

Once the model has finished training, we can go ahead and test its translation capabilities by deploying it on an endpoint.

## Hosting the model

We first need to define a base JSONPredictor class that will help us with sending predictions to the model once it's hosted on the Amazon SageMaker endpoint. 

In [10]:
from sagemaker.predictor import RealTimePredictor, json_serializer, json_deserializer

class JSONPredictor(RealTimePredictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(JSONPredictor, self).__init__(endpoint_name, sagemaker_session, json_serializer, json_deserializer)

We can now use the estimator object to deploy the model artificats (the trained model), and deploy it on a CPU instance as we no longer need a GPU instance for simply infering from the model. Let's use a `ml.m5.xlarge`. 

In [11]:
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m5.12xlarge', predictor_cls=JSONPredictor)

## modifications by nigenda@ (Sagemaker Hosting on-call)
## per https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.Estimator.deploy
## the endpoint reuses the training name if no name is defined, therefore when retrying endpoint creation you should do:
# nigenda_predictor = estimator.deploy(initial_instance_count=1, endpoint_name="pytorch-fairseq-20190715T14", instance_type='ml.m5.12xlarge', predictor_cls=JSONPredictor)
## that or let the estimator update the existing endpoint
# predictor = estimator.deploy(initial_instance_count=1, update_endpoint=True, instance_type='ml.m5.12xlarge', predictor_cls=JSONPredictor)

--------------------------------------------------------------------------------------------------!


Now it's your time to play. Input a sentence in German and get the translation in English by simply calling predict. 

In [13]:
import html

text_input = 'Guten Morgen'

result = predictor.predict(text_input)
#  Some characters are escaped HTML-style requiring to unescape them before printing
print(html.unescape(result))

it 's going to do .


Once you're done with getting predictions, remember to shut down your endpoint as you no longer need it. 

## Delete endpoint

In [20]:
sagemaker_session.delete_endpoint(predictor.endpoint)

NameError: name 'predictor' is not defined

Voila! For more information, you can check out the [FAIRSeq toolkit homepage](https://github.com/pytorch/fairseq). 