## Credits

This notebook was developed based on example notebook provided by AWS Sagemaker -> https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/scikit_bring_your_own.
We focus on improving the training part and add GPU support for computations and Keras library

## Build Docker container

This container will be needed to provide all learning and GPU-required dependencies as well as to expose simple API so that Sagemaker knows how to use it.

We can see that Tensorflow has the following requirements (https://www.nvidia.com/en-us/data-center/gpu-accelerated-applications/tensorflow/):

* 64-bit Linux
* Python 2.7
* CUDA 7.5 (CUDA 8.0 required for Pascal GPUs)
* cuDNN v5.1 (cuDNN v6 if on TF v1.3)

To fullfill them we can use prebuilt image provided by nvidia-docker project (see https://github.com/NVIDIA/nvidia-docker). We do not need to worry about nvidia-docker deamon since it is provided by Sagemaker.

In [3]:
!cat container/Dockerfile

# Build an image that can do training  in SageMaker
# This image contains CUDA 9.0 (CUDA libs are backward-compatible), cuddn version 7 and 64bit ubuntu
FROM nvidia/cuda:9.0-cudnn7-runtime-ubuntu16.04

# FROM ubuntu:16.04 - if we do not want cuda support
MAINTAINER Amazon AI <sage-learner@amazon.com>

RUN apt-get -y update && apt-get install -y --no-install-recommends \
         wget \
         python \
         nginx \
         ca-certificates \
         python-dev \
         python-tk \
         gcc \
         g++ \
         libopenblas-dev \
    && rm -rf /var/lib/apt/lists/*

# Here we get all python packages.
RUN wget https://bootstrap.pypa.io/get-pip.py && python get-pip.py && \
    pip install numpy scikit-learn pandas flask gevent gunicorn matplotlib tensorflow-gpu keras Pillow six

# Set some environment variables. PYTHONUNBUFFERED keeps Python from buffering our standard
# output stream, which means that logs can be delivered to the user quickly. PYTHO

In [4]:
%%sh

# AWS script to run the build and deployment 

# The name of our algorithm
algorithm_name=keras-nn

cd container

chmod +x keras-nn/train

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.
docker build  -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

Login Succeeded
Sending build context to Docker daemon  76.29kB
Step 1/9 : FROM nvidia/cuda:9.0-cudnn7-runtime-ubuntu16.04
 ---> 03608776d6cb
Step 2/9 : MAINTAINER Amazon AI <sage-learner@amazon.com>
 ---> Using cache
 ---> c46e372d6308
Step 3/9 : RUN apt-get -y update && apt-get install -y --no-install-recommends          wget          python          nginx          ca-certificates          python-dev          python-tk          gcc          g++          libopenblas-dev     && rm -rf /var/lib/apt/lists/*
 ---> Using cache
 ---> 332aac28bcee
Step 4/9 : RUN wget https://bootstrap.pypa.io/get-pip.py && python get-pip.py &&     pip install numpy scikit-learn pandas flask gevent gunicorn matplotlib tensorflow-gpu keras Pillow six
 ---> Using cache
 ---> d89a824d5f90
Step 5/9 : ENV PYTHONUNBUFFERED=TRUE
 ---> Using cache
 ---> 20be4641b434
Step 6/9 : ENV PYTHONDONTWRITEBYTECODE=TRUE
 ---> Using cache
 ---> 65db18abe65b
Step 7/9 : ENV PATH="/opt/program:${PATH}"
 ---> Using cache
 ---> 6f9



In [5]:
# S3 prefix
prefix = 'DEMO-keras-nn'

import boto3
import re

import os
import numpy as np
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()

## Create the session

The session remembers our connection parameters to SageMaker. We'll use it to perform all of our SageMaker operations.

In [6]:
import sagemaker as sage
from time import gmtime, strftime

sess = sage.Session()

In [7]:
# Upload the train data. In this case we leave this directory empty, since the CIFAR data is builtin in Keras library.
# Otherwise we could access all the data we put on S3 from our container.
WORK_DIRECTORY = 'data'

data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=prefix)

## Create an estimator and fit the model

In order to use SageMaker to fit our algorithm, we'll create an `Estimator` that defines how to use the container to train. This includes the configuration we need to invoke SageMaker training:

* The __container name__. This is constructed as in the shell commands above.
* The __role__. As defined above.
* The __instance count__ which is the number of machines to use for training.
* The __instance type__ which is the type of machine to use for training.
* The __output path__ determines where the model artifact will be written.
* The __session__ is the SageMaker session object that we defined above.

Then we use fit() on the estimator to train against the data that we uploaded above.

In [None]:
account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/keras-nn:latest'.format(account, region)

tree = sage.estimator.Estimator(image,
                       role, 1, 'ml.p2.xlarge',
                       output_path="s3://{}/output_keras_gpu".format(sess.default_bucket()),
                       sagemaker_session=sess)

tree.fit(data_location)

INFO:sagemaker:Creating training-job with name: keras-nn-2018-07-26-09-03-41-932


............................
  from ._conv import register_converters as _register_converters[0m
  from . import h5a, h5d, h5ds, h5f, h5fd, h5g, h5r, h5s, h5t, h5p, h5z[0m
  from .. import h5g, h5i, h5o, h5r, h5t, h5l, h5p[0m
[31mUsing TensorFlow backend.[0m
  from . import _csparsetools[0m
  from ._shortest_path import shortest_path, floyd_warshall, dijkstra,\[0m
  from ._tools import csgraph_to_dense, csgraph_from_dense,\[0m
  from ._traversal import breadth_first_order, depth_first_order, \[0m
  from ._min_spanning_tree import minimum_spanning_tree[0m
  from ._reordering import reverse_cuthill_mckee, maximum_bipartite_matching, \[0m
  from ._solve_toeplitz import levinson[0m
  from ._decomp_update import *[0m
  from ._ufuncs import *[0m
  from ._ellip_harm_2 import _ellipsoid, _ellipsoid_norm[0m
  from . import _bspl[0m
  from .ckdtree import *[0m
  from .qhull import *[0m
  from . import _voronoi[0m
  from . import _hausdorff[0m
  from . import _ni_label[0m
  f

[0m
[31mEpoch 1 of 240
[0m
  'Discrepancy between trainable weights and collected trainable'[0m


[31m  0/500 [..............................] - ETA: 0s[0m
[31m  1/500 [..............................] - ETA: 1:31:23[0m
[31m  2/500 [..............................] - ETA: 48:07  
  3/500 [..............................] - ETA: 33:41[0m
[31m  4/500 [..............................] - ETA: 26:27
  5/500 [..............................] - ETA: 22:07[0m
[31m  6/500 [..............................] - ETA: 19:13[0m
[31m  7/500 [..............................] - ETA: 17:09
  8/500 [..............................] - ETA: 15:36[0m
[31m  9/500 [..............................] - ETA: 14:23
 10/500 [..............................] - ETA: 13:25[0m
[31m 11/500 [..............................] - ETA: 12:37[0m
[31m 12/500 [..............................] - ETA: 11:57
 13/500 [..............................] - ETA: 11:23[0m
[31m 14/500 [..............................] - ETA: 10:54
 15/500 [..............................] - ETA: 10:29[0m
[31m 16/500 [..............................]

[31mTesting for epoch 7:[0m
[31mcomponent              | loss | generation_loss | auxiliary_loss[0m
[31m-----------------------------------------------------------------[0m
[31mgenerator (train)      | 1.45 | 1.18            | 0.27 [0m
[31mgenerator (test)       | 8.39 | 0.07            | 8.32 [0m
[31mdiscriminator (train)  | 1.39 | 0.57            | 0.82 [0m
[31mdiscriminator (test)   | 13.67 | 3.95            | 9.72 [0m
[31mEpoch 8 of 240
[0m
[31m  0/500 [..............................] - ETA: 0s[0m
[31m  1/500 [..............................] - ETA: 5:01
  2/500 [..............................] - ETA: 5:01[0m
[31m  3/500 [..............................] - ETA: 5:00
  4/500 [..............................] - ETA: 5:00[0m
[31m  5/500 [..............................] - ETA: 4:59[0m
[31m  6/500 [..............................] - ETA: 4:58
  7/500 [..............................] - ETA: 4:58[0m
[31m  8/500 [..............................] - ETA: 4:57
  9/500 [.





[31mTesting for epoch 8:[0m
[31mcomponent              | loss | generation_loss | auxiliary_loss[0m
[31m-----------------------------------------------------------------[0m
[31mgenerator (train)      | 1.40 | 1.14            | 0.25 [0m
[31mgenerator (test)       | 9.22 | 0.03            | 9.19 [0m
[31mdiscriminator (train)  | 1.37 | 0.58            | 0.79 [0m
[31mdiscriminator (test)   | 13.68 | 3.58            | 10.09[0m
[31mEpoch 9 of 240
[0m
[31m  0/500 [..............................] - ETA: 0s[0m
[31m  1/500 [..............................] - ETA: 5:00
  2/500 [..............................] - ETA: 5:01[0m
[31m  3/500 [..............................] - ETA: 5:01[0m
[31m  4/500 [..............................] - ETA: 5:01
  5/500 [..............................] - ETA: 5:00[0m
[31m  6/500 [..............................] - ETA: 4:59
  7/500 [..............................] - ETA: 4:59[0m
[31m  8/500 [..............................] - ETA: 4:58[0m
[31m 

[31m 47/500 [=>............................] - ETA: 4:34
 48/500 [=>............................] - ETA: 4:33[0m
[31m 49/500 [=>............................] - ETA: 4:33
 50/500 [==>...........................] - ETA: 4:32[0m
[31m 51/500 [==>...........................] - ETA: 4:32[0m
[31m 52/500 [==>...........................] - ETA: 4:31
 53/500 [==>...........................] - ETA: 4:30[0m
[31m 54/500 [==>...........................] - ETA: 4:30
 55/500 [==>...........................] - ETA: 4:29[0m
[31m 56/500 [==>...........................] - ETA: 4:29[0m
[31m 57/500 [==>...........................] - ETA: 4:28
 58/500 [==>...........................] - ETA: 4:27[0m
[31m 59/500 [==>...........................] - ETA: 4:27
 60/500 [==>...........................] - ETA: 4:26[0m
[31m 61/500 [==>...........................] - ETA: 4:26[0m
[31m 62/500 [==>...........................] - ETA: 4:25
 63/500 [==>...........................] - ETA: 4:24[0m
[31m 64/





[31mTesting for epoch 9:[0m
[31mcomponent              | loss | generation_loss | auxiliary_loss[0m
[31m-----------------------------------------------------------------[0m
[31mgenerator (train)      | 1.36 | 1.11            | 0.25 [0m
[31mgenerator (test)       | 9.50 | 0.01            | 9.49 [0m
[31mdiscriminator (train)  | 1.34 | 0.58            | 0.77 [0m
[31mdiscriminator (test)   | 14.82 | 4.68            | 10.15[0m
[31mEpoch 10 of 240
[0m
[31m  0/500 [..............................] - ETA: 0s
  1/500 [..............................] - ETA: 5:02[0m
[31m  2/500 [..............................] - ETA: 5:02[0m
[31m  3/500 [..............................] - ETA: 5:01
  4/500 [..............................] - ETA: 5:01[0m
[31m  5/500 [..............................] - ETA: 5:00
  6/500 [..............................] - ETA: 5:00[0m
[31m  7/500 [..............................] - ETA: 4:59[0m
[31m  8/500 [..............................] - ETA: 4:58
  9/500 [



In [29]:
!tar chvfz notebook2.tar.gz *

container/
container/ReadMe.md
container/keras-nn/
container/keras-nn/train_cpu
container/keras-nn/.ipynb_checkpoints/
container/keras-nn/train_gpu
container/keras-nn/train_old
container/keras-nn/Minibatch.py
container/keras-nn/train
container/Dockerfile
container/.ipynb_checkpoints/
container/Dockerfile_cpu
container/local_test/
container/local_test/test_dir/
container/local_test/test_dir/model/
container/local_test/test_dir/model/decision-tree-model.pkl
container/local_test/test_dir/output/
container/local_test/test_dir/output/success
container/local_test/test_dir/input/
container/local_test/test_dir/input/config/
container/local_test/test_dir/input/config/hyperparameters.json
container/local_test/test_dir/input/config/resourceConfig.json
container/local_test/test_dir/input/data/
container/local_test/test_dir/input/data/training/
container/local_test/test_dir/input/data/training/iris.csv
container/local_test/.ipynb_checkpoints/
data/
data/iris.csv
keras_on