# Module 1. Amazon SageMaker 학습 스크립트

<p>이 예제는 LG에서 개발한 AI chip에서 동작할 수 있도록, Tensorflow 1.X, python2.7 버전에서 학습하기 위한 코드입니다. </p>
<p>이 코드는 <strong><a href="https://github.com/tensorflow/models/tree/master/research/slim" target="_blank" class ='btn-default'>TensorFlow-Slim image classification model library</a></strong>를 참고하여 Sagemaker에서 학습할 수 있는 실행 스크립트로 수정하여 작성하였습니다. Amazon SageMaker로 실행 스크립트를 구성하는 이유는 노트북의 스크립트에서 일부 파라미터 수정으로 동일 모델 아키텍처에 대해 hyperparamter가 변경된 다양한 모델을 원하는 형태의 다수 인프라에서 동시에 학습 수행이 가능하며, 가장 높은 성능의 모델을 노트북 스크립트 내 명령어로 바로 hosting 서비스가 가능한 Endpoint 생성을 할 수 있습니다.</p>

<p>이번 실습에서는 Amazon SageMaker가 어떤 방식으로 학습이 되는지 설명되는 구조와 함께 학습하는 방법을 간단하게 체험해 보는 시간을 갖도록 하겠습니다.</p>

## 1. Sagemaker notebook 설명
<p>Sagemaker notebook은 완전 관리형 서비스로 컨테이너 기반으로 구성되어 있습니다. 사용자가 직접 컨테이너를 볼 수 없지만, 내부적으로는 아래와 같은 원리로 동작합니다. </p>
<p><img src="./imgs/fig00.png" width="700", height="70"></p>

- **S3 (Simple Storage Serivce)** : Object Storage로서 학습할 데이터 파일과 학습 결과인 model, checkpoint, tensorboard를 위한 event 파일, 로그 정보 등을 저장하는데 사용합니다.
- **SageMaker Notebook** : 학습을 위한 스크립트 작성과 디버깅, 그리고 실제 학습을 수행하기 위한 Python을 개발하기 위한 환경을 제공합니다.
- **Amazon Elastic Container Registry(ECR)** :  Docker 컨테이너 이미지를 손쉽게 저장, 관리 및 배포할 수 있게 해주는 완전관리형 Docker 컨테이너 레지스트리입니다. Sagemaker는 기본적인 컨테이너를 제공하기 때문에 별도 ECR에 컨테이너 이미지를 등록할 필요는 없습니다. 하지만, 별도의 학습 및 배포 환경이 필요한 경우 custom 컨테이너 이미지를 만들어서 ECR에 등록한 후 이 환경을 활용할 수 있습니다.

<p>학습과 추론을 하는 hosting 서비스는 각각 다른 컨테이너 환경에서 수행할 수 있으며, 쉽게 다량으로 컨테이너 환경을 확장할 수 있으므로 다량의 학습과 hosting이 동시에 가능합니다.   
</p>

## 2. 환경 설정

<p>Sagemaker 학습에 필요한 기본적인 package를 import 합니다. </p>
<p>boto3는 HTTP API 호출을 숨기는 편한 추상화 모델을 가지고 있고, Amazon EC2 인스턴스 및 S3 버켓과 같은 AWS 리소스와 동작하는 파이선 클래스를 제공합니다. </p>
<p>sagemaker python sdk는 Amazon SageMaker에서 기계 학습 모델을 교육 및 배포하기 위한 오픈 소스 라이브러리입니다.</p>

In [1]:
import sys

In [2]:
# !{sys.executable} -m pip install --upgrade pip
# !{sys.executable} -m pip install tensorflow_gpu==1.14

In [3]:
import os
import time
import sagemaker
import boto3
import tensorflow as tf
from PIL import Image

import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker import get_execution_role
from sagemaker.session import Session

from collections import defaultdict
from io import StringIO
from matplotlib import pyplot as plt
from PIL import Image
from IPython.display import display

%matplotlib inline

<p>SageMaker에서 앞으로 사용할 SageMaker Session 설정, Role 정보를 설정합니다. </p>

In [4]:
sagemaker_session = sagemaker.Session()

role = get_execution_role()
region = sagemaker_session.boto_session.region_name

sess = boto3.Session()
sm = sess.client('sagemaker')

## 3. S3의 저장 데이터 위치 가져오기
<p> 데이터를 정하기 위한 S3의 bucket 위치는 아래 data_bucket 이름으로 생성하며, 기본적으로 SageMaker에서 학습한 모델과 로그 정보를 남기는 위치는 자동으로 생성되는 bucket 이름으로 저장됩니다. </p>

In [5]:
# create a s3 bucket to hold data, note that your account might already created a bucket with the same name
account_id = sess.client('sts').get_caller_identity()["Account"]
data_bucket = 'sagemaker-experiments-{}-{}'.format(sess.region_name, account_id)
bucket = 'sagemaker-{}-{}'.format(sess.region_name, account_id)

try:
    if sess.region_name == "us-east-1":
        sess.client('s3').create_bucket(Bucket=data_bucket)
    else:
        sess.client('s3').create_bucket(Bucket=data_bucket, 
                                        CreateBucketConfiguration={'LocationConstraint': sess.region_name})
except Exception as e:
    print(e)

An error occurred (BucketAlreadyOwnedByYou) when calling the CreateBucket operation: Your previous request to create the named bucket succeeded and you already own it.


## 4. 이미지를 TFRecord 변경하기
<p>이미지 파일을 학습하기 위해 SageMaker Notebook 환경으로 upload를 합니다. 폴더 구조는 아래와 같은 형태로 구성되어야 합니다. </p>
<pre>
<div style='line-height:80%'>
    image_path/class1/Aimage_1<br>
                      Aimage_2<br>
                       ...<br>
                      Aimage_N<br>
    image_path/class2/Bimage_1<br>
                      Bimage_2<br>
                       ...<br>
                      Bimage_M<br>
</div>
</pre>
<p>생성된 TFRecord 파일은 아래 정의하신 dataset_dir에 저장이 됩니다. train과 test의 데이터 수는 향후 학습에서 활용하기 위해 train_num_data, test_num_data 변수에 저장합니다.</p>

In [6]:
sys.path.append('/home/ec2-user/SageMaker/PUBLIC-IOT-ML/src_dir/')
sys.path.append('/home/ec2-user/SageMaker/git_dir/PUBLIC-IOT-ML/src_dir/')

In [7]:
from datasets import image_to_tfrecord

In [8]:
dataset_dir = '/home/ec2-user/SageMaker/PUBLIC-IOT-ML/img_datasets'
image_path = '/home/ec2-user/SageMaker/PUBLIC-IOT-ML/data'

dataset_dir = '/home/ec2-user/SageMaker/git_dir/PUBLIC-IOT-ML/img_datasets'
image_path = '/home/ec2-user/SageMaker/git_dir/PUBLIC-IOT-ML/data'

In [9]:
!rm -rf $dataset_dir

In [10]:
train_num_data, test_num_data = image_to_tfrecord.run(image_path, dataset_dir)








Finished converting the image dataset!


## 5. TFRecord를 S3에 upload 하기

<p>SageMaker 학습을 위해 TFRecord 파일을 S3에 upload합니다. TFRecord 은 이전에 지정한 data_bucket 내 prefix 하위 폴더에 저장됩니다. </p>

In [11]:
prefix = 'captured_data/tfrecord'
!aws s3 cp $dataset_dir s3://{data_bucket}/{prefix}/ --recursive

upload: img_datasets/labels.txt to s3://sagemaker-experiments-us-east-2-322537213286/captured_data/tfrecord/labels.txt
upload: img_datasets/captureddata_val.tfrecord to s3://sagemaker-experiments-us-east-2-322537213286/captured_data/tfrecord/captureddata_val.tfrecord
upload: img_datasets/captureddata_train.tfrecord to s3://sagemaker-experiments-us-east-2-322537213286/captured_data/tfrecord/captureddata_train.tfrecord


## 6. 학습 스크립트 코딩하기

<p>SageMaker에서 학습하는 것이 아니더라도 실제 모델 아키텍처와 학습을 위한 optimizer와 loss 함수 등을 정의하는 python 파일을 구성하게 됩니다. SageMaker에서 활용하는 python 파일도 동일한 python 파일을 사용하게 됩니다. 연계되는 다른 소스코드 파일이 있는 경우에도 별도 소스코드 수정 없이 학습이 가능하며, SageMaker에서 사용하기 위해서는 기존 python 파일에 SageMaker 학습에 사용할 수 있는 환경변수들만 추가하면 됩니다. 예를 들어, 환경변수 중 <code>SM_MODEL_DIR</code>은 컨테이너 환경에서는 <code>/opt/ml/model</code>를 의미합니다. 다양한 환경변수는 <strong><a href="https://github.com/aws/sagemaker-containers" target="_blank" class ='btn-default'>SageMaker Containers-IMPORTANT ENVIRONMENT VARIABLES</a></strong>를 참고하시기 바랍니다. </p><p>SageMaker 학습이 끝나면 자동은 컨테이너 환경은 삭제가 됩니다. 따라서, 학습이 완료된 모델 산출물과 다양한 output 파일은 S3로 저장해야 합니다. SageMaker는 자동으로 <code>SM_MODEL_DIR</code>에 저장된 최종 모델 파일을 학습이 끝난 다음 model.tar.gz로 압축하여 컨테이너 환경에서 S3의 특정 bucket에 저장하게 됩니다.</p><p> 별도 bucket을 설정하지 않으며, 기본적으로 생성되는 bucket에 저장됩니다. 이외 학습에 이용되는 python source code는 SageMaker 학습이 시작되면서 S3에 저장되며, 별도로 <code>SM_MODEL_DIR</code>에 checkpoint 또는 log 파일을 저장하게 되면 학습이 끝난 이후 자동으로 컨테이너 환경에서 S3로 저장된 파일들이 이동하게 됩니다. 이런 과정을 이해한다면 학습 시 저장되는 다양한 정보들을 저장한 다음 학습이 끝난 후 S3에서 download 받아 활용할 수 있습니다. </p>

<p>아래는 시간 관계 상 미리 작성한 python 학습 스크립트 코드 입니다.</p>

In [12]:
!pygmentize './src_dir/image_classifier.py'

[37m# Copyright 2016 The TensorFlow Authors. All Rights Reserved.[39;49;00m
[37m#[39;49;00m
[37m# Licensed under the Apache License, Version 2.0 (the "License");[39;49;00m
[37m# you may not use this file except in compliance with the License.[39;49;00m
[37m# You may obtain a copy of the License at[39;49;00m
[37m#[39;49;00m
[37m# http://www.apache.org/licenses/LICENSE-2.0[39;49;00m
[37m#[39;49;00m
[37m# Unless required by applicable law or agreed to in writing, software[39;49;00m
[37m# distributed under the License is distributed on an "AS IS" BASIS,[39;49;00m
[37m# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.[39;49;00m
[37m# See the License for the specific language governing permissions and[39;49;00m
[37m# limitations under the License.[39;49;00m
[33m"""Generic training script that trains a model using a given dataset."""[39;49;00m

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m absolut

        learning_rate = tf.train.polynomial_decay(
            args.learning_rate,
            global_step,
            decay_steps,
            args.end_learning_rate,
            power=[34m1.0[39;49;00m,
            cycle=[34mFalse[39;49;00m,
            name=[33m'[39;49;00m[33mpolynomial_decay_learning_rate[39;49;00m[33m'[39;49;00m)
    [34melse[39;49;00m:
        [34mraise[39;49;00m [36mValueError[39;49;00m([33m'[39;49;00m[33mlearning_rate_decay_type [[39;49;00m[33m%s[39;49;00m[33m] was not recognized[39;49;00m[33m'[39;49;00m %
                         args.learning_rate_decay_type)

    [34mif[39;49;00m args.warmup_epochs:
        warmup_lr = (
            args.learning_rate * tf.cast(global_step, tf.float32) /
            (steps_per_epoch * args.warmup_epochs))
        learning_rate = tf.minimum(warmup_lr, learning_rate)
    [34mreturn[39;49;00m learning_rate


[34mdef[39;49;00m [32m_configure_optimizer[39;49;00m(args, learnin

## 7. `TensorFlow` estimator를 이용한 training job 생성하기


<p><strong><code>sagemaker.tensorflow.TensorFlow</code></strong> estimator는 처음 실행하는 스크립트 위치와 다양한 연계 코드들이 위치한 디렉토리 정보를 찾아서 스크립트를 S3에 upload하고 SageMaker의 training job을 수행하게 됩니다. training job은 학습을 수행한 단위입니다. 학습을 1번 돌리면 training job이 1개 생성됩니다. 몇 가지 중요 파라미터를 아래와 같이 설명드립니다. </p>

- **entry_point** : 학습을 처음 실행하는 Python 소스 파일의 절대 또는 상대 경로이며, source_dir이 지정된 경우 entry_point는 source_dir 내 파일이 됩니다.
- **source_dir** : 학습에 연계되는 다양한 소스코드 파일이 들어 있는 디렉토리 위치이며, 절대, 상대 경로 또는 S3 URI가 모두 가능하며,source_dir이 S3 URI 인 경우 tar.gz 파일이 됩니다.
- **role** : Amazon SageMaker가 사용자를 대신해 작업(예: S3 버킷에서 모델 결과물이라고 하는 훈련 결과 읽기 및 Amazon S3에 훈련 결과 쓰기)을 수행하는 AWS Identity and Access Management(IAM) 역할입니다.
- **train_instance_count** : 학습을 수행하는 instance 개수를 정의할 수 있습니다.
- **train_instance_type** : 학습을 수행하는 instance 타입을 정의할 수 있습니다.
- **train_volume_size** : 학습 인스턴스에 연결할 Amazon Elastic Block Store(Amazon EBS) 스토리지 볼륨의 크기(GB)입니다. File 모드를 사용할 경우 이 값이 훈련 데이터를 충분히 저장할 수 있는 크기여야 합니다(File 모드가 기본값)
- **train_use_spot_instances** : 학습에서 SageMaker Managed Spot 인스턴스를 사용할지 여부를 지정합니다. 활성화되면 train_max_wait도 설정해야 합니다.
- **train_max_run** : 최대 학습 시간을 설정할 수 있으며, 이 시간이 지나면 Amazon SageMaker는 현재 상태에 관계없이 작업을 종료합니다. (기본값 : 24 * 60 * 60)
- **train_max_wait** : SageMaker Managed Spot 인스턴스를 기다리는 초 단위의 시간을 의미하는 것으로, 이 시간이 지나면 Amazon SageMaker는 스팟 인스턴스가 사용 가능해지기를 기다리는 것을 중지하며 결과는 fail이 됩니다.
- **framework_version** : 학습에 사용될 특정 Tensorflow 버전을 정의할 수 있습니다.
- **py_version** : 컨테이너 환경이 python3일 경우 py3, python2일 경우 py2로 설정하면 됩니다. python2는 지원이 중단되었지만 기존 python2로 구성된 파일들을 지원하기 위해 현재 계속 사용할 수 있습니다.
- **hyperparameters** : 학습에 사용할 하이퍼 파라미터를 정의할 수 있으며, 정의된 하이퍼 파라미터 값들은 모두 학습 컨테이너로 전송이 됩니다.

<p> 추가적으로 분산/ 멀티 GPU 학습도 가능합니다. SageMaker는 <strong><a href="https://github.com/horovod/horovod" target="_blank" class ='btn-default'>Horovod</a></strong>에 최적화된 환경을 제공하고 있으며, 자세한 내용은 <strong><a href="https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/tensorflow#distributed-training" target="_blank" class ='btn-default'>여기</a></strong>에서 확인이 가능합니다. 이번 학습에서는 분산/멀티 GPU 학습은 제외하였습니다.(단, 기존과 동일하게 python 소스코드에 분산/멀티 학습이 가능하도록 구성 필요) </p>


<p>S3에 저장된 TFRecord 파일의 위치를 다시 확인합니다.</p>

In [29]:
## Dataset 위치
inputs= 's3://{}/{}'.format(data_bucket, prefix)
inputs

's3://sagemaker-experiments-us-east-2-322537213286/captured_data/tfrecord'

In [30]:
hyperparameters = {
        'dataset_name' : 'captured_dataset',
        'model_name' : 'mobilenet_v1_025',
        'preprocessing_name' : 'mobilenet_v1',
        'learning_rate_decay_type' : 'exponential',    ## "fixed", "exponential" or "polynomial"
        'learning_rate_decay_factor' : 0.98,          ## in case of exponential
        'learning_rate' : 0.001,
        'image_size' : 224,
        'save_summaries_secs' : 300,
        'num_epochs_per_decay' : 2.5,
        'moving_average_decay' : 0.9999,
        'batch_size' : 128,
        'max_number_of_steps' : 30000,
        'eval_batch_size' : 1000,
        'train_num_data' : train_num_data,
        'test_num_data': test_num_data,
        'finetune_checkpoint_path' : 'fine_tune_checkpoint/model.ckpt-25000',
#         'finetune_checkpoint_path' : 'fine_tune_checkpoint/mobilenet_v1_0.25_128.ckpt',
#         'checkpoint_exclude_scopes' : 'MobilenetV1/Logits,MobilenetV1/AuxLogits',
    }

In [31]:
training_job_name = "{}-img-classifier-training-job".format(int(time.time()))
estimator = TensorFlow(entry_point='image_classifier.py',
                       source_dir='src_dir',
                       role=role,
                       train_instance_count=1,
                       train_instance_type='ml.p3.2xlarge',
                       train_use_spot_instances=True,  # spot instance 활용
                       train_volume_size=400,
                       train_max_run=12*60*60,
                       train_max_wait=12*60*60,
#                        train_instance_type='local_gpu',
                       framework_version='1.14.0',
                       py_version='py2',
                       hyperparameters=hyperparameters
                      )

## 8. Fit 함수로 학습 시작하기 

<p>학습을 시작하는 것은 <strong><code>estimator.fit (training_data_uri)</code></strong>이 호출되는 경우입니다. 여기에서 실제 데이터가 있는 S3의 위치가 입력으로 사용됩니다. <code>fit</code>에서는 <code>training</code>라는 기본 채널을 생성하며, 이 위치의 데이터는 S3에서 실제 컨테이너 환경에서는 <code>SM_CHANNEL_TRAINING</code> 위치로 복사되어 학습에 활용이 가능합니다. <code>fit</code>은 몇 가지 다른 유형의 입력도 허용하는데 자세한 내용은 <strong><a href="https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.EstimatorBase.fit" target="_blank" class ='btn-default'>API 문서</a></strong>를 참고하실 수 있습니다.</p>
<p> 학습이 시작되면 Tensorflow 컨테이너에서는 <code>image_classifier.py</code>를 실행되며, <code>Estimator</code>에서 <code>hyperparameters</code> 와 <code>model_dir</code>을 스크립트의 파라미터로 전달합니다. <code>model_dir</code>을 별도로 전달하지 않으며, 기본값은<strong>s3://[DEFAULT_BUCKET]/[TRAINING_JOB_NAME] </strong>이 되며 실제 스크립트 실행은 다음과 같습니다. </p>
    
```bash
python image_classifier.py --model_dir s3://[DEFAULT_BUCKET]/[TRAINING_JOB_NAME]
```
<p>학습이 완료되면 training job은 Tensorflow serving을 위해 saved model을 S3에 upload합니다.</p>
<p><code>fit</code>에서 <strong>wait=True</strong>로 설정할 경우 <strong>Synchronous</strong> 방식으로 동직하게 되며, <strong>wait=False</strong>일 경우 <strong>Aynchronous</strong> 방식으로 동작되어 여러 개의 Training job을 동시에 실행할 수 있습니다. </p>

In [32]:
estimator.fit(
    inputs = {'training': inputs},
    job_name=training_job_name,
    logs='All',
    wait=False
)
print("training_job_name : {}".format(training_job_name))

training_job_name : 1593562844-img-classifier-training-job


<p><strong>Aynchronous</strong>로 진행된 Training job은 아래와 같은 방법으로 진행상황을 실시간으로 확인할 수 있습니다.</p>

In [None]:
sm_sess = sagemaker.Session()
sm_sess.logs_for_job(estimator.latest_training_job.name, wait=True, log_type='All')

2020-07-01 00:20:48 Starting - Starting the training job...
2020-07-01 00:20:49 Starting - Launching requested ML instances.........
2020-07-01 00:22:20 Starting - Preparing the instances for training......
2020-07-01 00:23:28 Downloading - Downloading input data...
2020-07-01 00:24:02 Training - Downloading the training image...
2020-07-01 00:24:30 Training - Training image download completed. Training in progress.[34mparser.parse_known_args() : (Namespace(adadelta_rho=0.95, adagrad_initial_accumulator_value=0.1, adam_beta1=0.9, adam_beta2=0.999, batch_size=128, checkpoint_exclude_scopes=None, clone_on_cpu=False, current_host='algo-1', data_config={u'training': {u'TrainingInputMode': u'File', u'RecordWrapperType': u'None', u'S3DistributionType': u'FullyReplicated'}}, dataset_dir='/opt/ml/input/data/training', dataset_name='captured_dataset', end_learning_rate=0.01, eval_batch_size=1000, finetune_checkpoint_path='fine_tune_checkpoint/model.ckpt-25000', ftrl_initial_accumulator_value=0

[34mW0701 00:24:45.649071 139735246075648 deprecation.py:323] From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py:742: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mPlease switch to tf.train.MonitoredTrainingSession[0m
[34mW0701 00:24:49.803543 139735246075648 deprecation.py:323] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py:1282: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.[0m
[34mInstructions for updating:[0m
[34mUse standard file APIs to check for files with this prefix.[0m
[34mI0701 00:24:49.805938 139735246075648 saver.py:1286] Restoring parameters from fine_tune_checkpoint/model.ckpt-25000[0m
[34mI0701 00:24:50.085700 139735246075648 session_manager.py:500] Running local_init_op.[0m
[34mI0701 00:24:50.2031

[34mI0701 00:27:27.976017 139735246075648 learning.py:507] global step 610: loss = 1.5476 (0.237 sec/step)[0m
[34mI0701 00:27:30.330969 139735246075648 learning.py:507] global step 620: loss = 1.5546 (0.240 sec/step)[0m
[34mI0701 00:27:32.677287 139735246075648 learning.py:507] global step 630: loss = 1.4961 (0.235 sec/step)[0m
[34mI0701 00:27:35.005853 139735246075648 learning.py:507] global step 640: loss = 1.4910 (0.225 sec/step)[0m
[34mI0701 00:27:37.369421 139735246075648 learning.py:507] global step 650: loss = 1.4802 (0.238 sec/step)[0m
[34mI0701 00:27:39.739717 139735246075648 learning.py:507] global step 660: loss = 1.4876 (0.238 sec/step)[0m
[34mI0701 00:27:42.072590 139735246075648 learning.py:507] global step 670: loss = 1.5303 (0.233 sec/step)[0m
[34mI0701 00:27:44.440766 139735246075648 learning.py:507] global step 680: loss = 1.5468 (0.237 sec/step)[0m
[34mI0701 00:27:46.789722 139735246075648 learning.py:507] global step 690: loss = 1.5251 (0.233 sec/st

[34mI0701 00:30:31.366954 139735246075648 learning.py:507] global step 1380: loss = 1.5474 (0.230 sec/step)[0m
[34mI0701 00:30:33.725805 139735246075648 learning.py:507] global step 1390: loss = 1.4293 (0.228 sec/step)[0m
[34mI0701 00:30:36.080322 139735246075648 learning.py:507] global step 1400: loss = 1.5612 (0.238 sec/step)[0m
[34mI0701 00:30:38.395329 139735246075648 learning.py:507] global step 1410: loss = 1.4569 (0.224 sec/step)[0m
[34mI0701 00:30:40.757076 139735246075648 learning.py:507] global step 1420: loss = 1.5664 (0.232 sec/step)[0m
[34mI0701 00:30:43.148196 139735246075648 learning.py:507] global step 1430: loss = 1.5627 (0.236 sec/step)[0m
[34mI0701 00:30:45.512945 139735246075648 learning.py:507] global step 1440: loss = 1.4811 (0.234 sec/step)[0m
[34mI0701 00:30:47.877816 139735246075648 learning.py:507] global step 1450: loss = 1.5141 (0.235 sec/step)[0m
[34mI0701 00:30:50.229621 139735246075648 learning.py:507] global step 1460: loss = 1.4330 (0.2

[34mI0701 00:33:33.350574 139735246075648 learning.py:507] global step 2150: loss = 1.5229 (0.235 sec/step)[0m
[34mI0701 00:33:35.812448 139735246075648 learning.py:507] global step 2160: loss = 1.4428 (0.237 sec/step)[0m
[34mI0701 00:33:38.162539 139735246075648 learning.py:507] global step 2170: loss = 1.4360 (0.237 sec/step)[0m
[34mI0701 00:33:40.527194 139735246075648 learning.py:507] global step 2180: loss = 1.3491 (0.239 sec/step)[0m
[34mI0701 00:33:42.896497 139735246075648 learning.py:507] global step 2190: loss = 1.5268 (0.231 sec/step)[0m
[34mI0701 00:33:45.261260 139735246075648 learning.py:507] global step 2200: loss = 1.6098 (0.228 sec/step)[0m
[34mI0701 00:33:47.631556 139735246075648 learning.py:507] global step 2210: loss = 1.4733 (0.226 sec/step)[0m
[34mI0701 00:33:49.989438 139735246075648 learning.py:507] global step 2220: loss = 1.7007 (0.233 sec/step)[0m
[34mI0701 00:33:52.346627 139735246075648 learning.py:507] global step 2230: loss = 1.4506 (0.2

[34mI0701 00:36:22.354020 139735246075648 learning.py:507] global step 2860: loss = 1.4669 (0.235 sec/step)[0m
[34mI0701 00:36:24.723222 139735246075648 learning.py:507] global step 2870: loss = 1.4979 (0.242 sec/step)[0m
[34mI0701 00:36:27.081810 139735246075648 learning.py:507] global step 2880: loss = 1.5242 (0.238 sec/step)[0m
[34mI0701 00:36:29.425035 139735246075648 learning.py:507] global step 2890: loss = 1.4862 (0.238 sec/step)[0m
[34mI0701 00:36:31.798669 139735246075648 learning.py:507] global step 2900: loss = 1.4361 (0.234 sec/step)[0m
[34mI0701 00:36:34.196156 139735246075648 learning.py:507] global step 2910: loss = 1.5536 (0.235 sec/step)[0m
[34mI0701 00:36:36.538780 139735246075648 learning.py:507] global step 2920: loss = 1.4416 (0.230 sec/step)[0m
[34mI0701 00:36:38.918688 139735246075648 learning.py:507] global step 2930: loss = 1.4006 (0.231 sec/step)[0m
[34mI0701 00:36:41.298794 139735246075648 learning.py:507] global step 2940: loss = 1.4849 (0.2

[32mI0701 01:06:08.440344 140367827080960 learning.py:754] Starting Session.[0m
[32mI0701 01:06:08.552064 140361938757376 supervisor.py:1117] Saving checkpoint to path /opt/ml/model/model.ckpt[0m
[32mI0701 01:06:08.557351 140367827080960 learning.py:768] Starting Queues.[0m
[32mI0701 01:06:09.560029 140361930364672 supervisor.py:1099] global_step/sec: 0[0m
[32mI0701 01:06:17.518559 140361921971968 supervisor.py:1050] Recording summary at step 1.[0m
[32mI0701 01:06:19.704246 140367827080960 learning.py:507] global step 10: loss = 1.4531 (0.232 sec/step)[0m
[32mI0701 01:06:22.113562 140367827080960 learning.py:507] global step 20: loss = 1.4690 (0.237 sec/step)[0m
[32mI0701 01:06:24.505810 140367827080960 learning.py:507] global step 30: loss = 1.6307 (0.235 sec/step)[0m
[32mI0701 01:06:26.911880 140367827080960 learning.py:507] global step 40: loss = 1.6451 (0.246 sec/step)[0m
[32mI0701 01:06:29.301457 140367827080960 learning.py:507] global step 50: loss = 1.5252 (0.

[32mI0701 01:09:06.242714 140367827080960 learning.py:507] global step 710: loss = 1.4501 (0.235 sec/step)[0m
[32mI0701 01:09:08.618967 140367827080960 learning.py:507] global step 720: loss = 1.4182 (0.246 sec/step)[0m
[32mI0701 01:09:10.998249 140367827080960 learning.py:507] global step 730: loss = 1.6087 (0.233 sec/step)[0m
[32mI0701 01:09:13.370057 140367827080960 learning.py:507] global step 740: loss = 1.4693 (0.244 sec/step)[0m
[32mI0701 01:09:15.751003 140367827080960 learning.py:507] global step 750: loss = 1.3897 (0.233 sec/step)[0m
[32mI0701 01:09:18.159667 140367827080960 learning.py:507] global step 760: loss = 1.4677 (0.240 sec/step)[0m
[32mI0701 01:09:20.527493 140367827080960 learning.py:507] global step 770: loss = 1.4808 (0.228 sec/step)[0m
[32mI0701 01:09:22.931441 140367827080960 learning.py:507] global step 780: loss = 1.5765 (0.243 sec/step)[0m
[32mI0701 01:09:25.301990 140367827080960 learning.py:507] global step 790: loss = 1.5478 (0.241 sec/st

[32mI0701 01:12:03.386049 140367827080960 learning.py:507] global step 1450: loss = 1.5000 (0.238 sec/step)[0m
[32mI0701 01:12:05.725855 140367827080960 learning.py:507] global step 1460: loss = 1.5593 (0.246 sec/step)[0m
[32mI0701 01:12:08.093163 140367827080960 learning.py:507] global step 1470: loss = 1.6935 (0.235 sec/step)[0m
[32mI0701 01:12:10.477116 140367827080960 learning.py:507] global step 1480: loss = 1.4750 (0.239 sec/step)[0m
[32mI0701 01:12:12.837841 140367827080960 learning.py:507] global step 1490: loss = 1.5834 (0.240 sec/step)[0m
[32mI0701 01:12:15.227312 140367827080960 learning.py:507] global step 1500: loss = 1.5428 (0.238 sec/step)[0m
[32mI0701 01:12:17.596560 140367827080960 learning.py:507] global step 1510: loss = 1.4488 (0.232 sec/step)[0m
[32mI0701 01:12:19.987706 140367827080960 learning.py:507] global step 1520: loss = 1.4706 (0.234 sec/step)[0m
[32mI0701 01:12:22.360666 140367827080960 learning.py:507] global step 1530: loss = 1.5906 (0.2

[32mI0701 01:15:06.339648 140367827080960 learning.py:507] global step 2220: loss = 1.4723 (0.242 sec/step)[0m
[32mI0701 01:15:08.697695 140367827080960 learning.py:507] global step 2230: loss = 1.5058 (0.236 sec/step)[0m
[32mI0701 01:15:11.084142 140367827080960 learning.py:507] global step 2240: loss = 1.6195 (0.240 sec/step)[0m
[32mI0701 01:15:13.466917 140367827080960 learning.py:507] global step 2250: loss = 1.5645 (0.240 sec/step)[0m
[32mI0701 01:15:15.833642 140367827080960 learning.py:507] global step 2260: loss = 1.4441 (0.236 sec/step)[0m
[32mI0701 01:15:18.173367 140367827080960 learning.py:507] global step 2270: loss = 1.6079 (0.235 sec/step)[0m
[32mI0701 01:15:20.550326 140367827080960 learning.py:507] global step 2280: loss = 1.5767 (0.233 sec/step)[0m
[32mI0701 01:15:22.915978 140367827080960 learning.py:507] global step 2290: loss = 1.6251 (0.241 sec/step)[0m
[32mI0701 01:15:25.285830 140367827080960 learning.py:507] global step 2300: loss = 1.5318 (0.2

[32mI0701 01:18:03.383128 140367827080960 learning.py:507] global step 2960: loss = 1.4559 (0.237 sec/step)[0m
[32mI0701 01:18:05.752002 140367827080960 learning.py:507] global step 2970: loss = 1.5315 (0.236 sec/step)[0m
[32mI0701 01:18:08.127285 140367827080960 learning.py:507] global step 2980: loss = 1.3887 (0.242 sec/step)[0m
[32mI0701 01:18:10.488058 140367827080960 learning.py:507] global step 2990: loss = 1.5295 (0.220 sec/step)[0m
[32mI0701 01:18:12.872689 140367827080960 learning.py:507] global step 3000: loss = 1.4215 (0.240 sec/step)[0m
[32mI0701 01:18:15.249557 140367827080960 learning.py:507] global step 3010: loss = 1.4889 (0.237 sec/step)[0m
[32mI0701 01:18:17.609390 140367827080960 learning.py:507] global step 3020: loss = 1.4214 (0.226 sec/step)[0m
[32mI0701 01:18:19.974962 140367827080960 learning.py:507] global step 3030: loss = 1.6105 (0.237 sec/step)[0m
[32mI0701 01:18:22.338599 140367827080960 learning.py:507] global step 3040: loss = 1.4441 (0.2

[32mI0701 01:21:06.437680 140367827080960 learning.py:507] global step 3730: loss = 1.4747 (0.236 sec/step)[0m
[32mI0701 01:21:08.579910 140361930364672 supervisor.py:1099] global_step/sec: 4.18963[0m
[32mI0701 01:21:09.797982 140367827080960 learning.py:507] global step 3740: loss = 1.4233 (0.404 sec/step)[0m
[32mI0701 01:21:09.855422 140361921971968 supervisor.py:1050] Recording summary at step 3740.[0m
[32mI0701 01:21:12.135967 140367827080960 learning.py:507] global step 3750: loss = 1.4923 (0.238 sec/step)[0m
[32mI0701 01:21:14.524939 140367827080960 learning.py:507] global step 3760: loss = 1.4659 (0.244 sec/step)[0m
[32mI0701 01:21:16.868747 140367827080960 learning.py:507] global step 3770: loss = 1.4586 (0.224 sec/step)[0m
[32mI0701 01:21:19.260210 140367827080960 learning.py:507] global step 3780: loss = 1.4747 (0.236 sec/step)[0m
[32mI0701 01:21:21.638789 140367827080960 learning.py:507] global step 3790: loss = 1.4393 (0.237 sec/step)[0m
[32mI0701 01:21:2

[32mI0701 01:23:58.459320 140367827080960 learning.py:507] global step 4450: loss = 1.3727 (0.233 sec/step)[0m
[32mI0701 01:24:00.838200 140367827080960 learning.py:507] global step 4460: loss = 1.4726 (0.234 sec/step)[0m
[32mI0701 01:24:03.212636 140367827080960 learning.py:507] global step 4470: loss = 1.4755 (0.236 sec/step)[0m
[32mI0701 01:24:05.570899 140367827080960 learning.py:507] global step 4480: loss = 1.4353 (0.231 sec/step)[0m
[32mI0701 01:24:07.944591 140367827080960 learning.py:507] global step 4490: loss = 1.4507 (0.225 sec/step)[0m
[32mI0701 01:24:10.335042 140367827080960 learning.py:507] global step 4500: loss = 1.3758 (0.236 sec/step)[0m
[32mI0701 01:24:12.703613 140367827080960 learning.py:507] global step 4510: loss = 1.5363 (0.236 sec/step)[0m
[32mI0701 01:24:15.086867 140367827080960 learning.py:507] global step 4520: loss = 1.4932 (0.238 sec/step)[0m
[32mI0701 01:24:17.484003 140367827080960 learning.py:507] global step 4530: loss = 1.4730 (0.2

[32mI0701 01:26:50.788121 140367827080960 learning.py:507] global step 5170: loss = 1.3727 (0.235 sec/step)[0m
[32mI0701 01:26:53.141916 140367827080960 learning.py:507] global step 5180: loss = 1.4776 (0.231 sec/step)[0m
[32mI0701 01:26:55.508317 140367827080960 learning.py:507] global step 5190: loss = 1.5357 (0.228 sec/step)[0m
[32mI0701 01:26:57.909327 140367827080960 learning.py:507] global step 5200: loss = 1.3621 (0.237 sec/step)[0m
[32mI0701 01:27:00.278306 140367827080960 learning.py:507] global step 5210: loss = 1.5503 (0.233 sec/step)[0m
[32mI0701 01:27:02.643315 140367827080960 learning.py:507] global step 5220: loss = 1.5466 (0.237 sec/step)[0m
[32mI0701 01:27:05.022949 140367827080960 learning.py:507] global step 5230: loss = 1.4786 (0.235 sec/step)[0m
[32mI0701 01:27:07.394824 140367827080960 learning.py:507] global step 5240: loss = 1.4242 (0.229 sec/step)[0m
[32mI0701 01:27:09.785290 140367827080960 learning.py:507] global step 5250: loss = 1.5172 (0.2

[32mI0701 01:29:46.767529 140367827080960 learning.py:507] global step 5910: loss = 1.4806 (0.225 sec/step)[0m
[32mI0701 01:29:49.152883 140367827080960 learning.py:507] global step 5920: loss = 1.4197 (0.230 sec/step)[0m
[32mI0701 01:29:51.522228 140367827080960 learning.py:507] global step 5930: loss = 1.4433 (0.239 sec/step)[0m
[32mI0701 01:29:53.900577 140367827080960 learning.py:507] global step 5940: loss = 1.5719 (0.234 sec/step)[0m
[32mI0701 01:29:56.289901 140367827080960 learning.py:507] global step 5950: loss = 1.4514 (0.230 sec/step)[0m
[32mI0701 01:29:58.677150 140367827080960 learning.py:507] global step 5960: loss = 1.3780 (0.232 sec/step)[0m
[32mI0701 01:30:01.071516 140367827080960 learning.py:507] global step 5970: loss = 1.2913 (0.236 sec/step)[0m
[32mI0701 01:30:03.463198 140367827080960 learning.py:507] global step 5980: loss = 1.4089 (0.236 sec/step)[0m
[32mI0701 01:30:05.847067 140367827080960 learning.py:507] global step 5990: loss = 1.5886 (0.2

[32mI0701 01:32:36.821429 140367827080960 learning.py:507] global step 6620: loss = 1.2669 (0.234 sec/step)[0m
[32mI0701 01:32:39.192507 140367827080960 learning.py:507] global step 6630: loss = 1.5819 (0.231 sec/step)[0m
[32mI0701 01:32:41.550673 140367827080960 learning.py:507] global step 6640: loss = 1.4809 (0.239 sec/step)[0m
[32mI0701 01:32:43.961066 140367827080960 learning.py:507] global step 6650: loss = 1.2712 (0.236 sec/step)[0m
[32mI0701 01:32:46.345699 140367827080960 learning.py:507] global step 6660: loss = 1.4964 (0.239 sec/step)[0m
[32mI0701 01:32:48.739522 140367827080960 learning.py:507] global step 6670: loss = 1.4125 (0.230 sec/step)[0m
[32mI0701 01:32:51.115489 140367827080960 learning.py:507] global step 6680: loss = 1.4029 (0.240 sec/step)[0m
[32mI0701 01:32:53.464020 140367827080960 learning.py:507] global step 6690: loss = 1.4451 (0.239 sec/step)[0m
[32mI0701 01:32:55.821803 140367827080960 learning.py:507] global step 6700: loss = 1.5693 (0.2

[32mI0701 01:35:30.918241 140367827080960 learning.py:507] global step 7350: loss = 1.5228 (0.240 sec/step)[0m
[32mI0701 01:35:33.309840 140367827080960 learning.py:507] global step 7360: loss = 1.4133 (0.241 sec/step)[0m
[32mI0701 01:35:35.684829 140367827080960 learning.py:507] global step 7370: loss = 1.4668 (0.237 sec/step)[0m
[32mI0701 01:35:38.064368 140367827080960 learning.py:507] global step 7380: loss = 1.4208 (0.234 sec/step)[0m
[32mI0701 01:35:40.432279 140367827080960 learning.py:507] global step 7390: loss = 1.5168 (0.240 sec/step)[0m
[32mI0701 01:35:42.815766 140367827080960 learning.py:507] global step 7400: loss = 1.4433 (0.238 sec/step)[0m
[32mI0701 01:35:45.221494 140367827080960 learning.py:507] global step 7410: loss = 1.4232 (0.240 sec/step)[0m
[32mI0701 01:35:47.598648 140367827080960 learning.py:507] global step 7420: loss = 1.3765 (0.231 sec/step)[0m
[32mI0701 01:35:50.001419 140367827080960 learning.py:507] global step 7430: loss = 1.4831 (0.2

[32mI0701 01:38:21.188458 140367827080960 learning.py:507] global step 8060: loss = 1.4552 (0.239 sec/step)[0m
[32mI0701 01:38:23.596170 140367827080960 learning.py:507] global step 8070: loss = 1.3572 (0.240 sec/step)[0m
[32mI0701 01:38:25.980482 140367827080960 learning.py:507] global step 8080: loss = 1.4349 (0.227 sec/step)[0m
[32mI0701 01:38:28.392743 140367827080960 learning.py:507] global step 8090: loss = 1.4341 (0.234 sec/step)[0m
[32mI0701 01:38:30.783756 140367827080960 learning.py:507] global step 8100: loss = 1.4775 (0.238 sec/step)[0m
[32mI0701 01:38:33.170727 140367827080960 learning.py:507] global step 8110: loss = 1.4118 (0.228 sec/step)[0m
[32mI0701 01:38:35.533927 140367827080960 learning.py:507] global step 8120: loss = 1.4745 (0.234 sec/step)[0m
[32mI0701 01:38:37.903333 140367827080960 learning.py:507] global step 8130: loss = 1.5523 (0.244 sec/step)[0m
[32mI0701 01:38:40.326488 140367827080960 learning.py:507] global step 8140: loss = 1.4317 (0.2

[32mI0701 01:41:13.420392 140367827080960 learning.py:507] global step 8780: loss = 1.4348 (0.232 sec/step)[0m
[32mI0701 01:41:15.809756 140367827080960 learning.py:507] global step 8790: loss = 1.4649 (0.235 sec/step)[0m
[32mI0701 01:41:18.197788 140367827080960 learning.py:507] global step 8800: loss = 1.4361 (0.242 sec/step)[0m
[32mI0701 01:41:20.558670 140367827080960 learning.py:507] global step 8810: loss = 1.4707 (0.228 sec/step)[0m
[32mI0701 01:41:22.926728 140367827080960 learning.py:507] global step 8820: loss = 1.5226 (0.238 sec/step)[0m
[32mI0701 01:41:25.283198 140367827080960 learning.py:507] global step 8830: loss = 1.4723 (0.235 sec/step)[0m
[32mI0701 01:41:27.653179 140367827080960 learning.py:507] global step 8840: loss = 1.4578 (0.235 sec/step)[0m
[32mI0701 01:41:29.998842 140367827080960 learning.py:507] global step 8850: loss = 1.4112 (0.240 sec/step)[0m
[32mI0701 01:41:32.403798 140367827080960 learning.py:507] global step 8860: loss = 1.4948 (0.2

[32mI0701 01:44:07.300396 140367827080960 learning.py:507] global step 9510: loss = 1.3289 (0.233 sec/step)[0m
[32mI0701 01:44:09.669750 140367827080960 learning.py:507] global step 9520: loss = 1.4162 (0.244 sec/step)[0m
[32mI0701 01:44:12.050621 140367827080960 learning.py:507] global step 9530: loss = 1.3964 (0.246 sec/step)[0m
[32mI0701 01:44:14.423583 140367827080960 learning.py:507] global step 9540: loss = 1.3820 (0.241 sec/step)[0m
[32mI0701 01:44:16.796386 140367827080960 learning.py:507] global step 9550: loss = 1.3928 (0.237 sec/step)[0m
[32mI0701 01:44:19.203363 140367827080960 learning.py:507] global step 9560: loss = 1.3207 (0.236 sec/step)[0m
[32mI0701 01:44:21.575881 140367827080960 learning.py:507] global step 9570: loss = 1.3874 (0.241 sec/step)[0m
[32mI0701 01:44:23.949872 140367827080960 learning.py:507] global step 9580: loss = 1.4068 (0.240 sec/step)[0m
[32mI0701 01:44:26.352528 140367827080960 learning.py:507] global step 9590: loss = 1.3714 (0.2

[32mI0701 01:46:57.278892 140367827080960 learning.py:507] global step 10220: loss = 1.4955 (0.246 sec/step)[0m
[32mI0701 01:46:59.672465 140367827080960 learning.py:507] global step 10230: loss = 1.2319 (0.233 sec/step)[0m
[32mI0701 01:47:02.024688 140367827080960 learning.py:507] global step 10240: loss = 1.4003 (0.234 sec/step)[0m
[32mI0701 01:47:04.387957 140367827080960 learning.py:507] global step 10250: loss = 1.3510 (0.225 sec/step)[0m
[32mI0701 01:47:06.768074 140367827080960 learning.py:507] global step 10260: loss = 1.3790 (0.233 sec/step)[0m
[32mI0701 01:47:09.133878 140367827080960 learning.py:507] global step 10270: loss = 1.5718 (0.237 sec/step)[0m
[32mI0701 01:47:11.464607 140367827080960 learning.py:507] global step 10280: loss = 1.3938 (0.231 sec/step)[0m
[32mI0701 01:47:13.831808 140367827080960 learning.py:507] global step 10290: loss = 1.3619 (0.233 sec/step)[0m
[32mI0701 01:47:16.201750 140367827080960 learning.py:507] global step 10300: loss = 1.

[32mI0701 01:49:57.493206 140367827080960 learning.py:507] global step 10980: loss = 1.4666 (0.241 sec/step)[0m
[32mI0701 01:49:59.850282 140367827080960 learning.py:507] global step 10990: loss = 1.3612 (0.230 sec/step)[0m
[32mI0701 01:50:02.217854 140367827080960 learning.py:507] global step 11000: loss = 1.4584 (0.239 sec/step)[0m
[32mI0701 01:50:04.569880 140367827080960 learning.py:507] global step 11010: loss = 1.4029 (0.228 sec/step)[0m
[32mI0701 01:50:06.930921 140367827080960 learning.py:507] global step 11020: loss = 1.4275 (0.237 sec/step)[0m
[32mI0701 01:50:09.290746 140367827080960 learning.py:507] global step 11030: loss = 1.4364 (0.237 sec/step)[0m
[32mI0701 01:50:11.678077 140367827080960 learning.py:507] global step 11040: loss = 1.2569 (0.238 sec/step)[0m
[32mI0701 01:50:14.053822 140367827080960 learning.py:507] global step 11050: loss = 1.5354 (0.234 sec/step)[0m
[32mI0701 01:50:16.431025 140367827080960 learning.py:507] global step 11060: loss = 1.

[32mI0701 01:52:47.429649 140367827080960 learning.py:507] global step 11690: loss = 1.4491 (0.240 sec/step)[0m
[32mI0701 01:52:49.815994 140367827080960 learning.py:507] global step 11700: loss = 1.3915 (0.234 sec/step)[0m
[32mI0701 01:52:52.213608 140367827080960 learning.py:507] global step 11710: loss = 1.3612 (0.239 sec/step)[0m
[32mI0701 01:52:54.610546 140367827080960 learning.py:507] global step 11720: loss = 1.3677 (0.235 sec/step)[0m
[32mI0701 01:52:57.001697 140367827080960 learning.py:507] global step 11730: loss = 1.3964 (0.241 sec/step)[0m
[32mI0701 01:52:59.362385 140367827080960 learning.py:507] global step 11740: loss = 1.4181 (0.236 sec/step)[0m
[32mI0701 01:53:01.769756 140367827080960 learning.py:507] global step 11750: loss = 1.3584 (0.234 sec/step)[0m
[32mI0701 01:53:04.152853 140367827080960 learning.py:507] global step 11760: loss = 1.4304 (0.245 sec/step)[0m
[32mI0701 01:53:06.510026 140367827080960 learning.py:507] global step 11770: loss = 1.

[32mI0701 01:55:38.725064 140367827080960 learning.py:507] global step 12410: loss = 1.4033 (0.241 sec/step)[0m
[32mI0701 01:55:41.115477 140367827080960 learning.py:507] global step 12420: loss = 1.4181 (0.239 sec/step)[0m
[32mI0701 01:55:43.530858 140367827080960 learning.py:507] global step 12430: loss = 1.3615 (0.244 sec/step)[0m
[32mI0701 01:55:45.874139 140367827080960 learning.py:507] global step 12440: loss = 1.3392 (0.236 sec/step)[0m
[32mI0701 01:55:48.264301 140367827080960 learning.py:507] global step 12450: loss = 1.4342 (0.238 sec/step)[0m
[32mI0701 01:55:50.635454 140367827080960 learning.py:507] global step 12460: loss = 1.4771 (0.238 sec/step)[0m
[32mI0701 01:55:53.003410 140367827080960 learning.py:507] global step 12470: loss = 1.3988 (0.236 sec/step)[0m
[32mI0701 01:55:55.355042 140367827080960 learning.py:507] global step 12480: loss = 1.3156 (0.244 sec/step)[0m
[32mI0701 01:55:57.726031 140367827080960 learning.py:507] global step 12490: loss = 1.

[32mI0701 01:58:18.941777 140367827080960 learning.py:507] global step 13080: loss = 1.5478 (0.240 sec/step)[0m
[32mI0701 01:58:21.343441 140367827080960 learning.py:507] global step 13090: loss = 1.3784 (0.245 sec/step)[0m
[32mI0701 01:58:23.697429 140367827080960 learning.py:507] global step 13100: loss = 1.3500 (0.226 sec/step)[0m
[32mI0701 01:58:26.064692 140367827080960 learning.py:507] global step 13110: loss = 1.5065 (0.239 sec/step)[0m
[32mI0701 01:58:28.415188 140367827080960 learning.py:507] global step 13120: loss = 1.4657 (0.228 sec/step)[0m
[32mI0701 01:58:30.783432 140367827080960 learning.py:507] global step 13130: loss = 1.2990 (0.235 sec/step)[0m
[32mI0701 01:58:33.135495 140367827080960 learning.py:507] global step 13140: loss = 1.3152 (0.235 sec/step)[0m
[32mI0701 01:58:35.495562 140367827080960 learning.py:507] global step 13150: loss = 1.3612 (0.237 sec/step)[0m
[32mI0701 01:58:37.854080 140367827080960 learning.py:507] global step 13160: loss = 1.

[32mI0701 02:01:25.061774 140367827080960 learning.py:507] global step 13860: loss = 1.3685 (0.234 sec/step)[0m
[32mI0701 02:01:27.416923 140367827080960 learning.py:507] global step 13870: loss = 1.4380 (0.237 sec/step)[0m
[32mI0701 02:01:29.814148 140367827080960 learning.py:507] global step 13880: loss = 1.4255 (0.239 sec/step)[0m
[32mI0701 02:01:32.189137 140367827080960 learning.py:507] global step 13890: loss = 1.4866 (0.240 sec/step)[0m
[32mI0701 02:01:34.561567 140367827080960 learning.py:507] global step 13900: loss = 1.3683 (0.238 sec/step)[0m
[32mI0701 02:01:36.925004 140367827080960 learning.py:507] global step 13910: loss = 1.3832 (0.238 sec/step)[0m
[32mI0701 02:01:39.315049 140367827080960 learning.py:507] global step 13920: loss = 1.3572 (0.236 sec/step)[0m
[32mI0701 02:01:41.704598 140367827080960 learning.py:507] global step 13930: loss = 1.3273 (0.239 sec/step)[0m
[32mI0701 02:01:44.050980 140367827080960 learning.py:507] global step 13940: loss = 1.

[32mI0701 02:04:16.050041 140367827080960 learning.py:507] global step 14580: loss = 1.3273 (0.238 sec/step)[0m
[32mI0701 02:04:18.404109 140367827080960 learning.py:507] global step 14590: loss = 1.3871 (0.232 sec/step)[0m
[32mI0701 02:04:20.798896 140367827080960 learning.py:507] global step 14600: loss = 1.4473 (0.239 sec/step)[0m
[32mI0701 02:04:23.186650 140367827080960 learning.py:507] global step 14610: loss = 1.3827 (0.245 sec/step)[0m
[32mI0701 02:04:25.553075 140367827080960 learning.py:507] global step 14620: loss = 1.4290 (0.235 sec/step)[0m
[32mI0701 02:04:27.925122 140367827080960 learning.py:507] global step 14630: loss = 1.4502 (0.227 sec/step)[0m
[32mI0701 02:04:30.306108 140367827080960 learning.py:507] global step 14640: loss = 1.3899 (0.239 sec/step)[0m
[32mI0701 02:04:32.690135 140367827080960 learning.py:507] global step 14650: loss = 1.3471 (0.244 sec/step)[0m
[32mI0701 02:04:35.059233 140367827080960 learning.py:507] global step 14660: loss = 1.

[32mI0701 02:07:10.311676 140367827080960 learning.py:507] global step 15310: loss = 1.4037 (0.236 sec/step)[0m
[32mI0701 02:07:12.651503 140367827080960 learning.py:507] global step 15320: loss = 1.4680 (0.231 sec/step)[0m
[32mI0701 02:07:15.021303 140367827080960 learning.py:507] global step 15330: loss = 1.3516 (0.227 sec/step)[0m
[32mI0701 02:07:17.398276 140367827080960 learning.py:507] global step 15340: loss = 1.3300 (0.239 sec/step)[0m
[32mI0701 02:07:19.767375 140367827080960 learning.py:507] global step 15350: loss = 1.3756 (0.236 sec/step)[0m
[32mI0701 02:07:22.165488 140367827080960 learning.py:507] global step 15360: loss = 1.3305 (0.240 sec/step)[0m
[32mI0701 02:07:24.559042 140367827080960 learning.py:507] global step 15370: loss = 1.4520 (0.235 sec/step)[0m
[32mI0701 02:07:26.925276 140367827080960 learning.py:507] global step 15380: loss = 1.3682 (0.225 sec/step)[0m
[32mI0701 02:07:29.296529 140367827080960 learning.py:507] global step 15390: loss = 1.

[32mI0701 02:10:01.142545 140367827080960 learning.py:507] global step 16030: loss = 1.3655 (0.235 sec/step)[0m
[32mI0701 02:10:03.491894 140367827080960 learning.py:507] global step 16040: loss = 1.3753 (0.234 sec/step)[0m
[32mI0701 02:10:05.834808 140367827080960 learning.py:507] global step 16050: loss = 1.4096 (0.235 sec/step)[0m
[32mI0701 02:10:08.192497 140367827080960 learning.py:507] global step 16060: loss = 1.4480 (0.234 sec/step)[0m
[32mI0701 02:10:10.552006 140367827080960 learning.py:507] global step 16070: loss = 1.3529 (0.238 sec/step)[0m
[32mI0701 02:10:12.969381 140367827080960 learning.py:507] global step 16080: loss = 1.4108 (0.235 sec/step)[0m
[32mI0701 02:10:15.355283 140367827080960 learning.py:507] global step 16090: loss = 1.4776 (0.235 sec/step)[0m
[32mI0701 02:10:17.725588 140367827080960 learning.py:507] global step 16100: loss = 1.3533 (0.232 sec/step)[0m
[32mI0701 02:10:20.115427 140367827080960 learning.py:507] global step 16110: loss = 1.

[32mI0701 02:12:57.439519 140367827080960 learning.py:507] global step 16770: loss = 1.3084 (0.237 sec/step)[0m
[32mI0701 02:12:59.799427 140367827080960 learning.py:507] global step 16780: loss = 1.3860 (0.231 sec/step)[0m
[32mI0701 02:13:02.168826 140367827080960 learning.py:507] global step 16790: loss = 1.3185 (0.244 sec/step)[0m
[32mI0701 02:13:04.518376 140367827080960 learning.py:507] global step 16800: loss = 1.3074 (0.239 sec/step)[0m
[32mI0701 02:13:06.877789 140367827080960 learning.py:507] global step 16810: loss = 1.3709 (0.234 sec/step)[0m
[32mI0701 02:13:09.232033 140367827080960 learning.py:507] global step 16820: loss = 1.3981 (0.240 sec/step)[0m
[32mI0701 02:13:11.610884 140367827080960 learning.py:507] global step 16830: loss = 1.3926 (0.239 sec/step)[0m
[32mI0701 02:13:13.971019 140367827080960 learning.py:507] global step 16840: loss = 1.2730 (0.236 sec/step)[0m
[32mI0701 02:13:16.357522 140367827080960 learning.py:507] global step 16850: loss = 1.

[32mI0701 02:15:48.444488 140367827080960 learning.py:507] global step 17490: loss = 1.3064 (0.235 sec/step)[0m
[32mI0701 02:15:50.827689 140367827080960 learning.py:507] global step 17500: loss = 1.4191 (0.230 sec/step)[0m
[32mI0701 02:15:53.202471 140367827080960 learning.py:507] global step 17510: loss = 1.3215 (0.233 sec/step)[0m
[32mI0701 02:15:55.583663 140367827080960 learning.py:507] global step 17520: loss = 1.5055 (0.235 sec/step)[0m
[32mI0701 02:15:57.964616 140367827080960 learning.py:507] global step 17530: loss = 1.3058 (0.235 sec/step)[0m
[32mI0701 02:16:00.342035 140367827080960 learning.py:507] global step 17540: loss = 1.3471 (0.241 sec/step)[0m
[32mI0701 02:16:02.704543 140367827080960 learning.py:507] global step 17550: loss = 1.4605 (0.230 sec/step)[0m
[32mI0701 02:16:05.057917 140367827080960 learning.py:507] global step 17560: loss = 1.3936 (0.239 sec/step)[0m
[32mI0701 02:16:07.440421 140367827080960 learning.py:507] global step 17570: loss = 1.

[32mI0701 02:18:40.605462 140367827080960 learning.py:507] global step 18210: loss = 1.4014 (0.241 sec/step)[0m
[32mI0701 02:18:42.973313 140367827080960 learning.py:507] global step 18220: loss = 1.5102 (0.238 sec/step)[0m
[32mI0701 02:18:45.358773 140367827080960 learning.py:507] global step 18230: loss = 1.3917 (0.235 sec/step)[0m
[32mI0701 02:18:47.720493 140367827080960 learning.py:507] global step 18240: loss = 1.5399 (0.232 sec/step)[0m
[32mI0701 02:18:50.089432 140367827080960 learning.py:507] global step 18250: loss = 1.3464 (0.238 sec/step)[0m
[32mI0701 02:18:52.478522 140367827080960 learning.py:507] global step 18260: loss = 1.4350 (0.230 sec/step)[0m
[32mI0701 02:18:54.842022 140367827080960 learning.py:507] global step 18270: loss = 1.4627 (0.240 sec/step)[0m
[32mI0701 02:18:57.209100 140367827080960 learning.py:507] global step 18280: loss = 1.4399 (0.240 sec/step)[0m
[32mI0701 02:18:59.546264 140367827080960 learning.py:507] global step 18290: loss = 1.

[32mI0701 02:21:34.740288 140367827080960 learning.py:507] global step 18940: loss = 1.3953 (0.237 sec/step)[0m
[32mI0701 02:21:37.100079 140367827080960 learning.py:507] global step 18950: loss = 1.3616 (0.232 sec/step)[0m
[32mI0701 02:21:39.471885 140367827080960 learning.py:507] global step 18960: loss = 1.3162 (0.239 sec/step)[0m
[32mI0701 02:21:41.857556 140367827080960 learning.py:507] global step 18970: loss = 1.3730 (0.241 sec/step)[0m
[32mI0701 02:21:44.228373 140367827080960 learning.py:507] global step 18980: loss = 1.4940 (0.243 sec/step)[0m
[32mI0701 02:21:46.584085 140367827080960 learning.py:507] global step 18990: loss = 1.4324 (0.233 sec/step)[0m
[32mI0701 02:21:48.991602 140367827080960 learning.py:507] global step 19000: loss = 1.3253 (0.235 sec/step)[0m
[32mI0701 02:21:51.375726 140367827080960 learning.py:507] global step 19010: loss = 1.3836 (0.238 sec/step)[0m
[32mI0701 02:21:53.751705 140367827080960 learning.py:507] global step 19020: loss = 1.

[32mI0701 02:24:27.645294 140367827080960 learning.py:507] global step 19670: loss = 1.3623 (0.236 sec/step)[0m
[32mI0701 02:24:30.011737 140367827080960 learning.py:507] global step 19680: loss = 1.3835 (0.235 sec/step)[0m
[32mI0701 02:24:32.386437 140367827080960 learning.py:507] global step 19690: loss = 1.4117 (0.239 sec/step)[0m
[32mI0701 02:24:34.791805 140367827080960 learning.py:507] global step 19700: loss = 1.3482 (0.233 sec/step)[0m
[32mI0701 02:24:37.130825 140367827080960 learning.py:507] global step 19710: loss = 1.4700 (0.235 sec/step)[0m
[32mI0701 02:24:39.507546 140367827080960 learning.py:507] global step 19720: loss = 1.2743 (0.226 sec/step)[0m
[32mI0701 02:24:41.871457 140367827080960 learning.py:507] global step 19730: loss = 1.5025 (0.236 sec/step)[0m
[32mI0701 02:24:44.235464 140367827080960 learning.py:507] global step 19740: loss = 1.2886 (0.230 sec/step)[0m
[32mI0701 02:24:46.605443 140367827080960 learning.py:507] global step 19750: loss = 1.

[32mI0701 02:27:26.873224 140367827080960 learning.py:507] global step 20420: loss = 1.3700 (0.239 sec/step)[0m
[32mI0701 02:27:29.234690 140367827080960 learning.py:507] global step 20430: loss = 1.3448 (0.242 sec/step)[0m
[32mI0701 02:27:31.580889 140367827080960 learning.py:507] global step 20440: loss = 1.2905 (0.238 sec/step)[0m
[32mI0701 02:27:33.946213 140367827080960 learning.py:507] global step 20450: loss = 1.4120 (0.236 sec/step)[0m
[32mI0701 02:27:36.332820 140367827080960 learning.py:507] global step 20460: loss = 1.3789 (0.240 sec/step)[0m
[32mI0701 02:27:38.697267 140367827080960 learning.py:507] global step 20470: loss = 1.4220 (0.231 sec/step)[0m
[32mI0701 02:27:41.070329 140367827080960 learning.py:507] global step 20480: loss = 1.3286 (0.244 sec/step)[0m
[32mI0701 02:27:43.452883 140367827080960 learning.py:507] global step 20490: loss = 1.3944 (0.239 sec/step)[0m
[32mI0701 02:27:45.790287 140367827080960 learning.py:507] global step 20500: loss = 1.

[32mI0701 02:30:17.773850 140367827080960 learning.py:507] global step 21140: loss = 1.3283 (0.232 sec/step)[0m
[32mI0701 02:30:20.163125 140367827080960 learning.py:507] global step 21150: loss = 1.3202 (0.241 sec/step)[0m
[32mI0701 02:30:22.525923 140367827080960 learning.py:507] global step 21160: loss = 1.4647 (0.235 sec/step)[0m
[32mI0701 02:30:24.899550 140367827080960 learning.py:507] global step 21170: loss = 1.4367 (0.242 sec/step)[0m
[32mI0701 02:30:27.293216 140367827080960 learning.py:507] global step 21180: loss = 1.2506 (0.234 sec/step)[0m
[32mI0701 02:30:29.654544 140367827080960 learning.py:507] global step 21190: loss = 1.3215 (0.235 sec/step)[0m
[32mI0701 02:30:32.031584 140367827080960 learning.py:507] global step 21200: loss = 1.4365 (0.238 sec/step)[0m
[32mI0701 02:30:34.392072 140367827080960 learning.py:507] global step 21210: loss = 1.3341 (0.231 sec/step)[0m
[32mI0701 02:30:36.769926 140367827080960 learning.py:507] global step 21220: loss = 1.

[32mI0701 02:33:11.859224 140367827080960 learning.py:507] global step 21870: loss = 1.1890 (0.231 sec/step)[0m
[32mI0701 02:33:14.225339 140367827080960 learning.py:507] global step 21880: loss = 1.3124 (0.229 sec/step)[0m
[32mI0701 02:33:16.597301 140367827080960 learning.py:507] global step 21890: loss = 1.4429 (0.224 sec/step)[0m
[32mI0701 02:33:18.994446 140367827080960 learning.py:507] global step 21900: loss = 1.3799 (0.236 sec/step)[0m
[32mI0701 02:33:21.354675 140367827080960 learning.py:507] global step 21910: loss = 1.3026 (0.236 sec/step)[0m
[32mI0701 02:33:23.716437 140367827080960 learning.py:507] global step 21920: loss = 1.3373 (0.236 sec/step)[0m
[32mI0701 02:33:26.091933 140367827080960 learning.py:507] global step 21930: loss = 1.4043 (0.237 sec/step)[0m
[32mI0701 02:33:28.459032 140367827080960 learning.py:507] global step 21940: loss = 1.3192 (0.236 sec/step)[0m
[32mI0701 02:33:30.864434 140367827080960 learning.py:507] global step 21950: loss = 1.

[32mI0701 02:36:05.000130 140367827080960 learning.py:507] global step 22600: loss = 1.3120 (0.229 sec/step)[0m
[32mI0701 02:36:07.372318 140367827080960 learning.py:507] global step 22610: loss = 1.3874 (0.239 sec/step)[0m
[32mI0701 02:36:08.551965 140361938757376 supervisor.py:1117] Saving checkpoint to path /opt/ml/model/model.ckpt[0m
[32mI0701 02:36:10.228773 140361921971968 supervisor.py:1050] Recording summary at step 22618.[0m
[32mI0701 02:36:10.632455 140367827080960 learning.py:507] global step 22620: loss = 1.5265 (0.232 sec/step)[0m
[32mI0701 02:36:13.026427 140367827080960 learning.py:507] global step 22630: loss = 1.4055 (0.238 sec/step)[0m
[32mI0701 02:36:15.371654 140367827080960 learning.py:507] global step 22640: loss = 1.2377 (0.241 sec/step)[0m
[32mI0701 02:36:17.743081 140367827080960 learning.py:507] global step 22650: loss = 1.4349 (0.232 sec/step)[0m
[32mI0701 02:36:20.116293 140367827080960 learning.py:507] global step 22660: loss = 1.5049 (0.23

<p>학습이 모두 완료된 다음에 S3에서 모델 산출물을 SageMaker Notebook 환경으로 내려받습니다.</p>

In [None]:
artifacts_dir = estimator.model_dir.replace('model','')
print(artifacts_dir)
!aws s3 ls --human-readable {artifacts_dir}

In [None]:
model_dir=artifacts_dir+'output/'
print(model_dir)
!aws s3 ls --human-readable {model_dir}

In [None]:
!rm -rf ./model_result/

In [None]:
import json , os

path = './model_result'
if not os.path.exists(path):
    os.makedirs(path)

!aws s3 cp {model_dir}model.tar.gz {path}/model.tar.gz
!tar -xzf {path}/model.tar.gz -C {path}

<p>최종 결과물에는 tflite를 생성할 수 있도록 했습니다. 압축을 푼 다음 tflite 를 다시 활용하기 위해 S3에 파일을 upload 합니다.</p>

In [None]:
final_result = 's3://{}/{}'.format(bucket, 'workshop_final_result')

!aws s3 cp ./img_datasets/labels.txt {final_result}/labels.txt
!aws s3 cp {path}/mobilenetv1_model.tflite {final_result}/mobilenetv1_model.tflite


<p></p>
<p>Amazon SageMaker에서 모든 학습을 완료하였습니다. 이제 tflite를 이용하여 AI Chip에서 활용할 수 있도록 Convertor를 수행합니다. 이 작업은 Cloud9에서 수행합니다. </p>