# Amazon Sagemaker 학습 스크립트

<p>이 예제는 LG에서 개발한 AI chip에서 동작할 수 있도록, Tensorflow 1.X, python2.7 버전에서 학습하기 위한 코드입니다. </p>
<p>이 코드는 <strong><a href="https://github.com/tensorflow/models/tree/master/research/slim" target="_blank" class ='btn-default'>TensorFlow-Slim image classification model library</a></strong>를 참고하여 Sagemaker에서 학습할 수 있는 실행 스크립트로 수정하여 작성하였습니다. Amazon SageMaker로 실행 스크립트를 구성하는 이유는 노트북의 스크립트에서 일부 파라미터 수정으로 동일 모델 아키텍처에 대해 hyperparamter가 변경된 다양한 모델을 원하는 형태의 다수 인프라에서 동시에 학습 수행이 가능하며, 가장 높은 성능의 모델을 노트북 스크립트 내 명령어로 바로 hosting 서비스가 가능한 Endpoint 생성을 할 수 있습니다.</p>

<p>이번 실습에서는 Amazon Sagemaker가 어떤 방식으로 학습이 되는지 설명되는 구조와 함께 학습하는 방법을 간단하게 체험해 보는 시간을 갖도록 하겠습니다.</p>

# 1. Sagemaker notebook 설명
<p>Sagemaker notebook은 완전 관리형 서비스로 컨테이너 기반으로 구성되어 있습니다. 사용자가 직접 컨테이너를 볼 수 없지만, 내부적으로는 아래와 같은 원리로 동작합니다. </p>
<p><img src="./fig/sm_notebook.png" width="700", height="70"></p>

- **S3 (Simple Storage Serivce)** : Object Storage로서 학습할 데이터 파일과 학습 결과인 model, checkpoint, tensorboard를 위한 event 파일, 로그 정보 등을 저장하는데 사용합니다.
- **SageMaker Notebook** : 학습을 위한 스크립트 작성과 디버깅, 그리고 실제 학습을 수행하기 위한 Python을 개발하기 위한 환경을 제공합니다.
- **Amazon Elastic Container Registry(ECR)** :  Docker 컨테이너 이미지를 손쉽게 저장, 관리 및 배포할 수 있게 해주는 완전관리형 Docker 컨테이너 레지스트리입니다. Sagemaker는 기본적인 컨테이너를 제공하기 때문에 별도 ECR에 컨테이너 이미지를 등록할 필요는 없습니다. 하지만, 별도의 학습 및 배포 환경이 필요한 경우 custom 컨테이너 이미지를 만들어서 ECR에 등록한 후 이 환경을 활용할 수 있습니다.

<p>학습과 추론을 하는 hosting 서비스는 각각 다른 컨테이너 환경에서 수행할 수 있으며, 쉽게 다량으로 컨테이너 환경을 확장할 수 있으므로 다량의 학습과 hosting이 동시에 가능합니다.   
</p>

# 2. 환경 설정

<p>Sagemaker 학습에 필요한 기본적인 package를 import 합니다. </p>
<p>boto3는 HTTP API 호출을 숨기는 편한 추상화 모델을 가지고 있고, Amazon EC2 인스턴스 및 S3 버켓과 같은 AWS 리소스와 동작하는 파이선 클래스를 제공합니다. </p>
<p>sagemaker python sdk는 Amazon SageMaker에서 기계 학습 모델을 교육 및 배포하기 위한 오픈 소스 라이브러리입니다.</p>

In [1]:
import sys

In [2]:
!{sys.executable} -m pip install --upgrade pip
!{sys.executable} -m pip install tensorflow_gpu==1.14

[33mDEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support[0m
Requirement already up-to-date: pip in /home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages (20.1)


In [2]:
import os
import time
import sagemaker
import boto3
import tensorflow as tf
from PIL import Image

import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker import get_execution_role
from sagemaker.session import Session

from collections import defaultdict
from io import StringIO
from matplotlib import pyplot as plt
from PIL import Image
from IPython.display import display

%matplotlib inline

ContextualVersionConflict: (botocore 1.16.16 (/Users/choijoon/.local/lib/python3.7/site-packages), Requirement.parse('botocore<1.14.0,>=1.13.33'), {'boto3'})

<p>SageMaker에서 앞으로 사용할 SageMaker Session 설정, Role 정보를 설정합니다. </p>

In [4]:
sagemaker_session = sagemaker.Session()

role = get_execution_role()
region = sagemaker_session.boto_session.region_name

sess = boto3.Session()
sm = sess.client('sagemaker')

## 3. S3의 저장 데이터 위치 가져오기
<p> 데이터를 정하기 위한 S3의 bucket 위치는 아래 data_bucket 이름으로 생성하며, 기본적으로 SageMaker에서 학습한 모델과 로그 정보를 남기는 위치는 자동으로 생성되는 bucket 이름으로 저장됩니다. </p>

In [3]:
# create a s3 bucket to hold data, note that your account might already created a bucket with the same name
account_id = sess.client('sts').get_caller_identity()["Account"]
data_bucket = 'sagemaker-experiments-{}-{}'.format(sess.region_name, account_id)
bucket = 'sagemaker-{}-{}'.format(sess.region_name, account_id)

try:
    if sess.region_name == "us-east-1":
        sess.client('s3').create_bucket(Bucket=data_bucket)
    else:
        sess.client('s3').create_bucket(Bucket=data_bucket, 
                                        CreateBucketConfiguration={'LocationConstraint': sess.region_name})
except Exception as e:
    print(e)

NameError: name 'sess' is not defined

## Data Generator

In [6]:
sys.path.append('/home/ec2-user/SageMaker/src_dir/')

In [7]:
from datasets import download_and_convert_visualwakewords

In [8]:
dataset_dir = 'raw_datasets'
small_object_area_threshold = 0.005
foreground_class_of_interest = 'dog'

In [9]:
# !rm -rf {dataset_dir}/coco_dataset

In [10]:
!wget -cP {dataset_dir}/coco_dataset http://images.cocodataset.org/zips/train2014.zip
!wget -cP {dataset_dir}/coco_dataset http://images.cocodataset.org/zips/val2014.zip
!wget -cP {dataset_dir}/coco_dataset http://images.cocodataset.org/annotations/annotations_trainval2014.zip
!unzip -nd {dataset_dir}/coco_dataset/ {dataset_dir}/coco_dataset/train2014.zip
!unzip -nd {dataset_dir}/coco_dataset/ {dataset_dir}/coco_dataset/val2014.zip
!unzip -nd {dataset_dir}/coco_dataset/ {dataset_dir}/coco_dataset/annotations_trainval2014.zip

--2020-05-17 04:13:31--  http://images.cocodataset.org/zips/train2014.zip
Resolving images.cocodataset.org (images.cocodataset.org)... 52.216.177.3
Connecting to images.cocodataset.org (images.cocodataset.org)|52.216.177.3|:80... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.

--2020-05-17 04:13:31--  http://images.cocodataset.org/zips/val2014.zip
Resolving images.cocodataset.org (images.cocodataset.org)... 52.216.177.3
Connecting to images.cocodataset.org (images.cocodataset.org)|52.216.177.3|:80... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.

--2020-05-17 04:13:31--  http://images.cocodataset.org/annotations/annotations_trainval2014.zip
Resolving images.cocodataset.org (images.cocodataset.org)... 52.216.177.3
Connecting to images.cocodataset.org (images.cocodataset.org)|52.216.177.3|:80... 

In [11]:
tfrecord_path = './raw_datasets/tfrecord'
json_path = './raw_datasets/json'

In [12]:
!rm -rf {tfrecord_path}
!rm -rf {json_path}

In [13]:
if not os.path.exists(tfrecord_path):
    os.makedirs(tfrecord_path)

if not os.path.exists(json_path):
    os.makedirs(json_path)

In [14]:
download_and_convert_visualwakewords.run(dataset_dir, small_object_area_threshold, foreground_class_of_interest)



INFO:tensorflow:Creating a labels file...
INFO:tensorflow:Creating train VisualWakeWords annotations...

INFO:tensorflow:Building annotations index...
INFO:tensorflow:702 images are missing annotations.
INFO:tensorflow:On image 0 of 82783
INFO:tensorflow:On image 100 of 82783
INFO:tensorflow:On image 200 of 82783
INFO:tensorflow:On image 300 of 82783
INFO:tensorflow:On image 400 of 82783
INFO:tensorflow:On image 500 of 82783
INFO:tensorflow:On image 600 of 82783
INFO:tensorflow:On image 700 of 82783
INFO:tensorflow:On image 800 of 82783
INFO:tensorflow:On image 900 of 82783
INFO:tensorflow:On image 1000 of 82783
INFO:tensorflow:On image 1100 of 82783
INFO:tensorflow:On image 1200 of 82783
INFO:tensorflow:On image 1300 of 82783
INFO:tensorflow:On image 1400 of 82783
INFO:tensorflow:On image 1500 of 82783
INFO:tensorflow:On image 1600 of 82783
INFO:tensorflow:On image 1700 of 82783
INFO:tensorflow:On image 1800 of 82783
INFO:tensorflow:On image 1900 of 82783
INFO:tensorflow:On image 20

INFO:tensorflow:On image 18800 of 82783
INFO:tensorflow:On image 18900 of 82783
INFO:tensorflow:On image 19000 of 82783
INFO:tensorflow:On image 19100 of 82783
INFO:tensorflow:On image 19200 of 82783
INFO:tensorflow:On image 19300 of 82783
INFO:tensorflow:On image 19400 of 82783
INFO:tensorflow:On image 19500 of 82783
INFO:tensorflow:On image 19600 of 82783
INFO:tensorflow:On image 19700 of 82783
INFO:tensorflow:On image 19800 of 82783
INFO:tensorflow:On image 19900 of 82783
INFO:tensorflow:On image 20000 of 82783
INFO:tensorflow:On image 20100 of 82783
INFO:tensorflow:On image 20200 of 82783
INFO:tensorflow:On image 20300 of 82783
INFO:tensorflow:On image 20400 of 82783
INFO:tensorflow:On image 20500 of 82783
INFO:tensorflow:On image 20600 of 82783
INFO:tensorflow:On image 20700 of 82783
INFO:tensorflow:On image 20800 of 82783
INFO:tensorflow:On image 20900 of 82783
INFO:tensorflow:On image 21000 of 82783
INFO:tensorflow:On image 21100 of 82783
INFO:tensorflow:On image 21200 of 82783


INFO:tensorflow:On image 39300 of 82783
INFO:tensorflow:On image 39400 of 82783
INFO:tensorflow:On image 39500 of 82783
INFO:tensorflow:On image 39600 of 82783
INFO:tensorflow:On image 39700 of 82783
INFO:tensorflow:On image 39800 of 82783
INFO:tensorflow:On image 39900 of 82783
INFO:tensorflow:On image 40000 of 82783
INFO:tensorflow:On image 40100 of 82783
INFO:tensorflow:On image 40200 of 82783
INFO:tensorflow:On image 40300 of 82783
INFO:tensorflow:On image 40400 of 82783
INFO:tensorflow:On image 40500 of 82783
INFO:tensorflow:On image 40600 of 82783
INFO:tensorflow:On image 40700 of 82783
INFO:tensorflow:On image 40800 of 82783
INFO:tensorflow:On image 40900 of 82783
INFO:tensorflow:On image 41000 of 82783
INFO:tensorflow:On image 41100 of 82783
INFO:tensorflow:On image 41200 of 82783
INFO:tensorflow:On image 41300 of 82783
INFO:tensorflow:On image 41400 of 82783
INFO:tensorflow:On image 41500 of 82783
INFO:tensorflow:On image 41600 of 82783
INFO:tensorflow:On image 41700 of 82783


INFO:tensorflow:On image 59800 of 82783
INFO:tensorflow:On image 59900 of 82783
INFO:tensorflow:On image 60000 of 82783
INFO:tensorflow:On image 60100 of 82783
INFO:tensorflow:On image 60200 of 82783
INFO:tensorflow:On image 60300 of 82783
INFO:tensorflow:On image 60400 of 82783
INFO:tensorflow:On image 60500 of 82783
INFO:tensorflow:On image 60600 of 82783
INFO:tensorflow:On image 60700 of 82783
INFO:tensorflow:On image 60800 of 82783
INFO:tensorflow:On image 60900 of 82783
INFO:tensorflow:On image 61000 of 82783
INFO:tensorflow:On image 61100 of 82783
INFO:tensorflow:On image 61200 of 82783
INFO:tensorflow:On image 61300 of 82783
INFO:tensorflow:On image 61400 of 82783
INFO:tensorflow:On image 61500 of 82783
INFO:tensorflow:On image 61600 of 82783
INFO:tensorflow:On image 61700 of 82783
INFO:tensorflow:On image 61800 of 82783
INFO:tensorflow:On image 61900 of 82783
INFO:tensorflow:On image 62000 of 82783
INFO:tensorflow:On image 62100 of 82783
INFO:tensorflow:On image 62200 of 82783


INFO:tensorflow:On image 80300 of 82783
INFO:tensorflow:On image 80400 of 82783
INFO:tensorflow:On image 80500 of 82783
INFO:tensorflow:On image 80600 of 82783
INFO:tensorflow:On image 80700 of 82783
INFO:tensorflow:On image 80800 of 82783
INFO:tensorflow:On image 80900 of 82783
INFO:tensorflow:On image 81000 of 82783
INFO:tensorflow:On image 81100 of 82783
INFO:tensorflow:On image 81200 of 82783
INFO:tensorflow:On image 81300 of 82783
INFO:tensorflow:On image 81400 of 82783
INFO:tensorflow:On image 81500 of 82783
INFO:tensorflow:On image 81600 of 82783
INFO:tensorflow:On image 81700 of 82783
INFO:tensorflow:On image 81800 of 82783
INFO:tensorflow:On image 81900 of 82783
INFO:tensorflow:On image 82000 of 82783
INFO:tensorflow:On image 82100 of 82783
INFO:tensorflow:On image 82200 of 82783
INFO:tensorflow:On image 82300 of 82783
INFO:tensorflow:On image 82400 of 82783
INFO:tensorflow:On image 82500 of 82783
INFO:tensorflow:On image 82600 of 82783
INFO:tensorflow:On image 82700 of 82783


INFO:tensorflow:On image 17900 of 40504
INFO:tensorflow:On image 18000 of 40504
INFO:tensorflow:On image 18100 of 40504
INFO:tensorflow:On image 18200 of 40504
INFO:tensorflow:On image 18300 of 40504
INFO:tensorflow:On image 18400 of 40504
INFO:tensorflow:On image 18500 of 40504
INFO:tensorflow:On image 18600 of 40504
INFO:tensorflow:On image 18700 of 40504
INFO:tensorflow:On image 18800 of 40504
INFO:tensorflow:On image 18900 of 40504
INFO:tensorflow:On image 19000 of 40504
INFO:tensorflow:On image 19100 of 40504
INFO:tensorflow:On image 19200 of 40504
INFO:tensorflow:On image 19300 of 40504
INFO:tensorflow:On image 19400 of 40504
INFO:tensorflow:On image 19500 of 40504
INFO:tensorflow:On image 19600 of 40504
INFO:tensorflow:On image 19700 of 40504
INFO:tensorflow:On image 19800 of 40504
INFO:tensorflow:On image 19900 of 40504
INFO:tensorflow:On image 20000 of 40504
INFO:tensorflow:On image 20100 of 40504
INFO:tensorflow:On image 20200 of 40504
INFO:tensorflow:On image 20300 of 40504


INFO:tensorflow:On image 38400 of 40504
INFO:tensorflow:On image 38500 of 40504
INFO:tensorflow:On image 38600 of 40504
INFO:tensorflow:On image 38700 of 40504
INFO:tensorflow:On image 38800 of 40504
INFO:tensorflow:On image 38900 of 40504
INFO:tensorflow:On image 39000 of 40504
INFO:tensorflow:On image 39100 of 40504
INFO:tensorflow:On image 39200 of 40504
INFO:tensorflow:On image 39300 of 40504
INFO:tensorflow:On image 39400 of 40504
INFO:tensorflow:On image 39500 of 40504
INFO:tensorflow:On image 39600 of 40504
INFO:tensorflow:On image 39700 of 40504
INFO:tensorflow:On image 39800 of 40504
INFO:tensorflow:On image 39900 of 40504
INFO:tensorflow:On image 40000 of 40504
INFO:tensorflow:On image 40100 of 40504
INFO:tensorflow:On image 40200 of 40504
INFO:tensorflow:On image 40300 of 40504
INFO:tensorflow:On image 40400 of 40504
INFO:tensorflow:On image 40500 of 40504
INFO:tensorflow:Creating train TFRecords for VisualWakeWords dataset...



IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



INFO:tensorflow:On image 100 of 82783
INFO:tensorflow:On image 200 of 82783
INFO:tensorflow:On image 300 of 82783
INFO:tensorflow:On image 400 of 82783
INFO:tensorflow:On image 500 of 82783
INFO:tensorflow:On image 600 of 82783
INFO:tensorflow:On image 700 of 82783
INFO:tensorflow:On image 800 of 82783
INFO:tensorflow:On image 900 of 82783
INFO:tensorflow:On image 1000 of 82783
INFO:tensorflow:On image 1100 of 82783
INFO:tensorflow:On image 1200 of 82783
INFO:tensorflow:On image 1300 of 82783
INFO:tensorflow:On image 1400 of 82783
INFO:tensorflow:On image 1500 of 82783
INFO:tensorflow:On image 1600 of 82783
INFO:tensorflow:On image 1700 of 82783
INFO:tensorflow:On image 1800 of 82783
INFO:tensorflow:On image 1900 of 82783
INFO:tensorflow:On image 2000 of 82783
INFO:tensorflow:On image 2100 of 82783
INFO:tensorflow:On image 2200 of 82783
INFO:tensorflow:On image 2300 of 82783
INFO:tensorflow:On image 2400 of 82783
INFO:tensorflow:On image 2500 of 82783
INFO:tensorflow:On image 2600 of 8

INFO:tensorflow:On image 20900 of 82783
INFO:tensorflow:On image 21000 of 82783
INFO:tensorflow:On image 21100 of 82783
INFO:tensorflow:On image 21200 of 82783
INFO:tensorflow:On image 21300 of 82783
INFO:tensorflow:On image 21400 of 82783
INFO:tensorflow:On image 21500 of 82783
INFO:tensorflow:On image 21600 of 82783
INFO:tensorflow:On image 21700 of 82783
INFO:tensorflow:On image 21800 of 82783
INFO:tensorflow:On image 21900 of 82783
INFO:tensorflow:On image 22000 of 82783
INFO:tensorflow:On image 22100 of 82783
INFO:tensorflow:On image 22200 of 82783
INFO:tensorflow:On image 22300 of 82783
INFO:tensorflow:On image 22400 of 82783
INFO:tensorflow:On image 22500 of 82783
INFO:tensorflow:On image 22600 of 82783
INFO:tensorflow:On image 22700 of 82783
INFO:tensorflow:On image 22800 of 82783
INFO:tensorflow:On image 22900 of 82783
INFO:tensorflow:On image 23000 of 82783
INFO:tensorflow:On image 23100 of 82783
INFO:tensorflow:On image 23200 of 82783
INFO:tensorflow:On image 23300 of 82783


INFO:tensorflow:On image 41400 of 82783
INFO:tensorflow:On image 41500 of 82783
INFO:tensorflow:On image 41600 of 82783
INFO:tensorflow:On image 41700 of 82783
INFO:tensorflow:On image 41800 of 82783
INFO:tensorflow:On image 41900 of 82783
INFO:tensorflow:On image 42000 of 82783
INFO:tensorflow:On image 42100 of 82783
INFO:tensorflow:On image 42200 of 82783
INFO:tensorflow:On image 42300 of 82783
INFO:tensorflow:On image 42400 of 82783
INFO:tensorflow:On image 42500 of 82783
INFO:tensorflow:On image 42600 of 82783
INFO:tensorflow:On image 42700 of 82783
INFO:tensorflow:On image 42800 of 82783
INFO:tensorflow:On image 42900 of 82783
INFO:tensorflow:On image 43000 of 82783
INFO:tensorflow:On image 43100 of 82783
INFO:tensorflow:On image 43200 of 82783
INFO:tensorflow:On image 43300 of 82783
INFO:tensorflow:On image 43400 of 82783
INFO:tensorflow:On image 43500 of 82783
INFO:tensorflow:On image 43600 of 82783
INFO:tensorflow:On image 43700 of 82783
INFO:tensorflow:On image 43800 of 82783


INFO:tensorflow:On image 61900 of 82783
INFO:tensorflow:On image 62000 of 82783
INFO:tensorflow:On image 62100 of 82783
INFO:tensorflow:On image 62200 of 82783
INFO:tensorflow:On image 62300 of 82783
INFO:tensorflow:On image 62400 of 82783
INFO:tensorflow:On image 62500 of 82783
INFO:tensorflow:On image 62600 of 82783
INFO:tensorflow:On image 62700 of 82783
INFO:tensorflow:On image 62800 of 82783
INFO:tensorflow:On image 62900 of 82783
INFO:tensorflow:On image 63000 of 82783
INFO:tensorflow:On image 63100 of 82783
INFO:tensorflow:On image 63200 of 82783
INFO:tensorflow:On image 63300 of 82783
INFO:tensorflow:On image 63400 of 82783
INFO:tensorflow:On image 63500 of 82783
INFO:tensorflow:On image 63600 of 82783
INFO:tensorflow:On image 63700 of 82783
INFO:tensorflow:On image 63800 of 82783
INFO:tensorflow:On image 63900 of 82783
INFO:tensorflow:On image 64000 of 82783
INFO:tensorflow:On image 64100 of 82783
INFO:tensorflow:On image 64200 of 82783
INFO:tensorflow:On image 64300 of 82783


INFO:tensorflow:On image 82400 of 82783
INFO:tensorflow:On image 82500 of 82783
INFO:tensorflow:On image 82600 of 82783
INFO:tensorflow:On image 82700 of 82783
INFO:tensorflow:Creating validation TFRecords for VisualWakeWords dataset...
annotations_index : {u'378467': {u'objects': [], u'label': 0}, u'377652': {u'objects': [], u'label': 0}, u'89378': {u'objects': [], u'label': 0}, u'425874': {u'objects': [], u'label': 0}, u'425870': {u'objects': [], u'label': 0}, u'256903': {u'objects': [], u'label': 0}, u'256906': {u'objects': [], u'label': 0}, u'221547': {u'objects': [], u'label': 0}, u'439015': {u'objects': [], u'label': 0}, u'103548': {u'objects': [], u'label': 0}, u'460266': {u'objects': [], u'label': 0}, u'127477': {u'objects': [], u'label': 0}, u'127476': {u'objects': [], u'label': 0}, u'564699': {u'objects': [], u'label': 0}, u'127474': {u'objects': [], u'label': 0}, u'467755': {u'objects': [], u'label': 0}, u'260525': {u'objects': [], u'label': 0}, u'304545': {u'objects': [], u

INFO:tensorflow:On image 100 of 40504
INFO:tensorflow:On image 200 of 40504
INFO:tensorflow:On image 300 of 40504
INFO:tensorflow:On image 400 of 40504
INFO:tensorflow:On image 500 of 40504
INFO:tensorflow:On image 600 of 40504
INFO:tensorflow:On image 700 of 40504
INFO:tensorflow:On image 800 of 40504
INFO:tensorflow:On image 900 of 40504
INFO:tensorflow:On image 1000 of 40504
INFO:tensorflow:On image 1100 of 40504
INFO:tensorflow:On image 1200 of 40504
INFO:tensorflow:On image 1300 of 40504
INFO:tensorflow:On image 1400 of 40504
INFO:tensorflow:On image 1500 of 40504
INFO:tensorflow:On image 1600 of 40504
INFO:tensorflow:On image 1700 of 40504
INFO:tensorflow:On image 1800 of 40504
INFO:tensorflow:On image 1900 of 40504
INFO:tensorflow:On image 2000 of 40504
INFO:tensorflow:On image 2100 of 40504
INFO:tensorflow:On image 2200 of 40504
INFO:tensorflow:On image 2300 of 40504
INFO:tensorflow:On image 2400 of 40504
INFO:tensorflow:On image 2500 of 40504
INFO:tensorflow:On image 2600 of 4

INFO:tensorflow:On image 20900 of 40504
INFO:tensorflow:On image 21000 of 40504
INFO:tensorflow:On image 21100 of 40504
INFO:tensorflow:On image 21200 of 40504
INFO:tensorflow:On image 21300 of 40504
INFO:tensorflow:On image 21400 of 40504
INFO:tensorflow:On image 21500 of 40504
INFO:tensorflow:On image 21600 of 40504
INFO:tensorflow:On image 21700 of 40504
INFO:tensorflow:On image 21800 of 40504
INFO:tensorflow:On image 21900 of 40504
INFO:tensorflow:On image 22000 of 40504
INFO:tensorflow:On image 22100 of 40504
INFO:tensorflow:On image 22200 of 40504
INFO:tensorflow:On image 22300 of 40504
INFO:tensorflow:On image 22400 of 40504
INFO:tensorflow:On image 22500 of 40504
INFO:tensorflow:On image 22600 of 40504
INFO:tensorflow:On image 22700 of 40504
INFO:tensorflow:On image 22800 of 40504
INFO:tensorflow:On image 22900 of 40504
INFO:tensorflow:On image 23000 of 40504
INFO:tensorflow:On image 23100 of 40504
INFO:tensorflow:On image 23200 of 40504
INFO:tensorflow:On image 23300 of 40504


### Upload dataset to S3

Next, we'll upload the TFRecord datasets to S3 so that we can use it in training and batch transform jobs.

In [20]:
prefix = 'coco_dataset/tfrecord_dog'
!aws s3 cp ./raw_datasets/tfrecord s3://{data_bucket}/{prefix}/ --recursive

upload: raw_datasets/tfrecord/labels.txt to s3://sagemaker-experiments-us-east-2-322537213286/coco_dataset/tfrecord_dog/labels.txt
upload: raw_datasets/tfrecord/train.record-00002-of-00100 to s3://sagemaker-experiments-us-east-2-322537213286/coco_dataset/tfrecord_dog/train.record-00002-of-00100
upload: raw_datasets/tfrecord/train.record-00001-of-00100 to s3://sagemaker-experiments-us-east-2-322537213286/coco_dataset/tfrecord_dog/train.record-00001-of-00100
upload: raw_datasets/tfrecord/train.record-00000-of-00100 to s3://sagemaker-experiments-us-east-2-322537213286/coco_dataset/tfrecord_dog/train.record-00000-of-00100
upload: raw_datasets/tfrecord/train.record-00004-of-00100 to s3://sagemaker-experiments-us-east-2-322537213286/coco_dataset/tfrecord_dog/train.record-00004-of-00100
upload: raw_datasets/tfrecord/train.record-00003-of-00100 to s3://sagemaker-experiments-us-east-2-322537213286/coco_dataset/tfrecord_dog/train.record-00003-of-00100
upload: raw_datasets/tfrecord/train.record-0

upload: raw_datasets/tfrecord/train.record-00049-of-00100 to s3://sagemaker-experiments-us-east-2-322537213286/coco_dataset/tfrecord_dog/train.record-00049-of-00100
upload: raw_datasets/tfrecord/train.record-00050-of-00100 to s3://sagemaker-experiments-us-east-2-322537213286/coco_dataset/tfrecord_dog/train.record-00050-of-00100
upload: raw_datasets/tfrecord/train.record-00051-of-00100 to s3://sagemaker-experiments-us-east-2-322537213286/coco_dataset/tfrecord_dog/train.record-00051-of-00100
upload: raw_datasets/tfrecord/train.record-00053-of-00100 to s3://sagemaker-experiments-us-east-2-322537213286/coco_dataset/tfrecord_dog/train.record-00053-of-00100
upload: raw_datasets/tfrecord/train.record-00052-of-00100 to s3://sagemaker-experiments-us-east-2-322537213286/coco_dataset/tfrecord_dog/train.record-00052-of-00100
upload: raw_datasets/tfrecord/train.record-00054-of-00100 to s3://sagemaker-experiments-us-east-2-322537213286/coco_dataset/tfrecord_dog/train.record-00054-of-00100
upload: ra

upload: raw_datasets/tfrecord/train.record-00092-of-00100 to s3://sagemaker-experiments-us-east-2-322537213286/coco_dataset/tfrecord_dog/train.record-00092-of-00100
upload: raw_datasets/tfrecord/val.record-00000-of-00010 to s3://sagemaker-experiments-us-east-2-322537213286/coco_dataset/tfrecord_dog/val.record-00000-of-00010
upload: raw_datasets/tfrecord/val.record-00001-of-00010 to s3://sagemaker-experiments-us-east-2-322537213286/coco_dataset/tfrecord_dog/val.record-00001-of-00010
upload: raw_datasets/tfrecord/val.record-00003-of-00010 to s3://sagemaker-experiments-us-east-2-322537213286/coco_dataset/tfrecord_dog/val.record-00003-of-00010
upload: raw_datasets/tfrecord/val.record-00002-of-00010 to s3://sagemaker-experiments-us-east-2-322537213286/coco_dataset/tfrecord_dog/val.record-00002-of-00010
upload: raw_datasets/tfrecord/val.record-00004-of-00010 to s3://sagemaker-experiments-us-east-2-322537213286/coco_dataset/tfrecord_dog/val.record-00004-of-00010
upload: raw_datasets/tfrecord/

# Construct a script for distributed training

We have modified it to handle the ``model_dir`` parameter passed in by SageMaker. This is an S3 path which can be used for data sharing during distributed training and checkpointing and/or model persistence. We have also added an argument-parsing function to handle processing training-related variables.

At the end of the training job we have added a step to export the trained model to the path stored in the environment variable ``SM_MODEL_DIR``, which always points to ``/opt/ml/model``. This is critical because SageMaker uploads all the model artifacts in this folder to S3 at end of training.

Here is the entire script:

In [21]:
!pygmentize './src_dir/image_classifier.py'

[37m# Copyright 2016 The TensorFlow Authors. All Rights Reserved.[39;49;00m
[37m#[39;49;00m
[37m# Licensed under the Apache License, Version 2.0 (the "License");[39;49;00m
[37m# you may not use this file except in compliance with the License.[39;49;00m
[37m# You may obtain a copy of the License at[39;49;00m
[37m#[39;49;00m
[37m# http://www.apache.org/licenses/LICENSE-2.0[39;49;00m
[37m#[39;49;00m
[37m# Unless required by applicable law or agreed to in writing, software[39;49;00m
[37m# distributed under the License is distributed on an "AS IS" BASIS,[39;49;00m
[37m# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.[39;49;00m
[37m# See the License for the specific language governing permissions and[39;49;00m
[37m# limitations under the License.[39;49;00m
[33m"""Generic training script that trains a model using a given dataset."""[39;49;00m

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m absolut

        optimizer = tf.train.MomentumOptimizer(
            learning_rate,
            momentum=args.momentum,
            name=[33m'[39;49;00m[33mMomentum[39;49;00m[33m'[39;49;00m)
    [34melif[39;49;00m args.optimizer == [33m'[39;49;00m[33mrmsprop[39;49;00m[33m'[39;49;00m:
        optimizer = tf.train.RMSPropOptimizer(
            learning_rate,
            decay=args.rmsprop_decay,
            momentum=args.rmsprop_momentum,
            epsilon=args.opt_epsilon)
    [34melif[39;49;00m args.optimizer == [33m'[39;49;00m[33msgd[39;49;00m[33m'[39;49;00m:
        optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    [34melse[39;49;00m:
        [34mraise[39;49;00m [36mValueError[39;49;00m([33m'[39;49;00m[33mOptimizer [[39;49;00m[33m%s[39;49;00m[33m] was not recognized[39;49;00m[33m'[39;49;00m % args.optimizer)
    [34mreturn[39;49;00m optimizer


[34mdef[39;49;00m [32m_get_init_fn[39;49;00m(args):
    [33m"""Returns

# Create a training job using the `TensorFlow` estimator

The `sagemaker.tensorflow.TensorFlow` estimator handles locating the script mode container, uploading your script to a S3 location and creating a SageMaker training job. Let's call out a couple important parameters here:

* `py_version` is set to `'py3'` to indicate that we are using script mode since legacy mode supports only Python 2. Though Python 2 will be deprecated soon, you can use script mode with Python 2 by setting `py_version` to `'py2'` and `script_mode` to `True`.

* `distributions` is used to configure the distributed training setup. It's required only if you are doing distributed training either across a cluster of instances or across multiple GPUs. Here we are using parameter servers as the distributed training schema. SageMaker training jobs run on homogeneous clusters. To make parameter server more performant in the SageMaker setup, we run a parameter server on every instance in the cluster, so there is no need to specify the number of parameter servers to launch. Script mode also supports distributed training with [Horovod](https://github.com/horovod/horovod). You can find the full documentation on how to configure `distributions` [here](https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/tensorflow#distributed-training). 



### SageMaker Experiments
- experiments를 관리하고 추적하는 기능 제공
<center><img src="./fig/experiments_fig.png" width="900" height="700"></center>


- trial components : pre-processing jobs, training jobs, and batch transform jobsb

#### Track an Experiment
- Experiment 정보를 기록하기 위해 tracker를 사용
- 기존 trial components 를 로딩하거나(Tracker.load) 신규 trial component를 생성하는 방식으로 사용(Tracker.create)
- 아래는 데이터셋을 업로드하는 S3 버킷의 URI와 데이터셋 관련 정보를 log로 남기는 예제임

===================================================================================================================
#### Create an Experiment
- The top level entity as a collection of trials that are observed, compared, and evaluated as a group

In [22]:
# experiment_name = "experiments-v2" ## 원하는 experiment 이름으로 변경

# experiment_existed = True
# try:
#     experiment = sm.describe_experiment(ExperimentName=experiment_name)
# except:
#     experiment_existed = False

# if not experiment_existed:
#     experiment = Experiment.create(
#         experiment_name=experiment_name, 
#         description="Classifier of XXX images", 
#         sagemaker_boto_client=sm)
# else:
#     experiment = sm.describe_experiment(ExperimentName=experiment_name)
# print(experiment)

#### Create  Trials
- 각  trial는 다른 hyperparameters에 대해 학습하는 과정을 나타냅니다. 

In [23]:
## Dataset 위치
inputs= 's3://{}/{}'.format(data_bucket, prefix)
inputs

's3://sagemaker-experiments-us-east-2-322537213286/coco_dataset/tfrecord_dog'

In [24]:
# trial_name = f"{int(time.time())}-{experiment_name}"
    
# train_trial = Trial.create(
#     trial_name=trial_name, 
#     experiment_name=experiment_name,
#     sagemaker_boto_client=sm,
# )

# with Tracker.create(display_name="Dataset", sagemaker_boto_client=sm) as tracker:
#     tracker.log_parameters({
#         "dataset": "coco_dataset",
#         "resize" : 128
#     })
#     # we can log the s3 uri to the dataset we just uploaded
#     tracker.log_input(name="coco_dataset", media_type="s3/uri", value=inputs)
    
# # associate the proprocessing trial component with the current trial
# train_trial.add_trial_component(tracker.trial_component)

In [25]:
hyperparameters = {
        'dataset_name' : 'visualwakewords',
        'model_name' : 'mobilenet_v1_025',
        'preprocessing_name' : 'mobilenet_v1',
        'image_size' : 128,
        'use_grayscale' : False,
        'save_summaries_secs' : 300,
        'label_smoothing' : 0.1,
        'learning_rate_decay_factor' : 0.98,
        'num_epochs_per_decay' : 2.5,
        'moving_average_decay' : 0.9999,
        'batch_size' : 128,
        'max_number_of_steps' : 200,
        'eval_batch_size' : 1000,     
    }

In [26]:
estimator = TensorFlow(entry_point='image_classifier.py',
                       source_dir='src_dir',
                       role=role,
                       train_instance_count=1,
                       train_instance_type='ml.p3.2xlarge',
                       train_use_spot_instances=True,  # spot instance 활용
                       train_volume_size=400,
                       train_max_run=12*60*60,
                       train_max_wait=12*60*60,
#                        train_instance_type='local_gpu',
                       framework_version='1.14.0',
                       py_version='py2',
                       hyperparameters=hyperparameters
                      )

No handlers could be found for logger "sagemaker"


## Calling ``fit``

To start a training job, we call `estimator.fit(training_data_uri)`.

An S3 location is used here as the input. `fit` creates a default channel named `'training'`, which points to this S3 location. In the training script we can then access the training data from the location stored in `SM_CHANNEL_TRAINING`. `fit` accepts a couple other types of input as well. See the API doc [here](https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.EstimatorBase.fit) for details.

When training starts, the TensorFlow container executes mnist.py, passing `hyperparameters` and `model_dir` from the estimator as script arguments. Because we didn't define either in this example, no hyperparameters are passed, and `model_dir` defaults to `s3://<DEFAULT_BUCKET>/<TRAINING_JOB_NAME>`, so the script execution is as follows:
```bash
python mnist.py --model_dir s3://<DEFAULT_BUCKET>/<TRAINING_JOB_NAME>
```
When training is complete, the training job will upload the saved model for TensorFlow serving.

In [27]:
training_job_name = "{}-img-classifier-training-job".format(int(time.time()))
estimator.fit(
    inputs = {'training': inputs},
    job_name=training_job_name,
    logs='All',
#     experiment_config={
#             "TrialName": train_trial.trial_name,
#             "TrialComponentDisplayName": "Training",
#         },
    wait=False
)
print("training_job_name : {}".format(training_job_name))

training_job_name : 1589689459-img-classifier-training-job


In [28]:
sm_sess = sagemaker.Session()
sm_sess.logs_for_job(estimator.latest_training_job.name, wait=True, log_type='All')

2020-05-17 04:24:20 Starting - Starting the training job...
2020-05-17 04:24:22 Starting - Launching requested ML instances......
2020-05-17 04:25:24 Starting - Preparing the instances for training......
2020-05-17 04:26:48 Downloading - Downloading input data............
2020-05-17 04:28:46 Training - Downloading the training image...
2020-05-17 04:29:06 Training - Training image download completed. Training in progress..[34m********************* args.model_dir[0m
[34m********************* args.train_dir[0m
[34mW0517 04:29:15.225903 140405522102016 deprecation_wrapper.py:119] From image_classifier.py:598: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.
[0m
[34mW0517 04:29:15.226243 140405522102016 deprecation_wrapper.py:119] From image_classifier.py:598: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.
[0m
[34mW0517 04:29:15.230127 140405522102016 deprecation.py:323] From image_classi

[34mI0517 04:29:29.201777 140405522102016 session_manager.py:500] Running local_init_op.[0m
[34mI0517 04:29:29.340445 140405522102016 session_manager.py:502] Done running local_init_op.[0m
[34mI0517 04:29:33.608508 140405522102016 learning.py:754] Starting Session.[0m
[34mI0517 04:29:33.757477 140399762978560 supervisor.py:1117] Saving checkpoint to path /opt/ml/model/model.ckpt[0m
[34mI0517 04:29:33.762818 140405522102016 learning.py:768] Starting Queues.[0m
[34mI0517 04:29:35.723864 140399771371264 supervisor.py:1099] global_step/sec: 0[0m
[34mI0517 04:29:44.439759 140399779763968 supervisor.py:1050] Recording summary at step 1.[0m
[34mI0517 04:29:45.563167 140405522102016 learning.py:507] global step 10: loss = 0.3595 (0.094 sec/step)[0m
[34mI0517 04:29:47.556523 140405522102016 learning.py:507] global step 20: loss = 0.3731 (0.234 sec/step)[0m
[34mI0517 04:29:50.042387 140405522102016 learning.py:507] global step 30: loss = 0.3191 (0.246 sec/step)[0m
[34mI0517 

[34mI0517 04:30:46.160223 140405522102016 evaluation.py:167] Evaluation [4/41][0m
[34mI0517 04:30:54.795001 140405522102016 evaluation.py:167] Evaluation [8/41][0m
[34mI0517 04:31:03.474709 140405522102016 evaluation.py:167] Evaluation [12/41][0m
[34mI0517 04:31:12.095094 140405522102016 evaluation.py:167] Evaluation [16/41][0m
[34mI0517 04:31:20.765419 140405522102016 evaluation.py:167] Evaluation [20/41][0m
[34mI0517 04:31:29.488480 140405522102016 evaluation.py:167] Evaluation [24/41][0m
[34mI0517 04:31:38.141202 140405522102016 evaluation.py:167] Evaluation [28/41][0m
[34mI0517 04:31:46.959955 140405522102016 evaluation.py:167] Evaluation [32/41][0m
[34mI0517 04:31:55.581482 140405522102016 evaluation.py:167] Evaluation [36/41][0m
[34mI0517 04:32:04.171299 140405522102016 evaluation.py:167] Evaluation [40/41][0m
[34mI0517 04:32:06.268907 140405522102016 evaluation.py:167] Evaluation [41/41][0m
[34meval/Accuracy[0.966219485][0m
[34meval/Recall_5[1][0m
[34m

In [29]:
artifacts_dir = estimator.model_dir.replace('model','')
print(artifacts_dir)
!aws s3 ls --human-readable {artifacts_dir}

s3://sagemaker-us-east-2-322537213286/1589689459-img-classifier-training-job/
                           PRE debug-output/
                           PRE output/
                           PRE source/


In [30]:
model_dir=artifacts_dir+'output/'
print(model_dir)
!aws s3 ls --human-readable {model_dir}

s3://sagemaker-us-east-2-322537213286/1589689459-img-classifier-training-job/output/
2020-05-17 04:32:18    7.8 MiB model.tar.gz


In [31]:
!rm -rf ./model_result/

In [32]:
import json , os

path = './model_result'
if not os.path.exists(path):
    os.makedirs(path)

!aws s3 cp {model_dir}model.tar.gz {path}/model.tar.gz
!tar -xzf {path}/model.tar.gz -C {path}

download: s3://sagemaker-us-east-2-322537213286/1589689459-img-classifier-training-job/output/model.tar.gz to model_result/model.tar.gz


In [33]:
!aws s3 cp {path}/mobilenetv1_model.tflite {estimator.model_dir}/mobilenetv1_model.tflite

upload: model_result/mobilenetv1_model.tflite to s3://sagemaker-us-east-2-322537213286/1589689459-img-classifier-training-job/model/mobilenetv1_model.tflite


### Compare the model training runs for an experiment

Now we will use the analytics capabilities of Python SDK to query and compare the training runs for identifying the best model produced by our experiment. You can retrieve trial components by using a search expression.

In [34]:
# search_expression = {
#     "Filters":[
#         {
#             "Name": "DisplayName",
#             "Operator": "Equals",
#             "Value": "Training",
#         }
#     ],
# }

In [35]:
# trial_component_analytics = ExperimentAnalytics(
#     sagemaker_session=Session(sess, sm), 
#     experiment_name=experiment_name,
#     search_expression=search_expression
# )

In [36]:
# trial_component_analytics.dataframe()

Now we can plot the history with two graphs, one for accuracy and another for loss. Each graph shows the results for both the training and validation datasets. Although training is a stochastic process that can vary significantly between training jobs, overall you are likely to see that the training curves are converging smoothly and steadily to higher accuracy and lower loss, while the validation curves are more jagged. This is due to the validation dataset being relatively small and thus not as representative as the training dataset.