## SageMaker Training Job 

### Please go through this notebook only if you have finished Part 1 to Part 4 of the tutorial.

---
#### Step 1: Import packages, get IAM role, get the region and set the S3 bucket.

In [9]:
import os
import boto3
import re
import copy
import time
from time import gmtime, strftime
from sagemaker import get_execution_role

role = get_execution_role()

region = boto3.Session().region_name

bucket ='machinelearning-sagemaker-train' # Put your s3 bucket name here

---
#### Step 2: Create the algorithm image and push to Amazon ECR.

In [10]:
%%sh

# The name of our algorithm
algorithm_name=ml-sagemaker-train

chmod +x src/*

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

# On a SageMaker Notebook Instance, the docker daemon may need to be restarted in order
# to detect your network configuration correctly.  (This is a known issue.)
if [ -d "/home/ec2-user/SageMaker" ]; then
  sudo service docker restart
fi

# Comment the line below to use a GPU
docker build  -t ${algorithm_name} -f Dockerfile.cpu .

# Uncomment the below line if you wish to run on a GPU
#docker build  -t ${algorithm_name} -f Dockerfile.gpu . 

docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

Login Succeeded
Sending build context to Docker daemon  5.712MB
Step 1/10 : FROM python:3.6-buster
 ---> 6a16f0d68245
Step 2/10 : LABEL project="keras-sagemaker-train"
 ---> Using cache
 ---> ec513aa7d4df
Step 3/10 : ARG APP_HOME=/opt/program
 ---> Using cache
 ---> 2a94c9918317
Step 4/10 : ENV PATH="${APP_HOME}:${PATH}"
 ---> Using cache
 ---> 9d9308191e66
Step 5/10 : RUN pip3 install --upgrade pip
 ---> Using cache
 ---> 2df8438249ce
Step 6/10 : RUN pip3 install --upgrade setuptools
 ---> Using cache
 ---> 832057067150
Step 7/10 : ADD requirements-cpu.txt /
 ---> Using cache
 ---> 5e8eed9d27cc
Step 8/10 : RUN pip3 install -r requirements-cpu.txt
 ---> Using cache
 ---> dc374004dfa1
Step 9/10 : COPY src ${APP_HOME}
 ---> Using cache
 ---> 79cbf08a84a9
Step 10/10 : WORKDIR ${APP_HOME}
 ---> Using cache
 ---> 8cd35b869bbd
Successfully built 8cd35b869bbd
Successfully tagged ml-sagemaker-train:latest
The push refers to repository [882096543472.dkr.ecr.us-east-1.amazonaws.com/ml-sagemake

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Redirecting to /bin/systemctl restart docker.service


---
#### Step 3: Define variables with data location and output location in S3 bucket.

In [11]:
data_location = 's3://{}/data'.format(bucket)
print("data location - " + data_location)

output_location = 's3://{}/output'.format(bucket)
print("output location - " + output_location)

data location - s3://machinelearning-sagemaker-train/data
output location - s3://machinelearning-sagemaker-train/output


---
#### Step 4: Create a SageMaker session.

In [12]:
import sagemaker as sage
sess = sage.Session()

---
#### Step 5: Define variables for account, region and algorithm image.

In [13]:
account = sess.boto_session.client('sts').get_caller_identity()['Account'] # aws account 
region = sess.boto_session.region_name # aws server region
image = '{}.dkr.ecr.{}.amazonaws.com/ml-sagemaker-train'.format(account, region) # algorithm image path in ECR
print(image)

882096543472.dkr.ecr.us-east-1.amazonaws.com/ml-sagemaker-train


---
#### Step 6: Define hyperparameters to be passed to your algorithm. 
In this project we are reading two hyperparameters for training. Use of hyperparameters in optional.

In [1]:
hyperparameters = {"batch_size":128, "epochs":30}

---
#### Step 7: Create the training job using SageMaker Estimator.

In [15]:
classifier = sage.estimator.Estimator(image_uri=image, 
                                      role=role,
                                      train_instance_count=1, 
                                      train_instance_type='ml.c5.2xlarge',
                                      hyperparameters=hyperparameters,
                                      output_path=output_location,
                                      sagemaker_session=sess)

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


---
#### Step 8: Run the training job by passing the data location.

In [16]:
classifier.fit(data_location)

2022-11-17 18:11:56 Starting - Starting the training job...
2022-11-17 18:12:21 Starting - Preparing the instances for trainingProfilerReport-1668708716: InProgress
......
2022-11-17 18:13:21 Downloading - Downloading input data...
2022-11-17 18:13:46 Training - Training image download completed. Training in progress..[34m2022-11-17 18:13:50.094190: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA[0m
[34m2022-11-17 18:13:50.142353: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2999995000 Hz[0m
[34m2022-11-17 18:13:50.142780: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55e300dcd420 executing computations on platform Host. Devices:[0m
[34m2022-11-17 18:13:50.142803: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>[0m
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])[0m
  _np_qu

[34mEpoch 22/30
 128/8000 [..............................] - ETA: 0s - loss: 0.1500 - accuracy: 0.9531[0m
[34m1280/8000 [===>..........................] - ETA: 0s - loss: 0.1849 - accuracy: 0.9469[0m
[34mEpoch 23/30
 128/8000 [..............................] - ETA: 0s - loss: 0.2929 - accuracy: 0.8906[0m
[34m1408/8000 [====>.........................] - ETA: 0s - loss: 0.2074 - accuracy: 0.9389[0m
[34mEpoch 24/30
 128/8000 [..............................] - ETA: 0s - loss: 0.2589 - accuracy: 0.9219[0m
[34m1152/8000 [===>..........................] - ETA: 0s - loss: 0.1877 - accuracy: 0.9418[0m
[34mEpoch 25/30
 128/8000 [..............................] - ETA: 0s - loss: 0.2965 - accuracy: 0.9141[0m
[34m1152/8000 [===>..........................] - ETA: 0s - loss: 0.1953 - accuracy: 0.9427[0m
[34mEpoch 26/30
 128/8000 [..............................] - ETA: 0s - loss: 0.2489 - accuracy: 0.9297[0m
[34m1152/8000 [===>..........................] - ETA: 0s - loss: 0.1885 - a


2022-11-17 18:14:22 Uploading - Uploading generated training model
2022-11-17 18:14:22 Completed - Training job completed
Training seconds: 68
Billable seconds: 68
