## Sagemaker container lesion example
In this example is launched a dummy container using spot instances, to test what's the behavior of the aws Sagemaker with docker containers, and what's the behaviour of the container whene it is terminated by aws due to lack of spot resources.</p>
<p>In this example is simulated by a dummy python script (into the container) that performs similar actions that a normal training script with tensorflow or other framework should do, more specificaly:

- Fake checkpoints are written in txt format in the folder `/opt/ml/checkpoints/`.

- Fake tensorboard records are written every 20 seconds into the folder `/opt/ml/output/tensorboard/`, for check the real-time prensence into the s3 bucket folder  specified.

- Furthermore the tree command is executed in the path `/opt/ml/` for inspect the folder structure created by sagemaker, and the result is stored in the `/opt/output/data/` folder as a txt file.

In [1]:
import os
import sagemaker
from sagemaker import get_execution_role
from sagemaker.debugger import TensorBoardOutputConfig

In this section is recovered the Sagemaker bucket generated by default from the service

In [2]:
# default sagemaker bucket name request 
sagemaker_session = sagemaker.session.Session()
sagemaker_default_bucket = sagemaker_session.default_bucket()

print("bucket generated by sagemacker:" + sagemaker_default_bucket)

bucket generated by sagemacker:sagemaker-eu-west-1-011827850615


In [9]:
############################  JOB NAME  ####################################

# job-name definition:
# every multi-job session of training jobs is characterized by a base-job-name
# the job-name on the contrary is the identifier of the single training job.
# The job-name must be different for each training job and should be used to
# divide the results of different training jobs into specific folders.
job_name = 'test-21'

## Defining the s3 bucket for the training job
In this section are defined all the nedeed variables that specify the paths to s3 buckets for the inputs and outputs data.

It's worth spending a few words about the configuration of the TensorBoardOutputConfig, this path can be used for tensorboard data if you use tensorflow or for other types of files that is important for you to take out of the container during the training process. in this example we write some txt files filled with random chearacters in the `container_local_output_path` and this files became available in the `s3_output_path` relative to the TensorBoardOutputConfig in a few seconds.


In [4]:
############################   INPUTS  ####################################

# repositroy ECR containing the docker image configured to be executed by Sagemaker
# ecr_container_uri = "<your aws id>.dkr.ecr.<your aws region>.amazonaws.com/<your repo name:your repo tag>"
ecr_container_uri = "011827850615.dkr.ecr.eu-west-1.amazonaws.com/maskrcnn_repo_test:lesion"
#ecr_container_uri = "011827850615.dkr.ecr.eu-west-1.amazonaws.com/maskrcnn_repo_test:lesion_2"

# s3 path containing the dataset needed for training the model
dataset_bucket = "s3://datsetsbucket/isic2018/"

# s3 path containing the model with pretrained weights, in the next example in this folder would be
# stored the Mask R-CNN model trained on COCO. 
model_bucket = 's3://cermodelbucket'

############################  OUTPUTS  ####################################

# s3 path where are stored the results of the instance profiler and any other data saved during the training in the folder /opt/ml/output/data/
output_path = f's3://{sagemaker_default_bucket}/output'

# s3 path where are stored the checkpoints of the training proces
checkpoints_path = f'{output_path}/{job_name}/checkpoints'

# internal paths for checkpoints and tenorboard logs passed to the container as env variables
user_defined_env_vars = {"checkpoints": "/opt/ml/checkpoints",
                        "tensorboard": "/opt/ml/output/tensorboard"}

# Definition of s3 target bucket folder for the tensorboard outputs and container folder where the tensorboard record must to be placed.
# it's possible to place the tensorboard output in other places but sagemaker copy that records into '/opt/ml/output/tensorboard' so we decide to
# put the records directly in there.
# Note: in the path 's3://testtflogs/logs' the recors are divided into folders related to the job-name in this example the output of tensorboard
# should be fine in 's3://testtflogs/logs/test-11'
tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path='s3://testtflogs/logs',
    container_local_output_path=user_defined_env_vars['tensorboard']
)


In this section are recovered the execution role ARN associated to this notebook, that will be passed to the estimator for launching the training job, be sure to give permissions to use other buckets to this role, otherwise it will only be possible to use buckets starting with the sagemaker keyword, in this case the permission is needed.

In [5]:

# if you are running the code from jupiter
# getting the execution role of this instance of sagemaker notebook
# role = get_execution_role()
# print(role)

# if you are running the code from local
role = 'arn:aws:iam::011827850615:role/service-role/AmazonSageMaker-ExecutionRole-20210522T125188'
# role = 'arn:aws:iam::<your aws id>:role/service-role/<your role name>'

In this section are defined the hyperparameters, this values are passed to the estimator definitions and would be reachable from the trainning script in the container as commandline arguments, or like environment variables whit this notation `SM_HP_{hyperparameter_name}`, es. `SM_HP_HP1` or `SM_HP_BATCH` in this case

In [6]:
# hyperparameters definition
hyperparameters = {
    "NAME": "lesion", 
    "GPU_COUNT": 1, 
    "IMAGES_PER_GPU": 2,
    "CLASS_NAMES": "{\"1\": \"lesion\"}",
    "TRAINING_SPLIT": 0.8,
    "HEAD_TRAIN_EPOCHS": 20,
    "ALL_TRAIN_EPOCHS": 40
}

## Setup the training job
This is the key function of the script, in there are configured al the training job parameters, are passed all the path that was defined earlier, the hyperparameters and are defined many settings relative to the type of machine used for the job, and in witch mode should run.

More specificaly we chose to run in spot mode (`use_spot_instances = True`), in this mode the cost of the training goes down from 50% to 80% depending on the instance type chosen and by the availability of the machine, this mode enable aws to sell at lower price unused compute capability in the cloud and can stop your application if someone need this machine in on-demand mode (without any discount).

Whene you chose to run in spot mode two more variables should be set, `max_run` and `max_wait`, this variables specify how match time in seconds the container could run and the second specify in the case that it will be stoped by aws how much time the program should wait that spot instances became available again for restart your container, if one of the two limits are exceeded the training job will be terminated.

For other info about the parameters of this function you can check the [reference](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html)

In [7]:
training_test = sagemaker.estimator.Estimator(
    # container image 
    image_uri    = ecr_container_uri, 
    # role of sagemaker notebook instance
    role         = role, 
    # number of instance to launch
    instance_count = 1, 
    # if you want to use local mode
    #train_instance_type="local",  
    # type of instance in witch place the training job
    #instance_type = 'ml.g4dn.2xlarge',
    instance_type = 'ml.p3.2xlarge',
    # space in GB of the storage attached to the ec2 instance
    volume_size  = 50,
    # max number of seconds of running for the job until the termination of the process
    max_run      = 10*3600,
    # s3 path where will be placed the results of the profiler and the content of /opt/ml/output/data/ path as tar.gz file 
    output_path  = output_path, 
    # prefix for the trainng job name, if not specified generated automatically
    #base_job_name="training-test", 
    # training job hyperparameters, parameters passed as command line arguments
    hyperparameters = hyperparameters, 
    # list of tags for the job
    tags = [{"Key": "CER", "Value": "1"},],
    # s3 path where to find the data needed for the start of the training job like the pretrained model,and that will be copied into the folder /opt/ml/inputs/data/model (sovrascrivibile da model_channel_name)
    model_uri    = model_bucket, 
    # name of the chanel where will be saved the data included in the path model_uri (/opt/ml/inputs/data/<model_channel_name>)
    model_channel_name = 'model', 
    # dict regexs for metrics extraction from stdout {"<metric name>":"<regex for logs estraction>", ...}
    #metric_definitions = , 
    # flag for enabling the spot training
    use_spot_instances = True, 
    # max time of waiting for spot instance to became available again
    max_wait = 24*3600, 
    # s3 chekpoints target path
    checkpoint_s3_uri = checkpoints_path, 
    # default: '/opt/ml/checkpoints'
    checkpoint_local_path = user_defined_env_vars['checkpoints'], 
    # SageMaker Debugger rules
    #rules = ; 
    # Tensorboard output configuration
    tensorboard_output_config = tensorboard_output_config,
    # dict usefull for setting more environment variables into the container
    environment = user_defined_env_vars,
    # max number of try to restart the job if it's finish unespectedly
    #max_retry_attempts =   
)

## Launching the training job
In this section of the notebook the training-job start, using the `.fit()` method of estimator object.

This method have the input parameter that could be defined as a dict with a key and a path to local storage or s3 bucket where are present files that we would download to the container, note that the number of path that could be specified are not limited and for each input path in the container into `/opt/ml/inputs/` is created one folder with the name equal to the passed key name and containing the data into the argument path. In this case we only passed one path to input param and we have as a result `/opt/ml/inputs/dataset/` into the container filled with the same data placed into `dataset_bucket ='s3://datsetsbucket/isic2018/test_dataset/'`, if you pass this dict with more values the result will be an `inputs/` folder populated with more subfolders and relative data.

The `job_name` is a very important parameter this enable you to distinguish from different jobs launched simultaniously or in different moments and permit to distinguish the training job in the training job panel for this reason they can't have the same name, if you launch two training job with the same name the result is an error and the container dosen't start.

Another note is relative to wait and logs parameters, this parameters enable you to watch the logs relative to the startup of the machine and to watch the training job machine logs outputs at the end of the training process (it's not possible to see the training job stdout in real-time). With this configuration the function fit don't terminate until the job isn't finished, but if you don't enable the wait parameter the job start and fit function terminate so you can launch other functions or other jobs,  so to see the logs the wait param should be true.

If something is unclear you can check the relative [reference](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html)


In [10]:
training_test.fit(
    inputs      = {
        'dataset': dataset_bucket
    },
    job_name    = job_name,
    wait        = True,
    logs        = 'All'
)

2021-05-29 20:14:55 Starting - Starting the training job...ProfilerReport-1622319294: InProgress
...
2021-05-29 20:15:49 Starting - Launching requested ML instances......
2021-05-29 20:16:49 Starting - Preparing the instances for training......
2021-05-29 20:18:00 Downloading - Downloading input data............
2021-05-29 20:19:50 Training - Downloading the training image...........[34m2021-05-29 20:21:51,242 sagemaker-training-toolkit INFO     Failed to parse hyperparameter NAME value lesion to Json.[0m
[34mReturning the value itself[0m
[34m2021-05-29 20:21:57,500 sagemaker-training-toolkit INFO     Failed to parse hyperparameter NAME value lesion to Json.[0m
[34mReturning the value itself[0m
[34m2021-05-29 20:21:57,522 sagemaker-training-toolkit INFO     Failed to parse hyperparameter NAME value lesion to Json.[0m
[34mReturning the value itself[0m
[34m2021-05-29 20:21:57,533 sagemaker-training-toolkit INFO     Invoking user script
[0m
[34mTraining Env:
[0m
[34m{
   