## Sagemaker dummy example
<p>In this example is launched a dummy container using spot instances, to test what's the behavior of the aws Sagemaker with docker containers, and what's the behaviour of the container whene it is terminated by aws due to lack of spot resources.</p>
<p>In this example is simulated by a dummy python script (into the container) that performs similar actions that a normal training script with tensorflow or other framework should do, more specificaly:

- Fake checkpoints are written in txt format in the folder /opt/ml/checkpoints/.

- Fake tensorboard records are written every 20 seconds into the folder /opt/ml/output/tensorboard/, for check the real-time prensence into the s3 bucket folder  specified.

- Furthermore the tree command is executed in the path /opt/ml/ for inspect the folder structure created by sagemaker, and the result is stored in the /opt/output/ data/ folder as a txt file.
</p> 

In [33]:
import os
import sagemaker
from sagemaker import get_execution_role
from sagemaker.debugger import TensorBoardOutputConfig

#### In this section is recovered the Sagemaker bucket generated by default from the service

In [36]:
# default sagemaker bucket name request 
sagemaker_session = sagemaker.session.Session()
sagemaker_default_bucket = sagemaker_session.default_bucket()

print("bucket generated by sagemacker:" + sagemaker_default_bucket)

bucket generated by sagemacker:sagemaker-eu-west-1-011827850615
output folder: s3://sagemaker-eu-west-1-011827850615/output


In [None]:
############################  JOB NAME  ####################################

# job-name definition:
# every multi-job session of training jobs is characterized by a base-job-name
# the job-name on the contrary is the identifier of the single training job.
# The job-name must be different for each training job and should be used to
# divide the results of different training jobs into specific folders.
job_name = 'test-11'

#### In this section are defined all the nedeed variables that specify the paths to s3 buckets for the inputs and outputs data

In [34]:
############################   INPUTS  ####################################

# repositroy ECR containing the docker image configured to be executed by Sagemaker
ecr_container_uri = "011827850615.dkr.ecr.eu-west-1.amazonaws.com/test_repo:latest"

# s3 path containing the dataset needed for training the model
dataset_bucket = "s3://datsetsbucket/isic2018/test_dataset/"

# s3 path containing the model with pretrained weights, in the next example in this folder would be
# stored the Mask R-CNN model trained on COCO. 
model_bucket = 's3://cermodelbucket'


############################  OUTPUTS  ####################################

# s3 path where are stored the results of the instance profiler and any other data saved during the training in the folder /opt/ml/output/data/
output_path = f's3://{sagemaker_default_bucket}/output'

# s3 path where are stored the checkpoints of the training proces
checkpoints_path = f'{output_path}/{job_name}/checkpoints'

# Definition of s3 target bucket folder for the tensorboard outputs and container folder where the tensorboard record must to be placed.
# it's possible to place the tensorboard output in other places but sagemaker copy that records into '/opt/ml/output/tensorboard' so we decide to
# put the records directly in there.
# Note: in the path 's3://testtflogs/logs' the recors are divided into folders related to the job-name in this example the output of tensorboard
# should be fine in 's3://testtflogs/logs/test-11'
tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path='s3://testtflogs/logs',
    container_local_output_path='/opt/ml/output/tensorboard'
)


#### In this section are recovered the execution role ARN associated to this notebook, that will be passed to the estimator for launching the training job, be sure to give permissions to use other buckets to this role, otherwise it will only be possible to use buckets starting with the sagemaker keyword, in this case the permission is needed.

In [35]:
# getting the execution role of this instance of sagemaker notebook
role = get_execution_role()
print(role)

arn:aws:iam::011827850615:role/service-role/AmazonSageMaker-ExecutionRole-20210522T125188


#### In this sections are defined the hyperparameters , this values are passed to the estimator definitions and would be reachable from the trainning script in the container as commandline arguments, or like environment variables whit this notation SM\_HP_{hyperparameter_name}, es. SM_HP_HP1 or SM_HP_BATCH in this case

In [43]:
# hyperparameters definition
hyperparameters = {
    'hp1': 'value_x',
    'hp2': 314,
    'hp3': 3.1415,
    'batch': 7
}

#### This is the key function of the script, in there are configured al the training job parameters, are passed all the path that was defined earlier, the hyperparameters and are defined many settings relative to the type of machine used for the job, and in witch mode should run.

#### More specificaly we chose to run in spot mode (use_spot_instances = True), in this mode the cost of the training goes down from 50% to 80% depending on the instance type chosen and by the availability of the machine, this modality enable aws to sell at lower price unused compute capability in the cloud and can stop your application if someone need this machine in on-demand mode (without any discount).

#### In 

In [51]:
training_test = sagemaker.estimator.Estimator(
    image_uri    = ecr_container_uri, # container image 
    role         = role, # role of sagemaker notebook instance
    instance_count = 1, # numero di instanze da lanciare
    #train_instance_type="local",  # use local mode
    instance_type = 'ml.m5.large', # tipo di macchina da lanciare
    volume_size  = 50, # dimensione in GB del volume associato all'istanza da lanciare
    max_run      = 10*3600, # massimo numero di secondi di addestramento prima della terminazione forzata dell'instanza
    output_path  = output_path, # destinazione bucket s3 per file contenuti in 
    #base_job_name="training-test", # prefix for the trainng job name, if not specified generated automatically
    hyperparameters = hyperparameters, # training job hyperparameters
    model_uri    = model_bucket, # percorso s3 o del notebook in cui è contenuti i modelli, e che verranno copiati nel canale corrispondente del container      (sovrascrivibile da model_channel_name)
    #model_channel_name = 'model', # nome del canale in cui vengono salvati i modelli contenuti in model_uri
    #metric_definitions = , # dict contenete un dict {"<nome metrica>":"<regex per l'estrazione dai logs>", ...}
    use_spot_instances = True, # flag di abilitazione istanza spot
    max_wait = 24*3600, # massimo un giorno di attesa per la riattivazione della macchina
    checkpoint_s3_uri = checkpoints_path, # percorso s3 di destinazione dei chekpoints
    #checkpoint_local_path = '', # default: '/opt/ml/checkpoints'
    #rules = ; # SageMaker Debugger rules
    tensorboard_output_config = tensorboard_output_config #
    #environment = {}, # dict contente le variabili d'ambiente che devono essere settate
    #max_retry_attempts =  #numero massimo di tentativi di ripristino del docker se il job non viene completato
)

In [55]:
training_test.fit(
    inputs      ={
        'dataset': dataset_bucket
    },
    job_name    = job_name,
    wait        = True,
    logs        = 'All'
)

2021-05-22 22:30:02 Starting - Starting the training job...
2021-05-22 22:30:26 Starting - Launching requested ML instancesProfilerReport-1621722602: InProgress
......
2021-05-22 22:31:26 Starting - Preparing the instances for training...
2021-05-22 22:31:52 Downloading - Downloading input data...
2021-05-22 22:32:26 Training - Downloading the training image...
2021-05-22 22:32:48 Training - Training image download completed. Training in progress.[34m2021-05-22 22:32:47,914 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-05-22 22:32:47,915 sagemaker-training-toolkit INFO     Failed to parse hyperparameter hp1 value value_x to Json.[0m
[34mReturning the value itself[0m
[34m2021-05-22 22:32:50,947 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-05-22 22:32:50,948 sagemaker-training-toolkit INFO     Failed to parse hyperparameter hp1 value value_x to Json.[0m
[34mReturning the value it