# Fine-Tuning a BERT Model and Create a Text Classifier

Customer Reviews Dataset에 대해 BERT 모델을 Fine-tuning을 하고, 새로운 Classification layer를 추가하여 지정된 `review_body`에 대한 `star_rating` 을 예측합니다.

BERT의 Attention 매커니즘을 Transformer라고 합니다. 이것은 HuggingFace가 유지관리하는 인기 있는 Bert Python 라이브러리인 "Transformers"의 이름입니다. 여기서는 [DistilBert](https://arxiv.org/pdf/1910.01108.pdf)라는 BERT 변형 방법을 사용합니다. 메모리와 컴퓨팅이 적지만, 이 Dataset에서 높은 정확도를 유지할 수 있습니다.

## Feature Engineering

이전 ad_hoc 노트북에서 사전 학습된 BERT 모델을 사용하여 `reviews_body` 텍스트에서 BERT embeddings를 생성하는 Feature Engineering을 이미 수행하였고, train, validation, test 파일로 데이터셋을 분리하였습니다. Tensorflow 학습을 최적화하기 위해 파일은 TFRecord 포맷으로 저장했습니다.

![BERT Training](img/bert_training.png)

![BERT Pre-Processing](img/prepare_dataset_bert.png)

In [1]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

In [2]:
!pip install -q smdebug==0.9.3
!pip install -q sagemaker-experiments==0.1.13

In [3]:
%store -r

![BERT Pre-Processing](img/training_workflow.png)

# Track the `Experiment`

Experiment를 이용하여 `prepare`, `train`, `optimize`, `deploy` 에 대한 실험의 모든 단계를 tracking 할 수 있습니다.

# Concepts

- **Experiment**: 연관된 Trials의 모음이며, 함께 비교할 Experments에 Trials를 추가합니다.  
- **Trial**: 여러 단계의 machine learning 작업흐름에 대한 설명이며, 작업흐름의 각 단계는 Trial의 component로 설명됩니다. 각 Trials component 간의 순서와 같은 관계는 없습니다.  
- **Trial Component**: machine learning 작업흐름 내 단일 단계를 의미합니다. 예를 들어 data cleaning, feature extraction, model training, model evaluation 등입니다.  
- **Tracker**: 단일 TrialComponent 정보의 logger입니다.

![SageMaker Experiments](img/sagemaker-experiments.png)


# 1 ) `Experiment` 생성 (*)

In [4]:
import time
from smexperiments.experiment import Experiment

timestamp = '{}'.format(int(time.time()))

experiment = Experiment.create(
                experiment_name='Amazon-Customer-Reviews-BERT-Experiment-{}'.format(timestamp),
                description='Amazon Customer Reviews BERT Experiment', 
                sagemaker_boto_client=sm)

experiment_name = experiment.experiment_name
print('Experiment name: {}'.format(experiment_name))

Experiment name: Amazon-Customer-Reviews-BERT-Experiment-1600347168


# 2 ) `Trial` 생성 (*)

In [5]:
import time
from smexperiments.trial import Trial

timestamp = '{}'.format(int(time.time()))

trial = Trial.create(trial_name='trial-{}'.format(timestamp),
                     experiment_name=experiment_name,
                     sagemaker_boto_client=sm)

trial_name = trial.trial_name
print('Trial name: {}'.format(trial_name))

Trial name: trial-1600347168


# 3 ) `prepare` Trial Component 및 Tracker 생성 (*)

Trial Component는 실제 Tracker를 통해 생성됩니다. 

In [6]:
from smexperiments.tracker import Tracker

tracker_prepare = Tracker.create(display_name='prepare', 
                                 sagemaker_boto_client=sm)

prepare_trial_component_name = tracker_prepare.trial_component.trial_component_name
print('Prepare trial component name {}'.format(prepare_trial_component_name))

Prepare trial component name TrialComponent-2020-09-17-125248-nmvx


#### Trial에 Component로서 `prepare` Trial Component과 Tracker를 attach 합니다.

In [7]:
trial.add_trial_component(tracker_prepare.trial_component)

# 4) `prepare` 단계 내 파라미터 Logging (*)

In [8]:
print(s3_raw_input_data)

s3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/tsv/


In [9]:
tracker_prepare.log_input(name='raw_data_s3_uri', 
                          media_type='s3/uri', 
                          value=s3_raw_input_data)

# must save after logging
tracker_prepare.trial_component.save()

TrialComponent(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7fdc3cda92e8>,trial_component_name='TrialComponent-2020-09-17-125248-nmvx',display_name='prepare',trial_component_arn='arn:aws:sagemaker:us-east-1:322537213286:experiment-trial-component/trialcomponent-2020-09-17-125248-nmvx',response_metadata={'RequestId': '0f7abaf7-6159-4aaf-bdc7-23c239221521', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '0f7abaf7-6159-4aaf-bdc7-23c239221521', 'content-type': 'application/x-amz-json-1.1', 'content-length': '129', 'date': 'Thu, 17 Sep 2020 12:52:49 GMT'}, 'RetryAttempts': 0},parameters={},input_artifacts={'raw_data_s3_uri': TrialComponentArtifact(value='s3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/tsv/',media_type='s3/uri')},output_artifacts={})

In [10]:
print("train_split_percentage :{}".format(train_split_percentage))
print("validation_split_percentage :{}".format(validation_split_percentage))
print("test_split_percentage :{}".format(test_split_percentage))
print("max_seq_length :{}".format(max_seq_length))

train_split_percentage :0.9
validation_split_percentage :0.05
test_split_percentage :0.05
max_seq_length :128


In [11]:
tracker_prepare.log_parameters({
    'max_seq_length': max_seq_length,
    'train_split_percentage': train_split_percentage,
    'validation_split_percentage': validation_split_percentage,
    'test_split_percentage': test_split_percentage,
})

# must save after logging
tracker_prepare.trial_component.save()

TrialComponent(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7fdc3cda92e8>,trial_component_name='TrialComponent-2020-09-17-125248-nmvx',display_name='prepare',trial_component_arn='arn:aws:sagemaker:us-east-1:322537213286:experiment-trial-component/trialcomponent-2020-09-17-125248-nmvx',response_metadata={'RequestId': '0ab375da-e5ac-4678-94b2-14349aef990f', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '0ab375da-e5ac-4678-94b2-14349aef990f', 'content-type': 'application/x-amz-json-1.1', 'content-length': '129', 'date': 'Thu, 17 Sep 2020 12:52:49 GMT'}, 'RetryAttempts': 0},parameters={'max_seq_length': 128, 'train_split_percentage': 0.9, 'validation_split_percentage': 0.05, 'test_split_percentage': 0.05},input_artifacts={'raw_data_s3_uri': TrialComponentArtifact(value='s3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/tsv/',media_type='s3/uri')},output_artifacts={})

In [12]:
print("processed_train_data_s3_uri :{}".format(processed_train_data_s3_uri))
print("processed_validation_data_s3_uri :{}".format(processed_validation_data_s3_uri))
print("processed_test_data_s3_uri :{}".format(processed_test_data_s3_uri))

processed_train_data_s3_uri :s3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-09-15-05-48-17-452/output/bert-train
processed_validation_data_s3_uri :s3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-09-15-05-48-17-452/output/bert-validation
processed_test_data_s3_uri :s3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-09-15-05-48-17-452/output/bert-test


In [13]:
tracker_prepare.log_output(name='train_data_s3_uri', 
                           media_type='s3/uri', 
                           value=processed_train_data_s3_uri)

tracker_prepare.log_output(name='validation_data_s3_uri', 
                           media_type='s3/uri', 
                           value=processed_validation_data_s3_uri)

tracker_prepare.log_output(name='test_data_s3_uri', 
                           media_type='s3/uri', 
                           value=processed_test_data_s3_uri)

# must save after logging
tracker_prepare.trial_component.save()

TrialComponent(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7fdc3cda92e8>,trial_component_name='TrialComponent-2020-09-17-125248-nmvx',display_name='prepare',trial_component_arn='arn:aws:sagemaker:us-east-1:322537213286:experiment-trial-component/trialcomponent-2020-09-17-125248-nmvx',response_metadata={'RequestId': '906a6833-8a62-4793-9767-f757e79fa93e', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '906a6833-8a62-4793-9767-f757e79fa93e', 'content-type': 'application/x-amz-json-1.1', 'content-length': '129', 'date': 'Thu, 17 Sep 2020 12:52:49 GMT'}, 'RetryAttempts': 0},parameters={'max_seq_length': 128, 'train_split_percentage': 0.9, 'validation_split_percentage': 0.05, 'test_split_percentage': 0.05},input_artifacts={'raw_data_s3_uri': TrialComponentArtifact(value='s3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/tsv/',media_type='s3/uri')},output_artifacts={'train_data_s3_uri': TrialComponentArtifact(value='s3://sagemaker-us-east-1-322537213286/sa

# 5 ) S3 내 Dataset 지정

이미 이전 노트북에서 train, validation, test dataset으로 분리하였습니다.

In [14]:
print(processed_train_data_s3_uri)

!aws s3 ls $processed_train_data_s3_uri/

s3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-09-15-05-48-17-452/output/bert-train
2020-09-15 05:53:17      50246 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-09-15 05:53:17     452469 part-algo-1-amazon_reviews_us_Musical_Instruments_v1_00.tfrecord
2020-09-15 05:53:07      71732 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


In [15]:
print(processed_validation_data_s3_uri)

!aws s3 ls $processed_validation_data_s3_uri/

s3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-09-15-05-48-17-452/output/bert-validation
2020-09-15 05:53:17       3330 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-09-15 05:53:17      25571 part-algo-1-amazon_reviews_us_Musical_Instruments_v1_00.tfrecord
2020-09-15 05:53:07       4533 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


In [16]:
print(processed_test_data_s3_uri)

!aws s3 ls $processed_test_data_s3_uri/

s3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-09-15-05-48-17-452/output/bert-test
2020-09-15 05:53:17       3272 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-09-15 05:53:17      25402 part-algo-1-amazon_reviews_us_Musical_Instruments_v1_00.tfrecord
2020-09-15 05:53:07       4378 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


# 6 ) S3의  `Distribution Strategy` 지정 (*)

모델 학습을 위해 시작된 각 ML 컴퓨팅 인스턴스에서 Amazon SageMaker가 전체 데이터 세트를 복제하도록하려면 FullyReplicated를 지정하시면 됩니다.

모델 학습을 위해 시작된 각 ML 컴퓨팅 인스턴스에서 Amazon SageMaker가 데이터의 subset을 복제하도록 하려면 ShardedByS3Key를 지정하시면 됩니다. 학습 작업을 위해 시작된 ML 컴퓨팅 인스턴스가 있는 경우 각 인스턴스는 S3 객체 수의 약 1/n을 얻게 되며, 각 머신의 모델 학습에서는 training 데이터의 subset만 사용합니다.

사용 가능한 S3 객체보다 학습을 위해 더 많은 ML 컴퓨팅 인스턴스를 선택하게 되면, 일부 노드는 데이터를 얻지 못하며 training 데이터를 얻지 못한 노드에 대해서는 비용을 지불하게 됩니다. 이것은 File 및 Pipe 모드 모두에 적용됩니다. 
여러 ML 컴퓨팅 EC2 인스턴스를 사용하는 distributed training에서는 ShardedByS3Key를 선택할 수 있습니다. 알고리즘이 training 데이터를 ML 스토리지 볼륨에 복사해야하는 경우 (TrainingInputMode가 File로 설정된 경우), object 수의 1/n을 복사합니다.

In [17]:
s3_input_train_data = sagemaker.s3_input(s3_data=processed_train_data_s3_uri, 
                                         distribution='ShardedByS3Key') 
s3_input_validation_data = sagemaker.s3_input(s3_data=processed_validation_data_s3_uri, 
                                              distribution='ShardedByS3Key')
s3_input_test_data = sagemaker.s3_input(s3_data=processed_test_data_s3_uri, 
                                        distribution='ShardedByS3Key')

print(s3_input_train_data.config)
print(s3_input_validation_data.config)
print(s3_input_test_data.config)



{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-09-15-05-48-17-452/output/bert-train', 'S3DataDistributionType': 'ShardedByS3Key'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-09-15-05-48-17-452/output/bert-validation', 'S3DataDistributionType': 'ShardedByS3Key'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-09-15-05-48-17-452/output/bert-test', 'S3DataDistributionType': 'ShardedByS3Key'}}}


# 7 )  Hyperparameter 설정

### 7-1 ) Training Code 확인

In [18]:
!pygmentize src_dir/tf_bert_reviews.py

[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36mrandom[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mfrom[39;49;00m [04m[36mglob[39;49;00m [34mimport[39;49;00m glob
[34mimport[39;49;00m [04m[36mpprint[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36msubprocess[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mtensorflow[39;49;00m [34mas[39;49;00m [04m[36mtf[39;49;00m
[37m#subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'tensorflow==2.1.0'])[39;49;00m
subprocess.check_call([sys.executable, [33m'[39;49;00m[33m-m[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mpip[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33minstall[39;49;00m[33m'[39;49;00m, [33m'[3

    initial_epoch_number_str = latest_checkpoint_file.rsplit([33m'[39;49;00m[33m_[39;49;00m[33m'[39;49;00m, [34m1[39;49;00m)[-[34m1[39;49;00m].split([33m'[39;49;00m[33m.h5[39;49;00m[33m'[39;49;00m)[[34m0[39;49;00m]
    initial_epoch_number = [36mint[39;49;00m(initial_epoch_number_str)

    loaded_model = TFDistilBertForSequenceClassification.from_pretrained(
                                               latest_checkpoint_file,
                                               config=config)

    [36mprint[39;49;00m([33m'[39;49;00m[33mloaded_model [39;49;00m[33m{}[39;49;00m[33m'[39;49;00m.format(loaded_model))
    [36mprint[39;49;00m([33m'[39;49;00m[33minitial_epoch_number [39;49;00m[33m{}[39;49;00m[33m'[39;49;00m.format(initial_epoch_number))
    
    [34mreturn[39;49;00m loaded_model, initial_epoch_number


[34mif[39;49;00m [31m__name__[39;49;00m == [33m'[39;49;00m[33m__main__[39;49;00m[33m'[39;49;00m:
    parser = argpar

### 7-2) Classification Layer에 대한 Hyper-Parameters 설정

In [19]:
print(max_seq_length)

128


In [21]:
epochs=10
learning_rate=0.0001
epsilon=0.00000001
train_batch_size=128
validation_batch_size=128
test_batch_size=128
train_steps_per_epoch=50
validation_steps=50
test_steps=50
train_instance_count=1
# train_instance_type='ml.c5.9xlarge'
train_instance_type='ml.p3.2xlarge'
train_volume_size=1024
use_xla=True
use_amp=True
freeze_bert_layer=True
enable_sagemaker_debugger=True
enable_checkpointing=False
enable_tensorboard=False
input_mode='Pipe'
run_validation=True
run_test=True
run_sample_predictions=True
finetune_checkpoint_path='finetune_checkpoint_path/'

# 8 ) Model 성능 추적용 Metrics 설정 (*)

In [22]:
metrics_definitions = [
     {'Name': 'train:loss', 'Regex': 'loss: ([0-9\\.]+)'},
     {'Name': 'train:accuracy', 'Regex': 'accuracy: ([0-9\\.]+)'},
     {'Name': 'validation:loss', 'Regex': 'val_loss: ([0-9\\.]+)'},
     {'Name': 'validation:accuracy', 'Regex': 'val_accuracy: ([0-9\\.]+)'},
]

# 9 ) SageMaker Debugger 설정 (*)

Debugger Rules 정의합니다.

In [23]:
from sagemaker.debugger import Rule
from sagemaker.debugger import rule_configs
from sagemaker.debugger import CollectionConfig
from sagemaker.debugger import DebuggerHookConfig

rules=[
        Rule.sagemaker(
            rule_configs.loss_not_decreasing(),
            rule_parameters={
                'collection_names': 'losses,metrics',
                'use_losses_collection': 'true',
                'num_steps': '10',
                'diff_percent': '50'
            },
            collections_to_save=[
                CollectionConfig(name='losses',
                                 parameters={
                                     'save_interval': '10',
                                 }),
                CollectionConfig(name='metrics',
                                 parameters={
                                     'save_interval': '10',
                                 })
            ]
        ),
        Rule.sagemaker(
            rule_configs.overtraining(),
            rule_parameters={
                'collection_names': 'losses,metrics',
                'patience_train': '10',
                'patience_validation': '10',
                'delta': '0.5'
            },
            collections_to_save=[
                CollectionConfig(name='losses',
                                 parameters={
                                     'save_interval': '10',
                                 }),
                CollectionConfig(name='metrics',
                                 parameters={
                                     'save_interval': '10',
                                 })
            ]
        )
    ]

hook_config = DebuggerHookConfig(
    hook_parameters={
        'save_interval': '10', # number of steps
        'export_tensorboard': 'true',
        'tensorboard_dir': 'hook_tensorboard/',
    })

# 10 ) Training Job 설정

### 10-1) Checkpoint S3 Location 지정 (*)

이번 학습은 Spot instance를 사용하여 학습할 예정입니다. 만일 노드가 교체될 경우에는 마지막 checkpoint에서 training을 사직합니다.

In [24]:
import uuid

checkpoint_s3_prefix = 'checkpoints/{}'.format(str(uuid.uuid4()))
checkpoint_s3_uri = 's3://{}/{}/'.format(bucket, checkpoint_s3_prefix)

print(checkpoint_s3_uri)

s3://sagemaker-us-east-1-322537213286/checkpoints/d6c1ebc7-27f2-4292-bb7c-ef91bffeabbe/


### 10-2) BERT + TensorFlow Script to Run on SageMaker 설정


In [25]:
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(entry_point='tf_bert_reviews.py', 
                       source_dir='src_dir', # put requirements.txt in this directory and it gets picked up
                       role=role,
                       train_instance_count=train_instance_count, # Make sure you have at least this number of input files or the ShardedByS3Key distibution strategy will fail the job due to no data available
                       train_instance_type=train_instance_type,
                       train_volume_size=train_volume_size,
#                        train_use_spot_instances=True,
#                        train_max_wait=7200, # Seconds to wait for spot instances to become available
                       checkpoint_s3_uri=checkpoint_s3_uri,
                       py_version='py3',
                       framework_version='2.1.0',
                       hyperparameters={'epochs': epochs,
                                        'learning_rate': learning_rate,
                                        'epsilon': epsilon,
                                        'train_batch_size': train_batch_size,
                                        'validation_batch_size': validation_batch_size,
                                        'test_batch_size': test_batch_size,                                             
                                        'train_steps_per_epoch': train_steps_per_epoch,
                                        'validation_steps': validation_steps,
                                        'test_steps': test_steps,
                                        'use_xla': use_xla,
                                        'use_amp': use_amp,                                             
                                        'max_seq_length': max_seq_length,
                                        'freeze_bert_layer': freeze_bert_layer,
                                        'enable_sagemaker_debugger': enable_sagemaker_debugger,
                                        'enable_checkpointing': enable_checkpointing,
                                        'enable_tensorboard': enable_tensorboard,                                        
                                        'run_validation': run_validation,
                                        'run_test': run_test,
                                        'run_sample_predictions': run_sample_predictions,
#                                         'finetune_checkpoint_path' : finetune_checkpoint_path
                                       },
                       input_mode=input_mode,
                       metric_definitions=metrics_definitions,
                       rules=rules,
                       debugger_hook_config=hook_config,                       
                      train_max_run=7200, # max 2 hours * 60 minutes seconds per hour * 60 seconds per minute
                      )

### 10-3)  `Experiment Config` 생성 (*)

In [26]:
experiment_config = {
    'ExperimentName': experiment_name,
    'TrialName': trial.trial_name,
    'TrialComponentDisplayName': 'train'
}

# 11) Model 학습

In [27]:
estimator.fit(inputs={'train': s3_input_train_data, 
                      'validation': s3_input_validation_data,
                      'test': s3_input_test_data
              },              
              experiment_config=experiment_config,                   
              wait=False)

INFO:sagemaker:Creating training-job with name: tensorflow-training-2020-09-17-12-52-57-069


In [28]:
training_job_name = estimator.latest_training_job.name
print('Training Job Name:  {}'.format(training_job_name))

Training Job Name:  tensorflow-training-2020-09-17-12-52-57-069


In [29]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/{}">Training Job</a> After About 5 Minutes</b>'.format(region, training_job_name)))


In [30]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/TrainingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, training_job_name)))


In [31]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Training Job Has Completed</b>'.format(bucket, training_job_name, region)))


In [32]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Checkpoint Data</a> After The Training Job Has Completed</b>'.format(bucket, checkpoint_s3_prefix, region)))


In [None]:
estimator.latest_training_job.wait(logs="All")

2020-09-17 12:53:30 Starting - Starting the training job...
2020-09-17 12:53:32 Starting - Launching requested ML instances
********* Debugger Rule Status *********
*
*  LossNotDecreasing: InProgress        
*       Overtraining: InProgress        
*
****************************************
....

<h2><span style="color:red">위 Training Job이 완료되기 전까지 기다려 주시기 바랍니다.</span></h2>

# 12 ) Experiment Tracking Lineage 살펴보기 (*)

In [None]:
from sagemaker.analytics import ExperimentAnalytics

lineage_table = ExperimentAnalytics(
    sagemaker_session=sess,
    experiment_name=experiment_name,
    metric_names=['validation:accuracy'],
    sort_by="CreationTime",
    sort_order="Ascending",
)

lineage_df = lineage_table.dataframe()
lineage_df.shape

In [None]:
lineage_df

In [None]:
sm.describe_trial_component(TrialComponentName=lineage_df.TrialComponentName[0])

# 13 ) Debugger Rules 분석 (*)

In [None]:
estimator.latest_training_job.rule_job_summary()

In [None]:
training_job_debugger_artifacts_path = estimator.latest_job_debugger_artifacts_path()
print(training_job_debugger_artifacts_path)

# Pass Variables to the Next Notebook(s)

In [None]:
print(training_job_name, experiment_name, trial_name, prepare_trial_component_name, training_job_debugger_artifacts_path)

In [None]:
%store training_job_name
%store experiment_name
%store trial_name
%store prepare_trial_component_name
%store training_job_debugger_artifacts_path