# Fine-Tuning a BERT Model and Create a Text Classifier

Customer Reviews Dataset에 대해 BERT 모델을 Fine-tuning을 하고, 새로운 Classification layer를 추가하여 지정된 `review_body`에 대한 `star_rating` 을 예측합니다.

BERT의 Attention 매커니즘을 Transformer라고 합니다. 이것은 HuggingFace가 유지관리하는 인기 있는 Bert Python 라이브러리인 "Transformers"의 이름입니다. 여기서는 [DistilBert](https://arxiv.org/pdf/1910.01108.pdf)라는 BERT 변형 방법을 사용합니다. 메모리와 컴퓨팅이 적지만, 이 Dataset에서 높은 정확도를 유지할 수 있습니다.

## Feature Engineering

이전 ad_hoc 노트북에서 사전 학습된 BERT 모델을 사용하여 `reviews_body` 텍스트에서 BERT embeddings를 생성하는 Feature Engineering을 이미 수행하였고, train, validation, test 파일로 데이터셋을 분리하였습니다. Tensorflow 학습을 최적화하기 위해 파일은 TFRecord 포맷으로 저장했습니다.

![BERT Training](img/bert_training.png)

![BERT Pre-Processing](img/prepare_dataset_bert.png)

In [1]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

In [2]:
!pip install -q smdebug==0.9.3
!pip install -q sagemaker-experiments==0.1.13

In [3]:
%store -r

![BERT Pre-Processing](img/training_workflow.png)

# Track the `Experiment`

Experiment를 이용하여 `prepare`, `train`, `optimize`, `deploy` 에 대한 실험의 모든 단계를 tracking 할 수 있습니다.

# Concepts

- **Experiment**: 연관된 Trials의 모음이며, 함께 비교할 Experments에 Trials를 추가합니다.  
- **Trial**: 여러 단계의 machine learning 작업흐름에 대한 설명이며, 작업흐름의 각 단계는 Trial의 component로 설명됩니다. 각 Trials component 간의 순서와 같은 관계는 없습니다.  
- **Trial Component**: machine learning 작업흐름 내 단일 단계를 의미합니다. 예를 들어 data cleaning, feature extraction, model training, model evaluation 등입니다.  
- **Tracker**: 단일 TrialComponent 정보의 logger입니다.

![SageMaker Experiments](img/sagemaker-experiments.png)


# 1 ) `Experiment` 생성 (*)

In [4]:
import time
from smexperiments.experiment import Experiment

timestamp = '{}'.format(int(time.time()))

experiment = Experiment.create(
                experiment_name='Amazon-Customer-Reviews-BERT-Experiment-{}'.format(timestamp),
                description='Amazon Customer Reviews BERT Experiment', 
                sagemaker_boto_client=sm)

experiment_name = experiment.experiment_name
print('Experiment name: {}'.format(experiment_name))

Experiment name: Amazon-Customer-Reviews-BERT-Experiment-1607431676


# 2 ) `Trial` 생성 (*)

In [5]:
import time
from smexperiments.trial import Trial

timestamp = '{}'.format(int(time.time()))

trial = Trial.create(trial_name='trial-{}'.format(timestamp),
                     experiment_name=experiment_name,
                     sagemaker_boto_client=sm)

trial_name = trial.trial_name
print('Trial name: {}'.format(trial_name))

Trial name: trial-1607431676


# 3 ) `prepare` Trial Component 및 Tracker 생성 (*)

Trial Component는 실제 Tracker를 통해 생성됩니다. 

In [6]:
from smexperiments.tracker import Tracker

tracker_prepare = Tracker.create(display_name='prepare', 
                                 sagemaker_boto_client=sm)

prepare_trial_component_name = tracker_prepare.trial_component.trial_component_name
print('Prepare trial component name {}'.format(prepare_trial_component_name))

Prepare trial component name TrialComponent-2020-12-08-124756-carq


#### Trial에 Component로서 `prepare` Trial Component과 Tracker를 attach 합니다.

In [7]:
trial.add_trial_component(tracker_prepare.trial_component)

# 4) `prepare` 단계 내 파라미터 Logging (*)

In [8]:
print(s3_raw_input_data)

s3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/tsv/


In [9]:
tracker_prepare.log_input(name='raw_data_s3_uri', 
                          media_type='s3/uri', 
                          value=s3_raw_input_data)

# must save after logging
tracker_prepare.trial_component.save()

TrialComponent(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f87a4f8e5f8>,trial_component_name='TrialComponent-2020-12-08-124756-carq',display_name='prepare',trial_component_arn='arn:aws:sagemaker:us-east-1:322537213286:experiment-trial-component/trialcomponent-2020-12-08-124756-carq',response_metadata={'RequestId': 'fecacf50-d570-46f4-8477-d4aad3f7a532', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'fecacf50-d570-46f4-8477-d4aad3f7a532', 'content-type': 'application/x-amz-json-1.1', 'content-length': '129', 'date': 'Tue, 08 Dec 2020 12:47:56 GMT'}, 'RetryAttempts': 0},parameters={},input_artifacts={'raw_data_s3_uri': TrialComponentArtifact(value='s3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/tsv/',media_type='s3/uri')},output_artifacts={})

In [10]:
print("train_split_percentage :{}".format(train_split_percentage))
print("validation_split_percentage :{}".format(validation_split_percentage))
print("test_split_percentage :{}".format(test_split_percentage))
print("max_seq_length :{}".format(max_seq_length))

train_split_percentage :0.9
validation_split_percentage :0.05
test_split_percentage :0.05
max_seq_length :128


In [11]:
tracker_prepare.log_parameters({
    'max_seq_length': max_seq_length,
    'train_split_percentage': train_split_percentage,
    'validation_split_percentage': validation_split_percentage,
    'test_split_percentage': test_split_percentage,
})

# must save after logging
tracker_prepare.trial_component.save()

TrialComponent(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f87a4f8e5f8>,trial_component_name='TrialComponent-2020-12-08-124756-carq',display_name='prepare',trial_component_arn='arn:aws:sagemaker:us-east-1:322537213286:experiment-trial-component/trialcomponent-2020-12-08-124756-carq',response_metadata={'RequestId': 'af63045e-9dd0-46c7-bbfe-5f81be666bf1', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'af63045e-9dd0-46c7-bbfe-5f81be666bf1', 'content-type': 'application/x-amz-json-1.1', 'content-length': '129', 'date': 'Tue, 08 Dec 2020 12:47:56 GMT'}, 'RetryAttempts': 0},parameters={'max_seq_length': 128, 'train_split_percentage': 0.9, 'validation_split_percentage': 0.05, 'test_split_percentage': 0.05},input_artifacts={'raw_data_s3_uri': TrialComponentArtifact(value='s3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/tsv/',media_type='s3/uri')},output_artifacts={})

In [12]:
print("processed_train_data_s3_uri :{}".format(processed_train_data_s3_uri))
print("processed_validation_data_s3_uri :{}".format(processed_validation_data_s3_uri))
print("processed_test_data_s3_uri :{}".format(processed_test_data_s3_uri))

processed_train_data_s3_uri :s3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-12-08-12-16-33-543/output/bert-train
processed_validation_data_s3_uri :s3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-12-08-12-16-33-543/output/bert-validation
processed_test_data_s3_uri :s3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-12-08-12-16-33-543/output/bert-test


In [13]:
tracker_prepare.log_output(name='train_data_s3_uri', 
                           media_type='s3/uri', 
                           value=processed_train_data_s3_uri)

tracker_prepare.log_output(name='validation_data_s3_uri', 
                           media_type='s3/uri', 
                           value=processed_validation_data_s3_uri)

tracker_prepare.log_output(name='test_data_s3_uri', 
                           media_type='s3/uri', 
                           value=processed_test_data_s3_uri)

# must save after logging
tracker_prepare.trial_component.save()

TrialComponent(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f87a4f8e5f8>,trial_component_name='TrialComponent-2020-12-08-124756-carq',display_name='prepare',trial_component_arn='arn:aws:sagemaker:us-east-1:322537213286:experiment-trial-component/trialcomponent-2020-12-08-124756-carq',response_metadata={'RequestId': 'f9d79ea2-cae1-4cc7-aa4b-ed7442298f07', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'f9d79ea2-cae1-4cc7-aa4b-ed7442298f07', 'content-type': 'application/x-amz-json-1.1', 'content-length': '129', 'date': 'Tue, 08 Dec 2020 12:47:56 GMT'}, 'RetryAttempts': 0},parameters={'max_seq_length': 128, 'train_split_percentage': 0.9, 'validation_split_percentage': 0.05, 'test_split_percentage': 0.05},input_artifacts={'raw_data_s3_uri': TrialComponentArtifact(value='s3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/tsv/',media_type='s3/uri')},output_artifacts={'train_data_s3_uri': TrialComponentArtifact(value='s3://sagemaker-us-east-1-322537213286/sa

# 5 ) S3 내 Dataset 지정

이미 이전 노트북에서 train, validation, test dataset으로 분리하였습니다.

In [14]:
print(processed_train_data_s3_uri)

!aws s3 ls $processed_train_data_s3_uri/

s3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-12-08-12-16-33-543/output/bert-train
2020-12-08 12:21:30      51050 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-12-08 12:21:30     451186 part-algo-1-amazon_reviews_us_Musical_Instruments_v1_00.tfrecord
2020-12-08 12:21:05      71910 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


In [15]:
print(processed_validation_data_s3_uri)

!aws s3 ls $processed_validation_data_s3_uri/

s3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-12-08-12-16-33-543/output/bert-validation
2020-12-08 12:21:30       3371 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-12-08 12:21:30      25288 part-algo-1-amazon_reviews_us_Musical_Instruments_v1_00.tfrecord
2020-12-08 12:21:05       4263 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


In [20]:
print(processed_test_data_s3_uri)

!aws s3 ls $processed_test_data_s3_uri/

s3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-12-08-12-16-33-543/output/bert-test
2020-12-08 12:21:31       3490 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-12-08 12:21:31      25008 part-algo-1-amazon_reviews_us_Musical_Instruments_v1_00.tfrecord
2020-12-08 12:21:06       4357 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


# 6 ) S3의  `Distribution Strategy` 지정 (*)

모델 학습을 위해 시작된 각 ML 컴퓨팅 인스턴스에서 Amazon SageMaker가 전체 데이터 세트를 복제하도록하려면 FullyReplicated를 지정하시면 됩니다.

모델 학습을 위해 시작된 각 ML 컴퓨팅 인스턴스에서 Amazon SageMaker가 데이터의 subset을 복제하도록 하려면 ShardedByS3Key를 지정하시면 됩니다. 학습 작업을 위해 시작된 ML 컴퓨팅 인스턴스가 있는 경우 각 인스턴스는 S3 객체 수의 약 1/n을 얻게 되며, 각 머신의 모델 학습에서는 training 데이터의 subset만 사용합니다.

사용 가능한 S3 객체보다 학습을 위해 더 많은 ML 컴퓨팅 인스턴스를 선택하게 되면, 일부 노드는 데이터를 얻지 못하며 training 데이터를 얻지 못한 노드에 대해서는 비용을 지불하게 됩니다. 이것은 File 및 Pipe 모드 모두에 적용됩니다. 
여러 ML 컴퓨팅 EC2 인스턴스를 사용하는 distributed training에서는 ShardedByS3Key를 선택할 수 있습니다. 알고리즘이 training 데이터를 ML 스토리지 볼륨에 복사해야하는 경우 (TrainingInputMode가 File로 설정된 경우), object 수의 1/n을 복사합니다.

In [21]:
s3_input_train_data = sagemaker.inputs.TrainingInput(s3_data=processed_train_data_s3_uri, 
                                         distribution='ShardedByS3Key') 
s3_input_validation_data = sagemaker.inputs.TrainingInput(s3_data=processed_validation_data_s3_uri, 
                                              distribution='ShardedByS3Key')
s3_input_test_data = sagemaker.inputs.TrainingInput(s3_data=processed_test_data_s3_uri, 
                                        distribution='ShardedByS3Key')

print(s3_input_train_data.config)
print(s3_input_validation_data.config)
print(s3_input_test_data.config)

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-12-08-12-16-33-543/output/bert-train', 'S3DataDistributionType': 'ShardedByS3Key'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-12-08-12-16-33-543/output/bert-validation', 'S3DataDistributionType': 'ShardedByS3Key'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-12-08-12-16-33-543/output/bert-test', 'S3DataDistributionType': 'ShardedByS3Key'}}}


# 7 )  Hyperparameter 설정

### 7-1 ) Training Code 확인

In [22]:
!pygmentize src_dir/tf_bert_reviews.py

[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36mrandom[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mfrom[39;49;00m [04m[36mglob[39;49;00m [34mimport[39;49;00m glob
[34mimport[39;49;00m [04m[36mpprint[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36msubprocess[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mtensorflow[39;49;00m [34mas[39;49;00m [04m[36mtf[39;49;00m
[37m#subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'tensorflow==2.1.0'])[39;49;00m
subprocess.check_call([sys.executable, [33m'[39;49;00m[33m-m[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mpip[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33minstall[39;49;00m[33m'[39;49;00m, [33m'[3

    [36mprint[39;49;00m([33m'[39;49;00m[33mrun_test [39;49;00m[33m{}[39;49;00m[33m'[39;49;00m.format(run_test))    
    run_sample_predictions = args.run_sample_predictions
    [36mprint[39;49;00m([33m'[39;49;00m[33mrun_sample_predictions [39;49;00m[33m{}[39;49;00m[33m'[39;49;00m.format(run_sample_predictions))
    enable_tensorboard = args.enable_tensorboard
    [36mprint[39;49;00m([33m'[39;49;00m[33menable_tensorboard [39;49;00m[33m{}[39;49;00m[33m'[39;49;00m.format(enable_tensorboard))       
    enable_checkpointing = args.enable_checkpointing
    [36mprint[39;49;00m([33m'[39;49;00m[33menable_checkpointing [39;49;00m[33m{}[39;49;00m[33m'[39;49;00m.format(enable_checkpointing)) 
    
    finetune_checkpoint_path = args.finetune_checkpoint_path
    [36mprint[39;49;00m([33m'[39;49;00m[33mfinetune_checkpoint_path [39;49;00m[33m{}[39;49;00m[33m'[39;49;00m.format(finetune_checkpoint_path))     

    checkpoint_base_path = args.

### 7-2) Classification Layer에 대한 Hyper-Parameters 설정

In [23]:
print(max_seq_length)

128


In [24]:
epochs=10
learning_rate=0.0001
epsilon=0.00000001
train_batch_size=128
validation_batch_size=128
test_batch_size=128
train_steps_per_epoch=50
validation_steps=50
test_steps=50
train_instance_count=1
# train_instance_type='ml.c5.9xlarge'
train_instance_type='ml.p3.2xlarge'
train_volume_size=1024
use_xla=True
use_amp=True
freeze_bert_layer=True
enable_sagemaker_debugger=True
enable_checkpointing=False
enable_tensorboard=False
input_mode='Pipe'
run_validation=True
run_test=True
run_sample_predictions=True
finetune_checkpoint_path='finetune_checkpoint_path/'

# 8 ) Model 성능 추적용 Metrics 설정 (*)

In [25]:
metrics_definitions = [
     {'Name': 'train:loss', 'Regex': 'loss: ([0-9\\.]+)'},
     {'Name': 'train:accuracy', 'Regex': 'accuracy: ([0-9\\.]+)'},
     {'Name': 'validation:loss', 'Regex': 'val_loss: ([0-9\\.]+)'},
     {'Name': 'validation:accuracy', 'Regex': 'val_accuracy: ([0-9\\.]+)'},
]

# 9 ) SageMaker Debugger 설정 (*)

Debugger Rules 정의합니다.

In [26]:
from sagemaker.debugger import Rule
from sagemaker.debugger import rule_configs
from sagemaker.debugger import CollectionConfig
from sagemaker.debugger import DebuggerHookConfig

rules=[
        Rule.sagemaker(
            rule_configs.loss_not_decreasing(),
            rule_parameters={
                'collection_names': 'losses,metrics',
                'use_losses_collection': 'true',
                'num_steps': '10',
                'diff_percent': '50'
            },
            collections_to_save=[
                CollectionConfig(name='losses',
                                 parameters={
                                     'save_interval': '10',
                                 }),
                CollectionConfig(name='metrics',
                                 parameters={
                                     'save_interval': '10',
                                 })
            ]
        ),
        Rule.sagemaker(
            rule_configs.overtraining(),
            rule_parameters={
                'collection_names': 'losses,metrics',
                'patience_train': '10',
                'patience_validation': '10',
                'delta': '0.5'
            },
            collections_to_save=[
                CollectionConfig(name='losses',
                                 parameters={
                                     'save_interval': '10',
                                 }),
                CollectionConfig(name='metrics',
                                 parameters={
                                     'save_interval': '10',
                                 })
            ]
        )
    ]

hook_config = DebuggerHookConfig(
    hook_parameters={
        'save_interval': '10', # number of steps
        'export_tensorboard': 'true',
        'tensorboard_dir': 'hook_tensorboard/',
    })

# 10 ) Training Job 설정

### 10-1) Checkpoint S3 Location 지정 (*)

이번 학습은 Spot instance를 사용하여 학습할 예정입니다. 만일 노드가 교체될 경우에는 마지막 checkpoint에서 training을 사직합니다.

In [27]:
import uuid

checkpoint_s3_prefix = 'checkpoints/{}'.format(str(uuid.uuid4()))
checkpoint_s3_uri = 's3://{}/{}/'.format(bucket, checkpoint_s3_prefix)

print(checkpoint_s3_uri)

s3://sagemaker-us-east-1-322537213286/checkpoints/b998b3da-45eb-481e-aa44-f3426fc0ace4/


### 10-2) BERT + TensorFlow Script to Run on SageMaker 설정


In [30]:
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(entry_point='tf_bert_reviews.py', 
                       source_dir='src_dir', # put requirements.txt in this directory and it gets picked up
                       role=role,
                       instance_count=train_instance_count, # Make sure you have at least this number of input files or the ShardedByS3Key distibution strategy will fail the job due to no data available
                       instance_type=train_instance_type,
                       volume_size=train_volume_size,
#                        train_use_spot_instances=True,
#                        train_max_wait=7200, # Seconds to wait for spot instances to become available
                       checkpoint_s3_uri=checkpoint_s3_uri,
                       py_version='py3',
                       framework_version='2.1.0',
                       hyperparameters={'epochs': epochs,
                                        'learning_rate': learning_rate,
                                        'epsilon': epsilon,
                                        'train_batch_size': train_batch_size,
                                        'validation_batch_size': validation_batch_size,
                                        'test_batch_size': test_batch_size,                                             
                                        'train_steps_per_epoch': train_steps_per_epoch,
                                        'validation_steps': validation_steps,
                                        'test_steps': test_steps,
                                        'use_xla': use_xla,
                                        'use_amp': use_amp,                                             
                                        'max_seq_length': max_seq_length,
                                        'freeze_bert_layer': freeze_bert_layer,
                                        'enable_sagemaker_debugger': enable_sagemaker_debugger,
                                        'enable_checkpointing': enable_checkpointing,
                                        'enable_tensorboard': enable_tensorboard,                                        
                                        'run_validation': run_validation,
                                        'run_test': run_test,
                                        'run_sample_predictions': run_sample_predictions,
#                                         'finetune_checkpoint_path' : finetune_checkpoint_path
                                       },
                       input_mode=input_mode,
                       metric_definitions=metrics_definitions,
                       rules=rules,
                       debugger_hook_config=hook_config,                       
                       max_run=7200, # max 2 hours * 60 minutes seconds per hour * 60 seconds per minute
                      )

### 10-3)  `Experiment Config` 생성 (*)

In [31]:
experiment_config = {
    'ExperimentName': experiment_name,
    'TrialName': trial.trial_name,
    'TrialComponentDisplayName': 'train'
}

# 11) Model 학습

In [32]:
estimator.fit(inputs={'train': s3_input_train_data, 
                      'validation': s3_input_validation_data,
                      'test': s3_input_test_data
              },              
              experiment_config=experiment_config,                   
              wait=False)

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: tensorflow-training-2020-12-08-12-54-45-217


In [33]:
training_job_name = estimator.latest_training_job.name
print('Training Job Name:  {}'.format(training_job_name))

Training Job Name:  tensorflow-training-2020-12-08-12-54-45-217


In [34]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/{}">Training Job</a> After About 5 Minutes</b>'.format(region, training_job_name)))


In [35]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/TrainingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, training_job_name)))


In [36]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Training Job Has Completed</b>'.format(bucket, training_job_name, region)))


In [37]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Checkpoint Data</a> After The Training Job Has Completed</b>'.format(bucket, checkpoint_s3_prefix, region)))


In [38]:
estimator.latest_training_job.wait(logs="All")

2020-12-08 12:57:44 Starting - Preparing the instances for training...
2020-12-08 12:58:08 Downloading - Downloading input data
2020-12-08 12:58:08 Training - Downloading the training image
********* Debugger Rule Status *********
*
*  LossNotDecreasing: InProgress        
*       Overtraining: InProgress        
*
****************************************
.........
2020-12-08 12:59:46 Training - Training image download completed. Training in progress.[34m2020-12-08 12:59:39,056 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[34m2020-12-08 12:59:39,710 sagemaker-containers INFO     Invoking user script
[0m
[34mTraining Env:
[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "test": "/opt/ml/input/data/test",
        "validation": "/opt/ml/input/data/validation",
        "train": "/opt/ml/input/data/train"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_tensorflow_container.t

[34mYou should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
[34mCollecting smdebug==0.8.0
  Downloading smdebug-0.8.0-py2.py3-none-any.whl (166 kB)[0m
[34mInstalling collected packages: smdebug
  Attempting uninstall: smdebug
    Found existing installation: smdebug 0.7.2
    Uninstalling smdebug-0.7.2:
      Successfully uninstalled smdebug-0.7.2[0m
[34mSuccessfully installed smdebug-0.8.0[0m
[34mYou should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
[34mCollecting scikit-learn==0.23.1
  Downloading scikit_learn-0.23.1-cp36-cp36m-manylinux1_x86_64.whl (6.8 MB)[0m
[34mCollecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-2.1.0-py3-none-any.whl (12 kB)[0m
[34mInstalling collected packages: threadpoolctl, scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.22
    Uninstalling scikit-learn-0.22:
      Successfully uninstalled scikit-learn-

[34mtrain_data_filenames [][0m
[34m***** Using pipe_mode with channel train[0m
[34mInstructions for updating:[0m
[34mUse `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.[0m
[34m#015Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]#015Downloading: 100%|██████████| 232k/232k [00:00<00:00, 27.6MB/s][0m
[34m#015Downloading:   0%|          | 0.00/442 [00:00<?, ?B/s]#015Downloading: 100%|██████████| 442/442 [00:00<00:00, 543kB/s][0m
[34m#015Downloading:   0%|          | 0.00/363M [00:00<?, ?B/s]#015Downloading:   1%|          | 4.53M/363M [00:00<00:07, 45.3MB/s]#015Downloading:   3%|▎         | 9.50M/363M [00:00<00:07, 46.5MB/s]#015Downloading:   4%|▍         | 14.6M/363M [00:00<00:07, 47.8MB/s]#015Downloading:   5%|▍         | 17.9M/363M [00:00<00:09, 36.7MB/s]#015Downloading:   6%|▋         | 22.8M/363M [00:00<00:08, 39.7MB/s]

[34mINFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).[0m
[34mINFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).[0m
[34mINFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).[0m
[34mINFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).[0m
[34mINFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).[0m
[34mINFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).[0m
[34mINFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:local

[34m[2020-12-08 13:00:46.331 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:00:46.369 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:00:46.406 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:00:46.444 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:00:46.484 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:00:46.523 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files creat

[34m[2020-12-08 13:00:56.327 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:00:56.443 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:00:56.558 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:00:56.729 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:00:56.897 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:00:57.049 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files creat

[34m[2020-12-08 13:01:06.286 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:06.432 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:06.662 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:06.860 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:07.138 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:07.286 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files creat

[34m[2020-12-08 13:01:16.185 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:16.310 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:16.502 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:16.666 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:16.804 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:16.933 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files creat

[34m[2020-12-08 13:01:26.240 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:26.414 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:26.532 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:26.779 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:26.971 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:27.129 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files creat

[34m[2020-12-08 13:01:36.313 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:36.470 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:36.807 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:36.959 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:37.096 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:37.234 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files creat

[34m[2020-12-08 13:01:46.274 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:46.518 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:46.645 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:46.777 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:46.956 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:47.144 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files creat

[34m[2020-12-08 13:01:56.248 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:56.393 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:56.545 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:56.692 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:56.827 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:01:56.952 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files creat

[34m[2020-12-08 13:02:06.251 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:02:06.382 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:02:06.744 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:02:06.905 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:02:07.088 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:02:07.268 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files creat

[34m[2020-12-08 13:02:21.238 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:02:21.390 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:02:21.609 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:02:21.781 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:02:21.904 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files created yet, to be updated[0m
[34m[2020-12-08 13:02:22.021 ip-10-0-102-78.ec2.internal:30 INFO state_store.py:95] Checkpoints not updated. There are no checkpoint files creat

[34mInstructions for updating:[0m
[34mIf using Keras pass *_constraint arguments to layers.[0m
[34mInstructions for updating:[0m
[34mIf using Keras pass *_constraint arguments to layers.[0m
[34mINFO:tensorflow:Assets written to: /opt/ml/model/tensorflow/saved_model/0/assets[0m
[34mINFO:tensorflow:Assets written to: /opt/ml/model/tensorflow/saved_model/0/assets[0m
[34mINFO:transformers.configuration_utils:loading configuration file /opt/ml/model/transformers/fine-tuned/config.json[0m
[34mINFO:transformers.configuration_utils:Model config DistilBertConfig {
  "_num_labels": 5,
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "bad_words_ids": null,
  "bos_token_id": null,
  "decoder_start_token_id": null,
  "dim": 768,
  "do_sample": false,
  "dropout": 0.1,
  "early_stopping": false,
  "eos_token_id": null,
  "finetuning_task": null,
  "hidden_dim": 3072,
  "id2label": {
    "0": 1,
    "1": 2,
    "2": 3,
    "3":

<h2><span style="color:red">위 Training Job이 완료되기 전까지 기다려 주시기 바랍니다.</span></h2>

# 12 ) Experiment Tracking Lineage 살펴보기 (*)

In [39]:
from sagemaker.analytics import ExperimentAnalytics

lineage_table = ExperimentAnalytics(
    sagemaker_session=sess,
    experiment_name=experiment_name,
    metric_names=['validation:accuracy'],
    sort_by="CreationTime",
    sort_order="Ascending",
)

lineage_df = lineage_table.dataframe()
lineage_df.shape

(2, 63)

In [40]:
lineage_df

Unnamed: 0,TrialComponentName,DisplayName,max_seq_length,test_split_percentage,train_split_percentage,validation_split_percentage,raw_data_s3_uri - MediaType,raw_data_s3_uri - Value,test_data_s3_uri - MediaType,test_data_s3_uri - Value,...,train - MediaType,train - Value,validation - MediaType,validation - Value,SageMaker.Checkpoints - MediaType,SageMaker.Checkpoints - Value,SageMaker.DebugHookOutput - MediaType,SageMaker.DebugHookOutput - Value,SageMaker.ModelArtifact - MediaType,SageMaker.ModelArtifact - Value
0,TrialComponent-2020-12-08-124756-carq,prepare,128.0,0.05,0.9,0.05,s3/uri,s3://sagemaker-us-east-1-322537213286/amazon-r...,s3/uri,s3://sagemaker-us-east-1-322537213286/sagemake...,...,,,,,,,,,,
1,tensorflow-training-2020-12-08-12-54-45-217-aw...,train,128.0,,,,,,,,...,,s3://sagemaker-us-east-1-322537213286/sagemake...,,s3://sagemaker-us-east-1-322537213286/sagemake...,,s3://sagemaker-us-east-1-322537213286/checkpoi...,,s3://sagemaker-us-east-1-322537213286/,,s3://sagemaker-us-east-1-322537213286/tensorfl...


In [41]:
sm.describe_trial_component(TrialComponentName=lineage_df.TrialComponentName[0])

{'TrialComponentName': 'TrialComponent-2020-12-08-124756-carq',
 'TrialComponentArn': 'arn:aws:sagemaker:us-east-1:322537213286:experiment-trial-component/trialcomponent-2020-12-08-124756-carq',
 'DisplayName': 'prepare',
 'CreationTime': datetime.datetime(2020, 12, 8, 12, 47, 56, 211000, tzinfo=tzlocal()),
 'CreatedBy': {},
 'LastModifiedTime': datetime.datetime(2020, 12, 8, 12, 47, 57, 468000, tzinfo=tzlocal()),
 'LastModifiedBy': {},
 'Parameters': {'max_seq_length': {'NumberValue': 128.0},
  'test_split_percentage': {'NumberValue': 0.05},
  'train_split_percentage': {'NumberValue': 0.9},
  'validation_split_percentage': {'NumberValue': 0.05}},
 'InputArtifacts': {'raw_data_s3_uri': {'MediaType': 's3/uri',
   'Value': 's3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/tsv/'}},
 'OutputArtifacts': {'test_data_s3_uri': {'MediaType': 's3/uri',
   'Value': 's3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-12-08-12-16-33-543/output/bert-test'},
  'train_data_s3_ur

# 13 ) Debugger Rules 분석 (*)

In [42]:
estimator.latest_training_job.rule_job_summary()

[{'RuleConfigurationName': 'LossNotDecreasing',
  'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-east-1:322537213286:processing-job/tensorflow-training-2020-1-lossnotdecreasing-b99afe0e',
  'RuleEvaluationStatus': 'NoIssuesFound',
  'LastModifiedTime': datetime.datetime(2020, 12, 8, 13, 4, 31, 578000, tzinfo=tzlocal())},
 {'RuleConfigurationName': 'Overtraining',
  'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-east-1:322537213286:processing-job/tensorflow-training-2020-1-overtraining-fd755f53',
  'RuleEvaluationStatus': 'IssuesFound',
  'StatusDetails': 'RuleEvaluationConditionMet: Evaluation of the rule Overtraining at step 199 resulted in the condition being met\n',
  'LastModifiedTime': datetime.datetime(2020, 12, 8, 13, 4, 31, 578000, tzinfo=tzlocal())}]

In [43]:
training_job_debugger_artifacts_path = estimator.latest_job_debugger_artifacts_path()
print(training_job_debugger_artifacts_path)

s3://sagemaker-us-east-1-322537213286/tensorflow-training-2020-12-08-12-54-45-217/debug-output


# Pass Variables to the Next Notebook(s)

In [44]:
print(training_job_name, experiment_name, trial_name, prepare_trial_component_name, training_job_debugger_artifacts_path)

tensorflow-training-2020-12-08-12-54-45-217 Amazon-Customer-Reviews-BERT-Experiment-1607431676 trial-1607431676 TrialComponent-2020-12-08-124756-carq s3://sagemaker-us-east-1-322537213286/tensorflow-training-2020-12-08-12-54-45-217/debug-output


In [45]:
%store training_job_name
%store experiment_name
%store trial_name
%store prepare_trial_component_name
%store training_job_debugger_artifacts_path

Stored 'training_job_name' (str)
Stored 'experiment_name' (str)
Stored 'trial_name' (str)
Stored 'prepare_trial_component_name' (str)
Stored 'training_job_debugger_artifacts_path' (str)
