# NOTE:  THIS NOTEBOOK WILL TAKE ABOUT 30 MINUTES TO COMPLETE.

# PLEASE BE PATIENT.

# Fine-Tuning a BERT Model and Create a Text Classifier

In the previous section, we've already performed the Feature Engineering to create BERT embeddings from the `reviews_body` text using the pre-trained BERT model, and split the dataset into train, validation and test files. To optimize for Tensorflow training, we saved the files in TFRecord format. 

Now, let’s fine-tune the BERT model to our Customer Reviews Dataset and add a new classification layer to predict the `star_rating` for a given `review_body`.

![BERT Training](img/bert_training.png)

As mentioned earlier, BERT’s attention mechanism is called a Transformer. This is, not coincidentally, the name of the popular BERT Python library, “Transformers,” maintained by a company called HuggingFace. We will use a variant of BERT called [DistilBert](https://arxiv.org/pdf/1910.01108.pdf) which requires less memory and compute, but maintains very good accuracy on our dataset.

In [406]:
import boto3
import sagemaker
import pandas as pd

sess = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name="sagemaker", region_name=region)

# _PRE-REQUISITE: You need to have succesfully run the notebooks in the `PREPARE` section before you continue with this notebook._

In [407]:
processed_train_data_s3_uri = "s3://{}/06_prepare/training".format(bucket)
%store -r processed_train_data_s3_uri
!aws s3 cp --recursive s3://usd-mads-508/06_prepare/output/bert-train/ s3://sagemaker-us-east-1-421477113665/06_prepare/training/

no stored variable or alias processed_train_data_s3_uri
copy: s3://usd-mads-508/06_prepare/output/bert-train/part-algo-1-amazon_reviews_us_Gift_Card_v1_00.tfrecord to s3://sagemaker-us-east-1-421477113665/06_prepare/training/part-algo-1-amazon_reviews_us_Gift_Card_v1_00.tfrecord
copy: s3://usd-mads-508/06_prepare/output/bert-train/part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord to s3://sagemaker-us-east-1-421477113665/06_prepare/training/part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
copy: s3://usd-mads-508/06_prepare/output/bert-train/part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord to s3://sagemaker-us-east-1-421477113665/06_prepare/training/part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


In [408]:
try:
    processed_train_data_s3_uri
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] Please run the notebooks in the PREPARE section before you continue.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

In [409]:
print(processed_train_data_s3_uri)

s3://sagemaker-us-east-1-421477113665/06_prepare/training


In [410]:
processed_validation_data_s3_uri = "s3://{}/06_prepare/validation".format(bucket)
%store -r processed_validation_data_s3_uri
!aws s3 cp --recursive "s3://usd-mads-508/06_prepare/output/bert-validation/" $processed_validation_data_s3_uri/

no stored variable or alias processed_validation_data_s3_uri
copy: s3://usd-mads-508/06_prepare/output/bert-validation/part-algo-1-amazon_reviews_us_Gift_Card_v1_00.tfrecord to s3://sagemaker-us-east-1-421477113665/06_prepare/validation/part-algo-1-amazon_reviews_us_Gift_Card_v1_00.tfrecord
copy: s3://usd-mads-508/06_prepare/output/bert-validation/part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord to s3://sagemaker-us-east-1-421477113665/06_prepare/validation/part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
copy: s3://usd-mads-508/06_prepare/output/bert-validation/part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord to s3://sagemaker-us-east-1-421477113665/06_prepare/validation/part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


In [411]:
try:
    processed_validation_data_s3_uri
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] Please run the notebooks in the PREPARE section before you continue.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

In [412]:
print(processed_validation_data_s3_uri)

s3://sagemaker-us-east-1-421477113665/06_prepare/validation


In [413]:
processed_test_data_s3_uri = "s3://{}/06_prepare/test".format(bucket)
%store -r processed_test_data_s3_uri
!aws s3 cp --recursive "s3://usd-mads-508/06_prepare/output/bert-test/" $processed_test_data_s3_uri/

no stored variable or alias processed_test_data_s3_uri
copy: s3://usd-mads-508/06_prepare/output/bert-test/part-algo-1-amazon_reviews_us_Gift_Card_v1_00.tfrecord to s3://sagemaker-us-east-1-421477113665/06_prepare/test/part-algo-1-amazon_reviews_us_Gift_Card_v1_00.tfrecord
copy: s3://usd-mads-508/06_prepare/output/bert-test/part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord to s3://sagemaker-us-east-1-421477113665/06_prepare/test/part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
copy: s3://usd-mads-508/06_prepare/output/bert-test/part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord to s3://sagemaker-us-east-1-421477113665/06_prepare/test/part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


In [414]:
try:
    processed_test_data_s3_uri
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] Please run the notebooks in the PREPARE section before you continue.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

In [415]:
print(processed_test_data_s3_uri)

s3://sagemaker-us-east-1-421477113665/06_prepare/test


In [416]:
max_seq_length=64
%store -r max_seq_length

no stored variable or alias max_seq_length


In [417]:
try:
    max_seq_length
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] Please run the notebooks in the PREPARE section before you continue.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

In [418]:
print(max_seq_length)

64


In [419]:
experiment_name = "Amazon-Customer-Reviews-BERT-Experiment-1739670178"
%store -r experiment_name

no stored variable or alias experiment_name


In [420]:
try:
    experiment_name
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] Please run the notebooks in the PREPARE section before you continue.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

In [421]:
print(experiment_name)

Amazon-Customer-Reviews-BERT-Experiment-1739670178


In [422]:
trial_name = "trial-1739670178"
%store -r trial_name

no stored variable or alias trial_name


In [423]:
try:
    trial_name
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] Please run the notebooks in the PREPARE section before you continue.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

In [424]:
print(trial_name)

trial-1739670178


# Specify the Dataset in S3
We are using the train, validation, and test splits created in the previous section.

In [425]:
print(processed_train_data_s3_uri)

!aws s3 ls $processed_train_data_s3_uri/

s3://sagemaker-us-east-1-421477113665/06_prepare/training
2025-02-16 04:30:12   10471356 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2025-02-16 04:30:12    2314429 part-algo-1-amazon_reviews_us_Gift_Card_v1_00.tfrecord
2025-02-16 04:30:12   11704782 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


In [426]:
print(processed_validation_data_s3_uri)

!aws s3 ls $processed_validation_data_s3_uri/

s3://sagemaker-us-east-1-421477113665/06_prepare/validation
2025-02-16 04:30:15     582736 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2025-02-16 04:30:15     128492 part-algo-1-amazon_reviews_us_Gift_Card_v1_00.tfrecord
2025-02-16 04:30:15     650931 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


In [427]:
print(processed_test_data_s3_uri)

!aws s3 ls $processed_test_data_s3_uri/

s3://sagemaker-us-east-1-421477113665/06_prepare/test
2025-02-16 04:30:17     582965 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2025-02-16 04:30:17     128915 part-algo-1-amazon_reviews_us_Gift_Card_v1_00.tfrecord
2025-02-16 04:30:17     650976 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


# Specify S3 `Distribution Strategy`

In [428]:
from sagemaker.inputs import TrainingInput

s3_input_train_data = TrainingInput(s3_data=processed_train_data_s3_uri, distribution="ShardedByS3Key")
s3_input_validation_data = TrainingInput(s3_data=processed_validation_data_s3_uri, distribution="ShardedByS3Key")
s3_input_test_data = TrainingInput(s3_data=processed_test_data_s3_uri, distribution="ShardedByS3Key")

print(s3_input_train_data.config)
print(s3_input_validation_data.config)
print(s3_input_test_data.config)

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-421477113665/06_prepare/training', 'S3DataDistributionType': 'ShardedByS3Key'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-421477113665/06_prepare/validation', 'S3DataDistributionType': 'ShardedByS3Key'}}}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-421477113665/06_prepare/test', 'S3DataDistributionType': 'ShardedByS3Key'}}}


# Setup Hyper-Parameters for Classification Layer

In [429]:
print(max_seq_length)

64


In [430]:
epochs = 2
learning_rate = 0.001
epsilon = 0.000001
train_batch_size = 128
validation_batch_size = 128
test_batch_size = 128
train_steps_per_epoch = 10
validation_steps = 10
test_steps = 10
train_instance_count = 1
train_instance_type = "ml.m5.xlarge"
train_volume_size = 1024
use_xla = True
use_amp = True
freeze_bert_layer = False
enable_sagemaker_debugger = True
enable_checkpointing = False
enable_tensorboard = True
input_mode = "Pipe"
run_validation = True
run_test = True
run_sample_predictions = True

# Setup Metrics To Track Model Performance

These sample log lines...
```
45/50 [=====>..] - ETA: 3s - loss: 0.425 - accuracy: 0.881
50/50 [=======>] - ETA: 0s - val_loss: 0.407 - val_accuracy: 0.885
```
...will produce the following 4 metrics in CloudWatch:

`loss` = 0.425

`accuracy` = 0.881

`val_loss` = 0.407

`val_accuracy` = 0.885

<img src="img/cloudwatch_train_metrics.png" align="left">

In [431]:
metrics_definitions = [
    {"Name": "train:loss", "Regex": "loss: ([0-9\\.]+)"},
    {"Name": "train:accuracy", "Regex": "accuracy: ([0-9\\.]+)"},
    {"Name": "validation:loss", "Regex": "val_loss: ([0-9\\.]+)"},
    {"Name": "validation:accuracy", "Regex": "val_accuracy: ([0-9\\.]+)"},
]

# Setup SageMaker Debugger
Define Debugger Rules as deccribed here:  https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html

In [432]:
from sagemaker.debugger import Rule
from sagemaker.debugger import rule_configs
from sagemaker.debugger import ProfilerRule
from sagemaker.debugger import CollectionConfig
from sagemaker.debugger import DebuggerHookConfig

actions = rule_configs.ActionList(
    #    rule_configs.StopTraining(),
    #    rule_configs.Email("")
)

rules = [
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),    
#     ProfilerRule.sagemaker(rule_configs.BatchSize()),
#     ProfilerRule.sagemaker(rule_configs.CPUBottleneck()),
#     ProfilerRule.sagemaker(rule_configs.GPUMemoryIncrease()),
#     ProfilerRule.sagemaker(rule_configs.IOBottleneck()),
#     ProfilerRule.sagemaker(rule_configs.LoadBalancing()),
#     ProfilerRule.sagemaker(rule_configs.LowGPUUtilization()),
#     ProfilerRule.sagemaker(rule_configs.OverallSystemUsage()),
#     ProfilerRule.sagemaker(rule_configs.StepOutlier()),
#     Rule.sagemaker(
#         base_config=rule_configs.loss_not_decreasing(),
#         rule_parameters={
#             "collection_names": "losses,metrics",
#             "use_losses_collection": "true",
#             "num_steps": "10",
#             "diff_percent": "50",
#         },
#         collections_to_save=[
#             CollectionConfig(
#                 name="losses",
#                 parameters={
#                     "save_interval": "10",
#                 },
#             ),
#             CollectionConfig(
#                 name="metrics",
#                 parameters={
#                     "save_interval": "10",
#                 },
#             ),
#         ],
#         actions=actions,
#     ),
#     Rule.sagemaker(
#         base_config=rule_configs.overtraining(),
#         rule_parameters={
#             "collection_names": "losses,metrics",
#             "patience_train": "10",
#             "patience_validation": "10",
#             "delta": "0.5",
#         },
#         collections_to_save=[
#             CollectionConfig(
#                 name="losses",
#                 parameters={
#                     "save_interval": "10",
#                 },
#             ),
#             CollectionConfig(
#                 name="metrics",
#                 parameters={
#                     "save_interval": "10",
#                 },
#             ),
#         ],
#         actions=actions,
#     )    
]

hook_config = DebuggerHookConfig(
    hook_parameters={
        "save_interval": "10",  # number of steps
        "export_tensorboard": "true",
        "tensorboard_dir": "hook_tensorboard/",
    }
)

## Specify a Debugger profiler configuration

The following configuration will capture system metrics at 500 milliseconds. The system metrics include utilization per CPU, GPU, memory utilization per CPU, GPU as well I/O and network.

Debugger will capture detailed profiling information from step 5 to step 15. This information includes Horovod metrics, dataloading, preprocessing, operators running on CPU and GPU.

In [433]:
from sagemaker.debugger import ProfilerConfig, FrameworkProfile

profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500,
    framework_profile_params=FrameworkProfile(local_path="/opt/ml/output/profiler/", start_step=5, num_steps=10),
)

# Specify Checkpoint S3 Location
This is used for Spot Instances Training.  If nodes are replaced, the new node will start training from the latest checkpoint.

In [434]:
import uuid

checkpoint_s3_prefix = "checkpoints/{}".format(str(uuid.uuid4()))
checkpoint_s3_uri = "s3://{}/{}/".format(bucket, checkpoint_s3_prefix)

print(checkpoint_s3_uri)

s3://sagemaker-us-east-1-421477113665/checkpoints/0774f85f-fcf1-4e1d-89a0-17fcfea67024/


# Setup Our BERT + TensorFlow Script to Run on SageMaker
Prepare our TensorFlow model to run on the managed SageMaker service

In [435]:
!pygmentize src/tf_bert_reviews.py

[34mimport[39;49;00m[37m [39;49;00m[04m[36mtime[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m[37m [39;49;00m[04m[36mrandom[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m[37m [39;49;00m[04m[36mpandas[39;49;00m[37m [39;49;00m[34mas[39;49;00m[37m [39;49;00m[04m[36mpd[39;49;00m[37m[39;49;00m
[34mfrom[39;49;00m[37m [39;49;00m[04m[36mglob[39;49;00m[37m [39;49;00m[34mimport[39;49;00m glob[37m[39;49;00m
[34mimport[39;49;00m[37m [39;49;00m[04m[36mpprint[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m[37m [39;49;00m[04m[36margparse[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m[37m [39;49;00m[04m[36mjson[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m[37m [39;49;00m[04m[36msubprocess[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m[37m [39;49;00m[04m[36msys[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m[37m [39;49;00m[04m[36mos[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m[37m [39;49;00m[04m[36mcsv[39;49;

In [436]:
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(
    entry_point="tf_bert_reviews.py",
    source_dir="src",
    role=role,
    instance_count=train_instance_count,
    instance_type=train_instance_type,
    volume_size=train_volume_size,
    checkpoint_s3_uri=checkpoint_s3_uri,
    py_version="py37",
    framework_version="2.3.1",
    hyperparameters={
        "epochs": epochs,
        "learning_rate": learning_rate,
        "epsilon": epsilon,
        "train_batch_size": train_batch_size,
        "validation_batch_size": validation_batch_size,
        "test_batch_size": test_batch_size,
        "train_steps_per_epoch": train_steps_per_epoch,
        "validation_steps": validation_steps,
        "test_steps": test_steps,
        "use_xla": use_xla,
        "use_amp": use_amp,
        "max_seq_length": max_seq_length,
        "freeze_bert_layer": freeze_bert_layer,
        "enable_sagemaker_debugger": enable_sagemaker_debugger,
        "enable_checkpointing": enable_checkpointing,
        "enable_tensorboard": enable_tensorboard,
        "run_validation": run_validation,
        "run_test": run_test,
        "run_sample_predictions": run_sample_predictions,
    },
    input_mode=input_mode,
    metric_definitions=metrics_definitions,
    rules=rules,
    debugger_hook_config=hook_config,
    profiler_config=profiler_config,
)

# Create the `Experiment Config`

In [437]:
experiment_config = {"ExperimentName": experiment_name, "TrialName": trial_name, "TrialComponentDisplayName": "train"}

# Train the Model on SageMaker

In [438]:
estimator.fit(
    inputs={"train": s3_input_train_data, "validation": s3_input_validation_data, "test": s3_input_test_data},
    #experiment_config=experiment_config,
    wait=False,
)
# If you get an error about 'No S3 objects found;, re-run 06_prepare/01_Prepare_Dataset_BERT_Scikit_AdHoc_FeatureStore

In [439]:
training_job_name = estimator.latest_training_job.name
print("Training Job Name:  {}".format(training_job_name))

Training Job Name:  tensorflow-training-2025-02-16-04-30-21-406


In [440]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/{}">Training Job</a> After About 5 Minutes</b>'.format(
            region, training_job_name
        )
    )
)

  from IPython.core.display import display, HTML


In [441]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/TrainingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(
            region, training_job_name
        )
    )
)

  from IPython.core.display import display, HTML


In [442]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Training Job Has Completed</b>'.format(
            bucket, training_job_name, region
        )
    )
)

  from IPython.core.display import display, HTML


In [443]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Checkpoint Data</a> After The Training Job Has Completed</b>'.format(
            bucket, checkpoint_s3_prefix, region
        )
    )
)

  from IPython.core.display import display, HTML


In [444]:
%%time

estimator.latest_training_job.wait(logs=False)


2025-02-16 04:30:27 Starting - Starting the training job..
2025-02-16 04:30:42 Starting - Preparing the instances for training....
2025-02-16 04:31:04 Downloading - Downloading input data...
2025-02-16 04:31:24 Downloading - Downloading the training image.....
2025-02-16 04:31:55 Training - Training image download completed. Training in progress.........................................................................................................................................................................................................................................
2025-02-16 04:51:29 Uploading - Uploading generated training model...
2025-02-16 04:51:52 Completed - Training job completed
CPU times: user 949 ms, sys: 57.6 ms, total: 1.01 s
Wall time: 21min 31s


# Wait Until the ^^ Training Job ^^ Completes Above!

# Display Training Job Metrics

In [445]:
estimator.training_job_analytics.dataframe()

Unnamed: 0,timestamp,metric_name,value
0,0.0,train:loss,1.614525
1,60.0,train:loss,1.606
2,120.0,train:loss,1.600867
3,240.0,train:loss,1.599533
4,300.0,train:loss,1.603967
5,360.0,train:loss,1.610767
6,420.0,train:loss,1.60945
7,540.0,train:loss,1.609
8,600.0,train:loss,1.606825
9,660.0,train:loss,1.609333


# [INFO] _Feel free to continue to the next workshop section while this notebook is running._

In [446]:
%store training_job_name

Stored 'training_job_name' (str)


In [447]:
!aws s3 cp s3://$bucket/$training_job_name/output/model.tar.gz ./model.tar.gz

download: s3://sagemaker-us-east-1-421477113665/tensorflow-training-2025-02-16-04-30-21-406/output/model.tar.gz to ./model.tar.gz


In [448]:
!mkdir -p ./model/
!tar -xvzf ./model.tar.gz -C ./model/

tar: Ignoring unknown extended header keyword 'LIBARCHIVE.creationtime'
code/
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.creationtime'
code/inference.py
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.creationtime'
tensorboard/
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.creationtime'
tensorboard/train/
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.creationtime'
tensorboard/train/plugins/
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.creationtime'
tensorboard/train/plugins/profile/
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.creationtime'
tensorboard/train/plugins/profile/2025_02_16_04_35_16/
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.creationtime'
tensorboard/train/plugins/profile/2025_02_16_04_35_16/ip-10-2-235-215.ec2.internal.memory_profile.json.gz
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.creationtime'
tensorboard/train/plugins/profile/2025_02_16_04_35_16/ip-10-2-235-215.ec2

In [449]:
!saved_model_cli show --all --dir ./model/tensorflow/saved_model/0/

2025-02-16 04:52:09.000533: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-02-16 04:52:09.004008: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-02-16 04:52:09.014834: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-02-16 04:52:09.032807: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-02-16 04:52:09.038164: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-16 04:52:09.052747: I tensorflow/core/platform/cpu_feature_gu

In [450]:
# !saved_model_cli run --dir ./model/tensorflow/saved_model/0/ --tag_set serve --signature_def serving_default \
#     --input_exprs 'input_ids=np.zeros((1,64));input_mask=np.zeros((1,64))'

# View Confusion Matrix
![Confusion Matrix](./model/metrics/confusion_matrix.png)

# Analyze Debugger Rules

In [451]:
estimator.latest_training_job.rule_job_summary()

[{'RuleConfigurationName': 'ProfilerReport',
  'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-east-1:421477113665:processing-job/tensorflow-training-2025-0-ProfilerReport-ca653156',
  'RuleEvaluationStatus': 'InProgress',
  'LastModifiedTime': datetime.datetime(2025, 2, 16, 4, 52, 37, 438000, tzinfo=tzlocal())}]

In [452]:
%store

Stored variables and their in-db values:
auto_ml_job_name                                      -> 'automl-dm-15-21-49-33'
autopilot_endpoint_arn                                -> 'arn:aws:sagemaker:us-east-1:421477113665:endpoint
autopilot_endpoint_name                               -> 'automl-dm-ep-15-23-48-30'
autopilot_model_arn                                   -> 'arn:aws:sagemaker:us-east-1:421477113665:processi
autopilot_model_name                                  -> 'automl-dm-model-15-22-32-37'
autopilot_train_s3_uri                                -> 's3://sagemaker-us-east-1-421477113665/data/amazon
balanced_bias_data_jsonlines_s3_uri                   -> 's3://sagemaker-us-east-1-421477113665/bias-detect
balanced_bias_data_s3_uri                             -> 's3://sagemaker-us-east-1-421477113665/bias-detect
bias_data_s3_uri                                      -> 's3://sagemaker-us-east-1-421477113665/bias-detect
ingest_create_athena_db_passed                        -> Tr

# Show the Experiment Tracking Lineage

In [453]:
from sagemaker.analytics import ExperimentAnalytics

import pandas as pd

pd.set_option("max_colwidth", 500)

experiment_analytics = ExperimentAnalytics(
    sagemaker_session=sess,
    experiment_name=experiment_name,
    metric_names=["validation:accuracy"],
    sort_by="CreationTime",
    sort_order="Descending",
)

experiment_analytics_df = experiment_analytics.dataframe()
experiment_analytics_df

In [454]:
from sagemaker.lineage.visualizer import LineageTableVisualizer

lineage_table_viz = LineageTableVisualizer(sess)
lineage_table_viz_df = lineage_table_viz.show(training_job_name=training_job_name)
lineage_table_viz_df

Unnamed: 0,Name/Source,Direction,Type,Association Type,Lineage Type
0,s3://...r-us-east-1-421477113665/06_prepare/test,Input,DataSet,ContributedTo,artifact
1,s3://...ast-1-421477113665/06_prepare/validation,Input,DataSet,ContributedTo,artifact
2,s3://...-east-1-421477113665/06_prepare/training,Input,DataSet,ContributedTo,artifact
3,76310...s.com/tensorflow-training:2.3.1-cpu-py37,Input,Image,ContributedTo,artifact
4,s3://...5-02-16-04-30-21-406/output/model.tar.gz,Output,Model,Produced,artifact
5,s3://...ts/0774f85f-fcf1-4e1d-89a0-17fcfea67024/,Output,Checkpoint,Produced,artifact


# Release Resources

In [455]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>

In [456]:
%%javascript

try {
    Jupyter.notebook.save_checkpoint();
    Jupyter.notebook.session.delete();
}
catch(err) {
    // NoOp
}

<IPython.core.display.Javascript object>