# Feature Transformation with Amazon a SageMaker Processing Job and Scikit-Learn

In this notebook, we convert raw text into BERT embeddings.  This will allow us to perform natural language processing tasks such as text classification.

Typically a machine learning (ML) process consists of few steps. First, gathering data with various ETL jobs, then pre-processing the data, featurizing the dataset by incorporating standard techniques or prior knowledge, and finally training an ML model using an algorithm.

Often, distributed data processing frameworks such as Scikit-Learn are used to pre-process data sets in order to prepare them for training. In this notebook we'll use Amazon SageMaker Processing, and leverage the power of Scikit-Learn in a managed SageMaker environment to run our processing workload.

# NOTE:  THIS NOTEBOOK WILL TAKE A 5-10 MINUTES TO COMPLETE.

# PLEASE BE PATIENT.

![](img/prepare_dataset_bert.png)

![](img/processing.jpg)


## Contents

1. Setup Environment
1. Setup Input Data
1. Setup Output Data
1. Build a Scikit-Learn container for running the processing job
1. Run the Processing Job using Amazon SageMaker
1. Inspect the Processed Output Data

# Setup Environment

Let's start by specifying:
* The S3 bucket and prefixes that you use for training and model data. Use the default bucket specified by the Amazon SageMaker session.
* The IAM role ARN used to give processing and training access to the dataset.

In [3]:
import sagemaker
import boto3

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sess.default_bucket()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)
s3 = boto3.Session().client(service_name='s3', region_name=region)

# Setup Input Data

In [4]:
%store -r s3_public_path_tsv

In [5]:
try:
    s3_public_path_tsv
except NameError:
    print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
    print('[ERROR] Please run the notebooks in the INGEST section before you continue.')
    print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++')

In [6]:
print(s3_public_path_tsv)

s3://amazon-reviews-pds/tsv


In [7]:
%store -r s3_private_path_tsv

In [8]:
try:
    s3_private_path_tsv
except NameError:
    print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
    print('[ERROR] Please run the notebooks in the INGEST section before you continue.')
    print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++')

In [9]:
print(s3_private_path_tsv)

s3://sagemaker-us-east-1-231218423789/amazon-reviews-pds/tsv


# Let's Copy 1 More Large Data File to Use For Training

In [10]:
#!aws s3 cp --recursive $s3_public_path_tsv/ $s3_private_path_tsv/ --exclude "*" --include "amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tsv.gz"

In [11]:
raw_input_data_s3_uri = 's3://{}/amazon-reviews-pds/tsv/'.format(bucket)
print(raw_input_data_s3_uri)

s3://sagemaker-us-east-1-231218423789/amazon-reviews-pds/tsv/


In [12]:
!aws s3 ls $raw_input_data_s3_uri

2020-12-18 17:44:16   18997559 amazon_reviews_us_Digital_Software_v1_00.tsv.gz
2020-12-18 17:44:18   27442648 amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz


# Run the Processing Job using Amazon SageMaker

Next, use the Amazon SageMaker Python SDK to submit a processing job using our custom python script.

# Review the Processing Script

In [13]:
!pygmentize preprocess-scikit-text-to-bert-feature-store.py

[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mmodel_selection[39;49;00m [34mimport[39;49;00m train_test_split
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m [34mimport[39;49;00m resample
[34mimport[39;49;00m [04m[36mfunctools[39;49;00m
[34mimport[39;49;00m [04m[36mmultiprocessing[39;49;00m

[34mfrom[39;49;00m [04m[36mdatetime[39;49;00m [34mimport[39;49;00m datetime
[34mfrom[39;49;00m [04m[36mtime[39;49;00m [34mimport[39;49;00m gmtime, strftime, sleep

[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mre[39;49;00m
[34mimport[39;49;00m [04m[36mcollections[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mcsv[39;49;00m
[34mimport[39;49;00m [04m[36mglob[39;49;00m
[34mfrom[39;49;00m [04m[36mp

Run this script as a processing job.  You also need to specify one `ProcessingInput` with the `source` argument of the Amazon S3 bucket and `destination` is where the script reads this data from `/opt/ml/processing/input` (inside the Docker container.)  All local paths inside the processing container must begin with `/opt/ml/processing/`.

Also give the `run()` method a `ProcessingOutput`, where the `source` is the path the script writes output data to.  For outputs, the `destination` defaults to an S3 bucket that the Amazon SageMaker Python SDK creates for you, following the format `s3://sagemaker-<region>-<account_id>/<processing_job_name>/output/<output_name>/`.  You also give the `ProcessingOutput` value for `output_name`, to make it easier to retrieve these output artifacts after the job is run.

The arguments parameter in the `run()` method are command-line arguments in our `preprocess-scikit-text-to-bert-feature-store.py` script.

Note that we sharding the data using `ShardedByS3Key` to spread the transformations across all worker nodes in the cluster.

# Track the `Experiment`
We will track every step of this experiment throughout the `prepare`, `train`, `optimize`, and `deploy`.

# Concepts

**Experiment**: A collection of related Trials.  Add Trials to an Experiment that you wish to compare together.

**Trial**: A description of a multi-step machine learning workflow. Each step in the workflow is described by a Trial Component. There is no relationship between Trial Components such as ordering.

**Trial Component**: A description of a single step in a machine learning workflow. For example data cleaning, feature extraction, model training, model evaluation, etc.

**Tracker**: A logger of information about a single TrialComponent.

<img src="img/sagemaker-experiments.png" width="90%" align="left">


# Create the `Experiment`

In [14]:
import time
from smexperiments.experiment import Experiment

timestamp = int(time.time())

experiment = Experiment.create(
                experiment_name='Amazon-Customer-Reviews-BERT-Experiment-{}'.format(timestamp),
                description='Amazon Customer Reviews BERT Experiment', 
                sagemaker_boto_client=sm)

experiment_name = experiment.experiment_name
print('Experiment name: {}'.format(experiment_name))

Experiment name: Amazon-Customer-Reviews-BERT-Experiment-1609564468


# Create the `Trial`

In [16]:
import time
from smexperiments.trial import Trial

timestamp = int(time.time())

trial = Trial.create(trial_name='trial-{}'.format(timestamp),
                     experiment_name=experiment_name,
                     sagemaker_boto_client=sm)

trial_name = trial.trial_name
print('Trial name: {}'.format(trial_name))

Trial name: trial-1609564486


# Create the `Experiment Config`

In [17]:
experiment_config = {
    'ExperimentName': experiment_name,
    'TrialName': trial.trial_name,
    'TrialComponentDisplayName': 'prepare'
}

In [18]:
print(experiment_name)

Amazon-Customer-Reviews-BERT-Experiment-1609564468


In [19]:
%store experiment_name

Stored 'experiment_name' (str)


In [20]:
print(trial_name)

trial-1609564486


In [21]:
%store trial_name

Stored 'trial_name' (str)


# Create Feature Store and Feature Group

In [26]:
featurestore_runtime = boto3.Session().client(service_name='sagemaker-featurestore-runtime', region_name=region)

In [28]:
timestamp = int(time.time())

feature_store_offline_prefix = 'reviews-feature-store-' + str(timestamp)
print(feature_store_offline_prefix)

reviews-feature-store-1609564580


In [29]:
%store feature_store_offline_prefix

Stored 'feature_store_offline_prefix' (str)


In [30]:
from time import gmtime, strftime, sleep

timestamp = int(time.time())

# reviews_feature_group_name = 'reviews-feature-group-' + strftime('%d-%H-%M-%S', gmtime())
reviews_feature_group_name = 'reviews-feature-group-' + str(timestamp)

print(reviews_feature_group_name)

reviews-feature-group-1609564585


In [31]:
%store reviews_feature_group_name

Stored 'reviews_feature_group_name' (str)


In [32]:
from sagemaker.feature_store.feature_definition import (
    FeatureDefinition,
    FeatureTypeEnum,
)

feature_definitions= [
    FeatureDefinition(feature_name='input_ids', feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name='input_mask', feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name='segment_ids', feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name='label_id', feature_type=FeatureTypeEnum.INTEGRAL),
    FeatureDefinition(feature_name='review_id', feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name='date', feature_type=FeatureTypeEnum.STRING),
    FeatureDefinition(feature_name='label', feature_type=FeatureTypeEnum.INTEGRAL),
#    FeatureDefinition(feature_name='review_body', feature_type=FeatureTypeEnum.STRING)
]

In [33]:
from sagemaker.feature_store.feature_group import FeatureGroup

reviews_feature_group = FeatureGroup(
    name=reviews_feature_group_name, 
    feature_definitions=feature_definitions,
    sagemaker_session=sess)

print(reviews_feature_group)

FeatureGroup(name='reviews-feature-group-1609564585', sagemaker_session=<sagemaker.session.Session object at 0x7f80bdfc9f98>, feature_definitions=[FeatureDefinition(feature_name='input_ids', feature_type=<FeatureTypeEnum.STRING: 'String'>), FeatureDefinition(feature_name='input_mask', feature_type=<FeatureTypeEnum.STRING: 'String'>), FeatureDefinition(feature_name='segment_ids', feature_type=<FeatureTypeEnum.STRING: 'String'>), FeatureDefinition(feature_name='label_id', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>), FeatureDefinition(feature_name='review_id', feature_type=<FeatureTypeEnum.STRING: 'String'>), FeatureDefinition(feature_name='date', feature_type=<FeatureTypeEnum.STRING: 'String'>), FeatureDefinition(feature_name='label', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>)])


# Set the Processing Job Hyper-Parameters 

In [34]:
processing_instance_type='ml.c5.2xlarge'
processing_instance_count=2
train_split_percentage=0.90
validation_split_percentage=0.05
test_split_percentage=0.05
balance_dataset=True
max_seq_length=64

# Choosing a `max_seq_length` for BERT
Since a smaller `max_seq_length` leads to faster training and lower resource utilization, we want to find the smallest review length that captures `70%` of our reviews.

Remember our distribution of review lengths from a previous section?

```
mean         67.930174
std         130.954079
min           1.000000
10%           4.000000
20%          14.000000
30%          21.000000
40%          25.000000
50%          31.000000
60%          42.000000
70%          59.000000
80%          87.000000
90%         149.000000
100%       5347.000000
max        5347.000000
```

![](img/review_word_count_distribution.png)

Review length `59` represents the `70th` percentile for this dataset.  However, it's best to stick with powers-of-2 when using BERT.  So let's choose `64` as this is the smallest power-of-2 greater than `59`.  Reviews with length > `64` will be truncated to `64`.

In [35]:
from sagemaker.sklearn.processing import SKLearnProcessor

processor = SKLearnProcessor(framework_version='0.23-1',
                             role=role,
                             instance_type=processing_instance_type,
                             instance_count=processing_instance_count,
                             env={'AWS_DEFAULT_REGION': region},
                             max_runtime_in_seconds=7200)

INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Defaulting to only available Python version: py3


In [36]:
from sagemaker.processing import ProcessingInput, ProcessingOutput

processor.run(code='preprocess-scikit-text-to-bert-feature-store.py',
              inputs=[
                    ProcessingInput(source=raw_input_data_s3_uri,
                                    destination='/opt/ml/processing/input/data/',
                                    s3_data_distribution_type='ShardedByS3Key')
              ],
              outputs=[
                    ProcessingOutput(s3_upload_mode='EndOfJob',
                                     output_name='bert-train',
                                     source='/opt/ml/processing/output/bert/train'),
                    ProcessingOutput(s3_upload_mode='EndOfJob',
                                     output_name='bert-validation',
                                     source='/opt/ml/processing/output/bert/validation'),
                    ProcessingOutput(s3_upload_mode='EndOfJob',
                                     output_name='bert-test',
                                     source='/opt/ml/processing/output/bert/test'),
              ],
              arguments=['--train-split-percentage', str(train_split_percentage),
                         '--validation-split-percentage', str(validation_split_percentage),
                         '--test-split-percentage', str(test_split_percentage),
                         '--max-seq-length', str(max_seq_length),
                         '--balance-dataset', str(balance_dataset),
                         '--feature-store-offline-prefix', str(feature_store_offline_prefix),
                         '--reviews-feature-group-name', str(reviews_feature_group_name)
              ],
              experiment_config=experiment_config,
              logs=True,
              wait=False)

INFO:sagemaker:Creating processing-job with name sagemaker-scikit-learn-2021-01-02-05-21-11-395



Job Name:  sagemaker-scikit-learn-2021-01-02-05-21-11-395
Inputs:  [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-231218423789/amazon-reviews-pds/tsv/', 'LocalPath': '/opt/ml/processing/input/data/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'ShardedByS3Key', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-01-02-05-21-11-395/input/code/preprocess-scikit-text-to-bert-feature-store.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'bert-train', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-01-02-05-21-11-395/output/bert-train', 'LocalPath': '/opt/ml/processing/output/bert/train', 'S3U

In [37]:
scikit_processing_job_name = processor.jobs[-1].describe()['ProcessingJobName']
print(scikit_processing_job_name)

sagemaker-scikit-learn-2021-01-02-05-21-11-395


In [38]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/processing-jobs/{}">Processing Job</a></b>'.format(region, scikit_processing_job_name)))


In [39]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, scikit_processing_job_name)))


In [40]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Processing Job Has Completed</b>'.format(bucket, scikit_processing_job_name, region)))


# Monitor the Processing Job

In [41]:
running_processor = sagemaker.processing.ProcessingJob.from_processing_name(processing_job_name=scikit_processing_job_name,
                                                                            sagemaker_session=sess)

processing_job_description = running_processor.describe()

print(processing_job_description)

{'ProcessingInputs': [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-231218423789/amazon-reviews-pds/tsv/', 'LocalPath': '/opt/ml/processing/input/data/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'ShardedByS3Key', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-01-02-05-21-11-395/input/code/preprocess-scikit-text-to-bert-feature-store.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}], 'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'bert-train', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-01-02-05-21-11-395/output/bert-train', 'LocalPath': '/opt/ml/processing/output/bert/train', 'S3UploadMode': 'EndOfJob'}, 'AppManaged': 

In [42]:
running_processor.wait(logs=False)

........................................................................................................!

# _Please Wait Until the ^^ Processing Job ^^ Completes Above._

# Inspect the Processed Output Data

Take a look at a few rows of the transformed dataset to make sure the processing was successful.

In [43]:
processing_job_description = running_processor.describe()

output_config = processing_job_description['ProcessingOutputConfig']
for output in output_config['Outputs']:
    if output['OutputName'] == 'bert-train':
        processed_train_data_s3_uri = output['S3Output']['S3Uri']
    if output['OutputName'] == 'bert-validation':
        processed_validation_data_s3_uri = output['S3Output']['S3Uri']
    if output['OutputName'] == 'bert-test':
        processed_test_data_s3_uri = output['S3Output']['S3Uri']
        
print(processed_train_data_s3_uri)
print(processed_validation_data_s3_uri)
print(processed_test_data_s3_uri)

s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-01-02-05-21-11-395/output/bert-train
s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-01-02-05-21-11-395/output/bert-validation
s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-01-02-05-21-11-395/output/bert-test


In [44]:
!aws s3 ls $processed_train_data_s3_uri/

2021-01-02 05:29:21   10478088 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2021-01-02 05:29:51   11712201 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


In [45]:
!aws s3 ls $processed_validation_data_s3_uri/

2021-01-02 05:29:22     581775 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2021-01-02 05:29:52     649157 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


In [46]:
!aws s3 ls $processed_test_data_s3_uri/

2021-01-02 05:29:22     583108 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2021-01-02 05:29:52     653038 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


In [47]:
processing_job_description

{'ProcessingInputs': [{'InputName': 'input-1',
   'AppManaged': False,
   'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-231218423789/amazon-reviews-pds/tsv/',
    'LocalPath': '/opt/ml/processing/input/data/',
    'S3DataType': 'S3Prefix',
    'S3InputMode': 'File',
    'S3DataDistributionType': 'ShardedByS3Key',
    'S3CompressionType': 'None'}},
  {'InputName': 'code',
   'AppManaged': False,
   'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-01-02-05-21-11-395/input/code/preprocess-scikit-text-to-bert-feature-store.py',
    'LocalPath': '/opt/ml/processing/input/code',
    'S3DataType': 'S3Prefix',
    'S3InputMode': 'File',
    'S3DataDistributionType': 'FullyReplicated',
    'S3CompressionType': 'None'}}],
 'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'bert-train',
    'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-01-02-05-21-11-395/output/bert-train',
     'LocalPath': '/opt/ml/processing

# Pass Variables to the Next Notebook(s)

In [48]:
%store raw_input_data_s3_uri

Stored 'raw_input_data_s3_uri' (str)


In [49]:
%store max_seq_length

Stored 'max_seq_length' (int)


In [50]:
%store train_split_percentage

Stored 'train_split_percentage' (float)


In [51]:
%store validation_split_percentage

Stored 'validation_split_percentage' (float)


In [52]:
%store test_split_percentage

Stored 'test_split_percentage' (float)


In [53]:
%store balance_dataset

Stored 'balance_dataset' (bool)


In [54]:
%store processed_train_data_s3_uri

Stored 'processed_train_data_s3_uri' (str)


In [55]:
%store processed_validation_data_s3_uri

Stored 'processed_validation_data_s3_uri' (str)


In [56]:
%store processed_test_data_s3_uri

Stored 'processed_test_data_s3_uri' (str)


In [57]:
%store

Stored variables and their in-db values:
balance_dataset                                                 -> True
experiment_name                                                 -> 'Amazon-Customer-Reviews-BERT-Experiment-160956446
feature_store_offline_prefix                                    -> 'reviews-feature-store-1609564580'
ingest_create_athena_db_passed                                  -> True
ingest_create_athena_table_parquet_passed                       -> True
ingest_create_athena_table_tsv_passed                           -> True
max_seq_length                                                  -> 64
prepare_trial_component_name                                    -> 'TrialComponent-2020-12-30-035536-zxfy'
processed_test_data_s3_uri                                      -> 's3://sagemaker-us-east-1-231218423789/sagemaker-s
processed_train_data_s3_uri                                     -> 's3://sagemaker-us-east-1-231218423789/sagemaker-s
processed_validation_data_s3_uri      

# Query The Feature Store

In [58]:
reviews_feature_store_query = reviews_feature_group.athena_query()

In [59]:
reviews_feature_store_table = reviews_feature_store_query.table_name

In [60]:
query_string = """
SELECT input_ids, input_mask, segment_ids, label_id, split_type  FROM "{}" WHERE split_type='train' LIMIT 5
""".format(reviews_feature_store_table)

print('Running ' + query_string)

Running 
SELECT input_ids, input_mask, segment_ids, label_id, split_type  FROM "reviews-feature-group-1609564585-1609565128" WHERE split_type='train' LIMIT 5



In [61]:
reviews_feature_store_query.run(query_string=query_string, output_location='s3://'+bucket+'/'+feature_store_offline_prefix+'/query_results/')

reviews_feature_store_query.wait()

INFO:sagemaker:Query f5dea4b6-bf2a-4779-a7bc-f26e2d2307bc is being executed.
INFO:sagemaker:Query f5dea4b6-bf2a-4779-a7bc-f26e2d2307bc successfully executed.


In [62]:
import pandas as pd
dataset = pd.DataFrame()

dataset = reviews_feature_store_query.as_dataframe()

dataset

Unnamed: 0,input_ids,input_mask,segment_ids,label_id,split_type
0,"[101, 2023, 2515, 2054, 2009, 2003, 4011, 2000...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",4,train
1,"[101, 6195, 2027, 3030, 2437, 24890, 1998, 709...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3,train
2,"[101, 4761, 3343, 2003, 2008, 4268, 2323, 2022...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0,train
3,"[101, 1996, 2208, 2993, 2003, 2025, 1037, 2919...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3,train
4,"[101, 2005, 1037, 3477, 2000, 2377, 2944, 2515...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",2,train


# Show the Experiment Tracking Lineage

In [63]:
from sagemaker.analytics import ExperimentAnalytics

import pandas as pd
pd.set_option("max_colwidth", 500)
#pd.set_option("max_rows", 100)

experiment_analytics = ExperimentAnalytics(
    sagemaker_session=sess,
    experiment_name=experiment_name,
    sort_by="CreationTime",
    sort_order="Descending"
)

experiment_analytics_df = experiment_analytics.dataframe()
experiment_analytics_df

Unnamed: 0,TrialComponentName,DisplayName,SourceArn,AWS_DEFAULT_REGION,SageMaker.InstanceCount,SageMaker.InstanceType,SageMaker.VolumeSizeInGB,SageMaker.ImageUri - MediaType,SageMaker.ImageUri - Value,code - MediaType,...,input-1 - MediaType,input-1 - Value,bert-test - MediaType,bert-test - Value,bert-train - MediaType,bert-train - Value,bert-validation - MediaType,bert-validation - Value,Trials,Experiments
0,sagemaker-scikit-learn-2021-01-02-05-21-11-395-aws-processing-job,prepare,arn:aws:sagemaker:us-east-1:231218423789:processing-job/sagemaker-scikit-learn-2021-01-02-05-21-11-395,us-east-1,2.0,ml.c5.2xlarge,30.0,,683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3,,...,,s3://sagemaker-us-east-1-231218423789/amazon-reviews-pds/tsv/,,s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-01-02-05-21-11-395/output/bert-test,,s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-01-02-05-21-11-395/output/bert-train,,s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-01-02-05-21-11-395/output/bert-validation,[trial-1609564486],[Amazon-Customer-Reviews-BERT-Experiment-1609564468]


In [64]:
trial_component_name=experiment_analytics_df.TrialComponentName[0]
print(trial_component_name)

sagemaker-scikit-learn-2021-01-02-05-21-11-395-aws-processing-job


In [65]:
trial_component_description=sm.describe_trial_component(TrialComponentName=trial_component_name)
trial_component_description

{'TrialComponentName': 'sagemaker-scikit-learn-2021-01-02-05-21-11-395-aws-processing-job',
 'TrialComponentArn': 'arn:aws:sagemaker:us-east-1:231218423789:experiment-trial-component/sagemaker-scikit-learn-2021-01-02-05-21-11-395-aws-processing-job',
 'DisplayName': 'prepare',
 'Source': {'SourceArn': 'arn:aws:sagemaker:us-east-1:231218423789:processing-job/sagemaker-scikit-learn-2021-01-02-05-21-11-395',
  'SourceType': 'SageMakerProcessingJob'},
 'Status': {'PrimaryStatus': 'Completed',
  'Message': 'Status: Completed, exit message: null, failure reason: null'},
 'StartTime': datetime.datetime(2021, 1, 2, 5, 24, 35, tzinfo=tzlocal()),
 'EndTime': datetime.datetime(2021, 1, 2, 5, 29, 56, tzinfo=tzlocal()),
 'CreationTime': datetime.datetime(2021, 1, 2, 5, 21, 12, 372000, tzinfo=tzlocal()),
 'CreatedBy': {},
 'LastModifiedTime': datetime.datetime(2021, 1, 2, 5, 29, 57, 361000, tzinfo=tzlocal()),
 'LastModifiedBy': {},
 'Parameters': {'AWS_DEFAULT_REGION': {'StringValue': 'us-east-1'},


# Show SageMaker ML Lineage Tracking 

Amazon SageMaker ML Lineage Tracking creates and stores information about the steps of a machine learning (ML) workflow from data preparation to model deployment. 

Amazon SageMaker Lineage enables events that happen within SageMaker to be traced via a graph structure. The data simplifies generating reports, making comparisons, or discovering relationships between events. For example easily trace both how a model was generated and where the model was deployed.

The lineage graph is created automatically by SageMaker and you can directly create or modify your own graphs.

## Key Concepts

* **Lineage Graph** - A connected graph tracing your machine learning workflow end to end.

* **Artifacts** - Represents a URI addressable object or data. Artifacts are typically inputs or outputs to Actions.

* **Actions** - Represents an action taken such as a computation, transformation, or job.

* **Contexts** - Provides a method to logically group other entities.

* **Associations** - A directed edge in the lineage graph that links two entities.

* **Lineage Traversal** - Starting from an arbitrary point trace the lineage graph to discover and analyze relationships between steps in your workflow.

* **Experiments** - Experiment entites (Experiments, Trials, and Trial Components) are also part of the lineage graph and can be associated wtih Artifacts, Actions, or Contexts.

## Show Lineage Artifacts For Our Processing Job

In [66]:
from sagemaker.lineage.visualizer import LineageTableVisualizer

lineage_table_viz = LineageTableVisualizer(sess)
lineage_table_viz_df = lineage_table_viz.show(processing_job_name=scikit_processing_job_name)
lineage_table_viz_df

Unnamed: 0,Name/Source,Direction,Type,Association Type,Lineage Type
0,preprocess-scikit-text-to-bert-feature-store.py,Input,DataSet,ContributedTo,artifact
1,s3://...t-1-231218423789/amazon-reviews-pds/tsv/,Input,DataSet,ContributedTo,artifact
2,68331...om/sagemaker-scikit-learn:0.23-1-cpu-py3,Input,Image,ContributedTo,artifact
3,s3://...2021-01-02-05-21-11-395/output/bert-test,Output,DataSet,Produced,artifact
4,s3://...1-02-05-21-11-395/output/bert-validation,Output,DataSet,Produced,artifact
5,s3://...021-01-02-05-21-11-395/output/bert-train,Output,DataSet,Produced,artifact


## List All Lineage Artifacts

In [67]:
from sagemaker.analytics import ArtifactAnalytics

artifact_analytics = ArtifactAnalytics()

artifacts_df = artifact_analytics.dataframe()
artifacts_df

Unnamed: 0,ArtifactName,ArtifactArn,ArtifactType,ArtifactSourceUri,CreationTime,LastModifiedTime
0,,arn:aws:sagemaker:us-east-1:231218423789:artifact/ecd15c0438612224d61f3589e0917461,DataSet,s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-01-02-05-21-11-395/output/bert-test,2021-01-02 05:21:12.696000+00:00,2021-01-02 05:21:12.696000+00:00
1,,arn:aws:sagemaker:us-east-1:231218423789:artifact/727df95432a9b3c5748f9d1ade3119c9,DataSet,s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-01-02-05-21-11-395/output/bert-validation,2021-01-02 05:21:12.638000+00:00,2021-01-02 05:21:12.638000+00:00
2,,arn:aws:sagemaker:us-east-1:231218423789:artifact/fb3a97f7277d53d86793db5090d25ba5,DataSet,s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-01-02-05-21-11-395/output/bert-train,2021-01-02 05:21:12.594000+00:00,2021-01-02 05:21:12.594000+00:00
3,,arn:aws:sagemaker:us-east-1:231218423789:artifact/984631fd545f00e867994db12bfd6ffa,DataSet,s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-01-02-05-21-11-395/input/code/preprocess-scikit-text-to-bert-feature-store.py,2021-01-02 05:21:12.480000+00:00,2021-01-02 05:21:12.480000+00:00
4,,arn:aws:sagemaker:us-east-1:231218423789:artifact/64e2869a3ffd419f8254134927265bc3,DataSet,s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2021-01-02-03-55-51-199/output/bert-test,2021-01-02 03:55:52.559000+00:00,2021-01-02 03:55:52.559000+00:00
...,...,...,...,...,...,...
170,,arn:aws:sagemaker:us-east-1:231218423789:artifact/432a98996416d8da9e2c5896d48bbac5,Image,763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.1.0-cpu-py3,2020-12-29 22:26:39.342000+00:00,2020-12-29 22:26:39.342000+00:00
171,,arn:aws:sagemaker:us-east-1:231218423789:artifact/e0df32aa65a0811e000f812f7fba2301,DataSet,s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2020-12-29-22-19-21-922/output/bert-test,2020-12-29 22:19:45.672000+00:00,2020-12-29 22:19:45.672000+00:00
172,,arn:aws:sagemaker:us-east-1:231218423789:artifact/19aa3a35fa1e72f5c30c45df85ba5d88,DataSet,s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2020-12-29-22-19-21-922/output/bert-validation,2020-12-29 22:19:45.629000+00:00,2020-12-29 22:19:45.629000+00:00
173,,arn:aws:sagemaker:us-east-1:231218423789:artifact/bcef4b9317ff561f6b9c4b5d8dc7e62f,DataSet,s3://sagemaker-us-east-1-231218423789/sagemaker-scikit-learn-2020-12-29-22-19-21-922/output/bert-train,2020-12-29 22:19:45.562000+00:00,2020-12-29 22:19:45.562000+00:00


In [None]:
%%javascript
Jupyter.notebook.save_checkpoint();
Jupyter.notebook.session.delete();