# Amazon SageMaker Processing Job 


기계 학습 (ML) 프로세스는 몇 단계로 구성됩니다. 먼저, 다양한 ETL 작업으로 데이터를 수집 한 다음 data의 pre-processing, 전통적인 기법 또는 사전 knowledge를 이용하여 데이터의 feature화, 마지막으로 알고리즘을 이용한 ML 모델을 학습합니다.

Apache Spark와 같은 분산 데이터 처리 프레임 워크는 학습을 위해 dataset의 pre-processing하는데 사용합니다. 이 노트북에서는 Amazon SageMaker Processing에서 기본 설치된 Apache Spark의 기능을 활용하여 처리 워크로드를 실행합니다.

![](img/prepare_dataset_bert.png)

![](img/processing.jpg)


# Setup Environment


* 모델 학습에 사용되는 S3 bucket과 prefix 가 필요합니다.
* 학습과 processing을 위해 IAM role은 dataset에 액세스가 가능해야 합니다.

In [1]:
import sagemaker
from time import gmtime, strftime
import boto3

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()
region = boto3.Session().region_name

# Setup Input Data

In [2]:
# Inputs
s3_input_data = 's3://{}/amazon-reviews-pds/tsv/'.format(bucket)
print(s3_input_data)

s3://sagemaker-us-east-2-322537213286/amazon-reviews-pds/tsv/


In [3]:
!aws s3 ls $s3_input_data

2020-07-15 07:29:01   18997559 amazon_reviews_us_Digital_Software_v1_00.tsv.gz
2020-07-15 07:29:04   27442648 amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz


# Processing Job을 수행할 Spark Docker Image

이 HOL에서는 `./container` 폴더 내에 Spark container 이미지를 포함합니다. container는 모든 Spark 구성의 부트스트랩을 처리하고 `spark-submit` CLI를 wrapper해서 제공합니다. 상위 레벨에서는,

* A set of default Spark/YARN/Hadoop configurations
* A bootstrapping script for configuring and starting up Spark master/worker nodes
* A wrapper around the `spark-submit` CLI to submit a Spark application

container 빌드와 push 절차가 완료된 후 dataset의 처리를 수행하는 관리형 분산 Spark 어플리케이션을 수행사는 것은 Amazon SageMaker Python SDK 사용합니다.

In [4]:
docker_repo = 'amazon-reviews-spark-processor'
docker_tag = 'latest'

In [5]:
!docker build -t $docker_repo:$docker_tag -f container/Dockerfile ./container

Sending build context to Docker daemon  4.385MB
Step 1/37 : FROM openjdk:8-jre-slim
 ---> 73c63778326a
Step 2/37 : RUN apt-get update
 ---> Using cache
 ---> 4833f5be0644
Step 3/37 : RUN apt-get install -y curl unzip python3 python3-setuptools python3-pip python-dev python3-dev python-psutil
 ---> Using cache
 ---> 9cc9ddd52407
Step 4/37 : RUN pip3 install py4j psutil==5.6.5 numpy==1.17.4
 ---> Using cache
 ---> 5dbb507c3abc
Step 5/37 : RUN apt-get clean
 ---> Using cache
 ---> cf58fdfdb6e2
Step 6/37 : RUN rm -rf /var/lib/apt/lists/*
 ---> Using cache
 ---> a6079407e665
Step 7/37 : ENV PYTHONHASHSEED 0
 ---> Using cache
 ---> 30c293c11416
Step 8/37 : ENV PYTHONIOENCODING UTF-8
 ---> Using cache
 ---> 7d951bf2bb49
Step 9/37 : ENV PIP_DISABLE_PIP_VERSION_CHECK 1
 ---> Using cache
 ---> 5140ba702b02
Step 10/37 : ENV HADOOP_VERSION 3.2.1
 ---> Using cache
 ---> 24f264d1751c
Step 11/37 : ENV HADOOP_HOME /usr/hadoop-$HADOOP_VERSION
 ---> Using cache
 ---> 61fbcdf98294
Step 12/37 : ENV HADOOP

Spark container의 Amazon Elastic Container Registry(Amazon ECR) 리포지토리를 생성하고 image를 push합니다.

In [6]:
import boto3
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name

image_uri = '{}.dkr.ecr.{}.amazonaws.com/{}:{}'.format(account_id, region, docker_repo, docker_tag)
print(image_uri)

322537213286.dkr.ecr.us-east-2.amazonaws.com/amazon-reviews-spark-processor:latest


### ECR repository 생성과 docker image를 push하기

In [7]:
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


### `RepositoryNotFoundException` 오류는 무시하셔도 됩니다. 즉시 repository를 생성합니다.

In [8]:
!aws ecr describe-repositories --repository-names $docker_repo || aws ecr create-repository --repository-name $docker_repo

{
    "repositories": [
        {
            "repositoryArn": "arn:aws:ecr:us-east-2:322537213286:repository/amazon-reviews-spark-processor",
            "registryId": "322537213286",
            "repositoryName": "amazon-reviews-spark-processor",
            "repositoryUri": "322537213286.dkr.ecr.us-east-2.amazonaws.com/amazon-reviews-spark-processor",
            "createdAt": 1594805555.0,
            "imageTagMutability": "MUTABLE",
            "imageScanningConfiguration": {
                "scanOnPush": false
            }
        }
    ]
}


In [9]:
!docker tag $docker_repo:$docker_tag $image_uri

In [10]:
!docker push $image_uri

The push refers to repository [322537213286.dkr.ecr.us-east-2.amazonaws.com/amazon-reviews-spark-processor]

[1B30d297f9: Preparing 
[1B1126e091: Preparing 
[1B9d96e860: Preparing 
[1B83aa0189: Preparing 
[1B9b7c9032: Preparing 
[1Be42631b9: Preparing 
[1B2c042f76: Preparing 
[1Bcdbe5356: Preparing 
[1B2a404d82: Preparing 
[1Be5094404: Preparing 
[1Be262b7cb: Preparing 
[1B43baa117: Preparing 
[1B0a90a596: Preparing 
[1B6e43b9d3: Preparing 
[1B1cad6fd4: Preparing 
[1B760baedf: Preparing 
[1B3663cf66: Preparing 
[1B29cec5e1: Preparing 


[19B0d297f9: Pushing  948.3MB/2.011GB[19A[2K[15A[2K[18A[2K[15A[2K[18A[2K[15A[2K[18A[2K[16A[2K[18A[2K[16A[2K[14A[2K[16A[2K[16A[2K[14A[2K[16A[2K[12A[2K[19A[2K[13A[2K[18A[2K[11A[2K[18A[2K[10A[2K[11A[2K[19A[2K[11A[2K[18A[2K[19A[2K[18A[2K[19A[2K[18A[2K[19A[2K[10A[2K[19A[2K[10A[2K[19A[2K[10A[2K[18A[2K[11A[2K[18A[2K[10A[2K[19A[2K[18A[2K[11A[2K[18A[2K[11A[2K[18A[2K[10A[2K[18A[2K[10A[2K[11A[2K[18A[2K[19A[2K[18A[2K[19A[2K[18A[2K[10A[2K[19A[2K[10A[2K[19A[2K[18A[2K[10A[2K[19A[2K[10A[2K[18A[2K[18A[2K[19A[2K[18A[2K[10A[2K[18A[2K[19A[2K[10A[2K[11A[2K[10A[2K[11A[2K[7A[2K[10A[2K[19A[2K[18A[2K[19A[2K[7A[2K[19A[2K[7A[2K[18A[2K[11A[2K[19A[2K[7A[2K[18A[2K[7A[2K[19A[2K[7A[2K[19A[2K[10A[2K[19A[2K[11A[2K[18A[2K[19A[2K[10A[2K[7A[2K[10A[2K[11A[2K[18A[2K[7A[2K[11A[2K[10A[2K[11A[2K[7A[2K[19A[2K[7A[2K

[19B0d297f9: Pushed   2.021GB/2.011GB[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2

# Amazon SageMaker Processing Jobs 으로 Job 수행

Amazon SageMaker Python SDK를 사용하여 Processing job을 실행합니다. Spark container와 job configuration에서 processing에 대한 Spark ML script를 사용합니다.

In [12]:
!pygmentize src_dir/preprocess-spark-text-to-bert.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function
[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m unicode_literals

[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mshutil[39;49;00m
[34mimport[39;49;00m [04m[36mcsv[39;49;00m
[34mimport[39;49;00m [04m[36mcollections[39;49;00m
[34mimport[39;49;00m [04m[36msubprocess[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[37m#subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'pip', '--upgrade'])[39;49;00m
[37m#subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'wrapt', '--upgrade', '--ignore-installed'])[39;49;00m
[37m#subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'tensorflow==2.1.0', '--ignore-installed'])[39;49;00m
[34mimport[39;49;00m [04m[36mtensorflow[39;4

    parser.add_argument([33m'[39;49;00m[33m--output-data[39;49;00m[33m'[39;49;00m, [36mtype[39;49;00m=[36mstr[39;49;00m,
        default=[33m'[39;49;00m[33m/opt/ml/processing/output[39;49;00m[33m'[39;49;00m,
    )
    [34mreturn[39;49;00m parser.parse_args()


[34mdef[39;49;00m [32mtransform[39;49;00m(spark, s3_input_data, s3_output_train_data, s3_output_validation_data, s3_output_test_data): 
    [36mprint[39;49;00m([33m'[39;49;00m[33mProcessing [39;49;00m[33m{}[39;49;00m[33m => [39;49;00m[33m{}[39;49;00m[33m'[39;49;00m.format(s3_input_data, s3_output_train_data, s3_output_validation_data, s3_output_test_data))
 
    schema = StructType([
        StructField([33m'[39;49;00m[33mmarketplace[39;49;00m[33m'[39;49;00m, StringType(), [34mTrue[39;49;00m),
        StructField([33m'[39;49;00m[33mcustomer_id[39;49;00m[33m'[39;49;00m, StringType(), [34mTrue[39;49;00m),
        StructField([33m'[39;49;00m[33mreview_id[39;49;00m[

In [13]:
from sagemaker.processing import ScriptProcessor

processor = ScriptProcessor(base_job_name='spark-amazon-reviews-processor',
                            image_uri=image_uri,
                            command=['/opt/program/submit'],
                            role=role,
                            instance_count=2, # instance_count needs to be > 1 or you will see the following error:  "INFO yarn.Client: Application report for application_ (state: ACCEPTED)"
                            instance_type='ml.r5.xlarge',
                            env={'mode': 'python'})

# Setup Output Data

In [14]:
from time import gmtime, strftime
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())

output_prefix = 'amazon-reviews-spark-processor-{}'.format(timestamp_prefix)

In [15]:
train_data_bert_output = 's3://{}/{}/output/bert-train'.format(bucket, output_prefix)
validation_data_bert_output = 's3://{}/{}/output/bert-validation'.format(bucket, output_prefix)
test_data_bert_output = 's3://{}/{}/output/bert-test'.format(bucket, output_prefix)

print(train_data_bert_output)
print(validation_data_bert_output)
print(test_data_bert_output)

s3://sagemaker-us-east-2-322537213286/amazon-reviews-spark-processor-2020-07-15-10-42-35/output/bert-train
s3://sagemaker-us-east-2-322537213286/amazon-reviews-spark-processor-2020-07-15-10-42-35/output/bert-validation
s3://sagemaker-us-east-2-322537213286/amazon-reviews-spark-processor-2020-07-15-10-42-35/output/bert-test


In [16]:
from sagemaker.processing import ProcessingOutput

processor.run(code='./src_dir/preprocess-spark-text-to-bert.py',
              arguments=['s3_input_data', s3_input_data,
                         's3_output_train_data', train_data_bert_output,
                         's3_output_validation_data', validation_data_bert_output,
                         's3_output_test_data', test_data_bert_output,                         
              ],
              # We need this dummy output to allow us to call 
              #    ProcessingJob.from_processing_name() later 
              #    to describe the job and poll for Completed status
              outputs=[
                       ProcessingOutput(s3_upload_mode='EndOfJob',
                                        output_name='dummy-output',
                                        source='/opt/ml/processing/output')
              ],          
              logs=True,
              wait=False
)

Parameter 'session' will be renamed to 'sagemaker_session' in SageMaker Python SDK v2.



Job Name:  spark-amazon-reviews-processor-2020-07-15-10-42-36-365
Inputs:  [{'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-2-322537213286/spark-amazon-reviews-processor-2020-07-15-10-42-36-365/input/code/preprocess-spark-text-to-bert.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'dummy-output', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-2-322537213286/spark-amazon-reviews-processor-2020-07-15-10-42-36-365/output/dummy-output', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]


In [17]:
from IPython.core.display import display, HTML

spark_processing_job_name = processor.jobs[-1].describe()['ProcessingJobName']

display(HTML('<b>Review <a href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, spark_processing_job_name)))


In [18]:
from IPython.core.display import display, HTML

# This is different than the job name because we are not using ProcessingOutput's in this Spark ML case.
spark_processing_job_s3_output_prefix = output_prefix

display(HTML('<b>Review <a href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Spark Job Has Completed</b>'.format(bucket, spark_processing_job_s3_output_prefix, region)))


# List Processing Jobs through boto3 Python SDK

In [19]:
import boto3

client = boto3.client('sagemaker')
client.list_processing_jobs()

{'ProcessingJobSummaries': [{'ProcessingJobName': 'spark-amazon-reviews-processor-2020-07-15-10-42-36-365',
   'ProcessingJobArn': 'arn:aws:sagemaker:us-east-2:322537213286:processing-job/spark-amazon-reviews-processor-2020-07-15-10-42-36-365',
   'CreationTime': datetime.datetime(2020, 7, 15, 10, 42, 36, 741000, tzinfo=tzlocal()),
   'LastModifiedTime': datetime.datetime(2020, 7, 15, 10, 42, 36, 901000, tzinfo=tzlocal()),
   'ProcessingJobStatus': 'InProgress'},
  {'ProcessingJobName': 'spark-amazon-reviews-processor-2020-07-15-10-31-40-380',
   'ProcessingJobArn': 'arn:aws:sagemaker:us-east-2:322537213286:processing-job/spark-amazon-reviews-processor-2020-07-15-10-31-40-380',
   'CreationTime': datetime.datetime(2020, 7, 15, 10, 31, 40, 841000, tzinfo=tzlocal()),
   'ProcessingEndTime': datetime.datetime(2020, 7, 15, 10, 33, 1, tzinfo=tzlocal()),
   'LastModifiedTime': datetime.datetime(2020, 7, 15, 10, 33, 1, 170000, tzinfo=tzlocal()),
   'ProcessingJobStatus': 'Failed',
   'Failure

# Please Wait Until the Processing Job Completes
Re-run this next cell until the job status shows `Completed`.

In [24]:
running_processor = sagemaker.processing.ProcessingJob.from_processing_name(processing_job_name=spark_processing_job_name,
                                                                            sagemaker_session=sagemaker_session)

processing_job_description = running_processor.describe()

processing_job_status = processing_job_description['ProcessingJobStatus']
print('\n')
print(processing_job_status)
print('\n')

print(processing_job_description)



InProgress


{'ProcessingInputs': [{'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-2-322537213286/spark-amazon-reviews-processor-2020-07-15-10-42-36-365/input/code/preprocess-spark-text-to-bert.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}], 'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'dummy-output', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-2-322537213286/spark-amazon-reviews-processor-2020-07-15-10-42-36-365/output/dummy-output', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]}, 'ProcessingJobName': 'spark-amazon-reviews-processor-2020-07-15-10-42-36-365', 'ProcessingResources': {'ClusterConfig': {'InstanceCount': 2, 'InstanceType': 'ml.r5.xlarge', 'VolumeSizeInGB': 30}}, 'StoppingCondition': {'MaxRuntimeInSeconds': 86400}, 'AppSpecification': {'ImageUri': '322537213286.dkr.ecr.us-east-2.amazonaws.c

In [25]:
running_processor.wait()

[34m2020-07-15 10:46:13,623 INFO namenode.NameNode: STARTUP_MSG: [0m
[34m/************************************************************[0m
[34mSTARTUP_MSG: Starting NameNode[0m
[34mSTARTUP_MSG:   host = algo-1/10.0.140.237[0m
[34mSTARTUP_MSG:   args = [-format, -force][0m
[34mSTARTUP_MSG:   version = 3.2.1[0m
[34mSTARTUP_MSG:   classpath = /usr/hadoop-3.2.1/etc/hadoop:/usr/hadoop-3.2.1/share/hadoop/common/lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/commons-compress-1.18.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/stax2-api-3.1.4.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/error_prone_annotations-2.2.0.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/jersey-core-1.19.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/netty-3.10.5.Final.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/slf4j-api-1.7.25.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/kerby-x

[34m2020-07-15 11:09:53,228 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 8.0 (TID 10) in 567308 ms on algo-2 (executor 1) (1/2)[0m
[34m2020-07-15 11:13:27,380 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 8.0 (TID 9) in 781460 ms on algo-2 (executor 1) (2/2)[0m
[34m2020-07-15 11:13:27,380 INFO cluster.YarnScheduler: Removed TaskSet 8.0, whose tasks have all completed, from pool [0m
[34m2020-07-15 11:13:27,381 INFO scheduler.DAGScheduler: ResultStage 8 (save at NativeMethodAccessorImpl.java:0) finished in 781.492 s[0m
[34m2020-07-15 11:13:27,381 INFO scheduler.DAGScheduler: Job 8 finished: save at NativeMethodAccessorImpl.java:0, took 781.493853 s[0m
[34m2020-07-15 11:13:27,779 INFO datasources.FileFormatWriter: Write Job 5c58cc07-f934-4f21-b289-7a72a539d868 committed.[0m
[34m2020-07-15 11:13:27,780 INFO datasources.FileFormatWriter: Finished processing stats for write job 5c58cc07-f934-4f21-b289-7a72a539d868.[0m
[34mWrote to output file:  s3a://s

[34m2020-07-15 11:22:56,806 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 9.0 (TID 12) in 568476 ms on algo-2 (executor 1) (1/2)[0m
[34m2020-07-15 11:26:35,908 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 9.0 (TID 11) in 787578 ms on algo-2 (executor 1) (2/2)[0m
[34m2020-07-15 11:26:35,909 INFO cluster.YarnScheduler: Removed TaskSet 9.0, whose tasks have all completed, from pool [0m
[34m2020-07-15 11:26:35,910 INFO scheduler.DAGScheduler: ResultStage 9 (save at NativeMethodAccessorImpl.java:0) finished in 787.610 s[0m
[34m2020-07-15 11:26:35,911 INFO scheduler.DAGScheduler: Job 9 finished: save at NativeMethodAccessorImpl.java:0, took 787.612771 s[0m
[34m2020-07-15 11:26:36,293 INFO datasources.FileFormatWriter: Write Job 63df17f5-655d-4af6-b183-4e5c2d574af5 committed.[0m
[34m2020-07-15 11:26:36,293 INFO datasources.FileFormatWriter: Finished processing stats for write job 63df17f5-655d-4af6-b183-4e5c2d574af5.[0m
[34mWrote to output file:  s3a://

[34mFinished Yarn configuration files setup.
[0m



<h2><span style="color:red">위 Processing Job이 완료되기 전까지 기다려 주시기 바랍니다.</span></h2>


# the Processed Output Dataset 확인

In [26]:
!aws s3 ls --recursive $train_data_bert_output/

2020-07-15 11:00:26          0 amazon-reviews-spark-processor-2020-07-15-10-42-35/output/bert-train/_SUCCESS
2020-07-15 11:00:24   71660174 amazon-reviews-spark-processor-2020-07-15-10-42-35/output/bert-train/part-00000-1dc923d5-149f-4f6f-ba47-a1391b41ec1b-c000.tfrecord
2020-07-15 10:56:48   50910757 amazon-reviews-spark-processor-2020-07-15-10-42-35/output/bert-train/part-00001-1dc923d5-149f-4f6f-ba47-a1391b41ec1b-c000.tfrecord


In [27]:
!aws s3 ls --recursive $validation_data_bert_output/

2020-07-15 11:13:28          0 amazon-reviews-spark-processor-2020-07-15-10-42-35/output/bert-validation/_SUCCESS
2020-07-15 11:13:27    4041153 amazon-reviews-spark-processor-2020-07-15-10-42-35/output/bert-validation/part-00000-6b771df3-7f99-4a47-adbc-f3f0cbba9dad-c000.tfrecord
2020-07-15 11:09:53    2893051 amazon-reviews-spark-processor-2020-07-15-10-42-35/output/bert-validation/part-00001-6b771df3-7f99-4a47-adbc-f3f0cbba9dad-c000.tfrecord


In [28]:
!aws s3 ls --recursive $test_data_bert_output/

2020-07-15 11:26:37          0 amazon-reviews-spark-processor-2020-07-15-10-42-35/output/bert-test/_SUCCESS
2020-07-15 11:26:36    3998157 amazon-reviews-spark-processor-2020-07-15-10-42-35/output/bert-test/part-00000-37bbaf43-1f31-4bb9-8fcd-289f8df85a92-c000.tfrecord
2020-07-15 11:22:57    2824481 amazon-reviews-spark-processor-2020-07-15-10-42-35/output/bert-test/part-00001-37bbaf43-1f31-4bb9-8fcd-289f8df85a92-c000.tfrecord
