# Amazon SageMaker Processing Job 


기계 학습 (ML) 프로세스는 몇 단계로 구성됩니다. 먼저, 다양한 ETL 작업으로 데이터를 수집 한 다음 data의 pre-processing, 전통적인 기법 또는 사전 knowledge를 이용하여 데이터의 feature화, 마지막으로 알고리즘을 이용한 ML 모델을 학습합니다.

Apache Spark와 같은 분산 데이터 처리 프레임 워크는 학습을 위해 dataset의 pre-processing하는데 사용합니다. 이 노트북에서는 Amazon SageMaker Processing에서 기본 설치된 Apache Spark의 기능을 활용하여 처리 워크로드를 실행합니다.

![](img/prepare_dataset_bert.png)

![](img/processing.jpg)


# Setup Environment


* 모델 학습에 사용되는 S3 bucket과 prefix 가 필요합니다.
* 학습과 processing을 위해 IAM role은 dataset에 액세스가 가능해야 합니다.

In [1]:
import sagemaker
from time import gmtime, strftime
import boto3

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()
region = boto3.Session().region_name

# Setup Input Data

In [2]:
# Inputs
s3_input_data = 's3://{}/amazon-reviews-pds/tsv/'.format(bucket)
print(s3_input_data)

s3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/tsv/


In [3]:
!aws s3 ls $s3_input_data

2020-09-15 04:56:04   18997559 amazon_reviews_us_Digital_Software_v1_00.tsv.gz
2020-09-15 04:56:06   27442648 amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz
2020-09-15 04:56:09  193389086 amazon_reviews_us_Musical_Instruments_v1_00.tsv.gz


# Processing Job을 수행할 Spark Docker Image

이 HOL에서는 `./container` 폴더 내에 Spark container 이미지를 포함합니다. container는 모든 Spark 구성의 부트스트랩을 처리하고 `spark-submit` CLI를 wrapper해서 제공합니다. 상위 레벨에서는,

* A set of default Spark/YARN/Hadoop configurations
* A bootstrapping script for configuring and starting up Spark master/worker nodes
* A wrapper around the `spark-submit` CLI to submit a Spark application

container 빌드와 push 절차가 완료된 후 dataset의 처리를 수행하는 관리형 분산 Spark 어플리케이션을 수행사는 것은 Amazon SageMaker Python SDK 사용합니다.

In [4]:
docker_repo = 'amazon-reviews-spark-processor'
docker_tag = 'latest'

In [5]:
!docker build -t $docker_repo:$docker_tag -f container/Dockerfile ./container

Sending build context to Docker daemon  4.385MB
Step 1/37 : FROM openjdk:8-jre-slim
8-jre-slim: Pulling from library/openjdk

[1Bf8d1c412: Pulling fs layer 
[1Bccc0fc24: Pulling fs layer 
[1B7ee20b42: Pulling fs layer 
[1BDigest: sha256:b933e809a1597f27617cf50bdd07f4daa351742c36dd34777506cd73111caca8[4A[2K[1A[2K[1A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[3A[2K[3A[2K[3A[2K[2A[2K[2A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K
Status: Downloaded newer image for openjdk:8-jre-slim
 ---> f75cca7b8ea8
Step 2/37 : RUN apt-get update
 ---> Running in 136754ebbad1
Get:1 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]
Get:2 http://deb.debian.org/debian buster InRelease [122 kB]
Get:3 http://deb.debian.org/debian buster-updates InRelease [51.9 kB]
Get:4 http://security.debian.org/debian-security buster/updates/main amd64 Packages [226 kB]
Get:5 http://deb.debian.org/debian buster/main amd64 Packages [7906 k

Get:16 http://deb.debian.org/debian buster/main amd64 python2.7 amd64 2.7.16-2+deb10u1 [305 kB]
Get:17 http://deb.debian.org/debian buster/main amd64 libpython2-stdlib amd64 2.7.16-1 [20.8 kB]
Get:18 http://deb.debian.org/debian buster/main amd64 libpython-stdlib amd64 2.7.16-1 [20.8 kB]
Get:19 http://deb.debian.org/debian buster/main amd64 python2 amd64 2.7.16-1 [41.6 kB]
Get:20 http://deb.debian.org/debian buster/main amd64 python amd64 2.7.16-1 [22.8 kB]
Get:21 http://deb.debian.org/debian buster/main amd64 liblocale-gettext-perl amd64 1.07-3+b4 [18.9 kB]
Get:22 http://deb.debian.org/debian buster/main amd64 libpython3.7-minimal amd64 3.7.3-2+deb10u2 [589 kB]
Get:23 http://deb.debian.org/debian buster/main amd64 python3.7-minimal amd64 3.7.3-2+deb10u2 [1731 kB]
Get:24 http://deb.debian.org/debian buster/main amd64 python3-minimal amd64 3.7.3-1 [36.6 kB]
Get:25 http://deb.debian.org/debian buster/main amd64 libmpdec2 amd64 2.4.2-2 [87.2 kB]
Get:26 http://deb.debian.org/debian buster/

Get:111 http://deb.debian.org/debian buster/main amd64 libalgorithm-diff-perl all 1.19.03-2 [47.9 kB]
Get:112 http://deb.debian.org/debian buster/main amd64 libalgorithm-diff-xs-perl amd64 0.04-5+b1 [11.8 kB]
Get:113 http://deb.debian.org/debian buster/main amd64 libalgorithm-merge-perl all 0.08-3 [12.7 kB]
Get:114 http://deb.debian.org/debian buster/main amd64 libexpat1-dev amd64 2.2.6-2+deb10u1 [153 kB]
Get:115 http://deb.debian.org/debian buster/main amd64 libfile-fcntllock-perl amd64 0.22-3+b5 [35.4 kB]
Get:116 http://deb.debian.org/debian buster/main amd64 libglib2.0-data all 2.58.3-2+deb10u2 [1110 kB]
Get:117 http://deb.debian.org/debian buster/main amd64 libicu63 amd64 63.1-6+deb10u1 [8300 kB]
Get:118 http://deb.debian.org/debian buster/main amd64 libpython2.7 amd64 2.7.16-2+deb10u1 [1036 kB]
Get:119 http://deb.debian.org/debian buster/main amd64 libpython2.7-dev amd64 2.7.16-2+deb10u1 [31.6 MB]
Get:120 http://deb.debian.org/debian buster/main amd64 libpython2-dev amd64 2.7.16-1

Selecting previously unselected package libpython3.7-minimal:amd64.
Preparing to unpack .../libpython3.7-minimal_3.7.3-2+deb10u2_amd64.deb ...
Unpacking libpython3.7-minimal:amd64 (3.7.3-2+deb10u2) ...
Selecting previously unselected package python3.7-minimal.
Preparing to unpack .../python3.7-minimal_3.7.3-2+deb10u2_amd64.deb ...
Unpacking python3.7-minimal (3.7.3-2+deb10u2) ...
Setting up libpython3.7-minimal:amd64 (3.7.3-2+deb10u2) ...
Setting up libexpat1:amd64 (2.2.6-2+deb10u1) ...
Setting up python3.7-minimal (3.7.3-2+deb10u2) ...
Selecting previously unselected package python3-minimal.
(Reading database ... 9963 files and directories currently installed.)
Preparing to unpack .../python3-minimal_3.7.3-1_amd64.deb ...
Unpacking python3-minimal (3.7.3-1) ...
Selecting previously unselected package libmpdec2:amd64.
Preparing to unpack .../libmpdec2_2.4.2-2_amd64.deb ...
Unpacking libmpdec2:amd64 (2.4.2-2) ...
Selecting previously unselected package libpython3.7-stdlib:amd64.
Prepari

Selecting previously unselected package build-essential.
Preparing to unpack .../044-build-essential_12.6_amd64.deb ...
Unpacking build-essential (12.6) ...
Selecting previously unselected package libkeyutils1:amd64.
Preparing to unpack .../045-libkeyutils1_1.6-6_amd64.deb ...
Unpacking libkeyutils1:amd64 (1.6-6) ...
Selecting previously unselected package libkrb5support0:amd64.
Preparing to unpack .../046-libkrb5support0_1.17-3_amd64.deb ...
Unpacking libkrb5support0:amd64 (1.17-3) ...
Selecting previously unselected package libk5crypto3:amd64.
Preparing to unpack .../047-libk5crypto3_1.17-3_amd64.deb ...
Unpacking libk5crypto3:amd64 (1.17-3) ...
Selecting previously unselected package libkrb5-3:amd64.
Preparing to unpack .../048-libkrb5-3_1.17-3_amd64.deb ...
Unpacking libkrb5-3:amd64 (1.17-3) ...
Selecting previously unselected package libgssapi-krb5-2:amd64.
Preparing to unpack .../049-libgssapi-krb5-2_1.17-3_amd64.deb ...
Unpacking libgssapi-krb5-2:amd64 (1.17-3) ...
Selecting pre

Selecting previously unselected package libpython3.7:amd64.
Preparing to unpack .../093-libpython3.7_3.7.3-2+deb10u2_amd64.deb ...
Unpacking libpython3.7:amd64 (3.7.3-2+deb10u2) ...
Selecting previously unselected package libpython3.7-dev:amd64.
Preparing to unpack .../094-libpython3.7-dev_3.7.3-2+deb10u2_amd64.deb ...
Unpacking libpython3.7-dev:amd64 (3.7.3-2+deb10u2) ...
Selecting previously unselected package libpython3-dev:amd64.
Preparing to unpack .../095-libpython3-dev_3.7.3-1_amd64.deb ...
Unpacking libpython3-dev:amd64 (3.7.3-1) ...
Selecting previously unselected package libsasl2-modules:amd64.
Preparing to unpack .../096-libsasl2-modules_2.1.27+dfsg-1+deb10u1_amd64.deb ...
Unpacking libsasl2-modules:amd64 (2.1.27+dfsg-1+deb10u1) ...
Selecting previously unselected package libxml2:amd64.
Preparing to unpack .../097-libxml2_2.9.4+dfsg1-7+b3_amd64.deb ...
Unpacking libxml2:amd64 (2.9.4+dfsg1-7+b3) ...
Selecting previously unselected package manpages-dev.
Preparing to unpack ...

Setting up librtmp1:amd64 (2.4+20151223.gitfa8646d.1-2) ...
Setting up libdbus-1-3:amd64 (1.12.20-0+deb10u1) ...
Setting up dbus (1.12.20-0+deb10u1) ...
invoke-rc.d: could not determine current runlevel
invoke-rc.d: policy-rc.d denied execution of start.
Setting up xz-utils (5.2.4-1) ...
update-alternatives: using /usr/bin/xz to provide /usr/bin/lzma (lzma) in auto mode
Setting up libquadmath0:amd64 (8.3.0-6) ...
Setting up libmpc3:amd64 (1.1.0-1) ...
Setting up libatomic1:amd64 (8.3.0-6) ...
Setting up patch (2.7.6-3+deb10u1) ...
Setting up libk5crypto3:amd64 (1.17-3) ...
Setting up libsasl2-2:amd64 (2.1.27+dfsg-1+deb10u1) ...
Setting up libmpx2:amd64 (8.3.0-6) ...
Setting up libubsan1:amd64 (8.3.0-6) ...
Setting up libisl19:amd64 (0.20-2) ...
Setting up libgirepository-1.0-1:amd64 (1.58.3-2) ...
Setting up libssh2-1:amd64 (1.8.0-2.1) ...
Setting up netbase (5.6) ...
Setting up python-pip-whl (18.1-5) ...
Setting up libkrb5-3:amd64 (1.17-3) ...
Setting up libmpdec2:amd64 (2.4.2-2) ...

Removing intermediate container e48eef1dc6ed
 ---> 4e5241cdb33b
Step 6/37 : RUN rm -rf /var/lib/apt/lists/*
 ---> Running in fd11516f476d
Removing intermediate container fd11516f476d
 ---> b8af7199d006
Step 7/37 : ENV PYTHONHASHSEED 0
 ---> Running in 67e69a40b043
Removing intermediate container 67e69a40b043
 ---> 2798320363dd
Step 8/37 : ENV PYTHONIOENCODING UTF-8
 ---> Running in 83309272600c
Removing intermediate container 83309272600c
 ---> 8c11598de3db
Step 9/37 : ENV PIP_DISABLE_PIP_VERSION_CHECK 1
 ---> Running in 883dc4cc32a9
Removing intermediate container 883dc4cc32a9
 ---> b703cf4ed145
Step 10/37 : ENV HADOOP_VERSION 3.2.1
 ---> Running in 235dc647039b
Removing intermediate container 235dc647039b
 ---> 4a524cf58328
Step 11/37 : ENV HADOOP_HOME /usr/hadoop-$HADOOP_VERSION
 ---> Running in c333dc9a7020
Removing intermediate container c333dc9a7020
 ---> 1d06dddc2444
Step 12/37 : ENV HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
 ---> Running in b00dab8657d1
Removing intermediate cont

Spark container의 Amazon Elastic Container Registry(Amazon ECR) 리포지토리를 생성하고 image를 push합니다.

In [6]:
import boto3
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name

image_uri = '{}.dkr.ecr.{}.amazonaws.com/{}:{}'.format(account_id, region, docker_repo, docker_tag)
print(image_uri)

322537213286.dkr.ecr.us-east-1.amazonaws.com/amazon-reviews-spark-processor:latest


### ECR repository 생성과 docker image를 push하기

In [7]:
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


### `RepositoryNotFoundException` 오류는 무시하셔도 됩니다. 즉시 repository를 생성합니다.

In [8]:
!aws ecr describe-repositories --repository-names $docker_repo || aws ecr create-repository --repository-name $docker_repo

{
    "repositories": [
        {
            "repositoryArn": "arn:aws:ecr:us-east-1:322537213286:repository/amazon-reviews-spark-processor",
            "registryId": "322537213286",
            "repositoryName": "amazon-reviews-spark-processor",
            "repositoryUri": "322537213286.dkr.ecr.us-east-1.amazonaws.com/amazon-reviews-spark-processor",
            "createdAt": 1596262176.0,
            "imageTagMutability": "MUTABLE",
            "imageScanningConfiguration": {
                "scanOnPush": false
            },
            "encryptionConfiguration": {
                "encryptionType": "AES256"
            }
        }
    ]
}


In [9]:
!docker tag $docker_repo:$docker_tag $image_uri

In [10]:
!docker push $image_uri

The push refers to repository [322537213286.dkr.ecr.us-east-1.amazonaws.com/amazon-reviews-spark-processor]

[1Baca904c6: Preparing 
[1B74080c56: Preparing 
[1B3c74b171: Preparing 
[1B67a7277d: Preparing 
[1Ba83bff53: Preparing 
[1Bfadc25c7: Preparing 
[1B7e01acfe: Preparing 
[1Bb0ef4a87: Preparing 
[1B5a61114d: Preparing 
[1Bee9c1c2d: Preparing 
[1B213985ea: Preparing 
[1B12a14adb: Preparing 
[1Bfde717d1: Preparing 
[1B7654db06: Preparing 
[1B721e75f9: Preparing 
[1Ba1389900: Preparing 
[1B7b6ca8b9: Preparing 
[1Bfd2b2495: Preparing 


[19Bca904c6: Pushing  1.334GB/2.013GB[15A[2K[18A[2K[16A[2K[19A[2K[16A[2K[19A[2K[16A[2K[19A[2K[18A[2K[16A[2K[19A[2K[16A[2K[19A[2K[14A[2K[16A[2K[18A[2K[19A[2K[18A[2K[15A[2K[18A[2K[13A[2K[18A[2K[19A[2K[18A[2K[19A[2K[18A[2K[16A[2K[19A[2K[11A[2K[19A[2K[11A[2K[19A[2K[10A[2K[19A[2K[18A[2K[11A[2K[10A[2K[11A[2K[10A[2K[11A[2K[19A[2K[11A[2K[19A[2K[11A[2K[19A[2K[10A[2K[19A[2K[10A[2K[10A[2K[18A[2K[11A[2K[10A[2K[11A[2K[10A[2K[11A[2K[19A[2K[10A[2K[11A[2K[19A[2K[11A[2K[10A[2K[11A[2K[18A[2K[11A[2K[18A[2K[19A[2K[19A[2K[11A[2K[19A[2K[11A[2K[19A[2K[10A[2K[19A[2K[11A[2K[19A[2K[11A[2K[19A[2K[18A[2K[10A[2K[11A[2K[10A[2K[19A[2K[10A[2K[7A[2K[10A[2K[7A[2K[18A[2K[7A[2K[10A[2K[19A[2K[18A[2K[19A[2K[7A[2K[11A[2K[7A[2K[11A[2K[19A[2K[11A[2K[19A[2K[11A[2K[19A[2K[11A[2K[19A[2K[10A[2K[11A[2K[7A[2K[18A[2K[19A

[19Bca904c6: Pushed   2.023GB/2.013GB[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2

# Amazon SageMaker Processing Jobs 으로 Job 수행

Amazon SageMaker Python SDK를 사용하여 Processing job을 실행합니다. Spark container와 job configuration에서 processing에 대한 Spark ML script를 사용합니다.

In [11]:
!pygmentize src_dir/preprocess-spark-text-to-bert.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function
[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m unicode_literals

[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mshutil[39;49;00m
[34mimport[39;49;00m [04m[36mcsv[39;49;00m
[34mimport[39;49;00m [04m[36mcollections[39;49;00m
[34mimport[39;49;00m [04m[36msubprocess[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[37m#subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'pip', '--upgrade'])[39;49;00m
[37m#subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'wrapt', '--upgrade', '--ignore-installed'])[39;49;00m
[37m#subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'tensorflow==2.1.0', '--ignore-installed'])[39;49;00m
[34mimport[39;49;00m [04m[36mtensorflow[39;4

In [12]:
from sagemaker.processing import ScriptProcessor

processor = ScriptProcessor(base_job_name='spark-amazon-reviews-processor',
                            image_uri=image_uri,
                            command=['/opt/program/submit'],
                            role=role,
                            instance_count=2, # instance_count needs to be > 1 or you will see the following error:  "INFO yarn.Client: Application report for application_ (state: ACCEPTED)"
                            instance_type='ml.r5.xlarge',
                            env={'mode': 'python'})

# Setup Output Data

In [13]:
from time import gmtime, strftime
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())

output_prefix = 'amazon-reviews-spark-processor-{}'.format(timestamp_prefix)

In [14]:
train_data_bert_output = 's3://{}/{}/output/bert-train'.format(bucket, output_prefix)
validation_data_bert_output = 's3://{}/{}/output/bert-validation'.format(bucket, output_prefix)
test_data_bert_output = 's3://{}/{}/output/bert-test'.format(bucket, output_prefix)

print(train_data_bert_output)
print(validation_data_bert_output)
print(test_data_bert_output)

s3://sagemaker-us-east-1-322537213286/amazon-reviews-spark-processor-2020-09-15-05-53-32/output/bert-train
s3://sagemaker-us-east-1-322537213286/amazon-reviews-spark-processor-2020-09-15-05-53-32/output/bert-validation
s3://sagemaker-us-east-1-322537213286/amazon-reviews-spark-processor-2020-09-15-05-53-32/output/bert-test


In [15]:
from sagemaker.processing import ProcessingOutput

processor.run(code='./src_dir/preprocess-spark-text-to-bert.py',
              arguments=['s3_input_data', s3_input_data,
                         's3_output_train_data', train_data_bert_output,
                         's3_output_validation_data', validation_data_bert_output,
                         's3_output_test_data', test_data_bert_output,                         
              ],
              # We need this dummy output to allow us to call 
              #    ProcessingJob.from_processing_name() later 
              #    to describe the job and poll for Completed status
              outputs=[
                       ProcessingOutput(s3_upload_mode='EndOfJob',
                                        output_name='dummy-output',
                                        source='/opt/ml/processing/output')
              ],          
              logs=True,
              wait=False
)

Parameter 'session' will be renamed to 'sagemaker_session' in SageMaker Python SDK v2.



Job Name:  spark-amazon-reviews-processor-2020-09-15-05-53-32-188
Inputs:  [{'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-322537213286/spark-amazon-reviews-processor-2020-09-15-05-53-32-188/input/code/preprocess-spark-text-to-bert.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'dummy-output', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-322537213286/spark-amazon-reviews-processor-2020-09-15-05-53-32-188/output/dummy-output', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]


In [16]:
from IPython.core.display import display, HTML

spark_processing_job_name = processor.jobs[-1].describe()['ProcessingJobName']

display(HTML('<b>Review <a href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, spark_processing_job_name)))


In [17]:
from IPython.core.display import display, HTML

# This is different than the job name because we are not using ProcessingOutput's in this Spark ML case.
spark_processing_job_s3_output_prefix = output_prefix

display(HTML('<b>Review <a href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Spark Job Has Completed</b>'.format(bucket, spark_processing_job_s3_output_prefix, region)))


# List Processing Jobs through boto3 Python SDK

In [18]:
import boto3

client = boto3.client('sagemaker')
client.list_processing_jobs()

{'ProcessingJobSummaries': [{'ProcessingJobName': 'spark-amazon-reviews-processor-2020-09-15-05-53-32-188',
   'ProcessingJobArn': 'arn:aws:sagemaker:us-east-1:322537213286:processing-job/spark-amazon-reviews-processor-2020-09-15-05-53-32-188',
   'CreationTime': datetime.datetime(2020, 9, 15, 5, 53, 32, 776000, tzinfo=tzlocal()),
   'LastModifiedTime': datetime.datetime(2020, 9, 15, 5, 53, 32, 776000, tzinfo=tzlocal()),
   'ProcessingJobStatus': 'InProgress'},
  {'ProcessingJobName': 'sagemaker-scikit-learn-2020-09-15-05-48-17-452',
   'ProcessingJobArn': 'arn:aws:sagemaker:us-east-1:322537213286:processing-job/sagemaker-scikit-learn-2020-09-15-05-48-17-452',
   'CreationTime': datetime.datetime(2020, 9, 15, 5, 48, 17, 936000, tzinfo=tzlocal()),
   'ProcessingEndTime': datetime.datetime(2020, 9, 15, 5, 53, 23, 152000, tzinfo=tzlocal()),
   'LastModifiedTime': datetime.datetime(2020, 9, 15, 5, 53, 23, 156000, tzinfo=tzlocal()),
   'ProcessingJobStatus': 'Completed'},
  {'ProcessingJobN

# Please Wait Until the Processing Job Completes
Re-run this next cell until the job status shows `Completed`.

In [19]:
running_processor = sagemaker.processing.ProcessingJob.from_processing_name(processing_job_name=spark_processing_job_name,
                                                                            sagemaker_session=sagemaker_session)

processing_job_description = running_processor.describe()

processing_job_status = processing_job_description['ProcessingJobStatus']
print('\n')
print(processing_job_status)
print('\n')

print(processing_job_description)



InProgress


{'ProcessingInputs': [{'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-322537213286/spark-amazon-reviews-processor-2020-09-15-05-53-32-188/input/code/preprocess-spark-text-to-bert.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}], 'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'dummy-output', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-322537213286/spark-amazon-reviews-processor-2020-09-15-05-53-32-188/output/dummy-output', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]}, 'ProcessingJobName': 'spark-amazon-reviews-processor-2020-09-15-05-53-32-188', 'ProcessingResources': {'ClusterConfig': {'InstanceCount': 2, 'InstanceType': 'ml.r5.xlarge', 'VolumeSizeInGB': 30}}, 'StoppingCondition': {'MaxRuntimeInSeconds': 86400}, 'AppSpecification': {'ImageUri': '322537213286.dkr.ecr.us-east-1.amazonaws.c

In [20]:
running_processor.wait()

[34m2020-09-15 05:57:15,541 INFO namenode.NameNode: STARTUP_MSG: [0m
[34m/************************************************************[0m
[34mSTARTUP_MSG: Starting NameNode[0m
[34mSTARTUP_MSG:   host = algo-1/10.0.220.139[0m
[34mSTARTUP_MSG:   args = [-format, -force][0m
[34mSTARTUP_MSG:   version = 3.2.1[0m
[34mSTARTUP_MSG:   classpath = /usr/hadoop-3.2.1/etc/hadoop:/usr/hadoop-3.2.1/share/hadoop/common/lib/kerb-core-1.0.1.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/jsr305-3.0.0.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/commons-configuration2-2.1.1.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/kerb-server-1.0.1.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/asm-5.0.4.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/jersey-json-1.19.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/commons-codec-1.11.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/kerby-asn1-1.0.1.jar:/usr/hadoop-3.2.1/share/hadoop/common/

[34m2020-09-15 05:57:29.782048: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory[0m
[34m2020-09-15 05:57:29.782160: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory[0m
[34m2020-09-15 05:57:29.782174: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.[0m
[34m2.1.0[0m
[34m#015Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]#015Downloading: 100%|██████████| 232k/232k [00:00<00:00, 50.0MB/s][0m
[34m2020-09-15 05:57:31,079 INFO spark.SparkContext: Running Spark version 2.4.6[0m
[34m2020-09-15 0

[34m2020-09-15 05:57:38,939 INFO yarn.Client: Application report for application_1600149447309_0001 (state: ACCEPTED)[0m
[34m2020-09-15 05:57:39,942 INFO yarn.Client: Application report for application_1600149447309_0001 (state: ACCEPTED)[0m
[34m2020-09-15 05:57:40,944 INFO yarn.Client: Application report for application_1600149447309_0001 (state: ACCEPTED)[0m
[34m2020-09-15 05:57:41,947 INFO yarn.Client: Application report for application_1600149447309_0001 (state: ACCEPTED)[0m
[34m2020-09-15 05:57:42,597 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> algo-1, PROXY_URI_BASES -> http://algo-1:8088/proxy/application_1600149447309_0001), /proxy/application_1600149447309_0001[0m
[34m2020-09-15 05:57:42,782 INFO cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM)[0m
[34m2020-09-15 05:57:42,950 INFO yarn.Client: Ap

[34m2020-09-15 05:58:11,309 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on algo-2:45803 (size: 42.9 KB, free: 11.9 GB)[0m
[34m2020-09-15 05:58:13,045 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 4) in 2243 ms on algo-2 (executor 1) (1/1)[0m
[34m2020-09-15 05:58:13,045 INFO cluster.YarnScheduler: Removed TaskSet 4.0, whose tasks have all completed, from pool [0m
[34m2020-09-15 05:58:13,047 INFO python.PythonAccumulatorV2: Connected to AccumulatorServer at host: 127.0.0.1 port: 37075[0m
[34m2020-09-15 05:58:13,048 INFO scheduler.DAGScheduler: ResultStage 4 (showString at NativeMethodAccessorImpl.java:0) finished in 2.260 s[0m
[34m2020-09-15 05:58:13,048 INFO scheduler.DAGScheduler: Job 4 finished: showString at NativeMethodAccessorImpl.java:0, took 2.264110 s[0m
[34m+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

[34m2020-09-15 06:21:18,554 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 6.0 (TID 7) in 1381962 ms on algo-2 (executor 1) (1/2)[0m
[34m2020-09-15 07:13:43,137 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 6.0 (TID 6) in 4526545 ms on algo-2 (executor 1) (2/2)[0m
[34m2020-09-15 07:13:43,137 INFO cluster.YarnScheduler: Removed TaskSet 6.0, whose tasks have all completed, from pool [0m
[34m2020-09-15 07:13:43,139 INFO scheduler.DAGScheduler: ResultStage 6 (save at NativeMethodAccessorImpl.java:0) finished in 4526.579 s[0m
[34m2020-09-15 07:13:43,139 INFO scheduler.DAGScheduler: Job 6 finished: save at NativeMethodAccessorImpl.java:0, took 4526.582249 s[0m
[34m2020-09-15 07:13:43,669 INFO datasources.FileFormatWriter: Write Job ed404a2b-526e-43c0-8ae6-83a966490d31 committed.[0m
[34m2020-09-15 07:13:43,673 INFO datasources.FileFormatWriter: Finished processing stats for write job ed404a2b-526e-43c0-8ae6-83a966490d31.[0m
[34mWrote to output file:  s3a:

[34m2020-09-15 07:36:35,269 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 7.0 (TID 9) in 1370882 ms on algo-2 (executor 1) (1/2)[0m
[34m2020-09-15 08:28:55,792 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 7.0 (TID 8) in 4511408 ms on algo-2 (executor 1) (2/2)[0m
[34m2020-09-15 08:28:55,794 INFO cluster.YarnScheduler: Removed TaskSet 7.0, whose tasks have all completed, from pool [0m
[34m2020-09-15 08:28:55,794 INFO scheduler.DAGScheduler: ResultStage 7 (save at NativeMethodAccessorImpl.java:0) finished in 4511.455 s[0m
[34m2020-09-15 08:28:55,794 INFO scheduler.DAGScheduler: Job 7 finished: save at NativeMethodAccessorImpl.java:0, took 4511.459635 s[0m
[34m2020-09-15 08:28:56,278 INFO datasources.FileFormatWriter: Write Job 0a5edcb6-fd9e-437f-8429-ad8451dc1b50 committed.[0m
[34m2020-09-15 08:28:56,279 INFO datasources.FileFormatWriter: Finished processing stats for write job 0a5edcb6-fd9e-437f-8429-ad8451dc1b50.[0m
[34mWrote to output file:  s3a:

[34m2020-09-15 09:44:02,283 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 8.0 (TID 10) in 4505473 ms on algo-2 (executor 1) (2/2)[0m
[34m2020-09-15 09:44:02,283 INFO cluster.YarnScheduler: Removed TaskSet 8.0, whose tasks have all completed, from pool [0m
[34m2020-09-15 09:44:02,284 INFO scheduler.DAGScheduler: ResultStage 8 (save at NativeMethodAccessorImpl.java:0) finished in 4505.505 s[0m
[34m2020-09-15 09:44:02,284 INFO scheduler.DAGScheduler: Job 8 finished: save at NativeMethodAccessorImpl.java:0, took 4505.508438 s[0m
[34m2020-09-15 09:44:02,803 INFO datasources.FileFormatWriter: Write Job d9cb7bf5-dddd-4739-93c6-c91b3fc75f3e committed.[0m
[34m2020-09-15 09:44:02,803 INFO datasources.FileFormatWriter: Finished processing stats for write job d9cb7bf5-dddd-4739-93c6-c91b3fc75f3e.[0m
[34mWrote to output file:  s3a://sagemaker-us-east-1-322537213286/amazon-reviews-spark-processor-2020-09-15-05-53-32/output/bert-test[0m
[34m2020-09-15 09:44:03,005 INFO data

[34mFinished Yarn configuration files setup.
[0m



<h2><span style="color:red">위 Processing Job이 완료되기 전까지 기다려 주시기 바랍니다.</span></h2>


# the Processed Output Dataset 확인

In [21]:
!aws s3 ls --recursive $train_data_bert_output/

2020-09-15 07:13:44          0 amazon-reviews-spark-processor-2020-09-15-05-53-32/output/bert-train/_SUCCESS
2020-09-15 07:13:38  450267574 amazon-reviews-spark-processor-2020-09-15-05-53-32/output/bert-train/part-00000-f89d9eb8-fdf8-4a54-b5e8-6567d2a5958c-c000.tfrecord
2020-09-15 06:21:17  122739791 amazon-reviews-spark-processor-2020-09-15-05-53-32/output/bert-train/part-00001-f89d9eb8-fdf8-4a54-b5e8-6567d2a5958c-c000.tfrecord


In [22]:
!aws s3 ls --recursive $validation_data_bert_output/

2020-09-15 08:28:57          0 amazon-reviews-spark-processor-2020-09-15-05-53-32/output/bert-validation/_SUCCESS
2020-09-15 08:28:56   25065057 amazon-reviews-spark-processor-2020-09-15-05-53-32/output/bert-validation/part-00000-bf37a7ed-3d6e-401e-ad4d-6eecc10de278-c000.tfrecord
2020-09-15 07:36:35    6798940 amazon-reviews-spark-processor-2020-09-15-05-53-32/output/bert-validation/part-00001-bf37a7ed-3d6e-401e-ad4d-6eecc10de278-c000.tfrecord


In [23]:
!aws s3 ls --recursive $test_data_bert_output/

2020-09-15 09:44:03          0 amazon-reviews-spark-processor-2020-09-15-05-53-32/output/bert-test/_SUCCESS
2020-09-15 09:44:02   24958447 amazon-reviews-spark-processor-2020-09-15-05-53-32/output/bert-test/part-00000-00115674-2af8-411b-81d1-96c0b778e4ae-c000.tfrecord
2020-09-15 08:51:42    6789042 amazon-reviews-spark-processor-2020-09-15-05-53-32/output/bert-test/part-00001-00115674-2af8-411b-81d1-96c0b778e4ae-c000.tfrecord


In [24]:
train_data = './data-tfrecord/bert-train'
validation_data = './data-tfrecord/bert-validation'
test_data = './data-tfrecord/bert-test'

!aws s3 cp $train_data_bert_output $train_data --recursive
!aws s3 cp $validation_data_bert_output $validation_data --recursive
!aws s3 cp $test_data_bert_output $test_data --recursive

download: s3://sagemaker-us-east-1-322537213286/amazon-reviews-spark-processor-2020-09-15-05-53-32/output/bert-train/_SUCCESS to data-tfrecord/bert-train/_SUCCESS
download: s3://sagemaker-us-east-1-322537213286/amazon-reviews-spark-processor-2020-09-15-05-53-32/output/bert-train/part-00001-f89d9eb8-fdf8-4a54-b5e8-6567d2a5958c-c000.tfrecord to data-tfrecord/bert-train/part-00001-f89d9eb8-fdf8-4a54-b5e8-6567d2a5958c-c000.tfrecord
download: s3://sagemaker-us-east-1-322537213286/amazon-reviews-spark-processor-2020-09-15-05-53-32/output/bert-train/part-00000-f89d9eb8-fdf8-4a54-b5e8-6567d2a5958c-c000.tfrecord to data-tfrecord/bert-train/part-00000-f89d9eb8-fdf8-4a54-b5e8-6567d2a5958c-c000.tfrecord
download: s3://sagemaker-us-east-1-322537213286/amazon-reviews-spark-processor-2020-09-15-05-53-32/output/bert-validation/_SUCCESS to data-tfrecord/bert-validation/_SUCCESS
download: s3://sagemaker-us-east-1-322537213286/amazon-reviews-spark-processor-2020-09-15-05-53-32/output/bert-validation/part

In [25]:
%store train_data_bert_output train_data

Stored 'train_data_bert_output' (str)
Stored 'train_data' (str)


In [26]:
%store validation_data_bert_output validation_data

Stored 'validation_data_bert_output' (str)
Stored 'validation_data' (str)


In [27]:
%store test_data_bert_output test_data

Stored 'test_data_bert_output' (str)
Stored 'test_data' (str)
