# Amazon SageMaker Processing Job 


기계 학습 (ML) 프로세스는 몇 단계로 구성됩니다. 먼저, 다양한 ETL 작업으로 데이터를 수집 한 다음 data의 pre-processing, 전통적인 기법 또는 사전 knowledge를 이용하여 데이터의 feature화, 마지막으로 알고리즘을 이용한 ML 모델을 학습합니다.

Apache Spark와 같은 분산 데이터 처리 프레임 워크는 학습을 위해 dataset의 pre-processing하는데 사용합니다. 이 노트북에서는 Amazon SageMaker Processing에서 기본 설치된 Apache Spark의 기능을 활용하여 처리 워크로드를 실행합니다.

![](img/prepare_dataset_bert.png)

![](img/processing.jpg)


# Setup Environment


* 모델 학습에 사용되는 S3 bucket과 prefix 가 필요합니다.
* 학습과 processing을 위해 IAM role은 dataset에 액세스가 가능해야 합니다.

In [1]:
import sagemaker
from time import gmtime, strftime
import boto3

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()
region = boto3.Session().region_name

# Setup Input Data

In [2]:
# Inputs
s3_input_data = 's3://{}/amazon-reviews-pds/tsv/'.format(bucket)
print(s3_input_data)

s3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/tsv/


In [3]:
!aws s3 ls $s3_input_data

2020-12-08 04:14:36   18997559 amazon_reviews_us_Digital_Software_v1_00.tsv.gz
2020-12-08 04:14:38   27442648 amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz
2020-12-08 04:14:41  193389086 amazon_reviews_us_Musical_Instruments_v1_00.tsv.gz


# Processing Job을 수행할 Spark Docker Image

이 HOL에서는 `./container` 폴더 내에 Spark container 이미지를 포함합니다. container는 모든 Spark 구성의 부트스트랩을 처리하고 `spark-submit` CLI를 wrapper해서 제공합니다. 상위 레벨에서는,

* A set of default Spark/YARN/Hadoop configurations
* A bootstrapping script for configuring and starting up Spark master/worker nodes
* A wrapper around the `spark-submit` CLI to submit a Spark application

container 빌드와 push 절차가 완료된 후 dataset의 처리를 수행하는 관리형 분산 Spark 어플리케이션을 수행사는 것은 Amazon SageMaker Python SDK 사용합니다.

In [4]:
docker_repo = 'amazon-reviews-spark-processor'
docker_tag = 'latest'

In [None]:
!docker build -t $docker_repo:$docker_tag -f container/Dockerfile ./container

Sending build context to Docker daemon  4.385MB
Step 1/37 : FROM openjdk:8-jre-slim
8-jre-slim: Pulling from library/openjdk

[1B50cd189d: Pulling fs layer 
[1Bc1a94464: Pulling fs layer 
[1B926b0eec: Pulling fs layer 
[1BDigest: sha256:f7b69267a0028409a6a411b473a2bd66cc5bfe25850222ee166e520b09ee4a8c[4A[2K[4A[2K[4A[2K[4A[2K[4A[2K[1A[2K[1A[2K[4A[2K[4A[2K[3A[2K[3A[2K[3A[2K[2A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K
Status: Downloaded newer image for openjdk:8-jre-slim
 ---> 8c3c0e49c694
Step 2/37 : RUN apt-get update
 ---> Running in a01aecd4d033
Get:1 http://deb.debian.org/debian buster InRelease [121 kB]
Get:2 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]
Get:3 http://deb.debian.org/debian buster-updates InRelease [51.9 kB]
Get:4 http://security.debian.org/debian-security buster/updates/main amd64 Packages [254 kB]
Get:5 http://deb.debian.org/debian buster/main amd64 Packages [7907 kB]
Get:6 http://

Get:24 http://deb.debian.org/debian buster/main amd64 python3-minimal amd64 3.7.3-1 [36.6 kB]
Get:25 http://deb.debian.org/debian buster/main amd64 libmpdec2 amd64 2.4.2-2 [87.2 kB]
Get:26 http://deb.debian.org/debian buster/main amd64 libpython3.7-stdlib amd64 3.7.3-2+deb10u2 [1732 kB]
Get:27 http://deb.debian.org/debian buster/main amd64 python3.7 amd64 3.7.3-2+deb10u2 [330 kB]
Get:28 http://deb.debian.org/debian buster/main amd64 libpython3-stdlib amd64 3.7.3-1 [20.0 kB]
Get:29 http://deb.debian.org/debian buster/main amd64 python3 amd64 3.7.3-1 [61.5 kB]
Get:30 http://deb.debian.org/debian buster/main amd64 netbase all 5.6 [19.4 kB]
Get:31 http://deb.debian.org/debian buster/main amd64 bzip2 amd64 1.0.6-9.2~deb10u1 [48.4 kB]
Get:32 http://deb.debian.org/debian buster/main amd64 libapparmor1 amd64 2.13.2-10 [94.7 kB]
Get:33 http://deb.debian.org/debian buster/main amd64 libdbus-1-3 amd64 1.12.20-0+deb10u1 [215 kB]
Get:34 http://deb.debian.org/debian buster/main amd64 dbus amd64 1.12

Get:118 http://deb.debian.org/debian buster/main amd64 libpython2.7 amd64 2.7.16-2+deb10u1 [1036 kB]
Get:119 http://deb.debian.org/debian buster/main amd64 libpython2.7-dev amd64 2.7.16-2+deb10u1 [31.6 MB]
Get:120 http://deb.debian.org/debian buster/main amd64 libpython2-dev amd64 2.7.16-1 [20.9 kB]
Get:121 http://deb.debian.org/debian buster/main amd64 libpython-dev amd64 2.7.16-1 [20.9 kB]
Get:122 http://deb.debian.org/debian buster/main amd64 libpython3.7 amd64 3.7.3-2+deb10u2 [1498 kB]
Get:123 http://deb.debian.org/debian buster/main amd64 libpython3.7-dev amd64 3.7.3-2+deb10u2 [48.4 MB]
Get:124 http://deb.debian.org/debian buster/main amd64 libpython3-dev amd64 3.7.3-1 [20.1 kB]
Get:125 http://deb.debian.org/debian buster/main amd64 libsasl2-modules amd64 2.1.27+dfsg-1+deb10u1 [104 kB]
Get:126 http://deb.debian.org/debian buster/main amd64 libxml2 amd64 2.9.4+dfsg1-7+deb10u1 [689 kB]
Get:127 http://deb.debian.org/debian buster/main amd64 manpages-dev all 4.16-2 [2232 kB]
Get:128 h

Selecting previously unselected package python3-minimal.
(Reading database ... 9963 files and directories currently installed.)
Preparing to unpack .../python3-minimal_3.7.3-1_amd64.deb ...
Unpacking python3-minimal (3.7.3-1) ...
Selecting previously unselected package libmpdec2:amd64.
Preparing to unpack .../libmpdec2_2.4.2-2_amd64.deb ...
Unpacking libmpdec2:amd64 (2.4.2-2) ...
Selecting previously unselected package libpython3.7-stdlib:amd64.
Preparing to unpack .../libpython3.7-stdlib_3.7.3-2+deb10u2_amd64.deb ...
Unpacking libpython3.7-stdlib:amd64 (3.7.3-2+deb10u2) ...
Selecting previously unselected package python3.7.
Preparing to unpack .../python3.7_3.7.3-2+deb10u2_amd64.deb ...
Unpacking python3.7 (3.7.3-2+deb10u2) ...
Selecting previously unselected package libpython3-stdlib:amd64.
Preparing to unpack .../libpython3-stdlib_3.7.3-1_amd64.deb ...
Unpacking libpython3-stdlib:amd64 (3.7.3-1) ...
Setting up python3-minimal (3.7.3-1) ...
Selecting previously unselected package pyt

Selecting previously unselected package libk5crypto3:amd64.
Preparing to unpack .../047-libk5crypto3_1.17-3+deb10u1_amd64.deb ...
Unpacking libk5crypto3:amd64 (1.17-3+deb10u1) ...
Selecting previously unselected package libkrb5-3:amd64.
Preparing to unpack .../048-libkrb5-3_1.17-3+deb10u1_amd64.deb ...
Unpacking libkrb5-3:amd64 (1.17-3+deb10u1) ...
Selecting previously unselected package libgssapi-krb5-2:amd64.
Preparing to unpack .../049-libgssapi-krb5-2_1.17-3+deb10u1_amd64.deb ...
Unpacking libgssapi-krb5-2:amd64 (1.17-3+deb10u1) ...
Selecting previously unselected package libsasl2-modules-db:amd64.
Preparing to unpack .../050-libsasl2-modules-db_2.1.27+dfsg-1+deb10u1_amd64.deb ...
Unpacking libsasl2-modules-db:amd64 (2.1.27+dfsg-1+deb10u1) ...
Selecting previously unselected package libsasl2-2:amd64.
Preparing to unpack .../051-libsasl2-2_2.1.27+dfsg-1+deb10u1_amd64.deb ...
Unpacking libsasl2-2:amd64 (2.1.27+dfsg-1+deb10u1) ...
Selecting previously unselected package libldap-common

Selecting previously unselected package libpython3-dev:amd64.
Preparing to unpack .../095-libpython3-dev_3.7.3-1_amd64.deb ...
Unpacking libpython3-dev:amd64 (3.7.3-1) ...
Selecting previously unselected package libsasl2-modules:amd64.
Preparing to unpack .../096-libsasl2-modules_2.1.27+dfsg-1+deb10u1_amd64.deb ...
Unpacking libsasl2-modules:amd64 (2.1.27+dfsg-1+deb10u1) ...
Selecting previously unselected package libxml2:amd64.
Preparing to unpack .../097-libxml2_2.9.4+dfsg1-7+deb10u1_amd64.deb ...
Unpacking libxml2:amd64 (2.9.4+dfsg1-7+deb10u1) ...
Selecting previously unselected package manpages-dev.
Preparing to unpack .../098-manpages-dev_4.16-2_all.deb ...
Unpacking manpages-dev (4.16-2) ...
Selecting previously unselected package publicsuffix.
Preparing to unpack .../099-publicsuffix_20190415.1030-1_all.deb ...
Unpacking publicsuffix (20190415.1030-1) ...
Selecting previously unselected package python2.7-dev.
Preparing to unpack .../100-python2.7-dev_2.7.16-2+deb10u1_amd64.deb .

Setting up libdbus-1-3:amd64 (1.12.20-0+deb10u1) ...
Setting up dbus (1.12.20-0+deb10u1) ...
invoke-rc.d: could not determine current runlevel
invoke-rc.d: policy-rc.d denied execution of start.
Setting up xz-utils (5.2.4-1) ...
update-alternatives: using /usr/bin/xz to provide /usr/bin/lzma (lzma) in auto mode
Setting up libquadmath0:amd64 (8.3.0-6) ...
Setting up libmpc3:amd64 (1.1.0-1) ...
Setting up libatomic1:amd64 (8.3.0-6) ...
Setting up patch (2.7.6-3+deb10u1) ...
Setting up libk5crypto3:amd64 (1.17-3+deb10u1) ...
Setting up libsasl2-2:amd64 (2.1.27+dfsg-1+deb10u1) ...
Setting up libmpx2:amd64 (8.3.0-6) ...
Setting up libubsan1:amd64 (8.3.0-6) ...
Setting up libisl19:amd64 (0.20-2) ...
Setting up libgirepository-1.0-1:amd64 (1.58.3-2) ...
Setting up libssh2-1:amd64 (1.8.0-2.1) ...
Setting up netbase (5.6) ...
Setting up python-pip-whl (18.1-5) ...
Setting up libkrb5-3:amd64 (1.17-3+deb10u1) ...
Setting up libmpdec2:amd64 (2.4.2-2) ...
Setting up libbinutils:amd64 (2.31.1-16) ..

 ---> Running in 7d10bca0f01c
Removing intermediate container 7d10bca0f01c
 ---> 57462dab1261
Step 7/37 : ENV PYTHONHASHSEED 0
 ---> Running in 54d6c1ab0a45
Removing intermediate container 54d6c1ab0a45
 ---> de525fa99d2e
Step 8/37 : ENV PYTHONIOENCODING UTF-8
 ---> Running in ba333b5bee18
Removing intermediate container ba333b5bee18
 ---> 8c0bf04bc4b6
Step 9/37 : ENV PIP_DISABLE_PIP_VERSION_CHECK 1
 ---> Running in d5058e613dc7
Removing intermediate container d5058e613dc7
 ---> 3e66672125fd
Step 10/37 : ENV HADOOP_VERSION 3.2.1
 ---> Running in 300ddb409133
Removing intermediate container 300ddb409133
 ---> 4b1ff6b5ab23
Step 11/37 : ENV HADOOP_HOME /usr/hadoop-$HADOOP_VERSION
 ---> Running in 39d03a68822d
Removing intermediate container 39d03a68822d
 ---> 6740c3f3450e
Step 12/37 : ENV HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
 ---> Running in bf53d988dc97
Removing intermediate container bf53d988dc97
 ---> e78012115315
Step 13/37 : ENV PATH $PATH:$HADOOP_HOME/bin
 ---> Running in 3c071861

Spark container의 Amazon Elastic Container Registry(Amazon ECR) 리포지토리를 생성하고 image를 push합니다.

In [None]:
import boto3
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name

image_uri = '{}.dkr.ecr.{}.amazonaws.com/{}:{}'.format(account_id, region, docker_repo, docker_tag)
print(image_uri)

### ECR repository 생성과 docker image를 push하기

In [None]:
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)

### `RepositoryNotFoundException` 오류는 무시하셔도 됩니다. 즉시 repository를 생성합니다.

In [None]:
!aws ecr describe-repositories --repository-names $docker_repo || aws ecr create-repository --repository-name $docker_repo

In [None]:
!docker tag $docker_repo:$docker_tag $image_uri

In [None]:
!docker push $image_uri

# Amazon SageMaker Processing Jobs 으로 Job 수행

Amazon SageMaker Python SDK를 사용하여 Processing job을 실행합니다. Spark container와 job configuration에서 processing에 대한 Spark ML script를 사용합니다.

In [None]:
!pygmentize src_dir/preprocess-spark-text-to-bert.py

In [None]:
from sagemaker.processing import ScriptProcessor

processor = ScriptProcessor(base_job_name='spark-amazon-reviews-processor',
                            image_uri=image_uri,
                            command=['/opt/program/submit'],
                            role=role,
                            instance_count=2, # instance_count needs to be > 1 or you will see the following error:  "INFO yarn.Client: Application report for application_ (state: ACCEPTED)"
                            instance_type='ml.r5.xlarge',
                            env={'mode': 'python'})

# Setup Output Data

In [None]:
from time import gmtime, strftime
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())

output_prefix = 'amazon-reviews-spark-processor-{}'.format(timestamp_prefix)

In [None]:
train_data_bert_output = 's3://{}/{}/output/bert-train'.format(bucket, output_prefix)
validation_data_bert_output = 's3://{}/{}/output/bert-validation'.format(bucket, output_prefix)
test_data_bert_output = 's3://{}/{}/output/bert-test'.format(bucket, output_prefix)

print(train_data_bert_output)
print(validation_data_bert_output)
print(test_data_bert_output)

In [None]:
from sagemaker.processing import ProcessingOutput

processor.run(code='./src_dir/preprocess-spark-text-to-bert.py',
              arguments=['s3_input_data', s3_input_data,
                         's3_output_train_data', train_data_bert_output,
                         's3_output_validation_data', validation_data_bert_output,
                         's3_output_test_data', test_data_bert_output,                         
              ],
              # We need this dummy output to allow us to call 
              #    ProcessingJob.from_processing_name() later 
              #    to describe the job and poll for Completed status
              outputs=[
                       ProcessingOutput(s3_upload_mode='EndOfJob',
                                        output_name='dummy-output',
                                        source='/opt/ml/processing/output')
              ],          
              logs=True,
              wait=False
)

In [None]:
from IPython.core.display import display, HTML

spark_processing_job_name = processor.jobs[-1].describe()['ProcessingJobName']

display(HTML('<b>Review <a href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, spark_processing_job_name)))


In [None]:
from IPython.core.display import display, HTML

# This is different than the job name because we are not using ProcessingOutput's in this Spark ML case.
spark_processing_job_s3_output_prefix = output_prefix

display(HTML('<b>Review <a href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Spark Job Has Completed</b>'.format(bucket, spark_processing_job_s3_output_prefix, region)))


# List Processing Jobs through boto3 Python SDK

In [None]:
import boto3

client = boto3.client('sagemaker')
client.list_processing_jobs()

# Please Wait Until the Processing Job Completes
Re-run this next cell until the job status shows `Completed`.

In [None]:
running_processor = sagemaker.processing.ProcessingJob.from_processing_name(processing_job_name=spark_processing_job_name,
                                                                            sagemaker_session=sagemaker_session)

processing_job_description = running_processor.describe()

processing_job_status = processing_job_description['ProcessingJobStatus']
print('\n')
print(processing_job_status)
print('\n')

print(processing_job_description)

In [None]:
running_processor.wait()

<h2><span style="color:red">위 Processing Job이 완료되기 전까지 기다려 주시기 바랍니다.</span></h2>


# the Processed Output Dataset 확인

In [None]:
!aws s3 ls --recursive $train_data_bert_output/

In [None]:
!aws s3 ls --recursive $validation_data_bert_output/

In [None]:
!aws s3 ls --recursive $test_data_bert_output/

In [None]:
train_data = './data-tfrecord/bert-train'
validation_data = './data-tfrecord/bert-validation'
test_data = './data-tfrecord/bert-test'

!aws s3 cp $train_data_bert_output $train_data --recursive
!aws s3 cp $validation_data_bert_output $validation_data --recursive
!aws s3 cp $test_data_bert_output $test_data --recursive

In [None]:
%store train_data_bert_output train_data

In [None]:
%store validation_data_bert_output validation_data

In [None]:
%store test_data_bert_output test_data