# SageMaker Processing Job으로 Feature Transformation

기계 학습 (ML) 프로세스는 몇 단계로 구성됩니다. 먼저, 다양한 ETL 작업으로 데이터를 수집 한 다음 data의 pre-processing, 전통적인 기법 또는 사전 knowledge를 이용하여 데이터의 feature화, 마지막으로 알고리즘을 이용한 ML 모델을 학습합니다.

Scikit-Learn과 같은 분산 데이터 처리 프레임 워크는 학습을 위해 dataset의 pre-processing하는데 사용합니다. 이 노트북에서는 Amazon SageMaker Processing에서 기본 설치된 Scikit-Learn의 기능을 활용하여 처리 워크로드를 실행합니다.

![](img/prepare_dataset_bert.png)

![](img/processing.jpg)


## Contents

1. Setup Environment
1. Setup Input Data
1. Setup Output Data
1. Build a Spark container for running the processing job
1. Run the Processing Job using Amazon SageMaker
1. Inspect the Processed Output Data

# Setup Environment

* 모델 학습에 사용되는 S3 bucket과 prefix 가 필요합니다.
* 학습과 processing을 위해 IAM role은 dataset에 액세스가 가능해야 합니다.

In [1]:
import sagemaker
from time import gmtime, strftime
import boto3

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

# Setup Input Data

In [2]:
s3_raw_input_data = 's3://{}/amazon-reviews-pds/tsv/'.format(bucket)
print(s3_raw_input_data)

s3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/tsv/


In [3]:
!aws s3 ls $s3_raw_input_data
!aws s3 cp $s3_raw_input_data ./data --recursive

2020-09-23 12:58:31   18997559 amazon_reviews_us_Digital_Software_v1_00.tsv.gz
2020-09-23 12:58:33   27442648 amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz
2020-09-23 12:58:38  193389086 amazon_reviews_us_Musical_Instruments_v1_00.tsv.gz
download: s3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz to data/amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz
download: s3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz to data/amazon_reviews_us_Digital_Software_v1_00.tsv.gz
download: s3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/tsv/amazon_reviews_us_Musical_Instruments_v1_00.tsv.gz to data/amazon_reviews_us_Musical_Instruments_v1_00.tsv.gz


# Run the Processing Job using Amazon SageMaker

Amazon SageMaker Python SDK를 사용하여 Processing job을 실행합니다. Spark container와 job configuration에서 processing에 대한 Spark ML script를 사용합니다.

# Review the Processing Script

In [4]:
!pygmentize src_dir/preprocess-scikit-text-to-bert.py

[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mmodel_selection[39;49;00m [34mimport[39;49;00m train_test_split
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m [34mimport[39;49;00m resample
[34mimport[39;49;00m [04m[36mfunctools[39;49;00m
[34mimport[39;49;00m [04m[36mmultiprocessing[39;49;00m

[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mfrom[39;49;00m [04m[36mdatetime[39;49;00m [34mimport[39;49;00m datetime
[34mimport[39;49;00m [04m[36msubprocess[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
subprocess.check_call([sys.executable, [33m'[39;49;00m[33m-m[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mpip[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33minstall[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mtensorflow==2.1.0[39;49;00m[33m'[39;49;00m])
[34mimport[39;49;00m [04m[36mtensorf

    df_validation = df_validation.reset_index(drop=[34mTrue[39;49;00m)
    df_test = df_test.reset_index(drop=[34mTrue[39;49;00m)

    [36mprint[39;49;00m([33m'[39;49;00m[33mShape of train dataframe [39;49;00m[33m{}[39;49;00m[33m'[39;49;00m.format(df_train.shape))
    [36mprint[39;49;00m([33m'[39;49;00m[33mShape of validation dataframe [39;49;00m[33m{}[39;49;00m[33m'[39;49;00m.format(df_validation.shape))
    [36mprint[39;49;00m([33m'[39;49;00m[33mShape of test dataframe [39;49;00m[33m{}[39;49;00m[33m'[39;49;00m.format(df_test.shape))

    train_inputs = df_train.apply([34mlambda[39;49;00m x: Input(text = x[DATA_COLUMN], 
                                                         label = x[LABEL_COLUMN]), axis = [34m1[39;49;00m)

    validation_inputs = df_validation.apply([34mlambda[39;49;00m x: Input(text = x[DATA_COLUMN], 
                                                            label = x[LABEL_COLUMN]), axis = [34m1[39;49;00m)


precessing job으로 이 스크립트를 실행합니다. Amazon S3 bucket의 `source` argument를 `ProcessingInput`으로 지정해야 합니다.
`destination`은 스크립트가 Docker container 내부의 `/opt/ml/processing/input`로 부터 데이터를 읽는 위치입니다. processing container 내의 모든 local paths는 `/opt/ml/processing/`로 시작해야 합니다.

`run ()`메소드에는 `ProcessingOutput`을 지정할 필요가 있으며, `source`는 스크립트가 출력 데이터를 쓰는 경로입니다.  
output의 경우`destination`은 Amazon SageMaker Python SDK가 생성하는 S3 버킷을 기본값으로   
`s3://sagemaker-<region>-<account_id>/<processing_job_name>/output/<output_name>/`와 같은 형태를 가집니다.   
또한 job이 실행 된 후 이러한 output 결과물을 더 쉽게 검색 할 수 있도록 `output_name`으로 `ProcessingOutput` 값을 제공합니다.  

`run()`메소드의 arguments 파라미터는 `preprocess-scikit-text-to-bert.py` 스크립트 내 command-line arugments가 되며,  
cluster 내의 모든 worker 노드에 transformations를 확장하기 위해 `ShardedS3Key`를 사용하여 데이터를 샤딩합니다.

In [5]:
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

processor = SKLearnProcessor(framework_version='0.20.0',
                             role=role,
                             instance_type='ml.c5.2xlarge',
                             instance_count=2)

# Set the Train, Validation, Split Percentages

In [6]:
train_split_percentage = 0.90
validation_split_percentage = 0.05
test_split_percentage = 0.05

# Set the Maximum Sequence Length for the BERT Tokenizer

In [7]:
max_seq_length = 128

In [8]:
balance_dataset = False

In [9]:
processor.run(code='src_dir/preprocess-scikit-text-to-bert.py',
              inputs=[ProcessingInput(source=s3_raw_input_data,
                                      destination='/opt/ml/processing/input/data/',
                                      s3_data_distribution_type='ShardedByS3Key')
              ],
              outputs=[
                       ProcessingOutput(s3_upload_mode='EndOfJob',
                                        output_name='bert-train',
                                        source='/opt/ml/processing/output/bert/train'),
                       ProcessingOutput(s3_upload_mode='EndOfJob',
                                        output_name='bert-validation',
                                        source='/opt/ml/processing/output/bert/validation'),
                       ProcessingOutput(s3_upload_mode='EndOfJob',
                                        output_name='bert-test',
                                        source='/opt/ml/processing/output/bert/test'),
              ],
              arguments=['--train-split-percentage', str(train_split_percentage),
                         '--validation-split-percentage', str(validation_split_percentage),
                         '--test-split-percentage', str(test_split_percentage),
                         '--max-seq-length', str(max_seq_length),
                         '--balance-dataset', str(balance_dataset)
              ],
              logs=True,
              wait=False)

Parameter 'session' will be renamed to 'sagemaker_session' in SageMaker Python SDK v2.



Job Name:  sagemaker-scikit-learn-2020-09-23-13-54-56-287
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/tsv/', 'LocalPath': '/opt/ml/processing/input/data/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'ShardedByS3Key', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-09-23-13-54-56-287/input/code/preprocess-scikit-text-to-bert.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'bert-train', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-09-23-13-54-56-287/output/bert-train', 'LocalPath': '/opt/ml/processing/output/bert/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'bert-validation', 'S3Output': {'S3U

In [10]:
scikit_processing_job_name = processor.jobs[-1].describe()['ProcessingJobName']
print(scikit_processing_job_name)

sagemaker-scikit-learn-2020-09-23-13-54-56-287


In [11]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, scikit_processing_job_name)))


In [12]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Spark Job Has Completed</b>'.format(bucket, scikit_processing_job_name, region)))


# List Processing Jobs through boto3 Python SDK

In [13]:
sm.list_processing_jobs()

{'ProcessingJobSummaries': [{'ProcessingJobName': 'sagemaker-scikit-learn-2020-09-23-13-54-56-287',
   'ProcessingJobArn': 'arn:aws:sagemaker:us-east-1:322537213286:processing-job/sagemaker-scikit-learn-2020-09-23-13-54-56-287',
   'CreationTime': datetime.datetime(2020, 9, 23, 13, 54, 56, 694000, tzinfo=tzlocal()),
   'LastModifiedTime': datetime.datetime(2020, 9, 23, 13, 54, 56, 694000, tzinfo=tzlocal()),
   'ProcessingJobStatus': 'InProgress'},
  {'ProcessingJobName': 'pr-1-c3217fde95994572ae7f34887d034cdd4c1f1d33ebda4766ac5704e558',
   'ProcessingJobArn': 'arn:aws:sagemaker:us-east-1:322537213286:processing-job/pr-1-c3217fde95994572ae7f34887d034cdd4c1f1d33ebda4766ac5704e558',
   'CreationTime': datetime.datetime(2020, 9, 23, 13, 31, 51, 975000, tzinfo=tzlocal()),
   'ProcessingEndTime': datetime.datetime(2020, 9, 23, 13, 36, 18, 744000, tzinfo=tzlocal()),
   'LastModifiedTime': datetime.datetime(2020, 9, 23, 13, 36, 18, 746000, tzinfo=tzlocal()),
   'ProcessingJobStatus': 'Complete

# Please Wait Until the Processing Job Completes
Re-run this next cell until the job status shows `Completed`.

In [14]:
running_processor = sagemaker.processing.ProcessingJob.from_processing_name(processing_job_name=scikit_processing_job_name,
                                                                            sagemaker_session=sagemaker_session)

processing_job_description = running_processor.describe()

processing_job_status = processing_job_description['ProcessingJobStatus']
print('\n')
print(processing_job_status)
print('\n')

print(processing_job_description)



InProgress


{'ProcessingInputs': [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-322537213286/amazon-reviews-pds/tsv/', 'LocalPath': '/opt/ml/processing/input/data/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'ShardedByS3Key', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-09-23-13-54-56-287/input/code/preprocess-scikit-text-to-bert.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}], 'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'bert-train', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-09-23-13-54-56-287/output/bert-train', 'LocalPath': '/opt/ml/processing/output/bert/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'bert-validation', 'S3Output': {'S3Uri'

<h2><span style="color:red">위 Processing Job이 완료되기 전까지 기다려 주시기 바랍니다.</span></h2>

In [15]:
running_processor.wait(logs=False)

...............................................................!

# Inspect the Processed Output Data

Take a look at a few rows of the transformed dataset to make sure the processing was successful.

In [16]:
output_config = processing_job_description['ProcessingOutputConfig']
for output in output_config['Outputs']:
    if output['OutputName'] == 'bert-train':
        processed_train_data_s3_uri = output['S3Output']['S3Uri']
    if output['OutputName'] == 'bert-validation':
        processed_validation_data_s3_uri = output['S3Output']['S3Uri']
    if output['OutputName'] == 'bert-test':
        processed_test_data_s3_uri = output['S3Output']['S3Uri']
        
print(processed_train_data_s3_uri)
print(processed_validation_data_s3_uri)
print(processed_test_data_s3_uri)

s3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-09-23-13-54-56-287/output/bert-train
s3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-09-23-13-54-56-287/output/bert-validation
s3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-09-23-13-54-56-287/output/bert-test


In [17]:
!aws s3 ls $processed_train_data_s3_uri/

2020-09-23 14:00:04      50315 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-09-23 14:00:04     448588 part-algo-1-amazon_reviews_us_Musical_Instruments_v1_00.tfrecord
2020-09-23 13:59:24      71836 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


In [18]:
!aws s3 ls $processed_validation_data_s3_uri/

2020-09-23 14:00:04       3172 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-09-23 14:00:04      25310 part-algo-1-amazon_reviews_us_Musical_Instruments_v1_00.tfrecord
2020-09-23 13:59:25       4383 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


In [19]:
!aws s3 ls $processed_test_data_s3_uri/

2020-09-23 14:00:05       3485 part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-09-23 14:00:05      25631 part-algo-1-amazon_reviews_us_Musical_Instruments_v1_00.tfrecord
2020-09-23 13:59:25       4444 part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


In [20]:
train_data = './data-tfrecord/bert-train'
validation_data = './data-tfrecord/bert-validation'
test_data = './data-tfrecord/bert-test'

!aws s3 cp $processed_train_data_s3_uri $train_data --recursive
!aws s3 cp $processed_validation_data_s3_uri $validation_data --recursive
!aws s3 cp $processed_test_data_s3_uri $test_data --recursive

download: s3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-09-23-13-54-56-287/output/bert-train/part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord to data-tfrecord/bert-train/part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
download: s3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-09-23-13-54-56-287/output/bert-train/part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord to data-tfrecord/bert-train/part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord
download: s3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-09-23-13-54-56-287/output/bert-train/part-algo-1-amazon_reviews_us_Musical_Instruments_v1_00.tfrecord to data-tfrecord/bert-train/part-algo-1-amazon_reviews_us_Musical_Instruments_v1_00.tfrecord
download: s3://sagemaker-us-east-1-322537213286/sagemaker-scikit-learn-2020-09-23-13-54-56-287/output/bert-validation/part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord to data-tf

# Pass Variables to the Next Notebook(s)

In [21]:
%store s3_raw_input_data 

Stored 's3_raw_input_data' (str)


In [22]:
%store max_seq_length

Stored 'max_seq_length' (int)


In [23]:
%store train_split_percentage

Stored 'train_split_percentage' (float)


In [24]:
%store validation_split_percentage

Stored 'validation_split_percentage' (float)


In [25]:
%store test_split_percentage

Stored 'test_split_percentage' (float)


In [26]:
%store processed_train_data_s3_uri train_data

Stored 'processed_train_data_s3_uri' (str)
Stored 'train_data' (str)


In [27]:
%store processed_validation_data_s3_uri validation_data

Stored 'processed_validation_data_s3_uri' (str)
Stored 'validation_data' (str)


In [28]:
%store processed_test_data_s3_uri test_data

Stored 'processed_test_data_s3_uri' (str)
Stored 'test_data' (str)


In [29]:
%store

Stored variables and their in-db values:
data_bucket                                  -> 'sagemaker-us-east-1-322537213286'
database_name                                -> 'awsdb_0920'
header_train_s3_uri                          -> 's3://sagemaker-us-east-1-322537213286/data/amazon
job_bucket                                   -> 'sagemaker-experiments-us-east-1-322537213286'
max_seq_length                               -> 128
noheader_train_s3_uri                        -> 's3://sagemaker-us-east-1-322537213286/data/amazon
processed_test_data_s3_uri                   -> 's3://sagemaker-us-east-1-322537213286/sagemaker-s
processed_train_data_s3_uri                  -> 's3://sagemaker-us-east-1-322537213286/sagemaker-s
processed_validation_data_s3_uri             -> 's3://sagemaker-us-east-1-322537213286/sagemaker-s
region_name                                  -> 'us-east-1'
role                                         -> 'arn:aws:iam::322537213286:role/service-role/AIMLW
s3_destination