<br />

<div style="text-align: center;">
<font size="7">Preprocessing in SageMaker</font>
</div>
<br />
<div style="text-align: right;">
<font size="4">2020/11/11</font>
<br />
<font size="4">Ryutaro Hashimoto</font>
</div>

___

# Summary

- There are two ways to pre-process the dataset (standardization, train/test split, etc.) in a local environment and upload the processed dataset to S3 for training, or use cloud resources on AWS.
- In this notebook, we will use the latter method to split the data into train/test.
- We will use the scikit-learn container image that is prepared in advance, but there are other images available as well.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preparing-sample-data" data-toc-modified-id="Preparing-sample-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preparing sample data</a></span><ul class="toc-item"><li><span><a href="#Download-sample-data" data-toc-modified-id="Download-sample-data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Download sample data</a></span></li><li><span><a href="#Load-the-downloaded-file-in-Pandas." data-toc-modified-id="Load-the-downloaded-file-in-Pandas.-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Load the downloaded file in Pandas.</a></span></li><li><span><a href="#Upload-to-Amazon-S3" data-toc-modified-id="Upload-to-Amazon-S3-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Upload to Amazon S3</a></span></li><li><span><a href="#Delete-local-data" data-toc-modified-id="Delete-local-data-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Delete local data</a></span></li></ul></li><li><span><a href="#Starting-a-Processing-Job-Instance" data-toc-modified-id="Starting-a-Processing-Job-Instance-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Starting a Processing Job Instance</a></span></li><li><span><a href="#Execute-job" data-toc-modified-id="Execute-job-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Execute job</a></span></li><li><span><a href="#Get-the-save-location." data-toc-modified-id="Get-the-save-location.-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Get the save location.</a></span></li></ul></div>

## Preparing sample data

### Download sample data

In [1]:
%%sh
wget -N https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip
# for the first time running by Studio
# apt-get -y install unzip 
unzip -o bank-additional.zip

Archive:  bank-additional.zip
   creating: bank-additional/
  inflating: bank-additional/bank-additional-names.txt  
  inflating: bank-additional/bank-additional.csv  
  inflating: bank-additional/bank-additional-full.csv  


--2021-02-05 11:35:25--  https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip
Resolving sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com (sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com)... 52.218.252.81
Connecting to sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com (sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com)|52.218.252.81|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘bank-additional.zip’ not modified on server. Omitting download.



### Load the downloaded file in Pandas.

In [2]:
import pandas as pd
data = pd.read_csv('bank-additional/bank-additional-full.csv')
data.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


### Upload to Amazon S3

In [3]:
import boto3
s3_client = boto3.client('s3')

local_path = 'bank-additional/bank-additional-full.csv'
bucket_path = 'processing/bank-additional-full.csv'
data_bucket_name='sagemaker-tutorial-hashimoto'

s3_client.upload_file(local_path, data_bucket_name, bucket_path)

### Delete local data

In [4]:
%%sh
rm -r bank-additional
rm  bank-additional.zip

## Starting a Processing Job Instance

In [8]:
import sagemaker
sess = sagemaker.Session()

role_ARN = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'    # ← your iam role ARN

from sagemaker.sklearn.processing import SKLearnProcessor
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role_ARN,
                                     instance_type='ml.m5.xlarge',
                                     instance_count=1,
                                     base_job_name = 'example',
                                     tags = [{"Key":"name", "Value": "sato"},
                                            {"Key":"project", "Value": "project1"}]
                                    )

## Execute job

In [13]:
# define parametors
import os
base = f's3://<file path>'
destination = {
    'train': os.path.join(base, 'train'),
    'validation': os.path.join(base, 'validation'),
    'test': os.path.join(base, 'test')
}

In [14]:
# run
from sagemaker.processing import ProcessingInput, ProcessingOutput
sklearn_processor.run(code='s3://<file path>',
                      inputs=[ProcessingInput(source='s3://sagemaker-tutorial-hashimoto/processing/bank-additional-full.csv',
                                              destination='/opt/ml/processing/input',
                                              )],
                      outputs=[ProcessingOutput(output_name='train_data',
                                                source='/opt/ml/processing/train',
                                                destination=destination['train'],
                                               ),
                               ProcessingOutput(output_name='test_data',
                                                source='/opt/ml/processing/test',
                                                destination=destination['test'],
                                               )],
                      arguments=['--train-test-split-ratio', '0.2']
                     )


Job Name:  example-2021-02-05-02-45-14-279
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-tutorial-hashimoto/processing/bank-additional-full.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-tutorial-hashimoto/processing/preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'train_data', 'S3Output': {'S3Uri': 's3://sagemaker-tutorial-hashimoto/processing/output/processed_data/train', 'LocalPath': '/opt/ml/processing/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'test_data', 'S3Output': {'S3Uri': 's3://sagemaker-tutorial-hashimoto/processing/output/processed_data/test', 'LocalPath': '/opt/ml/processing/test', 'S3Uploa

## Get the save location.

Output files are saved in following folder

In [15]:
preprocessing_job_description = sklearn_processor.jobs[-1].describe()

output_config = preprocessing_job_description['ProcessingOutputConfig']
for output in output_config['Outputs']:
    print(output['S3Output']['S3Uri'])

s3://sagemaker-tutorial-hashimoto/processing/output/processed_data/train
s3://sagemaker-tutorial-hashimoto/processing/output/processed_data/test


In [None]:
# End of File