# Train a Model with SageMaker Autopilot

Customer review를 예측하기 위해 SageMaker Autopilot을 사용합니다.  
Autopilot는 AutoML에 대해 white-box 접근 방식으로 구현합니다.

<img src="./img/autopilot.png" width="80%" align="left">

# Introduction

Amazon SageMaker Autopilot은 데이터 세트에서 자동 기계 학습 (AutoML)을 수행하는 서비스입니다. Autopilot은 UI 또는 AWS SDK를 통해 사용할 수 있습니다. 이 노트북에서는 AWS SDK를 사용하여 텍스트 처리 및 sentiment classification 기계 학습 파이프 라인을 생성과 배포를 합니다.

# Setup

* 모델 학습에 사용되는 S3 bucket과 prefix 가 필요합니다. 참고 : 필요한 dataset은 동일 region에 있어야 합니다.
* 이 노트북의 IAM role은 dataset에 액세스가 가능해야 합니다.

In [1]:
%store -r

In [2]:
import boto3
import sagemaker
import pandas as pd

# Execute statement using connection cursor
from pyathena import connect
from pyathena.util import as_pandas

sess   = sagemaker.Session()
role = sagemaker.get_execution_role()
sm = boto3.Session().client(service_name='sagemaker', region_name=region_name)

# Dataset

In [3]:
print(header_train_s3_uri)

s3://sagemaker-us-east-1-322537213286/data/amazon_reviews_us_Digital_Software_v1_00_header.csv


In [4]:
!aws s3 ls $header_train_s3_uri

2020-09-15 05:37:37   29582039 amazon_reviews_us_Digital_Software_v1_00_header.csv


# Setup the S3 Location for the Autopilot-Generated Assets 

* Jupyter Notebooks (Analysis)
* Python Scripts (Feature Engineering)
* Trained Models.

In [5]:
prefix_model_output = 'models/autopilot'

model_output_s3_uri = 's3://{}/{}'.format(job_bucket, prefix_model_output)

print(model_output_s3_uri)


s3://sagemaker-experiments-us-east-1-322537213286/models/autopilot


In [6]:
max_candidates = 3

job_config = {
    'CompletionCriteria': {
      'MaxRuntimePerTrainingJobInSeconds': 600,
      'MaxCandidates': max_candidates,
      'MaxAutoMLJobRuntimeInSeconds': 3600
    },
}

input_data_config = [{
      'DataSource': {
        'S3DataSource': {
          'S3DataType': 'S3Prefix',
          'S3Uri': '{}'.format(header_train_s3_uri)
        }
      },
      'TargetAttributeName': 'star_rating'
    }
]

output_data_config = {
    'S3OutputPath': '{}'.format(model_output_s3_uri)
}

# Launch the SageMaker Autopilot job

`create_auto_ml_job` API를 이용하여 Autopilot job을 실행합니다.

In [7]:
from time import gmtime, strftime, sleep
timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())

auto_ml_job_name = 'automl-dm-' + timestamp_suffix
print('AutoMLJobName: ' + auto_ml_job_name)

AutoMLJobName: automl-dm-15-05-41-41


`ProblemType`를 특정할 수 없는 경우에는 Autopilot이 자동으로 regression 또는 Classification (binary 또는 multi-class) 를 탐지합니다.

In [8]:
sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,
                      InputDataConfig=input_data_config,
                      OutputDataConfig=output_data_config,
                      AutoMLJobConfig=job_config,
#                      ProblemType="Classification",
                      RoleArn=role)

{'AutoMLJobArn': 'arn:aws:sagemaker:us-east-1:322537213286:automl-job/automl-dm-15-05-41-41',
 'ResponseMetadata': {'RequestId': '2bd23773-162b-4830-9d57-9ab92958ec70',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '2bd23773-162b-4830-9d57-9ab92958ec70',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '92',
   'date': 'Tue, 15 Sep 2020 05:41:43 GMT'},
  'RetryAttempts': 0}}

# Tracking the progress of the Autopilot job

SageMaker Autopilot은 high-level 단계로 아래와 같이 구성됩니다.

* _Data Analysis_ 데이터를 요약하고 분석하여 탐색할 feature engineering 기법, 하이퍼파라미터와 탐색 모델을 결정합니다.
* _Feature Engineering_ 데이터가 클린징, 밸런싱, 결합 과 훈련/검증데이터셋으로 분리합니다.
* _Model Training and Tuning_ 가장 높은 성능의 features, 하이퍼라미터와 모델을 선택하고 학습합니다.

# Analyzing Data

In [9]:
# Sleep for a bit to ensure the AutoML job above has time to start
import time
time.sleep(3)

job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
job_status = job['AutoMLJobStatus']
job_sec_status = job['AutoMLJobSecondaryStatus']

if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('AnalyzingData'):
        job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
        job_status = job['AutoMLJobStatus']
        job_sec_status = job['AutoMLJobSecondaryStatus']
        print(job_status, job_sec_status)
        sleep(30)
    print("Data analysis complete")
    
print(job)

InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress FeatureEngineering
Data analysis complete
{'AutoMLJobName': 'automl-dm-15-05-41-41', 'AutoMLJobArn': 'arn:aws:sagemaker:us-east-1:322537213286:automl-job/automl-dm-15-05-41-41', 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-322537213286/data/amazon_reviews_us_Digital_Software_v1_00_header.csv'}}, 'TargetAttributeName': 'star_rating'}], 'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-experiments-us-east-1-322537213286/models/autopilot'}, 

In [10]:
job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
print(job)

{'AutoMLJobName': 'automl-dm-15-05-41-41', 'AutoMLJobArn': 'arn:aws:sagemaker:us-east-1:322537213286:automl-job/automl-dm-15-05-41-41', 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-322537213286/data/amazon_reviews_us_Digital_Software_v1_00_header.csv'}}, 'TargetAttributeName': 'star_rating'}], 'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-experiments-us-east-1-322537213286/models/autopilot'}, 'RoleArn': 'arn:aws:iam::322537213286:role/service-role/AIMLWorkshop-SageMakerIamRole-W344QRC65HN0', 'AutoMLJobConfig': {'CompletionCriteria': {'MaxCandidates': 3, 'MaxRuntimePerTrainingJobInSeconds': 600, 'MaxAutoMLJobRuntimeInSeconds': 3600}}, 'CreationTime': datetime.datetime(2020, 9, 15, 5, 41, 42, 51000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2020, 9, 15, 5, 51, 36, 378000, tzinfo=tzlocal()), 'AutoMLJobStatus': 'InProgress', 'AutoMLJobSecondaryStatus': 'FeatureEngineering', 'GenerateCandidateDefiniti

# Feature Engineering

In [11]:
job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
job_status = job['AutoMLJobStatus']
job_sec_status = job['AutoMLJobSecondaryStatus']
print(job_status)
print(job_sec_status)
if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('FeatureEngineering'):
        job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
        job_status = job['AutoMLJobStatus']
        job_sec_status = job['AutoMLJobSecondaryStatus']
        print(job_status, job_sec_status)
        sleep(30)
    print("Feature engineering complete")
    
print(job)

InProgress
FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress ModelTuning
Feature engineering complete
{'AutoMLJobName': 'automl-dm-15-05-41-41', 'AutoMLJobArn': 'arn:aws:sagemaker:us-east-1:322537213286:automl-job/automl-dm-15-05-41-41', 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-322537213286/data/amazon_reviews_us_Digital_Software_v1_00_header.csv'}}, 'TargetAttributeName': '

# Model Training and Tuning

In [12]:
job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
job_status = job['AutoMLJobStatus']
job_sec_status = job['AutoMLJobSecondaryStatus']
print(job_status)
print(job_sec_status)
if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('ModelTuning'):
        job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
        job_status = job['AutoMLJobStatus']
        job_sec_status = job['AutoMLJobSecondaryStatus']
        print(job_status, job_sec_status)
        sleep(30)
    print("Model tuning complete")
    
print(job)

InProgress
ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
Completed MaxCandidatesReached
Model tuning complete
{'AutoMLJobName': 'automl-dm-15-05-41-41', 'AutoMLJobArn': 'arn:aws:sagemaker:us-east-1:322537213286:automl-job/automl-dm-15-05-41-41', 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-322537213286/data/amazon_reviews_us_Digital_Software_v1_00_header.csv'}}, 'TargetAttributeName': 'star_rating'}], 'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-experiments-us-east-1-322537213286/models/autopilot'}, 'RoleArn': 'arn:aws:iam::322537213286:role/service-role/AIMLW

<h2><span style="color:red">위 Autopilot이 완료되기 전까지 기다려 주시기 바랍니다.</span></h2>

# View Generated Notebooks

데이터 분석이 완료되며, SageMaker AutoPilot 2가지 노트북을 생성합니다.
* Data exploration,
* Candidate definition.

## Copy the Generated Notebooks Locally

In [13]:
generated_resources = job['AutoMLJobArtifacts']['DataExplorationNotebookLocation'].rstrip('notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb')
generated_resources

's3://sagemaker-experiments-us-east-1-322537213286/models/autopilot/automl-dm-15-05-41-41/sagemaker-automl-candidates/pr-1-5bfd335f662a4945bc40956e50bbeeca13b23f2d823541ce83dd868af6'

In [14]:
!aws s3 cp --recursive $generated_resources .

download: s3://sagemaker-experiments-us-east-1-322537213286/models/autopilot/automl-dm-15-05-41-41/sagemaker-automl-candidates/pr-1-5bfd335f662a4945bc40956e50bbeeca13b23f2d823541ce83dd868af6/generated_module/README.md to generated_module/README.md
download: s3://sagemaker-experiments-us-east-1-322537213286/models/autopilot/automl-dm-15-05-41-41/sagemaker-automl-candidates/pr-1-5bfd335f662a4945bc40956e50bbeeca13b23f2d823541ce83dd868af6/generated_module/candidate_data_processors/dpp0.py to generated_module/candidate_data_processors/dpp0.py
download: s3://sagemaker-experiments-us-east-1-322537213286/models/autopilot/automl-dm-15-05-41-41/sagemaker-automl-candidates/pr-1-5bfd335f662a4945bc40956e50bbeeca13b23f2d823541ce83dd868af6/notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb to notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb
download: s3://sagemaker-experiments-us-east-1-322537213286/models/autopilot/automl-dm-15-05-41-41/sagemaker-automl-candidates/pr-1-5bfd335f662a49

## In the file view, open the following folders:
```
notebooks/
generated_module/
```
이 폴더에서 많은 정보를 보실 수 있습니다.

# Viewing All Candidates

model tuning이 완료되면 AutoML에서 탐색 한 모든 후보(서로 다른 하이퍼 파라미터 조합을 가진 파이프 라인 평가)를 보고, 최종 성능 메트릭별로 정렬할 수 있습니다.

In [15]:
candidates = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name, 
                                                SortBy='FinalObjectiveMetricValue')['Candidates']
for index, candidate in enumerate(candidates):
    print(str(index) + "  " 
        + candidate['CandidateName'] + "  " 
        + str(candidate['FinalAutoMLJobObjectiveMetric']['Value']))

0  tuning-job-1-49c2f0edb0fe4380aa-003-fe1a028d  0.4752599895000458
1  tuning-job-1-49c2f0edb0fe4380aa-001-d85935a4  0.34101998805999756
2  tuning-job-1-49c2f0edb0fe4380aa-002-e4cc97f2  0.26423001289367676


# Inspect Trials using Experiments API

SageMaker Autopilot은 자동으로 새로운 experiment를 생성하고, 각 trial에 대한 정보를 experiment에 추가합니다.

In [16]:
from sagemaker.analytics import ExperimentAnalytics, TrainingJobAnalytics

exp = ExperimentAnalytics(
        sagemaker_session=sess, 
        experiment_name=auto_ml_job_name + '-aws-auto-ml-job',
)

df = exp.dataframe()
df

Unnamed: 0,TrialComponentName,DisplayName,SourceArn,SageMaker.ImageUri,SageMaker.InstanceCount,SageMaker.InstanceType,SageMaker.VolumeSizeInGB,_tuning_objective_metric,alpha,colsample_bytree,...,code - MediaType,code - Value,input_channel_mode,job_name,label_col,max_dataset_size,SageMaker.ImageUri - MediaType,SageMaker.ImageUri - Value,ds - MediaType,ds - Value
0,tuning-job-1-49c2f0edb0fe4380aa-003-fe1a028d-a...,tuning-job-1-49c2f0edb0fe4380aa-003-fe1a028d-a...,arn:aws:sagemaker:us-east-1:322537213286:train...,683313688378.dkr.ecr.us-east-1.amazonaws.com/s...,1.0,ml.m5.4xlarge,50.0,validation:accuracy,4e-06,0.95911,...,,,,,,,,,,
1,tuning-job-1-49c2f0edb0fe4380aa-001-d85935a4-a...,tuning-job-1-49c2f0edb0fe4380aa-001-d85935a4-a...,arn:aws:sagemaker:us-east-1:322537213286:train...,683313688378.dkr.ecr.us-east-1.amazonaws.com/s...,1.0,ml.m5.4xlarge,50.0,validation:accuracy,1e-05,0.973225,...,,,,,,,,,,
2,tuning-job-1-49c2f0edb0fe4380aa-002-e4cc97f2-a...,tuning-job-1-49c2f0edb0fe4380aa-002-e4cc97f2-a...,arn:aws:sagemaker:us-east-1:322537213286:train...,683313688378.dkr.ecr.us-east-1.amazonaws.com/s...,1.0,ml.m5.4xlarge,50.0,validation:accuracy,1e-05,0.973225,...,,,,,,,,,,
3,automl-dm--dpp0-rpb-1-61f5dd51becd43dcb609a37f...,automl-dm--dpp0-rpb-1-61f5dd51becd43dcb609a37f...,arn:aws:sagemaker:us-east-1:322537213286:trans...,,1.0,ml.m5.4xlarge,,,,,...,,,,,,,,,,
4,automl-dm--dpp1-csv-1-3f148d601b334e209134eadb...,automl-dm--dpp1-csv-1-3f148d601b334e209134eadb...,arn:aws:sagemaker:us-east-1:322537213286:trans...,,1.0,ml.m5.4xlarge,,,,,...,,,,,,,,,,
5,automl-dm--dpp1-1-f383abe089bb4cb788025585e4ed...,automl-dm--dpp1-1-f383abe089bb4cb788025585e4ed...,arn:aws:sagemaker:us-east-1:322537213286:train...,683313688378.dkr.ecr.us-east-1.amazonaws.com/s...,1.0,ml.m5.4xlarge,50.0,,,,...,application/x-code,s3://sagemaker-experiments-us-east-1-322537213...,,,,,,,,
6,automl-dm--dpp0-1-712013059e1a408ca9d2e0836664...,automl-dm--dpp0-1-712013059e1a408ca9d2e0836664...,arn:aws:sagemaker:us-east-1:322537213286:train...,683313688378.dkr.ecr.us-east-1.amazonaws.com/s...,1.0,ml.m5.4xlarge,50.0,,,,...,application/x-code,s3://sagemaker-experiments-us-east-1-322537213...,,,,,,,,
7,db-1-5fbfb46d39a94024b6fa41d0df79e6de0bf48ed2f...,db-1-5fbfb46d39a94024b6fa41d0df79e6de0bf48ed2f...,arn:aws:sagemaker:us-east-1:322537213286:proce...,,1.0,ml.m5.2xlarge,250.0,,,,...,,,Pipe,automl-dm-15-05-41-41,star_rating,5.0,,120479346908.dkr.ecr.us-east-1.amazonaws.com/d...,,s3://sagemaker-experiments-us-east-1-322537213...


# Explore the Best Candidate

Dataset에서 AutoML job을 완료하고 trials를 시각화하면, 단일 API call로 어떤 trials에 대해 모델을 생성할 수 있습니다. [Inference Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html) 사용하여 온라인으로 또는 batch prediction으로 모델을 배포합니다. 이 노트북에서는 가장 성능이 좋은 trial을 선택하여 inference로 배포합니다.


In [17]:
best_candidate = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['BestCandidate']
best_candidate_identifier = best_candidate['CandidateName']

print("Candidate name: " + best_candidate_identifier)
print("Metric name: " + best_candidate['FinalAutoMLJobObjectiveMetric']['MetricName'])
print("Metric value: " + str(best_candidate['FinalAutoMLJobObjectiveMetric']['Value']))

Candidate name: tuning-job-1-49c2f0edb0fe4380aa-003-fe1a028d
Metric name: validation:accuracy
Metric value: 0.4752599895000458


In [18]:
best_candidate

{'CandidateName': 'tuning-job-1-49c2f0edb0fe4380aa-003-fe1a028d',
 'FinalAutoMLJobObjectiveMetric': {'MetricName': 'validation:accuracy',
  'Value': 0.4752599895000458},
 'ObjectiveStatus': 'Succeeded',
 'CandidateSteps': [{'CandidateStepType': 'AWS::SageMaker::ProcessingJob',
   'CandidateStepArn': 'arn:aws:sagemaker:us-east-1:322537213286:processing-job/db-1-5fbfb46d39a94024b6fa41d0df79e6de0bf48ed2f29d4a51b9b88cbb93',
   'CandidateStepName': 'db-1-5fbfb46d39a94024b6fa41d0df79e6de0bf48ed2f29d4a51b9b88cbb93'},
  {'CandidateStepType': 'AWS::SageMaker::TrainingJob',
   'CandidateStepArn': 'arn:aws:sagemaker:us-east-1:322537213286:training-job/automl-dm--dpp0-1-712013059e1a408ca9d2e0836664de6407767ec4e3304',
   'CandidateStepName': 'automl-dm--dpp0-1-712013059e1a408ca9d2e0836664de6407767ec4e3304'},
  {'CandidateStepType': 'AWS::SageMaker::TransformJob',
   'CandidateStepArn': 'arn:aws:sagemaker:us-east-1:322537213286:transform-job/automl-dm--dpp0-rpb-1-61f5dd51becd43dcb609a37fbc336fa51762

inference 파이프라인은 model과 containers로 구성되어 있습니다.

In [19]:
for container in best_candidate['InferenceContainers']:
    print(container['Image'])
    print(container['ModelDataUrl'])
    print('======================')

683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-sklearn-automl:0.2-1-cpu-py3
s3://sagemaker-experiments-us-east-1-322537213286/models/autopilot/automl-dm-15-05-41-41/data-processor-models/automl-dm--dpp0-1-712013059e1a408ca9d2e0836664de6407767ec4e3304/output/model.tar.gz
683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3
s3://sagemaker-experiments-us-east-1-322537213286/models/autopilot/automl-dm-15-05-41-41/tuning/automl-dm--dpp0-xgb/tuning-job-1-49c2f0edb0fe4380aa-003-fe1a028d/output/model.tar.gz
683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-sklearn-automl:0.2-1-cpu-py3
s3://sagemaker-experiments-us-east-1-322537213286/models/autopilot/automl-dm-15-05-41-41/data-processor-models/automl-dm--dpp0-1-712013059e1a408ca9d2e0836664de6407767ec4e3304/output/model.tar.gz


# Autopilot Chooses XGBoost as Best Candidate!


Autopilot은 XGBoost 모델의 기본적인 값과 다른 하이퍼파라미터와 feature transformations을 선택합니다.

# Deploy the Model as a REST Endpoint

Batch transformations 또한 지원하지만, 여기서는 REST Endpoint를 생성합니다.

In [20]:
model_name = 'automl-dm-model-' + timestamp_suffix

model_arn = sm.create_model(Containers=best_candidate['InferenceContainers'],
                            ModelName=model_name,
                            ExecutionRoleArn=role)

print('Best candidate model ARN: ', model_arn['ModelArn'])

Best candidate model ARN:  arn:aws:sagemaker:us-east-1:322537213286:model/automl-dm-model-15-05-41-41


In [21]:
# EndpointConfig name
timestamp_sufafix = strftime('%d-%H-%M-%S', gmtime())
epc_name = 'automl-dm-epc-' + timestamp_suffix

# Endpoint name
autopilot_endpoint_name = 'automl-dm-ep-' + timestamp_suffix
variant_name = 'automl-dm-variant-' + timestamp_suffix

print(autopilot_endpoint_name)
print(variant_name)

automl-dm-ep-15-05-41-41
automl-dm-variant-15-05-41-41


In [22]:
ep_config = sm.create_endpoint_config(EndpointConfigName = epc_name,
                                      ProductionVariants=[{'InstanceType':'ml.m5.large',
                                                           'InitialInstanceCount': 1,
                                                           'ModelName': model_name,
                                                           'VariantName': variant_name}])


In [23]:
create_endpoint_response = sm.create_endpoint(EndpointName=autopilot_endpoint_name,
                                              EndpointConfigName=epc_name)
print(create_endpoint_response['EndpointArn'])

arn:aws:sagemaker:us-east-1:322537213286:endpoint/automl-dm-ep-15-05-41-41


# Wait for the Model to Deploy


모델을 deploy하는데에는 5~10분 정도 소요됩니다.

In [24]:
sm.get_waiter('endpoint_in_service').wait(EndpointName=autopilot_endpoint_name)


In [25]:
resp = sm.describe_endpoint(EndpointName=autopilot_endpoint_name)
status = resp['EndpointStatus']

print("Arn: " + resp['EndpointArn'])
print("Status: " + status)

Arn: arn:aws:sagemaker:us-east-1:322537213286:endpoint/automl-dm-ep-15-05-41-41
Status: InService


# Test Our Model with Some Example Reviews
Let's do some ad-hoc predictions on our model.

In [26]:
sm_runtime = boto3.client('sagemaker-runtime')

In [27]:
csv_line_predict_positive = """I loved it!"""

response = sm_runtime.invoke_endpoint(EndpointName=autopilot_endpoint_name, ContentType='text/csv', Accept='text/csv', Body=csv_line_predict_positive)

response_body = response['Body'].read().decode('utf-8').strip()
response_body

'5'

In [28]:
csv_line_predict_meh = """It's OK."""

response = sm_runtime.invoke_endpoint(EndpointName=autopilot_endpoint_name, ContentType='text/csv', Accept='text/csv', Body=csv_line_predict_meh)

response_body = response['Body'].read().decode('utf-8').strip()
response_body

'3'

In [29]:
csv_line_predict_negative = """The worst product ever."""

response = sm_runtime.invoke_endpoint(EndpointName=autopilot_endpoint_name, ContentType='text/csv', Accept='text/csv', Body=csv_line_predict_negative)

response_body = response['Body'].read().decode('utf-8').strip()
response_body

'1'

# Create an Athena Table with Sample Reviews

In [30]:
table_name = 'product_reviews'

In [31]:
# Create Table SQL Statement
statement = """
DROP TABLE IF EXISTS {}.{}
""".format(database_name, table_name, database_name, table_name_tsv)

print(statement)

cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement)


DROP TABLE IF EXISTS awsdb.product_reviews



<pyathena.cursor.Cursor at 0x7fd2ccea1c18>

In [32]:
# Create Table SQL Statement
statement = """
CREATE TABLE IF NOT EXISTS {}.{} AS 
SELECT review_id, review_body , star_rating
FROM {}.{}
""".format(database_name, table_name, database_name, table_name_tsv)

print(statement)


CREATE TABLE IF NOT EXISTS awsdb.product_reviews AS 
SELECT review_id, review_body , star_rating
FROM awsdb.amazon_reviews_tsv



In [33]:
# Execute statement using connection cursor
from pyathena import connect
from pyathena.util import as_pandas

cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement)

<pyathena.cursor.Cursor at 0x7fd2ccb9f5c0>

In [34]:
statement = 'SELECT * FROM {}.{} LIMIT 10'.format(database_name, table_name)
cursor.execute(statement)

<pyathena.cursor.Cursor at 0x7fd2ccb9f5c0>

In [35]:
df_show = as_pandas(cursor)
df_show

Unnamed: 0,review_id,review_body,star_rating
0,R1WYA5RZF1LTNZ,"After much research, here is my workaround for...",3
1,RH8KF6TUQSTKP,"I first purchased this game years ago on CD, l...",5
2,R2HBCGUNHZ68SZ,Pros:<br />*Visually gorgeous<br />*Puzzles no...,4
3,RLSQBMN9YVS0D,"I downloaded this game a couple days ago, and ...",5
4,R1I3Z8NJW9ACAI,"This game was cute, the premise is you are hol...",4
5,R1NLFSJQFJLFQF,"Purchased this game last night, downloaded ove...",1
6,R1A7UQGOCH0TIG,I bought this for my son and he absolutely lov...,4
7,R2TZ9ESCIQN60H,"First, a controller is a must (preferably [[AS...",3
8,R3ID738FD594O9,Great graphics and gameplay! it takes some tim...,5
9,R13TGUKBQLT0DQ,This is the next everquest expansion. It offe...,4


### Preview Feature를 사용하기 위해  `AmazonAthenaPreviewFunctionality` Work Group에 추가합니다. 

In [36]:
import boto3
from botocore.exceptions import ClientError

client = boto3.client('athena')

try:
    response = client.create_work_group(Name='AmazonAthenaPreviewFunctionality') 
    print(response)
except ClientError as e:
    if e.response['Error']['Code'] == 'InvalidRequestException':
        print("Workgroup already exists.")
    else:
        print("Unexpected error: %s" % e)
    


Workgroup already exists.


# SQL Query 생성하기

`USING FUNCTION`절은 Athena 함수(preview) 또는 쿼리 내 후속 `SELECT` 문에서 참조할 수 있는 여러 함수에서 ML을 지정하여 사용합니다. 변수와 return 값에 대한 데이터 타입과 변수명, 함수명을 정의합니다.

In [37]:
statement = """
USING FUNCTION predict_star_rating(review_body VARCHAR) 
    RETURNS VARCHAR TYPE
    SAGEMAKER_INVOKE_ENDPOINT WITH (sagemaker_endpoint = '{}'
)
SELECT review_id, review_body, star_rating, predict_star_rating(REPLACE(review_body, ',', ' ')) AS predicted_star_rating 
    FROM {}.{} LIMIT 10
    """.format(autopilot_endpoint_name, database_name, table_name)

print(statement)


USING FUNCTION predict_star_rating(review_body VARCHAR) 
    RETURNS VARCHAR TYPE
    SAGEMAKER_INVOKE_ENDPOINT WITH (sagemaker_endpoint = 'automl-dm-ep-15-05-41-41'
)
SELECT review_id, review_body, star_rating, predict_star_rating(REPLACE(review_body, ',', ' ')) AS predicted_star_rating 
    FROM awsdb.product_reviews LIMIT 10
    


In [38]:
# Execute statement using connection cursor
cursor = connect(region_name=region_name, 
                 s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement, 
               work_group='AmazonAthenaPreviewFunctionality')

<pyathena.cursor.Cursor at 0x7fd2cc877fd0>

In [39]:
df = as_pandas(cursor)
df

Unnamed: 0,review_id,review_body,star_rating,predicted_star_rating
0,R1WYA5RZF1LTNZ,"After much research, here is my workaround for...",3,1
1,RH8KF6TUQSTKP,"I first purchased this game years ago on CD, l...",5,2
2,R2HBCGUNHZ68SZ,Pros:<br />*Visually gorgeous<br />*Puzzles no...,4,4
3,RLSQBMN9YVS0D,"I downloaded this game a couple days ago, and ...",5,2
4,R1I3Z8NJW9ACAI,"This game was cute, the premise is you are hol...",4,2
5,R1NLFSJQFJLFQF,"Purchased this game last night, downloaded ove...",1,1
6,R1A7UQGOCH0TIG,I bought this for my son and he absolutely lov...,4,3
7,R2TZ9ESCIQN60H,"First, a controller is a must (preferably [[AS...",3,3
8,R3ID738FD594O9,Great graphics and gameplay! it takes some tim...,5,4
9,R13TGUKBQLT0DQ,This is the next everquest expansion. It offe...,4,4


In [40]:
# sm.delete_endpoint(
#     EndpointName=autopilot_endpoint_name
# )

In [41]:
%store

Stored variables and their in-db values:
data_bucket                                  -> 'sagemaker-us-east-1-322537213286'
database_name                                -> 'awsdb'
header_train_s3_uri                          -> 's3://sagemaker-us-east-1-322537213286/data/amazon
job_bucket                                   -> 'sagemaker-experiments-us-east-1-322537213286'
max_seq_length                               -> 128
noheader_train_s3_uri                        -> 's3://sagemaker-us-east-1-322537213286/data/amazon
processed_test_data_s3_uri                   -> 's3://sagemaker-us-east-1-322537213286/sagemaker-s
processed_train_data_s3_uri                  -> 's3://sagemaker-us-east-1-322537213286/sagemaker-s
processed_validation_data_s3_uri             -> 's3://sagemaker-us-east-1-322537213286/sagemaker-s
region_name                                  -> 'us-east-1'
role                                         -> 'arn:aws:iam::322537213286:role/service-role/AIMLW
s3_destination_path