# Train a Model with SageMaker Autopilot

Customer review를 예측하기 위해 SageMaker Autopilot을 사용합니다.  
Autopilot는 AutoML에 대해 white-box 접근 방식으로 구현합니다.

<img src="./img/autopilot.png" width="80%" align="left">

# Introduction

Amazon SageMaker Autopilot은 데이터 세트에서 자동 기계 학습 (AutoML)을 수행하는 서비스입니다. Autopilot은 UI 또는 AWS SDK를 통해 사용할 수 있습니다. 이 노트북에서는 AWS SDK를 사용하여 텍스트 처리 및 sentiment classification 기계 학습 파이프 라인을 생성과 배포를 합니다.

# Setup

* 모델 학습에 사용되는 S3 bucket과 prefix 가 필요합니다. 참고 : 필요한 dataset은 동일 region에 있어야 합니다.
* 이 노트북의 IAM role은 dataset에 액세스가 가능해야 합니다.

In [3]:
%store -r

In [4]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
role = sagemaker.get_execution_role()
sm = boto3.Session().client(service_name='sagemaker', region_name=region_name)

# Dataset

In [5]:
print(header_train_s3_uri)

s3://sagemaker-us-east-2-322537213286/data/amazon_reviews_us_Digital_Software_v1_00_header.csv


In [4]:
!aws s3 ls $header_train_s3_uri

2020-07-15 07:34:50   13700502 amazon_reviews_us_Digital_Software_v1_00_header.csv


# Setup the S3 Location for the Autopilot-Generated Assets 

* Jupyter Notebooks (Analysis)
* Python Scripts (Feature Engineering)
* Trained Models.

In [5]:
prefix_model_output = 'models/autopilot'

model_output_s3_uri = 's3://{}/{}'.format(job_bucket, prefix_model_output)

print(model_output_s3_uri)


s3://sagemaker-experiments-us-east-2-322537213286/models/autopilot


In [6]:
max_candidates = 3

job_config = {
    'CompletionCriteria': {
      'MaxRuntimePerTrainingJobInSeconds': 600,
      'MaxCandidates': max_candidates,
      'MaxAutoMLJobRuntimeInSeconds': 3600
    },
}

input_data_config = [{
      'DataSource': {
        'S3DataSource': {
          'S3DataType': 'S3Prefix',
          'S3Uri': '{}'.format(header_train_s3_uri)
        }
      },
      'TargetAttributeName': 'star_rating'
    }
]

output_data_config = {
    'S3OutputPath': '{}'.format(model_output_s3_uri)
}

# Launch the SageMaker Autopilot job

`create_auto_ml_job` API를 이용하여 Autopilot job을 실행합니다.

In [7]:
from time import gmtime, strftime, sleep
timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())

auto_ml_job_name = 'automl-dm-' + timestamp_suffix
print('AutoMLJobName: ' + auto_ml_job_name)

AutoMLJobName: automl-dm-15-07-57-45


`ProblemType`를 특정할 수 없는 경우에는 Autopilot이 자동으로 regression 또는 Classification (binary 또는 multi-class) 를 탐지합니다.

In [8]:
sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,
                      InputDataConfig=input_data_config,
                      OutputDataConfig=output_data_config,
                      AutoMLJobConfig=job_config,
#                      ProblemType="Classification",
                      RoleArn=role)

{'AutoMLJobArn': 'arn:aws:sagemaker:us-east-2:322537213286:automl-job/automl-dm-15-07-57-45',
 'ResponseMetadata': {'RequestId': '545a3e73-38f4-4bb2-b037-bada11f42cc8',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '545a3e73-38f4-4bb2-b037-bada11f42cc8',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '92',
   'date': 'Wed, 15 Jul 2020 07:59:07 GMT'},
  'RetryAttempts': 0}}

# Tracking the progress of the Autopilot job

SageMaker Autopilot은 high-level 단계로 아래와 같이 구성됩니다.

* _Data Analysis_ 데이터를 요약하고 분석하여 탐색할 feature engineering 기법, 하이퍼파라미터와 탐색 모델을 결정합니다.
* _Feature Engineering_ 데이터가 클린징, 밸런싱, 결합 과 훈련/검증데이터셋으로 분리합니다.
* _Model Training and Tuning_ 가장 높은 성능의 features, 하이퍼라미터와 모델을 선택하고 학습힙니다.

# Analyzing Data

In [9]:
# Sleep for a bit to ensure the AutoML job above has time to start
import time
time.sleep(3)

job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
job_status = job['AutoMLJobStatus']
job_sec_status = job['AutoMLJobSecondaryStatus']

if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('AnalyzingData'):
        job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
        job_status = job['AutoMLJobStatus']
        job_sec_status = job['AutoMLJobSecondaryStatus']
        print(job_status, job_sec_status)
        sleep(30)
    print("Data analysis complete")
    
print(job)

InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress FeatureEngineering
Data analysis complete
{'AutoMLJobName': 'automl-dm-15-07-57-45', 'AutoMLJobArn': 'arn:aws:sagemaker:us-east-2:322537213286:automl-job/automl-dm-15-07-57-45', 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-2-322537213286/data/amazon_reviews_us_Digital_Software_v1_00_header.csv'}}, 'TargetAttributeName': 'star_rating'}], 'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-experiments-us-east-2-322537213286/models/autopilot'}, 'RoleArn': 'arn:aws:iam::322537213286:role/service-role/AmazonSageMaker-ExecutionRole-20200715T134454', 'AutoMLJobConfig': {'CompletionCriteria': {'MaxCandidates': 3, 'MaxRuntimePerTrainingJobInSeconds': 600, 'MaxAutoMLJobRuntimeInSeconds': 3600}}, 'CreationTime': datetime.datetime(2020, 7, 15, 7, 5

In [10]:
job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
print(job)

{'AutoMLJobName': 'automl-dm-15-07-57-45', 'AutoMLJobArn': 'arn:aws:sagemaker:us-east-2:322537213286:automl-job/automl-dm-15-07-57-45', 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-2-322537213286/data/amazon_reviews_us_Digital_Software_v1_00_header.csv'}}, 'TargetAttributeName': 'star_rating'}], 'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-experiments-us-east-2-322537213286/models/autopilot'}, 'RoleArn': 'arn:aws:iam::322537213286:role/service-role/AmazonSageMaker-ExecutionRole-20200715T134454', 'AutoMLJobConfig': {'CompletionCriteria': {'MaxCandidates': 3, 'MaxRuntimePerTrainingJobInSeconds': 600, 'MaxAutoMLJobRuntimeInSeconds': 3600}}, 'CreationTime': datetime.datetime(2020, 7, 15, 7, 59, 7, 410000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2020, 7, 15, 8, 7, 10, 272000, tzinfo=tzlocal()), 'AutoMLJobStatus': 'InProgress', 'AutoMLJobSecondaryStatus': 'FeatureEngineering', 'GenerateCandidateDefini

# Feature Engineering

In [11]:
job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
job_status = job['AutoMLJobStatus']
job_sec_status = job['AutoMLJobSecondaryStatus']
print(job_status)
print(job_sec_status)
if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('FeatureEngineering'):
        job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
        job_status = job['AutoMLJobStatus']
        job_sec_status = job['AutoMLJobSecondaryStatus']
        print(job_status, job_sec_status)
        sleep(30)
    print("Feature engineering complete")
    
print(job)

InProgress
FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress ModelTuning
Feature engineering complete
{'AutoMLJobName': 'automl-dm-15-07-57-45', 'AutoMLJobArn': 'arn:aws:sagemaker:us-east-2:322537213286:automl-job/automl-dm-15-07-57-45', 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-2-322537213286/data/amazon_reviews_us_Digital_Software_v1_00_header.csv'}}, 'TargetAttributeName': 'star_rating'}], 'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-experiments-us-east-2-322537213286/models/autopilot'}, 'RoleArn': 'arn:aws:iam::3

# Model Training and Tuning

In [12]:
job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
job_status = job['AutoMLJobStatus']
job_sec_status = job['AutoMLJobSecondaryStatus']
print(job_status)
print(job_sec_status)
if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('ModelTuning'):
        job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
        job_status = job['AutoMLJobStatus']
        job_sec_status = job['AutoMLJobSecondaryStatus']
        print(job_status, job_sec_status)
        sleep(30)
    print("Model tuning complete")
    
print(job)

InProgress
ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
Completed MaxCandidatesReached
Model tuning complete
{'AutoMLJobName': 'automl-dm-15-07-57-45', 'AutoMLJobArn': 'arn:aws:sagemaker:us-east-2:322537213286:automl-job/automl-dm-15-07-57-45', 'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-2-322537213286/data/amazon_reviews_us_Digital_Software_v1_00_header.csv'}}, 'TargetAttributeName': 'star_rating'}], 'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-experiments-us-east-2-322537213286/models/autopilot'}, 'RoleArn': 'arn:aws:iam::322537213286:role/service-role/AmazonSageMaker-ExecutionRole-20200715T134454', 'AutoMLJobConfig': {'CompletionCriteria': {'MaxCandidates': 3, 'MaxRuntimePerTrainingJobInSeconds': 600, 'MaxAutoMLJobRuntimeInSeconds': 3600}}, 'CreationTime': datetime.datetime(2020, 7, 15, 7, 59, 7, 410000, tzinfo=tzlocal()), 'End

<h2><span style="color:red">위 Autopilot이 완료되기 전까지 기다려 주시기 바랍니다.</span></h2>

# View Generated Notebooks

데이터 분석이 완료되며, SageMaker AutoPilot 2가지 노트북을 생성합니다.
* Data exploration,
* Candidate definition.

## Copy the Generated Notebooks Locally

In [13]:
generated_resources = job['AutoMLJobArtifacts']['DataExplorationNotebookLocation'].rstrip('notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb')
generated_resources

's3://sagemaker-experiments-us-east-2-322537213286/models/autopilot/automl-dm-15-07-57-45/sagemaker-automl-candidates/pr-1-a7e85eebb333444a908818ff98305a673f35ec91c4584c2db3e03406d8'

In [14]:
!aws s3 cp --recursive $generated_resources .

download: s3://sagemaker-experiments-us-east-2-322537213286/models/autopilot/automl-dm-15-07-57-45/sagemaker-automl-candidates/pr-1-a7e85eebb333444a908818ff98305a673f35ec91c4584c2db3e03406d8/generated_module/README.md to generated_module/README.md
download: s3://sagemaker-experiments-us-east-2-322537213286/models/autopilot/automl-dm-15-07-57-45/sagemaker-automl-candidates/pr-1-a7e85eebb333444a908818ff98305a673f35ec91c4584c2db3e03406d8/generated_module/MANIFEST.in to generated_module/MANIFEST.in
download: s3://sagemaker-experiments-us-east-2-322537213286/models/autopilot/automl-dm-15-07-57-45/sagemaker-automl-candidates/pr-1-a7e85eebb333444a908818ff98305a673f35ec91c4584c2db3e03406d8/generated_module/candidate_data_processors/dpp2.py to generated_module/candidate_data_processors/dpp2.py
download: s3://sagemaker-experiments-us-east-2-322537213286/models/autopilot/automl-dm-15-07-57-45/sagemaker-automl-candidates/pr-1-a7e85eebb333444a908818ff98305a673f35ec91c4584c2db3e03406d8/notebooks/sag

## In the file view, open the following folders:
```
notebooks/
generated_module/
```
이 폴더에서 많은 정보를 보실 수 있습니다.

# Viewing All Candidates

model tuning이 완료되면 AutoML에서 탐색 한 모든 후보(서로 다른 하이퍼 파라미터 조합을 가진 파이프 라인 평가)를 보고, 최종 성능 메트릭별로 정렬할 수 있습니다.
Once model tuning is complete, you can view all the candidates (pipeline evaluations with different hyperparameter combinations) that were explored by AutoML and sort them by their final performance metric.

In [15]:
candidates = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name, 
                                                SortBy='FinalObjectiveMetricValue')['Candidates']
for index, candidate in enumerate(candidates):
    print(str(index) + "  " 
        + candidate['CandidateName'] + "  " 
        + str(candidate['FinalAutoMLJobObjectiveMetric']['Value']))

0  tuning-job-1-59e50d54d0db4ef385-001-6c394bb8  0.3930000066757202
1  tuning-job-1-59e50d54d0db4ef385-003-b58fcb37  0.3921999931335449
2  tuning-job-1-59e50d54d0db4ef385-002-68cf01cc  0.27834001183509827


# Inspect Trials using Experiments API

SageMaker Autopilot은 자동으로 새로운 experiment를 생성하고, 각 trial에 대한 정보를 experiment에 추가합니다.

In [16]:
from sagemaker.analytics import ExperimentAnalytics, TrainingJobAnalytics

exp = ExperimentAnalytics(
    sagemaker_session=sess, 
    experiment_name=auto_ml_job_name + '-aws-auto-ml-job',
)

df = exp.dataframe()
print(df)

                                  TrialComponentName  \
0  tuning-job-1-59e50d54d0db4ef385-002-68cf01cc-a...   
1  tuning-job-1-59e50d54d0db4ef385-001-6c394bb8-a...   
2  tuning-job-1-59e50d54d0db4ef385-003-b58fcb37-a...   
3  automl-dm--dpp1-csv-1-d7658af495e342a59b460ba1...   
4  automl-dm--dpp2-rpb-1-ca4dd172a4064465a8ed8bd0...   
5  automl-dm--dpp1-1-8e121f58d9e2434eac646be3b0a6...   
6  automl-dm--dpp2-1-8983e917f293463992b38cee3ac8...   
7  db-1-a212206ba1ad4f49ab9d381bb61a6760c22e9967d...   

                                         DisplayName  \
0  tuning-job-1-59e50d54d0db4ef385-002-68cf01cc-a...   
1  tuning-job-1-59e50d54d0db4ef385-001-6c394bb8-a...   
2  tuning-job-1-59e50d54d0db4ef385-003-b58fcb37-a...   
3  automl-dm--dpp1-csv-1-d7658af495e342a59b460ba1...   
4  automl-dm--dpp2-rpb-1-ca4dd172a4064465a8ed8bd0...   
5  automl-dm--dpp1-1-8e121f58d9e2434eac646be3b0a6...   
6  automl-dm--dpp2-1-8983e917f293463992b38cee3ac8...   
7  db-1-a212206ba1ad4f49ab9d381bb61a6760c22e996

# Explore the Best Candidate

Dataset에서 AutoML job을 완료하고 trials를 시각화하면, 단일 API call로 어떤 trials에 대해 모델을 생성할 수 있습니다. [Inference Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html) 사용하여 온라인으로 또는 batch prediction으로 모델을 배포합니다. 이 노트북에서는 가장 성능이 좋은 trial을 선택하여 inference로 배포합니다.


In [17]:
best_candidate = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['BestCandidate']
best_candidate_identifier = best_candidate['CandidateName']

print("Candidate name: " + best_candidate_identifier)
print("Metric name: " + best_candidate['FinalAutoMLJobObjectiveMetric']['MetricName'])
print("Metric value: " + str(best_candidate['FinalAutoMLJobObjectiveMetric']['Value']))

Candidate name: tuning-job-1-59e50d54d0db4ef385-001-6c394bb8
Metric name: validation:accuracy
Metric value: 0.3930000066757202


In [18]:
best_candidate

{'CandidateName': 'tuning-job-1-59e50d54d0db4ef385-001-6c394bb8',
 'FinalAutoMLJobObjectiveMetric': {'MetricName': 'validation:accuracy',
  'Value': 0.3930000066757202},
 'ObjectiveStatus': 'Succeeded',
 'CandidateSteps': [{'CandidateStepType': 'AWS::SageMaker::ProcessingJob',
   'CandidateStepArn': 'arn:aws:sagemaker:us-east-2:322537213286:processing-job/db-1-a212206ba1ad4f49ab9d381bb61a6760c22e9967d3a7444e8ba34a543d',
   'CandidateStepName': 'db-1-a212206ba1ad4f49ab9d381bb61a6760c22e9967d3a7444e8ba34a543d'},
  {'CandidateStepType': 'AWS::SageMaker::TrainingJob',
   'CandidateStepArn': 'arn:aws:sagemaker:us-east-2:322537213286:training-job/automl-dm--dpp2-1-8983e917f293463992b38cee3ac85942e9397f3b4d3c4',
   'CandidateStepName': 'automl-dm--dpp2-1-8983e917f293463992b38cee3ac85942e9397f3b4d3c4'},
  {'CandidateStepType': 'AWS::SageMaker::TransformJob',
   'CandidateStepArn': 'arn:aws:sagemaker:us-east-2:322537213286:transform-job/automl-dm--dpp2-rpb-1-ca4dd172a4064465a8ed8bd06d84601e0e23

inference 파이프라인은 model과 containers로 구성되어 있습니다.

In [19]:
for container in best_candidate['InferenceContainers']:
    print(container['Image'])
    print(container['ModelDataUrl'])
    print('======================')

257758044811.dkr.ecr.us-east-2.amazonaws.com/sagemaker-sklearn-automl:0.1.0-cpu-py3
s3://sagemaker-experiments-us-east-2-322537213286/models/autopilot/automl-dm-15-07-57-45/data-processor-models/automl-dm--dpp2-1-8983e917f293463992b38cee3ac85942e9397f3b4d3c4/output/model.tar.gz
257758044811.dkr.ecr.us-east-2.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3
s3://sagemaker-experiments-us-east-2-322537213286/models/autopilot/automl-dm-15-07-57-45/tuning/automl-dm--dpp2-xgb/tuning-job-1-59e50d54d0db4ef385-001-6c394bb8/output/model.tar.gz
257758044811.dkr.ecr.us-east-2.amazonaws.com/sagemaker-sklearn-automl:0.1.0-cpu-py3
s3://sagemaker-experiments-us-east-2-322537213286/models/autopilot/automl-dm-15-07-57-45/data-processor-models/automl-dm--dpp2-1-8983e917f293463992b38cee3ac85942e9397f3b4d3c4/output/model.tar.gz


# Autopilot Chooses XGBoost as Best Candidate!


Autopilot은 XGBoost 모델의 기본적인 값과 다른 하이퍼파라미터와 feature transformations을 선택합니다.

# Deploy the Model as a REST Endpoint

Batch transformations 또한 지원하지만, 여기서는 REST Endpoint를 생성합니다.

In [20]:
model_name = 'automl-dm-model-' + timestamp_suffix

model_arn = sm.create_model(Containers=best_candidate['InferenceContainers'],
                            ModelName=model_name,
                            ExecutionRoleArn=role)

print('Best candidate model ARN: ', model_arn['ModelArn'])

Best candidate model ARN:  arn:aws:sagemaker:us-east-2:322537213286:model/automl-dm-model-15-07-57-45


In [21]:
# EndpointConfig name
timestamp_sufafix = strftime('%d-%H-%M-%S', gmtime())
epc_name = 'automl-dm-epc-' + timestamp_suffix

# Endpoint name
autopilot_endpoint_name = 'automl-dm-ep-' + timestamp_suffix
variant_name = 'automl-dm-variant-' + timestamp_suffix

print(autopilot_endpoint_name)
print(variant_name)

automl-dm-ep-15-08-17-49
automl-dm-variant-15-08-17-49


In [22]:
ep_config = sm.create_endpoint_config(EndpointConfigName = epc_name,
                                      ProductionVariants=[{'InstanceType':'ml.m5.large',
                                                           'InitialInstanceCount': 1,
                                                           'ModelName': model_name,
                                                           'VariantName': variant_name}])


In [23]:
create_endpoint_response = sm.create_endpoint(EndpointName=autopilot_endpoint_name,
                                              EndpointConfigName=epc_name)
print(create_endpoint_response['EndpointArn'])

arn:aws:sagemaker:us-east-2:322537213286:endpoint/automl-dm-ep-15-08-17-49


# Wait for the Model to Deploy


모델을 deploy하는데에는 5~10분 정도 소요됩니다.

In [24]:
sm.get_waiter('endpoint_in_service').wait(EndpointName=autopilot_endpoint_name)


In [25]:
resp = sm.describe_endpoint(EndpointName=autopilot_endpoint_name)
status = resp['EndpointStatus']

print("Arn: " + resp['EndpointArn'])
print("Status: " + status)

Arn: arn:aws:sagemaker:us-east-2:322537213286:endpoint/automl-dm-ep-15-08-17-49
Status: InService


# Test Our Model with Some Example Reviews
Let's do some ad-hoc predictions on our model.

In [26]:
sm_runtime = boto3.client('sagemaker-runtime')

In [27]:
csv_line_predict_positive = """I loved it!"""

response = sm_runtime.invoke_endpoint(EndpointName=autopilot_endpoint_name, ContentType='text/csv', Accept='text/csv', Body=csv_line_predict_positive)

response_body = response['Body'].read().decode('utf-8').strip()
response_body

'5'

In [28]:
csv_line_predict_meh = """It's OK."""

response = sm_runtime.invoke_endpoint(EndpointName=autopilot_endpoint_name, ContentType='text/csv', Accept='text/csv', Body=csv_line_predict_meh)

response_body = response['Body'].read().decode('utf-8').strip()
response_body

'3'

In [39]:
csv_line_predict_negative = """The worst product ever."""

response = sm_runtime.invoke_endpoint(EndpointName=autopilot_endpoint_name, ContentType='text/csv', Accept='text/csv', Body=csv_line_predict_negative)

response_body = response['Body'].read().decode('utf-8').strip()
response_body

'5'

# Create an Athena Table with Sample Reviews

In [1]:
table_name = 'product_reviews'

In [6]:
# Create Table SQL Statement
statement = """
CREATE TABLE IF NOT EXISTS {}.{} AS 
SELECT review_id, review_body 
FROM {}.{}
""".format(database_name, table_name, database_name, table_name_tsv)

print(statement)


CREATE TABLE IF NOT EXISTS awsdb.product_reviews AS 
SELECT review_id, review_body 
FROM awsdb.amazon_reviews_tsv



In [9]:
# Execute statement using connection cursor
cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement)

<pyathena.cursor.Cursor at 0x7f2a3ed0e518>

In [11]:
statement = 'SELECT * FROM {}.{} LIMIT 10'.format(database_name, table_name)
cursor.execute(statement)

<pyathena.cursor.Cursor at 0x7f2a3ed0e518>

In [12]:
df_show = as_pandas(cursor)
df_show

Unnamed: 0,review_id,review_body
0,RVZDS4YVDNW02,Ok ok... DRM issues and server crashes aside (...
1,R372C7VSB80JQB,I can only get these cards when I want to buy ...
2,R13NFCN9LYFE3O,This expansion adds so much that it seems almo...
3,R3DPQ192I7OCV7,"Seriously, this game is so unique and you can ..."
4,R285LPWNSGV2QX,I wanted to see what Part 3 was and this game ...
5,R2FGX5RC31P9WN,There is not a easy way to buy new credits to ...
6,RJCIGXQEY8YJ3,I use to have Deus Ex 1 on CD-ROM It was a ver...
7,R2UBPEFGVRTAS9,This I'd quite possibly one of the best games ...
8,R35UFJJWQDMKI3,"Very poor help screens, constant crashes. Sim..."
9,R1QARQS63VPODI,"This is a great game if you like slow paced, t..."


### Preview Feature를 사용하기 위해  `AmazonAthenaPreviewFunctionality` Work Group에 추가합니다. 

In [30]:
import boto3
from botocore.exceptions import ClientError

client = boto3.client('athena')

try:
    response = client.create_work_group(Name='AmazonAthenaPreviewFunctionality') 
    print(response)
except ClientError as e:
    if e.response['Error']['Code'] == 'InvalidRequestException':
        print("Workgroup already exists.")
    else:
        print("Unexpected error: %s" % e)
    


Workgroup already exists.


# SQL Query 생성하기

`USING FUNCTION`절은 Athena 함수(preview) 또는 쿼리 내 후속 `SELECT` 문에서 참조할 수 있는 여러 함수에서 ML을 지정하여 사용합니다. 변수와 return 값에 대한 데이터 타입과 변수명, 함수명을 정의합니다.

In [31]:
statement = """
USING FUNCTION predict_star_rating(review_body VARCHAR) 
    RETURNS VARCHAR TYPE
    SAGEMAKER_INVOKE_ENDPOINT WITH (sagemaker_endpoint = '{}'
)
SELECT review_id, review_body, predict_star_rating(REPLACE(review_body, ',', ' ')) AS predicted_star_rating 
    FROM {}.{} LIMIT 10
    """.format(autopilot_endpoint_name, database_name, table_name)

print(statement)


USING FUNCTION predict_star_rating(review_body VARCHAR) 
    RETURNS VARCHAR TYPE
    SAGEMAKER_INVOKE_ENDPOINT WITH (sagemaker_endpoint = 'automl-dm-ep-15-08-17-49'
)
SELECT review_id, review_body, predict_star_rating(REPLACE(review_body, ',', ' ')) AS predicted_star_rating 
    FROM awsdb.product_reviews LIMIT 10
    


In [32]:
# Execute statement using connection cursor
cursor = connect(region_name=region_name, 
                 s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement, 
               work_group='AmazonAthenaPreviewFunctionality')  ## 버지니아만 가능??

ERROR:pyathena.common:Failed to execute query.
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pyathena/common.py", line 236, in _execute
    **request
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pyathena/util.py", line 344, in retry_api_call
    return retry(func, *args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/tenacity/__init__.py", line 409, in call
    do = self.iter(retry_state=retry_state)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/tenacity/__init__.py", line 356, in iter
    return fut.result()
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/home/ec2-user/anaconda3/envs/python3

DatabaseError: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: The given Workgroup name is a reserved Athena Workgroup Name. Please submit your query using a different Workgroup.

In [None]:
client.delete_endpoint(
    EndpointName=autopilot_endpoint_name
)

In [22]:
%store

Stored variables and their in-db values:
autopilot_endpoint_name                      -> 'automl-dm-ep-15-08-17-49'
comprehend_endpoint_arn                      -> 'arn:aws:comprehend:us-east-2:322537213286:documen
data_bucket                                  -> 'sagemaker-us-east-2-322537213286'
database_name                                -> 'awsdb'
header_train_s3_uri                          -> 's3://sagemaker-us-east-2-322537213286/data/amazon
job_bucket                                   -> 'sagemaker-experiments-us-east-2-322537213286'
max_seq_length                               -> 128
noheader_train_s3_uri                        -> 's3://sagemaker-us-east-2-322537213286/data/amazon
processed_test_data_s3_uri                   -> 's3://sagemaker-us-east-2-322537213286/sagemaker-s
processed_train_data_s3_uri                  -> 's3://sagemaker-us-east-2-322537213286/sagemaker-s
processed_validation_data_s3_uri             -> 's3://sagemaker-us-east-2-322537213286/sagemaker-s
regi