# Train a Model with SageMaker Autopilot

We will use Autopilot to predict the star rating of customer reviews. Autopilot implements a transparent approach to AutoML. 

For more details on Autopilot, have a look at this [Amazon Science Publication](https://assets.amazon.science/e8/8b/2366b1ab407990dec96e55ee5664/amazon-sagemaker-autopilot-a-white-box-automl-solution-at-scale.pdf)

<img src="img/autopilot-transparent.png" width="80%" align="left">

# Introduction

Amazon SageMaker Autopilot is a service to perform automated machine learning (AutoML) on your datasets.  Autopilot is available through the UI or AWS SDK.  In this notebook, we will use the AWS SDK to create and deploy a text processing and star rating classification machine learning pipeline.

# Pre-Requisite

## Make sure the previous notebook has run fully and prepared the dataset.

# Setup

Let's start by specifying:

* The S3 bucket and prefix to use to train our model.  _Note:  This should be in the same region as this notebook._
* The IAM role of this notebook needs access to your data.

# Note:  This notebook will take some time.  Feel free to continue to the next notebooks whenever you are waiting for the current notebook to finish.
We do this throughout the entire workshop as some of these notebooks may run for a while.

In [1]:
import boto3
import sagemaker
import pandas as pd
import json

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

# Dataset

In [2]:
%store -r autopilot_train_s3_uri

In [3]:
if not autopilot_train_s3_uri:
    print('****************************************************************************************')
    print('**************** PLEASE RE-RUN THE PREVIOUS DATA PREPARATION NOTEBOOK ******************')
    print('**************** THIS NOTEBOOK WILL NOT RUN PROPERLY ***********************************')
    print('****************************************************************************************')

In [4]:
print(autopilot_train_s3_uri)

s3://sagemaker-us-east-1-889926741212/data/amazon_reviews_us_Digital_Software_v1_00_autopilot.csv


In [5]:
!aws s3 ls $autopilot_train_s3_uri

2020-09-19 17:43:18   13634140 amazon_reviews_us_Digital_Software_v1_00_autopilot.csv


## See our prepared training data which we use as input for Autopilot

In [6]:
!aws s3 cp $autopilot_train_s3_uri ./tmp/

download: s3://sagemaker-us-east-1-889926741212/data/amazon_reviews_us_Digital_Software_v1_00_autopilot.csv to tmp/amazon_reviews_us_Digital_Software_v1_00_autopilot.csv


In [7]:
import csv

df = pd.read_csv('./tmp/amazon_reviews_us_Digital_Software_v1_00_autopilot.csv')
df.head()

Unnamed: 0,star_rating,review_body
0,1,I too thought that I would be able to make it ...
1,5,gets better and better every year - outstandin...
2,2,We've used Turbo Tax for over 10 years; this y...
3,4,I upgraded to Quicken 2013 about two weeks ago...
4,1,I had to turn off Kaspersky protection to inst...


# Setup the S3 Location for the Autopilot-Generated Assets 
This include Jupyter Notebooks (Analysis), Python Scripts (Feature Engineering), and Trained Models.

In [8]:
prefix_model_output = 'models/autopilot'

model_output_s3_uri = 's3://{}/{}'.format(bucket, prefix_model_output)

print(model_output_s3_uri)


s3://sagemaker-us-east-1-889926741212/models/autopilot


In [9]:
max_candidates = 3

job_config = {
    'CompletionCriteria': {
      'MaxRuntimePerTrainingJobInSeconds': 600,
      'MaxCandidates': max_candidates,
      'MaxAutoMLJobRuntimeInSeconds': 3600
    },
}

input_data_config = [{
      'DataSource': {
        'S3DataSource': {
          'S3DataType': 'S3Prefix',
          'S3Uri': '{}'.format(autopilot_train_s3_uri)
        }
      },
      'TargetAttributeName': 'star_rating'
    }
]

output_data_config = {
    'S3OutputPath': '{}'.format(model_output_s3_uri)
}

# Launch the SageMaker Autopilot job

We can now launch the job by calling the `create_auto_ml_job` API.

In [10]:
from time import gmtime, strftime, sleep
timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())

auto_ml_job_name = 'automl-dm-' + timestamp_suffix
print('AutoMLJobName: ' + auto_ml_job_name)

AutoMLJobName: automl-dm-19-17-51-21


_Note that we are not specifying the `ProblemType`.  Autopilot will automatically detect if we're using regression or classification (binary or multi-class)._

In [11]:
sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,
                      InputDataConfig=input_data_config,
                      OutputDataConfig=output_data_config,
                      AutoMLJobConfig=job_config,
                      RoleArn=role)

{'AutoMLJobArn': 'arn:aws:sagemaker:us-east-1:889926741212:automl-job/automl-dm-19-17-51-21',
 'ResponseMetadata': {'RequestId': '688ee21c-de14-499e-a25f-3bb7094d7a2e',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '688ee21c-de14-499e-a25f-3bb7094d7a2e',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '92',
   'date': 'Sat, 19 Sep 2020 17:51:22 GMT'},
  'RetryAttempts': 0}}

# Tracking the progress of the Autopilot job

SageMaker Autopilot job consists of the following high-level steps: 
* _Data Analysis_ where the data is summarized and analyzed to determine which feature engineering techniques, hyper-parameters, and models to explore.
* _Feature Engineering_ where the data is scrubbed, balanced, combined, and split into train and validation.
* _Model Training and Tuning_ where the top performing features, hyper-parameters, and models are selected and trained.

<img src="img/autopilot-steps.png" width="90%" align="left">

Source: [Amazon Science Publication](https://assets.amazon.science/e8/8b/2366b1ab407990dec96e55ee5664/amazon-sagemaker-autopilot-a-white-box-automl-solution-at-scale.pdf)

# Analyzing Data

In [12]:
# Sleep for a bit to ensure the AutoML job above has time to start
import time
time.sleep(30)

job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)

print(json.dumps(job, indent=4, sort_keys=True, default=str))

{
    "AutoMLJobArn": "arn:aws:sagemaker:us-east-1:889926741212:automl-job/automl-dm-19-17-51-21",
    "AutoMLJobConfig": {
        "CompletionCriteria": {
            "MaxAutoMLJobRuntimeInSeconds": 3600,
            "MaxCandidates": 3,
            "MaxRuntimePerTrainingJobInSeconds": 600
        }
    },
    "AutoMLJobName": "automl-dm-19-17-51-21",
    "AutoMLJobSecondaryStatus": "AnalyzingData",
    "AutoMLJobStatus": "InProgress",
    "CreationTime": "2020-09-19 17:51:21.184000+00:00",
    "GenerateCandidateDefinitionsOnly": false,
    "InputDataConfig": [
        {
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://sagemaker-us-east-1-889926741212/data/amazon_reviews_us_Digital_Software_v1_00_autopilot.csv"
                }
            },
            "TargetAttributeName": "star_rating"
        }
    ],
    "LastModifiedTime": "2020-09-19 17:51:42.152000+00:00",
    "OutputDataConfig": {


### Check if the Autopilot Job started correctly.

In [13]:
if not bool(job):
    print('STOP: Autopilot Job did NOT start correctly. Please re-run the notebook from start.')
elif 'AutoMLJobStatus' not in job.keys():
    print('STOP: Autopilot Job did NOT start correctly. Please re-run the notebook from start.')
elif 'AutoMLJobSecondaryStatus' not in job.keys():
    print('STOP: Autopilot Job did NOT start correctly. Please re-run the notebook from start.')
else:
    print('OK')

OK


### Watch out for two SageMaker `Processing Jobs` to start. 
* First Processing Job (Data Splitter) checks the data sanity, performs stratified shuffling and splits the data into training and validation. 
* Second Processing Job (Candidate Generator) first streams through the data to compute statistics for the dataset. Then, uses these statistics to identify the problem type, and possible types of every column-predictor: numeric, categorical, natural language, etc.

In [14]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/processing-jobs/">Processing Jobs</a></b>'.format(region)))


# The Next Cell Will Show `InProgress` For A Few Minutes.
_Please be patient._

In [15]:
job_status = job['AutoMLJobStatus']
job_sec_status = job['AutoMLJobSecondaryStatus']

if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('AnalyzingData'):
        job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
        job_status = job['AutoMLJobStatus']
        job_sec_status = job['AutoMLJobSecondaryStatus']
        print(job_status, job_sec_status)
        sleep(30)
    print("Data analysis complete")
    
print(json.dumps(job, indent=4, sort_keys=True, default=str))

InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress AnalyzingData
InProgress FeatureEngineering
Data analysis complete
{
    "AutoMLJobArn": "arn:aws:sagemaker:us-east-1:889926741212:automl-job/automl-dm-19-17-51-21",
    "AutoMLJobArtifacts": {
        "CandidateDefinitionNotebookLocation": "s3://sagemaker-us-east-1-889926741212/models/autopilot/automl-dm-19-17-51-21/sagemaker-automl-candidates/pr-1-bcbba697044643f28fd5a54e8bb83874fe8f7059c6ab42538e24fb4e76/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb",
        "DataExplorationNotebookLocation": "s3://sagemaker-us-east-1-889926741212/models/autopilot/automl-dm-19-17-51-21/sagemaker-automl-candidates/pr-1-bcbba69704

# View Generated Notebook Samples
Once data analysis is complete, SageMaker AutoPilot generates two notebooks: 
* Data exploration,
* Candidate definition.

### TODO: Add KeyCheck

In [16]:
if not bool(job):
    print('STOP: Autopilot Job did NOT finish correctly. Please re-run the notebook from start.')
else:
    generated_resources = job['AutoMLJobArtifacts']['DataExplorationNotebookLocation'].rstrip('notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb')
    pr_job_id = generated_resources.rsplit('/', 1)[-1]
    print('OK')

OK


In [17]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/{}/sagemaker-automl-candidates/{}/">S3 Generated Resources</a></b>'.format(bucket, prefix_model_output, auto_ml_job_name, pr_job_id)))


# In the Jupyter File Browser, Open the Following Folders to See Samples of the Generated Assets:
```
notebooks/
generated_module/
```

Lots of useful information ^^ in these folders ^^

(Optional) You can download the actual files generated for your specific Autopilot run using the following:
```
generated_resources = job['AutoMLJobArtifacts']['DataExplorationNotebookLocation'].rstrip('notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb')

!aws s3 cp --recursive $generated_resources .
```

# Feature Engineering

### Watch out for SageMaker `Training Jobs` and `Batch Transform Jobs` to start. 

* This is the candidate exploration phase. 
* Each python script code for data-processing is executed inside a SageMaker framework container as a training job, followed by transform job.

Note, that feature preprocessing part of each pipeline has all hyper parameters fixed, i.e. does not require tuning, thus feature preprocessing step can be done prior runing the hyper parameter optimization job. 

It outputs up to 10 variants of transformed data, therefore algorithms for each pipeline are set to use
the respective transformed data.

<img src="img/autopilot-steps.png" width="90%" align="left">

Source: [Amazon Science Publication](https://assets.amazon.science/e8/8b/2366b1ab407990dec96e55ee5664/amazon-sagemaker-autopilot-a-white-box-automl-solution-at-scale.pdf)

In [18]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/">Training Jobs</a></b>'.format(region)))


In [19]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/transform-jobs/">Batch Transform Jobs</a></b>'.format(region)))


# The Next Cell Will Show `InProgress` For A Few Minutes.
_Please be patient._

In [20]:
job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
job_status = job['AutoMLJobStatus']
job_sec_status = job['AutoMLJobSecondaryStatus']
print(job_status)
print(job_sec_status)
if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('FeatureEngineering'):
        job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
        job_status = job['AutoMLJobStatus']
        job_sec_status = job['AutoMLJobSecondaryStatus']
        print(job_status, job_sec_status)
        sleep(30)
    print("Feature engineering complete")
    
print(json.dumps(job, indent=4, sort_keys=True, default=str))

InProgress
FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress FeatureEngineering
InProgress ModelTuning
Feature engineering complete
{
    "AutoMLJobArn": "arn:aws:sagemaker:us-east-1:889926741212:automl-job/automl-dm-19-17-51-21",
    "AutoMLJobArtifacts": {
        "CandidateDefinitionNotebookLocation": "s3://sagemaker-us-east-1-889926741212/models/autopilot/automl-dm-19-17-51-21/sagemaker-automl-candidates/pr-1-bcbba697044643f28fd5a54e8bb83874fe8f7059c6ab42538e24fb4e76/note

# Model Training and Tuning

### Watch out for a SageMaker`Hyperparameter Tuning Job` and various `Training Jobs` to start. 

* All algorithms are optimized using a SageMaker Hyperparameter Tuning job. 
* Up to 250 training jobs (based on number of candidates specified) are selectively executed to find the best candidate model.

<img src="img/autopilot-steps.png" width="90%" align="left">

Source: [Amazon Science Publication](https://assets.amazon.science/e8/8b/2366b1ab407990dec96e55ee5664/amazon-sagemaker-autopilot-a-white-box-automl-solution-at-scale.pdf)

In [21]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/hyper-tuning-jobs/">Hyperparameter Tuning Jobs</a></b>'.format(region)))


In [22]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/">Training Jobs</a></b>'.format(region)))


# The Next Cell Will Show `InProgress` For A Few Minutes.
_Please be patient._

In [23]:
job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
job_status = job['AutoMLJobStatus']
job_sec_status = job['AutoMLJobSecondaryStatus']
print(job_status)
print(job_sec_status)
if job_status not in ('Stopped', 'Failed'):
    while job_status in ('InProgress') and job_sec_status in ('ModelTuning'):
        job = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
        job_status = job['AutoMLJobStatus']
        job_sec_status = job['AutoMLJobSecondaryStatus']
        print(job_status, job_sec_status)
        sleep(30)
    print("Model tuning complete")
    
print(json.dumps(job, indent=4, sort_keys=True, default=str))

InProgress
ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
InProgress ModelTuning
Completed MaxCandidatesReached
Model tuning complete
{
    "AutoMLJobArn": "arn:aws:sagemaker:us-east-1:889926741212:automl-job/automl-dm-19-17-51-21",
    "AutoMLJobArtifacts": {
        "CandidateDefinitionNotebookLocation": "s3://sagemaker-us-east-1-889926741212/models/autopilot/automl-dm-19-17-51-21/sagemaker-automl-candidates/pr-1-bcbba697044643f28fd5a54e8bb83874fe8f7059c6ab42538e24fb4e76/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb",
        "DataExplorationNotebookLocation": "s3://sagemaker-us-east-1-889926741212/models/autopilot/automl-dm-19-17-51-21/sagemaker-automl-candidates/pr-1-bcbba697044643f28fd5a54e8bb83874fe8f7059c6ab42538e24fb4e76/notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb"
    },
    "AutoMLJobConfig": {
        "CompletionCriteria": {
            "MaxAutoMLJobRuntimeInSeconds"

# _Please Wait Until ^^ Autopilot ^^ Completes Above_
Make sure the status below indicates `Completed`.

In [24]:
sleep(30)
print(job_status)

if job_status not in ('Completed'):
    print('*******************************************************************')
    print('*************** THIS JOB DID NOT COMPLETE PROPERLY ****************')
    print('***************  REPORT THE ISSUE OR ASK FOR HELP  ****************')    
    print('*******************************************************************')

Completed


# Viewing All Candidates
Once model tuning is complete, you can view all the candidates (pipeline evaluations with different hyperparameter combinations) that were explored by AutoML and sort them by their final performance metric.

In [25]:
candidates_response = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name, 
                                                         SortBy='FinalObjectiveMetricValue')

### Check that candidates is not empty

In [26]:
if not bool(candidates_response):
    print('STOP: Autopilot Job did NOT finish correctly. Please re-run the notebook from start.')
else:
    candidates = candidates_response['Candidates']
    print('OK')

OK


In [27]:
if not bool(candidates):
    print('STOP: Autopilot Job did NOT finish correctly. Please re-run the notebook from start.')
elif 'CandidateName' not in candidates[0]:
    print('STOP: Autopilot Job did NOT finish correctly. Please re-run the notebook from start.')
elif 'FinalAutoMLJobObjectiveMetric' not in candidates[0]:
    print('STOP: Autopilot Job did NOT start correctly. Please re-run the notebook from start.')
else:
    print('OK')

OK


In [28]:
print(json.dumps(candidates, indent=4, sort_keys=True, default=str))

[
    {
        "CandidateName": "tuning-job-1-c2904d1d37f34a2688-002-6dacd775",
        "CandidateStatus": "Completed",
        "CandidateSteps": [
            {
                "CandidateStepArn": "arn:aws:sagemaker:us-east-1:889926741212:processing-job/db-1-856effeb4f424b73b29c6f61563f15fae551cfec04e74d498993cd2570",
                "CandidateStepName": "db-1-856effeb4f424b73b29c6f61563f15fae551cfec04e74d498993cd2570",
                "CandidateStepType": "AWS::SageMaker::ProcessingJob"
            },
            {
                "CandidateStepArn": "arn:aws:sagemaker:us-east-1:889926741212:training-job/automl-dm--dpp0-1-cacbc13d79d44a4e85b75490f777abc22136b84b3a934",
                "CandidateStepName": "automl-dm--dpp0-1-cacbc13d79d44a4e85b75490f777abc22136b84b3a934",
                "CandidateStepType": "AWS::SageMaker::TrainingJob"
            },
            {
                "CandidateStepArn": "arn:aws:sagemaker:us-east-1:889926741212:transform-job/automl-dm--dpp0-rpb-1-82734

In [29]:
for index, candidate in enumerate(candidates):
    print(str(index) + "  " 
        + candidate['CandidateName'] + "  " 
        + str(candidate['FinalAutoMLJobObjectiveMetric']['Value']))

0  tuning-job-1-c2904d1d37f34a2688-002-6dacd775  0.38413000106811523
1  tuning-job-1-c2904d1d37f34a2688-003-992ee09e  0.37703999876976013
2  tuning-job-1-c2904d1d37f34a2688-001-5d2d2534  0.29381999373435974


# Inspect Trials using Experiments API

SageMaker Autopilot automatically creates a new experiment, and pushes information for each trial. 

In [30]:
from sagemaker.analytics import ExperimentAnalytics, TrainingJobAnalytics

exp = ExperimentAnalytics(
    sagemaker_session=sess, 
    experiment_name=auto_ml_job_name + '-aws-auto-ml-job',
)

df = exp.dataframe()
print(df)

                                  TrialComponentName  \
0  tuning-job-1-c2904d1d37f34a2688-001-5d2d2534-a...   
1  tuning-job-1-c2904d1d37f34a2688-003-992ee09e-a...   
2  tuning-job-1-c2904d1d37f34a2688-002-6dacd775-a...   
3  automl-dm--dpp0-rpb-1-8273462ad8e94dbcbd158148...   
4  automl-dm--dpp2-rpb-1-6778331426214a798b92e168...   
5  automl-dm--dpp1-csv-1-75a01f5dd33c4db3906e601f...   
6  automl-dm--dpp1-1-d7e89ee0091d4a9f93c4ab3e11cc...   
7  automl-dm--dpp0-1-cacbc13d79d44a4e85b75490f777...   
8  automl-dm--dpp2-1-ae1500eb68124619b8bf98c001ba...   
9  db-1-856effeb4f424b73b29c6f61563f15fae551cfec0...   

                                         DisplayName  \
0  tuning-job-1-c2904d1d37f34a2688-001-5d2d2534-a...   
1  tuning-job-1-c2904d1d37f34a2688-003-992ee09e-a...   
2  tuning-job-1-c2904d1d37f34a2688-002-6dacd775-a...   
3  automl-dm--dpp0-rpb-1-8273462ad8e94dbcbd158148...   
4  automl-dm--dpp2-rpb-1-6778331426214a798b92e168...   
5  automl-dm--dpp1-csv-1-75a01f5dd33c4db3906e60

# Explore the Best Candidate
Now that we have successfully completed the AutoML job on our dataset and visualized the trials, we can create a model from any of the trials with a single API call and then deploy that model for online or batch prediction using [Inference Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html). For this notebook, we deploy only the best performing trial for inference.

The best candidate is the one we're really interested in.

In [31]:
best_candidate_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)

In [32]:
if not bool(best_candidate_response):
    print('STOP: Autopilot Job did NOT finish correctly. Please re-run the notebook from start.')
else:
    best_candidate = best_candidate_response['BestCandidate']
    print('OK')

OK


In [33]:
print(json.dumps(best_candidate_response, indent=4, sort_keys=True, default=str))

{
    "AutoMLJobArn": "arn:aws:sagemaker:us-east-1:889926741212:automl-job/automl-dm-19-17-51-21",
    "AutoMLJobArtifacts": {
        "CandidateDefinitionNotebookLocation": "s3://sagemaker-us-east-1-889926741212/models/autopilot/automl-dm-19-17-51-21/sagemaker-automl-candidates/pr-1-bcbba697044643f28fd5a54e8bb83874fe8f7059c6ab42538e24fb4e76/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb",
        "DataExplorationNotebookLocation": "s3://sagemaker-us-east-1-889926741212/models/autopilot/automl-dm-19-17-51-21/sagemaker-automl-candidates/pr-1-bcbba697044643f28fd5a54e8bb83874fe8f7059c6ab42538e24fb4e76/notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb"
    },
    "AutoMLJobConfig": {
        "CompletionCriteria": {
            "MaxAutoMLJobRuntimeInSeconds": 3600,
            "MaxCandidates": 3,
            "MaxRuntimePerTrainingJobInSeconds": 600
        }
    },
    "AutoMLJobName": "automl-dm-19-17-51-21",
    "AutoMLJobSecondaryStatus": "MaxCandidatesReached",
  

In [34]:
if not bool(best_candidate):
    print('STOP: Autopilot Job did NOT finish correctly. Please re-run the notebook from start.')
elif 'CandidateName' not in best_candidate:
    print('STOP: Autopilot Job did NOT finish correctly. Please re-run the notebook from start.')
elif 'FinalAutoMLJobObjectiveMetric' not in best_candidate:
    print('STOP: Autopilot Job did NOT finish correctly. Please re-run the notebook from start.')
else:
    best_candidate_identifier = best_candidate['CandidateName']
    print("Candidate name: " + best_candidate_identifier)
    print("Metric name: " + best_candidate['FinalAutoMLJobObjectiveMetric']['MetricName'])
    print("Metric value: " + str(best_candidate['FinalAutoMLJobObjectiveMetric']['Value']))


Candidate name: tuning-job-1-c2904d1d37f34a2688-002-6dacd775
Metric name: validation:accuracy
Metric value: 0.38413000106811523


In [35]:
print(json.dumps(best_candidate, indent=4, sort_keys=True, default=str))

{
    "CandidateName": "tuning-job-1-c2904d1d37f34a2688-002-6dacd775",
    "CandidateStatus": "Completed",
    "CandidateSteps": [
        {
            "CandidateStepArn": "arn:aws:sagemaker:us-east-1:889926741212:processing-job/db-1-856effeb4f424b73b29c6f61563f15fae551cfec04e74d498993cd2570",
            "CandidateStepName": "db-1-856effeb4f424b73b29c6f61563f15fae551cfec04e74d498993cd2570",
            "CandidateStepType": "AWS::SageMaker::ProcessingJob"
        },
        {
            "CandidateStepArn": "arn:aws:sagemaker:us-east-1:889926741212:training-job/automl-dm--dpp0-1-cacbc13d79d44a4e85b75490f777abc22136b84b3a934",
            "CandidateStepName": "automl-dm--dpp0-1-cacbc13d79d44a4e85b75490f777abc22136b84b3a934",
            "CandidateStepType": "AWS::SageMaker::TrainingJob"
        },
        {
            "CandidateStepArn": "arn:aws:sagemaker:us-east-1:889926741212:transform-job/automl-dm--dpp0-rpb-1-8273462ad8e94dbcbd1581483f4ac837482279b21",
            "CandidateStepN

# View Individual Autopilot Jobs

In [36]:
steps = []
if not bool(best_candidate):
    print('STOP: Autopilot Job did NOT finish correctly. Please re-run the notebook from start.')
elif 'InferenceContainers' not in best_candidate:
    print('STOP: Autopilot Job did NOT finish correctly. Please re-run the notebook from start.')
else:
    for step in best_candidate['CandidateSteps']:
        print('Candidate Step Type: {}'.format(step['CandidateStepType']))
        print('Candidate Step Name: {}'.format(step['CandidateStepName']))
        steps.append(step['CandidateStepName'])

Candidate Step Type: AWS::SageMaker::ProcessingJob
Candidate Step Name: db-1-856effeb4f424b73b29c6f61563f15fae551cfec04e74d498993cd2570
Candidate Step Type: AWS::SageMaker::TrainingJob
Candidate Step Name: automl-dm--dpp0-1-cacbc13d79d44a4e85b75490f777abc22136b84b3a934
Candidate Step Type: AWS::SageMaker::TransformJob
Candidate Step Name: automl-dm--dpp0-rpb-1-8273462ad8e94dbcbd1581483f4ac837482279b21
Candidate Step Type: AWS::SageMaker::TrainingJob
Candidate Step Name: tuning-job-1-c2904d1d37f34a2688-002-6dacd775


In [37]:
from IPython.core.display import display, HTML

display(HTML('<b>Review Best Candidate <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/processing-jobs/{}">Processing Job</a></b>'.format(region, steps[0])))

In [38]:
from IPython.core.display import display, HTML

display(HTML('<b>Review Best Candidate <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/{}">Training Job</a></b>'.format(region, steps[1])))

In [39]:
from IPython.core.display import display, HTML

display(HTML('<b>Review Best Candidate <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/transform-jobs/{}">Transform Job</a></b>'.format(region, steps[2])))

In [40]:
from IPython.core.display import display, HTML

display(HTML('<b>Review Best Candidate <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/{}">Training Job (Tuning)</a></b>'.format(region, steps[3])))

# See the containers and models composing the Inference Pipeline

In [41]:
for container in best_candidate['InferenceContainers']:
        print(container['Image'])
        print(container['ModelDataUrl'])
        print('======================')

683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-sklearn-automl:0.2-1-cpu-py3
s3://sagemaker-us-east-1-889926741212/models/autopilot/automl-dm-19-17-51-21/data-processor-models/automl-dm--dpp0-1-cacbc13d79d44a4e85b75490f777abc22136b84b3a934/output/model.tar.gz
683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3
s3://sagemaker-us-east-1-889926741212/models/autopilot/automl-dm-19-17-51-21/tuning/automl-dm--dpp0-xgb/tuning-job-1-c2904d1d37f34a2688-002-6dacd775/output/model.tar.gz
683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-sklearn-automl:0.2-1-cpu-py3
s3://sagemaker-us-east-1-889926741212/models/autopilot/automl-dm-19-17-51-21/data-processor-models/automl-dm--dpp0-1-cacbc13d79d44a4e85b75490f777abc22136b84b3a934/output/model.tar.gz


# Autopilot Chooses XGBoost as Best Candidate!

Note that Autopilot chose different hyper-parameters and feature transformations than we used in our own XGBoost model.

# Deploy the Model as a REST Endpoint
Batch transformations are also supported, but for now, we will use a REST Endpoint.

In [42]:
model_name = 'automl-dm-model-' + timestamp_suffix

model_arn = sm.create_model(Containers=best_candidate['InferenceContainers'],
                            ModelName=model_name,
                            ExecutionRoleArn=role)

print('Best candidate model ARN: ', model_arn['ModelArn'])

Best candidate model ARN:  arn:aws:sagemaker:us-east-1:889926741212:model/automl-dm-model-19-17-51-21


In [43]:
# EndpointConfig name
timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())
epc_name = 'automl-dm-epc-' + timestamp_suffix

# Endpoint name
autopilot_endpoint_name = 'automl-dm-ep-' + timestamp_suffix
variant_name = 'automl-dm-variant-' + timestamp_suffix

print(autopilot_endpoint_name)
print(variant_name)

automl-dm-ep-19-18-14-00
automl-dm-variant-19-18-14-00


In [44]:
ep_config = sm.create_endpoint_config(EndpointConfigName = epc_name,
                                      ProductionVariants=[{'InstanceType':'ml.m5.large',
                                                           'InitialInstanceCount': 1,
                                                           'ModelName': model_name,
                                                           'VariantName': variant_name}])


In [45]:
create_endpoint_response = sm.create_endpoint(EndpointName=autopilot_endpoint_name,
                                              EndpointConfigName=epc_name)
print(create_endpoint_response['EndpointArn'])

arn:aws:sagemaker:us-east-1:889926741212:endpoint/automl-dm-ep-19-18-14-00


In [46]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/endpoints/{}">SageMaker REST Endpoint</a></b>'.format(region, autopilot_endpoint_name)))

# Store Variables for the Next Notebooks

In [47]:
%store autopilot_endpoint_name

Stored 'autopilot_endpoint_name' (str)


In [48]:
%store

Stored variables and their in-db values:
autopilot_endpoint_name             -> 'automl-dm-ep-19-18-14-00'
autopilot_train_s3_uri              -> 's3://sagemaker-us-east-1-889926741212/data/amazon


# Summary
We used Autopilot to automatically find the best model, hyper-parameters, and feature-engineering scripts for our dataset.  

Autopilot uses a transparent approach to generate re-usable exploration Jupyter Notebooks and transformation Python scripts to continue to train and deploy our model on new data - well after this initial interaction with the Autopilot service.

In [None]:
%%javascript
Jupyter.notebook.save_checkpoint();
Jupyter.notebook.session.delete();