# Autopilot Embedding
In this notebook, we'll use SageMaker to train a model using our graph embedding as an additional feature.

## SageMaker Connection
Let's setup our SageMaker connection.

In [None]:
import sagemaker
import boto3

region = boto3.Session().region_name

session = sagemaker.Session()
bucket = session.default_bucket()
prefix = 'sagemaker/form13'

role = sagemaker.get_execution_role()

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

## Upload to Amazon S3
Now we're going to upload the training and testing data to our default SageMaker bucket.

In [None]:
train_data_s3_path = session.upload_data(path='train.csv', key_prefix=prefix + '/train')
print('Training data uploaded to: ' + train_data_s3_path)

test_data_s3_path = session.upload_data(path='test.csv', key_prefix=prefix + '/test')
print('Testing data uploaded to: ' + test_data_s3_path)

validation_data_s3_path = session.upload_data(path='validate.csv', key_prefix=prefix + '/validate')
print('Validation data uploaded to: ' + validation_data_s3_path)


## Setting up the SageMaker Autopilot Job
After uploading the dataset to Amazon S3, you can invoke Autopilot to find the best ML pipeline to train a model on this dataset.

In [None]:
auto_ml_job_config = {'CompletionCriteria': {'MaxCandidates': 3}}

input_data_config = [
    {
        'DataSource': {
            'S3DataSource': {
                'S3DataType': 'S3Prefix',
                'S3Uri': 's3://{}/{}/train'.format(bucket, prefix),
            }
        },
        'TargetAttributeName': 'target',
    }
]

output_data_config = {'S3OutputPath': 's3://{}/{}/output'.format(bucket, prefix)}

## Launching the SageMaker Autopilot Job
You can now launch the Autopilot job by calling the create_auto_ml_job method.

In [None]:
from time import gmtime, strftime, sleep

timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())

auto_ml_job_name = 'automl-embedding-' + timestamp_suffix
print('AutoMLJobName: ' + auto_ml_job_name)

sm.create_auto_ml_job(
    AutoMLJobName=auto_ml_job_name,
    InputDataConfig=input_data_config,
    OutputDataConfig=output_data_config,
    AutoMLJobConfig=auto_ml_job_config,
    RoleArn=role,
)

## Tracking SageMaker Autopilot job progress
SageMaker Autopilot job consists of the following high-level steps : * Analyzing Data, where the dataset is analyzed and Autopilot comes up with a list of ML pipelines that should be tried out on the dataset. The dataset is also split into train and validation sets. * Feature Engineering, where Autopilot performs feature transformation on individual features of the dataset as well as at an aggregate level. * Model Tuning, where the top performing pipeline is selected along with the optimal hyperparameters for the training algorithm (the last stage of the pipeline).

In our experience, this job takes anywhere from 20-80 minutes to run.  At this point, you might want to set the rest of this notebook to run and move on to the raw notebook, "3_autopilot_raw.ipynb."

In [None]:
print('JobStatus - Secondary Status')
print('------------------------------')

describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
print(describe_response['AutoMLJobStatus'] + ' - ' + describe_response['AutoMLJobSecondaryStatus'])
job_run_status = describe_response['AutoMLJobStatus']

while job_run_status not in ('Failed', 'Completed', 'Stopped'):
    describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
    job_run_status = describe_response['AutoMLJobStatus']

    print(
        describe_response['AutoMLJobStatus'] + ' - ' + describe_response['AutoMLJobSecondaryStatus']
    )
    sleep(30)

## Results
Now use the describe_auto_ml_job API to look up the best candidate selected by the SageMaker Autopilot job.

In [None]:
import pprint

best_candidate = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['BestCandidate']
best_candidate_name = best_candidate['CandidateName']

print('CandidateName: ' + best_candidate_name)
print('FinalAutoMLJobObjectiveMetricName: ' + best_candidate['FinalAutoMLJobObjectiveMetric']['MetricName'])
print('FinalAutoMLJobObjectiveMetricValue: ' + str(best_candidate['FinalAutoMLJobObjectiveMetric']['Value']))
print()
pprint.pprint(best_candidate)

## Batch Inference
Now that we completed the SageMaker Autopilot job on the dataset, let's create a model from the best candidatewith Inference Pipelines.

In [None]:
model_name = 'automl-embedding-model-' + timestamp_suffix
model = sm.create_model(Containers=best_candidate['InferenceContainers'], ModelName=model_name, ExecutionRoleArn=role)
print('Model ARN corresponding to the best candidate is: {}'.format(model['ModelArn']))

We can use batch inference through Amazon SageMaker batch transform. The same model can also be deployed to perform online inference using Amazon SageMaker hosting.



In [None]:
transform_job_name = 'automl-embedding-transform-' + timestamp_suffix

transform_input = {
    'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': test_data_s3_path}},
    'ContentType': 'text/csv',
    'CompressionType': 'None',
    'SplitType': 'Line',
}

transform_output = {
    'S3OutputPath': 's3://{}/{}/inference-results'.format(bucket, prefix),
}

transform_resources = {'InstanceType': 'ml.m5.4xlarge', 'InstanceCount': 1}

sm.create_transform_job(
    TransformJobName=transform_job_name,
    ModelName=model_name,
    TransformInput=transform_input,
    TransformOutput=transform_output,
    TransformResources=transform_resources,
)

Watch the transform job for completion.

In [None]:
print('JobStatus')
print('----------')

describe_response = sm.describe_transform_job(TransformJobName=transform_job_name)
job_run_status = describe_response['TransformJobStatus']
print(job_run_status)

while job_run_status not in ('Failed', 'Completed', 'Stopped'):
    describe_response = sm.describe_transform_job(TransformJobName=transform_job_name)
    job_run_status = describe_response['TransformJobStatus']
    print(job_run_status)
    sleep(30)


Now let’s get the URL of the transform job results.  You can open this in S3.

In [None]:
bucket = session.default_bucket()
key = '{}/inference-results/test_data.csv.out'.format(prefix)
url='s3://' + bucket + key

print(url)

## View All Candidates
You can view all the candidates (pipeline evaluations with different hyperparameter combinations) that were explored by SageMaker Autopilot and sort them by their final performance metric.

In [None]:
candidates = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name, SortBy='FinalObjectiveMetricValue')['Candidates']
index = 0
for candidate in candidates:
    print(
        str(index)
        + '  '
        + candidate['CandidateName']
        + '  '
        + str(candidate['FinalAutoMLJobObjectiveMetric']['Value'])
    )
    index += 1


## Candidate Generation Notebook
Sagemaker AutoPilot also auto-generates a Candidate Definitions notebook. This notebook can be used to interactively step through the various steps taken by the Sagemaker Autopilot to arrive at the best candidate. This notebook can also be used to override various runtime parameters like parallelism, hardware used, algorithms explored, feature extraction scripts and more.

The notebook can be downloaded from the following Amazon S3 location:


In [None]:
sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['AutoMLJobArtifacts']['CandidateDefinitionNotebookLocation']

## Data Exploration Notebook
Sagemaker Autopilot also auto-generates a Data Exploration notebook, which can be downloaded from the following Amazon S3 location:


In [None]:
sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['AutoMLJobArtifacts']['DataExplorationNotebookLocation']

## Cleanup
The Autopilot job creates many underlying artifacts such as dataset splits, preprocessing scripts, or preprocessed data, etc. This code deletes them. This operation deletes all the generated models and the auto-generated notebooks as well.

The lines below are currently commented out.  Uncomment them if you'd like to run them.

In [None]:
#s3 = boto3.resource('s3')
#bucket = s3.Bucket(bucket)

#job_outputs_prefix = '{}/output/{}'.format(prefix, auto_ml_job_name)
#bucket.objects.filter(Prefix=job_outputs_prefix).delete()