<br />

<div style="text-align: center;">
<font size="7">Training Built-in Machine learning Models</font>
<br /> 
<br /> 
<font size="5">XGBoost</font>
    
</div>
<br />

<div style="text-align: right;">
<font size="4">2020/11/11</font>
<br />
<font size="4">Ryutaro Hashimoto</font>
</div>

___

# Summary
- We will use a machine learning model that is pre-built in SageMaker.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Define-Training-Job" data-toc-modified-id="Define-Training-Job-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Define Training Job</a></span><ul class="toc-item"><li><span><a href="#Get-the-container-image-to-use." data-toc-modified-id="Get-the-container-image-to-use.-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Get the container image to use.</a></span></li><li><span><a href="#学習ジョブを定義" data-toc-modified-id="学習ジョブを定義-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>学習ジョブを定義</a></span></li><li><span><a href="#Set-hyperparameters" data-toc-modified-id="Set-hyperparameters-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Set hyperparameters</a></span></li><li><span><a href="#Define-data-input-and-output" data-toc-modified-id="Define-data-input-and-output-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Define data input and output</a></span></li></ul></li><li><span><a href="#Execute-Training-Job" data-toc-modified-id="Execute-Training-Job-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Execute Training Job</a></span></li><li><span><a href="#Create-endpoints-and-predict-them-with-learning-models" data-toc-modified-id="Create-endpoints-and-predict-them-with-learning-models-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Create endpoints and predict them with learning models</a></span><ul class="toc-item"><li><span><a href="#Launch-endpoint" data-toc-modified-id="Launch-endpoint-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Launch endpoint</a></span></li><li><span><a href="#Predict-an-appropriate-sample" data-toc-modified-id="Predict-an-appropriate-sample-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Predict an appropriate sample</a></span></li></ul></li><li><span><a href="#Delete-endpoint" data-toc-modified-id="Delete-endpoint-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Delete endpoint</a></span></li></ul></div>

## Define Training Job

### Get the container image to use.

In [1]:
import boto3
import sagemaker

region = boto3.Session().region_name

container = sagemaker.image_uris.retrieve(
                              framework = 'xgboost',
                              region = region,
                              version='latest',
                              py_version='py3',
                              instance_type=None,
                                )
print(container)

433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest


### 学習ジョブを定義

In [2]:
from sagemaker.estimator import Estimator

role_ARN = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'    # ← your iam role ARN

xgb_estimator = Estimator(container,
    role=role_ARN, 
    instance_count=1,
    instance_type='ml.m5.large',
    output_path='<S3 path>',
    base_job_name = 'XGBoost',
    tags = [{"Key":"name", "Value": "name"},
            {"Key":"project", "Value": "project1"}]
)

### Set hyperparameters

In [3]:
xgb_estimator.set_hyperparameters(objective='reg:linear',
                                 num_round=200,
                                 early_stopping_rounds=10)

### Define data input and output

In [4]:
training_data_channel   = sagemaker.TrainingInput(
                                        s3_data = 's3://sagemaker-tutorial-hashimoto/boston-housing/training_dataset.csv', 
                                        content_type='text/csv')

validation_data_channel   = sagemaker.TrainingInput(
                                        s3_data = 's3://sagemaker-tutorial-hashimoto/boston-housing/validation_dataset.csv', 
                                        content_type='text/csv')

xgb_data = {'train': training_data_channel, 'validation': validation_data_channel}

## Execute Training Job

In [5]:
xgb_estimator.fit(xgb_data)

2021-02-05 06:02:51 Starting - Starting the training job...
2021-02-05 06:03:19 Starting - Launching requested ML instancesProfilerReport-1612504971: InProgress
......
2021-02-05 06:04:20 Starting - Preparing the instances for training...
2021-02-05 06:05:00 Downloading - Downloading input data...
2021-02-05 06:05:31 Training - Downloading the training image..[34mArguments: train[0m
[34m[2021-02-05:06:05:45:INFO] Running standalone xgboost training.[0m
[34m[2021-02-05:06:05:45:INFO] File size need to be processed in the node: 0.04mb. Available memory size in the node: 220.71mb[0m
[34m[2021-02-05:06:05:45:INFO] Determined delimiter of CSV input is ','[0m
[34m[06:05:45] S3DistributionType set as FullyReplicated[0m
[34m[06:05:45] 455x12 matrix with 5460 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2021-02-05:06:05:45:INFO] Determined delimiter of CSV input is ','[0m
[34m[06:05:45] S3DistributionType set as FullyReplicated[0m
[

## Create endpoints and predict them with learning models

### Launch endpoint

In [6]:
from time import strftime, gmtime
timestamp = strftime('%d-%H-%M-%S', gmtime())

endpoint_name = 'XGBoost-demo-'+timestamp
print(endpoint_name)

xgb_estimator = xgb_estimator.deploy(endpoint_name=endpoint_name, 
                        initial_instance_count=1, 
                        instance_type='ml.t2.medium')

# xgb_predictor.content_type = 'text/csv'
xgb_estimator.serializer = sagemaker.serializers.CSVSerializer()
xgb_estimator.deserializer = sagemaker.deserializers.CSVDeserializer()

XGBoost-demo-05-06-06-23
---------------!

### Predict an appropriate sample

In [7]:
test_sample = '0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,4.98'
response = xgb_estimator.predict(test_sample)

print(response)

[['24.1808795929']]


In [8]:
runtime = boto3.Session().client(service_name='runtime.sagemaker') 

response = runtime.invoke_endpoint(EndpointName=endpoint_name, 
                                  ContentType='text/csv', 
                                  Body=test_sample)

print(response['Body'].read())

b'24.1808795929'


In [9]:
test_samples = ['0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,4.98',
                '0.02731,0.00,7.070,0,0.4690,6.4210,78.90,4.9671,2,242.0,17.80,9.14']

response = xgb_estimator.predict(test_samples)
print(response)

[['24.1808795929', '21.5899925232']]


## Delete endpoint

The cost will be incurred while the endpoint is running.
It can be removed with the following code.

In [10]:
xgb_estimator.delete_endpoint()

In [11]:
# End of File