# Distributed training with Amazon SageMaker built-in algorithm XGBoost 

This notebook shows usage of SageMaker Managed Spot infrastructure for XGBoost training. Below we show how Spot instances can be used for the 'algorithm mode' and 'script mode' training methods with the XGBoost container. 

[Managed Spot Training](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html) uses Amazon EC2 Spot instance to run training jobs instead of on-demand instances. You can specify which training jobs use spot instances and a stopping condition that specifies how long Amazon SageMaker waits for a job to run using Amazon EC2 Spot instances.

This notebook was tested in Amazon SageMaker Studio on a ml.t3.medium instance with Python 3 (Data Science) kernel.

In this notebook we will perform XGBoost training as described [here](). See the original notebook for more details on the data. 

### Setup variables and define functions

In [2]:
!pip3 install -U sagemaker

Collecting sagemaker
  Downloading sagemaker-2.112.2.tar.gz (579 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m579.2/579.2 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting schema
  Using cached schema-0.7.5-py2.py3-none-any.whl (17 kB)
Building wheels for collected packages: sagemaker
  Building wheel for sagemaker (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker: filename=sagemaker-2.112.2-py2.py3-none-any.whl size=796129 sha256=d4ce37abe0c458d35c13e4b445d2c80b86ffe5b6a244f0aeb19ebd3ae05ccb4a
  Stored in directory: /root/.cache/pip/wheels/c9/2a/d8/0db78f00aee63d4fddc2c64edcb1e761ef8e1a502137dcbaeb
Successfully built sagemaker
Installing collected packages: schema, sagemaker
  Attempting uninstall: sagemaker
    Found existing installation: sagemaker 2.107.0
    Uninstalling sagemaker-2.107.0:
      Successfully uninstalled sagemaker-2.107.0
Successfully installed sagemaker-2.112.2 s

In [8]:
%%time

import os
import boto3
import re
import sagemaker

# Get a SageMaker-compatible role used by this Notebook Instance.
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

### update below values appropriately ###
bucket = sagemaker.Session().default_bucket()
prefix = 'sagemaker/xgboost-dist-builtin'

print(region)

us-west-2
CPU times: user 136 ms, sys: 3.09 ms, total: 139 ms
Wall time: 531 ms


In [9]:
%%time

import pyarrow
import numpy as np
import pandas as pd
from sklearn.datasets import load_svmlight_file

s3 = boto3.client("s3")
# Download the dataset and load into a pandas dataframe
FILE_NAME = 'abalone.csv'
s3.download_file("sagemaker-sample-files", f"datasets/tabular/uci_abalone/abalone.csv", FILE_NAME)

feature_names=['Sex', 
               'Length', 
               'Diameter', 
               'Height', 
               'Whole weight', 
               'Shucked weight', 
               'Viscera weight', 
               'Shell weight', 
               'Rings']

data = pd.read_csv(FILE_NAME, 
                   header=None, 
                   names=feature_names)
data["Sex"] = data["Sex"].astype("category").cat.codes

data.head()

CPU times: user 358 ms, sys: 53.7 ms, total: 411 ms
Wall time: 1.59 s


Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,2,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,2,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,0,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,2,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,1,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [10]:
# SageMaker XGBoost has the convention of label in the first column
data = data[feature_names[-1:] + feature_names[:-1]]
data.head()

Unnamed: 0,Rings,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight
0,15,2,0.455,0.365,0.095,0.514,0.2245,0.101,0.15
1,7,2,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07
2,9,0,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21
3,10,2,0.44,0.365,0.125,0.516,0.2155,0.114,0.155
4,7,1,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055


In [11]:
# Split the downloaded data into train/test dataframes
train, validation = np.split(data.sample(frac=1), [int(.8*len(data))])
train_0, train_1 = np.split(train.sample(frac=1), [int(.5*len(train))])

# requires PyArrow installed
train_0.to_csv('abalone_train_0.csv', index=False, header=False)
train_1.to_csv('abalone_train_1.csv', index=False, header=False)
validation.to_csv('abalone_validation.csv', index=False, header=False)

In [12]:
%%time

sagemaker.Session().upload_data('abalone_train_0.csv', 
                                bucket=bucket, 
                                key_prefix=prefix+'/'+'train')

sagemaker.Session().upload_data('abalone_train_1.csv', 
                                bucket=bucket, 
                                key_prefix=prefix+'/'+'train')

sagemaker.Session().upload_data('abalone_validation.csv', 
                                bucket=bucket, 
                                key_prefix=prefix+'/'+'validation')

CPU times: user 103 ms, sys: 0 ns, total: 103 ms
Wall time: 387 ms


's3://sagemaker-us-west-2-240487350066/sagemaker/xgboost-dist-builtin/validation/abalone_validation.csv'

### Obtaining the latest XGBoost container
We obtain the new container by specifying the framework version (1.5-1). This version specifies the upstream XGBoost framework version (1.5) and an additional SageMaker version (1). If you have an existing XGBoost workflow based on the previous (1.0-1, 1.2-2 or 1.3-1) container, this would be the only change necessary to get the same workflow working with the new container.

In [13]:
container = sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")

### Training the XGBoost model

After setting training parameters, we kick off training, and poll for status until training is completed, which in this example, takes few minutes.

To run our training script on SageMaker, we construct a sagemaker.xgboost.estimator.XGBoost estimator, which accepts several constructor arguments:

* __entry_point__: The path to the Python script SageMaker runs for training and prediction.
* __role__: Role ARN
* __hyperparameters__: A dictionary passed to the train function as hyperparameters.
* __train_instance_type__ *(optional)*: The type of SageMaker instances for training. __Note__: This particular mode does not currently support training on GPU instance types.
* __sagemaker_session__ *(optional)*: The session used to train on Sagemaker.

In [14]:
hyperparameters = {
    "max_depth": "5",
    "eta": "0.2",
    "gamma": "4",
    "min_child_weight": "6",
    "subsample": "0.7",
    "objective": "reg:squarederror",
    "num_round": "50",
    "verbosity": "2",
}

instance_type = "ml.m5.2xlarge"
instance_count = 2
output_path = "s3://{}/{}/{}/output".format(bucket, prefix, "abalone-xgb")
content_type = "csv"

If Spot instances are used, the training job can be interrupted, causing it to take longer to start or finish. If a training job is interrupted, a checkpointed snapshot can be used to resume from a previously saved point and can save training time (and cost).

To enable checkpointing for Managed Spot Training using SageMaker XGBoost we need to configure three things: 

1. Enable the `train_use_spot_instances` constructor arg - a simple self-explanatory boolean. 

2. Set the `train_max_wait constructor` arg - this is an int arg representing the amount of time you are willing to wait for Spot infrastructure to become available. Some instance types are harder to get at Spot prices and you may have to wait longer. You are not charged for time spent waiting for Spot infrastructure to become available, you're only charged for actual compute time spent once Spot instances have been successfully procured. 

3. Setup a `checkpoint_s3_uri` constructor arg - this arg will tell SageMaker an S3 location where to save checkpoints. While not strictly necessary, checkpointing is highly recommended for Manage Spot Training jobs due to the fact that Spot instances can be interrupted with short notice and using checkpoints to resume from the last interruption ensures you don't lose any progress made before the interruption.

Feel free to toggle the `train_use_spot_instances` variable to see the effect of running the same job using regular (a.k.a. "On Demand") infrastructure.

Note that `train_max_wait` can be set if and only if `train_use_spot_instances` is enabled and must be greater than or equal to `train_max_run`.

In [15]:
import time
from sagemaker.inputs import TrainingInput

job_name = "DEMO-xgboost-builtin-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
print("Training job", job_name)

# use_spot_instances = True
# max_run = 3600
# max_wait = 7200 if use_spot_instances else None
# checkpoint_s3_uri = (
#     "s3://{}/{}/checkpoints/{}".format(bucket, prefix, job_name) if use_spot_instances else None
# )
# print("Checkpoint path:", checkpoint_s3_uri)

xgb_estimator = sagemaker.estimator.Estimator(
    container,
    role,
    hyperparameters=hyperparameters,
    instance_count=instance_count,
    instance_type=instance_type,
    volume_size=5,  # 5 GB
    output_path=output_path,
    sagemaker_session=sagemaker.Session(),
    # use_spot_instances=use_spot_instances,
    # max_run=max_run,
    # max_wait=max_wait,
    # checkpoint_s3_uri=checkpoint_s3_uri,
)

train_input = TrainingInput(
    "s3://{}/{}/{}/".format(bucket, prefix, "train"), 
    distribution='ShardedByS3Key', 
    content_type=content_type)

validation_input = TrainingInput(
    "s3://{}/{}/{}/".format(bucket, prefix, "validation"), 
    distribution='FullyReplicated', 
    content_type=content_type)

Training job DEMO-xgboost-builtin-2022-10-13-01-15-07


In [17]:
xgb_estimator.fit({'train': train_input, 'validation': validation_input}, job_name=job_name)

2022-10-13 01:15:35 Starting - Starting the training job...
2022-10-13 01:16:00 Starting - Preparing the instances for trainingProfilerReport-1665623735: InProgress
......
2022-10-13 01:17:01 Downloading - Downloading input data...
2022-10-13 01:17:21 Training - Downloading the training image.....[35m[2022-10-13 01:18:10.279 ip-10-0-182-51.us-west-2.compute.internal:1 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[35m[2022-10-13:01:18:10:INFO] Imported framework sagemaker_xgboost_container.training[0m
[35m[2022-10-13:01:18:10:INFO] Failed to parse hyperparameter objective value reg:squarederror to Json.[0m
[35mReturning the value itself[0m
[35m[2022-10-13:01:18:10:INFO] No GPUs detected (normal if no gpus installed)[0m
[35m[2022-10-13:01:18:10:INFO] Running XGBoost Sagemaker in algorithm mode[0m
[35m[2022-10-13:01:18:10:INFO] Determined delimiter of CSV input is ','[0m
[35m[2022-10-13:01:18:10:INFO] Determined delimiter of CSV input is ','[0m
[35m[2022-10-13

In [22]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import CSVDeserializer

predictor = xgb_estimator.deploy(
    initial_instance_count=1, 
    instance_type="ml.m5.2xlarge",
    serializer=CSVSerializer(),
    deserializer=CSVDeserializer(),
)

-----!

In [19]:
array = data.iloc[:5, 1:].to_numpy() 
array

array([[2.    , 0.455 , 0.365 , 0.095 , 0.514 , 0.2245, 0.101 , 0.15  ],
       [2.    , 0.35  , 0.265 , 0.09  , 0.2255, 0.0995, 0.0485, 0.07  ],
       [0.    , 0.53  , 0.42  , 0.135 , 0.677 , 0.2565, 0.1415, 0.21  ],
       [2.    , 0.44  , 0.365 , 0.125 , 0.516 , 0.2155, 0.114 , 0.155 ],
       [1.    , 0.33  , 0.255 , 0.08  , 0.205 , 0.0895, 0.0395, 0.055 ]])

In [24]:
prediction = predictor.predict(array)
prediction

[['9.30324649810791'],
 ['8.23046588897705'],
 ['11.059348106384277'],
 ['9.831652641296387'],
 ['6.647389888763428']]

In [25]:
predictor.delete_endpoint()