# Retain and reuse infrastructure: SageMaker Warm Pools

In this brief and optional module, we will take a look to SageMaker Warm Pools. They allow you to retain and reuse provisioned infrastructure after completing a training job. This reduces latency for repetitive tasks, such as iterative experimentation or consecutive jobs, by avoiding the time-consuming process of resource provisioning.

In this notebook, we will utilize the same training job structure and dataset from [unit 1](./unit_1.ipynb) to demonstrate the use of SageMaker Warm Pools. The primary difference is the addition of a new variable to the `Estimator`, which is responsible for activating the warm pool.

## Setup

In [1]:
import sagemaker
print(f'This notebook was run with `sagemaker` v{sagemaker.__version__}')

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
This notebook was run with `sagemaker` v2.219.0


In [2]:
from utils.helpers import get_secret
from utils.toy_datasets import upload_dataset_to_s3

s3_bucket_uri = get_secret('s3_bucket_uri')
s3_bucket_name = get_secret('s3_bucket_name')
role = get_secret('role_arn')
session = sagemaker.Session()

dataset_name = 'iris'
upload_dataset_to_s3(dataset_name, s3_bucket_name)

Files uploaded to S3 successfully.
Cleanup complete!


In [3]:
image_uri = sagemaker.image_uris.retrieve('xgboost', region='us-east-1', version='1.5-1')

<div style="border:2px solid #FFA500; padding: 10px; background-color: #FFF9C4; border-radius: 5px;">
    <b>IMPORTANT:</b>
    To get started, you must first request a service limit increase for SageMaker managed warm pools. <b>The default resource limit for warm pools is 0</b>. For more information about Service Quotas, click <a href="https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html">here.</a></div><br>

![](./img/warm_pool_quota.png)

The `keep_alive_period_in_seconds` parameter is added to the `Estimator` to activate the warm pool feature, keeping the training instances warm for the specified duration. This setup allows us to quickly reuse the training instances for subsequent training jobs, significantly reducing the startup time and improving overall efficiency.

<div style="border:2px solid #42A891; padding: 10px; background-color: #E0F2F1; border-radius: 5px;">
    <b>LEARN MORE:</b> The maximum `keep_alive_period_in_seconds` for a single training job is 3600 seconds (60 minutes) and the maximum length of time that a warm pool cluster can continue running consecutive training jobs is 28 days.
Read more in the  <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html">official documentation.</a>
</div>

In [4]:
from sagemaker.inputs import TrainingInput
from sagemaker.estimator import Estimator

s3_train = TrainingInput(
    s3_data=f's3://{s3_bucket_name}/iris_dataset/train_data.csv',
    content_type='csv'
)

s3_validate = TrainingInput(
    s3_data=f's3://{s3_bucket_name}/iris_dataset/test_data.csv',
    content_type='csv'
)

training_inputs = {'train': s3_train, 'validation': s3_validate}

estimator = Estimator(
    image_uri=image_uri,
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",
    output_path=f"{s3_bucket_uri}/pipelines-output",
    sagemaker_session=session,
    base_job_name="iris-warm-pool-sm-xgboost",
    keep_alive_period_in_seconds=600 # Warm Pool activation
)

Now we will train the model. Note that after the training job is completed, SageMaker lets us know that the warm pool is active with the message `Completed - Resource retained for reuse`.

In [5]:
hyperparameters = {
    "max_depth": 1,
    "objective": "multi:softmax",
    "eval_metric": "mlogloss",
    "num_class": 3,
    "num_round": 10
}

estimator.set_hyperparameters(**hyperparameters)

estimator.fit(
    training_inputs,
    logs=False,
)

INFO:sagemaker:Creating training-job with name: iris-warm-pool-sm-xgboost-2024-06-12-17-54-01-815



2024-06-12 17:54:04 Starting - Starting the training job..
2024-06-12 17:54:19 Starting - Preparing the instances for training.....
2024-06-12 17:54:49 Downloading - Downloading input data........
2024-06-12 17:55:34 Downloading - Downloading the training image..........
2024-06-12 17:56:30 Training - Training image download completed. Training in progress..
2024-06-12 17:56:40 Uploading - Uploading generated training model..
2024-06-12 17:56:56 Completed - Resource retained for reuse


## Fetching and Printing Training Job Details

After the training job completes, we can use `session.describe_training_job()` to retrieve details about the job. This includes the total duration, billable time, and warm pool status.

In [6]:
first_job_name = estimator.latest_training_job.job_name
training_job_description = session.describe_training_job(first_job_name)

In [7]:
# It takes a couple of seconds for WarmPoolStatus info to be available
t = training_job_description['TrainingEndTime'] - training_job_description['CreationTime']
print(f"Training job duration: {t.seconds} seconds")
print("Billable time in seconds:", training_job_description['BillableTimeInSeconds'])
print("Warm Pool:", training_job_description['WarmPoolStatus'])
print("- Alive Period In Seconds:", training_job_description['ResourceConfig']['KeepAlivePeriodInSeconds'])

Training job duration: 171 seconds
Billable time in seconds: 124
Warm Pool: {'Status': 'Available'}
- Alive Period In Seconds: 600


The following timeline shows the different stages of the training job:

| Time                   | Status    | Message                                      |
|------------------------|-----------|----------------------------------------------|
| 2024-06-12 17:54:04    | Starting  | Starting the training job..                  |
| 2024-06-12 17:54:19    | Starting  | Preparing the instances for training.....    |
| 2024-06-12 17:54:49    | Downloading | Downloading input data........               |
| 2024-06-12 17:55:34    | Downloading | Downloading the training image..........     |
| 2024-06-12 17:56:30    | Training  | Training image download completed. Training in progress.. |
| 2024-06-12 17:56:40    | Uploading | Uploading generated training model..         |
| 2024-06-12 17:56:56    | Completed | Resource retained for reuse                  |

As you can see, billing starts from the moment the instance is prepared for training and the downloading process begins. It takes almost a minute to download the training image. This cost can be reduced with Warm Pools, as the image remains ready for training. Additionally, although the initial preparation of the instance is not billed, you can save significant time at the beginning of a training job since the instance preparation with Warm Pools is much faster.

## Updating the Keep Alive Period

You can update the `KeepAlivePeriodInSeconds` for a training job using the `session.update_training_job` method. Note that the maximum value for `KeepAlivePeriodInSeconds` is 3600 seconds (1 hour). Updating the training job does not reset the timer to 3600 seconds, but it subtracts the time that has already elapsed from the new keep-alive period.

![](./img/warm_pool_600.png)

In [8]:
session.update_training_job(first_job_name, resource_config={"KeepAlivePeriodInSeconds":3600})

INFO:sagemaker:Updating training job with name iris-warm-pool-sm-xgboost-2024-06-12-17-54-01-815


![](./img/warm_pool_3600.png)

## Training Multiple Models with Varying Hyperparameters

Now we will train several models, varying the `max_depth` hyperparameter. The goal is to observe how the training time is reduced due to the use of the warm pool.



In [None]:
for max_depth in [2, 3, 4, 5]:
    
    print(f"[INFO] Training XGBoost with max_depth={max_depth}", end='')
    hyperparameters["max_depth"] = max_depth
    
    estimator.set_hyperparameters(**hyperparameters)
    
    estimator.fit(
        training_inputs,
        logs=False,
    )
    
    last_job_name = estimator.latest_training_job.job_name
    training_job_description = session.describe_training_job(estimator.latest_training_job.job_name)
    t = training_job_description['TrainingEndTime'] - training_job_description['CreationTime']

    print(f"[INFO] Training job duration: {t.seconds} seconds")
    print(f"[INFO] Billable time: {training_job_description['BillableTimeInSeconds']} seconds")


INFO:sagemaker:Creating training-job with name: iris-warm-pool-sm-xgboost-2024-06-12-17-58-34-101


[INFO] Training XGBoost with max_depth=2
2024-06-12 17:58:38 Starting - Found matching resource for reuse
2024-06-12 17:58:38 Downloading - Downloading input data...
2024-06-12 17:58:58 Training - Training image download completed. Training in progress..
2024-06-12 17:59:06 Uploading - Uploading generated training model.
2024-06-12 17:59:19 Completed - Training job completed

INFO:sagemaker:Creating training-job with name: iris-warm-pool-sm-xgboost-2024-06-12-17-59-19-897



[INFO] Training job duration: 44 seconds
[INFO] Billable time: 40 seconds
[INFO] Training XGBoost with max_depth=3
2024-06-12 17:59:23 Starting - Found matching resource for reuse
2024-06-12 17:59:23 Downloading - Downloading input data...
2024-06-12 17:59:44 Training - Training image download completed. Training in progress..
2024-06-12 17:59:54 Uploading - Uploading generated training model..
2024-06-12 18:00:08 Completed - Resource retained for reuse

INFO:sagemaker:Creating training-job with name: iris-warm-pool-sm-xgboost-2024-06-12-18-00-10-720



[INFO] Training job duration: 47 seconds
[INFO] Billable time: 44 seconds
[INFO] Training XGBoost with max_depth=4
2024-06-12 18:00:10 Starting - Starting the training job.......
2024-06-12 18:00:53 Downloading - Downloading input data....
2024-06-12 18:01:19 Training - Training image download completed. Training in progress..
2024-06-12 18:01:29 Uploading - Uploading generated training model..
2024-06-12 18:01:44 Completed - Resource retained for reuse

INFO:sagemaker:Creating training-job with name: iris-warm-pool-sm-xgboost-2024-06-12-18-01-46-971



[INFO] Training job duration: 91 seconds
[INFO] Billable time: 49 seconds
[INFO] Training XGBoost with max_depth=5
2024-06-12 18:01:51 Starting - Found matching resource for reuse
2024-06-12 18:01:51 Downloading - Downloading input data....
2024-06-12 18:02:17 Downloading - Downloading the training image
2024-06-12 18:02:18 Training - Training image download completed. Training in progress..
2024-06-12 18:02:28 Uploading - Uploading generated training model.
2024-06-12 18:02:42 Completed - Resource retained for reuse
[INFO] Training job duration: 54 seconds
[INFO] Billable time: 50 seconds


As you can see, the training job time went significantly down.

![](./img/warm_pool_comparison.png)

Keep in mind that you are also billed for the time that the resouce was retained, so if you're done with your experiments, you can update the last training job and manually terminate the warm pool like this:

In [10]:
last_job_name = estimator.latest_training_job.job_name
session.update_training_job(last_job_name, resource_config={"KeepAlivePeriodInSeconds":0})

INFO:sagemaker:Updating training job with name iris-warm-pool-sm-xgboost-2024-06-12-18-01-46-971


This will ensure that you are not billed for unnecessary retention time. You can check the retention time for every training job by checking the `WarmPoolStatus`. Here we will check the status of the first training job. The status went from `Active` (after the first training job was completed) to `Reused` (after the second training job was started).

In [11]:
session.describe_training_job(first_job_name)['WarmPoolStatus']

{'Status': 'Reused',
 'ResourceRetainedBillableTimeInSeconds': 97,
 'ReusedByJob': 'iris-warm-pool-sm-xgboost-2024-06-12-17-58-34-101'}