## Distributed training

In order to see the benefit of using distributed training, we'll first get a couple of benchmarks for training time and model performance.

Adding comments.

### XGBoost baseline

In [8]:
s3_bucket = 'rdtest-data'
s3_prefix = 'prepared_parquet4'

In [9]:
import sagemaker
import boto3
from sagemaker import image_uris
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput
region = 'us-east-1'

In [10]:
# define the data type and paths to the training and validation datasets
content_type = "csv"
train_input = TrainingInput("s3://{}/{}/{}/".format(s3_bucket, s3_prefix, 'train'), content_type=content_type, distribution='ShardedByS3Key')
validation_input = TrainingInput("s3://{}/{}/{}/".format(s3_bucket, s3_prefix, 'validation'), content_type=content_type, distribution='ShardedByS3Key')

In [None]:
# initialize hyperparameters
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"reg:squarederror",
        "num_round":"5"}

# set an output path where the trained model will be saved
m_prefix = 'baseline'
output_path = 's3://{}/{}/{}/output'.format(s3_bucket, m_prefix, 'xgboost')

# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")

# construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type='ml.m5.12xlarge', 
                                          volume_size=200, # 5 GB 
                                          output_path=output_path)



# execute the XGBoost training job
estimator.fit({'train': train_input, 'validation': validation_input})

XGBoost gives us a training time of 1157 seconds and Validation RMSE of 1.33505.

### PyTorch baseline

Note that we use a simple CSV data loader that reads the entire data set into memory. If that proves infeasible, consider using a more sophisticated CSV loader or a [Redis-based loader](https://github.com/RedisAI/RedisAI).

* Move to device outside of model

In [None]:
train_instance_type = "ml.p3.16xlarge"

In [None]:
from sagemaker.pytorch import PyTorch

pt_estimator = PyTorch(
    entry_point="train_pytorch.py",
    source_dir="code",
    role=sagemaker.get_execution_role(),
    instance_count=2,
    instance_type=train_instance_type,
    framework_version="1.6",
    py_version="py3",
    volume_size=1024
)

In [None]:
ptrain = "s3://rdtest-data/prepared_parquet4/train/part-00424-5a7f35ae-f02b-44f4-84e6-1063c134015d-c000.csv"
pvalid = "s3://rdtest-data/prepared_parquet4/validation/part-00424-ab5c94f8-354e-42ff-8e41-39c36d5e3c30-c000.csv"

In [None]:
pt_estimator.fit({'train': ptrain, 'test': pvalid})

In [None]:
pt_estimator.fit({'train': train_input, 'test': validation_input})

PyTorch gives us a training time of 6931 seconds and Validation RMSE of 3.43.  We only used 5 of the files and 1 epoch.

### PyTorch with distributed library

If you want to use multiple instances, use ml.p4dn.24xlarge or ml.p4d.24xlarge.

In [11]:
train_instance_type_dist = "ml.p3.16xlarge"

In [None]:
train_input_l = TrainingInput("s3://{}/{}/{}/".format(s3_bucket, s3_prefix, 'train'), content_type=content_type)
validation_input_l = TrainingInput("s3://{}/{}/{}/".format(s3_bucket, s3_prefix, 'validation'), content_type=content_type)

In [12]:
s3_prefix = 'prepared_parquet4_p'
train_input = TrainingInput("s3://{}/{}/{}/".format(s3_bucket, s3_prefix, 'train'), content_type=content_type, distribution='ShardedByS3Key')
validation_input = TrainingInput("s3://{}/{}/{}/".format(s3_bucket, s3_prefix, 'validation'), content_type=content_type, distribution='ShardedByS3Key')

In [13]:
from sagemaker.pytorch import PyTorch
pt_dist_estimator = PyTorch(
    entry_point="train_pytorch_dist.py",
    source_dir="code",
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type=train_instance_type_dist,
    framework_version="1.8.1",
    py_version="py36",
    volume_size=256,
    distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
    debugger_hook_config=False,
    disable_profiler=True
)

In [None]:
pt_dist_estimator.fit({'train': ptrain, 'test': pvalid})

In [None]:
pt_dist_estimator.fit({'train': train_input, 'test': validation_input})

2021-06-14 17:38:23 Starting - Starting the training job...
2021-06-14 17:38:25 Starting - Launching requested ML instances............
2021-06-14 17:40:25 Starting - Preparing the instances for training......
2021-06-14 17:41:50 Downloading - Downloading input data......
2021-06-14 17:42:54 Training - Downloading the training image.....................
2021-06-14 17:46:21 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-06-14 17:46:22,064 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-06-14 17:46:22,142 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-06-14 17:46:23,581 sagemaker_pytorch_container.training INFO     Invoking SMDataParallel[0m
[34m2021-06-14 17:46:23,582 sagemaker_pytorch_container.training INFO     In

Distributed mode gives us an MSE of 0.0147 with a time of 4258 seconds.  Compared to 6931 seconds without distributed training, that's a 38% speedup.

# HPO

In [None]:
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)
hyperparameter_ranges = {
    "eta": ContinuousParameter(0, 1),
    "min_child_weight": ContinuousParameter(1, 10),
    "alpha": ContinuousParameter(0, 2),
    "max_depth": IntegerParameter(1, 10),
}

In [None]:
objective_metric_name = "validation:rmse"

In [None]:
estimator_hpo = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type='ml.m5.12xlarge', 
                                          volume_size=200, # 5 GB 
                                          output_path=output_path)

In [None]:
tuner = HyperparameterTuner(
    estimator_hpo, objective_metric_name, hyperparameter_ranges, max_jobs=10, max_parallel_jobs=2,
    objective_type = 'Minimize'
)

In [None]:
tuner.fit({'train': train_input, 'validation': validation_input})

In [None]:
tuning_job_name = tuner.latest_tuning_job.name

In [None]:
import pandas as pd

tuner_results = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)

full_df = tuner_results.dataframe()

if len(full_df) > 0:
    df = full_df[full_df["FinalObjectiveValue"] > -float("inf")]
    if len(df) > 0:
        df = df.sort_values("FinalObjectiveValue", ascending=True)
        print("Number of training jobs with valid objective: %d" % len(df))
        print({"lowest": min(df["FinalObjectiveValue"]), "highest": max(df["FinalObjectiveValue"])})
        pd.set_option("display.max_colwidth", -1)  # Don't truncate TrainingJobName
    else:
        print("No training jobs have reported valid results yet.")

df

Best RMSE is 1.335 with max_depth = 2, alpha = 1.59, eta = 0.99, min_child_weight = 5.90.

### Experiments

In [None]:
import boto3
sm = boto3.client("sagemaker")

In [None]:
experiment_name = "example-experiment-with-tuning-jobs"
trial_name = tuning_job_name + "-trial"

print(f"Associate all training jobs created by {tuning_job_name} with trial {trial_name}")

In [None]:
# create the experiment if it doesn't exist
try:
    experiment = Experiment.load(experiment_name=experiment_name)
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        experiment = Experiment.create(experiment_name=experiment_name)


# create the trial if it doesn't exist
try:
    trial = Trial.load(trial_name=trial_name)
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        trial = Trial.create(experiment_name=experiment_name, trial_name=trial_name)

In [None]:
!pip install sagemaker-experiments

In [None]:
from smexperiments.search_expression import Filter, Operator, SearchExpression
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
tuning_job = tuner.describe()
creation_time = tuning_job["CreationTime"]
creation_time = creation_time.astimezone(timezone.utc)
creation_time = creation_time.strftime("%Y-%m-%dT%H:%M:%SZ")

created_after_filter = Filter(
    name="CreationTime",
    operator=Operator.GREATER_THAN_OR_EQUAL,
    value=str(creation_time),
)

# the training job names contain the tuning job name (and the training job name is in the source arn)
source_arn_filter = Filter(
    name="TrialComponentName", operator=Operator.CONTAINS, value=tuning_job_name
)
source_type_filter = Filter(
    name="Source.SourceType", operator=Operator.EQUALS, value="SageMakerTrainingJob"
)

search_expression = SearchExpression(
    filters=[created_after_filter, source_arn_filter, source_type_filter]
)

# search iterates over every page of results by default
trial_component_search_results = list(
    TrialComponent.search(search_expression=search_expression, sagemaker_boto_client=sm)
)
print(f"Found {len(trial_component_search_results)} trial components.")

In [None]:
# associate the trial components with the trial
import time
for tc in trial_component_search_results:
    print(f"Associating trial component {tc.trial_component_name} with trial {trial.trial_name}.")
    trial.add_trial_component(tc.trial_component_name)
    # sleep to avoid throttling
    time.sleep(0.5)

In [None]:
from sagemaker.analytics import ExperimentAnalytics
sess = boto3.Session()
sm = sess.client("sagemaker")
sagemaker_session = Session(sess)

trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=sagemaker_session, experiment_name=experiment_name
)
trial_comp_ds_jobs = trial_component_analytics.dataframe()
trial_comp_ds_jobs

In [None]:
trial_comp_ds_jobs = trial_comp_ds_jobs.sort_values("validation:rmse - Last", ascending=False)
trial_comp_ds_jobs[["TrialComponentName", "validation:rmse - Last", "max_depth"]]

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

trial_comp_ds_jobs["col_names"] = (
    trial_comp_ds_jobs["max_depth"].astype("str")
    + "-" + trial_comp_ds_jobs["alpha"].astype("str")
    + "-" + trial_comp_ds_jobs["eta"].astype("str")
    + "-" + trial_comp_ds_jobs["min_child_weight"].astype("str")
)

In [None]:
sns.set(style="dark")
sns.stripplot(data = trial_comp_ds_jobs, x="validation:rmse - Last", y = "col_names")

## Pipe mode

In [None]:
# set an output path where the trained model will be saved
m_prefix = 'baseline-pipe'
output_path = 's3://{}/{}/{}/output'.format(s3_bucket, m_prefix, 'xgboost')

# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")

# construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type='ml.m5.12xlarge', 
                                          volume_size=200, # 5 GB 
                                          output_path=output_path,
                                         train_use_spot_instances=False,
                                          input_mode="Pipe")



# execute the XGBoost training job
estimator.fit({'train': ptrain, 'validation': pvalid})

Billable time was 53 seconds with pipe mode.

## Spot

In [None]:
m_prefix = 'spot'
output_path = 's3://{}/{}/{}/output'.format(s3_bucket, m_prefix, 'xgboost')

# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")

# construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type='ml.m5.12xlarge', 
                                          volume_size=200, # 5 GB 
                                          output_path=output_path,
                                         train_use_spot_instances=True,
                                         train_max_run=3600,
                                         train_max_wait=3600)



# execute the XGBoost training job
estimator.fit({'train': train_input, 'validation': validation_input})

Spot savings: 64.6%

    Managed Spot Training savings: 64.6%