# Amazon SageMaker Debugger XGBoost training report for Higgs Boson Detection Challenge
## &copy;  [Omkar Mehta](omehta2@illinois.edu) ##
### Industrial and Enterprise Systems Engineering, The Grainger College of Engineering,  UIUC ###

<hr style="border:2px solid blue"> </hr>


This tutorial walks thorugh an example of training an XGBoost model using data from the [2014 ATLAS Higgs Boson Machine Learning Challenge](http://opendata.cern.ch/record/328). This example showcases some of the new features available in SageMaker Debugger, such as the deep profiling and the XGBoost training report.

The [Debugger profiling report](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html) displays hardware resource utilization metrics such as cpu, gpu, memory, and IO utilization. Debugger will help you identify any hardware bottlenecks and appropriately choose the right-sized instance for your training job.

The [Debugger XGBoost training report](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-training-xgboost-report.html) will provide a comprehensive evaluation of your model's performance to help you fine-tune and improve your model.

The [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) makes it easy to train XGBoost models accessing other AWS services, such as Amazon EC2, Amazon ECR, and Amazon S3. For more information about the XGBoost model and SageMaker, see the [XGBoost Algoritm Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) and the [SageMaker Python SDK documentation](https://github.com/aws/sagemaker-python-sdk).

### Table of contents
* [Setup and imports](#setup)
* [Get and prepare data](#data)
* [Create the SageMaker XGBoost Estimator](#estimator)
* [Train XGBoost Model](#train)
* [View post training reports](#reports)

## Setup<a class="anchor" id="setup"></a>
This notebook was created and tested on an `ml.t3.medium` notebook instance.

After we've installed and imported the required packages, we'll need to specify a few variable that will be utilized throughout the example notebook:
- `role`: The IAM role to run SageMaker training jobs. The default SageMaker role with the SageMaker full access policy will be used.
- `sess`: The SageMaker session that interacts with different AWS services.
- `bucket`: The S3 bucket where the model's input and output data will be stored. We will use the default S3 bucket automatically paired with the SageMaker session.
- `key_prefix`: The directory in the S3 bucket where we'll store the input and output data.
- `region`: The AWS region where we operate the SageMaker training job.
- `s3`: the s3fs client to make it easier to read and write data from and to the S3 bucket.
- `xgboost_container`: The URI for the XGBoost training container for our region.

In [6]:
#!pip install -Uqq sagemaker
!pip install -Uqq s3fs==0.4.1
!pip install -Uqq s3fs==0.4.1

In [7]:
import requests
from io import BytesIO
import pandas as pd
import s3fs
from datetime import datetime
import time
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker import image_uris
from sagemaker.inputs import TrainingInput
from sagemaker.debugger import Rule, rule_configs

from IPython.display import FileLink, FileLinks

In [8]:
# setup sagemaker variables
role = sagemaker.get_execution_role()
sess = sagemaker.session.Session()
bucket = sess.default_bucket()
key_prefix = "higgs-boson"
region = sess._region_name
s3 = s3fs.S3FileSystem(anon=False)

xgboost_container = image_uris.retrieve("xgboost", region, "1.2-1")

In [21]:
print(bucket)

sagemaker-us-east-2-699001202920


## Get and prepare data <a class="anchor" id="data"></a>

The data for this example notebook is provided by the European Organization for Nuclear Research (CERN). This data was utilized in a 2014 machine learning competition where participants had to develop an algorithm that improves the detection of Higgs boson signal events decaying into two tau particles from a sample of simulated ATLAS data. More background and details on this interesting data set can be found at Dataset from the [ATLAS Higgs Boson Machine Learning Challenge 2014](http://opendata.cern.ch/record/328)

In [12]:
# Download the data from CERN and load it directly into memory
# data_url = "atlas-higgs-challenge-2014-v2.csv.gz"
# gz_file = BytesIO(requests.get(data_url).content)
# gz_file.flush()
df = pd.read_csv("atlas-higgs-challenge-2014-v2.csv.gz", compression="gzip")

In [13]:
# We remove the columns we don't need and identify columns that will be used as features as well as the target
non_feature_cols = ["EventId", "Weight", "KaggleSet", "KaggleWeight", "Label"]
feature_cols = [col for col in df.columns if col not in non_feature_cols]
label_col = "Label"
df["Label"] = df["Label"].apply(lambda x: 1 if x == "s" else 0)

# The original competition split the data out into training and validation sets. The data includes a column that identifies which sample falls into which set
train_data = df.loc[df["KaggleSet"] == "t", [label_col, *feature_cols]]
test_data = df.loc[df["KaggleSet"] == "b", [label_col, *feature_cols]]

In [14]:
# using the SageMaker session, we upload the data to S3
for name, dataset in zip(["train", "test"], [train_data, test_data]):
    sess.upload_string_as_file_body(
        body=dataset.to_csv(index=False, header=False),
        bucket=bucket,
        key=f"{key_prefix}/input/{name}.csv",
    )

In [15]:
# configure data inputs for SageMaker training
train_input = TrainingInput(f"s3://{bucket}/{key_prefix}/input/train.csv", content_type="text/csv")
validation_input = TrainingInput(
    f"s3://{bucket}/{key_prefix}/input/test.csv", content_type="text/csv"
)

## Create XGBoost Estimator <a class="anchor" id="estimator"></a>

Here we create a SageMaker Estimator using the XGBoost image prepared by SageMaker. We attach the SageMaker Debugger built-in `create_xgboost_report()` rule to automatically generate an XGBoost training report after the training job is complete. SageMaker Debugger also turns on the [ProfilerReport](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html) rule and autogenerate a report regarding system resource utilization and bottleneck detection results

In [16]:
# hyperparameters for the XGBoost model
hyperparameters = {"objective": "binary:logistic", "num_round": "100", "eval_metric": "error"}

# add a rule to generate the XGBoost Report
rules = [Rule.sagemaker(rule_configs.create_xgboost_report())]

In [17]:
# Create SageMaker Estimator using the XGBoost image
estimator = Estimator(
    role=role,
    image_uri=xgboost_container,
    base_job_name="higgs-boson-model",
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    hyperparameters=hyperparameters,
    rules=rules,
)

## Train XGBoost Model <a class="anchor" id="train"></a>
Finally we launch a training job to train the XGBoost model

In [18]:
estimator.fit({"train": train_input, "validation": validation_input}, wait=True)

2021-08-19 18:58:13 Starting - Starting the training job...
2021-08-19 18:58:15 Starting - Launching requested ML instancesCreateXgboostReport: InProgress
ProfilerReport-1629399493: InProgress
......
2021-08-19 18:59:42 Starting - Preparing the instances for training......
2021-08-19 19:00:43 Downloading - Downloading input data
2021-08-19 19:00:43 Training - Downloading the training image...
2021-08-19 19:01:13 Training - Training image download completed. Training in progress..
2021-08-19 19:01:43 Uploading - Uploading generated training model[34m[2021-08-19 19:01:15.377 ip-10-0-88-232.us-east-2.compute.internal:1 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter eval_metric value error to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logist

## Download SageMaker Debugger Reports <a class="anchor" id="reports"></a>
SageMaker Debugger generates profiling and training reports through a pair of processing jobs that run concurrent to the training job. The code below will download the outputs from the Debugger report output S3 URI to your current Jupyter working directory for easier viewing. 

In [19]:
import os

# get name of profiler report
profiler_report_name = [
    rule["RuleConfigurationName"]
    for rule in estimator.latest_training_job.rule_job_summary()
    if "Profiler" in rule["RuleConfigurationName"]
][0]

# get name of the xgboost training report
xgb_profile_job_name = [
    rule["RuleEvaluationJobArn"].split("/")[-1]
    for rule in estimator.latest_training_job.rule_job_summary()
    if "CreateXgboostReport" in rule["RuleConfigurationName"]
][0]

base_output_path = os.path.dirname(estimator.latest_job_debugger_artifacts_path())
rule_output_path = os.path.join(base_output_path, "rule-output/")
xgb_report_path = os.path.join(rule_output_path, "CreateXgboostReport")
profile_report_path = os.path.join(rule_output_path, profiler_report_name)

In [20]:
while True:

    xgb_job_info = sess.sagemaker_client.describe_processing_job(
        ProcessingJobName=xgb_profile_job_name
    )

    if xgb_job_info["ProcessingJobStatus"] == "Completed":
        break
    else:
        print(f"Job Status: {xgb_job_info['ProcessingJobStatus']}")
        time.sleep(30)

s3.download(xgb_report_path, "reports/xgb/", recursive=True)
s3.download(profile_report_path, "reports/profiler/", recursive=True)
display(
    "Click link below to view the profiler report whcih will help you identify hardware bottlenecks.",
    FileLink("reports/profiler/profiler-output/profiler-report.html"),
)
display(
    "Click link below to view the XGBoost Training reports which will help you imporve your model",
    FileLink("reports/xgb/xgboost_report.html"),
)

Job Status: InProgress
Job Status: InProgress
Job Status: InProgress
Job Status: InProgress
Job Status: InProgress
Job Status: InProgress
Job Status: InProgress
Job Status: InProgress
Job Status: InProgress


'Click link below to view the profiler report whcih will help you identify hardware bottlenecks.'

'Click link below to view the XGBoost Training reports which will help you imporve your model'

### Display the Debugger Profiling report

The following code opens the downloaded profiling report. For this training job, there is no bottleneck issues found as described in the report.

In [22]:
import IPython

IPython.display.HTML(filename="reports/profiler/profiler-output/profiler-report.html")

Unnamed: 0,Description,Recommendation,Number of times rule triggered,Number of datapoints,Rule parameters
Dataloader,"Checks how many data loaders are running in parallel and whether the total number is equal the number of available CPU cores. The rule triggers if number is much smaller or larger than the number of available cores. If too small, it might lead to low GPU utilization. If too large, it might impact other compute intensive operations on CPU.",Change the number of data loader processes.,0,0,min_threshold:70 max_threshold:200
LoadBalancing,"Detects workload balancing issues across GPUs. Workload imbalance can occur in training jobs with data parallelism. The gradients are accumulated on a primary GPU, and this GPU might be overused with regard to other GPUs, resulting in reducing the efficiency of data parallelization.",Choose a different distributed training strategy or a different distributed training framework.,0,0,threshold:0.2 patience:1000
StepOutlier,"Detects outliers in step duration. The step duration for forward and backward pass should be roughly the same throughout the training. If there are significant outliers, it may indicate a system stall or bottleneck issues.","Check if there are any bottlenecks (CPU, I/O) correlated to the step outliers.",0,0,threshold:3 mode:None n_outliers:10 stddev:3
LowGPUUtilization,"Checks if the GPU utilization is low or fluctuating. This can happen due to bottlenecks, blocking calls for synchronizations, or a small batch size.","Check if there are bottlenecks, minimize blocking calls, change distributed training strategy, or increase the batch size.",0,0,threshold_p95:70 threshold_p5:10 window:500 patience:1000
CPUBottleneck,"Checks if the CPU utilization is high and the GPU utilization is low. It might indicate CPU bottlenecks, where the GPUs are waiting for data to arrive from the CPUs. The rule evaluates the CPU and GPU utilization rates, and triggers the issue if the time spent on the CPU bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent.",Consider increasing the number of data loaders or applying data pre-fetching.,0,68,threshold:50 cpu_threshold:90 gpu_threshold:10 patience:1000
MaxInitializationTime,Checks if the time spent on initialization exceeds a threshold percent of the total training time. The rule waits until the first step of training loop starts. The initialization can take longer if downloading the entire dataset from Amazon S3 in File mode. The default threshold is 20 minutes.,"Initialization takes too long. If using File mode, consider switching to Pipe mode in case you are using TensorFlow framework.",0,0,threshold:20
BatchSize,"Checks if GPUs are underutilized because the batch size is too small. To detect this problem, the rule analyzes the average GPU memory footprint, the CPU and the GPU utilization.","The batch size is too small, and GPUs are underutilized. Consider running on a smaller instance type or increasing the batch size.",0,64,cpu_threshold_p95:70 gpu_threshold_p95:70 gpu_memory_threshold_p95:70 patience:1000 window:500
IOBottleneck,Checks if the data I/O wait time is high and the GPU utilization is low. It might indicate IO bottlenecks where GPU is waiting for data to arrive from storage. The rule evaluates the I/O and GPU utilization rates and triggers the issue if the time spent on the IO bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent.,"Pre-fetch data or choose different file formats, such as binary formats that improve I/O performance.",0,68,threshold:50 io_threshold:50 gpu_threshold:10 patience:1000
GPUMemoryIncrease,Measures the average GPU memory footprint and triggers if there is a large increase.,Choose a larger instance type with more memory if footprint is close to maximum available memory.,0,0,increase:5 patience:1000 window:10


### Display the Debugger XGBoost training report

The following code displays the XGBoost training report. This shows how the training job made progress, such as loss values over time and statistics at `plot_step`.

In [23]:
import IPython

IPython.display.HTML(filename="reports/xgb/xgboost_report.html")