# Debugging and Profiling Training Jobs With TensorBoard

In this example we will learn how to use TensorBoard to debug and profile PyTorch training job. TensorBoard is an open source tool developed originally for the TensorFlow framework, but it is now available for other DL frameworks, including PyTorch. TensorBoard supports the following features for visualizing and inspecting the training process:
- Tracking scalar values (loss, accuracy, and others) over time.
- Capturing tensors such as weights, biases, and gradients and how they change over time. This can be useful for visualizing weights and biases and verifying that they are changing expectedly.
- Experiment tracking via a dashboard of hyperparameters.
- Projecting high-dimensional embeddings to a lower-dimensionality space.
TensorBoard also allows you to profile resource utilization and resource comsuptions by different parts of your training program.

In this example we will re-use `Hymenoptera` image classification problem, however, we integrate TensorBoard monitoring and profiling capabilities into training script and review the results. We use PyTorch framework for this. Note, that changes to TensorFlow script will be similar.

### Prerequisites

To be able to use TensorBoard, you need to install multiple Python packages locally. You can install all dependencies for this chapter by running cell below:

In [None]:
! pip install -r requirements.txt

We start from imports and SageMaker configuration. 

In [None]:
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
role = get_execution_role()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/tensorboard'
print('Bucket:\n{}'.format(bucket))

Next, we download training data locally and upload it to S3. Feel free to skip it if you already uploaded this dataset previously (in this case update `data_url` variable accordingly).

In [None]:
# Downloading dataset and unzipping it locally
! wget https://download.pytorch.org/tutorial/hymenoptera_data.zip
! unzip hymenoptera_data.zip
data_url = sagemaker_session.upload_data(path="./hymenoptera_data", key_prefix="hymenoptera_data")
print(f"S3 location of dataset {data_url}")

## Modifying Training Script


To use TensorBoard, we need to make minimal changes to our training script (see it here: `1_sources/train_resnet_tb.py`). In this example, we are using TensorBoard both for debugging and profiling purposes. Below we groups changes by purpose.

### Modifications for Debugging
Following modifications needed to capture data for debugging and monitoring:
1. We must import and initialize TensorBoard’s SummaryWriter object. Here, we are using the S3 location to write TensorBoard summaries:
```python
    from torch.utils.tensorboard import SummaryWriter
    tb_writer = SummaryWriter(args.tb_s3_url)
```
2. Next, we must capture training artifacts that won’t change during training – in our case, the model graph. Note that we need to execute the model’s forward path on the sample data to do so:

```python
    sample_inputs, _ = next(iter(dataloaders_dict["val"]))
    tb_writer.add_graph(model, sample_inputs, verbose=False,
    use_strict_trace=False)
```

3. In our training loop, we capture the scalars (such as `loss` or `accuracy`) and tensors (such as `weights`) that we wish to inspect. We use the epoch number as the time dimension.
```python
    tb_writer.add_scalar(f"Loss/{phase}", epoch_loss, epoch)
    tb_writer.add_scalar(f"Accuracy/{phase}", epoch_accuracy, epoch)
    tb_writer.add_histogram("conv1.weight", model.conv1.weight, epoch)
    tb_writer.add_histogram("conv1.weight_grad", model.conv1.weight.grad, epoch)
    tb_writer.add_histogram("fc.weight", model.fc.weight, epoch)
    tb_writer.add_histogram("fc.weight_grad", model.fc.weight.grad, epoch)
    tb_writer.add_scalar(f"Loss/{phase}", epoch_loss, epoch)
    tb_writer.add_scalar(f"Accuracy/{phase}", epoch_accuracy, epoch)
    tb_writer.add_hparams(hparam_dict=vars(args), metric_dict={epoch_accuracy})
```
After these modications, TensorBoard will continiously save requested data to Amazon S3 location as the training progresses. Now let's review modifications needed for profiling.

### Modifications for Profiling
TensorBoard provides out-of-the-box profiling capabilities for TensorFlow programs (including Keras). To profile PyTorch programs in TensorBoard, you can use the open source `torch_tb_profiler` plugin (we included this dependency in `2_sources/requirements.txt` file). 

To profile applications using torch_tb_profiler, we need to wrap our training loop with the plugin context manager, as shown in the following code block:

```python
    with torch.profiler.profile(
        schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=5),
        on_trace_ready=torch.profiler.tensorboard_trace_handler(
            os.path.join(os.environ["SM_OUTPUT_DATA_DIR"], "tb_profiler")
        ),
        record_shapes=True,
        profile_memory=True,
        with_stack=True,
    ) as prof:
        # Rest of the training loop goes below.
```
The parameters of the context manager that are passed at initialization time define what profiling data must be gathered and at what intervals. At the time of writing this book, the torch_db_profiler plugin doesn’t support writing to the S3 location. Hence, we must write the profiling data to the local output directory stored in the "SM_OUTPUT_DATA_DIR" environment variable. After training is done, SageMaker automatically archives and stores the content of this directory to the S3 location.

If you'd like to review training script in full with all modifications, run the cell below:

In [None]:
! pygmentize 1_sources/train_resnet.py

## Running Training Job


To start the SageMaker training job, we need to provide the S3 location where TensorBoard summaries will be written. We can do this by setting `tb-s3-url` hyperparameter, as shown below. The rest of configuration is similar to regular SageMaker training job.

In [None]:

from sagemaker.pytorch import PyTorch

job_name = "pytorch-tb-profiling"
tb_s3_url = f"s3://{bucket}/tensorboard/{job_name}"

instance_type = 'ml.p2.xlarge'
instance_count = 1
python_version = "py38"
pytorch_version = "1.10.2"

estimator = PyTorch(
          entry_point="train_resnet_tb.py",
          source_dir='1_sources',
          role=role,
          framework_version=pytorch_version,
          py_version=python_version,          
          instance_type=instance_type,
          sagemaker_session=sagemaker_session,        
          instance_count=instance_count,
          hyperparameters={
              "batch-size":16,
              "num-epochs":5,
              "input-size" : 224,
              "feature-extract":False,
              "tb-s3-url": tb_s3_url,
              "num-data-workers": 4
          },
          disable_profiler=True,
          debugger_hook_config=False,
          base_job_name=job_name,
      )

In [None]:
estimator.fit(inputs={"train":f"{data_url}/train", "val":f"{data_url}/val"})

### Reviewing Debugging Results

TensorBoard will start saving data (scalars, tensors, graphs etc.) to Amazon S3 location almost immediately after start of the training. To review the debugging results, start TensorBoard application locally as below:

In [None]:
! tensorboard --logdir ${tb_debug_path}

TensorBoard web application should start now and redirect you to its local address (by default it's `localhost:6006`). Note the following when using TensorBoard in cloud development environments:
- If you are using a SageMaker notebook instance, then TensorBoard will be available here: `https://YOUR_NOTEBOOK_DOMAIN/proxy/6006/`
- If you are using SageMaker Studio, then TensorBoard will available here: `https://<YOUR_ STUDIO_DOMAIN>/jupyter/default/proxy/6006/`

The TensorBoard data will be updated in near-real time as the training job progresses. Please refer to the book for overview of visualizations in TensorBoard application.


### Reviewing Profiling Results
Unlike debugging data available in near real-time, profiling data will be only available after training job completion. This is due to current limitations of `torch_tb_profiler` plugin. Run the cell below to get location of TensorBoard Profiler output:

In [None]:
tb_profiler_path = f"{estimator.latest_training_job.describe()['OutputDataConfig']['S3OutputPath']}/output/output.tar.gz"

Then you can run the following commands to unarchive the profiler data and start TensorBoard:

In [None]:
! aws s3 cp ${ tb_profiler_path} .
! mkdir profiler_output
! tar -xf output.tar.gz -C profiler_output
! tensorboard --logdir ./profiler_output

After that TensorBoard application will open with loaded profiling results. Refer to the book for overview of the results.