# Getting started


<div class="alert alert-info">

<b>Warning</b>

Precision debug tools with [Nvidia-DL-Framework-Inspect](https://github.com/NVIDIA/nvidia-dlfw-inspect) for Transformer Engine is currently supported only for Torch.

</div>


Transformer Engine provides set of precision debug tools which easily allow to:

- log the statistics for each of the tensor in every GEMM,
- run some specific GEMMs in Higher Precision etc.,
- run current scaling - with one scaling factor per tensor - on Hopper,
- test new precisions and easily integrate them with FP8 training,
- ... and many more.

All these things nees only few small changes in code.

To use them one need to install [Nvidia-DL-Framework-Inspect](https://github.com/NVIDIA/nvidia-dlfw-inspect) tool. 
User defines in `config.yaml` which features need to be used in which layers and then Nvidia-DL-Framework-Inspect takes care of the rest. There are 2 kinds of features:

- provided by the Transformer Engine - for example DisableFP8GEMM or LogTensorStats - they are listed in [debug features API](./3_api_features.ipynb)section
- defined by the user - for example for testing new precisions - please read [calls to nvidia-dlframework-inspect](./3_api_te_calls.ipynb) section.

<figure align="center">
<img src="./img/introduction.svg">
    <figcaption> Fig 1: Example of Nvidia-DL-Framework-Inspect affecting traning script with 3 TE Linear Layers. 
    There is specification in `config.yaml` for each layers which features should be used. There are feature class files - some are provided by the TE,
    one - `UserProvidedPrecision` - is implemented by the user. Nvidia-DL-Framework-Inspect insterts features into the Layers as it is described in the config.
     </figcaption>
</figure>

#### Example training script

Let's look at a simple example of training a Transformer layer using Transformer Engine with FP8 precision. This example demonstrates how to set up the layer, define an optimizer, and perform a few training iterations using dummy data.

```python
# train.py

from transformer_engine.pytorch import TransformerLayer
import torch
import torch.nn as nn
import torch.optim as optim
import transformer_engine.pytorch as te

hidden_size = 512
num_attention_heads = 8

transformer_layer = TransformerLayer(
    hidden_size=hidden_size,
    ffn_hidden_size=hidden_size,
    num_attention_heads=num_attention_heads
).cuda()

dummy_input = torch.randn(10, 32, hidden_size).cuda()
criterion = nn.MSELoss()
optimizer = optim.Adam(transformer_layer.parameters(), lr=1e-4)
dummy_target = torch.randn(10, 32, hidden_size).cuda()

for epoch in range(5):
    transformer_layer.train()
    optimizer.zero_grad()
    with te.fp8_autocast(enabled=True):
        output = transformer_layer(dummy_input)
    loss = criterion(output, dummy_target)
    loss.backward()
    optimizer.step()
```

We will demonstrate two debug features on the code above:

1. Disabling FP8 precision for a specific GEMM operations, such as the FC1 and FC2 forward propagation GEMM.
2. Logging statistics for other GEMM operations, such as gradient statistics for dgrad GEMM within the LayerNormLinear layer.


There are 4 things one needs to do to use Transformer Engine debug features:

1. Create a **config.yaml** file to configure the desired features.
2. Install, import and initialize Nvidia-DL-Framework-Inspect tool before initializing any Transormer Engine layers.
3. One can pass `debug_name="..."` to init of every TE layer to easier identify layer names. If this will not be provided, names will be infered automatically.
4. Invoke `debug_api.step()` at the end of one forward-backward pass.

#### Requirements

To use the debug features of Transformer Engine, you need to install the [Nvidia-DL-Framework-Inspect](https://github.com/NVIDIA/nvidia-dlfw-inspect) package provided by NVIDIA. You can install it by following these steps:

```
git clone [link]
cd nvidia-dlfw-inspect
pip install .
```

#### Config file

We need to prepare **config.yaml** file, as below

```yaml
# config.yaml

fc1_fprop_to_fp8:
  enabled: True
  layers:
    layer_types: [fc1, fc2] # contains fc1 or fc2 in name
  transformer_engine:
    DisableFp8Gemm:
      enabled: True
      gemms: [fprop]

log_tensor_stats:
  enabled: True
  layers:
    layer_types: [layernorm_linear] # contains layernorm_linear in name
  transformer_engine:
    LogTensorStats:
      enabled: True
      stats: [max, min, mean, std, l1_norm]
      tensors: [activation]
      freq: 1
      start_step: 2
      end_step: 5
```

Further explanation on how to create config files is in the [next part of the documentation](./2_config_file_structure.ipynb).

#### Adjusting Python file

```python
# (...)

import nvdlfw_inspect.api as debug_api
debug_api.initialize(
    config_file="./config.yaml",
    feature_dirs=["/path/to/transformer_engine/debug/features"],
    log_dir="./log",
    default_logging_enabled=True)

# initilization of the TransformerLayer after the 
# debug_api.initialize(...)
transformer_layer = TransformerLayer(
  debug_name="transformer_layer",
  # ...

# (...)
for epoch in range(5):
  # forward and backward pass
  # ...
  debug_api.step()
```

In the modified code above, the following changes were made:

1. Added an import for `nvtorch_inspect.api`.
2. Initialized the Nvidia-DL-Framework-Inspect by calling `debug_api.initialize()` with appropriate configuration, specifying the path to the config file, feature directories, and log directory.
3. Added `debug_api.step()` after each of the forward-backward pass.

#### Inspecting the logs

Let's look at the files with the logs. Two files will be created:

1. First for main debug logs.
2. Second for statistics logs.

Let's look inside them!

```
# log/nvdlfw_inspect_logs/nvdlfw_inspect_globalrank-0.log

INFO - Default logging to file enabled at ./log
INFO - Reading config from ./config.yaml.
INFO - Loaded configs for dict_keys(['fc1_fprop_to_fp8', 'log_tensor_stats']).
WARNING - > UserBuffers are not supported in debug module. Using UB optimization will not affect the debug module. 
INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: activation, gemm fprop - FP8 quanitation
INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: activation, gemm wgrad - FP8 quanitation
INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: weight, gemm fprop - FP8 quanitation
INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: weight, gemm dgrad - FP8 quanitation
INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: gradient, gemm dgrad - FP8 quanitation
INFO - transformer_layer.self_attention.layernorm_qkv: Tensor: gradient, gemm wgrad - FP8 quanitation
INFO - transformer_layer.self_attention.proj: Tensor: activation, gemm fprop - FP8 quanitation
INFO - transformer_layer.self_attention.proj: Tensor: activation, gemm wgrad - FP8 quanitation
INFO - transformer_layer.self_attention.proj: Tensor: weight, gemm fprop - FP8 quanitation
INFO - transformer_layer.self_attention.proj: Tensor: weight, gemm dgrad - FP8 quanitation
INFO - transformer_layer.self_attention.proj: Tensor: gradient, gemm dgrad - FP8 quanitation
INFO - transformer_layer.self_attention.proj: Tensor: gradient, gemm wgrad - FP8 quanitation
INFO - transformer_layer.layernorm_mlp.fc1: Tensor: activation, gemm fprop - High precision
INFO - transformer_layer.layernorm_mlp.fc1: Tensor: activation, gemm wgrad - FP8 quanitation
INFO - transformer_layer.layernorm_mlp.fc1: Tensor: weight, gemm fprop - High precision
INFO - transformer_layer.layernorm_mlp.fc1: Tensor: weight, gemm dgrad - FP8 quanitation
INFO - transformer_layer.layernorm_mlp.fc1: Tensor: gradient, gemm dgrad - FP8 quanitation
INFO - transformer_layer.layernorm_mlp.fc1: Tensor: gradient, gemm wgrad - FP8 quanitation
INFO - transformer_layer.layernorm_mlp.fc2: Tensor: activation, gemm fprop - High precision
INFO - transformer_layer.layernorm_mlp.fc2: Tensor: activation, gemm wgrad - FP8 quanitation
INFO - transformer_layer.layernorm_mlp.fc2: Tensor: weight, gemm fprop - High precision
INFO - transformer_layer.layernorm_mlp.fc2: Tensor: weight, gemm dgrad - FP8 quanitation
INFO - transformer_layer.layernorm_mlp.fc2: Tensor: gradient, gemm dgrad - FP8 quanitation
INFO - transformer_layer.layernorm_mlp.fc2: Tensor: gradient, gemm wgrad - FP8 quanitation
INFO - transformer_layer.self_attention.layernorm_qkv: Feature=LogTensorStats, API=look_at_tensor_before_process: activation
....
```

In the main log file, you can find detailed information about the transformer's layer GEMMs behavior. You can see that `fc1` and `fc2` fprop GEMMs are run in high precision, as intended.

```
# log/nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-0.log

INFO - transformer_layer.self_attention.layernorm_qkv_activation_max 				 iteration=000002 				 value=4.3188
INFO - transformer_layer.self_attention.layernorm_qkv_activation_min 				 iteration=000002 				 value=-4.3386
INFO - transformer_layer.self_attention.layernorm_qkv_activation_mean 				 iteration=000002 				 value=0.0000
INFO - transformer_layer.self_attention.layernorm_qkv_activation_std 				 iteration=000002 				 value=0.9998
INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm             iteration=000002 				 value=130799.6953
INFO - transformer_layer.self_attention.layernorm_qkv_activation_max 				 iteration=000003 				 value=4.3184
INFO - transformer_layer.self_attention.layernorm_qkv_activation_min 				 iteration=000003 				 value=-4.3381
INFO - transformer_layer.self_attention.layernorm_qkv_activation_mean 				 iteration=000003 				 value=0.0000
INFO - transformer_layer.self_attention.layernorm_qkv_activation_std 				 iteration=000003 				 value=0.9997
INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm 	         iteration=000003 				 value=130788.1016
INFO - transformer_layer.self_attention.layernorm_qkv_activation_max 				 iteration=000004 				 value=4.3181
INFO - transformer_layer.self_attention.layernorm_qkv_activation_min 				 iteration=000004 				 value=-4.3377
INFO - transformer_layer.self_attention.layernorm_qkv_activation_mean 				 iteration=000004 				 value=0.0000
INFO - transformer_layer.self_attention.layernorm_qkv_activation_std 				 iteration=000004 				 value=0.9996
INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm 	         iteration=000004 				 value=130776.7969

```

The second log file (`nvdlfw_inspect_globalrank-0.log`) contains statistics for tensors we requested in `config.yaml`.


#### Logging using TensorBoard

Precision debug tools supports logging using [TensorBoard](https://www.tensorflow.org/tensorboard). To enable it, one needs to pass the argument `tb_writer` to the `debug_api.initialize()`.  Let's modify `train.py` file.

```python

# (...)

from torch.utils.tensorboard import SummaryWriter
tb_writer = SummaryWriter('./tensorboard_dir/run1')

# add tb_writer to the Debug API initialization
debug_api.initialize(
    config_file="./config.yaml",
    feature_dirs=["/path/to/transformer_engine/debug/features"],
    log_dir="./log",
    tb_writer=tb_writer)

# (...)
```

Let's run training and open TensorBoard by `tensorboard --logdir=./tensorboard_dir/run1`:

<figure align="center">
<img src="./img/tensorboard.png">
    <figcaption> Fig 2: TensorBoard with plotted stats.</figcaption>
</figure>