## Using SageMaker debugger to monitor attentions in BERT model training

[BERT](https://arxiv.org/abs/1810.04805) is a deep bidirectional transformer model that achieves state-of the art results in NLP tasks like question answering, text classification and others.
In this notebook we will use [GluonNLP](https://gluon-nlp.mxnet.io/) to finetune a pretrained BERT model on the [Stanford Question and Answering dataset](https://web.stanford.edu/class/cs224n/reports/default/15848195.pdf) and we will use [SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html) to monitor model training in real-time. 

The paper [Visualizing Attention in Transformer-Based Language Representation Models [1]](https://arxiv.org/pdf/1904.02679.pdf) shows that plotting attentions and individual neurons in the query and key vectors can help to identify causes of incorrect model predictions.
With SageMaker Debugger we can easily retrieve those tensors and plot them in real-time as training progresses which may help to understand what the model is learning. 

The animation below shows the attention scores of the first 20 input tokens for the first 10 iterations in the training.

<img src='images/attention_scores.gif' width='350' /> 
Fig. 1: Attention scores of the first head in the 7th layer 

[1] *Visualizing Attention in Transformer-Based Language Representation Models*:  Jesse Vig, 2019, 1904.02679, arXiv

In [1]:
!pip install smdebug==0.7.2

[33mYou are using pip version 10.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
import boto3
import sagemaker

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

### SageMaker training
The following code defines the SageMaker Estimator. The entry point script [train.py](entry_point/train.py) defines the model training. It downloads a BERT model from the GluonNLP model zoo and finetunes the model on the Stanford Question Answering dataset. The training script follows the official GluonNLP [example](https://github.com/dmlc/gluon-nlp/blob/v0.8.x/scripts/bert/finetune_squad.py) on finetuning BERT.

For demonstration purposes we will train only on a subset of the data (`train_dataset_size`) and perform evaluation on a single batch (`val_dataset_size`).

In [3]:
from sagemaker.mxnet import MXNet
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig

#role = sagemaker.get_execution_role()

#BUCKET_NAME = sagemaker_session.default_bucket()
s3_bucket_for_tensors = 's3://{}/sm_bert_viz/tensors'.format(bucket)

mxnet_estimator = MXNet(entry_point='train.py',
                            source_dir='entry_point',
                            role=role,
                            train_instance_type='ml.p3.2xlarge',
                            train_instance_count=1,
                            framework_version='1.6.0',
                            py_version='py3',
                            hyperparameters = {'epochs': 3, 
                                               'batch_size': 16,
                                               'learning_rate': 5e-5,
                                               'train_dataset_size': 1024,
                                               'val_dataset_size': 16},
                            debugger_hook_config = DebuggerHookConfig(
                              s3_output_path=s3_bucket_for_tensors,  
                              collection_configs=[
                                CollectionConfig(
                                    name="all",
                                    parameters={"include_regex": 
                                                ".*multiheadattentioncell0_output_1|.*key_output|.*query_output",
                                                "train.save_steps": "0",
                                                "eval.save_interval": "1"}
                                    )
                                 ]
                               )
                            )                                            

SageMaker Debugger will monitor by default collections such as gradients, weights and biases. The default `save_interval` is 100 steps. A step presents the work done by the training job for one batch (i.e. forward and backward pass). 

In this example we are also interested in attention scores, query and key output tensors. We can emit them by just defining a new [collection](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#collection). In this example we call the collection `all` and define the corresponding regex. We save every iteration during validation phase (`eval.save_interval`) and only the first iteration during training phase (`train.save_steps`).


We also add the following lines in the validation loop to record the string representation of input tokens:
```python
if hook.get_collections()['all'].save_config.should_save_step(modes.EVAL, hook.mode_steps[modes.EVAL]):  
   hook._write_raw_tensor_simple("input_tokens", input_tokens)
```

In [4]:
mxnet_estimator.fit(wait=False)

We can check the S3 location of tensors:

In [5]:
path = mxnet_estimator.latest_job_debugger_artifacts_path()
print('Tensors are stored in: {}'.format(path))

Tensors are stored in: s3://sagemaker-us-east-1-828802286385/sm_bert_viz/tensors/mxnet-training-2020-04-15-02-26-13-953/debug-output


Get the training job name:

In [6]:
job_name = mxnet_estimator.latest_training_job.name
print('Training job name: {}'.format(job_name))

client = mxnet_estimator.sagemaker_session.sagemaker_client

description = client.describe_training_job(TrainingJobName=job_name)

Training job name: mxnet-training-2020-04-15-02-26-13-953


We can access the tensors from S3 once the training job is in status Training or Completed. In the following code cell we check the job status.

In [7]:
import time

if description['TrainingJobStatus'] != 'Completed':
    while description['SecondaryStatus'] not in {'Training', 'Completed'}:
        description = client.describe_training_job(TrainingJobName=job_name)
        primary_status = description['TrainingJobStatus']
        secondary_status = description['SecondaryStatus']
        print('Current job status: [PrimaryStatus: {}, SecondaryStatus: {}]'.format(primary_status, secondary_status))
        time.sleep(15)

Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Downloading]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Training]


### Get tensors and visualize BERT model training in real-time
In this section, we will retrieve the tensors of our training job and create the attention-head view and neuron view as described in [Visualizing Attention in Transformer-Based Language Representation Models [1]](https://arxiv.org/pdf/1904.02679.pdf).

First we create the [trial](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/analysis.md#Trial) that points to the tensors in S3:

In [8]:
from smdebug.trials import create_trial

trial = create_trial( path )

[2020-04-15 02:28:59.977 ip-172-16-9-13:28280 INFO s3_trial.py:42] Loading trial debug-output at path s3://sagemaker-us-east-1-828802286385/sm_bert_viz/tensors/mxnet-training-2020-04-15-02-26-13-953/debug-output




In [14]:
for i in trial.tensor_names():
    print(i)

[2020-04-15 02:36:59.819 ip-172-16-9-13:28280 INFO trial.py:198] Training has ended, will refresh one final time in 1 sec.
[2020-04-15 02:37:00.841 ip-172-16-9-13:28280 INFO trial.py:210] Loaded all steps
bertencoder0_position_weight
bertencoder0_transformer0_bertpositionwiseffn0_ffn_1_bias
bertencoder0_transformer0_bertpositionwiseffn0_ffn_1_weight
bertencoder0_transformer0_bertpositionwiseffn0_ffn_2_bias
bertencoder0_transformer0_bertpositionwiseffn0_ffn_2_weight
bertencoder0_transformer0_multiheadattentioncell0_key_bias
bertencoder0_transformer0_multiheadattentioncell0_key_output_0
bertencoder0_transformer0_multiheadattentioncell0_key_weight
bertencoder0_transformer0_multiheadattentioncell0_output_1
bertencoder0_transformer0_multiheadattentioncell0_query_bias
bertencoder0_transformer0_multiheadattentioncell0_query_output_0
bertencoder0_transformer0_multiheadattentioncell0_query_weight
bertencoder0_transformer0_multiheadattentioncell0_value_bias
bertencoder0_transformer0_multiheadatt

Next we import a script that implements the visualization for attentation head view in Bokeh.

In [15]:
from utils import attention_head_view, neuron_view
from ipywidgets import interactive

We will use the tensors from the validation phase. In the next cell we check if such tensors are already available or not.

In [16]:
import numpy as np
from smdebug import modes

while (True):
    if len(trial.steps(modes.EVAL)) == 0:
        print("Tensors from validation phase not available yet")
    else:
        step = trial.steps(modes.EVAL)[0]
        break
    time.sleep(15) 

Once the validation phase started, we can retrieve the tensors from S3. In particular we are interested in outputs of the attention cells which gives the attention score. First we get the tensor names of the attention scores:

In [19]:
tensor_names = []

for tname in sorted(trial.tensor_names(regex='.*multiheadattentioncell0_output_1')):
    tensor_names.append(tname)

Next we iterate over the available tensors of the validation phase. We retrieve tensor values with `trial.tensor(tname).value(step, modes.EVAL)`. Note: if training is still in progress, not all steps will be available yet. 

In [20]:
steps = trial.steps(modes.EVAL)
tensors = {}

for step in steps:
    print("Reading tensors from step", step)
    for tname in tensor_names: 
        if tname not in tensors:
            tensors[tname]={}
        tensors[tname][step] = trial.tensor(tname).value(step, modes.EVAL)
num_heads = tensors[tname][step].shape[1]

Reading tensors from step 0
Reading tensors from step 1
Reading tensors from step 2


Next we get the query and key output tensor names:

In [21]:
layers = []
layer_names = {}

for index, (key, query) in enumerate(zip(trial.tensor_names(regex='.*key_output_'), trial.tensor_names(regex='.*query_output_'))):
    layers.append([key,query])
    layer_names[key.split('_')[1]] = index

We also retrieve the string representation of the input tokens that were input into our model during validation.

In [22]:
input_tokens = trial.tensor('input_tokens').value(0, modes.EVAL)

#### Attention Head View

The attention-head view shows the attention scores between different tokens. The thicker the line the higher the score. For demonstration purposes, we will limit the visualization to the first 20 tokens. We can select different attention heads and different layers. As training progresses attention scores change and we can check that by selecting a different step. 

**Note:** The following cells run fine in Jupyter. If you are using JupyterLab and encounter issues with the jupyter widgets (e.g. dropdown menu not displaying), check the subsection in the end of the notebook.

In [23]:
n_tokens = 20
view = attention_head_view.AttentionHeadView(input_tokens, 
                                             tensors,  
                                             step=trial.steps(modes.EVAL)[0],
                                             layer='bertencoder0_transformer0_multiheadattentioncell0_output_1',
                                             n_tokens=n_tokens)

In [24]:
interactive(view.select_layer, layer=tensor_names)

interactive(children=(Dropdown(description='layer', options=('bertencoder0_transformer0_multiheadattentioncell…

In [25]:
interactive(view.select_head, head=np.arange(num_heads))

interactive(children=(Dropdown(description='head', options=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11), value=0), O…

In [26]:
interactive(view.select_step, step=trial.steps(modes.EVAL))

interactive(children=(Dropdown(description='step', options=(0, 1, 2), value=0), Output()), _dom_classes=('widg…

The following code cell updates the dictionary `tensors`  with the latest tensors from the training the job. Once the dict is updated we can go to above code cell `attention_head_view.AttentionHeadView` and re-execute this and subsequent cells in order to plot latest attentions.

In [27]:
all_steps = trial.steps(modes.EVAL)
new_steps = list(set(all_steps).symmetric_difference(set(steps)))

for step in new_steps: 
    for tname in tensor_names:  
        if tname not in tensors:
            tensors[tname]={}
        tensors[tname][step] = trial.tensor(tname).value(step, modes.EVAL)

#### Neuron view

To create the neuron view as described in paper [Visualizing Attention in Transformer-Based Language Representation Models [1]](https://arxiv.org/pdf/1904.02679.pdf), we need to retrieve the queries and keys from the model. The tensors are reshaped and transposed to have the shape: *batch size, number of attention heads, sequence length, attention head size*

**Note:** The following cells run fine in Jupyter. If you are using JupyterLab and encounter issues with the jupyter widgets (e.g. dropdown menu not displaying), check the subsection in the end of the notebook.

In [28]:
queries = {}
steps = trial.steps(modes.EVAL)

for step in steps:
    print("Reading tensors from step", step)
    
    for tname in trial.tensor_names(regex='.*query_output'):
       query = trial.tensor(tname).value(step, modes.EVAL)
       query = query.reshape((query.shape[0], query.shape[1], num_heads, -1))
       query = query.transpose(0,2,1,3)
       if tname not in queries:
            queries[tname] = {}
       queries[tname][step] = query

Reading tensors from step 0
Reading tensors from step 1
Reading tensors from step 2


Retrieve the key vectors:

In [29]:
keys = {}
steps = trial.steps(modes.EVAL)

for step in steps:
    print("Reading tensors from step", step)
    
    for tname in trial.tensor_names(regex='.*key_output'):
       key = trial.tensor(tname).value(step, modes.EVAL)
       key = key.reshape((key.shape[0], key.shape[1], num_heads, -1))
       key = key.transpose(0,2,1,3)
       if tname not in keys:
            keys[tname] = {}
       keys[tname][step] = key

Reading tensors from step 0
Reading tensors from step 1
Reading tensors from step 2


We can now select different query vectors and see how they produce different attention scores. We can also select different steps to see how attention scores, query and key vectors change as training progresses. The neuron view shows:
* Query
* Key
* Query x Key (element wise product)
* Query * Key (dot product)


In [30]:
view = neuron_view.NeuronView(input_tokens, 
                              keys=keys, 
                              queries=queries, 
                              layers=layers, 
                              step=trial.steps(modes.EVAL)[0], 
                              n_tokens=n_tokens,
                              layer_names=layer_names)

In [31]:
interactive(view.select_query, query=np.arange(n_tokens))

interactive(children=(Dropdown(description='query', options=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…

In [32]:
interactive(view.select_layer, layer=layer_names.keys())

interactive(children=(Dropdown(description='layer', options=('transformer0', 'transformer10', 'transformer11',…

In [33]:
interactive(view.select_step, step=trial.steps(modes.EVAL))

interactive(children=(Dropdown(description='step', options=(0, 1, 2), value=0), Output()), _dom_classes=('widg…

#### Note: Jupyter widgets in JupyterLab

If you encounter issues with this notebook in JupyterLab, you may have to install JupyterLab extensions. You can do this by defining a SageMaker [Lifecycle configuration](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-lifecycle-config.html). A lifecycle configuration is a shell script that runs when you either create a notebook instance or whenever you start an instance. You can create a Lifecycle configuration directly in the SageMaker console (more details [here](https://aws.amazon.com/blogs/machine-learning/customize-your-amazon-sagemaker-notebook-instances-with-lifecycle-configurations-and-the-option-to-disable-internet-access/)) When selecting `Start notebook`, copy and paste the following code. Once the configuration is created attach it to your notebook instance and start the instance.

```sh
#!/bin/bash

set -e

# OVERVIEW
# This script installs a single jupyter notebook extension package in SageMaker Notebook Instance
# For more details of the example extension, see https://github.com/jupyter-widgets/ipywidgets

sudo -u ec2-user -i <<'EOF'

# PARAMETERS
PIP_PACKAGE_NAME=ipywidgets
EXTENSION_NAME=widgetsnbextension

source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv

pip install $PIP_PACKAGE_NAME
jupyter nbextension enable $EXTENSION_NAME --py --sys-prefix
jupyter labextension install @jupyter-widgets/jupyterlab-manager
# run the command in background to avoid timeout 
nohup jupyter labextension install @bokeh/jupyter_bokeh &

source /home/ec2-user/anaconda3/bin/deactivate

EOF
```