NLP Training 5: Visualizing Attention
--- 


In [1]:
%pip install bertviz



In [2]:
import os
os.chdir('..')
print(f'Setting working dir to: {os.getcwd()}')

Setting working dir to: /Users/ingomarquart/Documents/GitHub/itern-nlp-training-cases



## Visualizing Attention Layers of a standard BERT model

In this notebook, we will use transformers to load a pre-trained BERT Encoder model

The package [bertviz](https://github.com/jessevig/bertviz) provides some interesting tools to play with for visualizing the attention mechanism discussed in the theory section.

Give it a shot!

### Exercise 1 - Load BERT

Try to use the Transformers package to download and initialize a BERT model (`bert-base-cased`) using the `BertModel` and the `.from_pretrained()`-method. Note that we want to output attentions, so we specify it in the config (`output_attentions=True`). You also need the tokenizer, use the `BertTokenizer` to load the tokenizer.

In [3]:
from bertviz import *
from transformers import BertTokenizer, BertModel

model_name = 'bert-base-cased'


# Add your solution here:
# ...

In [4]:
# Get the model from ModelHub
model = BertModel.from_pretrained(model_name, output_attentions=True)
# We will also need the tokenizer
tokenizer = BertTokenizer.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Exercise 2 - Encoding a Sentence

We are ready to encode a sentence.

Some things of note: We need to specify that we wish to receive PyTorch tensors via `return_tensors="pt"`. Transformers now works beyond just PyTorch!

Second, the Tokenizer returns a dictionary that also includes attention_mask (all ones here) and token_type_ids for the next-sentence objective.    
We do not require these latter two tensors, and luckily BERT works without them. Try to extract only the input_ids.

In [5]:
sentence = "I play the guitar."
sentence2 = "The play is at the theater."

# Add your solution here:
# ...

In [6]:
# Tokenize the sentences
tokenizer_output = tokenizer(sentence, return_tensors='pt')
tokenizer_output2 = tokenizer(sentence2, return_tensors='pt')

# Get only the token ids
token_ids = tokenizer_output['input_ids']
token_ids2 = tokenizer_output2['input_ids']

### Exercise 3 - Forward Pass Through the Model

We are no ready to call BERT.

Again, we will receive a dict-like object giving us
- the last hidden states
- the output of the pooler

and, since we requested it
- all attention scores across all layers

Do the following:
- Pass the tokenized sentence through the model
- Extract the last hidden state from the output of the model
- Report it's shape
- How many attention layers are there?
- What is the shape of one of the attention layers?


In [7]:
# Add your solution here:
# ...

In [8]:
BERT_output = model(token_ids)

# A hidden layer - 1 batch, 7 tokens, and 768 hidden dimensions
print(f"Size of last hidden layer: {BERT_output['last_hidden_state'].shape}")

# By contrast, a rather hefty collection of all attentions
# Note we have 12 layers
print(f"Number of layers: {len(BERT_output['attentions'])}")

# And we have provided 1 batch of 7 tokens into each of the 12 attention heads
print(f"Shape of attention vectors from one layer: {BERT_output['attentions'][2].shape}")

Size of last hidden layer: torch.Size([1, 7, 768])
Number of layers: 12
Shape of attention vectors from one layer: torch.Size([1, 12, 7, 7])


### Exercise 4 - Visualizing Attention

Since we have supplied one sentence with 7 words, the attention matrices will be 7 by 7.

Let's visualize them using the `head_view` function from `bertviz`!

In [9]:
# Add your solution here:
# ...

In [10]:
# To convert tokens, BertTokenizer asks for a list, whereas the model requires a tensor... This was always a quirk of Transformers
head_view(BERT_output['attentions'], tokenizer.convert_ids_to_tokens(token_ids[0].tolist()))

<IPython.core.display.Javascript object>

In the default view, we can see across layers and heads. 
Scroll over "play", to see that BERT attributes much attention to "guitar"!

Now do the same for the second sentence and again find out which word has much attention to "play".

In [11]:
# Add your solution here:
# ...

In [12]:
# To convert tokens, BertTokenizer asks for a list, whereas the model requires a tensor... This was always a quirk of Transformers
BERT_output2 = model(token_ids2)

head_view(BERT_output2['attentions'], tokenizer.convert_ids_to_tokens(token_ids2[0].tolist()))

<IPython.core.display.Javascript object>

### Exercise 5 - Visualizing the Whole Network

Similarly, we can visualize the whole network using the `model_view` function of `bertviz`.

In [13]:
model_view(BERT_output['attentions'], tokenizer.convert_ids_to_tokens(token_ids[0].tolist()))

<IPython.core.display.Javascript object>