Analysis of Bert's attention heads using huggingface transformers
Unofficial Pytorch implementation of paper What Does BERT Look At? An Analysis of BERT's Attention using HuggingFace Transformers.
BERT-base has 12 layers and each layer has 12 attention heads. Each of the attention heads is of shape (sequence_length, sequence_length)
. The attentions of whole network can by captured in a four dimensional tensor attentions_map
of shape (12,12,sequence_length, sequence_length)
. attentions_map[i][j][k][l]
tells us the amount of attention k'th word of sequence pays to l'th word in i'th layer and j'th head. This paper sets out to find interesting patterns of attention each of 144 heads exhibit.
The paper uses 1000 sequences, each of length 128 to extract attention maps. Due to memory constraints, I used only 150 sequences (each of length 128) but the results were identical to the paper. WikiText-2 dataset was used for analysis carried out in general_analysis.ipynb. Each sequence was of the form [CLS]<sequence_a>[SEP]<sequence_b>[SEP]
. Attention map is extracted for each sequence and stored in a pickle file. Each attention map is of shape [NUM_LAYERS, NUM_HEADS, SEQ_LEN, SEQ_LEN]
where NUM_LAYERS
= 12, NUM_HEADS
= 12, SEQ_LEN
= length of the sequence without the padding tokens.
The authors investigate attentions between several types of tokens: