# Attention

This notebook is part of the lecture series at the Faculty Development Programme organised by the Department of Computer Science and Engineering, Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam, jointly in association with ShodhGuru Innovation and Research Labs, India. Specifically, this notebook is part of Tek Raj Chhetri's lecture entitled Applications of Deep Neural Networks in Knowledge Graph Construction.





## Understanding Attention

In [6]:
from bertviz.transformers_neuron_view import BertModel, BertTokenizer
from transformers import AutoTokenizer, AutoModel, utils
from bertviz.neuron_view import show
from bertviz import head_view

In [7]:
text = 'My friends told me about attention paper and I enjoyed reading it'

In [8]:
model_type = 'bert'
model_version = 'bert-base-uncased'
do_lower_case = True
model = BertModel.from_pretrained(model_version)
tokenizer = BertTokenizer.from_pretrained(model_version, do_lower_case=do_lower_case)

In [9]:
show(model, model_type, tokenizer, text,  layer=3, head=0)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [10]:
m  = BertModel.from_pretrained(model_version,  output_attentions=True)

In [11]:
import torch
tok = tokenizer.encode(text)
tnptrch = torch.tensor(tok)
tnptrch = tnptrch.unsqueeze(0)
tnptrch

tensor([[2026, 2814, 2409, 2033, 2055, 3086, 3259, 1998, 1045, 5632, 3752, 2009]])

In [12]:
attention_result_original = m(tnptrch)[2]

In [13]:
# last encoder attention result -- averaging 
final_attention_result = attention_result_original[0]['attn'][-1].mean(1)
final_attention_result

tensor([[0.1009, 0.0695, 0.1173, 0.1204, 0.0795, 0.0464, 0.0527, 0.0732, 0.1078,
         0.1146, 0.0559, 0.0618],
        [0.0381, 0.0965, 0.1179, 0.0491, 0.0675, 0.1137, 0.1373, 0.0386, 0.0271,
         0.1741, 0.1074, 0.0327],
        [0.2795, 0.0716, 0.1413, 0.0256, 0.0642, 0.0286, 0.0447, 0.0791, 0.0718,
         0.0563, 0.1066, 0.0307],
        [0.0800, 0.1250, 0.1482, 0.0694, 0.0972, 0.0898, 0.0430, 0.1209, 0.0678,
         0.0938, 0.0556, 0.0094],
        [0.1149, 0.0893, 0.1212, 0.0490, 0.0960, 0.0825, 0.0560, 0.0569, 0.0399,
         0.1654, 0.0616, 0.0674],
        [0.0612, 0.1040, 0.1098, 0.0316, 0.0784, 0.1345, 0.0722, 0.0464, 0.0283,
         0.1727, 0.1247, 0.0360],
        [0.0861, 0.1070, 0.1103, 0.0693, 0.0567, 0.0858, 0.0874, 0.0279, 0.0630,
         0.1242, 0.1447, 0.0377],
        [0.1798, 0.1090, 0.0722, 0.1301, 0.0473, 0.0657, 0.0381, 0.0422, 0.0943,
         0.0672, 0.0582, 0.0959],
        [0.0454, 0.0764, 0.0730, 0.0611, 0.0601, 0.1202, 0.1176, 0.0677, 0.0730,

In [14]:
final_attention_result = final_attention_result.detach()

In [15]:
import pandas as pd
attention_df = pd.DataFrame(final_attention_result)

In [16]:
attention_df.columns = tokenizer.convert_ids_to_tokens(tok) 
attention_df.index = tokenizer.convert_ids_to_tokens(tok) 
attention_df

Unnamed: 0,my,friends,told,me,about,attention,paper,and,i,enjoyed,reading,it
my,0.100934,0.06947,0.117299,0.120445,0.079473,0.046395,0.052692,0.073218,0.107774,0.114604,0.055917,0.061779
friends,0.038067,0.096476,0.117915,0.049149,0.067461,0.113744,0.13734,0.03857,0.027052,0.174089,0.107412,0.032725
told,0.279546,0.071563,0.141285,0.025596,0.064213,0.028561,0.044704,0.079137,0.071826,0.056287,0.106594,0.030688
me,0.080006,0.124975,0.148152,0.069366,0.097184,0.089837,0.042953,0.120903,0.067797,0.093817,0.055603,0.009408
about,0.114902,0.089267,0.121241,0.048999,0.095998,0.082544,0.055986,0.056864,0.039855,0.165398,0.061551,0.067397
attention,0.061192,0.103999,0.109819,0.031648,0.078422,0.13451,0.072197,0.046429,0.028284,0.17271,0.124742,0.036049
paper,0.086109,0.106978,0.110281,0.069261,0.056677,0.08579,0.087418,0.027939,0.062958,0.124162,0.1447,0.037727
and,0.179758,0.109041,0.072173,0.130123,0.047328,0.065694,0.038095,0.042239,0.094282,0.067172,0.058225,0.095871
i,0.04542,0.07636,0.073031,0.061148,0.060051,0.12022,0.117625,0.067718,0.073002,0.129334,0.099807,0.076283
enjoyed,0.232717,0.128013,0.066172,0.089593,0.033107,0.060076,0.098414,0.029566,0.065224,0.072466,0.095945,0.028707


## Visualisation of head

In [17]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased", output_attentions=True)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [18]:
inputs = tokenizer.encode(text, return_tensors='pt')
outputs = model(inputs)
attention = outputs[-1]  # Output includes attention weights when output_attentions=True
tokens = tokenizer.convert_ids_to_tokens(inputs[0]) 
head_view(attention, tokens)

<IPython.core.display.Javascript object>