In [1]:
import torch
import numpy as np
import pandas as pd

Torch Example: https://github.com/Jiarong-L/BioDL/blob/main/1.%20Protein_seq_Classifier/VFDB_Transformer.ipynb

1. PositionalEncoding
2. nn.TransformerEncoderLayer + nn.

[图示详解](https://zhuanlan.zhihu.com/p/338817680)：Attention矩阵的每一行，对应一个单词的表示

## Transformer

$attention = softmax(\frac{QK'}{\sqrt{K_{embDim}}})V$

$Multi Head Attention = weightedSum(Attention1,Attention2,Attention3,...)$


**Transformer** Figure1:  https://arxiv.org/abs/1706.03762 

**Positional Encoding**: pos in sequence, i in input_Matrix   
PE(pos,2i) = $sin(pos/10000^{2i/dModel})$   
PE(pos,2i+1) = $cos(pos/10000^{2i/dModel})$  
PE(others) = 0


In [2]:
d_model = 6
word_emb = 5
word_num = 4

WQ = np.random.rand(word_emb,d_model)
WK = np.random.rand(word_emb,d_model)    ## dk=d_model
WV = np.random.rand(word_emb,d_model)


input = pd.DataFrame({
    "<sos>": np.random.randn(word_emb),
    "<you>": np.random.randn(word_emb),
    "<like>": np.random.randn(word_emb),
    "<swimming>": np.random.randn(word_emb),
}).T


Q = input @ WQ    ## (query=word_num, word_emb)
K = input @ WK    ##   (key=word_num, word_emb)
V = input @ WV    ## (value=word_num, word_emb)

def softmax(M):
    M = M.values
    M = M - np.max(M, axis=1, keepdims=True)
    return np.exp(M)/np.sum(np.exp(M),axis=1,keepdims=True)


Att = softmax(Q @ K.T /np.sqrt(d_model)) @ V
## Multi-head = np.concatenate(Att_1,Att_2,Att_3)WH    ##-->WH is each attention's weight

Att

Unnamed: 0,0,1,2,3,4,5
0,2.388998,1.460099,1.954951,2.71526,1.949436,2.594846
1,2.336696,1.698741,1.801907,2.615653,2.10049,2.675136
2,2.501407,1.562851,2.035426,2.758198,2.033766,2.696152
3,2.591722,1.581702,2.120205,2.80373,2.056615,2.741807


In [3]:
## pytorch Function
## https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html

torch.nn.functional.scaled_dot_product_attention(torch.tensor(Q.values), torch.tensor(K.values), torch.tensor(V.values))

tensor([[2.3890, 1.4601, 1.9550, 2.7153, 1.9494, 2.5948],
        [2.3367, 1.6987, 1.8019, 2.6157, 2.1005, 2.6751],
        [2.5014, 1.5629, 2.0354, 2.7582, 2.0338, 2.6962],
        [2.5917, 1.5817, 2.1202, 2.8037, 2.0566, 2.7418]], dtype=torch.float64)

### Mask

```

Encoder:    (pos_emb,seq_source) ------------ [attention] ---------- [Forward] ---- (feature)

Decoder:    (feature,pos_emb,seq_target) ---- [masked_attention] --- [Forward] ---- (next word)


(feature,pos_emb,'<sos>') ---> Decoder ---> 'you'
(feature,pos_emb,'<sos> you') ---> Decoder ---> 'like'
(feature,pos_emb,'<sos> you like') ---> Decoder ---> 'swimming'
(feature,pos_emb,'<sos> you like swimming') ---> Decoder ---> '<eos>'
```

Decoder训练时已知晓生成的序列，可以一次计算它的attention矩阵；但输入时需要mask掉暂未使用的部分：Masked_QK * V 得到 Masked Self-Attention


In [4]:
QKt = Q @ K.T / np.sqrt(d_model)
QKt_tensor = torch.tensor(QKt.values)
QKt

Unnamed: 0,<sos>,<you>,<like>,<swimming>
<sos>,8.608641,3.832589,7.741274,7.718041
<you>,9.188842,3.664324,8.477303,9.482574
<like>,12.407758,4.827441,10.881397,11.59025
<swimming>,14.187342,4.906694,11.986032,13.136671


In [5]:
mask = torch.tril(torch.ones_like(QKt_tensor))
mask == 0

tensor([[False,  True,  True,  True],
        [False, False,  True,  True],
        [False, False, False,  True],
        [False, False, False, False]])

In [6]:
QKt_att = QKt_tensor.masked_fill_(mask == 0,-torch.inf)
QKt_att

tensor([[ 8.6086,    -inf,    -inf,    -inf],
        [ 9.1888,  3.6643,    -inf,    -inf],
        [12.4078,  4.8274, 10.8814,    -inf],
        [14.1873,  4.9067, 11.9860, 13.1367]], dtype=torch.float64)

In [7]:
torch.softmax(QKt_att, dim=-1)

tensor([[1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
        [9.9603e-01, 3.9719e-03, 0.0000e+00, 0.0000e+00],
        [8.2113e-01, 4.1910e-04, 1.7845e-01, 0.0000e+00],
        [6.8472e-01, 6.3823e-05, 7.5770e-02, 2.3945e-01]], dtype=torch.float64)