## Attention


Attention = 注意力，从两个不同的主体开始。

### 直观理解

![](./resource/seq2seq.jpg)

From：https://arxiv.org/pdf/1703.03906.pdf

![](./resource/seq2seq2.gif)

From: https://github.com/google/seq2seq

### 如何计算

加性Attention，如（Bahdanau attention）：

$$
\boldsymbol{v}_a^{\top} \tanh \left(\boldsymbol{W}_{\mathbf{1}} \boldsymbol{h}_t+\boldsymbol{W}_{\mathbf{2}} \overline{\boldsymbol{h}}_s\right)
$$

乘性Attention，如（Luong attention）：

$$
\operatorname{score}\left(\boldsymbol{h}_{t}, \overline{\boldsymbol{h}}_{s}\right)=\left\{\begin{array}{ll}
\boldsymbol{h}_{t}^{\top} \overline{\boldsymbol{h}}_{s} & \text { dot } \\
\boldsymbol{h}_{t}^{\top} \boldsymbol{W}_{a} \overline{\boldsymbol{h}}_{s} & \text { general } \\
\boldsymbol{v}_{a}^{\top} \tanh \left(\boldsymbol{W}_{a}\left[\boldsymbol{h}_{t} ; \overline{\boldsymbol{h}}_{s}\right]\right) & \text { concat }
\end{array}\right.
$$

From: https://arxiv.org/pdf/1508.04025.pdf

## From Attention to SelfAttention

### Self Attention

"Attention is All You Need" 这篇论文提出了Multi-Head Self-Attention，是一种：Scaled Dot-Product Attention。

$$
\operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V
$$

From：https://arxiv.org/pdf/1706.03762.pdf

### Scaled

Scaled 的目的是调节内积，使其结果不至于太大（太大的话softmax后就非0即1了，不够“soft”了）。

From: https://kexue.fm/archives/4765

### Multi-Head


Multi-Head可以理解为多个注意力模块，期望不同注意力模块“注意”到不一样的地方，类似于CNN的Kernel。

>Multi-head attention allows the model to jointly attend to information from different representation
subspaces at different positions.

$$
\begin{aligned}
\operatorname{MultiHead}(Q, K, V) & =\operatorname{Concat}\left(\operatorname{head}_1, \ldots, \text { head }_{\mathrm{h}}\right) W^O \\
\text { where head }_{\mathrm{i}} & =\operatorname{Attention}\left(Q W_i^Q, K W_i^K, V W_i^V\right)
\end{aligned}
$$



From: https://arxiv.org/pdf/1706.03762.pdf

## 实践体验

In [None]:
from dataclasses import dataclass
import torch
import torch.nn as nn
import torch.nn.functional as F

In [None]:
from selfattention import SelfAttention

### 模型

我们只用一个核心的SelfAttention模块（可支持Single-Head或Multi-Head），来学习理解Attention机制。

In [None]:
class Model(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.emb = nn.Embedding(config.vocab_size, config.hidden_dim)
        self.attn = SelfAttention(config)
        self.fc = nn.Linear(config.hidden_dim, config.num_labels)
    
    def forward(self, x):
        batch_size, seq_len = x.shape
        h = self.emb(x)
        attn_score, h = self.attn(h)
        h = F.avg_pool1d(h.permute(0, 2, 1), seq_len, 1)
        h = h.squeeze(-1)
        logits = self.fc(h)
        return attn_score, logits

In [None]:
@dataclass
class Config:
    
    vocab_size: int = 5000
    hidden_dim: int = 512
    num_heads: int = 16
    head_dim: int = 32
    dropout: float = 0.1
    
    num_labels: int = 2
    
    max_seq_len: int = 512
    
    num_epochs: int = 10

In [None]:
config = Config(5000, 512, 16, 32, 0.1, 2)

In [None]:
model = Model(config)

In [None]:
x = torch.randint(0, 5000, (3, 30))
x.shape

In [None]:
attn, logits = model(x)
attn.shape, logits.shape

### 数据

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
file_path = "./data/ChnSentiCorp_htl_all.csv"

In [None]:
df = pd.read_csv(file_path)
df = df.dropna()
df.head(), df.shape

In [None]:
df.label.value_counts()

数据不均衡，我们给它简单重采样一下。

In [None]:
df = pd.concat([df[df.label==1].sample(2500), df[df.label==0]])
df.shape

In [None]:
df.label.value_counts()

In [None]:
from tokenizer import Tokenizer

In [None]:
tokenizer = Tokenizer(config.vocab_size, config.max_seq_len)

In [None]:
tokenizer.build_vocab(df.review)

In [None]:
tokenizer(["你好", "你好呀"])

In [None]:
def collate_batch(batch):
    label_list, text_list = [], []
    for v in batch:
        _label = v["label"]
        _text = v["text"]
        label_list.append(_label)
        text_list.append(_text)
    inputs = tokenizer(text_list)
    labels = torch.LongTensor(label_list)
    return inputs, labels

In [None]:
from dataset import Dataset

In [None]:
ds = Dataset()
ds.build(df, "review", "label")

In [None]:
len(ds), ds[0]

In [None]:
train_ds, test_ds = train_test_split(ds, test_size=0.2)
train_ds, valid_ds = train_test_split(train_ds, test_size=0.1)
len(train_ds), len(valid_ds), len(test_ds)

In [None]:
from torch.utils.data import DataLoader

In [None]:
BATCH_SIZE = 8

In [None]:
train_dl = DataLoader(train_ds, batch_size=BATCH_SIZE, collate_fn=collate_batch)
valid_dl = DataLoader(valid_ds, batch_size=BATCH_SIZE, collate_fn=collate_batch)
test_dl = DataLoader(test_ds, batch_size=BATCH_SIZE, collate_fn=collate_batch)
len(train_dl), len(valid_dl), len(test_dl)

In [None]:
for v in train_dl: break

In [None]:
v[0].shape, v[1].shape, v[0].dtype, v[1].dtype

### 训练

In [None]:
from trainer import train, test

In [None]:
NUM_EPOCHS = 10
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

config = Config(5000, 64, 1, 64, 0.1, 2)
model = Model(config)
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-3)
train(model, optimizer, train_dl, valid_dl, config)

test(model, test_dl)

### 推理

In [None]:
from inference import infer, plot_attention
import numpy as np

In [None]:
sample = np.random.choice(test_ds)
while len(sample["text"]) > 20:
    sample = np.random.choice(test_ds)

print(sample)

inp = sample["text"]
inputs = tokenizer(inp)
attn, prob = infer(model, inputs.to(device))
attn_prob = attn[0, 0, :, :].cpu().numpy()
tokens = tokenizer.tokenize(inp)
tokens, prob

In [None]:
plot_attention(attn_prob, tokens, tokens)

In [None]:
tokenizer.get_freq_of("不")