# 8 - Transformer
自[Attention is All You Need](https://arxiv.org/pdf/1706.03762.pdf)的提出已经过去了两年 ... 【而大家发论文的激情愈发澎湃:)学不完了】

<img src='images/memes/new_models.jpg'>

## 理论
不管怎么说，**`Attention`**是一个非常重要的节点，是对于深度学习模型**结构**的一次革新。

在前两节中，我们使用的分类模型就只由一个CNN/RNN模型构成【RNN那一节暂缺】，因为其输出非常简单（标签类）。然而面对稍复杂的问题，如自然语言翻译/生成或问答（**seq2seq**），一次性输出就显得力不从心。

在Attention**之前**最常见的结构是`encoder-decoder`模式。

<img src='images/encoder_decoder.png'>

其中encoder和decoder可以由不同的模型替换，如LSTM、GRU、CNN等。关键在于连接两者的`中间向量`（上图context vector），其作为encoder的输出值，最大程度地提炼输入序列中的信息，然后再作为decoder的输入值，生成新的输出序列。

可以看出这是一个单一`线性结构`。中间向量虽然捕捉到了输入序列中每个元素的信息，但实为一锅乱炖。而且随着输出序列变长，其包含的意义也不断消散（参考传声筒游戏）。鉴于seq2seq问题中两个序列的元素经常有一一对应关系，那么是否可以元素间点到点互通，而不要依赖信息的层层传递？

Attention机制解决的就是这个问题。我们先来看最简单的应用。

<img src='images/attention.png'>

上图中，encoder的部分（即对输入序列的处理方式）没有变化。但是在decoder中，每个结构单元都有着属于自己的中间向量c，而除了要引入前一个结构单元的输出外，其还要包括encoder中所有单元输出(h)的组合。组合中起决定性因素的就是不同元素所分配到的权值(alpha)。

以"I love you"到"我爱你"为例，理想状态下，对应输出为"我"的**结构单元**，其输入中"I"所赋予的权值应该最大。

具体的计算可以参考这张图（[出处](https://blog.csdn.net/songbinxu/article/details/80739447)）。权值的计算实际上是考察encoder、decoder中状态量h和c的相似度，对应图中$f_{aat}$函数 + $softmax$函数。

<img src='images/attention_machanism.jpeg'>

总的来说，Attention实际上就是一个权值分配的问题，即:

$$c_{我}=\alpha_{1}*h_{I} + \alpha_{2}*h_{love} + \alpha_{3}*h_{you}$$

另一个常用的理解思路将其当成**键值查询**，其中$c_{i}$视作query，$h_{i}$为key，对应某个$value_{i}$（可以是其本身），通过$f(c_{i},h_{i})$得到权值$\alpha_{i}$，再与对应value值相乘，最终求和。写成算式如下：

$$Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_{k}}})V$$

最后再来看一下Google那篇论文中的Transformer模型。我第一眼看都没敢认这是我刚刚学的模型，不过仔细分析，其核心就是`红色方框内的Attention模型`（匹配输入和输出），黄色方框内的是两个`self-Attention`（据说用来提取句子中的长短期依赖，不过我没有深究，直接视作一个预处理步骤），剩下的两个前馈神经网络可以是任意前馈模型，也算是进一步整合数据（修修补补-让结构更好看？）

<img src='images/transformer.png'>

---
## 代码

至此，理论部分先告一段落。代码相对于原理来说简单得多（因为有现成的模型:D）。

此处我们套用[BERT](https://arxiv.org/pdf/1810.04805.pdf) (Bidirectional Encoder Representations from Transformers) 

注意，如果要用pre-trained model，那就要做全套，包括数据预处理。

In [1]:
import random
import pickle

from torchtext.data import Field, LabelField 
from torchtext.data import TabularDataset, BucketIterator
from transformers import BertTokenizer, BertModel

import torch
import torch.nn as nn

### model
我们的模型：将原来的embedding层替换为pre-trained BERT模型，其输出作为接下来GRU模型的输入，最后再接一个全连阶层。

即：input -> BERT -> GRU -> dropout -> FC -> output

In [2]:
class BERT_GRU(nn.Module):
    def __init__(self,BERT,config):
        super().__init__()
        
        self.bert = BERT
        # BERT模型的输出向量大小
        embedding_dim = BERT.config.to_dict()['hidden_size']
        
        # 使用GRU
        self.gru = nn.GRU(embedding_dim,
                          config.hidden_dim,
                          num_layers = config.layers_num,
                          bidirectional = config.bidirectional,
                          batch_first = True,
                          dropout = 0 if config.layers_num < 2 else config.dropout)
        
        self.dropout = nn.Dropout(config.dropout)
        self.fc = nn.Linear(config.hidden_dim * 2 if config.bidirectional else config.hidden_dim, config.class_num)
        
        
    def forward(self, x):
        # x -> (batch_size, seq_length)
        # 在BERT模型中的变换不算入反向传播
        with torch.no_grad():
            embedded = self.bert(x)[0]
        # embedded -> (batch_size, seq_length, embed_dim)
        
        _, hidden = self.gru(embedded)
        # hidden -> (layers_num * directions_num, batch_size, hidden_dim)
        
        # dropout，单双向不同
        if self.gru.bidirectional:
            hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
        else:
            hidden = self.dropout(hidden[-1,:,:])  
        # hidden -> (batch_size, hidden_dim)
        
        output = self.fc(hidden)
        # output -> (batch_size, class_num)
        
        return output

### train/evaluate/predict/save
这一部分仍然是大同小异。

In [3]:
def train(model, train_iterator, test_iterator, config):
    optimizer = torch.optim.Adam(model.parameters())
    model.train()

    step = 0
    best_acc = 0
    for epoch in range(1,config.epochs+1):
        stop_flag = 0
        for batch in train_iter:
            step += 1
            feature, target = batch.text, batch.label
            with torch.no_grad(): # 不计入逆向传播中
                target.sub_(1) # target-1，要求从0开始（否则报错）
            
            # 实际上执行model.forward(feature)操作 -> (batch_size, C)
            logit = model(feature) 

            loss = F.cross_entropy(logit,target)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            
            # 生成报告
            if step % config.log_interval == 0:
                # torch.max(logit,1)[1]取logit第二维中的最大值，即概率最高的类别
                # view相当于reshape
                corrects = (torch.max(logit,1)[1].view(target.size()).data == target.data).sum()
                accuracy = 100.0 * corrects/batch.batch_size
                # '\r'实现覆盖输出
                sys.stdout.write(f"\rStep[{step}] - loss: {loss.item():.6f} \
                    acc: {accuracy:.4f}%({corrects.item()}/{batch.batch_size})")
            
            # 阶段测试、模型保存，并判断是否提前结束训练
            if step % config.test_interval == 0:
                test_acc = evaluate(model,test_iter)
                if test_acc > best_acc:
                    best_acc = test_acc
                    last_step = step
                    save(model, config.save_dir, 'best', step)
                else:
                    if step - last_step >= config.early_stop:
                        print('early stop by {} steps.'.format(config.early_stop))
                        stop_flag = 1
                        break
        if stop_flag: break

In [4]:
def evaluate(model,data_iter):
    # 将self.training设置为False
    model.eval() 
    
    corrects, avg_loss = 0,0
    for batch in data_iter: #注意这里的batch表示多组evaluation，最后取平均即可
        feature, target = batch.text, batch.label
        with autograd.no_grad():
            target.data.sub_(1)

        logit = model(feature)
        # reduction='sum'表示每个batch内部不取平均
        loss = F.cross_entropy(logit,target,reduction='sum') 

        avg_loss += loss.item()
        corrects += (torch.max(logit,1)[1].view(target.size()).data 
                    == target.data).sum()

    size = len(data_iter.dataset)
    avg_loss /= size 
    accuracy = 100.0 * corrects/size 
    print(f"\nEvaluation - loss: {avg_loss:.6f} acc: {accuracy:.4f}%({corrects}/{size})\n")
    return accuracy

In [5]:
def predict(text,model,text_field,label_field):
    assert isinstance(text,str), "plz use str object as input."
    # 将self.training设置为False
    model.eval()
    
    # 相当于生成一个example
    text = text_field.preprocess(text)
    x = torch.tensor(text) # -> (batch_size,seq_length), batch_size=1
    x = autograd.Variable(x)
    
    output = model(x)
    _, pred = torch.max(output,1)
    # pred.item()等价于pred.data[0]
    print(label_field.vocab.itos[pred.item()+1])

In [6]:
def save(model,save_dir,save_prefix,steps):
    if not os.path.isdir(save_dir):
        os.makedirs(save_dir)
    save_path = f"{save_dir}/{save_prefix}_steps_{steps}.pt"
    torch.save(model.state_dict(),save_path)

### config

In [7]:
class Config:
    bidirectional = True # de:True
    layers_num = 2 # de:2
    class_num = 2 # de:2
    hidden_dim = 128 # de:128
    dropout = 0.5 #de:0.5
    
    # training
    lr = 0.001 # de:0.001
    epochs = 5 # de:5
    batch_size = 96 # de:64
    log_interval = 1 # de:1
    test_interval = 100 # de:100
    early_stop = 1000 # de:1000
    save_dir = 'model/BERT_GRU'

### main
注意此处的数据预处理要与pre-trained模型的一致

In [8]:
### 数据 ###
config = Config()
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# 需要特别关照这几个特殊符号 -> [CLS],[SEP],[PAD],[UNK]
init_token_idx = tokenizer.cls_token_id
eos_token_idx = tokenizer.sep_token_id
pad_token_idx = tokenizer.pad_token_id
unk_token_idx = tokenizer.unk_token_id
print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)

max_input_length = tokenizer.max_model_input_sizes['bert-base-uncased']
print(max_input_length)

def my_tokenize(sentence):
    tokens = tokenizer.tokenize(sentence) 
    tokens = tokens[:max_input_length-2]
    return tokens

text_field = Field(batch_first = True,
                  use_vocab = False,
                  tokenize = my_tokenize,
                  preprocessing = tokenizer.convert_tokens_to_ids, #此处替代了text_field.build_vocab
                  init_token = init_token_idx,
                  eos_token = eos_token_idx,
                  pad_token = pad_token_idx,
                  unk_token = unk_token_idx)

label_field = LabelField(dtype = torch.float)

data_path = 'processed_data'
train_dataset, test_dataset = TabularDataset.splits(
        path=data_path, format='csv', skip_header=True,
        train='best3_train.csv', test='best3_test.csv',
        fields=[('text',text_field),('label',label_field)]) # 按顺序来的

train_dataset, valid_dataset = train_dataset.split(random_state = random.seed(888))
print(f"Number of training examples: {len(train_dataset)}")
print(f"Number of validation examples: {len(valid_dataset)}")
print(f"Number of testing examples: {len(test_dataset)}")

label_field.build_vocab(train_dataset)
with open('model/label_field.txt','wb') as f:
    pickle.dump(label_field, f)
print(label_field.vocab.stoi)

train_iter, valid_iter, test_iter = BucketIterator.splits(
                            (train_dataset, valid_dataset, test_dataset), batch_size = config.batch_size)

101 102 0 100
512
Number of training examples: 8796
Number of validation examples: 3769
Number of testing examples: 5386
defaultdict(None, {'KIRK': 0, 'SPOCK': 1, 'MCCOY': 2})


我，再一次，卡死，在了，下载，上。

In [13]:
### 模型 ###
# 生成模型
config.class_num = len(label_field.vocab) - 1
BERT = BertModel.from_pretrained('bert-base-uncased')

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/Users/inkding/anaconda3/lib/python3.7/site-packages/transformers/modeling_utils.py", line 415, in from_pretrained
    state_dict = torch.load(resolved_archive_file, map_location='cpu')
  File "/Users/inkding/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 426, in load
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/Users/inkding/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 620, in _load
    deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: unexpected EOF, expected 9352645 more bytes. The file might be corrupted.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/inkding/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3325, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-13-c121670163ba>", line 

TypeError: can only concatenate str (not "list") to str

In [1]:
bert_gru = BERT_GRU(BERT,config)

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
def freeze_bert_params(model):
    for name, param in model.named_parameters():                
        if name.startswith('bert'):
            param.requires_grad = False

print(f'The model has {count_parameters(model):,} parameters all together.')
freeze__bert_params(model)
print(f'The model has {count_parameters(model):,} trainable parameters.')

In [None]:
# 训练模型
# 删除之前训练的模型
for file in os.listdir(config.save_dir):
    if file.find('.pt')>=0:
        os.remove(os.path.join(config.save_dir,file))
    
try:
    train(bert_gru,train_iter,valid_iter,config)
except KeyboardInterrupt:
    print('\n'+'-'*88)
    print("interrupted by keyboard, stop training...")

In [None]:
# 选择最优模型
models_path = list(filter(lambda x:x.find('.pt')>0, os.listdir('model')))
models_path = list(map(lambda x:os.path.join('model',x), models_path))
models_path = list(sorted(models_path, key=lambda x:os.path.getmtime(x)))
best_model_path = models_path[-1]
# 注意重载时的模型类要和之前的一样
bert_gru = BERT_GRU(bert,config)
state_dict = torch.load(best_model_path)
bert_gru.load_state_dict(state_dict)

# 模型评价
try:
    evaluate(bert_gru,test_iter)
except:
    print(traceback.format_exc())

In [None]:
# 模型预测->交互式
with open('model/label_field.txt','rb') as f:
    label_field = pickle.load(f)
while True:
    try:
        text = input("Plz enter a sentence for prediction:\n")
        predict(text,bert_gru,text_field,label_field)
        print()
    except KeyboardInterrupt:
        print('Exiting...')
        break