## 注意力機制
- 人天然在視覺觀察時就會有注意力的偏重，比如觀察Google網頁時，總是會看左上角先出現的部分，將這樣的觀念帶入到加權概念上，就是注意力機制了。
- 不論在NLP、CV都有很多的應用，如Transformers的QKV自注意力機制。
- 值得注意的是，注意力機制本質就是加權平均，不過是`動態的`加權平均，比如過往計算平均數是，是透過所有樣本給予相同的權重，或者其他的加權方式都是計算方式已經被固定，也就是`靜態的`；而動態的加權平均是會隨著輸入樣本產生變化，舉實際案例來說，當看到一張狗的圖片時，我們就會聚焦於狗出現的位置，而非其他地方，但狗出現在圖片的位置不一定是在圖片的固定方位，因此是動態的。

In [3]:
import tensorflow as tf
from tensorflow.keras.layers import Layer
from tensorflow import keras

In [4]:
# class AttentionLayer(Layer):
#     """
#         這是錯的~因為e是(None, 1)
#     """
#     def __init__(self, **kwargs):
#         super(AttentionLayer, self).__init__(**kwargs)
        
#     def build(self, input_shape):
#         # 定义可学习的权重参数
#         self.w = self.add_weight(shape=(input_shape[-1], 1),
#                                  initializer='random_normal',
#                                  trainable=True)
#         super(AttentionLayer, self).build(input_shape)
        
#     def call(self, inputs):
#         # 计算加权和
#         e = tf.keras.backend.dot(inputs, self.w)  # (None, 1)
#         a = tf.keras.backend.softmax(e, axis=1)   # (None, 1)
#         output = inputs * a                       # (None, 96)  
#         print(e.shape, a.shape, output.shape)
#         return output

    
class AttentionLayer2(Layer):
    def __init__(self, units=None, **kwargs):
        super(AttentionLayer2, self).__init__(**kwargs)
        self.dense = keras.layers.Dense(units=units, activation='softmax')
    
    def call(self, x):
        return self.dense(x)  # 出來的就是權重

In [5]:
from tensorflow.keras.layers import Input, Dense, Concatenate

# 輸入
input1 = Input(shape=(10,))
input2 = Input(shape=(20,))

# functional 模型
x1 = Dense(32, activation='relu')(input1)
x2 = Dense(64, activation='relu')(input2)
x = Concatenate()([x1, x2])
# x = AttentionLayer()(x)
alpha = AttentionLayer2(units=x1.shape[-1] + x2.shape[-1], name='attention')(x)   # 權重
x = keras.layers.Dot(axes=1)([x, alpha])
output = Dense(1, activation='sigmoid')(x)

# 編譯
model = tf.keras.models.Model(inputs=[input1, input2], outputs=output)
model.compile(optimizer='adam', loss='binary_crossentropy')


In [6]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 10)]         0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, 20)]         0           []                               
                                                                                                  
 dense (Dense)                  (None, 32)           352         ['input_1[0][0]']                
                                                                                                  
 dense_1 (Dense)                (None, 64)           1344        ['input_2[0][0]']                
                                                                                              

In [7]:

inputs = model.inputs
alpha_output = model.get_layer(name='attention').output


new_model = keras.Model(inputs=inputs, outputs=alpha_output)

In [8]:
new_model.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 10)]         0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, 20)]         0           []                               
                                                                                                  
 dense (Dense)                  (None, 32)           352         ['input_1[0][0]']                
                                                                                                  
 dense_1 (Dense)                (None, 64)           1344        ['input_2[0][0]']                
                                                                                            

In [9]:
import numpy as np

x_inputs = np.random.normal(size=(10000, 10))
x_input2 = np.random.normal(size=(10000, 20))

o = new_model([x_inputs, x_input2])
print(o.shape)
print(o)

(10000, 96)
tf.Tensor(
[[0.00800204 0.00719401 0.0110973  ... 0.01759386 0.00772979 0.00537172]
 [0.00269672 0.0310102  0.00881487 ... 0.01436653 0.00736911 0.0120503 ]
 [0.00683023 0.00996894 0.00976479 ... 0.03163126 0.00387621 0.01836039]
 ...
 [0.006447   0.00826287 0.0176962  ... 0.01358038 0.00488194 0.00829465]
 [0.0054411  0.00646757 0.01681057 ... 0.01833588 0.00559796 0.02084725]
 [0.00560729 0.007547   0.00612349 ... 0.01624637 0.00376094 0.00649687]], shape=(10000, 96), dtype=float32)


In [10]:
## 如果注意力機制無誤，任一的樣本的權重要=1

import random

row = random.choice(o)

print(tf.reduce_sum(row))

tf.Tensor(1.0, shape=(), dtype=float32)
