## Masking

### padding mask
Mask all the pad tokens in the batch of sequence. It ensures that the model does not treat padding as the input. The mask indicates where pad value 0 is present: it outputs a 1 at those locations, and a 0 otherwise.

mask 序列中的 pad tokens. 保证模型不会把 pad token 当做输入。mask indicates 在 pad 的位置输出 1，其他位置输出 0

In [3]:
import tensorflow as tf

def create_padding_mask(seq):
    seq = tf.cast(tf.math.equal(seq, 0), tf.float32)
    
    # add extra dimensions so that we can add the padding
    # to the attention logits.
    return seq[:, tf.newaxis, tf.newaxis, :]  # [batch_size, 1, 1, seq_len]

这里增加两个 newaxis  主要是为了后面的计算。

In [6]:
x = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])
create_padding_mask(x)[0,0,0,:]

<tf.Tensor: id=33, shape=(5,), dtype=float32, numpy=array([0., 0., 1., 1., 0.], dtype=float32)>

### look-ahead mask

The look-ahead mask is used to mask the future tokens in a sequence. In other words, the mask indicates which entries should not be used.

This means that to predict the third word, only the first and second word will be used. Similarly to predict the fourth word, only the first, second and the third word will be used and so on.

在 decoder 时，预测下一个词的时候需要 mask 序列中之后的词。

In [7]:
def create_look_ahead_mask(size):
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask   # (seq_len, seq_len)

In [10]:
x = tf.random.uniform((1, 3))
print(x)

tf.Tensor([[0.19278264 0.38801312 0.7854575 ]], shape=(1, 3), dtype=float32)


In [11]:
temp = create_look_ahead_mask(x.shape[1])
temp

<tf.Tensor: id=65, shape=(3, 3), dtype=float32, numpy=
array([[0., 1., 1.],
       [0., 0., 1.],
       [0., 0., 0.]], dtype=float32)>

### tf.linalg.band_part
```python
tf.linalg.band_part(
    input,
    num_lower,
    num_upper,
    name=None
)


band[i, j, k, ..., m, n] = in_band(m, n) * input[i, j, k, ..., m, n]

in_band(m, n) = (num_lower < 0 || (m-n) <= num_lower)) && (num_upper < 0 || (n-m) <= num_upper).
```

input:要输入的张量tensor.

num_lower:下三角矩阵保留的副对角线数量，从主对角线开始计算，相当于下三角的带宽。取值为负数时，则全部保留，矩阵不变。

num_upper:上三角矩阵保留的副对角线数量，从主对角线开始计算，相当于上三角的带宽。取值为负数时，则全部保留，矩阵不变。


In [18]:
input_a = tf.constant([[ 0,  1,  2, 3],
                       [-1,  0,  1, 2],
                       [-2, -1,  0, 1],
                       [-3, -2, -1, 0]], dtype=tf.float32)

In [19]:
tf.linalg.band_part(input_a, 1, -1)  # 下三角保留一个带宽，上三角全部保留

<tf.Tensor: id=80, shape=(4, 4), dtype=float32, numpy=
array([[ 0.,  1.,  2.,  3.],
       [-1.,  0.,  1.,  2.],
       [ 0., -1.,  0.,  1.],
       [ 0.,  0., -1.,  0.]], dtype=float32)>

In [20]:
tf.linalg.band_part(input_a, 2, 1)  # 下三角保留2个带宽，上三角保留 1 个带宽

<tf.Tensor: id=84, shape=(4, 4), dtype=float32, numpy=
array([[ 0.,  1.,  0.,  0.],
       [-1.,  0.,  1.,  0.],
       [-2., -1.,  0.,  1.],
       [ 0., -2., -1.,  0.]], dtype=float32)>