## Keras - Masking & Padding, [reference](https://keras.io/guides/understanding_masking_and_padding/)
- Masking: 常用以時間序列資料, 將某些time-step設為不可見。
- Padding: 資料長度不一時, 模型無法使用該資料, 透過padding(pre & post)補0, 將長度補齊。


## Padding

In [3]:
# 套件 & 框架

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from pprint import pprint

In [4]:
# Padding Sequence data


# 原始資料, 長度不一 ---> 3, 5, 6
seq_data = [
    ["Hello", "world", "!"],
    ["How", "are", "you", "doing", "today"],
    ["The", "weather", "will", "be", "nice", "tomorrow"],
]


# 通常會透過vocab mapping 成 int
seq_data = [
  [71, 1331, 4231],
  [73, 8, 3215, 55, 927],
  [83, 91, 1, 645, 1253, 927],
]

# 有pre & post 兩種方法, 通常用post

padding_seq_data = keras.preprocessing.sequence.pad_sequences(
    sequences=seq_data,
    maxlen=6,
    padding='post',
    value=0
)

pprint(padding_seq_data)

array([[  71, 1331, 4231,    0,    0,    0],
       [  73,    8, 3215,   55,  927,    0],
       [  83,   91,    1,  645, 1253,  927]], dtype=int32)


## Masking
- 有些資料需要被遮住, 可能是原先padding的部分, 或者是設定為不可見。
- keras根據文件有3種方式
    1. Add a **keras.layers.Masking** layer
    2. 設定 **keras.layers.Embedding layer** with **mask_zero=True**
    3. 手動傳遞 **mask** 參數 給layers 當這些layers有支持這個參數如(RNN) 


### Mask-generating layers: Embedding and Masking

In [16]:
# 1. Masking layer 方式

# masking layer
masking_layer = layers.Masking()

# (3, 6, 5000)
unmasked_embedding = tf.cast(
    tf.tile(tf.expand_dims(padding_seq_data, axis=-1), [1, 1, 5000]), tf.float32
)

# 透過masking layer 將其mask

masked_embedding = masking_layer(unmasked_embedding)
pprint(masked_embedding._keras_mask)

<tf.Tensor: shape=(3, 6), dtype=bool, numpy=
array([[ True,  True,  True, False, False, False],
       [ True,  True,  True,  True,  True, False],
       [ True,  True,  True,  True,  True,  True]])>


In [14]:
# 2. Embedding layer 方式

embedding = layers.Embedding(
    input_dim=5000,
    output_dim=16,
    mask_zero=True,     # index 0 將不會被當作一個word, 而是被當作mask
)
masked_output = embedding(padding_seq_data)

pprint(masked_output._keras_mask)       # mask output

<tf.Tensor: shape=(3, 6), dtype=bool, numpy=
array([[ True,  True,  True, False, False, False],
       [ True,  True,  True,  True,  True, False],
       [ True,  True,  True,  True,  True,  True]])>


### Mask 在 keras 中, functional API 以及 sequential API 會自動導入


In [18]:
# Sequential API

model = keras.Sequential(
    [layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True), layers.LSTM(32),]
)

In [19]:
# functional API

inputs = keras.Input(shape=(None,), dtype="int32")
x = layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)(inputs)
outputs = layers.LSTM(32)(x)

model = keras.Model(inputs, outputs)

In [25]:
# 3. 直接傳遞 mask tensors to layers

class MyLayer(layers.Layer):
    def __init__(self, **kwargs):
        super(MyLayer, self).__init__(**kwargs)
        self.embedding = layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)
        self.lstm = layers.LSTM(32)

    def call(self, inputs):
        x = self.embedding(inputs)
        # Note that you could also prepare a `mask` tensor manually.
        # It only needs to be a boolean tensor
        # with the right shape, i.e. (batch_size, timesteps).
        mask = self.embedding.compute_mask(inputs)
        output = self.lstm(x, mask=mask)  # The layer will ignore the masked values
        return output


layer = MyLayer()
x = np.random.random((32, 10)) * 100
x = x.astype("int32")
layer(x)

<tf.Tensor: shape=(32, 32), dtype=float32, numpy=
array([[-1.1599111e-04,  2.7151057e-03,  1.0948180e-03, ...,
         7.7614915e-03,  3.8981072e-03,  2.2541845e-04],
       [ 7.2514603e-04, -6.2782662e-03,  4.8193010e-03, ...,
         4.9857772e-03,  3.5195216e-03,  3.1076264e-04],
       [ 3.7625411e-03, -6.4970902e-04,  8.3045280e-03, ...,
        -6.3242915e-04, -3.6111430e-06, -6.5170531e-03],
       ...,
       [-1.0724877e-03,  6.2666158e-03,  1.2149852e-03, ...,
        -2.4524720e-03,  3.8965035e-03,  1.9770663e-03],
       [-7.5531029e-04, -6.1248569e-03, -1.1289645e-05, ...,
        -5.8271061e-04,  1.9159619e-03,  5.1142424e-03],
       [ 2.8349624e-03, -6.6133598e-03, -2.9771908e-03, ...,
         4.5597004e-03,  4.4241026e-03, -2.3988767e-03]], dtype=float32)>

## Supporting masking in your custom layers
- 有時候會需要去動態調整mask, 可以透過layer.compute_mask去完成

In [27]:
masked_embedding

<tf.Tensor: shape=(3, 6, 5000), dtype=float32, numpy=
array([[[7.100e+01, 7.100e+01, 7.100e+01, ..., 7.100e+01, 7.100e+01,
         7.100e+01],
        [1.331e+03, 1.331e+03, 1.331e+03, ..., 1.331e+03, 1.331e+03,
         1.331e+03],
        [4.231e+03, 4.231e+03, 4.231e+03, ..., 4.231e+03, 4.231e+03,
         4.231e+03],
        [0.000e+00, 0.000e+00, 0.000e+00, ..., 0.000e+00, 0.000e+00,
         0.000e+00],
        [0.000e+00, 0.000e+00, 0.000e+00, ..., 0.000e+00, 0.000e+00,
         0.000e+00],
        [0.000e+00, 0.000e+00, 0.000e+00, ..., 0.000e+00, 0.000e+00,
         0.000e+00]],

       [[7.300e+01, 7.300e+01, 7.300e+01, ..., 7.300e+01, 7.300e+01,
         7.300e+01],
        [8.000e+00, 8.000e+00, 8.000e+00, ..., 8.000e+00, 8.000e+00,
         8.000e+00],
        [3.215e+03, 3.215e+03, 3.215e+03, ..., 3.215e+03, 3.215e+03,
         3.215e+03],
        [5.500e+01, 5.500e+01, 5.500e+01, ..., 5.500e+01, 5.500e+01,
         5.500e+01],
        [9.270e+02, 9.270e+02, 9.270e+02, ..

In [29]:
pprint(masked_embedding._keras_mask)

<tf.Tensor: shape=(3, 6), dtype=bool, numpy=
array([[ True,  True,  True, False, False, False],
       [ True,  True,  True,  True,  True, False],
       [ True,  True,  True,  True,  True,  True]])>


In [28]:
# 範例

class TemporalSplit(layers.Layer):
    """ 將input tensor 沿著 time dimension 切割成 2個 tensors """

    def call(self, inputs):
        # 
        return tf.split(
            value=inputs,
            num_or_size_splits=2,       # 當此參數為int, 會將value 沿著 axis 切割成此等分(2)。
            axis=1,
        )
    
    # override
    def compute_mask(self, inputs, mask=None):
        if mask is None:
            return None
        return tf.split(mask, 2, axis=1)



# mask_embedding -> (3, 6, 5000) == (batch, time_stamp, vocab_size)
f, s = TemporalSplit()(masked_embedding)
pprint(f._keras_mask)
pprint(s._keras_mask)

<tf.Tensor: shape=(3, 3), dtype=bool, numpy=
array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])>
<tf.Tensor: shape=(3, 3), dtype=bool, numpy=
array([[False, False, False],
       [ True,  True, False],
       [ True,  True,  True]])>
