<a href="https://colab.research.google.com/github/0navarro/ml_portfolio/blob/main/MultiSegmentPacker_layer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#MultiSegmentPacker Layer

The MultiSegmentPacker from Keras NLP combines a groups of sequences into an adequate input for models such as BERT. In this notebook I explain the different arguments of the MultiSegmentPacker layer from Keras NLP.

Install required packages

In [None]:
!pip install -q --upgrade keras-nlp
!pip install -q --upgrade keras  # Upgrade to Keras 3.


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m508.4/508.4 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m950.8/950.8 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m589.8/589.8 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.3/5.3 MB[0m [31m56.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m40.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m62.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m43.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━

Import Keras NLP package

In [None]:
import keras_nlp

The MultiSegmentPacker takes a `sequence_length` argument which determines how large the output tensor will be. We also define `start_value` and an `end_value`, that will appear at the beginning and at the end of the packed sequence, respectively. We can define a `pad_value` as well, that will fill up the tensor up to the defined sequence length.

In [None]:
packer = keras_nlp.layers.MultiSegmentPacker(
    sequence_length=8,
    start_value="[START]",
    end_value="[END]",
    pad_value="[PAD]",
)

We define a test sequence and pack it with our MultiSegmentPacker layer. The layer outputs the packed sequence and a tensor with sequence ids, that helps matching each sequence element in the packed sequence with their corresponding input sequence. Since we have only one input sequence, all the elements have the same id.
Notice how the packed sequence starts and ends with the start and end values we defined in the layer. Also notice that the sequence is padded to reach the defined sequence length.

In [None]:
sequence = ["a", "b", "c", "d", "e"]
packer((sequence,))

(<tf.Tensor: shape=(8,), dtype=string, numpy=
 array([b'[START]', b'a', b'b', b'c', b'd', b'e', b'[END]', b'[PAD]'],
       dtype=object)>,
 <tf.Tensor: shape=(8,), dtype=int32, numpy=array([0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)>)

Now we will have two input sequences to explain the last two arguments: `sep_value` and `truncate`. With `sep_value` we can define a value that will separate each input sequence in the packed sequence. `truncate` determines how the package sequence "budget", i.e. the sequence length, is distributed among the input sequences. The default value, `"round_robin"`, assigns one budget element to each input sequence in a round robin fashion, i.e. first one element to the first sequence, then to the second sequence, then to the first and so on.

In [None]:
sequence_1 = ["a", "b", "c", "d"]
sequence_2 = ["v", "w", "x", "y"]
packer = keras_nlp.layers.MultiSegmentPacker(
    sequence_length=8,
    start_value="[START]",
    end_value="[END]",
    pad_value="[PAD]",
    sep_value="[SEP]",
    truncate="round_robin"
)
packer((sequence_1, sequence_2))

(<tf.Tensor: shape=(8,), dtype=string, numpy=
 array([b'[START]', b'a', b'b', b'c', b'[SEP]', b'v', b'w', b'[END]'],
       dtype=object)>,
 <tf.Tensor: shape=(8,), dtype=int32, numpy=array([0, 0, 0, 0, 0, 1, 1, 1], dtype=int32)>)

`truncate` also accepts the value `"waterfall"`, which assigns the sequence length budget from left to right. Notice that the second input sequence gets one element budget only after the entire first input sequence is packed.

Finally, also notice the output sequence ids. In this case we have six elements belonging to sequence 0 and two elements belonging to sequence 1

In [None]:
packer = keras_nlp.layers.MultiSegmentPacker(
    sequence_length=8,
    start_value="[START]",
    end_value="[END]",
    pad_value="[PAD]",
    sep_value="[SEP]",
    truncate="waterfall"
)
packer((sequence_1, sequence_2))

(<tf.Tensor: shape=(8,), dtype=string, numpy=
 array([b'[START]', b'a', b'b', b'c', b'd', b'[SEP]', b'v', b'[END]'],
       dtype=object)>,
 <tf.Tensor: shape=(8,), dtype=int32, numpy=array([0, 0, 0, 0, 0, 0, 1, 1], dtype=int32)>)