# Overview

Text preprocessing is the end-to-end transformation of raw text into a model’s integer inputs. NLP models are often accompanied by several hundreds (if not thousands) of lines of Python code for preprocessing text. Text preprocessing is often a challenge for models because:

*      Training-serving skew
*      Efficiency and flexibility
*      Complex model interface


# Text preprocessing with *TF.Text*

Using TF.Text's text preprocessing APIs, we can construct a preprocessing function that can transform a user's text dataset into the model's integer inputs. Users can package preprocessing directly as part of their model to alleviate the above mentioned problems.

This tutorial will show how to use TF.Text preprocessing ops to transform text data into inputs for the BERT model and inputs for language masking pretraining task described in "Masked LM and Masking Procedure" of BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. The process involves tokenizing text into subword units, combining sentences, trimming content to a fixed size and extracting labels for the masked language modeling task.

## Setup

In [None]:
pip install -q -U "tensorflow-text==2.8.*"

In [None]:
import tensorflow as tf
import tensorflow_text as text

import functools

Our data contains two text features and we can create a example tf.data.Dataset. Our goal is to create a function that we can supply Dataset.map() with to be used in training.

In [None]:
examples = {
    "text_a": [
      "Sponge bob Squarepants is an Avenger",
      "Marvel Avengers"
    ],
    "text_b": [
     "Barack Obama is the President.",
     "President is the highest office"
  ],
}

In [None]:
dataset = tf.data.Dataset.from_tensor_slices(examples)
next(iter(dataset))

## Tokenizing

Our first step is to run any string preprocessing and tokenize our dataset. This can be done using the text.BertTokenizer, which is a *text.Splitter* that can tokenize sentences into subwords or wordpieces for the BERT model given a vocabulary generated from the Wordpiece algorithm.

In [None]:
_VOCAB = [
    # Special tokens
    b"[UNK]", b"[MASK]", b"[RANDOM]", b"[CLS]", b"[SEP]",
    # Suffixes
    b"##ack", b"##ama", b"##ger", b"##gers", b"##onge", b"##pants",  b"##uare",
    b"##vel", b"##ven", b"an", b"A", b"Bar", b"Hates", b"Mar", b"Ob",
    b"Patrick", b"President", b"Sp", b"Sq", b"bob", b"box", b"has", b"highest",
    b"is", b"office", b"the",
]

In [None]:
_START_TOKEN = _VOCAB.index(b"[CLS]")
_END_TOKEN = _VOCAB.index(b"[SEP]")
_MASK_TOKEN = _VOCAB.index(b"[MASK]")
_RANDOM_TOKEN = _VOCAB.index(b"[RANDOM]")
_UNK_TOKEN = _VOCAB.index(b"[UNK]")
_MAX_SEQ_LEN = 8
_MAX_PREDICTIONS_PER_BATCH = 5

_VOCAB_SIZE = len(_VOCAB)

In [None]:
lookup_table = tf.lookup.StaticVocabularyTable(
    tf.lookup.KeyValueTensorInitializer(
        keys = _VOCAB,
        key_dtype = tf.string,
        values = tf.range(
            tf.size(_VOCAB, out_type = tf.int64), dtype = tf.int64),
            value_dtype = tf.int64,
        ),
        num_oov_buckets = 1,
)

Let's construct a *text.BertTokenizer* using the above vocabulary and tokenize the text inputs into a RaggedTensor.`.

In [None]:
bert_tokenizer = text.BertTokenizer(lookup_table, token_out_type=tf.string)

In [None]:
bert_tokenizer.tokenize(examples["text_a"])

In [None]:
bert_tokenizer.tokenize(examples["text_b"])

Text output from *text.BertTokenizer* allows us see how the text is being tokenized, but the model requires integer IDs. We can set the token_out_type param to tf.int64 to obtain integer IDs (which are the indices into the vocabulary).

In [None]:
bert_tokenizer = text.BertTokenizer(lookup_table, token_out_type=tf.int64)

In [None]:
segment_a = bert_tokenizer.tokenize(examples["text_a"])
segment_a

In [None]:
segment_b = bert_tokenizer.tokenize(examples["text_b"])
segment_b

*text.BertTokenizer* returns a RaggedTensor with shape [batch, num_tokens, num_wordpieces]. Because we don't need the extra num_tokens dimensions for our current use case, we can merge the last two dimensions to obtain a RaggedTensor with shape [batch, num_wordpieces]

In [None]:
segment_a = segment_a.merge_dims(-2, -1)
segment_a

In [None]:
segment_b = segment_b.merge_dims(-2, -1)
segment_b

## Content Trimming

The main input to BERT is a concatenation of two sentences. However, BERT requires inputs to be in a fixed-size and shape and we may have content which exceed our budget.



We can tackle this by using a *text.Trimmer* to trim our content down to a predetermined size (once concatenated along the last axis). There are different *text.Trimmer* types which select content to preserve using different algorithms. *text.RoundRobinTrimmer* for example will allocate quota equally for each segment but may trim the ends of sentences. *text.WaterfallTrimmer* will trim starting from the end of the last sentence.

For our example, we will use RoundRobinTrimmer which selects items from each segment in a left-to-right manner.

In [None]:
trimmer = text.RoundRobinTrimmer(max_seq_length = _MAX_SEQ_LEN)
trimmed = trimmer.trim([segment_a, segment_b])
trimmed

trimmed now contains the segments where the number of elements across a batch is 8 elements (when concatenated along axis=-1).

## Combining segments

Now that we have segments trimmed, we can combine them together to get a single RaggedTensor. BERT uses special tokens to indicate the beginning ([CLS]) and end of a segment ([SEP]). We also need a RaggedTensor indicating which items in the combined Tensor belong to which segment. We can use text.combine_segments() to get both of these Tensor with special tokens inserted.

In [None]:
segments_combined, segments_ids = text.combine_segments(
    trimmed,
    start_of_sequence_id = _START_TOKEN,
    end_of_segment_id = _END_TOKEN,
)
segments_combined, segments_ids

## Masked Language Model Task

Now that we have our basic inputs, we can begin to extract the inputs needed for the "Masked LM and Masking Procedure" task. It has two sub-problems for us to think about: (1) what items to select for masking and (2) what values are they assigned?

### Item Selection

Because we will choose to select items randomly for masking, we will use a *text.RandomItemSelector*

In [None]:
random_selector = text.RandomItemSelector(
    max_selections_per_batch = _MAX_PREDICTIONS_PER_BATCH,
    selection_rate = 0.2,
    unselectable_ids = [_START_TOKEN, _END_TOKEN, _UNK_TOKEN],
)

In [None]:
selected = random_selector.get_selection_mask(
    segments_combined, axis=1,
)
selected

### Choosing the Masked Value

The methodology described the original BERT paper for choosing the value for masking is as follows:

*      For mask_token_rate of the time, replace the item with the [MASK] token:

"my dog is hairy" -> "my dog is [MASK]"

*      For random_token_rate of the time, replace the item with a random word:

"my dog is hairy" -> "my dog is apple"

*      For 1 - mask_token_rate - random_token_rate of the time, keep the item unchanged:

"my dog is hairy" -> "my dog is hairy."

*text.MaskedValuesChooser* encapsulates this logic and can be used for our preprocessing function. Here's an example of what MaskValuesChooser returns given a mask_token_rate of 80% and default random_token_rate:

In [None]:
mask_values_chooser = text.MaskValuesChooser(_VOCAB_SIZE, _MASK_TOKEN, 0.8)
mask_values_chooser.get_mask_values(segments_combined)

### Generating Inpust for Masked Language Model Task

Now that we have a RandomItemSelector to help us select items for masking and *text.MaskValuesChooser* to assign the values, we can use *text.mask_language_model()* to assemble all the inputs of this task for our BERT model.

In [None]:
masked_token_ids, masked_pos, masked_lm_ids = text.mask_language_model(
    segments_combined,
    item_selector = random_selector,
    mask_values_chooser = mask_values_chooser,
)

Let's dive deeper and examine the outputs of *mask_language_model()*. The output of masked_token_ids is:

In [None]:
masked_token_ids

Remember that our input is encoded using a vocabulary. If we decode masked_token_ids using our vocabulary, we get:

In [None]:
tf.gather(_VOCAB, masked_token_ids)

Notice that some wordpiece tokens have been replaced with either [MASK], [RANDOM] or a different ID value. masked_pos output gives us the indices (in the respective batch) of the tokens that have been replaced.

In [None]:
masked_pos

masked_lm_ids gives us the original value of the token.

In [None]:
masked_lm_ids

We can again decode the IDs here to get human readable values.

In [None]:
tf.gather(_VOCAB, masked_lm_ids)

## Padding Model Inputs

Now that we have all the inputs for our model, the last step in our preprocessing is to package them into fixed 2-dimensional Tensors with padding and also generate a mask Tensor indicating the values which are pad values. We can use *text.pad_model_inputs()* to help us with this task.

In [None]:
# Prepare and pad combined segment inputs
input_word_ids, input_mask = text.pad_model_inputs(
  masked_token_ids, max_seq_length=_MAX_SEQ_LEN)
input_type_ids, _ = text.pad_model_inputs(
  segments_ids, max_seq_length=_MAX_SEQ_LEN)

In [None]:
# Prepare and pad masking task inputs
masked_lm_positions, masked_lm_weights = text.pad_model_inputs(
  masked_pos, max_seq_length=_MAX_PREDICTIONS_PER_BATCH)
masked_lm_ids, _ = text.pad_model_inputs(
  masked_lm_ids, max_seq_length=_MAX_PREDICTIONS_PER_BATCH)

In [None]:
model_inputs = {
    "input_word_ids": input_word_ids,
    "input_mask": input_mask,
    "input_type_ids": input_type_ids,
    "masked_lm_ids": masked_lm_ids,
    "masked_lm_positions": masked_lm_positions,
    "masked_lm_weights": masked_lm_weights,
}
model_inputs

# Review

Let's review what we have so far and assemble our preprocessing function. Here's what we have:

In [None]:
def bert_pretrain_preprocess(vocab_table, features):
  # Input is a string Tensor of documents, shape [batch, 1].
  text_a = features["text_a"]
  text_b = features["text_b"]

  # Tokenize segments to shape [num_sentences, (num_words)] each.
  tokenizer = text.BertTokenizer(
      vocab_table,
      token_out_type=tf.int64)
  segments = [tokenizer.tokenize(text).merge_dims(
      1, -1) for text in (text_a, text_b)]

  # Truncate inputs to a maximum length.
  trimmer = text.RoundRobinTrimmer(max_seq_length=6)
  trimmed_segments = trimmer.trim(segments)

  # Combine segments, get segment ids and add special tokens.
  segments_combined, segment_ids = text.combine_segments(
      trimmed_segments,
      start_of_sequence_id=_START_TOKEN,
      end_of_segment_id=_END_TOKEN)

  # Apply dynamic masking task.
  masked_input_ids, masked_lm_positions, masked_lm_ids = (
      text.mask_language_model(
        segments_combined,
        random_selector,
        mask_values_chooser,
      )
  )

  # Prepare and pad combined segment inputs
  input_word_ids, input_mask = text.pad_model_inputs(
    masked_input_ids, max_seq_length=_MAX_SEQ_LEN)
  input_type_ids, _ = text.pad_model_inputs(
    segment_ids, max_seq_length=_MAX_SEQ_LEN)

  # Prepare and pad masking task inputs
  masked_lm_positions, masked_lm_weights = text.pad_model_inputs(
    masked_lm_positions, max_seq_length=_MAX_PREDICTIONS_PER_BATCH)
  masked_lm_ids, _ = text.pad_model_inputs(
    masked_lm_ids, max_seq_length=_MAX_PREDICTIONS_PER_BATCH)

  model_inputs = {
      "input_word_ids": input_word_ids,
      "input_mask": input_mask,
      "input_type_ids": input_type_ids,
      "masked_lm_ids": masked_lm_ids,
      "masked_lm_positions": masked_lm_positions,
      "masked_lm_weights": masked_lm_weights,
  }
  return model_inputs

We previously constructed a *tf.data.Dataset* and we can now use our assembled preprocessing function *bert_pretrain_preprocess()* in *Dataset.map()*. This allows us to create an input pipeline for transforming our raw string data into integer inputs and feed directly into our model.

In [None]:
dataset = (
    tf.data.Dataset.from_tensors(examples)
    .map(functools.partial(bert_pretrain_preprocess, lookup_table))
)
next(iter(dataset))