The main advantage of a subword tokenizer is that it interpolates between word-based and character-based tokenization. Common words get a slot in the vocabulary, but the tokenizer can fall back to word pieces and individual characters for unknown words.

# Overview

The *tensorflow_text* package includes TensorFlow implementations of many common tokenizers. This includes three subword-style tokenizers:
*      *text.BertTokenizer* - The BertTokenizer class is a higher level interface. It includes BERT's token splitting algorithm and a WordPieceTokenizer. It takes sentences as input and returns token-IDs.
*      *text.WordpieceTokenizer* - The WordPieceTokenizer class is a lower level interface. It only implements the WordPiece algorithm. You must standardize and split the text into words before calling it. It takes words as input and returns token-IDs.
*      *text.SentencepieceTokenizer* - The SentencepieceTokenizer requires a more complex setup. Its initializer requires a pre-trained sentencepiece model. See the google/sentencepiece repository for instructions on how to build one of these models. It can accept sentences as input when tokenizing.

This tutorial builds a Wordpiece vocabulary in a top down manner, starting from existing words.

# Setup

In [None]:
pip install -q -U "tensorflow-text==2.8.*"

In [None]:
pip install -q tensorflow_datasets

In [None]:
import collections
import os
import pathlib
import re
import string
import sys
import tempfile
import time

import numpy as np
import matplotlib.pyplot as plt

import tensorflow_datasets as tfds
import tensorflow_text as text
import tensorflow as tf

In [None]:
tf.get_logger().setLevel('ERROR')
pwd = pathlib.Path.cwd()


# Download the dataset

Fetch the Portuguese/English translation dataset from tfds

In [None]:
examples, metadata = tfds.load(
    'ted_hrlr_translate/pt_to_en',
    with_info = True,
    as_supervised = True,
)

In [None]:
train_examples, val_examples = examples['train'], examples['validation']

This dataset produces Portuguese/English sentence pairs

In [None]:
for pt, en in train_examples.take(1):
  print("Portuguese: ", pt.numpy().decode('utf-8'))
  print("English:   ", en.numpy().decode('utf-8'))

In [None]:
train_en = train_examples.map(lambda pt, en: en)
train_pt = train_examples.map(lambda pt, en: pt)

# Generate the vocabulray

This section generates a wordpiece vocabulary from a dataset.

The vocabulary generation code is included in the tensorflow_text pip package. It is not imported by default , you need to manually import it:

In [None]:
from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab

There are many arguments you can set to adjust its behavior. For this tutorial, you'll mostly use the defaults.

In [None]:
bert_tokenizer_params = dict(lower_case = True)
reserved_tokens = ["[PAD]", "[UNK]", "[START]", "[END]"]

In [None]:
bert_vocab_args = dict(
    vocab_size = 8000, #The target vocabulary size
    reserved_tokens = reserved_tokens, #Reserved tokens that must be included in the vocabulary
    bert_tokenizer_params = bert_tokenizer_params, #Arguments for 'text.BertTokenizer'
    learn_params = {}, #Arguments for 'wordpiece_vocab_tokenizer_learner_lib.learn'
)

In [None]:
%%time
pt_vocab = bert_vocab.bert_vocab_from_dataset(
    train_pt.batch(1000).prefetch(2),
    **bert_vocab_args,
)

Here are some slices of the resulting vocabulary

In [None]:
print(pt_vocab[:10])
print(pt_vocab[100:110])
print(pt_vocab[1000:1010])
print(pt_vocab[-10:])

Write a vocabulary file

In [None]:
def write_vocab_file(filepath, vocab):
  with open(filepath, 'w') as f:
    for token in vocab:
      print(token, file=f)

In [None]:
write_vocab_file('pt_vocab.txt', pt_vocab)

Use that function to generate a vocabulary from the English data

In [None]:
%%time
en_vocab = bert_vocab.bert_vocab_from_dataset(
    train_en.batch(1000).prefetch(2),
    **bert_vocab_args,
)

In [None]:
print(en_vocab[:10])
print(en_vocab[100:110])
print(en_vocab[1000:1010])
print(en_vocab[-10:])

Here are the two vocabulary files

In [None]:
write_vocab_file('en_vocab.txt', en_vocab)

In [None]:
ls *.txt

# Build the tokenizer

The *text.BertTokenizer* can be initialized by passing the vocabulary file's path as the first argument.

In [None]:
pt_tokenizer = text.BertTokenizer('pt_vocab.txt', **bert_tokenizer_params)
en_tokenizer = text.BertTokenizer('en_vocab.txt', **bert_tokenizer_params)

Now you can use it to encode some text. Take a batch of 3 examples from the english data

In [None]:
for pt_examples, en_examples in train_examples.batch(3).take(1):
  for ex in en_examples:
    print(ex.numpy())

Run it through the *BertTokenizer.tokenize* method.Initially, this returns a *tf.RaggedTensor* with axes (batch, word, word-piece):

In [None]:
# Tokenize the examples -> (batch, word, word-piece)
token_batch = en_tokenizer.tokenize(en_examples)

# Merge the word and word-piece axes -> (batch, tokens)
token_batch = token_batch.merge_dims(-2, -1)

In [None]:
for ex in token_batch.to_list():
  print(ex)

If you replace the token IDs with their text representations (using *tf.gather*) you can see that in the first example the words "searchability" and "serendipity" have been decomposed into "search ##ability" and "s ##ere ##nd ##ip ##ity":

In [None]:
# Lookup each token id in the vocabulary.
txt_tokens = tf.gather(en_vocab, token_batch)

# Join with spaces.
tf.strings.reduce_join(txt_tokens, separator=' ', axis=-1)

To re-assemble words from the extracted tokens, use the *BertTokenizer.detokenize* method:

In [None]:
words = en_tokenizer.detokenize(token_batch)
tf.strings.reduce_join(words, separator=' ', axis=-1)

# Customization and export

This tutorial builds the text tokenizer and detokenizer used by the Transformer tutorial. This section adds methods and processing steps to simplify that tutorial, and exports the tokenizers using *tf.saved_model* so they can be imported by the other tutorials.

## Custom tokenization

The downstream tutorials both expect the tokenized text to include [START] and [END] tokens.

The *reserved_tokens* reserve space at the beginning of the vocabulary, so [START] and [END] have the same indexes for both languages:

In [None]:
START = tf.argmax(tf.constant(reserved_tokens) == "[START]")
END = tf.argmax(tf.constant(reserved_tokens) == "[END]")

In [None]:
def add_start_end(ragged):
  count = ragged.bounding_shape()[0]
  starts = tf.fill([count,1], START)
  ends = tf.fill([count,1], END)
  return tf.concat([starts, ragged, ends], axis=1)

In [None]:
words = en_tokenizer.detokenize(add_start_end(token_batch))
tf.strings.reduce_join(words, separator=' ', axis=-1)

## Custom detokenization

Before exporting the tokenizers there are a couple of things you can cleanup for the downstream tutorials:


1.   They want to generate clean text output, so drop reserved tokens like [START], [END] and [PAD].
2.   They're interested in complete strings, so apply a string join along the words axis of the result.



In [None]:
def cleanup_text(reserved_tokens, token_txt):
  # Drop the reserved tokens, except for the "[UNK]"
  bad_tokens = [re.escape(tok) for tok in reserved_tokens if tok != "[UNK]"]
  bad_token_re = "|".join(bad_tokens)

  bad_cells = tf.strings.regex_full_match(token_txt, bad_token_re)
  result = tf.ragged.boolean_mask(token_txt, ~bad_cells)

  # Join them into strings
  result = tf.strings.reduce_join(result, separator=' ', axis=-1)
  return result

In [None]:
en_examples.numpy()

In [None]:
token_btach = en_tokenizer.tokenize(en_examples).merge_dims(-2, -1)
words = en_tokenizer.detokenize(token_batch)
words

In [None]:
cleanup_text(reserved_tokens, words).numpy()

## Export

The following code block builds a *CustomTokenizer* class to contain the *text.BertTokenizer* instances, the custom logic, and the @tf.function wrappers required for export.

In [None]:
class CustomTokenizer(tf.Module):
  def __init__(self, reserved_tokens, vocab_path):
    self.tokenizer = text.BertTokenizer(vocab_path, lower_case=True)
    self._reserved_tokens = reserved_tokens
    self._vocab_path = tf.saved_model.Asset(vocab_path)

    vocab = pathlib.Path(vocab_path).read_text().splitlines()
    self.vocab = tf.Variable(vocab)

    ## Create the signatures for export:   

    # Include a tokenize signature for a batch of strings. 
    self.tokenize.get_concrete_function(
        tf.TensorSpec(shape=[None], dtype=tf.string))

    # Include `detokenize` and `lookup` signatures for:
    #   * `Tensors` with shapes [tokens] and [batch, tokens]
    #   * `RaggedTensors` with shape [batch, tokens]
    self.detokenize.get_concrete_function(
        tf.TensorSpec(shape=[None, None], dtype=tf.int64))
    self.detokenize.get_concrete_function(
          tf.RaggedTensorSpec(shape=[None, None], dtype=tf.int64))

    self.lookup.get_concrete_function(
        tf.TensorSpec(shape=[None, None], dtype=tf.int64))
    self.lookup.get_concrete_function(
          tf.RaggedTensorSpec(shape=[None, None], dtype=tf.int64))

    # These `get_*` methods take no arguments
    self.get_vocab_size.get_concrete_function()
    self.get_vocab_path.get_concrete_function()
    self.get_reserved_tokens.get_concrete_function()

  @tf.function
  def tokenize(self, strings):
    enc = self.tokenizer.tokenize(strings)
    # Merge the `word` and `word-piece` axes.
    enc = enc.merge_dims(-2,-1)
    enc = add_start_end(enc)
    return enc

  @tf.function
  def detokenize(self, tokenized):
    words = self.tokenizer.detokenize(tokenized)
    return cleanup_text(self._reserved_tokens, words)

  @tf.function
  def lookup(self, token_ids):
    return tf.gather(self.vocab, token_ids)

  @tf.function
  def get_vocab_size(self):
    return tf.shape(self.vocab)[0]

  @tf.function
  def get_vocab_path(self):
    return self._vocab_path

  @tf.function
  def get_reserved_tokens(self):
    return tf.constant(self._reserved_tokens)

Build a CustomTokenizer for each language

In [None]:
tokenizers = tf.Module()
tokenizers.pt = CustomTokenizer(reserved_tokens, 'pt_vocab.txt')
tokenizers.en = CustomTokenizer(reserved_tokens, 'en_vocab.txt')

Export the tokenizers as a saved_model

In [None]:
model_name = 'ted_hrlr_translate_pt_en_converter'
tf.saved_model.save(tokenizers, model_name)

Reload the saved_model and test the methods

In [None]:
reloaded_tokenizers = tf.saved_model.load(model_name)
reloaded_tokenizers.en.get_vocab_size().numpy()

In [None]:
tokens = reloaded_tokenizers.en.tokenize(['Hello TensorFlow!'])
tokens.numpy()

In [None]:
text_tokens = reloaded_tokenizers.en.lookup(tokens)
text_tokens

In [None]:
round_trip = reloaded_tokenizers.en.detokenize(tokens)
print(round_trip.numpy()[0].decode('utf-8'))

Archive it for the translation tutorials:

In [None]:
!zip -r {model_name}.zip {model_name}

In [None]:
!du -h *.zip