Tokenization is the process of breaking up string into tokens. Commonly, these tokens are words, numbers, and/or punctiation.

The tensorflow_text package provides a number of tokenizers available for preprocessing text required by your text-based models. By performing the tokenization in the TensorFlow graph, you will not need to worry about differences between the training and inference workflows and managing preprocessing scripts.

### Setup

In [None]:
pip install -q "tensorflow-text==2.8.*"

In [None]:
import requests

import tensorflow as tf
import tensorflow_text as tf_text

### Splitter API

```python
class Splitter {
  @abstractmethod
  def split(self, input)
}

class SplitterWithOffsets(Splitter) {
  @abstractmethod
  def split_with_offsets(self, input)
}
```

```python
class Detokenizer {
  @abstractmethod
  def detokenize(self, input)
}
```

# Tokenizers

Below is the suite of tokenizers provided by TensorFlow Text.

## Whole word tokenizers

These tokenizers attempt to split a string by words, and is the most intuitive way to split text.

In [None]:
SENTENCE = "What you know you can't explain, but you feel it."

### WhitespaceTokenizer

The *text.WhitespaceTokenizer* is the most basic tokenizer which splits strings on ICU defined whitespace characters (eg. space, tab, new line). This is often good for quickly building out prototype models.

In [None]:
tokenizer = tf_text.WhitespaceTokenizer()
tokens = tokenizer.tokenize([SENTENCE])
print(tokens.to_list())

### UnicodeScriptTokenizer

The *UnicodeScriptTokenizer* splits strings based on Unicode script boundaries. The script codes used correspond to International Components for Unicode (ICU) UScriptCode values.

In practice, this is similar to the WhitespaceTokenizer with the most apparent difference being that it will split punctuation from language texts while also separating language texts from each other. Note that this will also split contraction words into separate tokens.



In [None]:
tokenizer = tf_text.UnicodeScriptTokenizer()
tokens = tokenizer.tokenize([SENTENCE])
print(tokens.to_list())

## Subword tokenizers

Subword tokenizers can be used with a smaller vocabulary, and allow the model to have some information about novel words from the subwords that make create it.

### WordpieceTokenizer

WordPiece tokenization is a data-driven tokenization scheme which generates a set of sub-tokens. These sub tokens may correspond to linguistic morphemes, but this is often not the case.

The WordpieceTokenizer expects the input to already be split into tokens. Because of this prerequisite, you will often want to split using the WhitespaceTokenizer or UnicodeScriptTokenizer beforehand.

In [None]:
tokenizer = tf_text.WhitespaceTokenizer()
tokens = tokenizer.tokenize([SENTENCE])

After the string is split into tokens, the WordpieceTokenizer can be used to split into subtokens.

In [None]:
url = "https://github.com/tensorflow/text/blob/master/tensorflow_text/python/ops/test_data/test_wp_en_vocab.txt?raw=true"
r = requests.get(url)
filepath = "vocab.txt"
open(filepath, 'wb').write(r.content)

In [None]:
subtokenizer = tf_text.UnicodeScriptTokenizer(filepath)
subtokens = tokenizer.tokenize(tokens)
print(subtokens.to_list())

### BertTokenizer

The BertTokenizer mirrors the original implementation of tokenization from the BERT paper. This is backed by the WordpieceTokenizer, but also performs additional tasks such as normalization and tokenizing to words first.

In [None]:
tokenizer = tf_text.BertTokenizer(filepath, token_out_type = tf.string, lower_case = True)
tokens = tokenizer.tokenize([SENTENCE])
print(tokens.to_list())

### SentencepieceTokenizer

The SentencepieceTokenizer is a sub-token tokenizer that is highly configurable. This is backed by the Sentencepiece library. Like the BertTokenizer, it can include normalization and token splitting before splitting into sub-tokens.

In [None]:
url = "https://github.com/tensorflow/text/blob/master/tensorflow_text/python/ops/test_data/test_oss_model.model?raw=true"
sp_model = requests.get(url).content

In [None]:
tokenizer = tf_text.SentencepieceTokenizer(sp_model, out_type = tf.string)
tokens = tokenizer.tokenize([SENTENCE])
print(tokens.to_list())

# Other splitters

### UnicodeCharTokenizer

This splits a string into UTF-8 characters. It is useful for CJK languages that do not have spaces between words.

In [None]:
tokenizer = tf_text.UnicodeCharTokenizer()
tokens = tokenizer.tokenize([SENTENCE])
print(tokens.to_list())

The output is Unicode codepoints. This can be also useful for creating character ngrams, such as bigrams. To convert back into UTF-8 characters.

In [None]:
characters = tf.strings.unicode_encode(tf.expand_dims(tokens, -1), "UTF-8")
bigrams = tf_text.ngrams(characters, 2, reduction_type=tf_text.Reduction.STRING_JOIN, string_separator='')
print(bigrams.to_list())

### HubModuleTokenizer

This is a wrapper around models deployed to TF Hub to make the calls easier since TF Hub currently does not support ragged tensors. Having a model perform tokenization is particularly useful for CJK languages when you want to split into words, but do not have spaces to provide a heuristic guide.

In [None]:
MODEL_HANDLE = "https://tfhub.dev/google/zh_segmentation/1"
segmenter = tf_text.HubModuleTokenizer(MODEL_HANDLE)
tokens = segmenter.tokenize(["新华社北京"])
print(tokens.to_list())

It may be difficult to view the results of the UTF-8 encoded byte strings. Decode the list values to make viewing easier.

In [None]:
def decode_list(x):
  if type(x) is list:
    return list(map(decode_list, x))
  return x.decode("UTF-8")

def decode_utf8_tensor(x):
  return list(map(decode_list, x.to_list()))

print(decode_utf8_tensor(tokens))

### SplitMergeTokenizer

The *SplitMergeTokenizer* & *SplitMergeFromLogitsTokenizer* have a targeted purpose of splitting a string based on provided values that indicate where the string should be split. This is useful when building your own segmentation models like the previous Segmentation example.

For the *SplitMergeTokenizer*, a value of 0 is used to indicate the start of a new string, and the value of 1 indicates the character is part of the current string.

In [None]:
strings = ["新华社北京"]
labels = [[0, 1, 1, 0, 1]]

tokenizer = tf_text.SplitMergeTokenizer()
tokens = tokenizer.tokenize(strings, labels)
print(decode_utf8_tensor(tokens))

The *SplitMergeFromLogitsTokenizer* is similar, but it instead accepts logit value pairs from a neural network that predict if each character should be split into a new string or merged into the current one.

In [None]:
strings = [["新华社北京"]]
labels = [[[5.0, -3.2], [0.2, 12.0], [0.0, 11.0], [2.2, -1.0], [-3.0, 3.0]]]

tokenizer = tf_text.SplitMergeFromLogitsTokenizer()
tokenizer.tokenize(strings, labels)
print(decode_utf8_tensor(tokens))

### RegexSplitter

The RegexSplitter is able to segment strings at arbitrary breakpoints defined by a provided regular expression.

In [None]:
splitter = tf_text.RegexSplitter("\s") # "\s" is the regex exp that matches whitespace (spaces, tabs and new lines)
tokens = splitter.split([SENTENCE])
print(tokens.to_list())

# Offsets

When tokenizing strings, it is often desired to know where in the original string the token originated from. For this reason, each tokenizer which implements *TokenizerWithOffsets* has a tokenize_with_offsets method that will return the byte offsets along with the tokens.

The start_offsets lists the bytes in the original string each token starts at, and the end_offsets lists the bytes immediately after the point where each token ends. To refrase, the start offsets are inclusive and the end offsets are exclusive.

In [None]:
tokenizer = tf_text.UnicodeScriptTokenizer()
(tokens, start_off, end_off) = tokenizer.tokenize_with_offsets(['Everything not saved will be lost.'])
print(tokens.to_list())
print(start_off.to_list())
print(end_off.to_list())

# Detokenization

Tokenizers which implement the Detokenizer provide a detokenize method which attempts to combine the strings. This has the chance of being lossy, so the detokenized string may not always match exactly the original, pre-tokenized string.

In [None]:
tokenizer = tf_text.UnicodeCharTokenizer()
tokens = tokenizer.tokenize([SENTENCE])
print(tokens.to_list())

In [None]:
strings = tokenizer.detokenize(tokens)
print(strings.numpy())

# TF Data

TF Data is a powerful API for creating an input pipeline for training models. Tokenizers work as expected with the API.

In [None]:
docs = tf.data.Dataset.from_tensor_slices([['Never tell me the odds.'], ["It's a trap!"]])
tokenizer = tf_text.WhitespaceTokenizer()
tokenized_docs = docs.map(lambda x: tokenizer.tokenize(x))
iterator = iter(tokenized_docs)
print(next(iterator).to_list())
print(next(iterator).to_list())