# (Preprocessing) Tokenization

In this notebook, you will learn how to do different kinds of tokenization using tensorflow-text and sentencepiece.

Starting from character-based tokenization that takes individual characters as the tokens, we will have a look at word-based tokenization and its shortcomings, and then finally we'll investigate the benefits of sub-word tokenization (and how to do it) which is the most flexible tokenization that also works for any language (i.e. does not require words to be separated by white space etc.).

First we need to install tensorflow text, which will include some nice tokenizers that are used in modern NLP.

In [1]:
!pip install -q -U tensorflow-text

In [2]:
import tensorflow as tf
import tensorflow_text as tf_txt

import requests

We can use regular expressions and other operations such as string splitting for the tokenization (e.g. with the nltk library). The downside of this however is that the preprocessing will not be integratable with the model, complicating the data pipeline, model deployment and deteriorate performance since the preprocessing will not be part of the tensorflow graph.

So instead, we want to use tokenizers that can be used inside a tensorflow model. The tensorflow-text library provides tools for both basic and advanced tokenization that can be integrated with a tensorflow model.

## Character Tokenization

Defining a vocabulary of all characters and then treating the text as a sequence of indices signifying the character used.

Upsides:
- Relatively small vocabulary

Downsides:
- Very long sequences (vanishing gradients, memory constraints, training and inference speed)
- Quite unlike how humans read and understand text
- Usually leads to worse performance

In [3]:
tokenizer = tf_txt.UnicodeCharTokenizer()
tokenized_text = tokenizer.tokenize("Output of language models can be deceptive")
print(tokenized_text)

tf.Tensor(
[ 79 117 116 112 117 116  32 111 102  32 108  97 110 103 117  97 103 101
  32 109 111 100 101 108 115  32  99  97 110  32  98 101  32 100 101  99
 101 112 116 105 118 101], shape=(42,), dtype=int32)


In [4]:
detokenized_text = tokenizer.detokenize(tokenized_text)
print(detokenized_text)

tf.Tensor(b'Output of language models can be deceptive', shape=(), dtype=string)


## Word Tokenization

Defining a vocabulary of all words that come up in the text. Treating the text as a sequence of indices signifying which word occurs.

Upsides:
- Only coherent words (that are part of the vocabulary) can be generated
- Closer to how humans read and process text compared to a purely character-based representation

Downsides:
- Words can have many different versions, so some form of stemming would be needed to reduce vocabulary size
- Large vocabulary size since there are a lot of words
- Some tokens may be strongly under-represented in the data, 
- Words with the same sub-words are treated as independent from each other (in German, the word "Bushaltestelle" would be independent from the noun "Bus", the verb "halten" and the noun "Stelle")

In [5]:
tf.print(tf.strings.split("this is a sentence.", sep=" ",)  )

["this" "is" "a" "sentence."]


In [6]:
# The better way to do it
tokenizer = tf_txt.UnicodeScriptTokenizer()
tokenizer.tokenize(["This is some text."])

<tf.RaggedTensor [[b'This', b'is', b'some', b'text', b'.']]>

## Sentence Tokenization (generally a bad idea)

Defining a vocabulary of all sentences that occur in the text. Treating the text as a sequence of indices signifying which sentence occurs.

Upsides:

- Only coherent sentences will be produced since the model can only output sentences that occured in the training data (which is hopefully but not naturally not garbage)
- Sequences are much smaller, more text can be produced with less compute

Downsides:
- Depending on the text corpus, the vocabulary can be extremely large (combinatorial explosion)
- Does not allow for new content
- Not at all flexible - the slightest typo or difference in formatting compared to a sentence from the vocabulary will disable the model to process (and potentially comprehend) a sentence and force it to represent it as an "unknown" token

In [7]:
text_corpus = "This is a sentence. This is another sentence. And here we have yet another sentence.\
There are so many possible sentences. You would not believe it.".lower()

# suboptimal splitting of sentences, preferably use a method from nltk etc.
sentence_tokens = text_corpus.split(".") 

vocab_list = list(set(sentence_tokens))[1:]

vocab = {}
reverse_vocab = {}
for i, element in enumerate(vocab_list):
    vocab[element] = i
    reverse_vocab[i] = element

vocab

{'there are so many possible sentences': 0,
 ' this is another sentence': 1,
 ' and here we have yet another sentence': 2,
 ' you would not believe it': 3,
 'this is a sentence': 4}

# Tokenization with subword segmentation (superior) 

Upsides:
- Significantly smaller vocabulary size compared to word tokenization
- Compositional relations between words can be learned with shared embeddings of sub-words
- No stemming is needed to reduce vocabulary size since endings of verbs etc. are re-used for many words.
- A strong language model could potentially produce new words or names by using sub-words in a meaningful way. (Spoiler alert: this typically does not happen with current models!)
- Since there are multiple ways to segment words into subwords, this can be used for a regularizing effect on the pre-processing side.

Downsides:
- A model needs to be trained or an existing vocabulary of subwords needs to be found given a large corpus for the methods to work well
- Grammatical mistakes can occur during learning, similar to character based tokenization (could also be regarded as an upside, depending on the interest in the language model)
- Sequences get longer compared to word-level tokenization

## Wordpiece Tokenizer

https://paperswithcode.com/method/wordpiece

The wordpiece tokenizer expects a text file containing a list of the possible sub-words.

There is also a method called "build_fast_wordpiece_model" which supposedly allows to build a wordpiece vocabulary from raw text. However this is not documented, so we will only show how to use pre-defined vocabularies of sub-words. 

**Note that the wordpiece tokenizuer expects the text to already be split into words.**

First we download the pre-defined vocabulary of 7011 subwords.

Some of the word pieces have "##" in them, meaning that the subword is a suffix.

In [8]:
url = "https://github.com/tensorflow/text/blob/master/tensorflow_text/python/ops/test_data/test_wp_en_vocab.txt?raw=true"
r = requests.get(url)
filepath = "vocab.txt"

with open(filepath, "wb") as file:
    file.write(r.content)

Next we instantiate our wordpiece tokenizer with the vocabulary text file. We choose to transform tokens to strings but we can also directly obtain indices.

In [9]:
# instantiate the wordpiece tokenizer with the vocabulary list (list of possible subwords)
wp_tokenizer = tf_txt.WordpieceTokenizer(vocab_lookup_table='vocab.txt',
                                         # can also be set to tf.int32 to obtain indices
                                         token_out_type=tf.string) 

We pre-process some text that we want to tokenize with the word piece tokenizer. For this we split it into words.

We use a sentence that contains the word "Westerberg", which would be unlikely to be featured in a word-level vocabulary. Using sub-word tokenization, we can tokenize (and thus our model can also generate) words that are not part of the vocabulary.

In [10]:
text = "The sunflowers on Westerberg remind me of watching the sunset in my hometown.".lower()

# split text into words
tokenizer = tf_txt.UnicodeScriptTokenizer()
word_tokens = tokenizer.tokenize(text)
print(word_tokens)

tf.Tensor(
[b'the' b'sunflowers' b'on' b'westerberg' b'remind' b'me' b'of'
 b'watching' b'the' b'sunset' b'in' b'my' b'hometown' b'.'], shape=(14,), dtype=string)


Finally we tokenize the words into sub-words.

In [11]:
sub_word_tokens = wp_tokenizer.tokenize(word_tokens)

print(sub_word_tokens)

<tf.RaggedTensor [[b'the'], [b'sun', b'##f', b'##low', b'##ers'], [b'on'], [b'west', b'##er', b'##berg'], [b'remind'], [b'me'], [b'of'], [b'watching'], [b'the'], [b'sun', b'##set'], [b'in'], [b'my'], [b'hometown'], [b'.']]>


The word "westerberg" is composed of three tokens "west", the suffix "##er", and the suffix "##berg". This type of tokenization is highly flexible which is one of the reasons it is used in most (if not all) modern language models.

You may have noticed that the resulting tensor that we've printed above is not a normal tensor but a ragged tensor containing the words as rows of differing length. Ragged tensors can have elements of varying length without the need for padding. They are especially useful for RNN models because these can process sequences of varying length.

In [12]:
# shape of a ragged tensor is None along the second dimension.
sub_word_tokens.shape

TensorShape([14, None])

## Tokenization with SentencePiece (most versatile choice) 

The sentencepiece tokenization tool allows for tokenization without specifying at which level you want to tokenize. It allows for Byte-pair-encoding (BPE) which is used in BERT and GPT models (both are successful NLP models). SentencePiece does not need words to be split before tokenization, it can be used on raw text and thus also allows to tokenize many languages, even if they do not have white spaces or characters.

Using the sentencepiece tokenizer will give you the most versatility and customizability while allowing for the use of state-of-the-art algorithms such as BPE (https://paperswithcode.com/method/bpe) and unigram (https://paperswithcode.com/method/unigram-segmentation). Both allow for regularization techniques (e.g. through sampling) on the tokenization side of the model.

#### Training a sentencepiece model from scratch

To train state of the art tokenizer models on your own text from scratch, you can use sentencepiece by Google (https://github.com/google/sentencepiece). It is recommended to train the model on a large corpus to learn useful sub-words.

You will first have to install the Python package of sentencepiece.

In [13]:
!pip install sentencepiece



In [14]:
import sentencepiece as sp

In [15]:
sp.SentencePieceTrainer.train(
    input='shakespeare_sonnets.txt', model_prefix='tokenizer_model', model_type="unigram", vocab_size=512)

sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: shakespeare_sonnets.txt
  input_format: 
  model_prefix: tokenizer_model
  model_type: UNIGRAM
  vocab_size: 512
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
}
normalizer_spec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespa

The trained model is saved in a serialized format along with a vocabulary file. The vocabulary size here was set to be 512. Effectively this means that single characters as well as sub-words and whole words are part of the vocabulary. In some cases even phrases can become a single token since the segmentation model does not discriminate between words, subwords and collections of words like word-piece.

In [16]:
# deserialize the trained model file to load it in the correct format
trained_tokenizer_model = tf.io.gfile.GFile('tokenizer_model.model', "rb").read()

# load the model as a tokenizer that can be used inside a tensorflow model
tokenizer = tf_txt.SentencepieceTokenizer(
    model=trained_tokenizer_model, out_type=tf.int32, nbest_size=-1, alpha=1, reverse=False,
    add_bos=False, add_eos=False, return_nbest=False, name=None
)

In [17]:
tokens = tokenizer.tokenize("thou shall not pass")
print(tokens)

tf.Tensor([ 39 148  49   3 437], shape=(5,), dtype=int32)


In [18]:
tokenizer.detokenize(tokens)

<tf.Tensor: shape=(), dtype=string, numpy=b'thou shall not pass'>

Next we can also see how many tokens the sentencepiece model trained on shakespeare's sonnets needs to represent the word "Westerberg". Running the following code cell multiple times shows us that tokenization involves stochasticity, which is supposed to help regularize the optimization of language models.

In [19]:
tokens = tokenizer.tokenize("westerberg")

for token in tokens:
    print(tokenizer.detokenize([token]))

tf.Tensor(b'we', shape=(), dtype=string)
tf.Tensor(b'st', shape=(), dtype=string)
tf.Tensor(b'er', shape=(), dtype=string)
tf.Tensor(b'b', shape=(), dtype=string)
tf.Tensor(b'er', shape=(), dtype=string)
tf.Tensor(b'g', shape=(), dtype=string)


Lastly, Sentencepiece can also be used as a standalone package (e.g. when working in a different Deep Learning framework such as PyTorch or JAX).

In [20]:
# instantiate the SentencePieceProcessor
tokenizer = sp.SentencePieceProcessor(model_file="tokenizer_model.model",out_type=int)

In [21]:
# Encode text with the tokenizer
tokenized_text = tokenizer.encode("Thou shall not pass", out_type=int)
print(tokenized_text)

[248, 148, 49, 3, 437]


In [22]:
# Decode the list of indices with the tokenizer
tokenizer.decode(tokenized_text)

'Thou shall not pass'