<h1 align="center">Statistics for Machine Learning</h1>
<h2 align="center">Tokenization</h2>

&nbsp;

### Overview

Natural language problems deal with textual data which computers cannot immediately understand. For this reason, words and parts of words need to be encoded using numbers. These encodings are referred to as *tokens* and can be generated in a number of different ways. The many steps throughout the tokenization pipeline are cruicial to determining the success of a language model, and so this notebook dives deep into the methods used in many popular language models toaday.

&nbsp;


### Contents

Section 1 - Introduction to Tokenization

Section 2 - Normalization and Pre-Tokenization

Section 3 - Subword Tokenization Methods

Section 4 - Tokenizers in Python Libraries

Section 5 - Conclusion

Section 6 - Glossary

Section 7 - Further Reading

### Dependancies

In [66]:
!pip install transformers
!pip install tokenizers

<h2 align="center">Section 1 - Introduction to Tokenization</h2>

### 1.1 - Overview of Tokenizers

Natural language problems concern textual data, which cannot be immediately understood by a machine. For computers to process language, they must first convert the text into a numerical form. This process is carried out in two stages by the a component of the model called the **tokenizer**. The tokenizer first takes the text and divides it into smaller pieces, be that words, parts of words, or individual characters. These smaller pieces of text are called **tokens**. The Stanford NLP Group [1] defines tokens more rigorously as:

> an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.

Once the tokenizer has divided the text into tokens, each token can be assigned a number. An example of this might be that the word 'cat' is assigned the value '15', and so every 'cat' token in the input text will be represented by the number 15.

&nbsp;

### 1.2 - Different Tokenization Methods

There are several different ways to divide text into token, with the most common three being:

* Word-based
* Character-based
* Subword-based

The following cells give an overview of each of these methods, along with some pros and cons for each.

&nbsp;


### 1.3 - Overview of Word-Based Tokenization Methods

Word-based tokenization is perhaps the most simple of the three methods described above. In this method, the tokenizer will split a sentence into words by splitting on each space in the sentence (sometimes called 'whitespace-based tokenization), or by a similar set of rules (such as punctuation-based tokenization, treebank tokenization, etc) [2].

&nbsp;

For example, the sentence:

&nbsp;

`This sentence is a great, interesting sentence!`

&nbsp;

could be split on whitespace characters to give:

&nbsp;

`['This', 'sentence', 'is', 'a', 'great,', 'interesting', 'sentence!']`

&nbsp;

or by split on both punctuation and spaces to give:

&nbsp;

`['This', 'sentence', 'is', 'a', 'great', ',', 'interesting', 'sentence', '!']`

&nbsp;

From this simple example, it is clear that the rules used to determine the split are important, since the first split gives the potentially rare token 'sentence!', while the second split gives the two, less-rare tokens 'sentence' and '!'. Care should be taken not to remove punctuation altogether, as they can carry very specific meanings. An example of this is the apostrophe, which can distinguish between the plural and possessive form of words. For example "book's" refers to some property of a book, as in "the book's spine is damaged", and "books" refers to many books, as in "the books are damaged".

Once the tokens have been generated, each can be assigned a number. The next time that a token is generated that the tokenizer has already seen, it can simply assign the number that was assigned to the earlier, indentical token. For example, if the token 'sentence' is assigned the value 1 in the sentence above, and the tokenizer is given another sentence that contains the word 'sentence', this second instance (and all subsequent instances) of the word 'sentence' will also be assigned the value of 1 [3].

&nbsp;


### 1.4 - Pros and Cons of Word-Based Tokenization Methods

The tokens produced in the word-based method contain a high degree of information, since each token contains semantic and contextual information. However, one of the largest drawbacks with this method is that very similar words are treated as completely separate tokens. For example, the connection between 'cat' and 'cats' would be non-existent, and these would be treated as separate words. This becomes a problem in large-scale applications that contain many words, as the possible number of tokens in the model's **vocabulary** (total number of words) can grow very large. English has around 170,000 words, and so including various grammatical forms for each word can lead to what is known as the **exploding vocabulary problem**. An example of this is the TransformerXL tokenizer which uses whitespace-based splitting, this led to a vocabulary size of over 250,000 [4].

One way to combat this is by enforcing a hard limit on the number of tokens the model can learn (e.g. 10,000). This would classify any word outside of the 10,000 most frequent tokens as **out-of-vocabulary** (OOV), and would assign the token value of 'UNKNOWN'. This causes performance to suffer in cases where many unknown words are present, but may be a suitable compromise if the data contains mostly common words. [3]

&nbsp;

**Summary of Pros:**

* Simple

* High degree of information stored in each token

* Can limit vocabulary size which works well with datasets containing mostly common words


&nbsp;

**Summary of Cons:**

* Separate tokens are created for similar words (e.g. 'cat' and 'cats')

* Can result in very large vocabulary

* Limiting vocabulary can significantly degrade performance on datasets with many uncommon words

&nbsp;


### 1.5 - Overview of Character-Based Tokenization Methods

Character-based tokenization splits sentences on each character, including letters, numbers, and special characters such as punctuation. This greatly reduces the vocabulary size, to the point where the English language can be represented with a vocabulary size of around 256, instead of the roughly 170,000 needed with word-based approaches [5]. Even east Asian languages such as Chinese and Japanese can see a significant reduction in their vocabulary size, despite using a few thousand unqiue characters on a daily basis.

In a character-based tokenizer, the following sentence:

&nbsp;

`This sentence is a great, interesting sentence!`

&nbsp;

would be converted to:

&nbsp;

`['T', 'h', 'i', 's', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', ' ', 'i', 's', ' ', 'a', ' ', 'g', 'r', 'e', 'a', 't', ',', ' ,'i',
'n', 't', 'e', 'r', 'e', 's', 't', 'i', 'n', 'g', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', '!'`]

&nbsp;


### 1.6 - Pros and Cons of Character-Based Tokenization Methods

Character-based approaches result in a much smaller vocabulary size when compared to word-based methods, and also result in much fewer out-of-vocabulary tokens. This even allows misspelled words to be tokenized (albeit differently than the correct form of the word), rather than being removed immediately due to the frequency-based vocabulary limit.

However there are a number of drawbacks with this approach too. Firstly, the information stored in a single token produced with a character-based method is low. This is because unlike the tokens in the word-based method, no semantic or contextual meaning is captured (particularly in the case of western and alphabet-based languages, moreso than languages with logosyllabic writing systems that tend to store much more meaning in a single character. Finally, the size of the tokenized input that can be fed into a language model is limited with this method since many numbers are used to represent the input string, much moreso than representations created using the word-based approach.

&nbsp;

**Summary of Pros:**

* Smaller vocabulary size

* Does not remove misspelled words

&nbsp;

**Summary of Cons:**

* Low information stored in each token, little-to-no contextual or semantic meaning

* Size of input to language models is limited since the output of the tokenizer contains many more numbers than a word-based approach

&nbsp;


### 1.7 - Overview of Subword-Based Tokenization

Subword-based tokenization aims to achieve the benefits of both word-based and character-based methods, by splitting sentences within words. This means that the resulting vocabulary size is smaller than the one found in word-based methods, but larger than the one found in character-based methods. The same is also true for the amount of information stored within each token, which is also inbetween the tokens generated by the previous two methods. The subword approach uses the follow two guides lines:

&nbsp;

* Frequently used words should not be split into subwords, but rather be stored as entire tokens

* Infrequently used words should be split into subwords

&nbsp;

Splitting only the infrequently used words gives a chance the conjugations, plural forms etc are decomposed into their constiutent parts and the relationship between tokens is preserved. For example 'cat' might be a very common word in the dataset, but 'cats' might be less common. For this reason, 'cats' would be split into 'cat' and 's', where 'cat' is now assigned the same value as every other 'cat' token, and 's' is assigned a different value, which can encode the meaning of plurality. Another example would be the word 'tokenization', which can be split into the root word 'token' and the suffix 'ization'. This method can therefore preserve syntactic and semantic similarity. [6] For these reasons, subword-based tokenizers are very commonly used in NLP models today.

&nbsp;


<h2 align="center">Section 2 - Normalization and Pre-Tokenization</h2>



### 2.1 - Overview of the Tokenization Pipeline

The tokenization process requires some pre-processing and post-processing steps, that in all comprise the **tokenization pipeline**. This describes the entires series of actions that are take to convert raw text into tokens. The steps of this pipeline are:

&nbsp;

* Normalization

* Pre-tokenization

* Model

* Post-processing

&nbsp;

where the tokenization method (be that subword-based, character-based etc) taking place in the model step. [7] This section will cover each of these steps for a tokenizer that uses a subword-based tokenization approach.

&nbsp;

**IMPORTANT NOTE:** all the steps of the tokenization pipeline are handled for the user automaticallly when using a tokenizer from libraries such as Hugging Face. The entire pipeline is performed by a single entity called the Tokenizer. The cells in this section dive into the inner workings of the code the most users do not need to handle manually when working with NLP tasks.


&nbsp;


### 2.2 - Normalization Methods

**Normalization** is the process of *cleaning up* the text before it is split into tokens. This includes converting each character to lowercase, removing accents from characters (e.g. 'é' becomes 'e'), removing unnecessary whitespace, and so on. For example, the string `ThÍs is  áN ExaMPlé     sÉnteNCE` becomes `this is an example sentence` after normalization. Different normalizers will perform different steps, which can be useful depending on the use-case. For example, in some situations the casing or accents might need to be preserved. Depending on the normalizer chosen, different effects can be achieved at this stage.

The Hugging Face `tokenizers.normalizers` package contains several basic normalizers that are used by different tokenizers are part of larger models. The base normalizer class can be imported directly however to investigate how they work. Below shows the NFC unicode, Lowercase, and BERT normalizers. These show the following effects on the example sentence:

&nbsp;

* **NFC:** Does not convert casing or remove accents
* **Lower:** Converts casing but does not remove accents
* **BERT:** Converts casing and removes accents

&nbsp;

In [None]:
from tokenizers.normalizers import NFC, Lowercase, BertNormalizer

# Text to normalize
example_sentence = 'ThÍs is  áN ExaMPlé     sÉnteNCE'

# Instantiate normalizer objects
NFCNorm = NFC()
LowercaseNorm = Lowercase()
BertNorm = BertNormalizer()

# Normalize the text
print(f'NFC:   {NFCNorm.normalize_str(example_sentence)}')
print(f'Lower: {LowercaseNorm.normalize_str(example_sentence)}')
print(f'BERT:  {BertNorm.normalize_str(example_sentence)}')


NFC:   ThÍs is  áN ExaMPlé     sÉnteNCE
Lower: thís is  án examplé     séntence
BERT:  this is  an example     sentence



&nbsp;

The normalizers above are used in tokenizer models which can be imported from the Hugging Face `transformers` library. The cell below shows that the normalizers can be accessed using dot notation via `Tokenizer.backend_tokenizer.normalizer`. Some comparisons are shown between the tokenizsers to highlight the different normalization methods that are used. Note that in these examples, only the FNet normalizer removes unncessary whitespace.

&nbsp;


In [None]:
from transformers import FNetTokenizerFast, CamembertTokenizerFast, BertTokenizerFast

# Text to normalize
example_sentence = 'ThÍs is  áN ExaMPlé     sÉnteNCE'

# Instatiate tokenizers
FNetTokenizer = FNetTokenizerFast.from_pretrained('google/fnet-base')
CamembertTokenizer = CamembertTokenizerFast.from_pretrained('camembert-base')
BertTokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# Normalize the text
print(f'FNet Output:      {FNetTokenizer.backend_tokenizer.normalizer.normalize_str(example_sentence)}')
print(f'CamemBERT Output: {CamembertTokenizer.backend_tokenizer.normalizer.normalize_str(example_sentence)}')
print(f'BERT Output:      {BertTokenizer.backend_tokenizer.normalizer.normalize_str(example_sentence)}')

FNet Output:      ThÍs is áN ExaMPlé sÉnteNCE
CamemBERT Output: ThÍs is  áN ExaMPlé     sÉnteNCE
BERT Output:      this is  an example     sentence


### 2.3 - Pre-Tokenization Methods

The **pre-tokenization** step is the first splitting of the raw text in the tokenization pipeline. The split is performed to give an upper bound to what the final tokens could be at the end of the pipeline. That is, a sentence can be split into words in the pre-tokenization step, then in the model step some of these words may be split further according to the tokenization method (e.g. subword-based).

Just like with normalization, there are several ways that this step can be performed. For example, a sentence can be split based on every space, every space and some punctuation, or every space and every punctuation.

The cell below shows a comparison between the basic `Whitespacesplit` pre-tokenizer, and slightly more complex `BertPreTokenizer` from the Hugging Face `tokenizers.pre_tokenizers` package. The output of the whitespace pre-tokenizer leaves the punctuation in-tact, and still attached to the neighbouring words. For example `includes:` is treated as a single word in this case. Whereas the BERT pre-tokenizer treats punctuation as individual words [8].

&nbsp;


In [None]:
from tokenizers.pre_tokenizers import WhitespaceSplit, BertPreTokenizer

# Text to pre-tokenize - written in lowercase with no unnecessary whitespace to simulate normalization step
example_sentence = "this sentence's content includes: characters, spaces, and punctuation."

# Define helper function to display pre-tokenized output
def print_pretokenized_str(pre_tokens):
    for pre_token in pre_tokens:
        print(f'"{pre_token[0]}", ', end='')

# Instantiate pre-tokenizers
wss = WhitespaceSplit()
bpt = BertPreTokenizer()

# Pre-tokenize the text
print('Whitespace Pre-Tokenizer:')
print_pretokenized_str(wss.pre_tokenize_str(example_sentence))

print('\n\nBERT Pre-Tokenizer:')
print_pretokenized_str(bpt.pre_tokenize_str(example_sentence))


Whitespace Pre-Tokenizer:
"this", "sentence's", "content", "includes:", "characters,", "spaces,", "and", "punctuation.", 

BERT Pre-Tokenizer:
"this", "sentence", "'", "s", "content", "includes", ":", "characters", ",", "spaces", ",", "and", "punctuation", ".", 


&nbsp;

Just as with the normalization methods, you can call the pre-tokenization methods directly from common tokenizers such as the GPT-2 and ALBERT (A Lite BERT) tokenizers. These take a slightly different approach to the standard BERT pre-tokenizer shown above, in that space characters are not removed when splitting the tokens. Instead, they are replaced with special characters that represent where the space was. This has the advantage in that the space characters can be ignored when processing further, but the original sentence can be retrieved if required. The GPT-2 model uses the `Ġ` which features a capital 'G' with a dot above. The ALBERT models uses a special underscore character.

&nbsp;

In [None]:
from transformers import AutoTokenizer

# Text to pre-tokenize - written in lowercase with no unnecessary whitespace to simulate normalization step
example_sentence = "this sentence's content includes: characters, spaces, and punctuation."

# Instatiate the pre-tokenizers
GPT2_PreTokenizer = AutoTokenizer.from_pretrained('gpt2').backend_tokenizer.pre_tokenizer
Albert_PreTokenizer = AutoTokenizer.from_pretrained('albert-base-v1').backend_tokenizer.pre_tokenizer

# Pre-tokenize the text
print('GPT-2 Pre-Tokenizer:')
print_pretokenized_str(GPT2_PreTokenizer.pre_tokenize_str(example_sentence))
print('\n\nALBERT Pre-Tokenizer:')
print_pretokenized_str(Albert_PreTokenizer.pre_tokenize_str(example_sentence))

GPT-2 Pre-Tokenizer:
"this", "Ġsentence", "'s", "Ġcontent", "Ġincludes", ":", "Ġcharacters", ",", "Ġspaces", ",", "Ġand", "Ġpunctuation", ".", 

Albert Pre-Tokenizer:
"▁this", "▁sentence's", "▁content", "▁includes:", "▁characters,", "▁spaces,", "▁and", "▁punctuation.", 


&nbsp;

The cells above show the output of pre-tokenization is a compact format that nicely fits on the screen and removes some of the additional information generated. Below shows the results of a BERT pre-tokenization step on the same example sentence without any modifications. The object returned is a Python list containing tuples. Each tuple corresponds to a pre-token, where the first element is the pre-token string, and the second element is a tuple containing the index for the start and end of the string in the original input text. Note that the starting index of the string is inclusive, and the ending index is exclusive.

&nbsp;


In [None]:
from tokenizers.pre_tokenizers import WhitespaceSplit, BertPreTokenizer

# Text to pre-tokenize - written in lowercase with no unnecessary whitespace to simulate normalization step
example_sentence = "this sentence's content includes: characters, spaces, and punctuation."

# Instantiate pre-tokenizer
bpt = BertPreTokenizer()

# Pre-tokenize the text
bpt.pre_tokenize_str(example_sentence)


[('this', (0, 4)),
 ('sentence', (5, 13)),
 ("'", (13, 14)),
 ('s', (14, 15)),
 ('content', (16, 23)),
 ('includes', (24, 32)),
 (':', (32, 33)),
 ('characters', (34, 44)),
 (',', (44, 45)),
 ('spaces', (46, 52)),
 (',', (52, 53)),
 ('and', (54, 57)),
 ('punctuation', (58, 69)),
 ('.', (69, 70))]

<h2 align="center">Section 3 - Subword Tokenization Methods</h2>



### 3.1 - Subword Tokenization Methods

The model step of the tokenization pipeline is where the tokenization method comes into play. As described earlier, the options here are: word-based, character-based, and subword-based. Subword-based are generally favoured, since these methods were designed to overcome the limitations of the word and character-based approaches.

For transformer models, there are three tokenizer methods that are commonly used to implement subword-based tokenization. These include:

&nbsp;

* Byte Pair Encoding (BPE)

* WordPiece

* Unigram

&nbsp;

Each of these use slightly different techniques to split the less frequent words into smaller tokens, which are laid out in the next few cells. In addition, an implementation of these algorithms written in vanilla Python is also shown. This should help give a solid intuition for how these methods divide text into tokens, and the differences in their implementations.

&nbsp;


### 3.2 - Byte Pair Encoding (BPE) Tokenization

The BPE algorithm is a commonly-used tokenizer that is found in many transformer models such as Open AI's GPT and GPT-2 models, BART, and many others [9-10]. It was originally designed as a text compression algorithm, but has been found to work very well in tokenization tasks for language models. The BPE algorithm aims to decompose a string of text into subword units that appear frequently in a reference corpus (the text used to train the tokenization model) [11]. The BPE model is trained as follows:

&nbsp;

**Step 1) Construct the Corpus**

The input text is given to the normalization and pre-tokenization models to create clean words. The words are then given to the BPE model, which determines the frequency of each word, and stores this number alongside the word in a list called the **corpus**.

&nbsp;

**Step 2) Construct the Vocabulary**

The words from the corpus are then broken down individual characters and are added to an empty list called the vocabulary. The algorithm will iteratively add to this vocabulary every time it determines which character pairs can be merged together.

&nbsp;

**Step 3) Find the Frequency of Character Pairs**

The frequency of character pairs is then recorded for each word in the corpus. For example, the words 'cats' will have the character pairs 'ca', 'at', and 'ts'. All the words are examined in this way, and contribute to a global frequency counter. So any instance of 'ca' found in any of the tokens will increase the frequency counter for the 'ca' pair.

&nbsp;

**Step 4) Create a Merging Rule**

When the frequency for each character pair is known, the most frequent character pair is added to the vocabulary. The vocabulary now consists of every individual letter in the tokens, plus the most frequent character pair. This also gives a merging rule that the model can use. For example, if the model knows that 'ca' is the most frequent character pair, it has learned that all adjacent instances of 'c' and 'a' in the corpus can be merged to give 'ca'. This can now be treated as a single character 'ca' for the remainder of the steps.

&nbsp;

**Step 5) Repeat Steps 3 and 4**

Steps 3 and are then repeated, finding more merging rules, and adding more character pairs to the vocabulary. This process continues until the vocabulary size reaches a target size specified at the beginning of the training.

&nbsp;

Now that the BPE algorithm has been trained (i.e. now that all have the merging rules have been found), the model can be used to tokenize any text by first splitting each of the words on every character, and then merging according to the merge rules.

&nbsp;


### 3.3 - Implementation of BPE in Python

Below shows a vanilla Python implementation of the BPE algorithm, following the steps outlined above. The next cell will show the code in action using a toy dataset.

&nbsp;


In [64]:
class TargetVocabularySizeError(Exception):
    def __init__(self, message):
        super().__init__(message)

class BPE:
    '''An implementation of the Byte Pair Encoding tokenizer.'''

    def __init__(self, words, target_vocab_size):
        self.words = words
        self.target_vocab_size = target_vocab_size
        self.corpus = self.initialise_corpus(self.words)
        self.corpus_history = [self.corpus]
        self.vocabulary = list(set(''.join(words)))
        self.vocabulary_size = len(self.vocabulary)
        self.merge_rules = []

        # Iteratively add vocabulary until the target vocabulary size is reached
        if len(self.vocabulary) > self.target_vocab_size:
            raise TargetVocabularySizeError(f'Error: Target vocabulary size \
            must be greater than the initial vocabulary size \
            ({len(self.vocabulary)})')

        else:
            while len(self.vocabulary) < self.target_vocab_size:
                try:
                    self.create_merge_rule(self.corpus)
                    self.corpus = self.merge(self.corpus)
                    self.corpus_history.append(self.corpus)

                # If no further merging is possible
                except ValueError:
                    print('Exiting: No further merging is possible')
                    break


    def calculate_frequency(self, words):
        ''' Calculate the frequency for each word in a list of words.

            Take in. list of words stored as strings and return a list of tuples
            where each tuple contains a string from the words list, and an
            integer representing its frequency count in the list.

            Args:
                words (list):  A list of words (strings) in any order.

            Returns:
                corpus (list[tuple(str, int)]):
                               A list of tuples where the first element is a
                               string of a word in the words list, and the
                               second element is an integer representing the
                               frequency of the word in the list.
        '''
        freq_dict = dict()

        for word in words:
            if word not in freq_dict:
                freq_dict[word] = 1
            else:
                freq_dict[word] += 1

        corpus = [(word, freq_dict[word]) for word in freq_dict.keys()]

        return corpus


    def initialise_corpus(self, words):
        ''' Split each word into characters and count the word frequency.

            Split each word in the input word list on every character. For each
            word, store the split word in a list as the first element inside a
            tuple. Store the frequency count of the word as an integer as the
            second element of the tuple. Create a tuple for every word in this
            fashion and store the tuples in a list called 'corpus', then return
            then corpus list.

            Args:
                None

            Returns:
                corpus (list[tuple(list, int)]):
                               A list of tuples where the first element is a
                               list of a word in the words list (where the
                               elements are the individual characters of the
                               word), and the second element is an integer
                               representing the frequency of the word in the
                               list.
        '''
        corpus = self.calculate_frequency(words)
        corpus = [([*word], freq) for (word, freq) in corpus]
        return corpus


    def find_pair_frequencies(self, corpus):
        ''' Find the frequency of each character pair in the corpus.

            Loops through the corpus and calculate the frequency of each pair
            of adjacent characters across every word. Return a dictionary of
            each character pair as the keys and the corresponding frequency as
            the values.

            Args:
                corpus (list[tuple(list, int)]):
                               A list of tuples where the first element is a
                               list of a word in the words list (where the
                               elements are the individual characters (or
                               subwords in later iterations) of the
                               word), and the second element is an integer
                               representing the frequency of the word in the
                               list.

            Returns:
                pair_freq_dict (dict): A dictionary where the keys are the
                                       character pairs from the input corpus and
                                       the values are an integer representing
                                       the frequency of the pair in the corpus.
        '''
        pair_freq_dict = dict()

        for word, word_freq in corpus:
            for idx in range(len(word)-1):

                char_pair = f'{word[idx]},{word[idx+1]}'

                if char_pair not in pair_freq_dict:
                    pair_freq_dict[char_pair] = word_freq
                else:
                    pair_freq_dict[char_pair] += word_freq

        return pair_freq_dict


    def create_merge_rule(self, corpus):
        ''' Create a merge rule and add it to the self.merge_rules list.

            Args:
                corpus (list[tuple(list, int)]):
                               A list of tuples where the first element is a
                               list of a word in the words list (where the
                               elements are the individual characters (or
                               subwords in later iterations) of the
                               word), and the second element is an integer
                               representing the frequency of the word in the
                               list.

            Returns:
                None
        '''
        pair_frequencies = self.find_pair_frequencies(corpus)
        most_frequent_pair = max(pair_frequencies, key=pair_frequencies.get)
        self.merge_rules.append(most_frequent_pair.split(','))
        self.vocabulary.append(most_frequent_pair)


    def merge(self, corpus):
        ''' Loop through the corpus and perform the latest merge rule.

            Args:
                corpus (list[tuple(list, int)]):
                            A list of tuples where the first element is a
                            list of a word in the words list (where the
                            elements are the individual characters (or
                            subwords in later iterations) of the
                            word), and the second element is an integer
                            representing the frequency of the word in the
                            list.

            Returns:
                new_corpus (list[tuple(list, int)]):
                            A modified version of the input argument where the
                            most recent merge rule has been applied to merge
                            the most frequent adjacent characters.
        '''
        merge_rule = self.merge_rules[-1]
        new_corpus = []

        for word, word_freq in corpus:
            new_word = []
            idx = 0

            while idx < len(word):
                # If a merge pattern has been found
                if (len(word) != 1) and (word[idx] == merge_rule[0]) and\
                (word[idx+1] == merge_rule[1]):

                    new_word.append(word[idx]+word[idx+1])
                    idx += 2
                # If a merge patten has not been found
                else:
                    new_word.append(word[idx])
                    idx += 1

            new_corpus.append((new_word, word_freq))

        return new_corpus

### 3.4 - Using the BPE Algorithm with a Toy Dataset

The BPE algorithm is used below with a toy dataset that contains some words about cats. The goal of the tokenizer is to determine the most useful, meaningful subunits of the words in the dataset to be used as tokens. From inspection, it is clear that units such as 'cat', 'eat' and 'ing' would be useful subunits.

Running the tokenizer with a target vocabulary size of 21 (which only requires 5 merges) is enough for the tokenizer to capture all the desired subunits mentioned above. With a larger dataset, the target vocabulary would be much higher, but this shows how power the BPE tokenizer can be.

&nbsp;


In [65]:
# Instantiate and use tokenizer
words = ['cat', 'cat', 'cat', 'cat', 'cat',
         'cats', 'cats',
         'eat', 'eat', 'eat', 'eat', 'eat', 'eat', 'eat', 'eat', 'eat', 'eat',
         'eating', 'eating', 'eating',
         'running', 'running',
         'jumping',
         'food', 'food', 'food', 'food', 'food', 'food']

# Instantiate the tokenizer
bpe = BPE(words, 21)

# Print out the corpus at each stage of the process, and the merge rule used
print(f'INITIAL CORPUS:\n{bpe.corpus_history[0]}\n')
for rule, corpus in list(zip(bpe.merge_rules, bpe.corpus_history[1:])):
    print(f'NEW MERGE RULE: Combine "{rule[0]}" and "{rule[1]}"')
    print(corpus, end='\n\n')

INITIAL CORPUS:
[(['c', 'a', 't'], 5), (['c', 'a', 't', 's'], 2), (['e', 'a', 't'], 10), (['e', 'a', 't', 'i', 'n', 'g'], 3), (['r', 'u', 'n', 'n', 'i', 'n', 'g'], 2), (['j', 'u', 'm', 'p', 'i', 'n', 'g'], 1), (['f', 'o', 'o', 'd'], 6)]

NEW MERGE RULE: Combine "a" and "t"
[(['c', 'at'], 5), (['c', 'at', 's'], 2), (['e', 'at'], 10), (['e', 'at', 'i', 'n', 'g'], 3), (['r', 'u', 'n', 'n', 'i', 'n', 'g'], 2), (['j', 'u', 'm', 'p', 'i', 'n', 'g'], 1), (['f', 'o', 'o', 'd'], 6)]

NEW MERGE RULE: Combine "e" and "at"
[(['c', 'at'], 5), (['c', 'at', 's'], 2), (['eat'], 10), (['eat', 'i', 'n', 'g'], 3), (['r', 'u', 'n', 'n', 'i', 'n', 'g'], 2), (['j', 'u', 'm', 'p', 'i', 'n', 'g'], 1), (['f', 'o', 'o', 'd'], 6)]

NEW MERGE RULE: Combine "c" and "at"
[(['cat'], 5), (['cat', 's'], 2), (['eat'], 10), (['eat', 'i', 'n', 'g'], 3), (['r', 'u', 'n', 'n', 'i', 'n', 'g'], 2), (['j', 'u', 'm', 'p', 'i', 'n', 'g'], 1), (['f', 'o', 'o', 'd'], 6)]

NEW MERGE RULE: Combine "i" and "n"
[(['cat'], 5), (['cat'

### 3.5 - Issues with BPE Tokenizers

BPE tokenizers can only recognise character that have appeared in the training data. For example, in the implementation above, the training data only contained the characters needed to talk about cats, which happened to not require a 'z'. Therefore, that version of the tokenizer does not the character 'z' in its vocabulary and so would convert that character to an unknown token if the model was used to tokenize real data (actually, error handling was not added to instruct the model to create unknown tokens and so it would crash, but for productionised models this is the case).

The BPE tokenizers used in GPT-2 and RoBERTa do not have this issue due to a trick within the code. Instead of analysing the training data based on the Unicode characters, they instead analyse the character's bytes. This allows a small base vocabulary to be able to tokenize all characters the model might see.

&nbsp;


### 3.6 - WordPiece Tokenization

WordPiece is a tokenizer method developed by Google for their seminal BERT model, and has been used in derivative models such as DistilBERT and MobileBERT. Other models ha
&nbsp;


In [None]:
# WordPiece Tokenizer


### 3.7 - Unigram Tokenization


&nbsp;


In [None]:
# Unigram Tokenizer


<h2 align="center">Section 4 - Tokenizers in Python Libraries</h2>

In [None]:
!pip install transformers datasets tokenizers segeval -q

In [None]:
# HuggingFace libraries
import datasets
from transformers import BertTokenizerFast
from transformers import DataCollatorForTokenClassification
from transformers import AutoModelForTokenClassification

# Other libraries
import numpy as np

### 4.2 - The CoNLL2003 Dataset

The CoNNL2003 dataset is one of the most commonly used datasets for NER tasks. It was first introduced in the paper *Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition* during the 2003 Conference on Natural Language Learning (CoNLL) [1]. The aim of the task was to improve the state NER, and develop a language-independent model. As such, the dataset included data for both English and German for researchers to use in their model building.

This notebook will focus solely on the English data, which was produced by taking news artilcles from the Reuters corpus. This consisted of stories between August 1996 and August 1997. The training set was constructed using 10 days worth of data from late August, and the test data was taken from December of the same year. Preprocessed raw data is also included, which was taken from September 1996.

In the Hugging Face `datasets` library, this data is stored in a `DatasetDict`, which contains: train, validation, and test data, with a split of 14041 : 3250 : 3453.

In [None]:
conll2003 = datasets.load_dataset('conll2003')
conll2003

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

The cell below prints out the first element of the training data, which is stored in a Python dictionary. The element contains a single sentence from the Reuters articles, along with some additional information about the sentence, stored as key-value pairs. The first key, `id`, stores the id of the element, which in this case is 0. The next key, `tokens`, stores a list of all the tokens in the sentence. In this case, a token has been taken to be a word or some punctuation, such as the '.' at the end of the sentence.

In [None]:
conll2003['train'][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

The next key is `pos_tags`, assigns a **part-of-speech** (POS) tag to each of the tokens. The POS tag can be one of the 47 shown below, and these describe if the word is a noun, adverb, etc.

In [None]:
conll2003['train'].features['pos_tags']

Sequence(feature=ClassLabel(names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], id=None), length=-1, id=None)

The `chunk_tags` are used to show whether tokens belong to a phrase within the sentence, and which phrase they belong too. The cell below shows the possible values that these tags can take, whith the prefixes B, I, and O, being used to show whether the token is at the beginning of a phrase, inside a phrase, or not in a phrase, respectively.

In [None]:
conll2003['train'].features['chunk_tags']

Sequence(feature=ClassLabel(names=['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP'], id=None), length=-1, id=None)

Finally, the `ner_tags` value shows which entity type has been assigned to each of the tokens. A `0` represents no entity type has been assigned, and the other numbers represent the assigned entity type according to the list below. Note that the CoNLL2003 dataset only includes four entity types: PERSON, ORGANIZATION, LOCATION, and MISC (which represents no entity assigned).

In [None]:
conll2003['train'].features['ner_tags']

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

### 4.3 - Initialise the Tokenizer

In [None]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

element_0 = conll2003['train'][0]
tokenized_input = tokenizer(element_0['tokens'], is_split_into_words=True)
tokenized_input

{'input_ids': [101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
tokenizer.convert_ids_to_tokens(tokenized_input['input_ids'])

['[CLS]',
 'eu',
 'rejects',
 'german',
 'call',
 'to',
 'boycott',
 'british',
 'lamb',
 '.',
 '[SEP]']

# References

[x] - [CoNLL2003 Paper](https://arxiv.org/pdf/cs/0306050v1.pdf)

[1] - Token Definition [Stanford NLP Group](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html#:~:text=A%20token%20is%20an%20instance,useful%20semantic%20unit%20for%20processing.)

[2] Word Tokenizers - [Towards Data Science](https://towardsdatascience.com/top-5-word-tokenizers-that-every-nlp-data-scientist-should-know-45cc31f8e8b9#:~:text=Tokenization%20is%20the%20process%20of,I%E2%80%9D%20and%20%E2%80%9Cwon%E2%80%9D.)

[3] Tokenizers - [Hugging Face](https://huggingface.co/docs/transformers/tokenizer_summary)

[4] TransformerXL Paper - [ArXiv](https://arxiv.org/abs/1901.02860)

[5] Word-Based, Subword, and Character-Based Tokenizers - [Towards Data Science](https://towardsdatascience.com/word-subword-and-character-based-tokenization-know-the-difference-ea0976b64e17)

[6] A Comprehensive Guide to Subword Tokenizers - [Towards Data Science](https://towardsdatascience.com/a-comprehensive-guide-to-subword-tokenisers-4bbd3bad9a7c)

[7] The Tokenization Pipeline - [Hugging Face](https://huggingface.co/docs/tokenizers/pipeline)

[8] Pre-tokenizers - [Hugging Face](https://huggingface.co/docs/tokenizers/api/pre-tokenizers)

[9] Language Models are Unsupervised Multitask Learners - [OpenAI](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf)

[10] BART Model for Text Autocompletion in NLP - [Geeks for Geeks](https://www.geeksforgeeks.org/bart-model-for-text-auto-completion-in-nlp/)

[11] Byte Pair Encoding - [Hugging Face](https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt#byte-pair-encoding-tokenization)