# Notebook Objective

The objective of the notebook is to study and implement various tokenizers that are prominent in the literature. Implemented two tokenizers:

- Simple Regex Based Tokenizer
- BPE (Byte Pair encoding) Tokenizer

## Importing Libraries

In [1]:
import os
import pathlib
import regex as re
%reload_ext autoreload
%autoreload 2

## Configuration

In [None]:
class Config:
    def __init__(self):
        self.dataset_path = pathlib.Path('./datasets')
        self.dataset_file = 'the-verdict.txt'

cfg = Config()

## Loading Data

In [3]:
with open(os.path.join(cfg.dataset_path, cfg.dataset_file), 'r') as file:
    # Read but ignore lines with only new line character
    data = [line for line in file.readlines() if line != '\n']

raw_data = ' '.join(data)
print(f'Total number of characters in the data: {len(raw_data)}')
print(f'Total number of unique characters in the data: {len(set(raw_data))}')
print(raw_data[:400])

Total number of characters in the data: 20479
Total number of unique characters in the data: 62
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.)
 "The height of his glory"--that was what the women called it. I can hear


## Regex based tokenizer

Regex based tokenizer involves finding a regular expression that can effectively convert your raw data into tokens. The common idea is to split your raw data into words where words are found out by splitting based on some heurestics like splitting based on spaces, tabs, new line, exclamation point. By extracting unique words from your split raw text, a vocabulary of words/tokens is build that is then assigned a unique integer that is used to map words/tokens to token idx.

It is a design based tokenizer where domain knowledge and the required task that is required to split the raw data. For eg. If working with code generator tabs and spaces can't be neglected and will require special tokens in vocabulary for them. 

__NOTE:__ Also special tokens (like <unk> or |unk|) might be required in case some word that is not present in vocabulary is encountered

### Building Vocabulary

In [4]:
def text_to_tokens(raw_text, split_expression):
    return re.split(split_expression, raw_text)

In [5]:
split_expression = r'(\s+)' # split based on spaces
tokens = text_to_tokens(raw_data, split_expression)
print(tokens[:200])
print(f'Number of tokens: {len(tokens)}')

['I', ' ', 'HAD', ' ', 'always', ' ', 'thought', ' ', 'Jack', ' ', 'Gisburn', ' ', 'rather', ' ', 'a', ' ', 'cheap', ' ', 'genius--though', ' ', 'a', ' ', 'good', ' ', 'fellow', ' ', 'enough--so', ' ', 'it', ' ', 'was', ' ', 'no', ' ', 'great', ' ', 'surprise', ' ', 'to', ' ', 'me', ' ', 'to', ' ', 'hear', ' ', 'that,', ' ', 'in', ' ', 'the', ' ', 'height', ' ', 'of', ' ', 'his', ' ', 'glory,', ' ', 'he', ' ', 'had', ' ', 'dropped', ' ', 'his', ' ', 'painting,', ' ', 'married', ' ', 'a', ' ', 'rich', ' ', 'widow,', ' ', 'and', ' ', 'established', ' ', 'himself', ' ', 'in', ' ', 'a', ' ', 'villa', ' ', 'on', ' ', 'the', ' ', 'Riviera.', ' ', '(Though', ' ', 'I', ' ', 'rather', ' ', 'thought', ' ', 'it', ' ', 'would', ' ', 'have', ' ', 'been', ' ', 'Rome', ' ', 'or', ' ', 'Florence.)', '\n ', '"The', ' ', 'height', ' ', 'of', ' ', 'his', ' ', 'glory"--that', ' ', 'was', ' ', 'what', ' ', 'the', ' ', 'women', ' ', 'called', ' ', 'it.', ' ', 'I', ' ', 'can', ' ', 'hear', ' ', 'Mrs.', ' ', 

In [6]:
split_expression = r'(\s+|\n)' # split based on spaces and new line
tokens = text_to_tokens(raw_data, split_expression)
print(tokens[:200])
print(f'Number of tokens: {len(tokens)}')

['I', ' ', 'HAD', ' ', 'always', ' ', 'thought', ' ', 'Jack', ' ', 'Gisburn', ' ', 'rather', ' ', 'a', ' ', 'cheap', ' ', 'genius--though', ' ', 'a', ' ', 'good', ' ', 'fellow', ' ', 'enough--so', ' ', 'it', ' ', 'was', ' ', 'no', ' ', 'great', ' ', 'surprise', ' ', 'to', ' ', 'me', ' ', 'to', ' ', 'hear', ' ', 'that,', ' ', 'in', ' ', 'the', ' ', 'height', ' ', 'of', ' ', 'his', ' ', 'glory,', ' ', 'he', ' ', 'had', ' ', 'dropped', ' ', 'his', ' ', 'painting,', ' ', 'married', ' ', 'a', ' ', 'rich', ' ', 'widow,', ' ', 'and', ' ', 'established', ' ', 'himself', ' ', 'in', ' ', 'a', ' ', 'villa', ' ', 'on', ' ', 'the', ' ', 'Riviera.', ' ', '(Though', ' ', 'I', ' ', 'rather', ' ', 'thought', ' ', 'it', ' ', 'would', ' ', 'have', ' ', 'been', ' ', 'Rome', ' ', 'or', ' ', 'Florence.)', '\n ', '"The', ' ', 'height', ' ', 'of', ' ', 'his', ' ', 'glory"--that', ' ', 'was', ' ', 'what', ' ', 'the', ' ', 'women', ' ', 'called', ' ', 'it.', ' ', 'I', ' ', 'can', ' ', 'hear', ' ', 'Mrs.', ' ', 

In [7]:
split_expression = r'(-|--|---|\s+|\n)' # split based on spaces, new line and dashes (em dash, en dash, hyphen)
tokens = text_to_tokens(raw_data, split_expression)
print(tokens[:200])
print(f'Number of tokens: {len(tokens)}')

['I', ' ', 'HAD', ' ', 'always', ' ', 'thought', ' ', 'Jack', ' ', 'Gisburn', ' ', 'rather', ' ', 'a', ' ', 'cheap', ' ', 'genius', '-', '', '-', 'though', ' ', 'a', ' ', 'good', ' ', 'fellow', ' ', 'enough', '-', '', '-', 'so', ' ', 'it', ' ', 'was', ' ', 'no', ' ', 'great', ' ', 'surprise', ' ', 'to', ' ', 'me', ' ', 'to', ' ', 'hear', ' ', 'that,', ' ', 'in', ' ', 'the', ' ', 'height', ' ', 'of', ' ', 'his', ' ', 'glory,', ' ', 'he', ' ', 'had', ' ', 'dropped', ' ', 'his', ' ', 'painting,', ' ', 'married', ' ', 'a', ' ', 'rich', ' ', 'widow,', ' ', 'and', ' ', 'established', ' ', 'himself', ' ', 'in', ' ', 'a', ' ', 'villa', ' ', 'on', ' ', 'the', ' ', 'Riviera.', ' ', '(Though', ' ', 'I', ' ', 'rather', ' ', 'thought', ' ', 'it', ' ', 'would', ' ', 'have', ' ', 'been', ' ', 'Rome', ' ', 'or', ' ', 'Florence.)', '\n ', '"The', ' ', 'height', ' ', 'of', ' ', 'his', ' ', 'glory"', '-', '', '-', 'that', ' ', 'was', ' ', 'what', ' ', 'the', ' ', 'women', ' ', 'called', ' ', 'it.', ' ', 

In [8]:
split_expression = r'([,.:;?!_\(\)\"\']|-|--|---|\s+|\n)' # Including punctuations as well
tokens = text_to_tokens(raw_data, split_expression)
print(tokens[:200])
print(f'Number of tokens: {len(tokens)}')

['I', ' ', 'HAD', ' ', 'always', ' ', 'thought', ' ', 'Jack', ' ', 'Gisburn', ' ', 'rather', ' ', 'a', ' ', 'cheap', ' ', 'genius', '-', '', '-', 'though', ' ', 'a', ' ', 'good', ' ', 'fellow', ' ', 'enough', '-', '', '-', 'so', ' ', 'it', ' ', 'was', ' ', 'no', ' ', 'great', ' ', 'surprise', ' ', 'to', ' ', 'me', ' ', 'to', ' ', 'hear', ' ', 'that', ',', '', ' ', 'in', ' ', 'the', ' ', 'height', ' ', 'of', ' ', 'his', ' ', 'glory', ',', '', ' ', 'he', ' ', 'had', ' ', 'dropped', ' ', 'his', ' ', 'painting', ',', '', ' ', 'married', ' ', 'a', ' ', 'rich', ' ', 'widow', ',', '', ' ', 'and', ' ', 'established', ' ', 'himself', ' ', 'in', ' ', 'a', ' ', 'villa', ' ', 'on', ' ', 'the', ' ', 'Riviera', '.', '', ' ', '', '(', 'Though', ' ', 'I', ' ', 'rather', ' ', 'thought', ' ', 'it', ' ', 'would', ' ', 'have', ' ', 'been', ' ', 'Rome', ' ', 'or', ' ', 'Florence', '.', '', ')', '', '\n ', '', '"', 'The', ' ', 'height', ' ', 'of', ' ', 'his', ' ', 'glory', '"', '', '-', '', '-', 'that', ' '

In [9]:
# Dropping tokens with only spaces, other pronounciation add some context as well
tokens_cleaned = [token.strip() for token in tokens if token.strip()]
print(tokens_cleaned[:50])
print(f'Number of tokens: {len(tokens_cleaned)}')

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '-', '-', 'though', 'a', 'good', 'fellow', 'enough', '-', '-', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had', 'dropped', 'his', 'painting', ',', 'married', 'a', 'rich', 'widow', ',', 'and']
Number of tokens: 4863


In [10]:
# combining dashes if consecutive for em and en dash
final_tokens = []
i = 0
while i < (len(tokens_cleaned) - 1):
    if tokens_cleaned[i] == '-' and tokens_cleaned[i+1] == '-':
        if i+2 < len(tokens_cleaned) and tokens_cleaned[i+2] == '-':
            final_tokens.append('---')
            i = i+2
        else:
            final_tokens.append('--')
            i = i+1
    else:
        final_tokens.append(tokens_cleaned[i])
    i = i+1
print(final_tokens[:50])
print(f'Number of tokens: {len(final_tokens)}')

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had', 'dropped', 'his', 'painting', ',', 'married', 'a', 'rich', 'widow', ',', 'and', 'established', 'himself']
Number of tokens: 4765


### Encoding/Decoding text to token index

In [11]:
vocab = sorted(set(final_tokens))
print(f'Number of unique tokens: {len(vocab)}')
print(vocab[-5:])
tokens_to_idx = {token:idx for idx, token in enumerate(vocab)}
tokens_to_idx['|unk|'] = len(tokens_to_idx)
idx_to_token = {idx:token for token, idx in tokens_to_idx.items()}

Number of unique tokens: 1140
['yet', 'you', 'younger', 'your', 'yourself']


In [12]:
def post_split_processing(tokens):
    tokens_cleaned = [token.strip() for token in tokens if token.strip()]
    final_tokens = []
    i = 0
    while i < (len(tokens_cleaned) - 1):
        if tokens_cleaned[i] == '-' and tokens_cleaned[i+1] == '-':
            if i+2 < len(tokens_cleaned) and tokens_cleaned[i+2] == '-':
                final_tokens.append('---')
                i = i+2
            else:
                final_tokens.append('--')
                i = i+1
        else:
            final_tokens.append(tokens_cleaned[i])
        i = i+1
    return final_tokens

In [13]:
sentence = 'I HAD always thought Jack Gisburn' # Test Sentence
final_split_expression= r'([,.:;?_\(\)\"\']|-|--|---|\s+|\n)'
sentence_tokens = post_split_processing(text_to_tokens(sentence, final_split_expression))
print(sentence_tokens)
sentence_tokens = [tokens_to_idx[word_token] for word_token in sentence_tokens] # Doesn't handle if token is not present
print(sentence_tokens)
decoded_sentence = ' '.join([idx_to_token[token_idx] for token_idx in sentence_tokens])
print(decoded_sentence)

['I', 'HAD', 'always', 'thought', 'Jack']
[54, 45, 150, 1013, 58]
I HAD always thought Jack


In [14]:
sentence = 'My name is Andrew.' # Test Sentence
final_split_expression= r'([,.:;?_\(\)\"\']|-|--|---|\s+|\n)'
sentence_tokens = post_split_processing(text_to_tokens(sentence, final_split_expression))
print(sentence_tokens)
sentence_tokens = [tokens_to_idx.get(word_token, tokens_to_idx['|unk|']) for word_token in sentence_tokens]
print(sentence_tokens)
decoded_sentence = ' '.join([idx_to_token[token_idx] for token_idx in sentence_tokens])
print(decoded_sentence)

['My', 'name', 'is', 'Andrew']
[69, 1140, 588, 1140]
My |unk| is |unk|


### Trying our tokenizer class functionality

In [18]:
from tokenizer.simple_tokenizer import RegexTokenizer
split_regex= r'([,.:;?!_\(\)\"\']|-|--|---|\s+|\n)'
tokenizer = RegexTokenizer(raw_data, split_regex, special_tokens={'|unk|', }, unknown_token='|unk|')

########## Build Vocabulary ##########


In [20]:
sentence = 'My name is Andrew.'
encoded_sentence = tokenizer.encode(sentence)
print(encoded_sentence)
decoded_sentence = tokenizer.decode(encoded_sentence)
print(decoded_sentence)

[69, 1140, 588, 1140]
My |unk| is |unk|
