In [None]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2", add_prefix_space=True)
tokenizer("Hello world")["input_ids"]

In [None]:
tokenizer("Hello world")

In [None]:
tokenizer("Hello worldly beings")

In [None]:
tokenizer("Hello otherworldly beings")

In [None]:
type(tokenizer)

In [None]:
tokenizer.decode([15496, 995])

In [None]:
for text in ('Hello world', 'Hello worldly beings', 'Hello otherworldly beings'):
    print(tokenizer.decode(tokenizer(text)['input_ids']))

In [None]:
for text in ('Hello world', 'Hello worldly beings', 'Hello otherworldly beings'):
    print(tokenizer.encode(text))

In [None]:
for text in ('Hello world', 'Hello worldly beings', 'Hello otherworldly beings'):
    print([tokenizer.decode(id) for id in tokenizer(text, is_split_into_words=True)['input_ids']])

In [None]:
for text in ('Hello world', 'Hello worldly beings', 'Hello otherworldly beings'):
    print(tokenizer.convert_ids_to_tokens(tokenizer(text)['input_ids']))

In [None]:
tokenizer.get_added_vocab()

In [None]:
for text in ('Hello world', 'Hello worldly beings', 'Hello otherworldly beings'):
    print(tokenizer.tokenize(text))

In [None]:
def clean_tokenized_text(tokenized_text):
    words = [wd.replace('Ġ', ' ') if wd.startswith('Ġ') else '#' + wd for wd in tokenized_text]
    return ''.join(words)

# Pre-trained tokenizer on sample text

In [None]:
with open('28_sample_en_text.txt') as f:
    text = f.read()

print(clean_tokenized_text(tokenizer.tokenize(text)))

It's clear that GPT was trained with a high number of merges, because there are barely any words that get split. Still, there are some; here's an example:

In [None]:
tokenizer.tokenize('debutant')

I bet that if I train a BPE encoder with the bible with a low number of merges, there will be many more splits. The question is how much time it would take to train with the maximum number of merges.

# Train a tokenizer on a (fragment of a) bible

In [None]:
from word_splitting import train_tokenizer

In [None]:
mock_verses = [(el + ' .').split() for el in text.split('.')]

In [None]:
n_merges = 4

In [None]:
my_tokenizer = train_tokenizer(mock_verses, len(set(text)) + n_merges)

In [None]:
' '.join(my_tokenizer.encode(text).tokens)

This is a pretty good result, although there are some unexpected splits. But maybe they would have been merged at a later stage.

Note that, after 450 merges, "debutant" is split into "de butant", which is different from the pre-trained tokenizer above. But, to be fair, the training data is vastly different (in quality and in quantity).

# Retrieve the training history, i.e., the merge steps

In [None]:
my_tokenizer.model.save('WordSplitting/output', f'bpe_model_{n_merges}')

This allows saving the final vocabulary (after merges) and the list of merges in historic order. This is almost exactly what we want. There are two items left to be figured out:

1. How many steps do we need to run in order to complete all the merges? Or, put another way, how can we check if we have reached all merges?

2. What is the exact format that we need for the calculations that come afterwards? I need to check my old code for word-pasting and word-splitting.