In [2]:
from dotenv import dotenv_values
config = dotenv_values(".env")

## Character Analysis
Between English and Spanish, the alphabet is identical. The only difference is the addition of the letter ñ (enye) in the Spanish alphabet. Additionally, Spanish implements diacritics on vowels that are not present in English. While not part of the alphabet itself, these diacritics result in different character representations.It is therefore expected that individual character tokenizing should exist in the same domain for all tokenizers.

The Japanese alphabet consists of different characters, consisting of three writing systems: kanji, hiragana, and katakana. Both hiragana and katakana are syllabary like the English alaphabet, but each containing there own set of characters. Kanji is a logographic writing system, sharing its characters with the chinese writing system.

### Base Byte Pair Encoding
English and Spanish characters begin at U+0041, with additional Spanish characters beginning at U+00C1.
Hiragana begins at U+3041, katakana at U+30A0, and kanji at U+4E00.
For byte pair encoding, all characters below U+00FF are treated as single characters. This means that English and Spanish characters are treated as single characters when initializing the tokenizer. Meanwhile, Japanese characters require two bytes to be represented. This means that either the tokenizer will need to learn a new token to represent each character, or the model will need to learn how multiple tokens can be combined to represent a single character.

In [3]:
english_characters = "abcxyzABCXYZ"
spanish_characters = "abñyzABÑYZ"
japanese_characters = "あいうえおアイウエオ上中"

example_characters = {
    "English": english_characters,
    "Spanish": spanish_characters,
    "Japanese": japanese_characters,
    }

def get_bytes(character):
    char_number = ord(character)
    upper, lower = divmod(char_number, 0x100)
    return f"{hex(upper)} {hex(lower)}"

for language, characters in example_characters.items():
    print(language)
    count = 0
    for character in characters:
        end_char = "\n" if count%2 else "\t"
        print(f"{character}: {get_bytes(character)} ({ord(character)})", end=end_char)
        count += 1
    print("\n" if count%2 else "")

English
a: 0x0 0x61 (97)	b: 0x0 0x62 (98)
c: 0x0 0x63 (99)	x: 0x0 0x78 (120)
y: 0x0 0x79 (121)	z: 0x0 0x7a (122)
A: 0x0 0x41 (65)	B: 0x0 0x42 (66)
C: 0x0 0x43 (67)	X: 0x0 0x58 (88)
Y: 0x0 0x59 (89)	Z: 0x0 0x5a (90)

Spanish
a: 0x0 0x61 (97)	b: 0x0 0x62 (98)
ñ: 0x0 0xf1 (241)	y: 0x0 0x79 (121)
z: 0x0 0x7a (122)	A: 0x0 0x41 (65)
B: 0x0 0x42 (66)	Ñ: 0x0 0xd1 (209)
Y: 0x0 0x59 (89)	Z: 0x0 0x5a (90)

Japanese
あ: 0x30 0x42 (12354)	い: 0x30 0x44 (12356)
う: 0x30 0x46 (12358)	え: 0x30 0x48 (12360)
お: 0x30 0x4a (12362)	ア: 0x30 0xa2 (12450)
イ: 0x30 0xa4 (12452)	ウ: 0x30 0xa6 (12454)
エ: 0x30 0xa8 (12456)	オ: 0x30 0xaa (12458)
上: 0x4e 0xa (19978)	中: 0x4e 0x2d (20013)



In [4]:
from transformers import AutoTokenizer

def tokenizer_character_test(model):
    for language, characters in example_characters.items():
        print(language)
        count = 0
        for character in characters:
            end_char = "\n" if count%2 else "\t\t"
            tokenized = model.encode(character)
            print(f"{character}: {tokenized}", end=end_char)
            count += 1
        print("\n" if count%2 else "")

  from .autonotebook import tqdm as notebook_tqdm


### OpenAI GPT-2 Tokenizer

GPT-2 is the model created by OpenAI.

In [5]:
GPT2_Tokenizer = AutoTokenizer.from_pretrained("gpt2", token=config['HUGGING'])

tokenizer_character_test(GPT2_Tokenizer)

English
a: [64]		b: [65]
c: [66]		x: [87]
y: [88]		z: [89]
A: [32]		B: [33]
C: [34]		X: [55]
Y: [56]		Z: [57]

Spanish
a: [64]		b: [65]
ñ: [12654]		y: [88]
z: [89]		A: [32]
B: [33]		Ñ: [127, 239]
Y: [56]		Z: [57]

Japanese
あ: [40948]		い: [18566]
う: [29557]		え: [2515, 230]
お: [2515, 232]		ア: [11839]
イ: [11482]		ウ: [16165]
エ: [23544]		オ: [20513]
上: [41468]		中: [40792]



Despite Ñ being a single character in the Spanish alphabet, the GPT-2 tokenizer treats it as two characters.
This is likely because GPT-2 Tokenizer was created using traditional ASCII encoding instead of extended ASCII encoding. Traditionally, ASCII only supported up to U+007F, which is the first 128 characters of the Unicode standard. Extended ASCII encoding supports up to U+00FF, which includes characters like ñ.
Looking at the byte pairs, we can see what the tokenizer is representing.

This similarly occurs with the Japanese characters, though was expected from the start since Japanese characters require two bytes to be represented. The tokenizer learns byte pairs for many characters, but not all. This is likely due to the characters being too rare to be represented in the training data.

### Google T5 Transformer

T5 is the model created by Google.

In [6]:
T5_Tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it", token=config['HUGGING'])

tokenizer_character_test(T5_Tokenizer)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


English
a: [2, 236746]		b: [2, 236763]
c: [2, 236755]		x: [2, 236781]
y: [2, 236762]		z: [2, 236802]
A: [2, 236776]		B: [2, 236799]
C: [2, 236780]		X: [2, 236917]
Y: [2, 236874]		Z: [2, 236953]

Spanish
a: [2, 236746]		b: [2, 236763]
ñ: [2, 237168]		y: [2, 236762]
z: [2, 236802]		A: [2, 236776]
B: [2, 236799]		Ñ: [2, 240643]
Y: [2, 236874]		Z: [2, 236953]

Japanese
あ: [2, 237268]		い: [2, 236985]
う: [2, 237187]		え: [2, 237495]
お: [2, 237328]		ア: [2, 237254]
イ: [2, 237118]		ウ: [2, 237656]
エ: [2, 237746]		オ: [2, 237705]
上: [2, 237152]		中: [2, 237103]



The appearance of 2 tokens in most characters was originally unexpected, but comes into how the tokenizer processes text. Token ID 2 refers to the start of a word.

Additionally, the token IDs are not ordered alphabetically and are larger for the english characters compared to the GPT-2 tokenizer. This means another method was used to initiate the byte-pair encoding instead of UTF or ASCII encoding. The larger token IDs is because the T5 tokenizer begins with various special tokens, shifting the alphabet to a higher token ID. A few of these special tokens are:

In [7]:
for i in range(10):
    print(i, end="\t")
    print(T5_Tokenizer.decode([i]))

0	<pad>
1	<eos>
2	<bos>
3	<unk>
4	<mask>
5	[multimodal]
6	<unused0>
7	<unused1>
8	<unused2>
9	<unused3>


## Word Analysis
The next step for byte pair encoding is to tokenize multiple characters into single tokens. As this is based on the frequency of characters, there is unlikely any similarity between tokenizers.

In [8]:
english_words = ["Hello", "world", "John"]
spanish_words = ["Hola", "mundo", "Juan"]
japanese_words = ["おはよう", "世界", "ジョン"]

example_words = {
    "English": english_words,
    "Spanish": spanish_words,
    "Japanese": japanese_words,
}

def tokenizer_word_test(model):
    for language, characters in example_words.items():
        print(language)
        count = 0
        for character in characters:
            tokenized = model.encode(character)
            print(f"{character}: {tokenized}")
            count += 1

### GPT-2 Tokenizer

In [9]:
tokenizer_word_test(GPT2_Tokenizer)

English
Hello: [15496]
world: [6894]
John: [7554]
Spanish
Hola: [39, 5708]
mundo: [20125, 78]
Juan: [41, 7258]
Japanese
おはよう: [2515, 232, 31676, 1792, 230, 29557]
世界: [10310, 244, 45911, 234]
ジョン: [21091, 1209, 100, 6527]


The biggest point of interest is that most of the Japanese words tested had multiple tokens representing a relatively common word. This result is not expected for GPT-3 and above, as one of the major improvements between the two models is the a siginifant increase of vocabulary size from the tokenizer.

### T5 Tokenizer

In [10]:
tokenizer_word_test(T5_Tokenizer)

English
Hello: [2, 9259]
world: [2, 12392]
John: [2, 12720]
Spanish
Hola: [2, 21529]
mundo: [2, 223428]
Juan: [2, 76777]
Japanese
おはよう: [2, 220844]
世界: [2, 12811]
ジョン: [2, 104950]
