# Tokenization

_Tokenization_ is the process of converting a body of text into individual _tokens_, e.g., words and punctuation characters. This is the first step for most Natural Language Processing (NLP) tasks, including preparing data for training an LLM. Let's see how it's done!

## Some sample text

In [1]:
text = "This is a test! Or is this not a test? Test it to be sure. :)"
print(text)
print(f"This sample text has {len(text)} characters.")

This is a test! Or is this not a test? Test it to be sure. :)
This sample text has 61 characters.


In [2]:
str.split?

[1;31mSignature:[0m [0mstr[0m[1;33m.[0m[0msplit[0m[1;33m([0m[0mself[0m[1;33m,[0m [1;33m/[0m[1;33m,[0m [0msep[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mmaxsplit[0m[1;33m=[0m[1;33m-[0m[1;36m1[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a list of the substrings in the string, using sep as the separator string.

  sep
    The separator used to split the string.

    When set to None (the default value), will split on any whitespace
    character (including \n \r \t \f and spaces) and will discard
    empty strings from the result.
  maxsplit
    Maximum number of splits.
    -1 (the default value) means no limit.

Splitting starts at the front of the string and works to the end.

Note, str.split() is mainly useful for data that has been intentionally
delimited.  With natural text that includes punctuation, consider using
the regular expression module.
[1;31mType:[0m      method_descriptor

In [4]:
import re 

In [11]:
tokens = re.split(r'([.?!]|\s)', text)
tokens = [ item for item in tokens if item.split()] # returns true if it's not white space
tokens = list(set(tokens))
tokens = sorted(list(set(tokens)))
print(tokens)

['!', '.', ':)', '?', 'Or', 'Test', 'This', 'a', 'be', 'is', 'it', 'not', 'sure', 'test', 'this', 'to']


In [12]:
vocab = {token:index for index, token in enumerate(tokens)}
print(vocab.items())

dict_items([('!', 0), ('.', 1), (':)', 2), ('?', 3), ('Or', 4), ('Test', 5), ('This', 6), ('a', 7), ('be', 8), ('is', 9), ('it', 10), ('not', 11), ('sure', 12), ('test', 13), ('this', 14), ('to', 15)])


In [2]:
with open("tokenization_example_story.txt", 'r') as f:
    raw = f.read()
print(raw[:100])

Ancient Egypt (Rawlinson)
by George Rawlinson
The Priest-Kings--Pinetem and Solomon
SHISHAK AND HIS 


In [6]:
tokens = re.split(r'([.?!]|\s)', raw)
tokens = [ item for item in tokens if item.split()] # returns true if it's not white space
tokens = list(set(tokens))
tokens = sorted(tokens)
print(tokens[:100])

['"', '"Administrator', '"Chief', '"Fanbearer', '"Her-Hor', '"Her-Hor,', '"King', '"Principal', '"Royal', '"cities', '"colleges,"', '"gardens,', '"pillared', '"story', '"strengthens', '"the', '(1', '(Rawlinson)', '(Wady-el-Arish)', '(ib', ',', '.', '1),', '1,', '16)', '16-19),', '20)', '21-24);', '28,', '29):', '3)', 'A', 'AND', 'According', 'After', 'Amenhotep', "Amenhotep's;", 'Ammon', 'Ammon,"', 'Ammonites,', 'Ancient', 'Architect,"', 'As', 'Assyria', 'Assyria,', 'Assyria;', 'At', 'Bedouins', 'Boaz,', 'But', 'Canaanite', 'Chron', 'Commissioner,', 'Consequently,', 'DYNASTY', 'David', "David's", 'Delta,', 'Edomites,', 'Egypt', "Egypt's", 'Egypt,', 'Egypt,"', 'Egypt--perhaps', 'Egyptian', 'Egyptians,', 'Egyptologers', 'Euphrates', 'Finally,', 'For', 'Formerly,', 'From', 'George', 'Gezer,', 'Granaries,"', 'Great', 'Gush,"', 'HIS', 'He', 'Hebrew', 'Heliopolis,', 'Her-hor', 'Herhor', "Herhor's", 'Hesi-em-Kheb,', 'High-Priest', 'His', 'Hittites', 'Holies,', 'Holy', 'Hor-pa-seb-en-sha,', 'H