# Tokenization

_Tokenization_ is the process of converting a body of text into individual _tokens_, e.g., words and punctuation characters. This is the first step for most Natural Language Processing (NLP) tasks, including preparing data for training an LLM. Let's see how it's done!

In [80]:
text = "This is a test! Or is this not a test? Test it to be sure. :)"
print(text)
print(f"This sample text has {len(text)} characters.")

This is a test! Or is this not a test? Test it to be sure. :)
This sample text has 61 characters.


In [81]:
print(text.split())

['This', 'is', 'a', 'test!', 'Or', 'is', 'this', 'not', 'a', 'test?', 'Test', 'it', 'to', 'be', 'sure.', ':)']


In [82]:
#help(str.split)

In [83]:
#import regular expressions module
import re

In [84]:
tokens = re.split(r'([.?!:()]|\s)',text)
#if item is whitespace item.split() returns FALSE, if a character, returns TRUE
tokens = [item for item in tokens if item.split()]
#set takes only unique elements
tokens = sorted (list (set(tokens)))
print(tokens)

['!', ')', '.', ':', '?', 'Or', 'Test', 'This', 'a', 'be', 'is', 'it', 'not', 'sure', 'test', 'this', 'to']


In [85]:
#enumerate numbers things in a list
vocab = {token:index for index, token in enumerate(tokens)}
print(vocab.items())
print(vocab["Test"])

dict_items([('!', 0), (')', 1), ('.', 2), (':', 3), ('?', 4), ('Or', 5), ('Test', 6), ('This', 7), ('a', 8), ('be', 9), ('is', 10), ('it', 11), ('not', 12), ('sure', 13), ('test', 14), ('this', 15), ('to', 16)])
6


## Tokenizing the Story

In [86]:
with open("Amontillado.txt","r") as f:
    raw_text = f.read()
print(raw_text[:20])

The thousand injurie


In [87]:
raw_text.split()

['The',
 'thousand',
 'injuries',
 'of',
 'Fortunato',
 'I',
 'had',
 'borne',
 'as',
 'I',
 'best',
 'could;',
 'but',
 'when',
 'he',
 'ventured',
 'upon',
 'insult,',
 'I',
 'vowed',
 'revenge.',
 'You,',
 'who',
 'so',
 'well',
 'know',
 'the',
 'nature',
 'of',
 'my',
 'soul,',
 'will',
 'not',
 'suppose,',
 'however,',
 'that',
 'I',
 'gave',
 'utterance',
 'to',
 'a',
 'threat.',
 'At',
 'length',
 'I',
 'would',
 'be',
 'avenged;',
 'this',
 'was',
 'a',
 'point',
 'definitively',
 'settled—but',
 'the',
 'very',
 'definitiveness',
 'with',
 'which',
 'it',
 'was',
 'resolved,',
 'precluded',
 'the',
 'idea',
 'of',
 'risk.',
 'I',
 'must',
 'not',
 'only',
 'punish,',
 'but',
 'punish',
 'with',
 'impunity.',
 'A',
 'wrong',
 'is',
 'unredressed',
 'when',
 'retribution',
 'overtakes',
 'its',
 'redresser.',
 'It',
 'is',
 'equally',
 'unredressed',
 'when',
 'the',
 'avenger',
 'fails',
 'to',
 'make',
 'himself',
 'felt',
 'as',
 'such',
 'to',
 'him',
 'who',
 'has',
 'done

In [88]:
token = re.split(r'([,.:;?_!"()\'“”‘’]|--|\s)',raw_text)
token = [item for item in token if item.split()]
token.extend(["<|unk|>", "<|endoftext|>"])
print("Number of total tokens in the text:",len(token))

Number of total tokens in the text: 2911


In [89]:
token = sorted (list (set(token)))
print("Number of unique tokens in the text:", len(token))

Number of unique tokens in the text: 865


In [90]:
vocab = {token:index for index, token in enumerate(token)}
print(vocab.items())

dict_items([('!', 0), ('"', 1), ("'", 2), (',', 3), ('.', 4), (':', 5), (';', 6), ('<|endoftext|>', 7), ('<|unk|>', 8), ('?', 9), ('A', 10), ('Against', 11), ('Amontillado', 12), ('And', 13), ('As', 14), ('At', 15), ('Austrian', 16), ('Be', 17), ('Besides', 18), ('British', 19), ('But', 20), ('Come', 21), ('De', 22), ('Drink', 23), ('Enough', 24), ('Few', 25), ('For', 26), ('Fortunato', 27), ('Fortunato—although', 28), ('From', 29), ('God', 30), ('Good', 31), ('Grâve', 32), ('Ha', 33), ('He', 34), ('Here', 35), ('His', 36), ('How', 37), ('I', 38), ('If', 39), ('Impossible', 40), ('In', 41), ('Indeed', 42), ('It', 43), ('Italian', 44), ('Italians', 45), ('Its', 46), ('Lady', 47), ('Let', 48), ('Luchesi', 49), ('Luchesi—', 50), ('Luchesi——', 51), ('Medoc', 52), ('Montresor', 53), ('Montresors', 54), ('My', 55), ('Nemo', 56), ('Nitre', 57), ('No', 58), ('Not', 59), ('Once', 60), ('Paris', 61), ('Pass', 62), ('Proceed', 63), ('Putting', 64), ('Sherry', 65), ('The', 66), ('Then', 67), ('The

In [91]:
phrase = "The thousand injuries of Fortunato I had borne as I best could; but when he ventured upon insult, I vowed revenge."

In [92]:
phrase = re.split(r'([,.:;?_!"()\'“”‘’]|--|\s)',phrase)
phrase = [item for item in phrase if item.split()]
print(phrase)

['The', 'thousand', 'injuries', 'of', 'Fortunato', 'I', 'had', 'borne', 'as', 'I', 'best', 'could', ';', 'but', 'when', 'he', 'ventured', 'upon', 'insult', ',', 'I', 'vowed', 'revenge', '.']


In [93]:
#encoding
ids = [vocab[token] for token in phrase]
print(ids)

[66, 758, 412, 541, 27, 38, 356, 154, 124, 38, 148, 203, 6, 167, 829, 365, 805, 797, 416, 3, 38, 814, 650, 4]


In [94]:
#Decoding
reverse_vocab = {index:token for token, index in vocab.items()}
reverse_vocab.items()

dict_items([(0, '!'), (1, '"'), (2, "'"), (3, ','), (4, '.'), (5, ':'), (6, ';'), (7, '<|endoftext|>'), (8, '<|unk|>'), (9, '?'), (10, 'A'), (11, 'Against'), (12, 'Amontillado'), (13, 'And'), (14, 'As'), (15, 'At'), (16, 'Austrian'), (17, 'Be'), (18, 'Besides'), (19, 'British'), (20, 'But'), (21, 'Come'), (22, 'De'), (23, 'Drink'), (24, 'Enough'), (25, 'Few'), (26, 'For'), (27, 'Fortunato'), (28, 'Fortunato—although'), (29, 'From'), (30, 'God'), (31, 'Good'), (32, 'Grâve'), (33, 'Ha'), (34, 'He'), (35, 'Here'), (36, 'His'), (37, 'How'), (38, 'I'), (39, 'If'), (40, 'Impossible'), (41, 'In'), (42, 'Indeed'), (43, 'It'), (44, 'Italian'), (45, 'Italians'), (46, 'Its'), (47, 'Lady'), (48, 'Let'), (49, 'Luchesi'), (50, 'Luchesi—'), (51, 'Luchesi——'), (52, 'Medoc'), (53, 'Montresor'), (54, 'Montresors'), (55, 'My'), (56, 'Nemo'), (57, 'Nitre'), (58, 'No'), (59, 'Not'), (60, 'Once'), (61, 'Paris'), (62, 'Pass'), (63, 'Proceed'), (64, 'Putting'), (65, 'Sherry'), (66, 'The'), (67, 'Then'), (68, 

In [95]:
print(" ".join([reverse_vocab[id] for id in ids]))

The thousand injuries of Fortunato I had borne as I best could ; but when he ventured upon insult , I vowed revenge .


## Creating a Tokenizer Class

In [96]:
class SimpleTokenizer:
    def __init__(self,vocab):
        self.str_to_int = vocab
        self.int_to_str = {index: token for token, index in vocab.items()}
        
    def encode(self, text):
        tokens = re.split(r'([,.:;?_!"()\'“”‘’]|--|\s)',text)
        tokens = [item if item in self.str_to_int else "<|unk|>" for item in tokens if item.split()]
        ids =  [self.str_to_int[token] for token in tokens]
        return ids


    def decode(self,ids):
        text = " ".join([self.int_to_str[id] for id in ids])
        text = re.sub(r'\s+([,.:;?_!"()\'“”‘’]|--|\s)', r'\1', text)
        return text

## Testing my Tokenizer

In [97]:
tokenizer = SimpleTokenizer(vocab)

In [98]:
#Test from own story
phrase = "The thousand injuries of Fortunato I had borne as I best could; but when he ventured upon insult, I vowed revenge."

In [104]:
ids = tokenizer.encode(phrase)
print(ids)
text = tokenizer.decode(ids)
print(text)

[66, 758, 412, 541, 27, 38, 356, 154, 124, 38, 148, 203, 6, 167, 829, 365, 805, 797, 416, 3, 38, 814, 650, 4]
The thousand injuries of Fortunato I had borne as I best could; but when he ventured upon insult, I vowed revenge.


In [101]:
#Something with unknown words
NewPhrase = "I own the only twenty-four karat gold labubu."

In [102]:
ids = tokenizer.encode(NewPhrase)
print(ids)

text = tokenizer.decode(ids)
print(text)

[38, 560, 748, 548, 8, 8, 8, 8, 4]
I own the only <|unk|> <|unk|> <|unk|> <|unk|>.


## Ignore The Stuff Below, Importing Tiktoken

In [103]:
import tiktoken

ModuleNotFoundError: No module named 'tiktoken'

In [None]:
tokenizer = tiktoken.get_encoding("gpt2")

In [None]:
tokenizer.encode(NewPhrase)
print(ids)