# Tokenization

_Tokenization_ is the process of converting a body of text into individual _tokens_, e.g., words and punctuation characters. This is the first step for most Natural Language Processing (NLP) tasks, including preparing data for training an LLM. Let's see how it's done!

## Some sample text

In [1]:
text = "This is a test! Or is this not a test? Test it to be sure. :)"
print(text)
print(f"This sample text has {len(text)} characters.")

This is a test! Or is this not a test? Test it to be sure. :)
This sample text has 61 characters.


In [2]:
print(text.split())

['This', 'is', 'a', 'test!', 'Or', 'is', 'this', 'not', 'a', 'test?', 'Test', 'it', 'to', 'be', 'sure.', ':)']


In [3]:
#help(str.split)

In [None]:
#import regular expressions module
import re

In [7]:
tokens = re.split(r'([.?!:()]|\s)',text)
#if item is whitespace item.split() returns FALSE, if a character, returns TRUE
tokens = [item for item in tokens if item.split()]
#set takes only unique elements
tokens = sorted (list (set(tokens)))
print(tokens)

['!', ')', '.', ':', '?', 'Or', 'Test', 'This', 'a', 'be', 'is', 'it', 'not', 'sure', 'test', 'this', 'to']


In [9]:
#enumerate numbers things in a list
vocab = {token:index for index, token in enumerate(tokens)}
print(vocab.items())
print(vocab["Test"])

dict_items([('!', 0), (')', 1), ('.', 2), (':', 3), ('?', 4), ('Or', 5), ('Test', 6), ('This', 7), ('a', 8), ('be', 9), ('is', 10), ('it', 11), ('not', 12), ('sure', 13), ('test', 14), ('this', 15), ('to', 16)])
6


In [15]:
## Tokenizing the Story

In [10]:
with open("Amontillado.txt","r") as f:
    raw_text = f.read()
print(raw_text[:20])

The thousand injurie


In [17]:
raw_text.split()

['The',
 'thousand',
 'injuries',
 'of',
 'Fortunato',
 'I',
 'had',
 'borne',
 'as',
 'I',
 'best',
 'could;',
 'but',
 'when',
 'he',
 'ventured',
 'upon',
 'insult,',
 'I',
 'vowed',
 'revenge.',
 'You,',
 'who',
 'so',
 'well',
 'know',
 'the',
 'nature',
 'of',
 'my',
 'soul,',
 'will',
 'not',
 'suppose,',
 'however,',
 'that',
 'I',
 'gave',
 'utterance',
 'to',
 'a',
 'threat.',
 'At',
 'length',
 'I',
 'would',
 'be',
 'avenged;',
 'this',
 'was',
 'a',
 'point',
 'definitively',
 'settled—but',
 'the',
 'very',
 'definitiveness',
 'with',
 'which',
 'it',
 'was',
 'resolved,',
 'precluded',
 'the',
 'idea',
 'of',
 'risk.',
 'I',
 'must',
 'not',
 'only',
 'punish,',
 'but',
 'punish',
 'with',
 'impunity.',
 'A',
 'wrong',
 'is',
 'unredressed',
 'when',
 'retribution',
 'overtakes',
 'its',
 'redresser.',
 'It',
 'is',
 'equally',
 'unredressed',
 'when',
 'the',
 'avenger',
 'fails',
 'to',
 'make',
 'himself',
 'felt',
 'as',
 'such',
 'to',
 'him',
 'who',
 'has',
 'done

In [28]:
token = re.split(r'([,.:;?_!"()\'“”‘’]|--|\s)',raw_text)
token = [item for item in token if item.split()]
print(len(token))

2909


In [29]:
token = sorted (list (set(token)))
print(len(token))

863


In [30]:
vocab = {token:index for index, token in enumerate(token)}
print(vocab.items())

dict_items([('!', 0), ('"', 1), ("'", 2), (',', 3), ('.', 4), (':', 5), (';', 6), ('?', 7), ('A', 8), ('Against', 9), ('Amontillado', 10), ('And', 11), ('As', 12), ('At', 13), ('Austrian', 14), ('Be', 15), ('Besides', 16), ('British', 17), ('But', 18), ('Come', 19), ('De', 20), ('Drink', 21), ('Enough', 22), ('Few', 23), ('For', 24), ('Fortunato', 25), ('Fortunato—although', 26), ('From', 27), ('God', 28), ('Good', 29), ('Grâve', 30), ('Ha', 31), ('He', 32), ('Here', 33), ('His', 34), ('How', 35), ('I', 36), ('If', 37), ('Impossible', 38), ('In', 39), ('Indeed', 40), ('It', 41), ('Italian', 42), ('Italians', 43), ('Its', 44), ('Lady', 45), ('Let', 46), ('Luchesi', 47), ('Luchesi—', 48), ('Luchesi——', 49), ('Medoc', 50), ('Montresor', 51), ('Montresors', 52), ('My', 53), ('Nemo', 54), ('Nitre', 55), ('No', 56), ('Not', 57), ('Once', 58), ('Paris', 59), ('Pass', 60), ('Proceed', 61), ('Putting', 62), ('Sherry', 63), ('The', 64), ('Then', 65), ('There', 66), ('These', 67), ('They', 68), (

In [33]:
phrase = "The thousand injuries of Fortunato I had borne as I best could; but when he ventured upon insult, I vowed revenge."

In [34]:
phrase = re.split(r'([,.:;?_!"()\'“”‘’]|--|\s)',phrase)
phrase = [item for item in phrase if item.split()]
print(phrase)

['The', 'thousand', 'injuries', 'of', 'Fortunato', 'I', 'had', 'borne', 'as', 'I', 'best', 'could', ';', 'but', 'when', 'he', 'ventured', 'upon', 'insult', ',', 'I', 'vowed', 'revenge', '.']


In [None]:
#encoding
ids = [vocab[token] for token in phrase]
print(ids)

[64, 756, 410, 539, 25, 36, 354, 152, 122, 36, 146, 201, 6, 165, 827, 363, 803, 795, 414, 3, 36, 812, 648, 4]


In [36]:
#Decoding
reverse_vocab = {index:token for token, index in vocab.items()}
reverse_vocab.items()

dict_items([(0, '!'), (1, '"'), (2, "'"), (3, ','), (4, '.'), (5, ':'), (6, ';'), (7, '?'), (8, 'A'), (9, 'Against'), (10, 'Amontillado'), (11, 'And'), (12, 'As'), (13, 'At'), (14, 'Austrian'), (15, 'Be'), (16, 'Besides'), (17, 'British'), (18, 'But'), (19, 'Come'), (20, 'De'), (21, 'Drink'), (22, 'Enough'), (23, 'Few'), (24, 'For'), (25, 'Fortunato'), (26, 'Fortunato—although'), (27, 'From'), (28, 'God'), (29, 'Good'), (30, 'Grâve'), (31, 'Ha'), (32, 'He'), (33, 'Here'), (34, 'His'), (35, 'How'), (36, 'I'), (37, 'If'), (38, 'Impossible'), (39, 'In'), (40, 'Indeed'), (41, 'It'), (42, 'Italian'), (43, 'Italians'), (44, 'Its'), (45, 'Lady'), (46, 'Let'), (47, 'Luchesi'), (48, 'Luchesi—'), (49, 'Luchesi——'), (50, 'Medoc'), (51, 'Montresor'), (52, 'Montresors'), (53, 'My'), (54, 'Nemo'), (55, 'Nitre'), (56, 'No'), (57, 'Not'), (58, 'Once'), (59, 'Paris'), (60, 'Pass'), (61, 'Proceed'), (62, 'Putting'), (63, 'Sherry'), (64, 'The'), (65, 'Then'), (66, 'There'), (67, 'These'), (68, 'They'), (

In [49]:
print(" ".join([reverse_vocab[id] for id in ids]))

The thousand injuries of Fortunato I had borne as I best could ; but when he ventured upon insult , I vowed revenge .


In [None]:
##Creating a Class

In [46]:
class SimpleTokenizer:
    def __init__(self,vocab):
        self.str_to_int = vocab
        self.int_to_str = {index: token for token, index in vocab.items()}
        pass
    def encode():
        pass

    def decode():
        pass


In [47]:
tokenizer = SimpleTokenizer(vocab)

In [48]:
tokenizer.str_to_int

{'!': 0,
 '"': 1,
 "'": 2,
 ',': 3,
 '.': 4,
 ':': 5,
 ';': 6,
 '?': 7,
 'A': 8,
 'Against': 9,
 'Amontillado': 10,
 'And': 11,
 'As': 12,
 'At': 13,
 'Austrian': 14,
 'Be': 15,
 'Besides': 16,
 'British': 17,
 'But': 18,
 'Come': 19,
 'De': 20,
 'Drink': 21,
 'Enough': 22,
 'Few': 23,
 'For': 24,
 'Fortunato': 25,
 'Fortunato—although': 26,
 'From': 27,
 'God': 28,
 'Good': 29,
 'Grâve': 30,
 'Ha': 31,
 'He': 32,
 'Here': 33,
 'His': 34,
 'How': 35,
 'I': 36,
 'If': 37,
 'Impossible': 38,
 'In': 39,
 'Indeed': 40,
 'It': 41,
 'Italian': 42,
 'Italians': 43,
 'Its': 44,
 'Lady': 45,
 'Let': 46,
 'Luchesi': 47,
 'Luchesi—': 48,
 'Luchesi——': 49,
 'Medoc': 50,
 'Montresor': 51,
 'Montresors': 52,
 'My': 53,
 'Nemo': 54,
 'Nitre': 55,
 'No': 56,
 'Not': 57,
 'Once': 58,
 'Paris': 59,
 'Pass': 60,
 'Proceed': 61,
 'Putting': 62,
 'Sherry': 63,
 'The': 64,
 'Then': 65,
 'There': 66,
 'These': 67,
 'They': 68,
 'Three': 69,
 'Throwing': 70,
 'Thus': 71,
 'To': 72,
 'True': 73,
 'True—true': 