# Tokenization

_Tokenization_ is the process of converting a body of text into individual _tokens_, e.g., words and punctuation characters. This is the first step for most Natural Language Processing (NLP) tasks, including preparing data for training an LLM. Let's see how it's done!

## Some sample text

In [9]:
text = "This is a test! Or is this not a test? Test it to be sure. :)"
print(text)
print(f"This sample text has {len(text)} characters.")

This is a test! Or is this not a test? Test it to be sure. :)
This sample text has 61 characters.


In [10]:
print(text.split())

['This', 'is', 'a', 'test!', 'Or', 'is', 'this', 'not', 'a', 'test?', 'Test', 'it', 'to', 'be', 'sure.', ':)']


In [11]:
help(str.split)

Help on method_descriptor:

split(self, /, sep=None, maxsplit=-1) unbound builtins.str method
    Return a list of the substrings in the string, using sep as the separator string.

      sep
        The separator used to split the string.

        When set to None (the default value), will split on any whitespace
        character (including \n \r \t \f and spaces) and will discard
        empty strings from the result.
      maxsplit
        Maximum number of splits.
        -1 (the default value) means no limit.

    Splitting starts at the front of the string and works to the end.

    Note, str.split() is mainly useful for data that has been intentionally
    delimited.  With natural text that includes punctuation, consider using
    the regular expression module.



In [12]:
import re

In [13]:
tokens = re.split(r'([.?!:()]|\s)',text)
#if item is whitespace item.split() returns FALSE, if a character, returns TRUE
tokens = [item for item in tokens if item.split()]
#set takes only unique elements
tokens = sorted (list (set(tokens)))
print(tokens)

['!', ')', '.', ':', '?', 'Or', 'Test', 'This', 'a', 'be', 'is', 'it', 'not', 'sure', 'test', 'this', 'to']


In [14]:
#enumerate numbers things in a list
vocab = {token:index for index, token in enumerate(tokens)}
print(vocab.items())

dict_items([('!', 0), (')', 1), ('.', 2), (':', 3), ('?', 4), ('Or', 5), ('Test', 6), ('This', 7), ('a', 8), ('be', 9), ('is', 10), ('it', 11), ('not', 12), ('sure', 13), ('test', 14), ('this', 15), ('to', 16)])


In [15]:
## Tokenizing the Story

In [16]:
with open("Amontillado.txt","r") as f:
    raw_text = f.read()
print(raw_text[:20])

The thousand injurie


In [17]:
raw_text.split()

['The',
 'thousand',
 'injuries',
 'of',
 'Fortunato',
 'I',
 'had',
 'borne',
 'as',
 'I',
 'best',
 'could;',
 'but',
 'when',
 'he',
 'ventured',
 'upon',
 'insult,',
 'I',
 'vowed',
 'revenge.',
 'You,',
 'who',
 'so',
 'well',
 'know',
 'the',
 'nature',
 'of',
 'my',
 'soul,',
 'will',
 'not',
 'suppose,',
 'however,',
 'that',
 'I',
 'gave',
 'utterance',
 'to',
 'a',
 'threat.',
 'At',
 'length',
 'I',
 'would',
 'be',
 'avenged;',
 'this',
 'was',
 'a',
 'point',
 'definitively',
 'settled—but',
 'the',
 'very',
 'definitiveness',
 'with',
 'which',
 'it',
 'was',
 'resolved,',
 'precluded',
 'the',
 'idea',
 'of',
 'risk.',
 'I',
 'must',
 'not',
 'only',
 'punish,',
 'but',
 'punish',
 'with',
 'impunity.',
 'A',
 'wrong',
 'is',
 'unredressed',
 'when',
 'retribution',
 'overtakes',
 'its',
 'redresser.',
 'It',
 'is',
 'equally',
 'unredressed',
 'when',
 'the',
 'avenger',
 'fails',
 'to',
 'make',
 'himself',
 'felt',
 'as',
 'such',
 'to',
 'him',
 'who',
 'has',
 'done

In [22]:
toke = re.split(r'([,.:;?_!"()\']|--|\s)',raw_text)
toke = [item for item in toke if item.split()]
toke = sorted (list (set(toke)))
print(toke)

['!', '"', "'", ',', '.', ':', ';', '?', 'A', 'Against', 'Amontillado', 'And', 'As', 'At', 'Austrian', 'Be', 'Besides', 'British', 'But', 'Come', 'De', 'Drink', 'Enough', 'Few', 'For', 'Fortunato', 'Fortunato—although', 'From', 'God', 'Good', 'Grâve', 'Ha', 'He', 'Here', 'His', 'How', 'I', 'If', 'Impossible', 'In', 'Indeed', 'It', 'Italian', 'Italians', 'Its', 'Lady', 'Let', 'Luchesi', 'Luchesi—', 'Luchesi——', 'Medoc', 'Montresor', 'Montresors', 'My', 'Nemo', 'Nitre', 'No', 'Not', 'Once', 'Paris', 'Pass', 'Proceed', 'Putting', 'Sherry', 'The', 'Then', 'There', 'These', 'They', 'Three', 'Throwing', 'Thus', 'To', 'True', 'True—true', 'Ugh', 'Unsheathing', 'We', 'When', 'Whither', 'Will', 'With', 'Withdrawing', 'Within', 'Yes', 'You', 'Your', 'a', 'about', 'above', 'absconded', 'accosted', 'account', 'admired', 'adopted', 'afflicted', 'again', 'again—', 'aid', 'aided—I', 'air', 'alarming', 'all', 'aloud—', 'am', 'among', 'an', 'and', 'another', 'answer', 'answered', 'any', 'aperture', 'ap

In [23]:
vocab = {toke:index for index, toke in enumerate(toke)}
print(vocab.items())

dict_items([('!', 0), ('"', 1), ("'", 2), (',', 3), ('.', 4), (':', 5), (';', 6), ('?', 7), ('A', 8), ('Against', 9), ('Amontillado', 10), ('And', 11), ('As', 12), ('At', 13), ('Austrian', 14), ('Be', 15), ('Besides', 16), ('British', 17), ('But', 18), ('Come', 19), ('De', 20), ('Drink', 21), ('Enough', 22), ('Few', 23), ('For', 24), ('Fortunato', 25), ('Fortunato—although', 26), ('From', 27), ('God', 28), ('Good', 29), ('Grâve', 30), ('Ha', 31), ('He', 32), ('Here', 33), ('His', 34), ('How', 35), ('I', 36), ('If', 37), ('Impossible', 38), ('In', 39), ('Indeed', 40), ('It', 41), ('Italian', 42), ('Italians', 43), ('Its', 44), ('Lady', 45), ('Let', 46), ('Luchesi', 47), ('Luchesi—', 48), ('Luchesi——', 49), ('Medoc', 50), ('Montresor', 51), ('Montresors', 52), ('My', 53), ('Nemo', 54), ('Nitre', 55), ('No', 56), ('Not', 57), ('Once', 58), ('Paris', 59), ('Pass', 60), ('Proceed', 61), ('Putting', 62), ('Sherry', 63), ('The', 64), ('Then', 65), ('There', 66), ('These', 67), ('They', 68), (