# Step 1: Creating Tokens

The print command prints the total number of characters followed by the first 100 characters of this file for illustration purposes.

In [4]:
with open("alice_in_wonderland.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 148208
TITLE: Alice's Adventures in Wonderland
AUTHOR: Lewis Carroll


= CHAPTER I = 
=( Down the Rabbit-H


Using some simple example text, we can use the re.split command with the following syntax to split a text on whitespace characters:

In [3]:
import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)

print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


Using some simple example text, we can use the re.split command with the following syntax to split a text on whitespace characters:

In [5]:
import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)

print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


Let's modify the regular expression (re) splits on whitespaces (\s) and commas, and periods ([,.]):




In [7]:
result = re.split(r'([,.]|\s)', text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


A small remaining issue is that the list still includes whitespace characters. Optionally, we can remove these redundant characters safely as follows:


In [9]:
result = [item for item in result if item.strip()]      #  item.strip() :- Removing the WhiteSpaces
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


 Let's modify it a bit further so that it can also handle other types of punctuation, such as question marks, quotation marks, and the double-dashes we have seen earlier in the first 100 characters of Edith Wharton's short story, along with additional special characters:

In [10]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [11]:
# Strip whitespace from each item and then filter out any empty strings.
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


## Now that we got a basic tokenizer working,

In [12]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['TITLE', ':', 'Alice', "'", 's', 'Adventures', 'in', 'Wonderland', 'AUTHOR', ':', 'Lewis', 'Carroll', '=', 'CHAPTER', 'I', '=', '=', '(', 'Down', 'the', 'Rabbit-Hole', ')', '=', 'Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired']


In [13]:
print(len(preprocessed))


34158


In [18]:
# if we want to remove the Numerical values form the Dataset then we can also do it here then after that creating the token id

# Step 2: Creating Token IDs

In the previous section, we tokenized a short story and assigned it to a Python variable called preprocessed. Let's now create a list of all unique tokens and sort them alphabetically to determine the vocabulary size:

In [14]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

3189


After determining that the vocabulary size is 3189 via the above code, we create the vocabulary and print its first 51 entries for illustration purposes:

In [15]:
vocab = {token:integer for integer,token in enumerate(all_words)}   # assinning the token id to the vocab

In [16]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
('*', 5)
(',', 6)
('--', 7)
('.', 8)
('0', 9)
('00', 10)
('10', 11)
('124', 12)
('1865-11-26', 13)
('2', 14)
('2021-03-08', 15)
('2021-08-03', 16)
('30', 17)
('5', 18)
('500', 19)
('8', 20)
('9', 21)
(':', 22)
(';', 23)
('=', 24)
('?', 25)
('@', 26)
('A', 27)
('ALICE', 28)
('ALL', 29)
('AND', 30)
('ARE', 31)
('AT', 32)
('AUTHOR', 33)
('Ada', 34)
('Adventures', 35)
('Advice', 36)
('After', 37)
('Ah', 38)
('Alas', 39)
('Alice', 40)
('All', 41)
('Allow', 42)
('Always', 43)
('Ambition', 44)
('An', 45)
('And', 46)
('Ann', 47)
('Antipathies', 48)
('Arithmetic', 49)
('As', 50)


Later in this book, when we want to convert the outputs of an LLM from numbers back into text, we also need a way to turn token IDs into text.

For this, we can create an inverse version of the vocabulary that maps token IDs back to corresponding text tokens.

### The class will have an encode method that splits text into tokens and carries out the string-to-integer mapping to produce token IDs via the vocabulary.
### In addition, we implement a decode method that carries out the reverse integer-to-string mapping to convert the token IDs back into text.

In [19]:
# Step 1: Store the vocabulary as a class attribute for access in the encode and decode methods

# Step 2: Create an inverse vocabulary that maps token IDs back to the original text tokens

# Step 3: Process input text into token IDs

# Step 4: Convert token IDs back into text

# Step 5: Replace spaces before the specified punctuation

In [20]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

Let's instantiate a new tokenizer object from the SimpleTokenizerV1 try it out in practice:

In [22]:
tokenizer = SimpleTokenizerV1(vocab)

text = """"Alice's Adventures in Wonderland"""
ids = tokenizer.encode(text)
print(ids)

[1, 40, 2, 2489, 35, 1772, 447]


The code above prints the following token IDs: Next, let's see if we can turn these token IDs back into text using the decode method:

In [23]:
tokenizer.decode(ids)

'" Alice\' s Adventures in Wonderland'

In [24]:
text = "Hello, do you like tea?"
print(tokenizer.encode(text))

KeyError: 'Hello'

## ADDING SPECIAL CONTEXT TOKENS

In [None]:
# In the previous section, we implemented a simple tokenizer and applied it to a passage from the training set.

# In this section, we will modify this tokenizer to handle unknown words.

# In particular, we will modify the vocabulary and tokenizer we implemented in the previous section, SimpleTokenizerV2
# to support two new tokens, <|unk|> and <|endoftext|>

Let's now modify the vocabulary to include these two special tokens, and <|endoftext|>, by adding these to the list of all unique words that we created in the previous section:

In [25]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}

In [27]:
len(vocab.items())

3191

In [28]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('yourself', 3186)
('youth', 3187)
('zigzag', 3188)
('<|endoftext|>', 3189)
('<|unk|>', 3190)


Step 1: Replace unknown words by <|unk|> tokens

Step 2: Replace spaces before the specified punctuations

In [29]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int           # <===#
            else "<|unk|>" for item in preprocessed   # <===#
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

In [30]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [31]:
tokenizer.encode(text)

[3190,
 6,
 1289,
 3182,
 1920,
 2820,
 25,
 3189,
 198,
 2848,
 3190,
 3190,
 2139,
 2848,
 3190,
 8]

In [32]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the <|unk|> <|unk|> of the <|unk|>.'