#Creating Tokens


In [None]:
url = "https://raw.githubusercontent.com/Satvik-jain/DeepLearning_LSTM_WordPredictor/refs/heads/main/The%20Verdict%20-%20Next%20Word%20Predictor/data/Data.txt"
#so before importing the text directly from the github, got to github, click on raw and then copy the url in the url bar
import requests
r = requests.get(url)
book = r.text

In [None]:
print(len(book)) #printing the length of text
print(book[450:560])

20482
oring his unaccountable abdication. "Of course it's going to send the value of my picture 'way up; but I don't


#####Our goal is to tokenize this 20,479-character short story into individual words and special characters that we can then turn into embeddings for LLM training
#####How can we best split this text to obtain a list of tokens? For this, we go on a small excursion and use Python's regular expression library re for illustration purposes. (Note that you don't have to learn or memorize any regular expression syntax since we will transition to a pre-built tokenizer later in this chapter.)

In [None]:
import re
text = "hello everyone my name is something, an I. Finding it good"
result = re.split(r'(\s)', text)
print(result)
#we have to look forwward to spaces as well because they do play a role in sentences


['hello', ' ', 'everyone', ' ', 'my', ' ', 'name', ' ', 'is', ' ', 'something,', ' ', 'an', ' ', 'I.', ' ', 'Finding', ' ', 'it', ' ', 'good']


#####So basically what is beign don eabove is that we are splitting the text wrt to whitespaces(\s) , further we can split based on , . \s or anything we feel should be included

In [None]:
result = re.split(r'([,.]|\s)',text)
print(result)

['hello', ' ', 'everyone', ' ', 'my', ' ', 'name', ' ', 'is', ' ', 'something', ',', '', ' ', 'an', ' ', 'I', '.', '', ' ', 'Finding', ' ', 'it', ' ', 'good']


REMOVING WHITESPACES OR NOT

When developing a simple tokenizer, whether we should encode whitespaces as separate characters or just remove them depends on our application and its requirements. Removing whitespaces reduces the memory and computing requirements. However, keeping whitespaces can be useful if we train models that are sensitive to the exact structure of the text (for example, Python code, which is sensitive to indentation and spacing). Here, we remove whitespaces for simplicity and brevity of the tokenized outputs. Later, we will switch to a tokenization scheme that includes whitespaces.

The tokenization scheme we devised above works well on the simple sample text. Let's modify it a bit further so that it can also handle other types of punctuation, such as question marks, quotation marks, and the double-dashes we have seen earlier in the first 100 characters of Edith Wharton's short story, along with additional special characters:

In [None]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--)', text)
print(result)
# result = [item.strip() for item in result if item.strip()]
# print(result)

['Hello', ',', ' world', '.', ' Is this', '--', ' a test', '?', '']


In [None]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)',book)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:100])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had', 'dropped', 'his', 'painting', ',', 'married', 'a', 'rich', 'widow', ',', 'and', 'established', 'himself', 'in', 'a', 'villa', 'on', 'the', 'Riviera', '.', '(', 'Though', 'I', 'rather', 'thought', 'it', 'would', 'have', 'been', 'Rome', 'or', 'Florence', '.', ')', '"', 'The', 'height', 'of', 'his', 'glory', '"', '--', 'that', 'was', 'what', 'the', 'women', 'called', 'it', '.', 'I', 'can', 'hear', 'Mrs', '.', 'Gideon', 'Thwing', '--', 'his', 'last', 'Chicago', 'sitter', '--']


In [None]:
len(preprocessed)

4690

#Creating token **IDs**

In [None]:
#this creates a list of all unique tokens and sort them alphabetiically to determine the vocab size
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)




1130


In [None]:
# enumerate(all_words)

In [None]:
vocab = {token:integer for integer,token in enumerate(all_words)}

In [None]:
# vocab[1]

'"'

Let's implement a complete tokenizer class in Python.

The class will have an encode method that splits text into tokens and carries out the string-to-integer mapping to produce token IDs via the vocabulary.

In addition, we implement a decode method that carries out the reverse integer-to-string mapping to convert the token IDs back into text.

Step 1: Store the vocabulary as a class attribute for access in the encode and decode methods

Step 2: Create an inverse vocabulary that maps token IDs back to the original text tokens

Step 3: Process input text into token IDs

Step 4: Convert token IDs back into text

Step 5: Replace spaces before the specified punctuation

In [None]:
class SimpleTokenizerV1:
    def __init__(self,vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self,text):
      preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

      preprocessed = [
          item.strip() for item in preprocessed if item.strip()
      ]
      ids = [self.str_to_int[s] for s in preprocessed]
      return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text


In [None]:
tokenizer = SimpleTokenizerV1(vocab)
text = """ his unaccountable abdication. "Of course it's going to send the value of my picture 'way up; but I don't"""
ids = tokenizer.encode(text)
print(ids)

[549, 1042, 116, 7, 1, 73, 297, 585, 2, 850, 498, 1016, 866, 988, 1059, 722, 697, 769, 2, 1083, 1051, 9, 239, 53, 359, 2, 970]


In [None]:
tokenizer.decode(ids)

#here we need to undersrand that this tokenizer class is very  limited on vocab size and if you give an out of vocabulary word to this tokenizer class it'll throw an error
#hence we need to develop a tokenizer that atleast doesnot throw an error if cannot the encode the word

'his unaccountable abdication." Of course it\' s going to send the value of my picture\' way up ; but I don\' t'

ADDING SPECIAL CONTEXT TOKENS
 we will modify the vocabulary and tokenizer we implemented in the previous section, SimpleTokenizerV2, to support two new tokens, <|unk|> and <|endoftext|>

We can modify the tokenizer to use an <|unk|> token if it encounters a word that is not part of the vocabulary.

Furthermore, we add a token between unrelated texts.

For example, when training GPT-like LLMs on multiple independent documents or books, it is common to insert a token before each document or book that follows a previous text source
the token will be <|endoftext|>


In [None]:
# preprocessed

In [None]:
# all_tokens = sorted(list(set(preprocessed)))
all_words.extend(["<|endoftext|>", "<|unk|>"])

In [None]:
len(all_words)
all_words[1130]

'<|endoftext|>'

In [None]:
vocab = {s:i for i,s in enumerate(all_words)}


In [None]:
# vocab

In [None]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}

    def encode(self,text):
      preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

      preprocessed = [
          item.strip() for item in preprocessed if item.strip()
      ]
      preprocessed = [
          item if item in self.str_to_int
          else "<|unk|>" for item in preprocessed
      ]
      ids = [self.str_to_int[s] for s in preprocessed]

      return ids

    def decode(self,ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [None]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the india."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the india.


In [None]:
ids = tokenizer.encode(text)

In [None]:
tokenizer.decode(ids)

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

[BOS] (beginning of sequence): This token marks the start of a text. It signifies to the LLM where a piece of content begins.

[EOS] (end of sequence): This token is positioned at the end of a text, and is especially useful when concatenating multiple unrelated texts, similar to <|endoftext|>. For instance, when combining two different Wikipedia articles or books, the [EOS] token indicates where one article ends and the next one begins.

[PAD] (padding): When training LLMs with batch sizes larger than one, the batch might contain texts of varying lengths. To ensure all texts have the same length, the shorter texts are extended or "padded" using the [PAD] token, up to the length of the longest text in the batch.

#Byte Pair encoding
#####So essentially we had 3 types of tokenization techniques - word based, character based and sub-word based
we have already seen word based above, now bpe is an algorithm that is for sub-word based tokenization
EVen in GPT architecture, BPE is used for tokenization
Also the algorithm being complex, we'll use a premade library called tiktoken for this task

In [None]:
!pip3 install tiktoken



In [None]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.9.0


In [None]:
#so remember here, at exactly this step before we had extracted the raw_text, created a vocabulary out of it that contained the, wrote the encode and decode methods
#here in tiktoken gpt has already taken the text, applied the bpe algo and created the subswords
tokenizer = tiktoken.get_encoding("gpt2")

In [None]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [None]:
text = tokenizer.decode(integers)
print(text)

#You see how the tiktoken is able to encode someunknownplace as well even though this combined is not even a word/place in the english vocabulary
#also, endoftext is the last token in the vocab used in gpt - 50256

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


In [None]:
integers = tokenizer.encode("Akwirw ier")
print(integers)

strings = tokenizer.decode(integers)
print(strings)

##any gibberish word is converted to a certain set of token ids - power of tiktoken/bpe

[33901, 86, 343, 86, 220, 959]
Akwirw ier


In [None]:
encodings = {
    "gpt2": tiktoken.get_encoding("gpt2"),
    "gpt3": tiktoken.get_encoding("p50k_base"),  # Commonly associated with GPT-3 models
    "gpt4": tiktoken.get_encoding("cl100k_base")  # Used for GPT-4 and later versions
}

# Get the vocabulary size for each encoding
vocab_sizes = {model: encoding.n_vocab for model, encoding in encodings.items()}

# Print the vocabulary sizes
for model, size in vocab_sizes.items():
    print(f"The vocabulary size for {model.upper()} is: {size}")

The vocabulary size for GPT2 is: 50257
The vocabulary size for GPT3 is: 50281
The vocabulary size for GPT4 is: 100277
