### Step 1: Creating Tokens

In [1]:

with open("the-verdict.txt","r",encoding="utf-8") as f:
    raw_text=f.read()

print("length of raw text",len(raw_text))
print(raw_text[:99])

length of raw text 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


<div class="alert alert-block alert-success">

The print command prints the total number of characters followed by the first 100
characters of this file for illustration purposes. </div>

<div class="alert alert-block alert-warning">

Note that it's common to process millions of articles and hundreds of thousands of
books -- many gigabytes of text -- when working with LLMs. However, for educational
purposes, it's sufficient to work with smaller text samples like a single book to
illustrate the main ideas behind the text processing steps and to make it possible to
run it in reasonable time on consumer hardware. </div>

<div class="alert alert-block alert-success">

How can we best split this text to obtain a list of tokens? For this, we go on a small
excursion and use Python's regular expression library re for illustration purposes. (Note
that you don't have to learn or memorize any regular expression syntax since we will
transition to a pre-built tokenizer later in this chapter.) </div>

In [3]:
import re
result = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
print(result[:20])

['I', ' ', 'HAD', ' ', 'always', ' ', 'thought', ' ', 'Jack', ' ', 'Gisburn', ' ', 'rather', ' ', 'a', ' ', 'cheap', ' ', 'genius', '--']


In [4]:
new_result=[]
for item in result:
    if(item.strip()):
        new_result.append(item)
result=new_result
print(len(result))

4690


In [5]:
all_word=sorted(set(result))
print(len(all_word))

1130


## Step 2: Creating Token IDs

In [6]:
vocab={}
reverse_vocab={}
i=0
for item in all_word:
    vocab[item]=i
    reverse_vocab[i]=item
    i+=1

<div class="alert alert-block alert-info">
As we can see, based on the output above, the dictionary contains individual tokens
associated with unique integer labels. 
</div>

<div class="alert alert-block alert-success">

After determining that the vocabulary size is 1,130 via the above code, we create the
vocabulary and print its first 51 entries for illustration purposes:

</div>

<div class="alert alert-block alert-warning">

We can modify the tokenizer to use an <|unk|> token if it
encounters a word that is not part of the vocabulary. 

Furthermore, we add a token between
unrelated texts. 

For example, when training GPT-like LLMs on multiple independent
documents or books, it is common to insert a token before each document or book that
follows a previous text source

</div>



<div class="alert alert-block alert-success">

Let's now modify the vocabulary to include these two special tokens, <unk> and
<|endoftext|>, by adding these to the list of all unique words that we created in the
previous section:
</div>

In [8]:
all_tokens=sorted(set(vocab))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab={}
i=0
for item in all_tokens:
    vocab[item]=i
    reverse_vocab[i]=item
    i+=1