### Step 1: Creating Tokens

In [2]:

with open("the-verdict.txt","r",encoding="utf-8") as f:
    raw_text=f.read()

print("length of raw text",len(raw_text))
print(raw_text[:99])

length of raw text 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


<div class="alert alert-block alert-success">

The print command prints the total number of characters followed by the first 100
characters of this file for illustration purposes. </div>

<div class="alert alert-block alert-warning">

Note that it's common to process millions of articles and hundreds of thousands of
books -- many gigabytes of text -- when working with LLMs. However, for educational
purposes, it's sufficient to work with smaller text samples like a single book to
illustrate the main ideas behind the text processing steps and to make it possible to
run it in reasonable time on consumer hardware. </div>

<div class="alert alert-block alert-success">

How can we best split this text to obtain a list of tokens? For this, we go on a small
excursion and use Python's regular expression library re for illustration purposes. (Note
that you don't have to learn or memorize any regular expression syntax since we will
transition to a pre-built tokenizer later in this chapter.) </div>

In [3]:
import re
result = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
print(result[:20])

['I', ' ', 'HAD', ' ', 'always', ' ', 'thought', ' ', 'Jack', ' ', 'Gisburn', ' ', 'rather', ' ', 'a', ' ', 'cheap', ' ', 'genius', '--']


In [4]:
new_result=[]
for item in result:
    if(item.strip()):
        new_result.append(item)
result=new_result
print(len(result))

4690


In [5]:
all_word=sorted(set(result))
print(len(all_word))

1130


## Step 2: Creating Token IDs

In [6]:
vocab={}
reverse_vocab={}
i=0
for item in all_word:
    vocab[item]=i
    reverse_vocab[i]=item
    i+=1

<div class="alert alert-block alert-info">
As we can see, based on the output above, the dictionary contains individual tokens
associated with unique integer labels. 
</div>

<div class="alert alert-block alert-success">

After determining that the vocabulary size is 1,130 via the above code, we create the
vocabulary and print its first 51 entries for illustration purposes:

</div>

<div class="alert alert-block alert-warning">

We can modify the tokenizer to use an <|unk|> token if it
encounters a word that is not part of the vocabulary. 

Furthermore, we add a token between
unrelated texts. 

For example, when training GPT-like LLMs on multiple independent
documents or books, it is common to insert a token before each document or book that
follows a previous text source

</div>



<div class="alert alert-block alert-success">

Let's now modify the vocabulary to include these two special tokens, <unk> and
<|endoftext|>, by adding these to the list of all unique words that we created in the
previous section:
</div>

In [7]:
all_tokens=sorted(set(vocab))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab={}
i=0
for item in all_tokens:
    vocab[item]=i
    reverse_vocab[i]=item
    i+=1

<div class="alert alert-block alert-info">
    
Step 1: Store the vocabulary as a class attribute for access in the encode and decode methods
    
Step 2: Create an inverse vocabulary that maps token IDs back to the original text tokens

Step 3: Process input text into token IDs

Step 4: Convert token IDs back into text

Step 5: Replace spaces before the specified punctuation

</div>



In [8]:
def encode (text,vocab):
    tokens=re.split(r'([,.:;?_!"()\']|--|\s)',text)
    idx=[]
    for item in tokens:
        if item.strip():
            idx.append(item)
    tokens=idx
    idx=[]
    for token in tokens:
        if token in vocab:
            idx.append(vocab[token])
        else:
            idx.append(vocab["<|unk|>"])
    return idx
 

In [9]:
def decode(idx,reverse_vocab):
    words=[]
    for i in idx:
        if i in reverse_vocab:
            words.append(reverse_vocab[i])
    return " ".join(words) 

In [10]:
text1="Thwing--his last Chicago sitter"
text2="value of my picture"
text="<|endoftext|> ".join([text1,text2])
idx=encode(text,vocab)
print(idx)

[100, 6, 549, 602, 25, 1131, 1059, 722, 697, 769]


In [11]:
decoded_text=decode(idx,reverse_vocab)
print(decoded_text)

Thwing -- his last Chicago <|unk|> value of my picture


### BYTE PAIR ENCODING (BPE)


<div class="alert alert-block alert-success">

We implemented a simple tokenization scheme in the previous sections for illustration
purposes. 

This section covers a more sophisticated tokenization scheme based on a concept
called byte pair encoding (BPE). 

The BPE tokenizer covered in this section was used to train
LLMs such as GPT-2, GPT-3, and the original model used in ChatGPT.</div>

<div class="alert alert-block alert-warning">

Since implementing BPE can be relatively complicated, we will use an existing Python
open-source library called tiktoken (https://github.com/openai/tiktoken). 

This library implements
the BPE algorithm very efficiently based on source code in Rust.
</div>

In [12]:
%pip install tiktoken

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [13]:
import tiktoken
print(tiktoken.__version__)
tokenizer=tiktoken.get_encoding("gpt2")

0.12.0


<div class="alert alert-block alert-info">
    
The code above prints the following token IDs:

</div>

In [14]:
text=( 
     "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)
values=tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(values)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


<div class="alert alert-block alert-warning">

The algorithm underlying BPE breaks down words that aren't in its predefined vocabulary
into smaller subword units or even individual characters.

The enables it to handle out-ofvocabulary words. 

So, thanks to the BPE algorithm, if the tokenizer encounters an
unfamiliar word during tokenization, it can represent it as a sequence of subword tokens or
characters
    


</div>

In [15]:
values=tokenizer.encode("hello i am rashid")
print(values)
strings=tokenizer.decode(values)
print(strings)

[31373, 1312, 716, 28509, 312]
hello i am rashid


<div class="alert alert-block alert-info">
    
Executing the code above will return 5145, the total number of tokens in the training set,
after applying the BPE tokenizer.

</div>

In [16]:
with open("the-verdict.txt","r",encoding="utf-8") as f:
    raw_data=f.read()
enc_text=tokenizer.encode(raw_data)
print(enc_text[:20])
print(len(enc_text))

[40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438]
5145


In [17]:
context_size=4;
x=enc_text[:context_size]
y=enc_text[1:context_size+1]
print(x)
print(y)

[40, 367, 2885, 1464]
[367, 2885, 1464, 1807]


<div class="alert alert-block alert-success">
Processing the inputs along with the targets, which are the inputs shifted by one position,
we can then create the next-word prediction tasks as
follows:</div>

In [18]:
for i in range(1,context_size+1):
    context=enc_text[:i]
    desired=enc_text[i]
    print(context,"------>",desired)

[40] ------> 367
[40, 367] ------> 2885
[40, 367, 2885] ------> 1464
[40, 367, 2885, 1464] ------> 1807


<div class="alert alert-block alert-info">
Everything left of the arrow (---->) refers to the input an LLM would receive, and the token
ID on the right side of the arrow represents the target token ID that the LLM is supposed to
predict.
</div>

In [19]:
for i in range(1,context_size+1):
    context=enc_text[:i]
    desired=enc_text[i]
    print(tokenizer.decode(context),"------>",tokenizer.decode([desired]))

I ------>  H
I H ------> AD
I HAD ------>  always
I HAD always ------>  thought
