# Chapter 2, part 1 - Tokenization for input text
To pretrain a LLM we use a huge amount of text to train the model for it to gain context between words. 
The first step to prepare training data is to preprocess the raw input text. This notebook covers steps on how to transfer raw text in String format to integer formate as numeric format makes more sense for Python/PyTorch to process.


In [24]:
print("hello world")

hello world


### preprocess the raw text to individual words (including punctuations)
after preprocessing the whole raw text file becomes individual tokens.

In [25]:
# we should have the the-verdict.txt file ready in local env.

with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

# inspect the length of words and print a sample
print(len(raw_text))
print(raw_text[:99])

20480
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


In [26]:
# sample code to use regular expression to tokenize an input text stream
import re
sample_text = "Hello, word. This, is a test."

# this way we are splitting based on spaces, not ideal because there are punctuation characters attached to words
result = re.split(r'(\s)', sample_text)
print(result)

# this way we are splitting on whitespaces (\s), commas, and periods, it's still not ideal because an empty string or a whitespace is an element
result = re.split(r'([,.]|\s)', sample_text)
print(result)
# we can get rid of spaces with:
result = [item for item in result if item.strip()]
print(result)

# for our short story text we want to also include text like "--" when we do tokenization, so we can:
sample_text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', sample_text)
result = [item for item in result if item.strip()]
print(result)

['Hello,', ' ', 'word.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']
['Hello', ',', '', ' ', 'word', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']
['Hello', ',', 'word', '.', 'This', ',', 'is', 'a', 'test', '.']
['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [27]:
# let's use this RE scheme on the input text
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item for item in preprocessed if item.strip()]
print(len(preprocessed))

# let's inspect 30 elements, looks pretty good
print(preprocessed[:30])

4690
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


### Convert tokens into token IDs
tokens are still in String type, now we need to map tokens into integers that can be further processed by Python.

steps to converting to token IDs:
1. We first get a set of unique words from tokens.
2. We sort the set alphabetically, and label them from 0 to N (N is the number of unique tokens)
3. With the labels we map words into integer token IDs.


In [28]:
all_unique_words = sorted(set(preprocessed))
vocab_size = len(all_unique_words)
print("total unique tokens:", vocab_size)

# let's check some token IDs
vocab = {token: integer for integer, token in enumerate(all_unique_words)}
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 20:
        break

total unique tokens: 1130
('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)


In [29]:
# in the above cell we have figured out how to get a mapping from token (String) to token ID (integer), now let's wrap it up in a class for token encoding & decoding.
class SimpleTokenizerV1:
    def __init__(self, vocal: dict):
        # in constructor, we initiate bidirectional mapping
        self.str_to_int = vocab
        self.int_to_str = {token_id: token for token, token_id in vocab.items()}
        
    # this is an encoding method used to transform text input to a series of token IDs
    def encode(self, text: str) -> list[int]: 
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    # this is a decoding method used to transform token IDs back to the string text
    def decode(self, ids: list[int]) -> str:
        text = " ".join([self.int_to_str[i] for i in ids])
        
        # this step is to remove spaces before the specified punctuation
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)     
        return text
    

# now let's use this SimpleTokenizerV1 class to do some testing
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know,"
Mrs. Gisburn said with pardonable pride."""
# encode into IDs
ids = tokenizer.encode(text)
print(ids)
# decode back to string
print(tokenizer.decode(ids))
        

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]
" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


In the above class we successfully built a tokenizer which can do encoding and decoding
but if we fed an unknown word to the tokenizer, it will throw an error. 

In [30]:
text = "Hello, do you like tea?"
print(tokenizer.encode(text))

KeyError: 'Hello'

In addition to all the words that we feed into vocab construction, we also need more additional contextual tokens which can enhance models' understanding. 
Here we are adding two context tokens - `<|unk|>` and `<|endoftext|>` to represent unknown and endOfText respectively. 

In [33]:
all_unique_words = sorted(list(set(preprocessed)))
all_unique_words.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token: integer for integer, token in enumerate(all_unique_words)}
print(len(vocab.items()))
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

1132
('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


In [37]:
# after adding the above two special contextual tokens, let's construct a tokenizer V2
class SimpleTokenizerV2:
    def __init__(self, vocal: dict):
        # in constructor, we initiate bidirectional mapping
        self.str_to_int = vocab
        self.int_to_str = {token_id: token for token, token_id in vocab.items()}
        
    # this is an encoding method used to transform text input to a series of token IDs
    def encode(self, text: str) -> list[int]: 
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [item if item in self.str_to_int else "<|unk|>" for item in preprocessed]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    # this is a decoding method used to transform token IDs back to the string text
    def decode(self, ids: list[int]) -> str:
        text = " ".join([self.int_to_str[i] for i in ids])
        
        # this step is to remove spaces before the specified punctuation
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)     
        return text
    
# let's test using two unrelated sentences
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join([text1, text2])
print(text)
tokenizer = SimpleTokenizerV2(vocab)
print(tokenizer.encode(text))
print(tokenizer.decode(tokenizer.encode(text)))

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.
[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]
<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.


### Byte pare encoding (BPE)
[BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding) is a sophisticated way to handle unknown word situations.
An example of BPE:

Suppose the data to be encoded is
```
aaabdaaabac
```
The byte pair "aa" occurs most often, so it will be replaced by a byte that is not used in the data, such as "Z". Now there is the following data and replacement table:
```
ZabdZabac
Z=aa
```
Then the process is repeated with byte pair "ab", replacing it with "Y":
```
ZYdZYac
Y=ab
Z=aa
```
The only literal byte pair left occurs only once, and the encoding might stop here. Alternatively, the process could continue with recursive byte pair encoding, replacing "ZY" with "X":
```
XdXac
X=ZY
Y=ab
Z=aa
```
This data cannot be compressed further by byte pair encoding because there are no pairs of bytes that occur more than once.
To decompress the data, simply perform the replacements in the reverse order.


**It is complicated to implement BPE from scratch, so here we are using openai/tiktoken library**

In [39]:
# import & check the version
import tiktoken
from importlib.metadata import version
print("tiktoken version:", version("tiktoken"))

tiktoken version: 0.7.0


In [47]:
tokenizer = tiktoken.get_encoding("gpt2")

text = (
"Hello, do you like tea? <|endoftext|> In the sunlit terraces"
"of someunknownPlace."
)
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

strings = tokenizer.decode(integers)
print(strings)

# try out unknown word scenario
text = "Akwirw ier"
integers = tokenizer.encode(text)
print(integers)
strings = [tokenizer.decode([element]) for element in integers]
print(strings)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]
Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.
[33901, 86, 343, 86, 220, 959]
['Ak', 'w', 'ir', 'w', ' ', 'ier']
