## Working with text data - data preparation and sampling
In this chapter **Sebastian** describes how to prepare the text and, in general datas, which are fondumental to feed the models. This is made by different process like the *splitting phase* which gives a first raw idea to the reader about the token that is the suitable's units for handling data in next stages. In additionaly there are the real explanation about *Tokenization phase*, where texts is transformed to tokens through different algorithm like *Byte pair encoding* and finally the sliding window approach and the process to obtain vectors to feed into *LLMs*.

##### Embedding: process in which (raw) datas are converted into vector format. Different format requires different embedding methods. Passing from a discrete format to a continuous format.

###### note: one of the most known pretrained model is Word2Vec, a model able to generate word embeddings by predicting the context. More similar = Closer.


In [48]:
# Example text: short story by Edith Wharton
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    content = f.read()

print(f"Total number of characters: {len(content)}\n")
print(f"First row: \n {content[:20]}")

Total number of characters: 20479

First row: 
 I HAD always thought


In [49]:
# Splitting text into tokens which are units to handle text data easier
import re
# skipping punctuation, whitespace, ect ... 
text = re.split(r'([,.:;?_!"()\']|--|\s)', content)
res = [item.strip() for item in text if item.strip()]
print(len(res))
print(res[:20])

4690
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was']


##### The vocabulary: defines how we map each unique word/character or token to a unique integer. 
(Easy to handle and memory size is fixed)

In [50]:
# Before tokenization, we need to have a intermidiate representation which is the tokenIDs

all_words = list(set(sorted(res)))
# special tokens to handle exceptions
all_words.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:idx for idx,token in enumerate(all_words)}
for idx, tok in enumerate(vocab):
    print(f"Token: {tok}\t TokenID: {vocab[tok]}")
    if idx >= 10:
        break

Token: wasn	 TokenID: 0
Token: mysterious	 TokenID: 1
Token: meant	 TokenID: 2
Token: portrait	 TokenID: 3
Token: along	 TokenID: 4
Token: saying	 TokenID: 5
Token: hermit	 TokenID: 6
Token: She	 TokenID: 7
Token: while	 TokenID: 8
Token: hung	 TokenID: 9
Token: because	 TokenID: 10


##### Tokenizer Class
Now we can implement a Class able to convert new text into TokenID, using our mapping set.

In [51]:
class Tokenizer:
    def __init__(self,vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        """Function to convert text to tokens and consequently to TokensID"""
        splitted = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        res = [item.strip() for item in splitted if item.strip()]
        res = [item if item in self.str_to_int else "<|unk|>" for item in res]
        ids = [self.str_to_int[s] for s in res]
        return ids

    def decode(self, ids):
        """Function to convert TokensID to tokens and consequently to text """
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text


In [52]:
Tokenizer = Tokenizer(vocab)
sample = "Hi, nice to meet you! This is a test to experiment the Tokenizer Class"
ids = Tokenizer.encode(text = sample)
print(ids)

original = Tokenizer.decode(ids = ids)
print(original)

[1131, 134, 1131, 24, 1131, 220, 485, 757, 718, 26, 1131, 24, 1131, 112, 1131, 1131]
<|unk|>, <|unk|> to <|unk|> you! This is a <|unk|> to <|unk|> the <|unk|> <|unk|>


##### note: usually special tokens addded to handle particolar cases are the following one:
 - [BOS] beginning of sequences
 - [EOS] endining of sequences
 - [PAD] padding - to ensure all texts have the same lenght.

##### Tokenizer Class BPE
The LLMs use more sophisticated tokenizer which are able to managing unknown tokens. BPE allows the model to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words   

In [56]:
import importlib
import tiktoken
print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.12.0


In [57]:
tokenizer = tiktoken.get_encoding("gpt2")
text = ( "Hello, do you like tea? <|endoftext|> In the sunlit terraces" "of someunknownPlace.")
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [58]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


##### Data sampling with a sliding window
The sliding windows is an important element to distinguish input and target. This pair is essential for training phase.

In [None]:
# MAIN CONCEPT

with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
enc_text = tokenizer.encode(raw_text)
enc_sample = enc_text[50:]
context_size = 4
# shifiting by one position to create input-target pair
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


In [69]:
# in inference time the model will receive only the context and it will predict the desired token
for i in range(1, context_size+1):
    # context
    context = enc_sample[:i]
    # desired token 
    desired = enc_sample[i]

    print(context, "---->", desired)
    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

[290] ----> 4920
 and ---->  established
[290, 4920] ----> 2241
 and established ---->  himself
[290, 4920, 2241] ----> 287
 and established himself ---->  in
[290, 4920, 2241, 287] ----> 257
 and established himself in ---->  a


In [74]:
# Creating a dataset and dataloader that extract samples using a sliding window
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDataset(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
        assert len(token_ids) > max_length, "Number of tokenized inputs must at least be equal to max_length+1"

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            # shifting by one position to create target
            target_chunk = token_ids[i + 1: i + max_length + 1]
            # creating tensors where input_ids[X] and target_ids[X] are the input-target pair
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

  cpu = _conversion_method_template(device=torch.device("cpu"))


In [75]:
def create_dataloader(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")
    # Create dataset class instance with pair input-target
    dataset = GPTDataset(txt, tokenizer, max_length, stride)
    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

In [None]:
# checking if it works
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

dataloader = create_dataloader(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)

# iterator over dataloader
data_iter = iter(dataloader)
# get first batch
first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


#### Exercise 2.2 page 91
To develop more intuition for how the data loader works, try to run it with different settings such as max_lenght=2 and stride=2, and max_lenght=8 and stride=2.