# LLM Data Pipeline: Tokenization & Sliding Windows

**Purpose:** This notebook demonstrates how to tokenize a raw text file with `tiktoken`, build a sliding-window `Dataset` for autoregressive training, and create small batches for inspection. All code cells remain unchanged — only markdown and explanations were improved for clarity.

---

### Overview

* **Tokenization** — use `tiktoken` (GPT-2 BPE) to convert text to token IDs.
* **Sliding window dataset** — produce overlapping input/target chunks for autoregressive training.
* **Batching & embeddings** — create `DataLoader` batches and map tokens to dense vectors.

### Requirements

* `Data.txt` (a UTF-8 text file) in the same directory as the notebook.
* `tiktoken` Python package (install with `pip install tiktoken`).
* `PyTorch` available for `Dataset`, `DataLoader`, and tensors.

### Table of contents

1.  Setup and Installation
2.  Data Loading & Tokenization
3.  Quick tokenization check
4.  Causal Modeling — input/target shift
5.  Growing-context illustration
6.  Sliding-window Dataset: idea & implementation
7.  DataLoader factory
8.  Sanity checks (reload & sample batches)
9.  Token embeddings and shapes
10. Create a small embedded batch (example)
11. Embed inputs into 256-dimension vectors
12. Notes & next steps

## 1. Setup and Installation
Install the required tokenizer package if you haven't already:

In [None]:
!pip install tiktoken

## 2. Data Loading & Tokenization
Read the source text and initialize the tokenizer. The following code reads `Data.txt` into `raw_text` and prepares `tiktoken` (GPT-2 encoding).

In [2]:
import tiktoken
with open("Data.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

## 3. Quick tokenization check
This short block prints the first character of the file, tokenizes with the **GPT-2 BPE encoder** and shows the token count and a short token sample.


In [3]:
print(raw_text[:1])
tokenizer = tiktoken.get_encoding('gpt2')
enc_text = tokenizer.encode(raw_text)
print(len(enc_text))
enc_sample = enc_text[:50]

I
5147


## 4. Causal Modeling — input/target shift (x, y example)
**Autoregressive models** predict the next token given a context. Below we create `x` as a context of `context_size` tokens and `y` as the next token sequence (shifted by one, which is the model's target).


In [4]:
context_size = 4

x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x: {x}")
print(f"y:     {y}")

x: [40, 367, 2885, 1464]
y:     [367, 2885, 1464, 1807]


## 5. Growing-context illustration (what the model sees)
This loop demonstrates how the context grows token-by-token and the desired next token at each step. This visualizes the fundamental prediction task of the model.

In [5]:
for i in range (1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(f"in interation {i}, with : {tokenizer.decode(context)} ==> : {tokenizer.decode([desired])}")

in interation 1, with : I ==> :  H
in interation 2, with : I H ==> : AD
in interation 3, with : I HAD ==> :  always
in interation 4, with : I HAD always ==> :  thought


## 6. Sliding-window Dataset: idea & implementation
**Idea:** From a single long token sequence, we produce many overlapping input/target examples.
* For a `max_length` window, we take tokens `[i:i+max_length]` as **input**.
* We take tokens `[i+1:i+max_length+1]` as the **target**.
* The **stride** controls the amount of overlap between consecutive samples.

**Why:** This approach maximizes training data utilization and ensures each token appears in several contexts. 

**Implementation (unchanged)** — this class produces `(input_ids, target_ids)` samples.

In [6]:
from torch.utils.data import Dataset, DataLoader
class GPTDataSetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i+max_length]
            target_chunk = token_ids[i+1: i+max_length+1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk)) 

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, index):
        return self.input_ids[index], self.target_ids[index]


## 7. DataLoader factory
The helper function below constructs the `GPTDataSetV1` and wraps it in a PyTorch `DataLoader`. This allows for efficient batching and multi-processing (controlled by `num_workers`). This cell remains unchanged.

In [7]:
def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=6):
    
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDataSetV1(txt, tokenizer, max_length, stride)

    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)

    return dataloader

## 8. Sanity checks (reload & sample batches)
We reload the text (optional but kept for flow) and sample a couple of batches with a small `stride=1` and `max_length=4`. This clearly validates that the windowing and batching behave as expected.

In [39]:
with open("Data.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

In [8]:
import torch
dataloader = create_dataloader_v1(raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)

data_iter = iter(dataloader)
first_batch = next(data_iter)
second_batch = next(data_iter)
print("First Iter : ", first_batch)
print("Second Iter : ", second_batch)

First Iter :  [tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]
Second Iter :  [tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


## 9. Token embedding layer (shapes)
We create a **token embedding layer** (`nn.Embedding`) to map the integer token IDs to dense, continuous vectors. This is the first step of the Transformer input process. Positional embeddings are typically added after this step.


In [9]:
vocab_size= 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

# Shape note: If inputs has shape (B, T) then token_embedding_layer(inputs) => (B, T, output_dim).

## 10. Create a small embedded batch (example)
We instantiate a dataloader with a `batch_size=8` and `max_length=4`. The input tensor will have shape $(8, 4)$, which then gets embedded to $(8, 4, 256)$ when passed through `token_embedding_layer`.

In [10]:
max_length =4
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=max_length, stride=max_length, shuffle=False)

data_iter = iter(dataloader)
inputs,targets = next(data_iter)
#for i in range(5):
 #   print("\nInput tokens : \n", tokenizer.decode(inputs[i].tolist()))
    #print("\n Inputs Shape: \n", inputs.shape)\
    
  #  print("\n Targets tokens : \n", tokenizer.decode(targets[i].tolist()))
    #print("\n Targets Shape: \n", targets.shape)

## 11. Embed inputs into 256-dimension vectors
The step to convert the token IDs to their dense vector representation, ready for the Transformer blocks.

In [12]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


## 12. Create another embedding layer for the postional encoder

Add positional embeddings and implement the self-attention mechanism + transformer blocks to build a full model.


In [15]:
context_size = max_length
pos_embedding_layer = torch.nn.Embedding(context_size, output_dim)

In [18]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

torch.Size([4, 256])
