# Introduction to Text Tokenization and Embeddings for Language Modeling

[Token vizualize](https://tiktokenizer.vercel.app/)

In [2]:
with open("/content/the-verdict.txt",'r',encoding='utf-8') as f:
  data = f.read()

In [3]:
data[:100]

'I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g'

In [4]:
len(data)

20479

In [5]:
import re

text = "I HAD always thought Jack Gisburn rather a cheap genius--though a"
result = re.split(r'([,.:;?_!"()\']|--|\s)', data)

In [6]:
result = [item for item in result if item.strip()]
preprocessed = result

In [7]:
len(preprocessed)

4690

# converting it the token to token id

In [8]:
all_words = list(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)

1130


In [9]:
vocab = {vocab:interger for interger,vocab in enumerate(all_words)}

{'"lift': 0,
 'across': 1,
 'accuse': 2,
 'knew?': 3,
 'covered': 4,......}

In [10]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

**Encoding (encode):** Splits input text into tokens using the same regex as before.
Filters out empty tokens.
Converts each token to its corresponding ID using `str_to_int`.
Returns a list of token IDs.

**Decoding (decode):**
Converts a list of token IDs back to tokens using int_to_str.
Joins tokens with spaces to form a string.
Uses regex `(re.sub)` to remove extra spaces before punctuation `(e.g., "word ," → "word,")`.

In [11]:
tokenizer = SimpleTokenizerV1(vocab)

In [12]:
text = "I HAD always thought Jack Gisburn rather a cheap genius--though a"
tokenizer.encode(text)

[920, 571, 747, 180, 780, 58, 666, 712, 505, 285, 401, 499, 712]

In [13]:
tokenizer.decode([675, 573, 805, 942, 433, 607, 759, 804, 939, 475, 210, 241, 804])

'go jealousy occurred absurdity is? Victor recreated charming now effects You recreated'

In [14]:
preprocessed[:10]

['I',
 'HAD',
 'always',
 'thought',
 'Jack',
 'Gisburn',
 'rather',
 'a',
 'cheap',
 'genius']

In [15]:
all_tokens = sorted(list(set(preprocessed)))
all_words.extend(["<|endoftext|>","<junk|>"])

In [16]:
vocab = {token:interger for interger,token in enumerate(all_words)}

In [17]:
len(vocab.items())

1132

In [18]:
for i,item in enumerate(list(vocab.items())[-5:]):
  print(i,item)

0 ('Gideon', 1127)
1 ('am', 1128)
2 ('secret', 1129)
3 ('<|endoftext|>', 1130)
4 ('<junk|>', 1131)


In [19]:

class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        preprocessed = [
            item if item in self.str_to_int else "<junk|>" for item in preprocessed
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [20]:
tokenizer = SimpleTokenizerV2(vocab)

In [21]:
text1 = text + " Rohit"
tokenizer.encode(text1)

[920, 571, 747, 180, 780, 58, 666, 712, 505, 285, 401, 499, 712, 1131]

In [22]:
tokenizer.decode(tokenizer.encode(text1))

'I HAD always thought Jack Gisburn rather a cheap genius -- though a <junk|>'

# Byte Pair encoding

using tiktoken library to utilize the byte pair algo used in the gpt2

In [23]:
# !pip install tiktoken
import tiktoken

The `tiktoken` library provides a `pre-trained` BPE tokenizer for GPT-2.
BPE is a more sophisticated tokenization method that splits text into subword units, balancing vocabulary size and coverage.

In [24]:
tokenizer = tiktoken.get_encoding("gpt2")

### here two word yet three encode (because it is using byte pair encoding)

In [25]:
tokenizer.encode("Hello rohit")

[15496, 686, 17945]

In [26]:
tokenizer.decode([15496, 686, 17945])

'Hello rohit'

In [27]:
tokenizer.encode("hello <|endoftext|>",allowed_special={'<|endoftext|>'})

[31373, 220, 50256]

# Data sampling with a sliding window

In [28]:
with open("/content/the-verdict.txt",'r',encoding='utf-8') as f:
  raw_text = f.read()

In [29]:
enc_text = tokenizer.encode(raw_text)
len(enc_text)

5145

In [30]:
gpt_tokenizer = tiktoken.get_encoding("gpt2")
context_size = 4
enc_sample = enc_text[:50]
for i in range(1,context_size+1):
  context = enc_sample[:i]
  desired = enc_sample[i]
  print(context," --> ",desired)
  print(gpt_tokenizer.decode(context)," --> ",gpt_tokenizer.decode([desired]))

[40]  -->  367
I  -->   H
[40, 367]  -->  2885
I H  -->  AD
[40, 367, 2885]  -->  1464
I HAD  -->   always
[40, 367, 2885, 1464]  -->  1807
I HAD always  -->   thought


In [31]:
import torch

In [32]:
 torch.__version__

'2.6.0+cu124'

**What Are Inputs and Targets?**

`Inputs:` These are the sequences of tokens (or token IDs) that the model takes as input during training. They represent the context or the portion of the text the model sees to make predictions.


`Targets:` These are the sequences of tokens that the model is expected to predict based on the inputs. They represent the "correct answers" or the next tokens in the sequence.

Suppose the tokenized text is `[40, 367, 2885, 1464, 1807]` (decoded as “`I HAD always thought`”), and max_length=4, stride=1. The dataset creates pairs like this:

First pair:
`Input: [40, 367, 2885, 1464] (“I HAD always”)`

```
Target: [367, 2885, 1464, 1807] (“HAD always thought”)
```


```
Second pair (slide window by 1):
Input: [367, 2885, 1464, 1807] (“HAD always thought”)
```


```
Target: [2885, 1464, 1807, ...] (“always thought ...”)
```



**Why Shift the Target by One Token?**

The target sequence is shifted by one token because the goal of the language model is to predict the next token given the current context. For example:

If the input is `[40, 367, 2885]` `(“I HAD always”)`, the model should predict the next token 1464 (“thought”).
By providing the target sequence `[367, 2885, 1464, 1807]`, the model can compare its predictions to the actual next tokens and learn from the errors.
This is called next-token prediction, a core concept in autoregressive language models like GPT.

In [33]:
from torch.utils.data import Dataset, DataLoader


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
        assert len(token_ids) > max_length, "Number of tokenized inputs must at least be equal to max_length+1"

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [34]:

def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

In [35]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

Input chunk: A sequence of max_length tokens (e.g., `[40, 367, 2885, 1464]`).
Target chunk: The same sequence shifted one token forward (e.g., `[367, 2885, 1464, 1807])`.

In [36]:
dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


In [37]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


# Token embedding

In [38]:
dir(tokenizer)[-5:]

['max_token_value',
 'n_vocab',
 'name',
 'special_tokens_set',
 'token_byte_values']

In [39]:
tokenizer.n_vocab

50257

**What Is Token Embedding?**
A token embedding is a dense, continuous vector representation of a token in a vocabulary. Each token (e.g., a word, subword, or special character) is mapped to a fixed-size vector of real numbers, where the vector captures semantic information about the token. These vectors are typically learned during training and allow the model to understand relationships between tokens based on their meanings or context.

*  The token “cat” might be represented as a 256-dimensional vector like `[0.2, -0.5, 0.1, ..., 0.3]`
*  The token “dog” might have a similar but slightly different vector, reflecting their semantic similarity (both are animals).

In [40]:
input_ids = torch.tensor([1,2,3])
vocab_size = 6
output_size = 3

embedding_layer = torch.nn.Embedding(vocab_size, output_size)
embedding_layer.weight

Parameter containing:
tensor([[ 1.0981,  0.5051,  1.2166],
        [-1.0484, -0.3839,  0.7467],
        [-1.9080, -0.4061,  1.2185],
        [-0.1899,  1.0797, -0.9287],
        [ 1.9656,  0.3913,  0.2022],
        [-0.2758, -0.8194,  0.8391]], requires_grad=True)

In [41]:
output = embedding_layer(input_ids)
output

tensor([[-1.0484, -0.3839,  0.7467],
        [-1.9080, -0.4061,  1.2185],
        [-0.1899,  1.0797, -0.9287]], grad_fn=<EmbeddingBackward0>)

# Positional Word Encoding

Positional encoding is a method to represent the position of each token in a sequence as a numerical vector that can be combined with the token’s embedding. This allows the model to understand not just what the tokens are (via token embeddings) but also their order in the sequence (e.g., whether a word is the first, second, or third in a sentence).

For example, in the sentence “I love to code”:



1.   The token “love” has a different meaning or role depending on whether it’s the first word, second word, or last word.
2.   Positional encoding adds information to tell the model that “love” is the second token in this sequence.



Without positional encoding, a transformer model would treat the sentence as a “bag of words,” ignoring the order, which could lead to incorrect interpretations (e.g., `“I love to code”` might be treated the same as `“Code to love I”`).

Transformer models process all tokens in a sequence simultaneously (in parallel) using self-attention mechanisms. `Unlike RNNs or LSTMs`, which process tokens sequentially, **transformers don’t inherently know the order of tokens**.
Positional encoding provides this order information explicitly.

**Order Matters in Language:**

The meaning of a sentence often depends on the order of words. For example, **“The dog chased the cat”** is different from **“The cat chased the dog.”** Positional encoding ensures the model can distinguish these cases.

**Enabling Context-Aware Predictions:**

In language modeling, the model predicts the next token based on the context of previous tokens. Positional encoding helps the model understand the relative positions of tokens in the context, which is critical for accurate predictions.

In [42]:
vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

A token embedding layer is created to convert token IDs (from the GPT-2 tokenizer’s vocabulary of `50,257` tokens) into `256-dimensional vectors`.

In [48]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)
max_length = 4
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("batch size : ",8)
print("max lenght : ",4)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

batch size :  8
max lenght :  4
Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


**How Does Positional Encoding Work in Practice?**

Suppose the input batch (inputs) contains one sequence: `[40, 367, 2885, 1464]` (decoded as “I HAD always”).

Token Embeddings:
Each token ID is converted to a 256-dimensional vector:
*  40 (“I”) → [0.1, -0.2, ..., 0.5]
(example vector)
*  367 (“HAD”) → [-0.3, 0.4, ..., -0.1]
*  2885 (“always”) → [0.2, 0.1, ..., 0.3]
*  1464 → [0.0, -0.5, ..., 0.2]
*  Shape: [1, 4, 256] (for one sequence).

**Positional Embeddings:**
The positions `[0, 1, 2, 3]` are converted to 256-dimensional vectors:
*  Position 0 → `[0.01, -0.02, ..., 0.03]`
*  Position 1 → `[-0.01, 0.04, ..., -0.02]`
*  Position 2 → `[0.02, 0.01, ..., 0.05]`
*  Position 3 → `[0.00, -0.03, ..., 0.01]`
*  Shape: `[4, 256]`

**Token Embedding**

In [44]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


**Position Embedding**

In [45]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

In [49]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)
print(pos_embeddings)

torch.Size([4, 256])
tensor([[-1.9388, -0.0067,  0.8084,  ...,  0.5857, -1.1943, -0.2662],
        [-1.2178,  0.6031, -1.0661,  ...,  0.1578,  1.0267,  0.1040],
        [ 1.2363,  0.2505, -1.1435,  ...,  0.0201, -1.7183, -0.2860],
        [-1.0928, -0.9578,  1.0012,  ..., -0.5694,  0.6425,  0.8918]],
       grad_fn=<EmbeddingBackward0>)


**Combining Embeddings:**
For each token, the token embedding and positional embedding are added:
*  For “I” (position 0): `token_embedding(40) + pos_embedding(0)`
*  For “HAD” (position 1): `token_embedding(367) + pos_embedding(1)`
*  And so on.
The result is a new tensor where each token’s vector encodes both its meaning and its position.

In [50]:
token_embeddings.shape

torch.Size([8, 4, 256])

In [51]:
pos_embeddings.shape

torch.Size([4, 256])

**What Does Broadcasting Look Like?**

In [54]:
_="""
pos_embeddings = [
  [p0],  # Position 0: [0.01, -0.02, ..., 0.03] (256D vector)
  [p1],  # Position 1: [-0.01, 0.04, ..., -0.02]
  [p2],  # Position 2: [0.02, 0.01, ..., 0.05]
  [p3]   # Position 3: [0.00, -0.03, ..., 0.01]
]  # Shape: [4, 256]
"""

In [55]:
_="""
pos_embeddings_broadcasted = [
  [  # Batch 1
    [p0],  # Position 0
    [p1],  # Position 1
    [p2],  # Position 2
    [p3]   # Position 3
  ],
  [  # Batch 2
    [p0],  # Same position 0
    [p1],  # Same position 1
    [p2],
    [p3]
  ],
  ...,  # Repeated for all 8 batches
]  # Shape: [8, 4, 256]
"""

In [52]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])
