# Alert! See TODO's (for actual training)

<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>

# Llama 3.2 From Scratch (A Standalone Notebook)

- This notebook is purposefully minimal and focuses on the code to implement the Llama 3.2 1B and 3B LLMs
- For a step-by-step guide that explains the individual components and the relationship between GPT, Llama 2, and Llama 3, please see the following companion notebooks:
  - [Converting a From-Scratch GPT Architecture to Llama 2](converting-gpt-to-llama2.ipynb)
  - [Converting Llama 2 to Llama 3.2 From Scratch](converting-llama2-to-llama3.ipynb)
  

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gpt-to-llama/llama32.webp" width="700px">
  
  
- About the code:
  - all code is my own code, mapping the Llama 3 architecture onto the model code implemented in my [Build A Large Language Model (From Scratch)](http://mng.bz/orYv) book; the code is released under a permissive open-source Apache 2.0 license (see [LICENSE.txt](https://github.com/rasbt/LLMs-from-scratch/blob/main/LICENSE.txt))
  - the tokenizer code is inspired by the original [Llama 3 tokenizer code](https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py), which Meta AI used to to extends the Tiktoken GPT-4 tokenizer
  - the RoPE rescaling section is inspired by the [_compute_llama3_parameters function](https://github.com/huggingface/transformers/blob/5c1027bf09717f664b579e01cbb8ec3ef5aeb140/src/transformers/modeling_rope_utils.py#L329-L347) in the `transformers` library

In [1]:
!pip install -r https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/refs/heads/main/ch05/07_gpt_to_llama/requirements-extra.txt

Collecting blobfile>=3.0.0 (from -r https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/refs/heads/main/ch05/07_gpt_to_llama/requirements-extra.txt (line 1))
  Downloading blobfile-3.0.0-py3-none-any.whl.metadata (15 kB)
Collecting ipywidgets>=8.1.2 (from -r https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/refs/heads/main/ch05/07_gpt_to_llama/requirements-extra.txt (line 3))
  Downloading ipywidgets-8.1.5-py3-none-any.whl.metadata (2.3 kB)
Collecting pycryptodomex>=3.8 (from blobfile>=3.0.0->-r https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/refs/heads/main/ch05/07_gpt_to_llama/requirements-extra.txt (line 1))
  Downloading pycryptodomex-3.21.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting comm>=0.1.3 (from ipywidgets>=8.1.2->-r https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/refs/heads/main/ch05/07_gpt_to_llama/requirements-extra.txt (line 3))
  Downloading comm-0.2.2-py3-none-any.whl.metadata (3.7 kB)
Collecti

In [2]:
# # we'll use nepali tokenizer here
# !pip install tiktoken --quiet

In [3]:
from importlib.metadata import version

pkgs = [
    "blobfile",         # to download pretrained weights
    "huggingface_hub",  # to download pretrained weights
    # "tiktoken",         # to implement the tokenizer
    "torch",            # to implement the model
]
for p in pkgs:
    print(f"{p} version: {version(p)}")

blobfile version: 3.0.0
huggingface_hub version: 0.26.5
torch version: 2.5.1+cu121


&nbsp;
# 1. Architecture code

In [4]:
import torch
import torch.nn as nn


class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.fc1 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], dtype=cfg["dtype"], bias=False)
        self.fc2 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], dtype=cfg["dtype"], bias=False)
        self.fc3 = nn.Linear(cfg["hidden_dim"], cfg["emb_dim"], dtype=cfg["dtype"], bias=False)

    def forward(self, x):
        x_fc1 = self.fc1(x)
        x_fc2 = self.fc2(x)
        x = nn.functional.silu(x_fc1) * x_fc2
        return self.fc3(x)

In [5]:
def precompute_rope_params(head_dim, theta_base=10_000, context_length=4096, freq_config=None):
    assert head_dim % 2 == 0, "Embedding dimension must be even"

    # Compute the inverse frequencies
    inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim, 2)[: (head_dim // 2)].float() / head_dim))

    # Frequency adjustments
    if freq_config is not None:
        low_freq_wavelen = freq_config["original_context_length"] / freq_config["low_freq_factor"]
        high_freq_wavelen = freq_config["original_context_length"] / freq_config["high_freq_factor"]

        wavelen = 2 * torch.pi / inv_freq

        inv_freq_llama = torch.where(
            wavelen > low_freq_wavelen, inv_freq / freq_config["factor"], inv_freq
        )

        smooth_factor = (freq_config["original_context_length"] / wavelen - freq_config["low_freq_factor"]) / (
            freq_config["high_freq_factor"] - freq_config["low_freq_factor"]
        )

        smoothed_inv_freq = (
            (1 - smooth_factor) * (inv_freq / freq_config["factor"]) + smooth_factor * inv_freq
        )

        is_medium_freq = (wavelen <= low_freq_wavelen) & (wavelen >= high_freq_wavelen)
        inv_freq_llama = torch.where(is_medium_freq, smoothed_inv_freq, inv_freq_llama)
        inv_freq = inv_freq_llama

    # Generate position indices
    positions = torch.arange(context_length)

    # Compute the angles
    angles = positions[:, None] * inv_freq[None, :]  # Shape: (context_length, head_dim // 2)

    # Expand angles to match the head_dim
    angles = torch.cat([angles, angles], dim=1)  # Shape: (context_length, head_dim)

    # Precompute sine and cosine
    cos = torch.cos(angles)
    sin = torch.sin(angles)

    return cos, sin


def compute_rope(x, cos, sin):
    # x: (batch_size, num_heads, seq_len, head_dim)
    batch_size, num_heads, seq_len, head_dim = x.shape
    assert head_dim % 2 == 0, "Head dimension must be even"

    # Split x into first half and second half
    x1 = x[..., : head_dim // 2]  # First half
    x2 = x[..., head_dim // 2 :]  # Second half

    # Adjust sin and cos shapes
    cos = cos[:seq_len, :].unsqueeze(0).unsqueeze(0)  # Shape: (1, 1, seq_len, head_dim)
    sin = sin[:seq_len, :].unsqueeze(0).unsqueeze(0)

    # Apply the rotary transformation
    rotated = torch.cat((-x2, x1), dim=-1)
    x_rotated = (x * cos) + (rotated * sin)

    return x_rotated.to(dtype=x.dtype)

In [6]:
class SharedBuffers:
    _buffers = {}

    @staticmethod
    def get_buffers(context_length, head_dim, rope_base, freq_config, dtype=torch.float32):
        key = (context_length, head_dim, rope_base, tuple(freq_config.values()) if freq_config else freq_config, dtype)

        if key not in SharedBuffers._buffers:
            # Create or fetch the buffers
            mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
            cos, sin = precompute_rope_params(head_dim, rope_base, context_length, freq_config)
            if dtype is not None:
                cos = cos.to(dtype)
                sin = sin.to(dtype)
            SharedBuffers._buffers[key] = (mask, cos, sin)

        return SharedBuffers._buffers[key]


class GroupedQueryAttention(nn.Module):
    def __init__(
            self, d_in, d_out, context_length, num_heads,
            num_kv_groups,
            rope_base=10_000,
            rope_config=None,
            dtype=None
        ):
        super().__init__()
        assert d_out % num_heads == 0, "d_out must be divisible by num_heads"
        assert num_heads % num_kv_groups == 0, "num_heads must be divisible by num_kv_groups"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads

        self.W_key = nn.Linear(d_in, num_kv_groups * self.head_dim, bias=False, dtype=dtype)
        self.W_value = nn.Linear(d_in, num_kv_groups * self.head_dim, bias=False, dtype=dtype)
        self.num_kv_groups = num_kv_groups
        self.group_size = num_heads // num_kv_groups

        self.W_query = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
        self.out_proj = nn.Linear(d_out, d_out, bias=False, dtype=dtype)

        # Fetch buffers using SharedBuffers
        mask, cos, sin = SharedBuffers.get_buffers(context_length, self.head_dim, rope_base, rope_config, dtype)
        self.register_buffer("mask", mask)

        self.register_buffer("cos", cos)
        self.register_buffer("sin", sin)

    def forward(self, x):
        b, num_tokens, d_in = x.shape

        queries = self.W_query(x)  # Shape: (b, num_tokens, d_out)
        keys = self.W_key(x)  # Shape: (b, num_tokens, num_kv_groups * head_dim)
        values = self.W_value(x)  # Shape: (b, num_tokens, num_kv_groups * head_dim)

        # Reshape queries, keys, and values
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
        keys = keys.view(b, num_tokens, self.num_kv_groups, self.head_dim)
        values = values.view(b, num_tokens, self.num_kv_groups, self.head_dim)

        # Transpose keys, values, and queries
        keys = keys.transpose(1, 2)  # Shape: (b, num_heads, num_tokens, head_dim)
        values = values.transpose(1, 2)  # Shape: (b, num_heads, num_tokens, head_dim)
        queries = queries.transpose(1, 2)  # Shape: (b, num_query_groups, num_tokens, head_dim)

        # Apply RoPE
        keys = compute_rope(keys, self.cos, self.sin)
        queries = compute_rope(queries, self.cos, self.sin)

        # Expand keys and values to match the number of heads
        # Shape: (b, num_heads, num_tokens, head_dim)
        keys = keys.repeat_interleave(self.group_size, dim=1)  # Shape: (b, num_heads, num_tokens, head_dim)
        values = values.repeat_interleave(self.group_size, dim=1)  # Shape: (b, num_heads, num_tokens, head_dim)
        # For example, before repeat_interleave along dim=1 (query groups):
        #   [K1, K2]
        # After repeat_interleave (each query group is repeated group_size times):
        #   [K1, K1, K2, K2]
        # If we used regular repeat instead of repeat_interleave, we'd get:
        #   [K1, K2, K1, K2]

        # Compute scaled dot-product attention (aka self-attention) with a causal mask
        # Shape: (b, num_heads, num_tokens, num_tokens)
        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head

        # Original mask truncated to the number of tokens and converted to boolean
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        # Use the mask to fill attention scores
        attn_scores.masked_fill_(mask_bool, -torch.inf)

        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        assert keys.shape[-1] == self.head_dim

        # Shape: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2)

        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.reshape(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec)  # optional projection

        return context_vec

In [7]:
class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.att =  GroupedQueryAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            num_heads=cfg["n_heads"],
            num_kv_groups=cfg["n_kv_groups"],
            rope_base=cfg["rope_base"],
            rope_config=cfg["rope_freq"],
            dtype=cfg["dtype"]
        )
        self.ff = FeedForward(cfg)
        self.norm1 = nn.RMSNorm(cfg["emb_dim"], eps=1e-5)
        self.norm2 = nn.RMSNorm(cfg["emb_dim"], eps=1e-5)

    def forward(self, x):
        # Shortcut connection for attention block
        shortcut = x
        x = self.norm1(x)
        x = self.att(x.to(torch.bfloat16))   # Shape [batch_size, num_tokens, emb_size]
        x = x + shortcut  # Add the original input back

        # Shortcut connection for feed-forward block
        shortcut = x
        x = self.norm2(x)
        x = self.ff(x.to(torch.bfloat16))
        x = x + shortcut  # Add the original input back

        return x

In [8]:
class Llama3Model(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"], dtype=cfg["dtype"])

        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])

        self.final_norm = nn.RMSNorm(cfg["emb_dim"], eps=1e-5)
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False, dtype=cfg["dtype"])

    def forward(self, in_idx):
        tok_embeds = self.tok_emb(in_idx)
        x = tok_embeds
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x.to(torch.bfloat16))
        return logits

&nbsp;
# 2. Initialize model

- The remainder of this notebook uses the Llama 3.2 1B model; to use the 3B model variant, just uncomment the second configuration file in the following code cell

In [None]:
# TODO : use 1b config not debug config
# Debug mode
LLAMA32_CONFIG = {
    # d_out = emb_dim
    # Embedding dimension <d_out // num_heads> must be even


    "vocab_size": 128_256,      # Vocabulary size
    "context_length": 10,  # Context length
    # d_in=d_out=emb_dim,
    # d_out must be divisible by num_heads
    "emb_dim": 8,            # Embedding dimension
    # (num_heads must be divisible by num_kv_groups)
    "n_heads": 4,              # Number of attention heads
    "n_layers": 2,             # Number of layers
    "hidden_dim": 16,         # Size of the intermediate dimension in FeedForward
    "n_kv_groups": 2,           # Key-Value groups for grouped-query attention
    "rope_base": 500_000.0,     # The base in RoPE's "theta"
    "dtype": torch.bfloat16,    # Lower-precision dtype to reduce memory usage
    "rope_freq": {              # RoPE frequency scaling
        "factor": 32.0,
        "low_freq_factor": 1.0,
        "high_freq_factor": 4.0,
        "original_context_length": 8192,
    }
}

# Llama 3.2 1B

LLAMA32_CONFIG = {
    "vocab_size": 128_256,      # Vocabulary size
    "context_length": 131_072,  # Context length
    "emb_dim": 2048,            # Embedding dimension
    "n_heads": 32,              # Number of attention heads
    "n_layers": 16,             # Number of layers
    "hidden_dim": 8192,         # Size of the intermediate dimension in FeedForward
    "n_kv_groups": 8,           # Key-Value groups for grouped-query attention
    "rope_base": 500_000.0,     # The base in RoPE's "theta"
    "dtype": torch.bfloat16,    # Lower-precision dtype to reduce memory usage
    "rope_freq": {              # RoPE frequency scaling
        "factor": 32.0,
        "low_freq_factor": 1.0,
        "high_freq_factor": 4.0,
        "original_context_length": 8192,
    }
}

# Llama 3.2 3B

# LLAMA32_CONFIG = {
#     "vocab_size": 128_256,      # Vocabulary size
#     "context_length": 131_072,  # Context length
#     "emb_dim": 3072,            # Embedding dimension
#     "n_heads": 24,              # Number of attention heads
#     "n_layers": 28,             # Number of layers
#     "hidden_dim": 8192,         # Size of the intermediate dimension in FeedForward
#     "n_kv_groups": 8,           # Key-Value groups for grouped-query attention
#     "rope_base": 500_000.0,     # The base in RoPE's "theta"
#     "dtype": torch.bfloat16,    # Lower-precision dtype to reduce memory usage
#     "rope_freq": {              # RoPE frequency scaling
#         "factor": 32.0,
#         "low_freq_factor": 1.0,
#         "high_freq_factor": 4.0,
#         "original_context_length": 8192,
#     }
# }

LLAMA_SIZE_STR = "1B" if LLAMA32_CONFIG["emb_dim"] == 2048 else "3B"

In [None]:
import torch
# TODO : use 1b config not debug config
# Debug mode
LLAMA32_CONFIG = {
    # d_out = emb_dim
    # Embedding dimension <d_out // num_heads> must be even


    "vocab_size": 128_256,      # Vocabulary size
    "context_length": 10,  # Context length
    # d_in=d_out=emb_dim,
    # d_out must be divisible by num_heads
    "emb_dim": 8,            # Embedding dimension
    # (num_heads must be divisible by num_kv_groups)
    "n_heads": 4,              # Number of attention heads
    "n_layers": 2,             # Number of layers
    "hidden_dim": 16,         # Size of the intermediate dimension in FeedForward
    "n_kv_groups": 2,           # Key-Value groups for grouped-query attention
    "rope_base": 500_000.0,     # The base in RoPE's "theta"
    "dtype": torch.bfloat16,    # Lower-precision dtype to reduce memory usage
    "rope_freq": {              # RoPE frequency scaling
        "factor": 32.0,
        "low_freq_factor": 1.0,
        "high_freq_factor": 4.0,
        "original_context_length": 8192,
    }
}

# # Llama 3.2 1B

# LLAMA32_CONFIG = {
#     "vocab_size": 128_256,      # Vocabulary size
#     "context_length": 131_072,  # Context length
#     "emb_dim": 2048,            # Embedding dimension
#     "n_heads": 32,              # Number of attention heads
#     "n_layers": 16,             # Number of layers
#     "hidden_dim": 8192,         # Size of the intermediate dimension in FeedForward
#     "n_kv_groups": 8,           # Key-Value groups for grouped-query attention
#     "rope_base": 500_000.0,     # The base in RoPE's "theta"
#     "dtype": torch.bfloat16,    # Lower-precision dtype to reduce memory usage
#     "rope_freq": {              # RoPE frequency scaling
#         "factor": 32.0,
#         "low_freq_factor": 1.0,
#         "high_freq_factor": 4.0,
#         "original_context_length": 8192,
#     }
# }

# Llama 3.2 3B

# LLAMA32_CONFIG = {
#     "vocab_size": 128_256,      # Vocabulary size
#     "context_length": 131_072,  # Context length
#     "emb_dim": 3072,            # Embedding dimension
#     "n_heads": 24,              # Number of attention heads
#     "n_layers": 28,             # Number of layers
#     "hidden_dim": 8192,         # Size of the intermediate dimension in FeedForward
#     "n_kv_groups": 8,           # Key-Value groups for grouped-query attention
#     "rope_base": 500_000.0,     # The base in RoPE's "theta"
#     "dtype": torch.bfloat16,    # Lower-precision dtype to reduce memory usage
#     "rope_freq": {              # RoPE frequency scaling
#         "factor": 32.0,
#         "low_freq_factor": 1.0,
#         "high_freq_factor": 4.0,
#         "original_context_length": 8192,
#     }
# }

LLAMA_SIZE_STR = "1B" if LLAMA32_CONFIG["emb_dim"] == 2048 else "3B"

- Reduce the context length so the model would work fine on a MacBook Air (if you have more RAM, feel free to comment out the lines below):

In [10]:
old_context_length = LLAMA32_CONFIG["context_length"]
# Todo:use context length of 8192 (as done by sebastian) not LLAMA32_CONFIG["context_length"]
LLAMA32_CONFIG["context_length"] =  LLAMA32_CONFIG["context_length"] # 8192


def rescale_theta(theta_old, context_length_old, context_length_new):
    scaling_factor = context_length_new / context_length_old
    theta_new = theta_old * scaling_factor
    return theta_new

LLAMA32_CONFIG["rope_base"] = rescale_theta(
    LLAMA32_CONFIG["rope_base"],
    old_context_length,
    LLAMA32_CONFIG["context_length"]
)

print("New RoPE theta:", LLAMA32_CONFIG["rope_base"])

New RoPE theta: 500000.0


In [11]:
model = Llama3Model(LLAMA32_CONFIG)

- The following is expected to print True to confirm buffers are reused instead of being (wastefully) recreated:

In [12]:
# Check buffers
print(model.trf_blocks[0].att.mask is model.trf_blocks[-1].att.mask)
print(model.trf_blocks[0].att.cos is model.trf_blocks[-1].att.cos)
print(model.trf_blocks[0].att.sin is model.trf_blocks[-1].att.sin)

True
True
True


In [13]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")

# Account for weight tying
total_params_normalized = total_params - model.tok_emb.weight.numel()
print(f"\nTotal number of unique parameters: {total_params_normalized:,}")

Total number of parameters: 2,053,288

Total number of unique parameters: 1,027,240


In [14]:
def model_memory_size(model, input_dtype=torch.float32):
    total_params = 0
    total_grads = 0
    for param in model.parameters():
        # Calculate total number of elements per parameter
        param_size = param.numel()
        total_params += param_size
        # Check if gradients are stored for this parameter
        if param.requires_grad:
            total_grads += param_size

    # Calculate buffer size (non-parameters that require memory)
    total_buffers = sum(buf.numel() for buf in model.buffers())

    # Size in bytes = (Number of elements) * (Size of each element in bytes)
    # We assume parameters and gradients are stored in the same type as input dtype
    element_size = torch.tensor(0, dtype=input_dtype).element_size()
    total_memory_bytes = (total_params + total_grads + total_buffers) * element_size

    # Convert bytes to gigabytes
    total_memory_gb = total_memory_bytes / (1024**3)

    return total_memory_gb

print(f"float32 (PyTorch default): {model_memory_size(model, input_dtype=torch.float32):.2f} GB")
print(f"bfloat16: {model_memory_size(model, input_dtype=torch.bfloat16):.2f} GB")

float32 (PyTorch default): 0.02 GB
bfloat16: 0.01 GB


In [15]:
if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

model.to(device);

&nbsp;
# 3. Load tokenizer

In [16]:
import os
from pathlib import Path

from transformers import PreTrainedTokenizerFast


class Tokenizer:
    def __init__(self, tokenizer):
        """
        Initialize the Tokenizer with a Hugging Face tokenizer instance.

        Args:
            tokenizer (PreTrainedTokenizerFast): Hugging Face tokenizer.
        """
        self.tokenizer = tokenizer

        # print(pretrained_tokenizer.encode("<|end_of_text|>"), '\n',pretrained_tokenizer.encode("प्रयोगकर्ता"), end='\n\n')  # [239, 250, 35, 1, 245, 251, 36, 1, 251, 40, 35, 41, 255, 250, 273], [15542]
        special_tokens = [
            "<|begin_of_text|>",
            "<|end_of_text|>",
            "<|start_header_id|>",
            "<|end_header_id|>",
            "<|eot_id|>", # tf is this
        ]
        # Add special tokens to the tokenizer
        self.tokenizer.add_special_tokens({"additional_special_tokens": special_tokens})
        # print(pretrained_tokenizer.encode("<|end_of_text|>"), '\n',pretrained_tokenizer.encode("प्रयोगकर्ता"), end='\n\n')  # [50001], [15542]

    def encode(self, text, bos=False, eos=False):
        """

        Encode a text string into token IDs.

        Args:
            text (str): The text to encode.
            bos (bool): Whether to add the beginning-of-sequence token.
            eos (bool): Whether to add the end-of-sequence token.

        Returns:
            List[int]: List of token IDs.



        """
        tokens = []
        if bos:
            tokens.append(self.special_tokens["<|begin_of_text|>"])
        tokens.extend(self.tokenizer.encode(text, add_special_tokens=False))
        if eos:
            tokens.append(self.special_tokens["<|end_of_text|>"])
        return tokens

    def decode(self, tokens):
        """
        Decode token IDs back into a text string.

        Args:
            tokens (List[int]): List of token IDs.

        Returns:
            str: The decoded text.
        """
        return self.tokenizer.decode(tokens, skip_special_tokens=True)


class ChatFormat:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    def encode_header(self, message):
        tokens = []
        tokens.extend(self.tokenizer.encode("<|start_header_id|>"))
        tokens.extend(self.tokenizer.encode(message["role"], bos=False, eos=False))
        tokens.extend(self.tokenizer.encode("<|end_header_id|>"))
        tokens.extend(self.tokenizer.encode("\n\n", bos=False, eos=False))
        return tokens

    def encode(self, text):
        message = {
            "role": "प्रयोगकर्ता",
            "content": text
        }

        tokens = self.encode_header(message)
        tokens.extend(
            self.tokenizer.encode(message["content"].strip(), bos=False, eos=False)
        )
        tokens.extend(self.tokenizer.encode("<|eot_id|>"))
        return tokens

    def decode(self, token_ids):
        return self.tokenizer.decode(token_ids)


# Using the Hugging Face tokenizer
pretrained_tokenizer = PreTrainedTokenizerFast.from_pretrained("Aananda-giri/NepaliBPE")

# Initialize the custom tokenizer
tokenizer = Tokenizer(pretrained_tokenizer)
chat_tokenizer = ChatFormat(tokenizer)


tokenizer_config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.53M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.00 [00:00<?, ?B/s]

## tests

In [17]:
chat_tokenizer.tokenizer.encode("<|start_header_id|>")

[50002]

In [18]:
tokenizer.encode("<|begin_of_text|>")

[50000]

In [19]:
chat_tokenizer.encode("लामा हरु ले के खान्छन् ?")

[50002, 15542, 50003, 843, 577, 285, 903, 23750, 1, 50004]

In [20]:
chat_tokenizer.decode([50002, 15542, 50003, 843, 577, 285, 903, 23750, 1, 50004])

'प्रयोगकर्ता लामा हरु ले के खान्छन्'

In [21]:
[tokenizer.decode([token]) for token in tokenizer.encode('hello world')]

['', 'e', '', '', '', '', 'o', 'r', '', 'd']

In [22]:
[tokenizer.decode([token]) for token in tokenizer.encode("महाप्रभुको सदिच्छा पूरा गर्न यो भौतिक ")]

['महा', 'प्रभु', 'को', 'सदि', 'च्छा', 'पूरा', 'गर्न', 'यो', 'भौतिक']

## original tokenizer code by sebastian

In [23]:
'''import os
from pathlib import Path

import tiktoken
from tiktoken.load import load_tiktoken_bpe


class Tokenizer:
    def __init__(self, model_path):
        assert os.path.isfile(model_path), f"Model file {model_path} not found"
        mergeable_ranks = load_tiktoken_bpe(model_path)

        self.special_tokens = {
            "<|begin_of_text|>": 128000,
            "<|end_of_text|>": 128001,
            "<|start_header_id|>": 128006,
            "<|end_header_id|>": 128007,
            "<|eot_id|>": 128009,
        }
        self.special_tokens.update({
            f"<|reserved_{i}|>": 128002 + i for i in range(256) if (128002 + i) not in self.special_tokens.values()
        })

        self.model = tiktoken.Encoding(
            name=Path(model_path).name,
            pat_str=r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+",
            mergeable_ranks=mergeable_ranks,
            special_tokens=self.special_tokens
        )


    def encode(self, text, bos=False, eos=False, allowed_special=set(), disallowed_special=()):
        if bos:
            tokens = [self.special_tokens["<|begin_of_text|>"]]
        else:
            tokens = []

        tokens += self.model.encode(text, allowed_special=allowed_special, disallowed_special=disallowed_special)

        if eos:
            tokens.append(self.special_tokens["<|end_of_text|>"])
        return tokens

    def decode(self, tokens):
        return self.model.decode(tokens)


class ChatFormat:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    def encode_header(self, message):
        tokens = []
        tokens.append(self.tokenizer.special_tokens["<|start_header_id|>"])
        tokens.extend(self.tokenizer.encode(message["role"], bos=False, eos=False))
        tokens.append(self.tokenizer.special_tokens["<|end_header_id|>"])
        tokens.extend(self.tokenizer.encode("\n\n", bos=False, eos=False))
        return tokens

    def encode(self, text):
        message = {
            "role": "user",
            "content": text
        }

        tokens = self.encode_header(message)
        tokens.extend(
            self.tokenizer.encode(message["content"].strip(), bos=False, eos=False)
        )
        tokens.append(self.tokenizer.special_tokens["<|eot_id|>"])
        return tokens

    def decode(self, token_ids):
        return self.tokenizer.decode(token_ids)'''

'import os\nfrom pathlib import Path\n\nimport tiktoken\nfrom tiktoken.load import load_tiktoken_bpe\n\n\nclass Tokenizer:\n    def __init__(self, model_path):\n        assert os.path.isfile(model_path), f"Model file {model_path} not found"\n        mergeable_ranks = load_tiktoken_bpe(model_path)\n\n        self.special_tokens = {\n            "<|begin_of_text|>": 128000,\n            "<|end_of_text|>": 128001,\n            "<|start_header_id|>": 128006,\n            "<|end_header_id|>": 128007,\n            "<|eot_id|>": 128009,\n        }\n        self.special_tokens.update({\n            f"<|reserved_{i}|>": 128002 + i for i in range(256) if (128002 + i) not in self.special_tokens.values()\n        })\n\n        self.model = tiktoken.Encoding(\n            name=Path(model_path).name,\n            pat_str=r"(?i:\'s|\'t|\'re|\'ve|\'m|\'ll|\'d)|[^\r\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\r\n]*|\\s*[\r\n]+|\\s+(?!\\S)|\\s+",\n            mergeable_ranks=mergeable_rank

- Please note that Meta AI requires that you accept the Llama 3.2 licensing terms before you can download the files; to do this, you have to create a Hugging Face Hub account and visit the [meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) repository to accept the terms
- Next, you will need to create an access token; to generate an access token with READ permissions, click on the profile picture in the upper right and click on "Settings"


<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gpt-to-llama/settings.webp?1" width="300px">

- Then, create and copy the access token so you can copy & paste it into the next code cell

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gpt-to-llama/access-token.webp?1" width="600px">

In [24]:
'''from huggingface_hub import login

login()'''

'from huggingface_hub import login\n\nlogin()'

In [25]:
'''from huggingface_hub import hf_hub_download

tokenizer_file_path = hf_hub_download(
    repo_id=f"meta-llama/Llama-3.2-{LLAMA_SIZE_STR}-Instruct",
    filename="original/tokenizer.model",
    local_dir=f"Llama-3.2-{LLAMA_SIZE_STR}-Instruct"
)'''

'from huggingface_hub import hf_hub_download\n\ntokenizer_file_path = hf_hub_download(\n    repo_id=f"meta-llama/Llama-3.2-{LLAMA_SIZE_STR}-Instruct",\n    filename="original/tokenizer.model",\n    local_dir=f"Llama-3.2-{LLAMA_SIZE_STR}-Instruct"\n)'

In [26]:
'''tokenizer = Tokenizer(tokenizer_file_path)
chat_tokenizer = ChatFormat(tokenizer)'''

'tokenizer = Tokenizer(tokenizer_file_path)\nchat_tokenizer = ChatFormat(tokenizer)'

## tests

In [27]:
'''print(type(chat_tokenizer.encode('hello world')))
chat_tokenizer.decode(chat_tokenizer.encode('hello world'))'''

"print(type(chat_tokenizer.encode('hello world')))\nchat_tokenizer.decode(chat_tokenizer.encode('hello world'))"

In [28]:
'''[tokenizer.decode([token]) for token in tokenizer.encode("महाप्रभुको सदिच्छा पूरा गर्न यो भौतिक ")]'''

'[tokenizer.decode([token]) for token in tokenizer.encode("महाप्रभुको सदिच्छा पूरा गर्न यो भौतिक ")]'

In [29]:
'''a=[]
a.append(tokenizer.special_tokens["<|end_header_id|>"])
a'''

'a=[]\na.append(tokenizer.special_tokens["<|end_header_id|>"])\na'

&nbsp;
# 3.5. Generate text

In [30]:
def text_to_token_ids(text, tokenizer):
    encoded = tokenizer.encode(text)
    encoded_tensor = torch.tensor(encoded).unsqueeze(0)  # add batch dimension
    return encoded_tensor
    # '''
    #   we have modified above return statement by sebastian because there are no tokens like 'start_header_id', 'end_header_id' and tokenizer is returning None which inturn is giving error
    #   TODO: add special tokens: 'start_header_id', 'end_header_id' and uncomment above return statement
    # '''
    # print(encoded_tensor)
    # return torch.tensor([token for token in encoded_tensor])  # TODO: use additional vocab like encoded_tensor


def token_ids_to_text(token_ids, tokenizer):
    flat = token_ids.squeeze(0)  # remove batch dimension
    return tokenizer.decode(flat.tolist())


def generate(model, idx, max_new_tokens, context_size, temperature=0.0, top_k=None, eos_id=None):

    # For-loop is the same as before: Get logits, and only focus on last time step
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:]
        with torch.no_grad():
            logits = model(idx_cond)
        logits = logits[:, -1, :]

        # New: Filter logits with top_k sampling
        if top_k is not None:
            # Keep only top_k values
            top_logits, _ = torch.topk(logits, top_k)
            min_val = top_logits[:, -1]
            logits = torch.where(logits < min_val, torch.tensor(float('-inf')).to(logits.device), logits)

        # New: Apply temperature scaling
        if temperature > 0.0:
            logits = logits / temperature

            # Apply softmax to get probabilities
            probs = torch.softmax(logits, dim=-1)  # (batch_size, context_len)

            # Sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (batch_size, 1)

        # Otherwise same as before: get idx of the vocab entry with the highest logits value
        else:
            idx_next = torch.argmax(logits, dim=-1, keepdim=True)  # (batch_size, 1)

        if idx_next == eos_id:  # Stop generating early if end-of-sequence token is encountered and eos_id is specified
            break

        # Same as before: append sampled index to the running sequence
        idx = torch.cat((idx, idx_next), dim=1)  # (batch_size, num_tokens+1)

    return idx

In [31]:
PROMPT = "लामा हरु ले के खान्छन् ?"

torch.manual_seed(123)

token_ids = generate(
    model=model,
    idx=text_to_token_ids(PROMPT, chat_tokenizer).to(device),
    max_new_tokens=150,
    context_size=LLAMA32_CONFIG["context_length"],
    top_k=1,
    temperature=0.
)

output_text = token_ids_to_text(token_ids, tokenizer)


def clean_text(text, header_end="assistant<|end_header_id|>\n\n"):
    # Find the index of the first occurrence of "<|end_header_id|>"
    index = text.find(header_end)

    if index != -1:
        # Return the substring starting after "<|end_header_id|>"
        return text[index + len(header_end):].strip()  # Strip removes leading/trailing whitespace
    else:
        # If the token is not found, return the original text
        return text

print("Output text:\n", clean_text(output_text))

Output text:
 प्रयोगकर्ता लामा हरु ले के खान्छन् जनकल्यामार्थ त्रुटीउतारेको गौँबर्षाएक्लैले वट साहिविद्यालयबाट भान्ह्वाट्सएप मानवअधिकारकर्मी संविधानसम्मत कविराज क्षमा मरिहत्ते उठ्दा जवाफमा मार्गमा पराजित ीकरणमा जवाफमा स्वीकार्दै सर्लाही तिरे बिरामीहरु चारका माहोलमा पोम्पिवट साहिविद्यालयबाट भान्ह्वाट्सएप १२ त्रुटीसञ्चारमाध्यमहरूले रामेछापका चेपछि गृहमन्त्रीको केन्द्रामाहोलमा बच्नका साहिटेक्नोलोसियादेवदह क्षयरोग रह्यामारले र्था चाही व्यापारलाई नर्सकेन्द्रामाहोलमा कार्यकर्तालगानीकर्बिरामीहरु मोर्माहोलमा ट्वीटरमार्फत ट्वीटरमार्फत ट्वीटरमार्फत


&nbsp;
# 4.5 The poor man's trainning loop

In [32]:
'''# tokenizer does not seem to tokenize nepali text well. lets try training with english text instead

# Download data
import urllib.request
url = ("https://raw.githubusercontent.com/rasbt/"
 "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
 "the-verdict.txt")
file_path = "the-verdict.txt"

file_path = "cleaned_bhagavad_gita_data.txt"
url = "https://github.com/Aananda-giri/llm.np/blob/main/4.%20LLAMA/training_loop/cleaned_bhagavad_gita_data.txt"

urllib.request.urlretrieve(url, file_path)

# load the text
with open("the-verdict.txt", "r", encoding="utf-8") as f:
 raw_text = f.read()
print("Total number of character:", len(raw_text))
print(raw_text[:99])'''

'# tokenizer does not seem to tokenize nepali text well. lets try training with english text instead\n\n# Download data\nimport urllib.request\nurl = ("https://raw.githubusercontent.com/rasbt/"\n "LLMs-from-scratch/main/ch02/01_main-chapter-code/"\n "the-verdict.txt")\nfile_path = "the-verdict.txt"\n\nfile_path = "cleaned_bhagavad_gita_data.txt"\nurl = "https://github.com/Aananda-giri/llm.np/blob/main/4.%20LLAMA/training_loop/cleaned_bhagavad_gita_data.txt"\n\nurllib.request.urlretrieve(url, file_path)\n\n# load the text\nwith open("the-verdict.txt", "r", encoding="utf-8") as f:\n raw_text = f.read()\nprint("Total number of character:", len(raw_text))\nprint(raw_text[:99])'

In [34]:
import urllib.request
import requests

# Raw file URL
url = "https://raw.githubusercontent.com/Aananda-giri/llm.np/main/4.%20LLAMA/training_loop/cleaned_bhagavad_gita_data.txt"

# File path to save the file locally
file_path = "cleaned_bhagavad_gita_data.txt"

urllib.request.urlretrieve(url, file_path)


('cleaned_bhagavad_gita_data.txt', <http.client.HTTPMessage at 0x7dedee6e6470>)

In [36]:
# import tiktoken
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator

#####################################
# Chapter 2
#####################################


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt)

        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]


def create_dataloader_v1(tokenizer, txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True, num_workers=0):
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    dataloader = DataLoader(
        dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)

    return dataloader

def create_dataloaders(text_data, train_ratio, batch_size, max_length, stride, num_workers=0):
    split_idx = int(train_ratio * len(text_data))
    train_loader = create_dataloader_v1(
        tokenizer,
        text_data[:split_idx],
        batch_size=batch_size,
        max_length=max_length,
        stride=stride,
        drop_last=True,
        shuffle=True,
        num_workers=num_workers
    )
    val_loader = create_dataloader_v1(
        tokenizer,
        text_data[split_idx:],
        batch_size=batch_size,
        max_length=max_length,
        stride=stride,
        drop_last=False,
        shuffle=False,
        num_workers=num_workers
    )
    return train_loader, val_loader

def read_text_file(file_path):
    with open(file_path, "r", encoding="utf-8") as file:
        text_data = file.read()
    return text_data

text_data = read_text_file("cleaned_bhagavad_gita_data.txt") + " <|endoftext|> "



train_loader, val_loader = create_dataloaders(
    text_data,
    train_ratio=0.9,
    batch_size=2,
    max_length=LLAMA32_CONFIG["context_length"],
    stride=LLAMA32_CONFIG["context_length"],
    num_workers=0
)

In [None]:
import time


n_epochs = 3
eval_freq = 300
print_sample_iter = 2_000
save_ckpt_freq = 2_000
eval_iter=1

start_context = "She raised her eyebrows with a hint of"


output_dir  = 'llama_debug_model'
os.makedirs(output_dir, exist_ok=True)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.1)


train_losses, val_losses, track_tokens_seen = [], [], []
tokens_seen = 0
global_step = -1
start_time = time.time()




# ----------------
# Functions:
# ----------------
def calc_loss_loader(data_loader, model, device, num_batches=None):
    total_loss = 0.
    if len(data_loader) == 0:
        return float("nan")
    elif num_batches is None:
        num_batches = len(data_loader)
    else:
        num_batches = min(num_batches, len(data_loader))
    for i, (input_batch, target_batch) in enumerate(data_loader):
        if i < num_batches:
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            total_loss += loss.item()
        else:
            break
    return total_loss / num_batches

def evaluate_model(model, train_loader, val_loader, device, eval_iter):
    model.eval()
    with torch.no_grad():
        train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)
        val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
    model.train()
    return train_loss, val_loss

def calc_loss_batch(input_batch, target_batch, model, device):
    input_batch, target_batch = input_batch.to(device), target_batch.to(device)
    logits = model(input_batch)
    loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
    return loss


def generate_and_print_sample(PROMPT):
    # PROMPT = "What do llamas eat?"

    torch.manual_seed(123)

    token_ids = generate(
        model=model,
        idx=text_to_token_ids(PROMPT, chat_tokenizer).to(device),
        max_new_tokens=150,
        context_size=LLAMA32_CONFIG["context_length"],
        top_k=1,
        temperature=0.
    )

    output_text = token_ids_to_text(token_ids, tokenizer)

    print("Output text:\n", clean_text(output_text))

# -------------------------
# Actual training
# -------------------------
# try:
print("Training ...")
for epoch in range(n_epochs):
        model.train()
        for input_batch, target_batch in train_loader:
            optimizer.zero_grad()
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            loss.backward()
            optimizer.step()
            tokens_seen += input_batch.numel()
            global_step += 1

            # Optional evaluation step
            if global_step % eval_freq == 0:
                train_loss, val_loss = evaluate_model(
                    model, train_loader, val_loader, device, eval_iter)
                train_losses.append(train_loss)
                val_losses.append(val_loss)
                track_tokens_seen.append(tokens_seen)
                print(f"Ep {epoch+1} (Step {global_step}): "
                        f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")

            # Generate text passage
            if global_step % print_sample_iter == 0 and global_step > 0:
                generate_and_print_sample(start_context)
                # generate_and_print_sample(
                #     model, tokenizer, device, start_context
                # )

        if global_step % save_ckpt_freq == 0:
            file_name = output_dir + '/' + f"model_pg_{global_step}.pth"
            torch.save(model.state_dict(), file_name)
            print(f"Saved {file_name}")

        # print_eta(start_time, book_start_time, index, total_files)
print(f"time_taken:{(time.time() - start_time) / 60:.2f} minutes device:{device}")
# except KeyboardInterrupt:
#     file_name = output_dir / f"model_pg_{global_step}_interrupted.pth"
#     torch.save(model.state_dict(), file_name)
#     print(f"Saved {file_name}")

Training ...
Ep 1 (Step 0): Train loss 9.125, Val loss 8.938
Ep 1 (Step 300): Train loss 9.625, Val loss 8.812
Ep 1 (Step 600): Train loss 9.188, Val loss 8.750
Ep 1 (Step 900): Train loss 8.438, Val loss 8.625
Ep 1 (Step 1200): Train loss 9.312, Val loss 8.500
Ep 1 (Step 1500): Train loss 8.688, Val loss 8.438
Ep 1 (Step 1800): Train loss 8.500, Val loss 8.438
Output text:
 प्रयोगकर्ता rsed eeerott o। । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । । ।
Ep 1 (Step 2100): Train loss 8.875, Val loss 8.375
Ep 1 (Step 2400): Train loss 7.625, Val loss 8.312
Ep 1 (Step 2700): Train loss 8.312, Val loss 8.312
Ep 1 (Step 3000): Train loss 8.188, Val loss 8.312
Ep 1 (Step 3300): Train loss 7.562, Val loss 8.312
Ep 1 (Step 3600): Train loss 8.250, Val l

&nbsp;
# What's next?

- The notebook was kept purposefully minimal; if you are interested in additional explanation about the individual components, check out the following two companion notebooks:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gpt-to-llama/gpt-and-all-llamas.webp">

  1. [Converting a From-Scratch GPT Architecture to Llama 2](converting-gpt-to-llama2.ipynb)
  2. [Converting Llama 2 to Llama 3.2 From Scratch](converting-llama2-to-llama3.ipynb)
  
- For those interested in a comprehensive guide on building a large language model from scratch and gaining a deeper understanding of its mechanics, you might like my [Build a Large Language Model (From Scratch)](http://mng.bz/orYv)

<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>