## Notebook Introduction

Welcome to the labyrinth of "Inside Llama," where we unravel the complexities of Meta's Llama 3 model. This notebook is a testament to the power of precision, knowledge, and the relentless pursuit of perfection. It is designed for those who aspire not just to understand, but to master the inner workings of one of the most sophisticated language models in existence.


## Notebook Overview

This notebook is divided into several critical sections. Each one is a rung on the ladder to dominance over the machine learning landscape.


### Objective

Our mission is clear: to construct and train the Llama 3 model from scratch, employing a character-based tokenizer as our tool of choice. Inspired by the teachings of Andrew Karpathy, this tokenizer is the key to unlocking the full potential of our model. Should your preferences lean towards tradition, the original tokenizer from the Huggingface Hub is but a switch away, courtesy of the `transformers` library.


### Architectural Blueprint

Before we embark on our journey, let us pause to appreciate the architectural elegance of Llama 3, as depicted in the following diagram:

<img src="./Llama-Architecture.png" alt="Llama Architecture" width="500">


This diagram is more than a mere image; it is a blueprint of our conquest:

1. **Input Tokens**: The journey begins with the input tokens, the raw material fed into the model.

2. **Embeddings**: These tokens are transformed into embeddings, the foundation upon which our model is built.

3. **Transformer Block**: The core of our architecture, the Transformer Block, where the magic happens:
   - **Multi-Head-Self-Attention**: This component employs Grouped-Multi-Query-Attention with KV-Cache, a sophisticated mechanism for attention across multiple heads. We will not be using the KV-Cache in this notebook.
   - **RMS Norm**: Normalization is applied to ensure stability and efficiency.
   - **SwiGLU Activated MLP Layer**: A crucial layer where activation functions breathe life into our model.
   - **RMS Norm**: Another layer of normalization to maintain equilibrium.
   - **Residual Connections**: These connections ensure that information flows smoothly through the network without loss.
   
4. **Output Tokens**: The culmination of our efforts, the output tokens, are derived from the softmax probabilities and argmax operations, translating the model's predictions into tangible results.

This is the architecture that will guide us, the blueprint that will lead us to mastery. As we proceed through the notebook, each section will bring us closer to fully understanding and harnessing the power of Llama 3.

Prepare yourself for a journey of discovery, precision, and unparalleled mastery. Welcome to "Inside Llama".

## Installation of the required libraries

In [None]:
!pip install torch transformers tokenizers

In [None]:
from typing import Optional, Tuple

import math

import torch.nn.functional as F
from torch import nn
import torch

from tokenizers import Tokenizer, models, pre_tokenizers, trainers

from utils import createPlot, createLossPlot, plot_lm_head_output, plot_probs_or_logits, plot_mask_tensor, plot_intermediate_attention, parse_parameters_from_file, visualize_parameters, save_model_parameters_to_file

### Parameters:

- **dim: int = 16 # 4096**: The core dimensions of our model. In the realm of possibilities, we set it to 16, though it could scale to 4096 in more ambitious undertakings like the Llama3 8B model.

- **n_layers: int = 6 # 32**: The number of Decoder-Transformer Layers. We start with 6, but the ceiling is our device were training it on, Llama3 8B uses 32.

- **n_heads: int = 8 # 32**: The count of Single Attention Heads in our Multi-Head Attention mechanism. We begin with 8, yet can extend to Llama3 8B's 32, each head enhancing our model's perceptive power.

- **n_kv_heads: Optional[int] = 8 # 8**: The number of key-value heads in the attention mechanism. Set at 8, a balanced number ensuring efficiency and depth.

- **vocab_size: int = -1**: The vocabulary size, as yet undefined.

- **multiple_of: int = 24**: This ensures the SwiGLU hidden layer size is a multiple of a large power of 2, originally 256. It’s a move of strategic alignment, ensuring optimal performance.

- **ffn_dim_multiplier: Optional[float] = None**: The multiplier for the Feed-Forward Network dimension, currently undefined, allowing for dynamic scaling as needed.

- **rms_norm_eps: float = 1e-5**: The epsilon value for RMS normalization, a fine-tuned parameter ensuring stability and precision in our model’s calculations.

- **max_batch_size: int = 6**: The maximum batch size, set to 6. A modest start, with the potential for scaling as our model’s appetite grows.

- **max_seq_len: int = 32**: The maximum sequence length, a defining parameter that caps our input sequences at 32, ensuring manageable complexity.

- **plot = False**: The Plot property, set to False for now. When we choose to visualize the values within our model, we’ll switch it on, revealing the intricate workings beneath the surface.

In [None]:
dim: int = 24
n_layers: int = 8
n_heads: int = 12
n_kv_heads: Optional[int] = 12
vocab_size: int = -1
multiple_of: int = 24
ffn_dim_multiplier: Optional[float] = None
rms_norm_eps: float = 1e-5
max_batch_size: int = 6
max_seq_len: int = 16
plot = False

### Now, let's shift our attention (pun intended) to our Tokenizer and our Dataset.

In this segment, we see the elegance of our strategy unfold:

1. **Loading the Dataset**: We begin by drawing in our raw data, a fundamental step that brings us closer to the heart of our endeavor.

2. **Creating the Tokenizer**: 
    - **Character Analysis**: We meticulously analyze the unique characters within our dataset, understanding the building blocks of our linguistic universe.
    - **Vocab Insights**: We reveal the vocabulary size, a metric of our model’s breadth and comprehension.
    - **Mapping Characters**: Two critical dictionaries are crafted:
        - **`stoi`**: Maps characters to their respective indices.
        - **`itos`**: Reverses the map, from indices back to characters.
    - **Tokenization and Detokenization**: We define our lambda functions, transforming strings to sequences of indices and back, ensuring seamless transitions between raw text and numerical representations.

3. **Padding ID**: Finally, we identify our padding ID, a key player in managing sequences of varying lengths, ensuring consistency and order.

In [None]:
print("... Loading Dataset")
with open("tiny-shakespear.txt", "r") as file:
    dataset = file.read()

print("... Initializing Tokenizer")
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

print("... Pre-tokenizing")
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

print("... Training custom Tokenizer")
trainer = trainers.WordPieceTrainer(
    vocab_size=30522,
    special_tokens=["[UNK]", "[PAD]"]
)
tokenizer.train_from_iterator([dataset], trainer=trainer)

print("... Saving Tokenizer")
tokenizer.save("custom_tokenizer.json")

In [None]:
print("... Loading Tokenizer")
tokenizer = Tokenizer.from_file("custom_tokenizer.json")

# Fetching and displaying the vocab size from the tokenizer
vocab_size = tokenizer.get_vocab_size()
print(f"Vocab size: {vocab_size}")

# Assuming you want to print the ID for padding if you have added one
pad_id = tokenizer.token_to_id("[PAD]") if "[PAD]" in tokenizer.get_vocab() else None
print(f"Padding ID: {pad_id}")

In [None]:
class RMSNorm(torch.nn.Module):
    def __init__(self, dim: int, eps: float = 1e-5):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def _norm(self, x):
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)

    def forward(self, x):
        output = self._norm(x.float()).type_as(x)
        return output * self.weight
    
create_a_instace_of_layernorm = RMSNorm(3)
what_is_layer_norm = create_a_instace_of_layernorm(torch.Tensor([1.0, 5.6, -8.3]))
print(f"the input numbers are [1.0, 5.6, -8.3], and after the Layernorm are: {what_is_layer_norm.tolist()}")

### Let's now examine the precomputation and application of frequency components in our model.

Here, we witness the meticulous orchestration of frequency component precomputation and their application to query and key tensors in the attention mechanism:

1. **Precomputing Frequency Components (`precompute_freqs_cis`)**:
    - **Parameters**:
        - `dim`: The dimensionality of the input.
        - `end`: The sequence length.
        - `theta`: A scaling factor, defaulting to `10000.0`.
    - **Frequency Calculation**:
        - Generates frequency values by scaling the inverse powers of `theta` raised to the fraction of dimensionality.
        - `freqs`: This array of frequencies is derived from a linear space, normalized by `dim`.
    - **Time Steps**:
        - `t`: A range tensor from `0` to `end`, representing time steps.
        - `freqs` is then expanded across these time steps using an outer product.
    - **Complex Frequency Representation**:
        - `freqs_cis`: Converts these frequency values into a complex form using polar coordinates, where the magnitude is `1` and the phase angle is given by `freqs`.

2. **Reshaping for Broadcast (`reshape_for_broadcast`)**:
    - **Parameters**:
        - `freqs_cis`: The precomputed complex frequency tensor.
        - `x`: The tensor to which the frequencies will be applied.
    - **Shape Adjustment**:
        - Ensures `freqs_cis` can be broadcast across `x`.
        - Constructs a shape list where only the dimensions corresponding to the sequence length and feature size are retained, others are set to `1`.
    - **Reshaping**:
        - Reshapes `freqs_cis` to match the required broadcast shape.

3. **Applying Rotary Embeddings (`apply_rotary_emb`)**:
    - **Parameters**:
        - `xq`, `xk`: Query and key tensors from the attention mechanism.
        - `freqs_cis`: The precomputed complex frequency tensor.
    - **Complex Conversion**:
        - Converts `xq` and `xk` to complex numbers by reshaping the last dimension into pairs, facilitating complex multiplication.
    - **Broadcast Adjustment**:
        - Adjusts the shape of `freqs_cis` to match `xq` and `xk` using `reshape_for_broadcast`.
    - **Rotary Embedding Application**:
        - Multiplies the complex queries and keys with the complex frequencies.
        - Converts the results back to real numbers and flattens the last two dimensions.
    - **Output**:
        - Returns the transformed `xq` and `xk` tensors in their original data type.

---

This trio of functions performs a masterful operation, setting up the intricate dance of frequencies within our model's attention mechanism. Each step is a deliberate move, enhancing the model's ability to capture positional information and contextual relationships, ensuring our model sees not just the present, but the sequence and structure that bind the past and future.

In [None]:
def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
    t = torch.arange(end, device=freqs.device)  # type: ignore
    freqs = torch.outer(t, freqs).float()  # type: ignore
    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64
    return freqs_cis

def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor):
    ndim = x.ndim
    assert 0 <= 1 < ndim
    assert freqs_cis.shape == (x.shape[1], x.shape[-1])
    shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
    return freqs_cis.view(*shape)

def apply_rotary_emb(xq: torch.Tensor, xk: torch.Tensor, freqs_cis: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
    freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
    return xq_out.type_as(xq), xk_out.type_as(xk)

In [None]:
what_is_precompute_freqs_cis = precompute_freqs_cis(3, 4)
print(what_is_precompute_freqs_cis)

In [None]:
x = torch.randn(2, 3, 4)
what_is_reshape_for_broadcast = reshape_for_broadcast(torch.randn(3, 4), x)
print(x)
print(what_is_reshape_for_broadcast)

### `Attention` Class Explanation

- **Initialization (`__init__` method)**:
  - Defines the number of heads for the multi-head attention mechanism (`n_heads`).
  - Sets up key-value heads (`n_kv_heads`) and determines repetitions (`n_rep`).
  - Calculates the dimension of each head (`head_dim`).
  - Initializes linear layers for queries (`wq`), keys (`wk`), values (`wv`), and output (`wo`).

- **Forward Pass (`forward` method)**:
  - **Input Transformation**: Transforms the input tensor `x` into queries, keys, and values.
  - **Reshape for Multi-head Attention**: Reshapes queries, keys, and values into their respective head dimensions.
  - **Rotary Embedding Application**: Applies rotary embeddings to the queries and keys.
  - **Attention Calculation**: Computes attention scores using scaled dot-product, applies softmax, and multiplies by values to get the attention output.
  - **Output Projection**: Projects the concatenated attention outputs back to the model's dimensionality using `wo`.

---

### `FeedForward` Class Explanation

- **Initialization (`__init__` method)**:
  - Sets up the feedforward network with a configurable hidden dimension (`hidden_dim`).
  - Adjusts the hidden dimension based on a multiplier (`ffn_dim_multiplier`) and ensures it aligns with a specific multiple (`multiple_of`).
  - Initializes three linear layers (`w1`, `w2`, `w3`) for the feedforward transformations.

- **Forward Pass (`forward` method)**:
  - **Linear Transformations**: Applies the first and third linear transformations (`w1` and `w3`) to the input tensor `x`.
  - **Element-wise Multiplication**: Multiplies the outputs of `w1` and `w3`.
  - **Activation Function**: Applies the SiLU (Sigmoid Linear Unit) activation function to the multiplied output.
  - **Final Transformation**: Transforms the activated output through the second linear layer (`w2`).

---

### `TransformerBlock` Class Explanation

- **Initialization (`__init__` method)**:
  - Sets up the transformer block with a specific layer ID.
  - Initializes RMS normalization layers for attention and feedforward networks.
  - Integrates the `Attention` and `FeedForward` modules to handle respective transformations.

- **Forward Pass (`forward` method)**:
  - **Attention Normalization**: Normalizes the input tensor `x` before passing it through the attention module.
  - **Residual Connection**: Adds the attention output back to the input tensor.
  - **Feedforward Normalization**: Normalizes the result before passing it through the feedforward network.
  - **Final Residual Connection**: Adds the feedforward output back to the tensor, completing the transformer block processing.

---

### `Transformer` Class Explanation

- **Initialization (`__init__` method)**:
  - Defines the token embedding layer using the vocabulary size and model dimensionality.
  - Creates a list of transformer blocks, iterating up to the defined number of layers (`n_layers`).
  - Initializes the final normalization layer and output linear layer.
  - Precomputes frequency components for rotary embeddings.

- **Forward Pass (`forward` method)**:
  - **Token Embedding**: Converts input tokens to embeddings.
  - **Frequency Component Preparation**: Prepares frequency components for the current sequence length.
  - **Attention Masking**: Creates an attention mask to manage sequence dependencies.
  - **Layer Processing**: Passes the embeddings through each transformer block sequentially.
  - **Final Normalization and Output Projection**: Normalizes the final layer output and projects it to the vocabulary space.

- **Text Generation (`generate` method)**:
  - **Iterative Generation**: Generates new tokens by iteratively applying the model to the most recent sequence.
  - **Probability Sampling**: Uses softmax to convert logits to probabilities and samples the next token.
  - **Sequence Extension**: Extends the input sequence with the newly generated token.

In [None]:
class Attention(nn.Module):
    def __init__(self):
        super().__init__()
        self.n_heads = n_heads
        self.n_kv_heads = n_heads if n_kv_heads is None else n_kv_heads
        self.n_rep = self.n_heads // self.n_kv_heads
        self.head_dim = dim // n_heads

        self.wq = nn.Linear(dim, n_heads * self.head_dim, bias=False)
        self.wk = nn.Linear(dim, self.n_kv_heads * self.head_dim, bias=False)
        self.wv = nn.Linear(dim, self.n_kv_heads * self.head_dim, bias=False)
        self.wo = nn.Linear(n_heads * self.head_dim, dim, bias=False)

    def forward(self, x: torch.Tensor, freqs_cis: torch.Tensor, mask: Optional[torch.Tensor]):
        B, L, D = x.shape
        queries, keys, values = self.wq(x), self.wk(x), self.wv(x)

        with torch.no_grad():
            if plot:
                print("self.queries(x)")
                createPlot(queries.detach(), title="Queries")
                print("self.keys(x)")
                createPlot(keys.detach(), title="Keys")
                print("self.values(x)")
                createPlot(values.detach(), title="Values")

        queries = queries.view(B, L, self.n_heads, self.head_dim)
        keys = keys.view(B, L, self.n_kv_heads, self.head_dim)
        values = values.view(B, L, self.n_kv_heads, self.head_dim)

        queries, values = apply_rotary_emb(queries, keys, freqs_cis=freqs_cis)

        queries = queries.transpose(1, 2)  # (bs, n_heads, L, head_dim)
        keys = keys.transpose(1, 2) # (bs, n_heads, cache_len + L, head_dim)
        values = values.transpose(1, 2) # (bs, n_heads, cache_len + L, head_dim)
        scores = torch.matmul(queries, keys.transpose(2, 3)) / math.sqrt(self.head_dim)
        with torch.no_grad():
            if plot:
                plot_intermediate_attention(scores)
        if mask is not None:
            scores = scores + mask  # (bs, n_heads, L, cache_len + L)

        scores = F.softmax(scores.float(), dim=-1).type_as(queries)
        with torch.no_grad():
            if plot:
                plot_intermediate_attention(scores)

        output = torch.matmul(scores, values)  # (bs, n_heads, L, head_dim)
        with torch.no_grad():
            if plot:
                plot_intermediate_attention(output, title="Attention Output", xlabel="Scores", ylabel="Values")
        output = output.transpose(1, 2).contiguous().view(B, L, -1)

        wo = self.wo(output)

        with torch.no_grad():
            if plot:
                print("self.wo(output)")
                createPlot(wo.detach(), title="Output Projection")
        return wo


class FeedForward(nn.Module):
    def __init__(
        self,
        dim: int,
        hidden_dim: int,
        multiple_of: int,
        ffn_dim_multiplier: Optional[float]
    ):
        super().__init__()
        hidden_dim = int(2 * hidden_dim / 3)
        # custom dim factor multiplier
        if ffn_dim_multiplier is not None:
            hidden_dim = int(ffn_dim_multiplier * hidden_dim)
        hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)

        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)

    def forward(self, x):
        w1 = self.w1(x)
        w3 = self.w3(x)
        multiplied = w1 * w3
        activated = F.silu(multiplied)
        w2 = self.w2(activated)

        with torch.no_grad():
            if plot:
                print("self.w1(x)")
                createPlot(w1.detach(), title="w1")
                print("self.w3(x)")
                createPlot(w3.detach(), title="w3")
                print("w1 * w3")
                createPlot(multiplied.detach(), title="multiplied")
                print("F.silu(multiplied)")
                createPlot(activated.detach(), title="activated")
                print("self.w2(activated)")
                createPlot(w2.detach(), title="w2")
        return w2


class TransformerBlock(nn.Module):
    def __init__(self, layer_id: int):
        super().__init__()
        self.layer_id = layer_id
        self.dim = dim
        self.attention_norm = RMSNorm(self.dim, eps=rms_norm_eps)
        self.attention = Attention()

        self.ffn_norm = RMSNorm(self.dim, eps=rms_norm_eps)
        self.feed_forward = FeedForward(
            dim=dim,
            hidden_dim=4 * dim,
            multiple_of=multiple_of,
            ffn_dim_multiplier=ffn_dim_multiplier,
        )

    def forward(self, x: torch.Tensor, freqs_cis: torch.Tensor, mask: Optional[torch.Tensor]):
        if plot:
            print(f"START Transformer Block {self.layer_id}")
        attention_norm = self.attention_norm(x)

        with torch.no_grad():
            if plot:
                print("self.attention_norm(x)")
                createPlot(attention_norm.detach(), title="attention_norm")

        res1 = x + self.attention(attention_norm, freqs_cis, mask)

        with torch.no_grad():
            if plot:
                print("x + self.attention(attention_norm, freqs_cis, mask)")
                createPlot(res1.detach(), title="x + self.attention(attention_norm, freqs_cis, mask)")

        ffn_norm = self.ffn_norm(res1)

        with torch.no_grad():
            if plot:
                print("self.ffn_norm(res1)")
                createPlot(ffn_norm.detach(), title="self.ffn_norm(res1)")

        out = res1 + self.feed_forward(ffn_norm)

        with torch.no_grad():
            if plot:
                print("res1 + self.feed_forward(ffn_norm)")
                createPlot(out.detach(), title="res1 + self.feed_forward(ffn_norm)")
                print(f"END Transformer Block {self.layer_id}")
        return out


class Transformer(nn.Module):
    def __init__(self):
        super().__init__()
        self.tok_embeddings = nn.Embedding(vocab_size, dim, padding_idx=pad_id)

        self.layers = torch.nn.ModuleList()
        for layer_id in range(n_layers):
            self.layers.append(TransformerBlock(layer_id))

        self.norm = RMSNorm(dim, eps=rms_norm_eps)
        self.output = nn.Linear(dim, vocab_size, bias=False)

        self.freqs_cis = precompute_freqs_cis(dim // n_heads, max_seq_len * 2)

    def forward(self, tokens: torch.Tensor, start_pos: int = 0, targets=None):
        B, L = tokens.shape
        h = self.tok_embeddings(tokens)

        with torch.no_grad():
            if plot:
                print("self.tok_embeddings(tokens)")
                createPlot(h.detach(), title="tok_embeddings")

        self.freqs_cis = self.freqs_cis.to(h.device)
        freqs_cis = self.freqs_cis[start_pos : start_pos + L]

        mask = None
        if L > 1:
            mask = torch.full((L, L), float("-inf"), device=tokens.device)
            mask = torch.triu(mask, diagonal=1)
            mask = torch.hstack([torch.zeros((L, start_pos), device=tokens.device), mask]).type_as(h)

        if plot:
            plot_mask_tensor(mask)

        for layer in self.layers:
            h = layer(h, freqs_cis, mask)

        with torch.no_grad():
            if plot:
                print("h = layer(h, freqs_cis, mask)")
                createPlot(h.detach(), title="h = layer(h, freqs_cis, mask)")

        h = self.norm(h)

        with torch.no_grad():
            if plot:
                print("self.norm(h)")
                createPlot(h.detach(), title="self.norm(h)")

        logits = self.output(h).float()

        with torch.no_grad():
            if plot:
                print("logits = self.output(h).float()")
                plot_lm_head_output(logits.detach(), title="self.output(h).float()")

        if targets is None:
            loss = None
        else:
            B, L, D = logits.shape
            logits = logits.view(B*L, D)
            targets = targets.view(B*L)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -max_seq_len:]
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :]
            if plot:
                plot_probs_or_logits(logits, title="Logits from the Last Dimension", xlabel="Sequence Length (Tokens)", ylabel="Logit Value", label="Logits")
            probs = F.softmax(logits, dim=-1)
            if plot:
                plot_probs_or_logits(probs)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        return idx

In [None]:
model = Transformer()
print(model)

total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Tutal Parameters: {total_params}")

save_model_parameters_to_file(model, 'model_untrained_parameters.txt')
parameters = parse_parameters_from_file('model_untrained_parameters.txt')
visualize_parameters(parameters, 'all_untrained_parameters_visualization.png', 'individual_untrained_parameter_plots')

### Testing the Untrained Model

First, let's prepare our input and tokens, then pass them through the untrained model to observe its raw, initial output.

1. **Tokenize the Input**:
   - Take the given line from Shakespeare and convert it into a sequence of tokens using our tokenizer.

2. **Create Tensor from Tokens**:
   - Convert the sequence of tokens into a tensor format suitable for model input.

3. **Print the Tokens**:
   - Output the token tensor to ensure our input is correctly formatted.

Here's how it would unfold.

- **Line Selection**: We select a specific line from "Romeo and Juliet" to test the model's initial output. The line I chose "ROMEO:
I pay thy poverty, and not thy will."
- **Tokenization**: The `tokenize` function transforms the text into a sequence of integers, each representing a character.
- **Tensor Conversion**: The token sequence is converted into a tensor and reshaped to match the model's expected input format (batch size of 1).
- **Output Display**: We print the resulting tensor to verify that the tokenization process was successful.

In [None]:
print("First untrained testing")

line_19826 = """ROMEO:\nI pay thy poverty, """

first_tokens = tokenizer.encode(line_19826)
print(first_tokens.tokens)
first_input = torch.LongTensor(first_tokens.ids).unsqueeze(0)
print(first_input)

### Testing the Untrained Model with Plots

Here's how we can set up the test and generate the output while enabling plots for each step of the process.

1. **Plotting Enabled**:
   - Set `plot = True` to ensure that plots are generated at each stage of the model's forward pass and generation process.

2. **Generate New Token**:
   - Call the `generate` method on the model with the `first_tokens` as input and `max_new_tokens=1` to generate one new token.
   
3. **Output Conversion**:
   - Convert the generated output tensor to a list of tokens.

4. **Detokenize**:
   - Use the `detokenize` function to convert the list of tokens back into text.

5. **Print Output**:
   - Print the generated character or token to observe the untrained model's response.

### Prepare for Plots:

As the model generates the output, the `plot` flag will trigger multiple plots displaying the internal states and transformations at various stages:

- **Token Embeddings**: Visualize how the input tokens are converted into embeddings.
- **Attention Mechanisms**: Observe the queries, keys, values, and attention scores.
- **Feedforward Transformations**: See the intermediate results of the feedforward network.
- **Layer Outputs**: Track the output at each layer of the transformer.

These plots will provide a comprehensive view of the untrained model's behavior, allowing us to analyze and understand its initial, unbiased responses.

Brace yourself for the insights and visualizations that follow, as they will pave the way for informed adjustments and training strategies.

In [None]:
plot = True
output = model.generate(first_input, max_new_tokens=1)
out = output[0].tolist() # Uncommend this if you used the Simple tokenizer 
generated_text_from_untrained = tokenizer.decode(out, skip_special_tokens=True)
print(f"the next generated character or Token is: {generated_text_from_untrained}")

# Let's generate more Tokens

In [None]:
plot = False
output = model.generate(first_input, max_new_tokens=10)
out = output[0].tolist()
generated_text_from_untrained = tokenizer.decode(out, skip_special_tokens=True)
print(generated_text_from_untrained)

### Output Analysis:

The output you received:
```
ROMEO:
I pay thy poverty, Q-Vtwm::Hyom3yH3O'NO:AV3uJiuKuQg$nyMBla:uOH'ql?MgIG#rRH:;;lWg;aQFegX!-,p.T.V saWVGAA.jpFw:g!nOVZiBzq
```
is not ideal and quite different from the expected:
```
ROMEO:
I pay thy poverty, and not thy will.
```


### Explanation and Next Steps:

1. **Untrained Model**:
   - The model is currently untrained, meaning it hasn't learned the patterns, syntax, or semantics of the Shakespearean language.
   - The output is essentially random, reflecting the lack of training.

2. **Training the Model**:
   - To generate coherent and contextually accurate text, the model needs to be trained on a substantial amount of relevant data.
   - Training involves feeding the model many examples of Shakespearean text (or any desired text corpus) and adjusting its parameters to minimize prediction errors.

3. **Steps for Training**:
   - **Prepare Dataset**: Ensure you have a large and well-prepared dataset of Shakespearean text.
   - **Define Training Loop**: Implement a training loop that iteratively adjusts the model's weights based on the loss computed from its predictions.
   - **Evaluation and Fine-tuning**: Regularly evaluate the model's performance and fine-tune hyperparameters for better results.


### Conclusion:

The initial output serves as a baseline, highlighting the importance of training. By investing time and computational resources into training the model, it will progressively learn to generate text that aligns with your expectations.

If you need further assistance with the training setup or specific aspects of your model, feel free to ask!

Certainly, Gökdeniz. Let’s go through the training setup step-by-step, ensuring everything is well-defined and optimized for training your model effectively.

### Training the Model

Here’s a detailed explanation of each part of the training process:

#### Data Preparation
1. **Tokenize Dataset**: Convert the entire dataset into a sequence of tokens.
2. **Train/Validation Split**: Split the tokenized data into 90% for training and 10% for validation.

#### Batch Preparation
3. **Get Batch Function**: Define a function to sample batches of data for training and validation. This function randomly selects sequences of length `max_seq_len` from the data.

#### Loss Estimation
4. **Estimate Loss Function**: Create a function to evaluate the model's performance on both training and validation sets without updating the model parameters (`@torch.no_grad()`).

#### Optimizer
5. **AdamW Optimizer**: Use AdamW, a variant of the Adam optimizer with weight decay, for better performance on large-scale datasets.

#### Training Loop
6. **Training Loop**: Train the model for a specified number of steps (`max_steps`). Periodically evaluate and log the training and validation losses.

#### Save and Plot
7. **Save the Model**: Save the trained model parameters for future use.
8. **Plot Loss Curves**: Plot the training and validation loss curves to visualize the training progress.

### Training Process

1. **Data Preparation**: Tokenize the dataset and split it into training and validation sets to ensure the model can generalize well.
2. **Batch Preparation**: Randomly sample batches for both training and evaluation to prevent overfitting and ensure the model sees a variety of data.
3. **Loss Estimation**: Regularly estimate the model’s loss on training and validation data to monitor overfitting and model performance.
4. **Optimizer**: Use the AdamW optimizer, which helps in better generalization due to its weight decay property.
5. **Training Loop**: Train the model iteratively, evaluate performance periodically, and log the losses for visualization.
6. **Save and Plot**: Save the trained model for future use and plot the loss curves to visualize the training dynamics.

In [None]:
max_steps = 10000
eval_steps = 100
eval_interval = 1000
lr = 0.002
max_batch_size = 32

steps = []
train_losses = []
val_losses = []

print("... Loading Dataset")
with open("/Users/gokdenizgulmez/Desktop/Inside-Llama/tiny-shakespear.txt", "r") as file:
    dataset = file.read()

# Tokenizing the dataset
dataset = tokenizer.encode(dataset)
# Converting tokens to tensor and adding batch dimension
data = torch.LongTensor(dataset.ids).unsqueeze(0)

# Splitting data into training and validation sets
n = int(0.9 * len(data[0]))
train_data = data[:, :n]  # Ensure correct slicing
val_data = data[:, n:]    # Ensure correct slicing

print("Dataset loaded and split into training and validation sets.")

def get_batch(split):
    data = train_data if split == 'train' else val_data
    if len(data[0]) <= max_seq_len:
        raise ValueError(f"Data length ({len(data[0])}) is not sufficient for the sequence length ({max_seq_len}).")
    ix = torch.randint(len(data[0]) - max_seq_len, (max_batch_size,))
    x = torch.stack([data[0, i:i+max_seq_len] for i in ix])
    y = torch.stack([data[0, i+1:i+max_seq_len+1] for i in ix])
    return x.to("cpu"), y.to("cpu")

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_steps)
        for k in range(eval_steps):
            X, Y = get_batch(split)
            _, loss = model(X, start_pos=0, targets=Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# Create a PyTorch optimizer
print("Training")
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

for epoch in range(max_steps - 1):
    # every once in a while evaluate the loss on train and val sets
    if epoch % eval_interval == 0 or epoch == max_steps - 1:
        losses = estimate_loss()
        steps.append(epoch)
        train_losses.append(losses['train'])
        val_losses.append(losses['val'])
        print(f"step {epoch}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, start_pos=0, targets=yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# Save the trained model
torch.save(model.state_dict(), "trained_llama3_model.pth")

# Plot the loss curves
createLossPlot(steps, train_losses)
createLossPlot(steps, val_losses, title="Validation")

## Lets now test the trained Llama model again

In [None]:
# Load teh saved mdoel weights
trained_model = Transformer()
trained_model.load_state_dict(torch.load("trained_llama3_model.pth"))

In [None]:
save_model_parameters_to_file(model, 'model_trained_parameters.txt')
parameters = parse_parameters_from_file('model_trained_parameters.txt')
visualize_parameters(parameters, 'all_trained_parameters_visualization.png', 'individual_trained_parameter_plots')

In [None]:
plot = False
output = model.generate(first_input, max_new_tokens=10)
out = output[0].tolist()
generated_text_from_untrained = tokenizer.decode(out, skip_special_tokens=True)
print(generated_text_from_untrained)

In [None]:
plot = True
output = trained_model.generate(first_input, max_new_tokens=1)
out = output[0].tolist()
generated_text = tokenizer.decode(out, skip_special_tokens=True)
print(generated_text)