## Notebook Introduction

Welcome to the labyrinth of "Inside Llama," where we unravel the complexities of Meta's Llama 3 model. This notebook is a testament to the power of precision, knowledge, and the relentless pursuit of perfection. It is designed for those who aspire not just to understand, but to master the inner workings of one of the most sophisticated language models in existence.


## Notebook Overview

This notebook is divided into several critical sections. Each one is a rung on the ladder to dominance over the machine learning landscape.


### Objective

Our mission is clear: to construct and train the Llama 3 model from scratch, employing a character-based tokenizer as our tool of choice. Inspired by the teachings of Andrew Karpathy, this tokenizer is the key to unlocking the full potential of our model. Should your preferences lean towards tradition, the original tokenizer from the Huggingface Hub is but a switch away, courtesy of the `transformers` library.


### Architectural Blueprint

Before we embark on our journey, let us pause to appreciate the architectural elegance of Llama 3, as depicted in the following diagram:

<img src="./Llama-Architecture.png" alt="Llama Architecture" width="500">


This diagram is more than a mere image; it is a blueprint of our conquest:

1. **Input Tokens**: The journey begins with the input tokens, the raw material fed into the model.

2. **Embeddings**: These tokens are transformed into embeddings, the foundation upon which our model is built.

3. **Transformer Block**: The core of our architecture, the Transformer Block, where the magic happens:
   - **Multi-Head-Self-Attention**: This component employs Grouped-Multi-Query-Attention with KV-Cache, a sophisticated mechanism for attention across multiple heads. We will not be using the KV-Cache in this notebook.
   - **RMS Norm**: Normalization is applied to ensure stability and efficiency.
   - **SwiGLU Activated MLP Layer**: A crucial layer where activation functions breathe life into our model.
   - **RMS Norm**: Another layer of normalization to maintain equilibrium.
   - **Residual Connections**: These connections ensure that information flows smoothly through the network without loss.
   
4. **Output Tokens**: The culmination of our efforts, the output tokens, are derived from the softmax probabilities and argmax operations, translating the model's predictions into tangible results.

This is the architecture that will guide us, the blueprint that will lead us to mastery. As we proceed through the notebook, each section will bring us closer to fully understanding and harnessing the power of Llama 3.

Prepare yourself for a journey of discovery, precision, and unparalleled mastery. Welcome to "Inside Llama".

## Installation of the required libraries

In [None]:
!pip install torch transformers

In [1]:
from typing import Optional, Tuple

import math

import torch.nn.functional as F
from torch import nn
import torch

from plot import createPlot, createLossPlot, LlamaVisualizer

ModuleNotFoundError: No module named 'plot'

### Parameters:

- **dim: int = 16 # 4096**: The core dimensions of our model. In the realm of possibilities, we set it to 16, though it could scale to 4096 in more ambitious undertakings like the Llama3 8B model.

- **n_layers: int = 6 # 32**: The number of Decoder-Transformer Layers. We start with 6, but the ceiling is our device were training it on, Llama3 8B uses 32.

- **n_heads: int = 8 # 32**: The count of Single Attention Heads in our Multi-Head Attention mechanism. We begin with 8, yet can extend to Llama3 8B's 32, each head enhancing our model's perceptive power.

- **n_kv_heads: Optional[int] = 8 # 8**: The number of key-value heads in the attention mechanism. Set at 8, a balanced number ensuring efficiency and depth.

- **vocab_size: int = -1**: The vocabulary size, as yet undefined.

- **multiple_of: int = 24**: This ensures the SwiGLU hidden layer size is a multiple of a large power of 2, originally 256. It’s a move of strategic alignment, ensuring optimal performance.

- **ffn_dim_multiplier: Optional[float] = None**: The multiplier for the Feed-Forward Network dimension, currently undefined, allowing for dynamic scaling as needed.

- **rms_norm_eps: float = 1e-5**: The epsilon value for RMS normalization, a fine-tuned parameter ensuring stability and precision in our model’s calculations.

- **max_batch_size: int = 6**: The maximum batch size, set to 6. A modest start, with the potential for scaling as our model’s appetite grows.

- **max_seq_len: int = 32**: The maximum sequence length, a defining parameter that caps our input sequences at 32, ensuring manageable complexity.

- **plot = False**: The Plot property, set to False for now. When we choose to visualize the values within our model, we’ll switch it on, revealing the intricate workings beneath the surface.

In [None]:
dim: int = 16
n_layers: int = 6
n_heads: int = 8
n_kv_heads: Optional[int] = 8
vocab_size: int = -1
multiple_of: int = 24
ffn_dim_multiplier: Optional[float] = None
rms_norm_eps: float = 1e-5
max_batch_size: int = 6
max_seq_len: int = 32
plot = False

### Now, let's shift our attention (pun intended) to our Tokenizer and our Dataset.

In this segment, we see the elegance of our strategy unfold:

1. **Loading the Dataset**: We begin by drawing in our raw data, a fundamental step that brings us closer to the heart of our endeavor.

2. **Creating the Tokenizer**: 
    - **Character Analysis**: We meticulously analyze the unique characters within our dataset, understanding the building blocks of our linguistic universe.
    - **Vocab Insights**: We reveal the vocabulary size, a metric of our model’s breadth and comprehension.
    - **Mapping Characters**: Two critical dictionaries are crafted:
        - **`stoi`**: Maps characters to their respective indices.
        - **`itos`**: Reverses the map, from indices back to characters.
    - **Tokenization and Detokenization**: We define our lambda functions, transforming strings to sequences of indices and back, ensuring seamless transitions between raw text and numerical representations.

3. **Padding ID**: Finally, we identify our padding ID, a key player in managing sequences of varying lengths, ensuring consistency and order.

In [None]:
print("... Loading Dataset")
with open("dataset.txt", "r") as file:
    dataset = file.read()

print("... Creating Tokenizer")
chars = sorted(list(set(dataset)))
print(".......................................................")
print(f"Learned Chars: {chars}")
print(".......................................................")
chars_dataset = sorted(list(set(dataset)))
vocab_size = len(chars_dataset)
print(f"Vocab size: {vocab_size}")
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
tokenize = lambda s: [stoi[c] for c in s]
detokenize = lambda l: ''.join([itos[i] for i in l])

pad_id = tokenize("P")[0]
print(f"pading ID: {pad_id}")