In [1]:
import math
from dataclasses import dataclass
from typing import Optional

import torch
import torch.nn as nn
from torch.nn import functional as F
from typing_extensions import Self

from utils import find_multiple

In [2]:
llama_configs = {
    "7B": dict(n_layer=32, n_head=32, n_embd=4096),
    "13B": dict(n_layer=40, n_head=40, n_embd=5120),
    "30B": dict(n_layer=60, n_head=52, n_embd=6656),
    "65B": dict(n_layer=80, n_head=64, n_embd=8192),
}

These are the differnet variants of the LLaMa models.

In [3]:
@dataclass
class LLaMAConfig:
    block_size: int = 2048
    vocab_size: int = 32000
    padded_vocab_size: Optional[int] = None
    n_layer: int = 32
    n_head: int = 32
    n_embd: int = 4096

    def __post_init__(self):
        if self.padded_vocab_size is None:
            self.padded_vocab_size = find_multiple(self.vocab_size, 64)

    @classmethod
    def from_name(cls, name: str) -> Self:
        return cls(**llama_configs[name])


The `LLaMAConfig` class is used to store class varibales.<br>
Lets understand each of the class variable:<br>

- `block_size` : Represents the maximum sequence length the language model can process. 
- `vocab_size` : Represents the size of vocabular the large language model was trained on. 
- `n_layer` : Represents total number of transformer block. 
- `n_head` : Represents total number of head in each transformer block.
- `n_embd` : Represents size of embedding. 

Accoriding to [this](https://twitter.com/karpathy/status/1621578354024677377/) tweet of **Andrej Karpathy**, it is important to find the nearest multiple of 64 for your vocab. The tweet explains : <br>

*The most dramatic optimization to nanoGPT so far (~25% speedup) is to simply increase vocab size from 50257 to 50304 (nearest multiple of 64). This calculates added useless dimensions but goes down a different kernel path with much higher occupancy. Careful with your Powers of 2.*

You can also read more about it [HERE](https://pytorch.org/blog/accelerating-large-language-models/).



```pyton

def __post_init__(self):
    if self.padded_vocab_size is None:
        self.padded_vocab_size = find_multiple(self.vocab_size, 64)
```
So, this code initializes the padded_vocab_size attribute of an object to a multiple of 64 based on the object's vocab_size, but only if padded_vocab_size is not already set.







In [4]:
class LLaMA(nn.Module):
    def __init__(self, config: LLaMAConfig) -> None:
        super().__init__()
        assert config.padded_vocab_size is not None
        self.config = config

        self.lm_head = nn.Linear(config.n_embd, config.padded_vocab_size, bias=False)
        self.transformer = nn.ModuleDict(
            dict(
                wte=nn.Embedding(config.padded_vocab_size, config.n_embd),
                h=nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
                ln_f=RMSNorm(config.n_embd),
            )
        )

    def _init_weights(self, module: nn.Module) -> None:
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02 / math.sqrt(2 * self.config.n_layer))
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02 / math.sqrt(2 * self.config.n_layer))

    def forward(self, idx: torch.Tensor) -> torch.Tensor:
        _, t = idx.size()
        assert (
            t <= self.config.block_size
        ), f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"

        # forward the LLaMA model itself
        x = self.transformer.wte(idx)  # token embeddings of shape (b, t, n_embd)

        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)

        logits = self.lm_head(x)  # (b, t, vocab_size)

        return logits

    @classmethod
    def from_name(cls, name: str) -> Self:
        return cls(LLaMAConfig.from_name(name))

The LLaMA model provided is a PyTorch-based implementation. Below is an elaboration on the various components of the code:

1. **Initialization**:
```python
class LLaMA(nn.Module):
    def __init__(self, config: LLaMAConfig) -> None:
        super().__init__()
        assert config.padded_vocab_size is not None
        self.config = config
```
Here, the model takes a configuration object, `LLaMAConfig`, during initialization. An assertion checks that the `padded_vocab_size` attribute is not `None`.

2. **Model Architecture**:
```python
        self.lm_head = nn.Linear(config.n_embd, config.padded_vocab_size, bias=False)
        self.transformer = nn.ModuleDict(
            dict(
                wte=nn.Embedding(config.padded_vocab_size, config.n_embd),
                h=nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
                ln_f=RMSNorm(config.n_embd),
            )
        )
```
- The `lm_head` is a final linear layer of the large language model to generate the final prediction. It  maps from embeddings to the vocabulary size, which is used for predicting the next word/token. So why are we doing this? This is because we want to represent the probability distribution over the vocabulary to make the prediction. 

- `transformer` is a dictionary of modules, which includes:
  - `wte`: Word Token Embedding, an embedding layer for the vocabulary. Given tokens, it will generate embeddings of size `config.n_embd=4096.`
  - `h`: A list of blocks, with each block being a segment of the transformer architecture. The number of blocks is defined by `config.n_layer`.
  - `ln_f`: A final layer normalization, here using RMSNorm.

3. **Weight Initialization**:
```python
    def _init_weights(self, module: nn.Module) -> None:
        ...
```
This method initializes the weights of linear and embedding layers based on the model configuration.

4. **Forward Pass**:
```python
    def forward(self, idx: torch.Tensor) -> torch.Tensor:
        ...
```
The forward method defines how input data is processed through the model to produce an output. It processes the input tensor, passes it through the transformer blocks, and eventually through the language model head to produce the logits.<br>

Here, `idx` is shape of `B,T`. We havent converted the tokens into embedding.<br>
`_, t = idx.size()`  : Get the sequence length. 

```python
assert (
    t <= self.config.block_size
), f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
```
This will check whether the input sequence is greater than the max sequence length i.e. `self.config.block_size`. 


``x = self.transformer.wte(idx) `` This will convert the input of sahpe `B,T` to `B,T,n_embd`
```python
for block in self.transformer.h:
    x = block(x)
x = self.transformer.ln_f(x)
```
This passes the embedding through out n transformer blocks. I think this the is the most interesting part in our entire code. 
We will dive deeper into it next. 

As discussed above `logits = self.lm_head(x)  # (b, t, vocab_size)` maps from embeddings to the vocabulary size, which is used for predicting the next word/token.



5. **Load Model by Name**:
```python
    @classmethod
    def from_name(cls, name: str) -> Self:
        return cls(LLaMAConfig.from_name(name))
```
This class method allows for creating a LLaMA model instance directly using a name, assuming the `LLaMAConfig.from_name(name)` can produce the necessary configuration from the provided name.


In [5]:
class Block(nn.Module):
    def __init__(self, config: LLaMAConfig) -> None:
        super().__init__()
        self.rms_1 = RMSNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.rms_2 = RMSNorm(config.n_embd)
        self.mlp = MLP(config)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = x + self.attn(self.rms_1(x))
        x = x + self.mlp(self.rms_2(x))
        return x


A transformer block typically consists of self-attention mechanisms followed by feed-forward neural networks. The LLaMA model has infused some variations, including the use of RMSNorm for normalization. 
**Forward Pass:**
- The input tensor x is first normalized using the first RMSNorm instance.
- Post normalization, it's fed into the CausalSelfAttention. The result is combined with the original tensor via a residual connection, a vital feature in deep networks for maintaining gradient flow.
- The tensor then undergoes the second RMSNorm normalization.
- The normalized output is processed by the MLP. As before, the resultant is added back to the tensor using a residual connection.
- The processed tensor, rich with information, is then returned.


The Block class crystallizes a singular transformer layer's operations within LLaMA. With the integral role of RMSNorm already understood, it becomes evident how this block combines normalization, attention, and feed-forward operations to refine the data representation at each layer. When stacked, these blocks work in concert, building upon one another to offer the powerful capabilities of the LLaMA model.

In [6]:
class CausalSelfAttention(nn.Module):
    def __init__(self, config: LLaMAConfig) -> None:
        super().__init__()
        assert config.n_embd % config.n_head == 0

        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=False)
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=False)

        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.block_size = config.block_size
        self.rope_cache: Optional[torch.Tensor] = None

        
        
        

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, T, C = x.size()  # batch size, sequence length, embedding dimensionality (n_embd)

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)

        head_size = C // self.n_head
        k = k.view(B, T, self.n_head, head_size)
        q = q.view(B, T, self.n_head, head_size)
        v = v.view(B, T, self.n_head, head_size)
        
        if self.rope_cache is None:
            # cache for future forward calls
            self.rope_cache = build_rope_cache(
                seq_len=self.block_size,
                n_elem=self.n_embd // self.n_head, 
                dtype=x.dtype,
                device=x.device,
            )

        
        q = apply_rope(q, self.rope_cache)
        k = apply_rope(k, self.rope_cache)

        k = k.transpose(1, 2)  # (B, nh, T, hs)
        q = q.transpose(1, 2)  # (B, nh, T, hs)
        v = v.transpose(1, 2)  # (B, nh, T, hs)


        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        #  att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        #  att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
        #  att = F.softmax(att, dim=-1)
        #  y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)


        # efficient attention using Flash Attention CUDA kernels
        y = F.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=0.0, is_causal=True)
        y = y.transpose(1, 2).contiguous().view(B, T, C)  # re-assemble all head outputs side by side

        # output projection
        y = self.c_proj(y)

        return y



Here comes the most interesting part of our LLM. Lets dive into each line of code in details. 
1. **Initialization:**
- Here, we first ensure that the embedding size (n_embd) is divisible by the number of attention heads (n_head). This is necessary to equally distribute the embeddings across all heads.

- **The Key, Query, Value Projections**:<br>
    ```self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=False)```

    This transformation is designed to produce key, query, and value tensors, which are essential for the attention mechanism. Normally, you'd expect three separate linear transformations - one for each of key, query, and value. But here, they're combined into a single transformation for efficiency.


    Input: config.n_embd represents the embedding size of each token in the model.
    Output: 3 * config.n_embd might look a bit confusing initially, but it makes perfect sense once you understand the purpose. Since we're generating three different tensors (key, query, and value) and each has an embedding size of config.n_embd, the combined size is 3 * config.n_embd.


2. **Forward Pass**:

- The input tensor's dimensions are extracted, where:
  - `B` represents the batch size.
  - `T` stands for the sequence length.
  - `C` denotes the embedding dimensionality.
  
- The tensor `x` undergoes the `c_attn` transformation, splitting the result into query, key, and value tensors (`q, k, v`). 

- These tensors are then reshaped for multi-head attention. Essentially, the embedding dimensionality is divided among the number of attention heads.

- If the rope cache hasn't been built (i.e., `self.rope_cache is None`), it's constructed using the `build_rope_cache` function. As we alrady discussed this cache is calculate for single head and later applied across each head, we can see that `n_elem=self.n_embd // self.n_head`, this basically means for each token in the sequence, we split the token into `n_head` and based on dimension of head, we calculate the ROPE cache. This method is preety much similar to the one we have implemented before. We will discuss some changes in this implementation later. This cache is then applied to the `q` and `k` tensors using `apply_rope` which is also preety much similar to our previous approach. 

- The `q`, `k`, and `v` tensors are transposed to align them for the attention mechanism. Can you tell why are we performig this transformation? 
After transposing, we have final tensor of sahpe `(B, nh, T, hs)`. Now if we perform the operation `q @ k.t`, as the key is transformed, final tensor will be of shape `T,T`. This `T,T` matrix will gave us information about, given a token, what's the relation with other tokens. I think you got an idea why this tranformation is performed. This is done basically to get the attention matrix. 


- The main action happens in the causal self-attention mechanism. Normally, one would compute attention scores by multiplying `q` and `k`, apply a mask for causality, then use this to weight the `v` tensor. Here, however, the mechanism uses the efficient `F.scaled_dot_product_attention` method, which leverages FlashAttention for faster attention calculations. FlashAttention is a new algorithm to speed up attention and reduce its memory footprint—without any approximation.
You can read more about FlashAttention [Here](https://crfm.stanford.edu/2023/07/17/flash2.html#:~:text=FlashAttention%20is%20an%20algorithm%20that,to%20linear%20in%20sequence%20length.), [Here](https://www.adept.ai/blog/flashier-attention), [Here](https://gordicaleksa.medium.com/eli5-flash-attention-5c44017022ad).

- The resultant tensor `y` is reshaped and then undergoes the output projection via the `c_proj` transformation.


In [7]:

class MLP(nn.Module):
    def __init__(self, config: LLaMAConfig) -> None:
        super().__init__()
        hidden_dim = 4 * config.n_embd
        n_hidden = int(2 * hidden_dim / 3)
        n_hidden = find_multiple(n_hidden, 256)

        self.c_fc1 = nn.Linear(config.n_embd, n_hidden, bias=False)
        self.c_fc2 = nn.Linear(config.n_embd, n_hidden, bias=False)
        self.c_proj = nn.Linear(n_hidden, config.n_embd, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = F.silu(self.c_fc1(x)) * self.c_fc2(x)
        x = self.c_proj(x)
        return x

In [8]:
# class RMSNorm(nn.Module):

#     def __init__(self, input_dim , eps = 1e-6) -> None:
#         super().__init__()

#         self.scale = nn.Parameter(torch.ones(input_dim))
#         self.eps = eps

#     def forward(self,x):
#         # RMS of input
#         rms = torch.rsqrt(torch.square(x).mean(dim=-1,keepdim=True) + self.eps)
#         # rescaling 
#         x  = x * rms
#         return x * self.scale


class RMSNorm(nn.Module):
    """Root Mean Square Layer Normalization.

    Derived from https://github.com/bzhangGo/rmsnorm/blob/master/rmsnorm_torch.py. BSD 3-Clause License:
    https://github.com/bzhangGo/rmsnorm/blob/master/LICENSE.
    """

    def __init__(self, size: int, dim: int = -1, eps: float = 1e-5) -> None:
        super().__init__()
        self.scale = nn.Parameter(torch.ones(size))
        self.eps = eps
        self.dim = dim

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # NOTE: the original RMSNorm paper implementation is not equivalent
        # norm_x = x.norm(2, dim=self.dim, keepdim=True)
        # rms_x = norm_x * d_x ** (-1. / 2)
        # x_normed = x / (rms_x + self.eps)
        norm_x = torch.mean(x * x, dim=self.dim, keepdim=True)
        x_normed = x * torch.rsqrt(norm_x + self.eps)
        return self.scale * x_normed

In [9]:
def build_rope_cache(seq_len: int, n_elem: int, dtype: torch.dtype, device: torch.device, base: int = 10000) -> torch.Tensor:
    """Enhanced Transformer with Rotary Position Embedding.

    Derived from: https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/
    transformers/rope/__init__.py. MIT License:
    https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/license.
    """
    # $\Theta = {\theta_i = 10000^{\frac{2(i-1)}{d}}, i \in [1, 2, ..., \frac{d}{2}]}$
    theta = 1.0 / (base ** (torch.arange(0, n_elem, 2, dtype=dtype, device=device) / n_elem))

    # Create position indexes `[0, 1, ..., seq_len - 1]`
    seq_idx = torch.arange(seq_len, dtype=dtype, device=device)

    # Calculate the product of position index and $\theta_i$
    idx_theta = torch.outer(seq_idx, theta).float()

    cache = torch.stack([torch.cos(idx_theta), torch.sin(idx_theta)], dim=-1)

    # this is to mimic the behaviour of complex32, else we will get different results
    if dtype in (torch.float16, torch.bfloat16, torch.int8):
        cache = cache.half()
    return cache


def apply_rope(x: torch.Tensor, rope_cache: torch.Tensor) -> torch.Tensor:
    # truncate to support variable sizes
    T = x.size(1)
    rope_cache = rope_cache[:T]

    # cast because the reference does
    xshaped = x.float().reshape(*x.shape[:-1], -1, 2)
    # uta hami lea cos ra sine lai 2 ota use garinthiyo. Like x_rope, neg_half_x calculate gareko.
    rope_cache = rope_cache.view(1, xshaped.size(1), 1, xshaped.size(3), 2)
    x_out2 = torch.stack(
        [
            xshaped[..., 0] * rope_cache[..., 0] - xshaped[..., 1] * rope_cache[..., 1],
            xshaped[..., 1] * rope_cache[..., 0] + xshaped[..., 0] * rope_cache[..., 1],
        ],
        -1,
    )

    x_out2 = x_out2.flatten(3)
    return x_out2.type_as(x)



The `build_rope_cache` function is almost identical to `build_cache` we implemented. Here the cos and sin values are calulated before hand. Also, `build_rope_cache` has specific handling for certain data types like torch.float16, torch.bfloat16, and torch.int8, where it casts the computed cache to half precision.
`build_cache` doesn't handle data types in this manner.

The `apply_rope` is also applies RoPE cache to query and key. But there is slight difference on how the transformation is applied. I'll expain what is happeing in this method in details. 

We have two tensors: `x` and `rope_cache`. 

Lets assume `x` is a 4D tensor with shape `(1, 4, 2, 4)` and `rope_cache` is a 4D tensor with shape `(4, 2, 2)`.

```python 

x = tensor([[[[ 0,  1,  2,  3],
          [ 4,  5,  6,  7]],

         [[ 8,  9, 10, 11],
          [12, 13, 14, 15]],

         [[16, 17, 18, 19],
          [20, 21, 22, 23]],

         [[24, 25, 26, 27],
          [28, 29, 30, 31]]]])
```

**Step 1 :** 

```python
T = x.size(1)

```

Here, `T` is simply the size of the second dimension of `x`, which is 4.

**Step 2 :** 

Next, We resize `rope_cache` to match the size `T`:

```python
rope_cache = rope_cache[:T]

```

This step is redundant because `rope_cache` already has a size of 4 in its first dimension.

**Step 3:** 

Then, you reshape `x` to make its last dimension into two parts:

```python
xshaped = x.float().reshape(*x.shape[:-1], -1, 2)

```

This breaks down as:

1. Convert x into float: `x.float()`
2. Reshape it: For our tensor, this converts it from shape `(1, 4, 2, 4)` to `(1, 4, 4, 2)`.

Given the **`xshaped`** tensor structure you provided, we can see that its shape is (1, 4, 2, 2, 2). That means you have:

- 1 batch (the outermost dimension)
- 4 channels
- 2x2 spatial dimensions (height x width)
- 2 values for each spatial position (the innermost dimension)

For instance, before reshaping, the first 2x4 matrix in `x` is:

```
0,  1,  2,  3
4,  5,  6,  7

```

After reshaping, the first 4x2 matrix in `xshaped` would be:

```
0,  1
2,  3
4,  5
6,  7

```

Next, you are reshaping the `rope_cache`:

```python
rope_cache = rope_cache.view(1, xshaped.size(1), 1, xshaped.size(3), 2)

```

This converts `rope_cache` from shape `(4, 2, 2)` to `(1, 4, 1, 2, 2)`. This reshaping is done to align the dimensions of `rope_cache` with `xshaped` for broadcasting during the subsequent operations.

**Step 3:** 

Then, you perform element-wise multiplication and subtraction/addition between the reshaped `x` and `rope_cache`:

```python
x_out2 = torch.stack(
    [
        xshaped[..., 0] * rope_cache[..., 0] - xshaped[..., 1] * rope_cache[..., 1],
        xshaped[..., 1] * rope_cache[..., 0] + xshaped[..., 0] * rope_cache[..., 1],
    ],
    -1,
)
```

This is similar to performing rotation using sine and cosine values from `rope_cache`. The resulting tensor `x_out2` has the same shape as `xshaped`, which is `(1, 4, 4, 2)`. Rotation operation in **`torch.stack`** would work element-wise over the tensors. This means that for each position in **`xshaped`**, it uses the corresponding position in **`rope_cache`** for the rotation calculation.

**Breakdown:**

Given a 2D rotation matrix:

$$
R(\theta) = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \end{bmatrix}
$$

When you multiply this rotation matrix with a 2D vector $([x, y]^T)$, you get:

$$
R(\theta) \cdot \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} x\cos(\theta) - y\sin(\theta) \\ x\sin(\theta) + y\cos(\theta) \end{bmatrix}
$$

Now, let's connect this to the operations in the code:

- The first component of the output:
$x' = x\cos(\theta) - y\sin(\theta)$ is given by:
`xshaped[..., 0] * rope_cache[..., 0] - xshaped[..., 1] * rope_cache[..., 1]`

Where:

- `xshaped[..., 0]` corresponds to the x component (or the first value) of our vector.
- `xshaped[..., 1]` corresponds to the y component (or the second value) of our vector.
- `rope_cache[..., 0]` is the cosine of the rotation angle.
- `rope_cache[..., 1]` is the sine of the rotation angle.
- The second component of the output:
$y' = x\sin(\theta) + y\cos(\theta)$ is given by:
`xshaped[..., 1] * rope_cache[..., 0] + xshaped[..., 0] * rope_cache[..., 1]`

The code is essentially applying this rotation to every pair of values in the tensor `xshaped` using the angles specified in `rope_cache`.

The `torch.stack(..., -1)` at the end stacks these computed values along the last dimension. After this operation, for every pair of x and y values in the original `xshaped`, you have their rotated counterparts stacked together in the resulting tensor.

## Inference 

For inference we will be using pipeline provided by the lit-lama repo. It provides some helpful classes that can potentially speed up the loading and initialization of large models, especially when only parts of the model need to be accessed or when specific tensor initializations are desired. The code also seems to handle some advanced features like quantization and lazy loading of tensors.

let's break down these classes:

1. **`EmptyInitOnDevice` class**:

   This class is a context manager that changes the behavior of tensor initialization to create tensors with uninitialized memory (or "empty tensors"). Additionally, it can set specific devices and data types for tensor initialization, and supports specific quantization modes. When this context is active, tensors are initialized without actually assigning them any initial values, making the initialization process faster in some scenarios.
   

2. **`NotYetLoadedTensor` class**:

   Represents a tensor that has not yet been loaded into memory. It is essentially a placeholder that can be transformed into an actual tensor when accessed or used in computations. This class can be especially useful when dealing with large datasets or models, as it allows for lazy loading of data, only loading tensors into memory when they're actually needed.

   
3. **`LazyLoadingUnpickler` class**:

   Custom unpickler for lazy loading. Pickling is the process of converting a Python object into a byte stream, and unpickling is the reverse operation. The idea here is to load tensors and related objects from the pickled format only when they're actually accessed or used.
   


In [10]:
import sys
import time
import warnings
from pathlib import Path
from typing import Optional

import lightning as L
import torch

from tokenizer import  Tokenizer
from utils import EmptyInitOnDevice, lazy_load, llama_model_lookup


In [11]:
@torch.no_grad()
def generate(
    model: torch.nn.Module,
    idx: torch.Tensor,
    max_new_tokens: int,
    max_seq_length: int,
    temperature: float = 1.0,
    top_k: Optional[int] = None,
    eos_id: Optional[int] = None,
) -> torch.Tensor:
    """Takes a conditioning sequence (prompt) as input and continues to generate as many tokens as requested.

    The implementation of this function is modified from A. Karpathy's nanoGPT.

    Args:
        model: The model to use.
        idx: Tensor of shape (T) with indices of the prompt sequence.
        max_new_tokens: The number of new tokens to generate.
        max_seq_length: The maximum sequence length allowed.
        temperature: Scales the predicted logits by 1 / temperature
        top_k: If specified, only sample among the tokens with the k highest probabilities
        eos_id: If specified, stop generating any more token once the <eos> token is triggered
    """
    # create an empty tensor of the expected final shape and fill in the current tokens
    T = idx.size(0)
    T_new = T + max_new_tokens
    empty = torch.empty(T_new, dtype=idx.dtype, device=idx.device)
    empty[:T] = idx
    idx = empty

    # generate max_new_tokens tokens
    for t in range(T, T_new):
        # ignore the not-filled-yet tokens
        idx_cond = idx[:t]
        # if the sequence context is growing too long we must crop it at max_seq_length
        idx_cond = idx_cond if T <= max_seq_length else idx_cond[-max_seq_length:]

        # forward
        logits = model(idx_cond.view(1, -1))
        logits = logits[0, -1] / temperature

        # optionally crop the logits to only the top k options
        if top_k is not None:
            v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
            logits[logits < v[[-1]]] = -float("Inf")

        probs = torch.nn.functional.softmax(logits, dim=-1)
        idx_next = torch.multinomial(probs, num_samples=1)

        # concatenate the new generation
        idx[t] = idx_next

        # if <eos> token is triggered, return the output (stop generation)
        if idx_next == eos_id:
            return idx[:t + 1]  # include the EOS token

    return idx


In [12]:
def main(
    prompt: str = "Hello, my name is",
    *,
    num_samples: int = 1,
    max_new_tokens: int = 50,
    top_k: int = 200,
    temperature: float = 0.8,
    checkpoint_path: Optional[Path] = None,
    tokenizer_path: Optional[Path] = None,
    quantize: Optional[str] = None,
) -> None:
    """Generates text samples based on a pre-trained LLaMA model and tokenizer.

    Args:
        prompt: The prompt string to use for generating the samples.
        num_samples: The number of text samples to generate.
        max_new_tokens: The number of generation steps to take.
        top_k: The number of top most probable tokens to consider in the sampling process.
        temperature: A value controlling the randomness of the sampling process. Higher values result in more random
            samples.
        checkpoint_path: The checkpoint path to load.
        tokenizer_path: The tokenizer path to load.
        quantize: Whether to quantize the model and using which method:
            ``"llm.int8"``: LLM.int8() mode,
            ``"gptq.int4"``: GPTQ 4-bit mode.
    """
    if not checkpoint_path:
        checkpoint_path = Path(f"./checkpoints/lit-llama/7B/lit-llama.pth")
    if not tokenizer_path:
        tokenizer_path = Path("./checkpoints/lit-llama/tokenizer.model")
    assert checkpoint_path.is_file(), checkpoint_path
    assert tokenizer_path.is_file(), tokenizer_path

    fabric = L.Fabric(devices=1)
    dtype = torch.bfloat16 if fabric.device.type == "cuda" and torch.cuda.is_bf16_supported() else torch.float32

    print("Loading model ...", file=sys.stderr)
    t0 = time.time()
    with lazy_load(checkpoint_path) as checkpoint:
        name = llama_model_lookup(checkpoint)

        with EmptyInitOnDevice(
                device=fabric.device, dtype=dtype, quantization_mode=quantize
        ):
            model = LLaMA.from_name(name)

        model.load_state_dict(checkpoint)
    print(f"Time to load model: {time.time() - t0:.02f} seconds.", file=sys.stderr)

    model.eval()
    model = fabric.setup_module(model)

    tokenizer = Tokenizer(tokenizer_path)
    encoded_prompt = tokenizer.encode(prompt, bos=True, eos=False, device=fabric.device)

    L.seed_everything(1234)
    for i in range(num_samples):
        t0 = time.perf_counter()
        y = generate(
            model,
            encoded_prompt,
            max_new_tokens,
            model.config.block_size,  # type: ignore[union-attr,arg-type]
            temperature=temperature,
            top_k=top_k,
        )
        t = time.perf_counter() - t0
        print('\n\n')
        print(tokenizer.decode(y))
        print('\n\n')
        print(f"Time for inference {i + 1}: {t:.02f} sec total, {max_new_tokens / t:.02f} tokens/sec", file=sys.stderr)
    if fabric.device.type == "cuda":
        print(f"Memory used: {torch.cuda.max_memory_reserved() / 1e9:.02f} GB", file=sys.stderr)

In [13]:
main("Artificial Intelligence is the")

Loading model ...
Time to load model: 17.45 seconds.
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Global seed set to 1234





Artificial Intelligence is the ability of a computer to imitate intelligent behaviour without being programmed, such as learning in a self-directed way to do a specific task, and then not just repeating the task, but improving itself. This is different from Traditional Artificial Intelligence which is any





Time for inference 1: 1.41 sec total, 35.55 tokens/sec
Memory used: 13.52 GB
