# GPT

First notebook was written from 'scratch', this one leverages existing libraries to experiment with actual training and inference.

I also added some improvements over baseline model : 
- Moved attention computation to optimized `F.scaled_dot_product_attention`
- Moved `LayerNorm` to `RMSNorm`, which is the standard now
- Moved `GELU` to `SWIGLU`
- Moved positional encoding to RoPE on Q, K
- Disabled bias in every linear layers
- Grouped Q, V, K projections into 1
- Used `MUON`


In [1]:
# Standard library
import csv
import math
import multiprocessing
import os
import random
import time
from datetime import datetime

# Environment config
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

# Third-party
import numpy as np
from datasets import concatenate_datasets, load_dataset
from rotary_embedding_torch import RotaryEmbedding
from tokenizers import Tokenizer

# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset

# Torch runtime config
torch.set_float32_matmul_precision("medium")
torch.cuda.empty_cache()

In [2]:
class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim: int, num_heads: int, causal: bool = True, dropout: float = 0.1):
        super().__init__()
        if embed_dim % num_heads != 0:
            raise ValueError(f"embed_dim ({embed_dim}) must be divisible by num_heads ({num_heads}).")
        
        self.causal = causal
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.dropout_p = dropout
        
        # Fused QKV projection: 3x the output size
        self.qkv_proj = nn.Linear(embed_dim, 3 * embed_dim, bias=False)
        
        self.rotary_emb = RotaryEmbedding(dim=self.head_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=False)
    
    def forward(self, x, k_v_cache=None):
        B, T, _ = x.shape
        using_cache = k_v_cache is not None and "K" in k_v_cache
    
        # 1. Single fused projection
        if using_cache:
            x_q = x[:, -1:, :]
            qkv = self.qkv_proj(x_q)  # (B, 1, 3*embed_dim)
        else:
            qkv = self.qkv_proj(x)  # (B, T, 3*embed_dim)
        
        # 2. Split into Q, K, V
        Q, K, V = qkv.chunk(3, dim=-1)  # Each is (B, T, embed_dim)
        
        def split_heads(t):
            return t.view(B, -1, self.num_heads, self.head_dim).transpose(1, 2)
        
        # 3. Split heads -> (B, H, T, D_head)
        Q = split_heads(Q)
        K = split_heads(K)
        V = split_heads(V)
    
        # 4. Apply RoPE 
        if using_cache:
            past_len = k_v_cache["K"].shape[-2]
            Q = self.rotary_emb.rotate_queries_or_keys(Q, offset=past_len)
            K = self.rotary_emb.rotate_queries_or_keys(K, offset=past_len)
            
            K = torch.cat([k_v_cache["K"], K], dim=-2)
            V = torch.cat([k_v_cache["V"], V], dim=-2)
        else:
            Q = self.rotary_emb.rotate_queries_or_keys(Q)
            K = self.rotary_emb.rotate_queries_or_keys(K)
    
        # 5. Update cache
        if k_v_cache is not None:
            k_v_cache["K"] = K
            k_v_cache["V"] = V
    
        # 6. Attention
        out = F.scaled_dot_product_attention(
            query=Q,
            key=K,
            value=V,
            attn_mask=None, 
            dropout_p=self.dropout_p if self.training else 0.0,
            is_causal=self.causal and (Q.shape[-2] > 1)
        )
        
        # 7. Merge heads
        out = out.transpose(1, 2).contiguous().view(B, -1, self.embed_dim)
    
        return self.out_proj(out), k_v_cache

In [3]:
class MLP(nn.Module):
    def __init__(self, embed_dim, hidden_dim=None, dropout_prob=0.1, use_swiglu=True):
        super().__init__()
        if hidden_dim is None:
            hidden_dim = 4 * embed_dim
        
        self.use_swiglu = use_swiglu
        
        if use_swiglu:
            # Adjust hidden_dim for param count matching
            hidden_dim = int(2 * hidden_dim / 3)
            self.gate_proj = nn.Linear(embed_dim, hidden_dim, bias=False)
            self.up_proj = nn.Linear(embed_dim, hidden_dim, bias=False)
            self.down_proj = nn.Linear(hidden_dim, embed_dim, bias=False)
        else:
            self.linear1 = nn.Linear(embed_dim, hidden_dim, bias=False)
            self.act = nn.GELU()
            self.linear2 = nn.Linear(hidden_dim, embed_dim, bias=False)
        
        self.dropout = nn.Dropout(dropout_prob)
    
    def forward(self, x):
        if self.use_swiglu:
            return self.dropout(self.down_proj(F.silu(self.gate_proj(x)) * self.up_proj(x)))
        else:
            return self.dropout(self.linear2(self.act(self.linear1(x))))

In [4]:
class TransformerBlock(nn.Module):
    def __init__(self,
                 embed_dim,
                 num_heads,
                 mlp_ratio=4,
                 dropout_prob=0.1,
                 causal=True,
                 use_swiglu=True,
                ): 
        """
        Initialize a complete transformer block.
        
        APPROACH:
        1. Multi-head self-attention for sequence modeling
        2. 1st Normalization (pre-norm architecture)
        3. MLP with specified expansion ratio
        4. 2nd Normalization
    
        TRANSFORMER BLOCK ARCHITECTURE:
        x → Norm → MultiHeadAttention → + (residual) →
            Norm → MLP → + (residual) → output
    
        NB: We use pre-norm architecture (before attention/MLP)
        """
    
        super().__init__()
        self.norm1 = nn.RMSNorm(embed_dim)
        self.mha = MultiHeadAttention(embed_dim, num_heads, causal, dropout_prob)  # causal = masking out tokens
        self.norm2 = nn.RMSNorm(embed_dim)
        self.mlp = MLP(embed_dim, mlp_ratio * embed_dim, dropout_prob, use_swiglu)
    
    def forward(self, x, cache=None):
        x1 = self.norm1(x)
        x2, cache = self.mha(x1, cache)  # will be used when generating tokens during inference
        x2 = x2 + x  # residual path
    
        x3 = self.norm2(x2)
        x3 = self.mlp(x3) + x2  # residual path
        return x3, cache

In [5]:
class GPT(nn.Module):
    """
    Complete GPT (Generative Pre-trained Transformer) model.

    This combines embeddings, positional encoding, multiple transformer blocks,
    and a language modeling head for text generation.
    """

    def __init__(self,
                 vocab_size,
                 embed_dim,
                 num_layers,
                 num_heads,
                 mlp_ratio=4,
                 dropout_prob=0.1,
                 use_swiglu=True,
                ):
        """
        Initialize complete GPT model.
        """
        super().__init__()

        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        self.num_layers = num_layers
        self.num_heads = num_heads
        self.mlp_ratio = mlp_ratio

        self.embedding = nn.Embedding(self.vocab_size, self.embed_dim)
        self.dropout = nn.Dropout(dropout_prob)
        self.blocks = nn.ModuleList([TransformerBlock(embed_dim, num_heads, mlp_ratio, dropout_prob, use_swiglu) for _ in range(num_layers)])
        self.norm = nn.RMSNorm(embed_dim)
        self.lm_head = nn.Linear(embed_dim, vocab_size, bias=False)
        # weight tying
        self.lm_head.weight = self.embedding.weight

        # below shamefully stolen from nano-gpt
        # init all weights
        self.apply(self._init_weights)
        # apply special scaled init to the residual projections, per GPT-2 paper
        for pn, p in self.named_parameters():
            if pn.endswith('c_proj.weight'):
                torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * self.n_layer))

        # report number of parameters
        print("Number of parameters: %.2fM" % (self.get_num_params()/1e6,))

    def get_num_params(self, non_embedding=True):
        """
        Return the number of parameters in the model.
        For non-embedding count (default), the position embeddings get subtracted.
        The token embeddings would too, except due to the parameter sharing these
        params are actually used as weights in the final layer, so we include them.
        """
        n_params = sum(p.numel() for p in self.parameters())
        if non_embedding:
            n_params -= self.embedding.weight.numel()
        return n_params

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
       
    def forward(self, tokens):
        embeddings = self.embedding(tokens)
        x = self.dropout(embeddings)
        for b in self.blocks:
            x, _ = b(x)  # iteratively refines features from initial embeddings
        features = self.norm(x)  # normalized to stabilize training
        return self.lm_head(features)

    @property
    def device(self):
        return next(self.parameters()).device

    @torch.no_grad()
    def generate(self,
                 prompt_tokens,
                 max_new_tokens=50,
                 temperature=1.0,
                 use_cache=True,
                 use_top_k=False,
                ):
        self.eval()

        tokens_out = prompt_tokens.clone()
        current_tokens = prompt_tokens.clone()
        tokens_out = tokens_out.to(self.device)
        current_tokens = current_tokens.to(self.device)
        cache = [{} if use_cache else None for _ in range(len(self.blocks))]
        
        for _ in range(max_new_tokens):

            x = self.embedding(current_tokens)
            for i, b in enumerate(self.blocks):
                x, c_i = b(x, cache[i])
                cache[i] = c_i
            
            features = self.norm(x)
            logits = self.lm_head(features)
                    
            last_logits = logits[:, -1, :]
    
            if temperature > 0:
                scaled_logits = last_logits / temperature
                # Only sample from top k tokens to avoid garbage prediction derailing whole prediction
                # We don't simply take max prob token to allow "creativity"
                if use_top_k:
                    # heuristic that is ok for toy project
                    # most of probability mass in on a small amount of tokens
                    k = min(max(5, int(0.01 * self.vocab_size)), 100)
                    values, indices = torch.topk(scaled_logits, k)
                    scaled_logits = torch.full_like(scaled_logits, float('-inf'))
                    scaled_logits.scatter_(1, indices, values)
                probs = torch.softmax(scaled_logits, dim=-1)
                next_token = torch.multinomial(probs, num_samples=1)
            else:
                # Greedy decoding if temp is 0 (prevents division by zero)
                next_token = torch.argmax(last_logits, dim=-1, keepdim=True)
    
            tokens_out = torch.cat([tokens_out, next_token], dim=1)

            # If caching, we only need to feed the newest token next time, otherwise full sequence
            current_tokens = next_token if use_cache else tokens_out
       
        return tokens_out

## Full Training on Wikipedia

Note : we re-use the tokenizer defined at the beginning

In [6]:
#### CONFIG #####

# Basically GPT-2 Small
block_size = 1024  # 512 for faster convergence then 1024 to finish training
batch_size = 16
embed_dim = 768
num_layers = 12
num_heads = 12
dropout_prob = 0.1
mlp_ratio = 4  # standard 4x expansion


# Training
MAX_STEPS = 500000       # Total number of micro-batches to process
GRAD_ACCUM_STEPS = 40    # Accumulate gradients over 40 batches
LOG_INTERVAL = 500       # Log every 500 micro-batches
num_workers = 4          # For data loading
prefetch = 4
dtype = torch.bfloat16
device = "cuda"
model_path = f"gpt_model_{block_size}_final.pt"  # where do we store trained model
print("torch.cuda.is_bf16_supported()", torch.cuda.is_bf16_supported())

torch.cuda.is_bf16_supported() True


In [7]:
# from huggingface_hub import login
# login()  # faster dl


# --- 1. Setup Tokenizer (Your specific implementation) ---
tokenizer = Tokenizer.from_pretrained("GPT2")
eot_id = tokenizer.token_to_id("<|endoftext|>")
assert eot_id is not None

# --- 2. Load and Normalize Datasets ---
# We need to make sure all datasets have a 'text' column and nothing else
def clean_columns(ds):
    # Some datasets use 'line' or 'sentence', we rename to 'text'
    if 'line' in ds.column_names:
        ds = ds.rename_column('line', 'text')
    # Keep only the 'text' column to ensure concatenation works
    return ds.select_columns(['text'])

print("Loading datasets...")

# Train List
train_raw = [
    load_dataset("wikitext", "wikitext-2-raw-v1", split="train"),
    load_dataset("tiny_shakespeare", split="train"),
    load_dataset("bookcorpus", split="train[:10%]"),
    load_dataset("openwebtext", split="train[:20%]"),
]
# Clean and Concatenate
train_ds = concatenate_datasets([clean_columns(d) for d in train_raw])

# Validation List
val_raw = [
    load_dataset("tiny_shakespeare", split="validation"),
    load_dataset("wikitext", "wikitext-2-raw-v1", split="validation"),
]
val_ds = concatenate_datasets([clean_columns(d) for d in val_raw])

print(f"Raw Train Size: {len(train_ds)} rows")

# --- 3. Optimized Processing Pipeline ---

def process_batch(examples):
    """
    1. Tokenizes text.
    2. Appends EOT token to EVERY document.
    3. Flattens into a 1D stream.
    4. Chunks into block_size + 1 (to allow for shifting).
    """
    all_token_ids = []
    
    # 1. Tokenize and add EOT (Document Boundary)
    # Note: We use tokenizer.encode_batch for speed if possible, 
    # but here we loop to append EOT easily.
    for text in examples["text"]:
        # Skip empty strings
        if not text.strip():
            continue
        
        # Encode
        ids = tokenizer.encode(text).ids
        
        # Append EOT (Crucial for GPT context separation)
        ids.append(eot_id)
        all_token_ids.extend(ids)
    
    # 2. Chunking
    # We need chunks of length (block_size + 1)
    chunk_len = block_size + 1
    
    # Truncate remainder
    total_len = (len(all_token_ids) // chunk_len) * chunk_len
    
    # Reshape into list of lists
    chunks = [
        all_token_ids[i : i + chunk_len] 
        for i in range(0, total_len, chunk_len)
    ]
    
    # Return dict for HF Dataset
    return {"chunk_ids": chunks}

# Apply the processing
# We remove 'text' immediately to free up RAM.
# num_proc uses multiple CPU cores to tokenize faster.
print("Tokenizing and chunking (this may take a moment)...")
train_tokenized = train_ds.map(
    process_batch, 
    batched=True, 
    batch_size=1000, 
    num_proc=multiprocessing.cpu_count(),
    remove_columns=train_ds.column_names,
    desc="Processing Train"
)

val_tokenized = val_ds.map(
    process_batch, 
    batched=True, 
    batch_size=1000, 
    num_proc=max(1, multiprocessing.cpu_count() // 2),
    remove_columns=val_ds.column_names,
    desc="Processing Val"
)

Loading datasets...


Resolving data files:   0%|          | 0/80 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/80 [00:00<?, ?it/s]

Raw Train Size: 9039896 rows
Tokenizing and chunking (this may take a moment)...


In [8]:
class GPTDataset(torch.utils.data.Dataset):
    def __init__(self, hf_dataset):
        self.ds = hf_dataset

    def __len__(self):
        return len(self.ds)

    def __getitem__(self, idx):
        # Retrieve the chunk of size BLOCK_SIZE + 1
        chunk = self.ds[idx]["chunk_ids"]
        
        # Convert to tensor
        data = torch.tensor(chunk, dtype=torch.long)

        # Shift target, the model needs to learn token_t -> token_t+1
        x = data[:-1]
        y = data[1:]
        
        return x, y


# Set format to pytorch (optional, but good practice for speed)
train_tokenized.set_format(type="numpy", columns=["chunk_ids"])
val_tokenized.set_format(type="numpy", columns=["chunk_ids"])

train_dataset = GPTDataset(train_tokenized)
val_dataset = GPTDataset(val_tokenized)

train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True, # Shuffle chunks
    num_workers=num_workers,
    prefetch_factor=prefetch,
    pin_memory=True
)

val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    shuffle=False,
    num_workers=num_workers,
    prefetch_factor=prefetch,
    pin_memory=True
)

# --- Verification ---
print("-" * 20)
print(f"Final Train Batches: {len(train_loader)}")
x, y = next(iter(train_loader))
print(f"Input x shape: {x.shape}")  # Should be [Batch, Block_Size]
print(f"Target y shape: {y.shape}") # Should be [Batch, Block_Size]

print("\nSanity Check (Shifting):")
print(f"x[0, -5:]: {x[0, -5:].tolist()}") # End of input
print(f"y[0, -5:]: {y[0, -5:].tolist()}") # End of target
del x
del y

--------------------
Final Train Batches: 117275
Input x shape: torch.Size([16, 1024])
Target y shape: torch.Size([16, 1024])

Sanity Check (Shifting):
x[0, -5:]: [355, 326, 3290, 11, 290]
y[0, -5:]: [326, 3290, 11, 290, 6451]


In [14]:
# --- Model & Optimizer ---
from pprint import pprint
model_config = {
    "vocab_size": tokenizer.get_vocab_size(),
    "embed_dim": embed_dim,
    "num_layers": num_layers,
    "num_heads": num_heads,
    "mlp_ratio": mlp_ratio,
    "dropout_prob": dropout_prob,
    "use_swiglu": True,
}

print("Initializing model with config : ")
pprint(model_config)

model = GPT(**model_config).to(device)

Initializing model with config : 
{'dropout_prob': 0.1,
 'embed_dim': 768,
 'mlp_ratio': 4,
 'num_heads': 12,
 'num_layers': 12,
 'use_swiglu': True,
 'vocab_size': 50257}
Number of parameters: 84.95M


In [15]:
model = torch.compile(model)   # can take a while
model.train()
print("Model compiled !")

Model compiled !


In [12]:
# ckpt_path = "ckpts/gpt_model_1024_final.pt"

# state_dict = torch.load(ckpt_path, map_location=device)
# model.load_state_dict(state_dict, strict=False)

_IncompatibleKeys(missing_keys=['_orig_mod.blocks.0.norm1.weight', '_orig_mod.blocks.0.mha.qkv_proj.weight', '_orig_mod.blocks.0.norm2.weight', '_orig_mod.blocks.0.mlp.gate_proj.weight', '_orig_mod.blocks.0.mlp.up_proj.weight', '_orig_mod.blocks.0.mlp.down_proj.weight', '_orig_mod.blocks.1.norm1.weight', '_orig_mod.blocks.1.mha.qkv_proj.weight', '_orig_mod.blocks.1.norm2.weight', '_orig_mod.blocks.1.mlp.gate_proj.weight', '_orig_mod.blocks.1.mlp.up_proj.weight', '_orig_mod.blocks.1.mlp.down_proj.weight', '_orig_mod.blocks.2.norm1.weight', '_orig_mod.blocks.2.mha.qkv_proj.weight', '_orig_mod.blocks.2.norm2.weight', '_orig_mod.blocks.2.mlp.gate_proj.weight', '_orig_mod.blocks.2.mlp.up_proj.weight', '_orig_mod.blocks.2.mlp.down_proj.weight', '_orig_mod.blocks.3.norm1.weight', '_orig_mod.blocks.3.mha.qkv_proj.weight', '_orig_mod.blocks.3.norm2.weight', '_orig_mod.blocks.3.mlp.gate_proj.weight', '_orig_mod.blocks.3.mlp.up_proj.weight', '_orig_mod.blocks.3.mlp.down_proj.weight', '_orig_mod.b

In [17]:
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.95))
loss_fn = torch.nn.CrossEntropyLoss()

In [None]:
# Calculate how many actual updates we will do (for the scheduler)

total_optim_steps = MAX_STEPS // GRAD_ACCUM_STEPS
print(f"Total Micro-batches: {MAX_STEPS}")
print(f"Gradient Accumulation: {GRAD_ACCUM_STEPS}")
print(f"Total Optimizer Updates: {total_optim_steps}")


# T_max is now based on actual optimizer updates, not total loops
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=total_optim_steps, eta_min=1e-5)

# --- CSV Logger ---
log_file = "training_log.csv"
file_exists = os.path.isfile(log_file)
with open(log_file, "a", newline="") as f:
    writer = csv.writer(f)
    if not file_exists:
        writer.writerow(["micro_step", "optim_step", "loss", "lr", "tokens_seen", "tokens_per_sec", "timestamp"])

# --- Training Loop ---
micro_step = 0      # Counts every batch seen
optim_step = 0      # Counts every weight update
tokens_seen = 0
running_loss = 0.0
start_time = time.time()
start_training = time.time()

# Initialize gradients once before starting
optimizer.zero_grad(set_to_none=True)


# Train until we've seen enough tokens
while micro_step < MAX_STEPS:
    for x, y in train_loader:
        
        x, y = x.to(device), y.to(device)
        B, T = x.shape
        tokens_seen += B * T

        # 1. Forward
        with torch.autocast(device_type="cuda", dtype=dtype):
            logits = model(x)
            loss = loss_fn(logits.view(-1, logits.size(-1)), y.view(-1))

        # 2. Scale Loss for Backward (but keep original for logging!)
        current_loss_val = loss.item() 
        scaled_loss = loss / GRAD_ACCUM_STEPS
        
        # 3. Backward
        scaled_loss.backward()

        # 4. Step (only every 40 micro-steps)
        if (micro_step + 1) % GRAD_ACCUM_STEPS == 0:
            # avoids exploding gradients
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            optimizer.zero_grad(set_to_none=True)
            scheduler.step()
            optim_step += 1

        # 5. Bookkeeping
        running_loss += current_loss_val
        micro_step += 1

        # 6. Logging
        if micro_step % LOG_INTERVAL == 0:
            elapsed = time.time() - start_time
            avg_loss = running_loss / LOG_INTERVAL
            tokens_per_sec = (B * T * LOG_INTERVAL) / elapsed
            current_lr = optimizer.param_groups[0]['lr']
            timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

            print(
                f"step {micro_step:06d} | "
                f"opt_step {optim_step:04d} | "
                f"loss {avg_loss:.3f} | "
                f"lr {current_lr:.2e} | "
                f"{tokens_per_sec:,.0f} tok/s"
            )

            try:
                with open(log_file, "a", newline="") as f:
                    writer = csv.writer(f)
                    writer.writerow([micro_step, optim_step, f"{avg_loss:.4f}", f"{current_lr:.2e}", tokens_seen, int(tokens_per_sec), timestamp])
            except Exception as e:
                print(f"CSV Error: {e}")

            running_loss = 0.0
            start_time = time.time()

        if micro_step % 60000 == 0:
            mid_model_path = model_path.replace(".pt", f"_{micro_step}.pt")
            print(f"Saving intermediate model in {mid_model_path}")
            torch.save(model.state_dict(), mid_model_path)
        
        if micro_step >= MAX_STEPS:
            elapsed = int(time.time() - start_training)
            h = elapsed // 3600
            m = (elapsed % 3600) // 60
            s = elapsed % 60
            print(f"\nProcessed {tokens_seen:,} tokens in {h:02d}:{m:02d}:{s:02d}")
            print(f"Saving final model in {model_path}")
            torch.save(model.state_dict(), model_path)
            break

Total Micro-batches: 500000
Gradient Accumulation: 40
Total Optimizer Updates: 12500


In [30]:
x, y = val_dataset.__getitem__(32)

In [31]:
print(tokenizer.decode(x.tolist()))

atres . Phase I of the construction of Meridian Crossroads , a 375 @,@ 000 @-@ square @-@ foot ( 34 @,@ 800 m2 ) shopping center in the Bonita Lakes area , was completed in November 2007 , providing a major boost to retail in the area . Also , the shopping district on North Hills Street has continued to expand , and in March 2007 , additional retail and office space was opened near the Highway 19 Walmart Supercenter . 
 The area is also served by two military facilities , Naval Air Station Meridian and Key Field , which supply over 4 @,@ 000 jobs to residents of the surrounding area . NAS Meridian provides training for naval carrier pilots and other enlisted personnel . Also housed at the base is the Regional Counter @-@ Drug Training Academy ( RCTA ) , which provides narcotics training for law enforcement in many southeastern states . Containing the first local Department of Homeland Security in the state , the city is the leader in a nine county regional response team and a twenty @-

In [37]:
prompt = "The 2021 Masters (officially the 2021 Betfred Masters) was a professional non-ranking snooker tournament that took place from 10 to 17 January 2021 at the Marshall Arena in Milton Keynes, England"
x = torch.tensor(tokenizer.encode(prompt).ids)

In [38]:
model.to("cuda")
out = model.generate(
    x.unsqueeze(0).to("cuda"),
    max_new_tokens=250,
    temperature=0.9,
    use_cache=True,
    use_top_k=True,
)

print("\nOutput : ", tokenizer.decode(out[0].tolist()))


Output :  The 2021 Masters (officially the 2021 Betfred Masters) was a professional non-ranking snooker tournament that took place from 10 to 17 January 2021 at the Marshall Arena in Milton Keynes, England.

The competition was scheduled from 1 April to 12 April, with the rest of the UK qualifying for an additional round to 1 April, with the prize money divided between £10,000 and £100,000.

The £10,000 was to be split into £10,000 and £100,000, respectively.

The event was contested on 3 May-June 2021 at the Milton Keynes Stadium in Milton Keynes, England

The draw gave the North London Women a total of £858,000 to be sold, with a total prize of £2,300,000 added to the £50,000 prize money

The final draw was for the second round of the men’s tournament in Boston, England. They have been on the run since the World Championships, but have not been on the Tour.

The winner of the bid is the runner up and will be invited to attend the competition.

(Image: Getty)

The winner of the third

## TO DO : 

- [x] ROPE for K, V, Q
- [x] Top k sampling / Temperature
- [x] K / V cache
- [x] Add stop token / EOS handling
- [x] Training on a real problem to see how far we can push current model
- [ ] Clean up / Revisit markdown / maths
- [ ] Explore hyper connections and manifold constrained HC
- [ ] Check newer architectures / design choices (https://github.com/lucidrains git is a gold mine)