# üöÄ Training Pipeline: SlimPajama (Baseline Transformer)

Notebook n√†y thi·∫øt l·∫≠p quy tr√¨nh hu·∫•n luy·ªán (training pipeline) chu·∫©n cho m√¥ h√¨nh ng√¥n ng·ªØ.

**M·ª•c ti√™u:**
1. T·∫£i v√† x·ª≠ l√Ω d·ªØ li·ªáu SlimPajama.
2. C·∫•u h√¨nh m√¥ h√¨nh Transformer m·∫∑c ƒë·ªãnh (GPT-2) ƒë·ªÉ l√†m baseline (theo y√™u c·∫ßu).
3. T√≠ch h·ª£p ƒë√°nh gi√° b·∫±ng `Torchmetrics`.
4. Logging k·∫øt qu·∫£ qua TensorBoard.
5. T·ª± ƒë·ªông commit code v√† k·∫øt qu·∫£ l√™n Github.

In [1]:
import os
import torch
import numpy as np
from datasets import load_dataset
from transformers import (
    AutoConfig, 
    AutoModelForCausalLM, 
    AutoTokenizer, 
    Trainer, 
    TrainingArguments,
    DataCollatorForLanguageModeling
)
import torchmetrics
from torchmetrics.text import Perplexity

## 2. C·∫•u h√¨nh Model (Default Transformer)

Theo y√™u c·∫ßu "X√†i default Transformer", ch√∫ng ta s·∫Ω s·ª≠ d·ª•ng c·∫•u h√¨nh GPT-2 chu·∫©n. 
Sau n√†y khi mu·ªën ch·∫°y **Holo-Transformer**, b·∫°n ch·ªâ c·∫ßn import `HoloConfig` t·ª´ file `configuration_holo.py` ƒë√£ upload.

In [2]:
# --- L·ª∞A CH·ªåN MODEL CONFIG ---
USE_HOLO_MODEL = True  # Set True n·∫øu mu·ªën d√πng Holo, False ƒë·ªÉ d√πng Default Transformer (GPT-2)

if USE_HOLO_MODEL:
    # Import Holo Config t·ª´ file b·∫°n ƒë√£ upload
    import sys
    sys.path.append("./long-attention") # Tr·ªè v√†o th∆∞ m·ª•c ch·ª©a code
    from model.configuration_holo import HoloConfig
    from model.modeling_holo import HoloForCausalLM
    
    config = HoloConfig(
        vocab_size=50257,
        hidden_size=768,
        hd_dim=1024,
        num_hidden_layers=12
    )
    model = HoloForCausalLM(config)
    print("Using Holo-Transformer Config")
    
else:
    # Y√™u c·∫ßu: "X√†i default Transformer"
    # Ch√∫ng ta d√πng ki·∫øn tr√∫c GPT-2 small l√†m baseline
    config = AutoConfig.from_pretrained("gpt2")
    # Init model random (kh√¥ng load pre-trained weights ƒë·ªÉ train from scratch)
    model = AutoModelForCausalLM.from_config(config)
    print("Using Default Transformer (GPT-2 Architecture) initialized from scratch")

# ƒê·∫£m b·∫£o model in BFloat16 n·∫øu GPU h·ªó tr·ª£ (L40S, A100)
if torch.cuda.is_available() and torch.cuda.is_bf16_supported():
    model = model.to(dtype=torch.bfloat16)
    print("Model casted to BFloat16")

Using Holo-Transformer Config
Model casted to BFloat16


## 3. Data Pipeline: SlimPajama

Ch√∫ng ta s·∫Ω s·ª≠ d·ª•ng `DKYoon/SlimPajama-6B` (b·∫£n nh·ªè h∆°n c·ªßa SlimPajama) ƒë·ªÉ test pipeline cho nhanh. D·ªØ li·ªáu s·∫Ω ƒë∆∞·ª£c stream ƒë·ªÉ ti·∫øt ki·ªám RAM.

In [3]:
# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Load Dataset (Streaming mode ƒë·ªÉ tr√°nh OOM RAM)
dataset_name = "DKYoon/SlimPajama-6B"
ds_train = load_dataset(dataset_name, split="train", streaming=True)
ds_val = load_dataset(dataset_name, split="validation", streaming=True)

# Preprocessing function
block_size = 1024  # Context window

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=block_size)

# √Åp d·ª•ng tokenization on-the-fly
tokenized_train = ds_train.map(tokenize_function, batched=True, remove_columns=["text", "meta"])
tokenized_val = ds_val.map(tokenize_function, batched=True, remove_columns=["text", "meta"])

# Shuffle v√† l·∫•y m·∫´u nh·ªè ƒë·ªÉ test pipeline (optional)
train_dataset = tokenized_train.shuffle(seed=42).take(10000) # L·∫•y 10k m·∫´u train
eval_dataset = tokenized_val.take(1000) # L·∫•y 1k m·∫´u eval

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

## 4. Metric Evaluation v·ªõi Torchmetrics

Y√™u c·∫ßu: "Vi·∫øt func ƒë√°nh gi√° metric qua Torchmetrics".
H√†m `compute_metrics` d∆∞·ªõi ƒë√¢y s·∫Ω chuy·ªÉn ƒë·ªïi output c·ªßa model (logits) th√†nh metric Perplexity v√† Accuracy.

In [4]:
# --- H√ÄM M·ªöI ƒê·ªÇ TI·∫æT KI·ªÜM B·ªò NH·ªö ---
def preprocess_logits_for_metrics(logits, labels):
    """
    H√†m n√†y ch·∫°y ngay sau m·ªói b∆∞·ªõc forward eval.
    N√≥ chuy·ªÉn ƒë·ªïi Logits (Float to ƒë√πng) th√†nh Predictions (Int nh·ªè x√≠u)
    tr∆∞·ªõc khi cache v√†o b·ªô nh·ªõ.
    """
    if isinstance(logits, tuple):
        # Holo model c√≥ th·ªÉ tr·∫£ v·ªÅ tuple, l·∫•y ph·∫ßn t·ª≠ ƒë·∫ßu ti√™n
        logits = logits[0]
    
    # Ch·ªâ gi·ªØ l·∫°i index c√≥ x√°c su·∫•t cao nh·∫•t.
    # K√≠ch th∆∞·ªõc gi·∫£m t·ª´ [Batch, Seq, 50257] -> [Batch, Seq]
    return logits.argmax(dim=-1)

def compute_metrics(eval_pred):
    # L√∫c n√†y preds kh√¥ng c√≤n l√† logits n·ªØa, m√† l√† output c·ªßa h√†m preprocess ·ªü tr√™n (c√°c index)
    preds, labels = eval_pred
    
    # Convert sang tensor
    preds = torch.tensor(preds)
    labels = torch.tensor(labels)

    # Shift labels ƒë·ªÉ kh·ªõp v·ªõi next-token prediction
    # (Do Trainer ƒë√£ t·ª± ƒë·ªông shift logits khi t√≠nh loss, nh∆∞ng v·ªõi accuracy ta c·∫ßn c·∫©n th·∫≠n)
    # Tuy nhi√™n, ƒë∆°n gi·∫£n nh·∫•t cho Accuracy l√† so s√°nh tr·ª±c ti·∫øp
    # L∆∞u √Ω: Logits g·ªëc th∆∞·ªùng ch∆∞a shift, labels g·ªëc c≈©ng ch∆∞a.
    # ƒê·ªÉ t√≠nh accuracy ch√≠nh x√°c cho Causal LM:
    # Pred t·∫°i t d·ª± ƒëo√°n Label t·∫°i t+1.
    
    # C·∫Øt b·ªè token cu·ªëi c·ªßa pred v√† token ƒë·∫ßu c·ªßa label
    shift_preds = preds[..., :-1]
    shift_labels = labels[..., 1:]

    # T√≠nh Accuracy
    accuracy_metric = torchmetrics.Accuracy(task="multiclass", num_classes=config.vocab_size, ignore_index=-100)
    acc = accuracy_metric(shift_preds.reshape(-1), shift_labels.reshape(-1))

    # Perplexity: Ch√∫ng ta kh√¥ng t√≠nh tr·ª±c ti·∫øp ·ªü ƒë√¢y v√¨ kh√¥ng c√≥ logits.
    # Ch√∫ng ta s·∫Ω nh√¨n v√†o "eval_loss" trong log c·ªßa Trainer.
    # PPL = exp(eval_loss)
    
    return {
        "accuracy": acc.item()
    }

## 5. Training Setup & TensorBoard

C·∫•u h√¨nh `TrainingArguments` ƒë·ªÉ l∆∞u log v√†o TensorBoard.

In [8]:
# --- FIX MEMORY FRAGMENTATION ---
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

output_dir = "./results_slimpajama"

training_args = TrainingArguments(
    output_dir=output_dir,
    eval_strategy="steps",
    eval_steps=20,
    logging_steps=5,
    save_steps=200,
    save_safetensors=False,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1, 
    gradient_accumulation_steps=16,
    gradient_checkpointing=False,       
    learning_rate=3e-4,
    max_steps=1000,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir='./logs',
    report_to="tensorboard",
    bf16=True if torch.cuda.is_bf16_supported() else False,
    fp16=False if torch.cuda.is_bf16_supported() else True,
)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    # --- D√íNG QUAN TR·ªåNG NH·∫§T C·∫¶N TH√äM ---
    preprocess_logits_for_metrics=preprocess_logits_for_metrics
    # --------------------------------------
)

In [9]:
# B·∫Øt ƒë·∫ßu train
trainer.train()

Step,Training Loss,Validation Loss,Accuracy
20,119.1092,7.389917,0.090895
40,116.2071,7.359068,0.092387
60,115.4666,7.360652,0.091974
80,115.5847,7.348478,0.093795
100,115.6245,7.331895,0.09199
120,114.4422,7.282728,0.094743
140,113.0482,7.231789,0.096722
160,112.7335,7.187241,0.099374
180,109.4619,7.155865,0.101095
200,113.0164,7.12731,0.102683


KeyboardInterrupt: 