# Person A-v2: Linear Projector (최적화 버전)

**실험 목록**:
- E1-v2: Linear Projector + LoRA + 최적화

**핵심 변경사항**:
- Linear 대신 **Linear Projector** 사용 (4M params)
- LR Scheduler (Warmup + Cosine Decay)
- Gradient Clipping (max_norm=1.0)
- Diversity Monitoring
- Vision Features 캐싱

**목적**: 단순한 Projector가 Mode Collapse 없이 학습 가능한지 확인

---

**체크포인트 기반 재시작 지원**

## 1. 환경 설정 (런타임 시작 시 항상 실행)

In [None]:
# Google Drive 마운트
from google.colab import drive
drive.mount('/content/drive')

# ============================================
# 경로 설정
# ============================================
DRIVE_ROOT = "/content/drive/MyDrive/mutsa-02"

# 데이터 경로 (aihub_splitted는 mutsa-02 바로 아래)
DATA_PATH = f"{DRIVE_ROOT}/aihub_splitted"

# 결과 저장 경로 (siglip_study/results3) - v2/v3 실험용
RESULTS_DIR = f"{DRIVE_ROOT}/korean_video_captioning/siglip_study/results3"

import os
os.makedirs(RESULTS_DIR, exist_ok=True)

print(f"{'='*60}")
print(f"DATA_PATH:    {DATA_PATH}")
print(f"RESULTS_DIR:  {RESULTS_DIR}")
print(f"{'='*60}")

if os.path.exists(DATA_PATH):
    train_n = len(os.listdir(f"{DATA_PATH}/train")) if os.path.exists(f"{DATA_PATH}/train") else 0
    val_n = len(os.listdir(f"{DATA_PATH}/val")) if os.path.exists(f"{DATA_PATH}/val") else 0
    print(f"Data found! Train: {train_n}, Val: {val_n}")
else:
    print(f"WARNING: Data path not found!")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
DATA_PATH:    /content/drive/MyDrive/mutsa-02/aihub_splitted
RESULTS_DIR:  /content/drive/MyDrive/mutsa-02/korean_video_captioning/siglip_study/results3
Data found! Train: 5, Val: 4


In [None]:
!pip install -q transformers>=4.40.0 accelerate bitsandbytes peft
!pip install -q torch torchvision
!pip install -q av decord opencv-python pillow
!pip install -q tqdm matplotlib pandas
!pip install -q evaluate bert_score nltk

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from pathlib import Path
import json, os, random
import numpy as np
from PIL import Image
from tqdm import tqdm
import pandas as pd
from datetime import datetime
from typing import List, Dict, Optional
import cv2

from transformers import (
    AutoModel, AutoProcessor, AutoModelForCausalLM, AutoTokenizer,
    BitsAndBytesConfig, CLIPVisionModel, CLIPImageProcessor,
    get_cosine_schedule_with_warmup,  # [OPT] LR Scheduler
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType

print(f"PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)} ({torch.cuda.get_device_properties(0).total_memory/1e9:.1f}GB)")

PyTorch: 2.9.0+cu126, CUDA: True
GPU: NVIDIA A100-SXM4-80GB (85.2GB)


## 2. 설정

In [None]:
CONFIG = {
    "vision_encoder": "openai/clip-vit-large-patch14-336",
    "llm": "Qwen/Qwen3-8B",
    "siglip_model": "google/siglip2-so400m-patch14-384",
    "lora_r": 16, "lora_alpha": 32, "lora_dropout": 0.05,

    # ============================================
    # [v2] Linear Projector 설정
    # ============================================
    "stage1_epochs": 2,
    "stage1_lr": 1e-3,
    "stage2_epochs": 3,
    "stage2_lr": 5e-5,

    # ============================================
    # [OPT] 추가 최적화
    # ============================================
    "warmup_ratio": 0.1,
    "max_grad_norm": 1.0,

    "batch_size": 4, "gradient_accumulation": 2,
    "num_frames": 8, "max_length": 768, "seed": 42,
    "data_path": DATA_PATH,
    "results_dir": RESULTS_DIR,
    "prompt": "이 영상을 자세히 설명해주세요.",
    "max_new_tokens": 128,
    "repetition_penalty": 1.2,
}

def set_seed(seed):
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed)
    if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)
set_seed(CONFIG["seed"])
print("Config loaded (Linear Projector + Optimizations)")
print(f"  Projector: Linear (~4M params)")

Config loaded (Linear Projector + Optimizations)
  Projector: Linear (~4M params)


## 3. 핵심 클래스 정의

In [None]:
# SigLIP2 Evaluator (Multilingual - Korean supported)
class SigLIPEvaluator:
    def __init__(self, model_name="google/siglip2-so400m-patch14-384", device="cuda"):
        print(f"Loading SigLIP2: {model_name}")
        self.model = AutoModel.from_pretrained(model_name).to(device)
        self.processor = AutoProcessor.from_pretrained(model_name)
        self.model.eval(); self.device = device

    @torch.no_grad()
    def compute_score(self, frames, caption):
        if not frames or not caption: return None
        try:
            inputs = self.processor(
                text=[caption], images=frames, return_tensors="pt", padding=True,
                truncation=True, max_length=64
            ).to(self.device)
            return torch.sigmoid(self.model(**inputs).logits_per_image).mean().item()
        except Exception as e:
            print(f"SigLIP error: {e}")
            return None

    def evaluate_batch(self, samples):
        scores, errors = [], 0
        for s in tqdm(samples, desc="SigLIP"):
            score = self.compute_score(s["frames"], s["caption"])
            if score is not None: scores.append(score)
            else: errors += 1
        return {
            "siglip_score": np.mean(scores) if scores else 0.0,
            "siglip_std": np.std(scores) if scores else 0.0,
            "num_samples": len(scores), "num_errors": errors
        }

print("SigLIPEvaluator defined.")

SigLIPEvaluator defined.


In [None]:
# Text Metrics Evaluator (METEOR, BERTScore)
class TextMetricsEvaluator:
    def __init__(self):
        import evaluate
        import nltk
        nltk.download('wordnet', quiet=True)
        nltk.download('punkt', quiet=True)
        nltk.download('omw-1.4', quiet=True)
        self.meteor = evaluate.load("meteor")
        self.bertscore = evaluate.load("bertscore")
        print("TextMetricsEvaluator ready")

    def compute_scores(self, predictions: list, references: list) -> dict:
        if not predictions or not references:
            return {"meteor": 0.0, "bertscore_f1": 0.0}
        meteor_result = self.meteor.compute(predictions=predictions, references=references)
        bert_result = self.bertscore.compute(predictions=predictions, references=references, lang="ko")
        return {
            "meteor": meteor_result["meteor"],
            "bertscore_precision": np.mean(bert_result["precision"]),
            "bertscore_recall": np.mean(bert_result["recall"]),
            "bertscore_f1": np.mean(bert_result["f1"]),
        }

print("TextMetricsEvaluator defined.")

TextMetricsEvaluator defined.


In [None]:
# Linear Projector
class LinearProjector(nn.Module):
    """단순 선형 변환 Projector (약 4M params)"""
    def __init__(self, vision_dim=1024, llm_dim=4096):
        super().__init__()
        self.proj = nn.Linear(vision_dim, llm_dim)

    def forward(self, x):
        return self.proj(x)

def create_projector(projector_type, vision_dim=1024, llm_dim=4096, config=None):
    if projector_type == "linear":
        return LinearProjector(vision_dim, llm_dim)
    raise ValueError(f"Unknown: {projector_type}")

print(f"Linear Projector params: {sum(p.numel() for p in LinearProjector().parameters()):,}")

Linear Projector params: 4,198,400


In [None]:
# Custom VLM (Vision Features 캐싱 지원)
class CustomVLM(nn.Module):
    def __init__(self, vision_encoder, projector, llm, tokenizer):
        super().__init__()
        self.vision_encoder = vision_encoder
        self.projector = projector
        self.llm = llm
        self.tokenizer = tokenizer
        for p in self.vision_encoder.parameters(): p.requires_grad = False
        self.vision_encoder.eval()

    def encode_video(self, frames):
        with torch.no_grad():
            features = self.vision_encoder(pixel_values=frames).last_hidden_state[:, 1:, :]
        return features.reshape(-1, features.size(-1))

    def forward_with_cache(self, vision_features, input_ids, attention_mask, labels=None):
        """[OPT] 캐싱된 Vision Features로 forward (Vision Encoder 스킵)"""
        batch_size, device = vision_features.size(0), vision_features.device
        all_vision = [self.projector(vision_features[i]) for i in range(batch_size)]
        text_embeds = self.llm.get_input_embeddings()(input_ids)

        combined_e, combined_a, combined_l = [], [], []
        for i in range(batch_size):
            v_len = all_vision[i].size(0)
            combined_e.append(torch.cat([all_vision[i], text_embeds[i]], dim=0))
            combined_a.append(torch.cat([torch.ones(v_len, device=device), attention_mask[i]], dim=0))
            if labels is not None:
                combined_l.append(torch.cat([torch.full((v_len,), -100, device=device, dtype=labels.dtype), labels[i]], dim=0))

        max_len = max(e.size(0) for e in combined_e)
        pad_e = torch.zeros(batch_size, max_len, combined_e[0].size(-1), device=device)
        pad_a = torch.zeros(batch_size, max_len, device=device)
        pad_l = torch.full((batch_size, max_len), -100, device=device, dtype=torch.long) if labels is not None else None

        for i in range(batch_size):
            sl = combined_e[i].size(0)
            pad_e[i, :sl], pad_a[i, :sl] = combined_e[i], combined_a[i]
            if labels is not None: pad_l[i, :sl] = combined_l[i]

        return self.llm(inputs_embeds=pad_e, attention_mask=pad_a, labels=pad_l, return_dict=True)

    def forward(self, frames, input_ids, attention_mask, labels=None):
        batch_size, device = frames.size(0), frames.device
        all_vision = [self.projector(self.encode_video(frames[i])) for i in range(batch_size)]
        text_embeds = self.llm.get_input_embeddings()(input_ids)

        combined_e, combined_a, combined_l = [], [], []
        for i in range(batch_size):
            v_len = all_vision[i].size(0)
            combined_e.append(torch.cat([all_vision[i], text_embeds[i]], dim=0))
            combined_a.append(torch.cat([torch.ones(v_len, device=device), attention_mask[i]], dim=0))
            if labels is not None:
                combined_l.append(torch.cat([torch.full((v_len,), -100, device=device, dtype=labels.dtype), labels[i]], dim=0))

        max_len = max(e.size(0) for e in combined_e)
        pad_e = torch.zeros(batch_size, max_len, combined_e[0].size(-1), device=device)
        pad_a = torch.zeros(batch_size, max_len, device=device)
        pad_l = torch.full((batch_size, max_len), -100, device=device, dtype=torch.long) if labels is not None else None

        for i in range(batch_size):
            sl = combined_e[i].size(0)
            pad_e[i, :sl], pad_a[i, :sl] = combined_e[i], combined_a[i]
            if labels is not None: pad_l[i, :sl] = combined_l[i]

        return self.llm(inputs_embeds=pad_e, attention_mask=pad_a, labels=pad_l, return_dict=True)

    @torch.no_grad()
    def generate(self, frames, prompt, max_new_tokens=128):
        device = frames.device
        vision_embeds = self.projector(self.encode_video(frames))
        text_inputs = self.tokenizer(prompt, return_tensors="pt").to(device)
        text_embeds = self.llm.get_input_embeddings()(text_inputs.input_ids)
        combined = torch.cat([vision_embeds.unsqueeze(0), text_embeds], dim=1)
        outputs = self.llm.generate(
            inputs_embeds=combined, max_new_tokens=max_new_tokens, do_sample=False,
            repetition_penalty=1.2,
            pad_token_id=self.tokenizer.pad_token_id, eos_token_id=self.tokenizer.eos_token_id
        )
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

print("CustomVLM defined (with cache support).")

CustomVLM defined (with cache support).


In [None]:
# Dataset with Vision Features Caching
def extract_frames(video_path, num_frames=8):
    cap = cv2.VideoCapture(video_path)
    total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    if total <= 0: cap.release(); return []
    frames = []
    for idx in np.linspace(0, total-1, num_frames, dtype=int):
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret: frames.append(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
    cap.release()
    return frames

class CachedVideoDataset(Dataset):
    """[OPT] Vision Features를 미리 계산하여 캐싱하는 Dataset.

    Vision Encoder는 frozen이므로 같은 이미지에 대해 항상 같은 출력.
    한 번 계산해서 저장하면 매 epoch마다 Vision Encoder 추론 시간 절약.
    """

    def __init__(self, data_path, split, vision_encoder, image_processor, tokenizer,
                 num_frames=8, max_length=512, max_samples=None, prompt="이 영상을 자세히 설명해주세요.", device="cuda"):
        self.data_path, self.split = Path(data_path), split
        self.image_processor, self.tokenizer = image_processor, tokenizer
        self.num_frames, self.max_length = num_frames, max_length
        self.prompt = prompt
        self.device = device
        self.samples = self._load(max_samples)
        self.prompt_len = len(tokenizer(prompt, add_special_tokens=False).input_ids)

        # Vision Features 캐싱
        self.vision_cache = {}
        self.pil_cache = {}  # 평가용 PIL 이미지
        print(f"\n[CACHE] Pre-computing vision features for {len(self.samples)} {split} samples...")
        vision_encoder.eval()
        with torch.no_grad():
            for idx in tqdm(range(len(self.samples)), desc=f"Caching {split}"):
                s = self.samples[idx]
                frames = extract_frames(s["video_path"], self.num_frames)
                if not frames:
                    frames = [np.zeros((336, 336, 3), dtype=np.uint8)] * self.num_frames
                while len(frames) < self.num_frames: frames.append(frames[-1].copy())
                frames = frames[:self.num_frames]

                pil_frames = [Image.fromarray(f) for f in frames]
                self.pil_cache[idx] = pil_frames

                pixel_values = self.image_processor(images=pil_frames, return_tensors="pt").pixel_values.to(device)
                features = vision_encoder(pixel_values=pixel_values).last_hidden_state[:, 1:, :]
                # CPU로 이동하여 GPU 메모리 절약
                self.vision_cache[idx] = features.reshape(-1, features.size(-1)).cpu()

        print(f"[CACHE] Done! Cached {len(self.vision_cache)} samples.")
        print(f"[CACHE] Feature shape: {self.vision_cache[0].shape}")

    def _load(self, max_samples):
        samples = []
        label_dir = self.data_path / self.split / "labels"
        video_dir = self.data_path / self.split / "videos"
        if not label_dir.exists(): return samples
        for lf in sorted(label_dir.glob("*.json")):
            try:
                with open(lf, "r", encoding="utf-8") as f: label = json.load(f)
                vp = video_dir / (lf.stem + ".mp4")
                if vp.exists() and (cap := label.get("annotation", {}).get("description_kr", "")):
                    samples.append({"video_path": str(vp), "caption": cap})
            except: pass
            if max_samples and len(samples) >= max_samples: break
        return samples

    def __len__(self): return len(self.samples)

    def __getitem__(self, idx):
        s = self.samples[idx]
        vision_features = self.vision_cache[idx]  # 캐싱된 features 사용
        pil_frames = self.pil_cache[idx]

        full_text = f"{self.prompt} {s['caption']}"
        ti = self.tokenizer(full_text, max_length=self.max_length, padding="max_length", truncation=True, return_tensors="pt")

        return {
            "vision_features": vision_features,
            "input_ids": ti.input_ids.squeeze(0),
            "attention_mask": ti.attention_mask.squeeze(0),
            "caption": s["caption"],
            "pil_frames": pil_frames,
            "prompt_len": self.prompt_len
        }

def create_cached_collate_fn(pad_token_id):
    def collate_fn(batch):
        labels = []
        for b in batch:
            label = b["input_ids"].clone()
            label[label == pad_token_id] = -100
            if "prompt_len" in b:
                label[:b["prompt_len"]] = -100
            labels.append(label)
        return {
            "vision_features": torch.stack([b["vision_features"] for b in batch]),
            "input_ids": torch.stack([b["input_ids"] for b in batch]),
            "attention_mask": torch.stack([b["attention_mask"] for b in batch]),
            "labels": torch.stack(labels),
            "captions": [b["caption"] for b in batch],
            "pil_frames": [b["pil_frames"] for b in batch],
        }
    return collate_fn

print("CachedVideoDataset defined.")

CachedVideoDataset defined.


In [None]:
# Checkpoint Manager
class CheckpointManager:
    def __init__(self, exp_name, results_dir):
        self.exp_dir = Path(results_dir) / exp_name
        self.exp_dir.mkdir(parents=True, exist_ok=True)
        self.ckpt_dir = self.exp_dir / "checkpoints"
        self.ckpt_dir.mkdir(exist_ok=True)
        self.logs = []
        self.best_score = 0.0
        log_file = self.exp_dir / "training_log.csv"
        if log_file.exists(): self.logs = pd.read_csv(log_file).to_dict('records')
        print(f"CheckpointManager: {self.exp_dir}")

    def is_completed(self): return (self.exp_dir / "final_metrics.json").exists()

    def get_resume_info(self):
        for e in range(10, 0, -1):
            p = self.ckpt_dir / f"stage2_epoch{e}_checkpoint.pt"
            if p.exists(): return {"stage": 2, "epoch": e, "path": p}
        p = self.ckpt_dir / "stage1_checkpoint.pt"
        if p.exists(): return {"stage": 1, "epoch": "done", "path": p}
        return None

    def log(self, m):
        m["timestamp"] = datetime.now().isoformat()
        self.logs.append(m)
        pd.DataFrame(self.logs).to_csv(self.exp_dir / "training_log.csv", index=False)

    def save_checkpoint(self, model, stage, epoch, metrics, optimizer=None, scheduler=None):
        ckpt = {"projector_state_dict": model.projector.state_dict(), "metrics": metrics, "stage": stage, "epoch": epoch}
        if optimizer: ckpt["optimizer_state_dict"] = optimizer.state_dict()
        if scheduler: ckpt["scheduler_state_dict"] = scheduler.state_dict()
        path = self.ckpt_dir / ("stage1_checkpoint.pt" if stage == 1 else f"stage2_epoch{epoch}_checkpoint.pt")
        torch.save(ckpt, path)
        print(f"Saved: {path}")
        if (s := metrics.get("siglip_score", 0)) > self.best_score:
            self.best_score = s
            torch.save(ckpt, self.ckpt_dir / "best_model.pt")
            print(f"New best! Score: {s:.4f}")

    def save_final(self, m):
        with open(self.exp_dir / "final_metrics.json", "w") as f: json.dump(m, f, indent=2)
        print("Experiment completed!")

print("CheckpointManager defined.")

CheckpointManager defined.


In [None]:
# Logging helpers
import time

def get_gpu_memory():
    if torch.cuda.is_available():
        return {"allocated": torch.cuda.memory_allocated()/1e9, "total": torch.cuda.get_device_properties(0).total_memory/1e9}
    return {"allocated": 0, "total": 0}

def format_time(s):
    h, m, sec = int(s//3600), int((s%3600)//60), int(s%60)
    return f"{h}h{m}m{sec}s" if h else (f"{m}m{sec}s" if m else f"{sec}s")

def print_config(config, proj_type, train_n, val_n):
    mem = get_gpu_memory()
    print(f"\n{'-'*60}\nCONFIG: {proj_type} | Train: {train_n} | Val: {val_n}")
    print(f"Batch: {config['batch_size']}x{config['gradient_accumulation']} | Frames: {config['num_frames']} | MaxLen: {config['max_length']}")
    print(f"Stage1: {config['stage1_epochs']}ep (LR:{config['stage1_lr']}) | Stage2: {config['stage2_epochs']}ep (LR:{config['stage2_lr']})")
    print(f"Warmup: {config['warmup_ratio']*100:.0f}% | Grad Clip: {config['max_grad_norm']}")
    print(f"GPU: {mem['allocated']:.1f}GB / {mem['total']:.1f}GB\n{'-'*60}")

def compute_diversity(captions):
    """[OPT] 캡션 다양성 계산 (0.0 ~ 1.0). Mode Collapse 탐지용."""
    if not captions: return 0.0
    unique = len(set(captions))
    return unique / len(captions)

print("Helpers defined.")

Helpers defined.


In [None]:
# Training functions with optimizations
from torch.amp import autocast, GradScaler

def evaluate_model(model, val_loader, siglip_evaluator, prompt, device, max_samples=None, text_evaluator=None):
    """검증 평가 + Diversity 모니터링."""
    model.eval()
    samples, num_eval = [], 0
    predictions, references = [], []
    total = len(val_loader.dataset) if not max_samples else min(max_samples, len(val_loader.dataset))
    print(f"\n[Eval] Generating {total} captions...")
    start = time.time()

    cap_lens, empty_cnt, rep_scores = [], 0, []
    all_captions = []  # [OPT] Diversity 계산용

    with torch.no_grad():
        for batch in tqdm(val_loader, desc="Eval"):
            if max_samples and num_eval >= max_samples: break
            # 캐싱된 데이터의 경우 pixel_values 대신 vision_features 사용 (평가는 원본 사용)
            if "pixel_values" in batch:
                pv = batch["pixel_values"].to(device)
            else:
                # 평가 시에는 PIL에서 다시 처리
                pv = None
            gt_captions = batch["captions"]
            pil_frames_batch = batch["pil_frames"]

            for i in range(len(gt_captions)):
                if max_samples and num_eval >= max_samples: break

                # 캐싱된 경우 PIL에서 다시 처리
                if pv is None:
                    from transformers import CLIPImageProcessor
                    img_proc = CLIPImageProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
                    frames_tensor = img_proc(images=pil_frames_batch[i], return_tensors="pt").pixel_values.to(device)
                else:
                    frames_tensor = pv[i]

                cap = model.generate(frames_tensor, prompt)

                cap_lens.append(len(cap))
                if not cap.strip(): empty_cnt += 1
                tokens = cap.split()
                if len(tokens) > 1: rep_scores.append(len(set(tokens))/len(tokens))
                all_captions.append(cap)

                if num_eval < 3:
                    print(f"\n  [Sample {num_eval+1}]")
                    print(f"    Gen: {cap[:100]}{'...' if len(cap)>100 else ''}")
                    print(f"    GT:  {gt_captions[i][:100]}{'...' if len(gt_captions[i])>100 else ''}")

                samples.append({"frames": pil_frames_batch[i], "caption": cap})
                predictions.append(cap)
                references.append(gt_captions[i])
                num_eval += 1

    # [OPT] Diversity 계산
    diversity = compute_diversity(all_captions)
    print(f"\n[Eval] Done in {format_time(time.time()-start)}")
    print(f"[DEBUG] Empty:{empty_cnt}, Diversity:{diversity:.2f} ({len(set(all_captions))}/{len(all_captions)} unique)")

    # [OPT] Mode Collapse 경고
    if diversity < 0.5:
        print(f"\n" + "!"*60)
        print(f"  WARNING: Low diversity ({diversity:.2f}) - POSSIBLE MODE COLLAPSE!")
        print(f"!"*60 + "\n")

    print("[Eval] Computing SigLIP2...")
    metrics = siglip_evaluator.evaluate_batch(samples)
    metrics["empty_count"] = empty_cnt
    metrics["diversity"] = diversity  # [OPT] 추가

    if text_evaluator is not None:
        print("[Eval] Computing METEOR/BERTScore...")
        text_metrics = text_evaluator.compute_scores(predictions, references)
        metrics.update(text_metrics)
        print(f"  METEOR: {text_metrics['meteor']:.4f}, BERTScore-F1: {text_metrics['bertscore_f1']:.4f}")

    model.train()
    return metrics

def train_stage1(model, train_loader, config, ckpt_mgr, device):
    """[OPT] LR Scheduler + Gradient Clipping 적용."""
    print("\n" + "="*60 + "\n  STAGE 1: Linear Warm-up (with LR Scheduler)\n" + "="*60)
    for p in model.llm.parameters(): p.requires_grad = False
    for p in model.projector.parameters(): p.requires_grad = True

    proj_params = sum(p.numel() for p in model.projector.parameters() if p.requires_grad)
    total_steps = len(train_loader) * config["stage1_epochs"]
    warmup_steps = int(config["warmup_ratio"] * total_steps)

    print(f"  Trainable: {proj_params:,} params")
    print(f"  Total Steps: {total_steps} | Warmup: {warmup_steps} ({config['warmup_ratio']*100:.0f}%)")
    print(f"  LR: {config['stage1_lr']} | Grad Clip: {config['max_grad_norm']}")

    opt = torch.optim.AdamW(model.projector.parameters(), lr=config["stage1_lr"])
    scheduler = get_cosine_schedule_with_warmup(opt, warmup_steps, total_steps)
    scaler = GradScaler('cuda')
    stage_start = time.time()

    for epoch in range(config["stage1_epochs"]):
        epoch_start = time.time()
        model.train(); total_loss = 0
        for i, batch in enumerate(pbar := tqdm(train_loader, desc=f"S1 E{epoch+1}/{config['stage1_epochs']}")):
            with autocast('cuda', dtype=torch.bfloat16):
                # [OPT] 캐싱된 features 사용
                loss = model.forward_with_cache(
                    batch["vision_features"].to(device),
                    batch["input_ids"].to(device),
                    batch["attention_mask"].to(device),
                    batch["labels"].to(device)
                ).loss

            scaler.scale(loss).backward()

            if (i+1) % config["gradient_accumulation"] == 0:
                # [OPT] Gradient Clipping
                scaler.unscale_(opt)
                torch.nn.utils.clip_grad_norm_(model.projector.parameters(), config["max_grad_norm"])
                scaler.step(opt)
                scaler.update()
                scheduler.step()  # [OPT] LR Scheduler
                opt.zero_grad()

            total_loss += loss.item()
            current_lr = scheduler.get_last_lr()[0]
            mem = get_gpu_memory()
            pbar.set_postfix({"loss": f"{loss.item():.4f}", "lr": f"{current_lr:.2e}", "mem": f"{mem['allocated']:.1f}G"})

        avg = total_loss / len(train_loader)
        print(f"\n[S1 E{epoch+1}] Loss: {avg:.4f} | LR: {current_lr:.2e} | Time: {format_time(time.time()-epoch_start)}")
        ckpt_mgr.log({"stage": 1, "epoch": epoch+1, "train_loss": avg, "lr": current_lr})

    print(f"\n{'='*60}\n  Stage 1 Complete! Time: {format_time(time.time()-stage_start)} | Loss: {avg:.4f}\n{'='*60}")
    ckpt_mgr.save_checkpoint(model, 1, config["stage1_epochs"], {"train_loss": avg}, opt, scheduler)

def train_stage2(model, train_loader, val_loader, siglip_evaluator, config, ckpt_mgr, device, start_epoch=0, text_evaluator=None):
    """[OPT] LR Scheduler + Gradient Clipping + Diversity 모니터링."""
    remain = config["stage2_epochs"] - start_epoch
    print(f"\n" + "="*60 + f"\n  STAGE 2: Linear + LoRA (E{start_epoch+1}-{config['stage2_epochs']})\n" + "="*60)

    for p in model.projector.parameters(): p.requires_grad = True
    for name, param in model.llm.named_parameters():
        if 'lora' in name.lower(): param.requires_grad = True

    proj_p = sum(p.numel() for p in model.projector.parameters() if p.requires_grad)
    lora_p = sum(p.numel() for p in model.llm.parameters() if p.requires_grad)
    total_steps = len(train_loader) * remain
    warmup_steps = int(config["warmup_ratio"] * total_steps)

    print(f"  Trainable: Proj {proj_p:,} + LoRA {lora_p:,} = {proj_p+lora_p:,}")
    print(f"  Total Steps: {total_steps} | Warmup: {warmup_steps}")
    print(f"  LR: {config['stage2_lr']} | Grad Clip: {config['max_grad_norm']}")

    all_params = list(model.projector.parameters()) + [p for p in model.llm.parameters() if p.requires_grad]
    opt = torch.optim.AdamW(all_params, lr=config["stage2_lr"])
    scheduler = get_cosine_schedule_with_warmup(opt, warmup_steps, total_steps)
    scaler = GradScaler('cuda')

    stage_start = time.time()
    best = ckpt_mgr.best_score

    for epoch in range(start_epoch, config["stage2_epochs"]):
        epoch_start = time.time()
        model.train(); total_loss = 0

        for i, batch in enumerate(pbar := tqdm(train_loader, desc=f"S2 E{epoch+1}/{config['stage2_epochs']}")):
            with autocast('cuda', dtype=torch.bfloat16):
                loss = model.forward_with_cache(
                    batch["vision_features"].to(device),
                    batch["input_ids"].to(device),
                    batch["attention_mask"].to(device),
                    batch["labels"].to(device)
                ).loss

            scaler.scale(loss).backward()

            if (i+1) % config["gradient_accumulation"] == 0:
                scaler.unscale_(opt)
                torch.nn.utils.clip_grad_norm_(all_params, config["max_grad_norm"])
                scaler.step(opt)
                scaler.update()
                scheduler.step()
                opt.zero_grad()

            total_loss += loss.item()
            current_lr = scheduler.get_last_lr()[0]
            mem = get_gpu_memory()
            pbar.set_postfix({"loss": f"{loss.item():.4f}", "lr": f"{current_lr:.2e}", "mem": f"{mem['allocated']:.1f}G"})

        avg = total_loss / len(train_loader)
        print(f"\n[S2 E{epoch+1}] Train Loss: {avg:.4f} | LR: {current_lr:.2e} | Time: {format_time(time.time()-epoch_start)}")

        # Evaluation
        metrics = evaluate_model(model, val_loader, siglip_evaluator, config["prompt"], device, text_evaluator=text_evaluator)
        is_best = metrics['siglip_score'] > best
        if is_best: best = metrics['siglip_score']

        print(f"[S2 E{epoch+1}] Val SigLIP: {metrics['siglip_score']:.4f} | Diversity: {metrics.get('diversity', 0):.2f} {'(BEST!)' if is_best else ''} | Best: {best:.4f}")
        ckpt_mgr.log({"stage": 2, "epoch": epoch+1, "train_loss": avg, "siglip_score": metrics["siglip_score"], "diversity": metrics.get("diversity", 0), "lr": current_lr})
        ckpt_mgr.save_checkpoint(model, 2, epoch+1, metrics, opt, scheduler)

print("Training functions defined (with optimizations).")

Training functions defined (with optimizations).


In [None]:
# Model build
def build_model(config, projector_type, device, resume_path=None):
    print(f"\nBuilding {projector_type} model...")
    vision_encoder = CLIPVisionModel.from_pretrained(config["vision_encoder"]).to(device)
    image_processor = CLIPImageProcessor.from_pretrained(config["vision_encoder"])
    vision_encoder.eval()

    bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True)
    llm = AutoModelForCausalLM.from_pretrained(config["llm"], quantization_config=bnb, device_map="auto", trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(config["llm"], trust_remote_code=True)
    if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token
    print(f"  PAD: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")
    print(f"  EOS: {tokenizer.eos_token} (ID: {tokenizer.eos_token_id})")

    projector = create_projector(projector_type, vision_encoder.config.hidden_size, llm.config.hidden_size, config).to(device)

    llm = prepare_model_for_kbit_training(llm)
    lora_cfg = LoraConfig(r=config["lora_r"], lora_alpha=config["lora_alpha"], lora_dropout=config["lora_dropout"],
                          target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
                          task_type=TaskType.CAUSAL_LM, bias="none")
    llm = get_peft_model(llm, lora_cfg)
    llm.print_trainable_parameters()
    llm.config.use_cache = False  # type: ignore[attr-defined]
    llm.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})

    model = CustomVLM(vision_encoder, projector, llm, tokenizer)
    if resume_path:
        print(f"Loading checkpoint: {resume_path}")
        ckpt = torch.load(resume_path, map_location=device)
        model.projector.load_state_dict(ckpt["projector_state_dict"])
        print(f"Resumed from stage {ckpt['stage']}, epoch {ckpt['epoch']}")
    print("Model built!")
    return model, vision_encoder, image_processor, tokenizer

print("Build function defined.")

Build function defined.


In [None]:
# Experiment runner (with Vision Caching)
def run_experiment(exp_name, projector_type, config):
    exp_start = time.time()
    print("\n" + "#"*60)
    print(f"#  EXPERIMENT: {exp_name}")
    print(f"#  Projector: {projector_type}")
    print(f"#  LR: stage1={config['stage1_lr']}, stage2={config['stage2_lr']}")
    print(f"#  Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print("#"*60)

    device = "cuda" if torch.cuda.is_available() else "cpu"
    ckpt_mgr = CheckpointManager(exp_name, config["results_dir"])

    if ckpt_mgr.is_completed():
        print(f"\n[SKIP] Already completed!")
        with open(ckpt_mgr.exp_dir / "final_metrics.json") as f:
            r = json.load(f)
        print(f"  Previous Score: {r.get('siglip_score', 'N/A')}")
        return r

    resume_info = ckpt_mgr.get_resume_info()
    resume_path = resume_info["path"] if resume_info else None
    if resume_info:
        print(f"\n[RESUME] Stage {resume_info['stage']}, Epoch {resume_info['epoch']}")
    else:
        print(f"\n[START] Starting from scratch")

    try:
        print(f"\n[1/5] Building model...")
        model, vision_encoder, img_proc, tok = build_model(config, projector_type, device, resume_path)

        print(f"\n[2/5] Loading datasets with Vision Caching...")
        train_ds = CachedVideoDataset(
            config["data_path"], "train", vision_encoder, img_proc, tok,
            config["num_frames"], config["max_length"], None, config["prompt"], device
        )
        val_ds = CachedVideoDataset(
            config["data_path"], "val", vision_encoder, img_proc, tok,
            config["num_frames"], config["max_length"], None, config["prompt"], device
        )
        train_loader = DataLoader(train_ds, batch_size=config["batch_size"], shuffle=True,
                                  collate_fn=create_cached_collate_fn(tok.pad_token_id),
                                  num_workers=4, pin_memory=True, persistent_workers=True)
        val_loader = DataLoader(val_ds, batch_size=config["batch_size"], shuffle=False,
                                collate_fn=create_cached_collate_fn(tok.pad_token_id),
                                num_workers=4, pin_memory=True, persistent_workers=True)
        print_config(config, projector_type, len(train_ds), len(val_ds))

        print(f"[3/5] Loading SigLIP2 evaluator...")
        siglip_evaluator = SigLIPEvaluator(config["siglip_model"], device)

        print(f"[4/5] Loading Text Metrics evaluator...")
        text_evaluator = TextMetricsEvaluator()

        print(f"\n[5/5] Starting training...")
        if resume_info is None:
            train_stage1(model, train_loader, config, ckpt_mgr, device)
            train_stage2(model, train_loader, val_loader, siglip_evaluator, config, ckpt_mgr, device, 0, None)
        elif resume_info["stage"] == 1:
            train_stage2(model, train_loader, val_loader, siglip_evaluator, config, ckpt_mgr, device, 0, None)
        else:
            train_stage2(model, train_loader, val_loader, siglip_evaluator, config, ckpt_mgr, device, resume_info["epoch"], None)

        print(f"\n[FINAL] Running final evaluation...")
        final = evaluate_model(model, val_loader, siglip_evaluator, config["prompt"], device, None, text_evaluator)
        final["experiment"], final["projector"] = exp_name, projector_type
        final["config"] = {"stage1_lr": config["stage1_lr"], "stage2_lr": config["stage2_lr"]}
        ckpt_mgr.save_final(final)

        exp_time = time.time() - exp_start
        print(f"\n" + "="*60)
        print(f"  EXPERIMENT COMPLETE: {exp_name}")
        print(f"="*60)
        print(f"  Final SigLIP2: {final['siglip_score']:.4f}")
        print(f"  Final Diversity: {final.get('diversity', 0):.2f}")
        if 'meteor' in final:
            print(f"  Final METEOR: {final['meteor']:.4f}")
            print(f"  Final BERTScore-F1: {final['bertscore_f1']:.4f}")
        print(f"  Total Time: {format_time(exp_time)}")
        print(f"  Saved to: {ckpt_mgr.exp_dir}")
        print("="*60)

        del model, siglip_evaluator, text_evaluator; torch.cuda.empty_cache()
        return final
    except Exception as e:
        print(f"\n{'!'*60}\n  ERROR in {exp_name}: {e}\n{'!'*60}")
        import traceback; traceback.print_exc()
        return {"experiment": exp_name, "error": str(e)}

print("Experiment runner defined.")

Experiment runner defined.


---

## 4. 실험 실행

### 실험 E1-v2: Linear + LoRA (Linear Projector)

In [None]:
result_e1_v2 = run_experiment("E1_v2_linear_optimized", "linear", CONFIG)
print(f"\nE1-v2 Result: {result_e1_v2}")


############################################################
#  EXPERIMENT: E1_v2_linear_optimized
#  Projector: linear
#  LR: stage1=0.001, stage2=5e-05
#  Time: 2026-01-21 04:35:36
############################################################
CheckpointManager: /content/drive/MyDrive/mutsa-02/korean_video_captioning/siglip_study/results3/E1_v2_linear_optimized

[START] Starting from scratch

[1/5] Building model...

Building linear model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

  PAD: <|endoftext|> (ID: 151643)
  EOS: <|im_end|> (ID: 151645)
trainable params: 43,646,976 || all params: 8,234,382,336 || trainable%: 0.5301
Model built!

[2/5] Loading datasets with Vision Caching...

[CACHE] Pre-computing vision features for 865 train samples...


Caching train: 100%|██████████| 865/865 [1:11:25<00:00,  4.95s/it]


[CACHE] Done! Cached 865 samples.
[CACHE] Feature shape: torch.Size([4608, 1024])

[CACHE] Pre-computing vision features for 97 val samples...


Caching val: 100%|██████████| 97/97 [07:59<00:00,  4.94s/it]


[CACHE] Done! Cached 97 samples.
[CACHE] Feature shape: torch.Size([4608, 1024])

------------------------------------------------------------
CONFIG: linear | Train: 865 | Val: 97
Batch: 4x2 | Frames: 8 | MaxLen: 768
Stage1: 2ep (LR:0.001) | Stage2: 3ep (LR:5e-05)
Warmup: 10% | Grad Clip: 1.0
GPU: 10.0GB / 85.2GB
------------------------------------------------------------
[3/5] Loading SigLIP2 evaluator...
Loading SigLIP2: google/siglip2-so400m-patch14-384


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


[4/5] Loading Text Metrics evaluator...


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


TextMetricsEvaluator ready

[5/5] Starting training...

  STAGE 1: Linear Warm-up (with LR Scheduler)
  Trainable: 4,198,400 params
  Total Steps: 434 | Warmup: 43 (10%)
  LR: 0.001 | Grad Clip: 1.0


S1 E1/2: 100%|██████████| 217/217 [46:59<00:00, 12.99s/it, loss=1.5438, lr=9.33e-04, mem=14.6G]



[S1 E1] Loss: 1.8293 | LR: 9.33e-04 | Time: 46m59s


S1 E2/2: 100%|██████████| 217/217 [46:53<00:00, 12.97s/it, loss=1.9720, lr=5.90e-04, mem=14.6G]



[S1 E2] Loss: 1.6065 | LR: 5.90e-04 | Time: 46m53s

  Stage 1 Complete! Time: 1h33m53s | Loss: 1.6065
Saved: /content/drive/MyDrive/mutsa-02/korean_video_captioning/siglip_study/results3/E1_v2_linear_optimized/checkpoints/stage1_checkpoint.pt

  STAGE 2: Linear + LoRA (E1-3)
  Trainable: Proj 4,198,400 + LoRA 43,646,976 = 47,845,376
  Total Steps: 651 | Warmup: 65
  LR: 5e-05 | Grad Clip: 1.0


S2 E1/3: 100%|██████████| 217/217 [47:21<00:00, 13.09s/it, loss=1.6685, lr=4.93e-05, mem=15.1G]



[S2 E1] Train Loss: 1.5239 | LR: 4.93e-05 | Time: 47m21s

[Eval] Generating 97 captions...


Eval:   0%|          | 0/25 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



  [Sample 1]
    Gen:  영상은 맑고 화창한 날씨 속에서 한 건물의 외관과 주변 풍경을 담고 있습니다. 카메라는 고정된 상태로, 중앙에 위치한 회사 건물을 중심으로 촬영하고 있으며, 이는 전체적인 구조를...
    GT:  영상은 한 도시에 있는 건축물을 담고 있습니다. 중앙에 보이는 건물의 색은 회색이며 직사각형 형태를 가지고 있습니다. 건물의 전면 부분에는 영어로 "the #"이 적혀 있고 그 밑...

  [Sample 2]
    Gen:  영상은 맑고 화창한 날씨 속에서 도심의 한 건물을 중심으로 촬영되었습니다. 카메라는 고정된 상태로, 이 건물과 그 주변을 보여줍니다. 건물 앞에는 넓은 잎사귀가 있는 나무들이 식...
    GT:  이 영상은 맑은 날씨에 촬영되었습니다. 화면 중앙에는 "한국학술진흥재단" 건물이 자리 잡고 있습니다. 이 건물은 베이지색 외벽과 수직으로 배열된 창문들이 특징이며, 5층 높이로 보...

  [Sample 3]
    Gen:  영상은 한 사람이 흰색 옷을 입고 길가에 서 있는 장면입니다. 카메라는 고정된 상태로 촬영하고 있습니다. 화면 중앙에는 흰색 상의를 착용한 남자가 보이며, 그 앞쪽으로는 검은색 ...
    GT:  이 영상은 2000년대에 촬영된 것으로, 맑은 날 낮의 한옥을 주요 배경으로 촬영된 영상입니다. 영상 전면에 보이는 건물은 한쪽이 열려 있는 밝은 갈색의 대문이 보이며, 검은색 문...


Eval: 100%|██████████| 25/25 [39:39<00:00, 95.19s/it]



[Eval] Done in 39m39s
[DEBUG] Empty:0, Diversity:1.00 (97/97 unique)
[Eval] Computing SigLIP2...


SigLIP: 100%|██████████| 97/97 [00:49<00:00,  1.96it/s]


[S2 E1] Val SigLIP: 0.1440 | Diversity: 1.00 (BEST!) | Best: 0.1440
Saved: /content/drive/MyDrive/mutsa-02/korean_video_captioning/siglip_study/results3/E1_v2_linear_optimized/checkpoints/stage2_epoch1_checkpoint.pt
New best! Score: 0.1440


S2 E2/3: 100%|██████████| 217/217 [47:19<00:00, 13.09s/it, loss=1.3886, lr=4.22e-05, mem=15.1G]



[S2 E2] Train Loss: 1.3804 | LR: 4.22e-05 | Time: 47m19s

[Eval] Generating 97 captions...


Eval:   0%|          | 0/25 [00:00<?, ?it/s]


  [Sample 1]
    Gen:  이 영상은 맑고 화창한 낮의 광경을 담고 있습니다. 카메라는 고정된 상태로 촬영하고 있으며, 도심 속 건물과 그 주변 풍경이 선명하게 보입니다. 화면 중앙에는 대형 상업 건물이 ...
    GT:  영상은 한 도시에 있는 건축물을 담고 있습니다. 중앙에 보이는 건물의 색은 회색이며 직사각형 형태를 가지고 있습니다. 건물의 전면 부분에는 영어로 "the #"이 적혀 있고 그 밑...

  [Sample 2]
    Gen:  영상은 맑고 화창한 낮의 광경을 담고 있습니다. 카메라는 고정된 상태로 촬영하고 있으며, 도심 속 건물과 그 주변 풍경이 중심입니다. 화면 중앙에는 대형 상업 건물이 자리 잡고 ...
    GT:  이 영상은 맑은 날씨에 촬영되었습니다. 화면 중앙에는 "한국학술진흥재단" 건물이 자리 잡고 있습니다. 이 건물은 베이지색 외벽과 수직으로 배열된 창문들이 특징이며, 5층 높이로 보...

  [Sample 3]
    Gen:  영상은 맑고 화창한 날씨 속에서 한 사람이 걸어가는 모습을 담고 있습니다. 카메라는 고정된 상태로 촬영되며, 사람의 움직임과 주변 풍경이 선명하게 보입니다. 화면 중앙에는 검은색...
    GT:  이 영상은 2000년대에 촬영된 것으로, 맑은 날 낮의 한옥을 주요 배경으로 촬영된 영상입니다. 영상 전면에 보이는 건물은 한쪽이 열려 있는 밝은 갈색의 대문이 보이며, 검은색 문...


Eval:  48%|████▊     | 12/25 [19:34<21:12, 97.85s/it]

---

## 5. 결과 요약

In [None]:
print("\n" + "="*60 + "\nE1-v2 (LINEAR PROJECTOR) SUMMARY\n" + "="*60)
metrics_file = Path(RESULTS_DIR) / "E1_v2_linear_optimized" / "final_metrics.json"
if metrics_file.exists():
    with open(metrics_file) as f: result = json.load(f)
    print(pd.DataFrame([result]).to_string(index=False))
    pd.DataFrame([result]).to_csv(f"{RESULTS_DIR}/e1_v2_summary.csv", index=False)
else:
    print("E1-v2: Not completed")

## 6. 모델 로드 (추론용)

In [None]:
def load_trained_model(exp_name, projector_type, config):
    ckpt_path = Path(config["results_dir"]) / exp_name / "checkpoints" / "best_model.pt"
    if not ckpt_path.exists():
        print(f"Not found: {ckpt_path}")
        return None, None, None, None
    model, vision_encoder, img_proc, tok = build_model(config, projector_type, "cuda", ckpt_path)
    model.eval()
    return model, vision_encoder, img_proc, tok

# 사용: model, ve, img_proc, tok = load_trained_model("E1_v2_linear_optimized", "linear", CONFIG)