# VLM Embedding Stitching Benchmark

**Goal**: Compare how fast three frozen vision encoders (ViT, CLIP, I-JEPA) converge when stitched into a Qwen-0.5B LLM via a trainable MLP projector + LoRA.

| Encoder | Model | Hidden Dim | Patches |
|---------|-------|-----------|----------|
| ViT-L/16 | `google/vit-large-patch16-224` | 1024 | 196 |
| CLIP ViT-L/14 | `openai/clip-vit-large-patch14` | 1024 | 256 |
| I-JEPA ViT-H/14 | `facebook/ijepa_vith14_1k` | 1280 | 256 |

## 1. Setup & Install

In [1]:
# Install dependencies (uncomment on Colab)
!pip install -q torch transformers peft datasets accelerate Pillow matplotlib tqdm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import os
import sys
import torch

# If running on Colab, clone the repo first:
# !git clone <your-repo-url> RLJ
# %cd RLJ

# Make sure project root is on the path
PROJECT_ROOT = os.path.abspath(".")
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

print(f"Project root: {PROJECT_ROOT}")
print(f"PyTorch:      {torch.__version__}")
print(f"CUDA:         {torch.cuda.is_available()} ({torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A'})")

Project root: /Users/tarun/Personal/RLJ
PyTorch:      2.10.0
CUDA:         False (N/A)


## 2. Configuration

In [3]:
import torch
from configs import ExperimentConfig, get_encoder_configs

# Get pre-built configs for all three encoders
all_configs = get_encoder_configs()

# ---- Detect hardware ----
IS_COLAB = torch.cuda.is_available()
IS_MAC = (not IS_COLAB) and hasattr(torch.backends, "mps") and torch.backends.mps.is_available()

# ---- Adjust hyperparams for your hardware ----
for cfg in all_configs:
    cfg.num_steps = 500            # increase for a more thorough run
    cfg.learning_rate = 1e-4
    cfg.max_samples = 10000        # use a subset for speed (None = all)
    cfg.log_every = 10
    cfg.eval_every = 50
    cfg.save_dir = "outputs"

    if IS_COLAB:
        cfg.batch_size = 4
        cfg.gradient_accumulation_steps = 4
        cfg.dtype = "bfloat16"
    else:
        # Mac / CPU -- smaller batches, float32 (bfloat16 crashes MPS)
        cfg.batch_size = 2
        cfg.gradient_accumulation_steps = 8
        cfg.dtype = "float32"

# Quick overview
hw = "CUDA/Colab" if IS_COLAB else ("MPS/Mac" if IS_MAC else "CPU")
print(f"  Hardware: {hw}")
for c in all_configs:
    print(f"  {c.encoder_name:6s} | {c.encoder_model_id} | bs={c.batch_size} | {c.dtype}")

  Hardware: MPS/Mac
  vit    | google/vit-large-patch16-224 | bs=2 | float32
  clip   | openai/clip-vit-large-patch14 | bs=2 | float32
  ijepa  | facebook/ijepa_vith14_1k | bs=2 | float32


## 3. Run All Experiments

In [None]:
from train import run_all_experiments

trackers = run_all_experiments(all_configs)

## 3b. (Alternative) Run a Single Encoder

Uncomment and run this cell instead of cell 3 if you want to run one encoder at a time (useful for limited GPU memory).

In [None]:
from train import run_experiment

# Pick which encoder to run: 0 = ViT, 1 = CLIP, 2 = I-JEPA
encoder_idx = 1  # CLIP
cfg = all_configs[encoder_idx]

tracker = run_experiment(cfg)
print(tracker.summary())

  [INFO] MPS detected -- forcing float32 (bfloat16 is unstable on MPS)
  EXPERIMENT: CLIP
  Encoder:    openai/clip-vit-large-patch14
  LLM:        Qwen/Qwen2.5-0.5B-Instruct
  Device:     mps   Dtype: torch.float32


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

[1mCLIPVisionModel LOAD REPORT[0m from: openai/clip-vit-large-patch14
Key                                                          | Status     |  | 
-------------------------------------------------------------+------------+--+-
text_model.encoder.layers.{0...11}.self_attn.out_proj.weight | UNEXPECTED |  | 
text_model.encoder.layers.{0...11}.self_attn.k_proj.bias     | UNEXPECTED |  | 
text_model.encoder.layers.{0...11}.mlp.fc1.weight            | UNEXPECTED |  | 
text_model.embeddings.position_ids                           | UNEXPECTED |  | 
text_model.encoder.layers.{0...11}.self_attn.v_proj.weight   | UNEXPECTED |  | 
text_model.encoder.layers.{0...11}.layer_norm2.weight        | UNEXPECTED |  | 
text_model.encoder.layers.{0...11}.self_attn.q_proj.bias     | UNEXPECTED |  | 
text_model.encoder.layers.{0...11}.self_attn.q_proj.weight   | UNEXPECTED |  | 
text_model.encoder.layers.{0...11}.mlp.fc2.bias              | UNEXPECTED |  | 
text_model.encoder.layers.{0...11}.self_attn.v_p

Loading weights:   0%|          | 0/290 [00:00<?, ?it/s]

  Trainable params: 2,805,504  (0.35%)
  Total params:     800,018,048


CLIP:   0%|          | 0/500 [00:00<?, ?it/s]

## 4. Plot Convergence Comparison

In [None]:
from utils import plot_convergence

fig = plot_convergence(
    trackers,
    title="Convergence: ViT vs CLIP vs I-JEPA",
    smoothing_window=20,
    save_path="outputs/convergence.png",
)
fig.show()

## 5. Summary & Save Results

In [None]:
import json

results = {}
for name, tracker in trackers.items():
    s = tracker.summary()
    results[name] = s
    print(f"{name.upper():8s} | final_loss={s['final_loss']:.4f}  "
          f"min_loss={s['min_loss']:.4f}  "
          f"avg_last_50={s['avg_loss_last_50']:.4f}")

# Save summary
os.makedirs("outputs", exist_ok=True)
with open("outputs/summary.json", "w") as f:
    json.dump(results, f, indent=2)
print("\nSaved to outputs/summary.json")

## 6. (Optional) Load Previous Results

If you ran experiments separately, you can reload the saved trackers and re-plot.

In [None]:
# from utils import LossTracker, plot_convergence
#
# reloaded = {
#     "vit":   LossTracker.load("outputs/vit/loss_history.json"),
#     "clip":  LossTracker.load("outputs/clip/loss_history.json"),
#     "ijepa": LossTracker.load("outputs/ijepa/loss_history.json"),
# }
# fig = plot_convergence(reloaded)
# fig.show()