
# 🔍 Understanding Pretrained Models

## Loading, Inspecting, and Preparing Models for Physical AI

This tutorial demonstrates how to work with pretrained models—loading them,
understanding their architecture, and preparing them for fine-tuning.

### Learning Objectives
1. Load pretrained VLMs, vision encoders, and robotics models
2. Inspect model architecture and understand components
3. Identify which parts to freeze vs. train
4. Run basic inference to verify models work

### Prerequisites
- Hugging Face account (for model access)
- Basic PyTorch knowledge

---



## 1. Environment Setup

In [15]:
!pip install -q torch torchvision
!pip install -q transformers>=4.40.0
!pip install -q accelerate
!pip install -q peft
!pip install -q pillow
!pip install -q timm
!pip install -q huggingface_hub

# bitsandbytes for quantization (optional - requires CUDA GPU)
# This may fail on CPU-only runtimes, which is OK
!pip install -q bitsandbytes>=0.42.0 2>/dev/null || echo "Note: bitsandbytes not installed (requires GPU)"

print("✅ Core packages installed")

# %%
import torch
import torch.nn as nn
from PIL import Image
import numpy as np
from typing import Dict, List, Optional
import requests
from io import BytesIO

# Hugging Face
from transformers import (
    AutoProcessor,
    AutoModelForCausalLM,
    AutoModelForVision2Seq,
    AutoModel,
    AutoTokenizer,
    CLIPProcessor,
    CLIPModel,
)
from huggingface_hub import login, list_models
import timm

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

✅ Core packages installed
PyTorch version: 2.9.0+cpu
CUDA available: False


## 2. Hugging Face Authentication

Some models require authentication. Get your token from https://huggingface.co/settings/tokens

In [2]:
# Login to Hugging Face
# Uncomment and run if you need access to gated models

login()  # Interactive login

# Or use token directly:
# login(token="your_token_here")

print("💡 Run login() if you need access to gated models like PaliGemma")

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

💡 Run login() if you need access to gated models like PaliGemma


## 3. Vision Encoders

Vision encoders are the "eyes" of Physical AI systems. Let's explore common options.

In [3]:
print("=" * 60)
print("COMMON VISION ENCODERS")
print("=" * 60)

vision_encoders = """
┌──────────────────────────────────────────────────────────────┐
│  Model          │ Source    │ Pretrained On    │ Best For    │
├──────────────────────────────────────────────────────────────┤
│  CLIP ViT       │ OpenAI    │ Image-text pairs │ VLM base    │
│  SigLIP         │ Google    │ Image-text pairs │ PaliGemma   │
│  DINOv2         │ Meta      │ Self-supervised  │ Dense feats │
│  ResNet         │ Various   │ ImageNet         │ Simple/fast │
│  EfficientNet   │ Google    │ ImageNet         │ Mobile/edge │
└──────────────────────────────────────────────────────────────┘
"""
print(vision_encoders)

# %%
# Load and inspect CLIP (widely used in VLMs)
print("\n📷 Loading CLIP Vision Encoder...")

clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Inspect architecture
print("\nCLIP Model Structure:")
print("-" * 50)

# Vision encoder
vision_encoder = clip_model.vision_model
print(f"Vision Encoder:")
print(f"  Hidden size: {vision_encoder.config.hidden_size}")
print(f"  Num layers:  {vision_encoder.config.num_hidden_layers}")
print(f"  Num heads:   {vision_encoder.config.num_attention_heads}")
print(f"  Image size:  {vision_encoder.config.image_size}")
print(f"  Patch size:  {vision_encoder.config.patch_size}")

# Count parameters
total_params = sum(p.numel() for p in clip_model.parameters())
vision_params = sum(p.numel() for p in vision_encoder.parameters())
print(f"\nParameter counts:")
print(f"  Vision encoder: {vision_params / 1e6:.1f}M")
print(f"  Total CLIP:     {total_params / 1e6:.1f}M")

# %%
# Load and inspect DINOv2 (excellent for dense features)
print("\n📷 Loading DINOv2...")

dinov2 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
dinov2.eval()

print("\nDINOv2 ViT-S/14:")
print(f"  Parameters: {sum(p.numel() for p in dinov2.parameters()) / 1e6:.1f}M")
print(f"  Embedding dim: 384")
print(f"  Patch size: 14")

# %%
# Load ResNet from timm (fast, simple baseline)
print("\n📷 Loading ResNet18 from timm...")

resnet = timm.create_model('resnet18', pretrained=True, num_classes=0)  # Remove classifier
resnet.eval()

print(f"\nResNet18 (feature extractor mode):")
print(f"  Parameters: {sum(p.numel() for p in resnet.parameters()) / 1e6:.1f}M")
print(f"  Output features: 512")

COMMON VISION ENCODERS

┌──────────────────────────────────────────────────────────────┐
│  Model          │ Source    │ Pretrained On    │ Best For    │
├──────────────────────────────────────────────────────────────┤
│  CLIP ViT       │ OpenAI    │ Image-text pairs │ VLM base    │
│  SigLIP         │ Google    │ Image-text pairs │ PaliGemma   │
│  DINOv2         │ Meta      │ Self-supervised  │ Dense feats │
│  ResNet         │ Various   │ ImageNet         │ Simple/fast │
│  EfficientNet   │ Google    │ ImageNet         │ Mobile/edge │
└──────────────────────────────────────────────────────────────┘


📷 Loading CLIP Vision Encoder...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


model.safetensors:   0%|          | 0.00/605M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]


CLIP Model Structure:
--------------------------------------------------
Vision Encoder:
  Hidden size: 768
  Num layers:  12
  Num heads:   12
  Image size:  224
  Patch size:  32

Parameter counts:
  Vision encoder: 87.5M
  Total CLIP:     151.3M

📷 Loading DINOv2...
Downloading: "https://github.com/facebookresearch/dinov2/zipball/main" to /root/.cache/torch/hub/main.zip




Downloading: "https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_pretrain.pth" to /root/.cache/torch/hub/checkpoints/dinov2_vits14_pretrain.pth


100%|██████████| 84.2M/84.2M [00:00<00:00, 232MB/s]



DINOv2 ViT-S/14:
  Parameters: 22.1M
  Embedding dim: 384
  Patch size: 14

📷 Loading ResNet18 from timm...


model.safetensors:   0%|          | 0.00/46.8M [00:00<?, ?B/s]


ResNet18 (feature extractor mode):
  Parameters: 11.2M
  Output features: 512


## 4. Vision Language Models (VLMs)

VLMs combine vision encoders with language models.

In [13]:
print("=" * 60)
print("AVAILABLE VLMs")
print("=" * 60)

vlm_models = """
┌───────────────────────────────────────────────────────────────┐
│  Model              │ Size  │ Access    │ Notes               │
├───────────────────────────────────────────────────────────────┤
│  PaliGemma          │ 3B    │ Gated     │ Great for fine-tune │
│  LLaVA-1.5          │ 7/13B │ Open      │ Popular, versatile  │
│  Florence-2         │ 0.2B  │ Open      │ Lightweight         │
│  Qwen-VL            │ 7B    │ Open      │ Strong multilingual │
│  Phi-3-Vision       │ 4B    │ Open      │ Microsoft, compact  │
│  InternVL           │ Various│ Open     │ Strong benchmarks   │
└───────────────────────────────────────────────────────────────┘
"""
print(vlm_models)

# %%
# Load Florence-2 (lightweight, open)
print("\n🔮 Loading Florence-2 (lightweight VLM)...")

from transformers import AutoProcessor, AutoModelForCausalLM

florence_processor = AutoProcessor.from_pretrained(
    "microsoft/Florence-2-base",
    trust_remote_code=True
)

# Note: Florence-2 requires attn_implementation="eager" for compatibility
# with newer transformers versions
florence_model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Florence-2-base",
    trust_remote_code=True,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    attn_implementation="eager",  # Required for Florence-2 compatibility
).to(device)

print("\nFlorence-2 Architecture:")
print(f"  Total parameters: {sum(p.numel() for p in florence_model.parameters()) / 1e6:.1f}M")

# %%
# Test Florence-2 with a sample image
print("\n🧪 Testing Florence-2 inference...")

# Create a simple test image
test_image = Image.new('RGB', (224, 224), color=(100, 150, 200))

# Run inference
prompt = "<CAPTION>"
inputs = florence_processor(text=prompt, images=test_image, return_tensors="pt").to(device)

with torch.no_grad():
    # Note: Florence-2 has compatibility issues with newer transformers
    # use_cache=False and num_beams=1 help avoid these issues
    outputs = florence_model.generate(
        **inputs,
        max_new_tokens=50,
        do_sample=False,
        num_beams=1,
        use_cache=False,  # Disable KV cache for compatibility
    )

result = florence_processor.decode(outputs[0], skip_special_tokens=True)
print(f"Caption: {result}")
print("✅ Florence-2 working correctly")

AVAILABLE VLMs

┌───────────────────────────────────────────────────────────────┐
│  Model              │ Size  │ Access    │ Notes               │
├───────────────────────────────────────────────────────────────┤
│  PaliGemma          │ 3B    │ Gated     │ Great for fine-tune │
│  LLaVA-1.5          │ 7/13B │ Open      │ Popular, versatile  │
│  Florence-2         │ 0.2B  │ Open      │ Lightweight         │
│  Qwen-VL            │ 7B    │ Open      │ Strong multilingual │
│  Phi-3-Vision       │ 4B    │ Open      │ Microsoft, compact  │
│  InternVL           │ Various│ Open     │ Strong benchmarks   │
└───────────────────────────────────────────────────────────────┘


🔮 Loading Florence-2 (lightweight VLM)...

Florence-2 Architecture:
  Total parameters: 231.4M

🧪 Testing Florence-2 inference...
Caption: a blue background with a white border
✅ Florence-2 working correctly


## 5. Inspecting Model Components

Understanding which parts of a model do what helps you decide what to freeze.

In [5]:
def inspect_model_layers(model: nn.Module, name: str = "Model"):
    """
    Inspect model architecture and parameter distribution.
    """
    print(f"\n{'='*60}")
    print(f"ARCHITECTURE: {name}")
    print(f"{'='*60}\n")

    total_params = 0
    layer_info = []

    for name, module in model.named_modules():
        if len(list(module.children())) == 0:  # Leaf modules only
            params = sum(p.numel() for p in module.parameters())
            if params > 0:
                layer_info.append((name, type(module).__name__, params))
                total_params += params

    # Group by layer type
    type_counts = {}
    for name, layer_type, params in layer_info:
        if layer_type not in type_counts:
            type_counts[layer_type] = {"count": 0, "params": 0}
        type_counts[layer_type]["count"] += 1
        type_counts[layer_type]["params"] += params

    print("Layer types:")
    for layer_type, info in sorted(type_counts.items(), key=lambda x: -x[1]["params"]):
        pct = 100 * info["params"] / total_params
        print(f"  {layer_type:20} x{info['count']:4}  {info['params']/1e6:>8.2f}M  ({pct:5.1f}%)")

    print(f"\nTotal: {total_params/1e6:.2f}M parameters")

    return layer_info

# Inspect CLIP vision encoder
_ = inspect_model_layers(clip_model.vision_model, "CLIP Vision Encoder")

# %%
def show_freezing_recommendation(model_type: str):
    """Show recommended freezing patterns for different model types."""

    recommendations = {
        "VLM": {
            "Vision Encoder": ("❄️ FREEZE", "Pretrained representations work well"),
            "Vision Projection": ("🔥 TRAIN", "Adapt to your domain"),
            "LLM Backbone": ("❄️ FREEZE", "Use LoRA adapters instead"),
            "LoRA Adapters": ("🔥 TRAIN", "Efficient fine-tuning"),
            "LM Head": ("🔥 TRAIN", "Task-specific outputs"),
        },
        "World Model": {
            "Video Encoder": ("❄️ or 🔥", "Depends on domain shift"),
            "Dynamics Model": ("🔥 TRAIN", "Core learning happens here"),
            "Decoder": ("❄️ or 🔥", "Depends on output needs"),
        },
        "VAM": {
            "Vision Encoder": ("❄️ FREEZE", "Use pretrained CLIP/DINOv2"),
            "State Encoder": ("🔥 TRAIN", "Task-specific"),
            "Policy Network": ("🔥 TRAIN", "Core learning"),
            "Action Head": ("🔥 TRAIN", "Output layer"),
        },
    }

    print(f"\n{'='*60}")
    print(f"FREEZING RECOMMENDATIONS: {model_type}")
    print(f"{'='*60}\n")

    for component, (status, reason) in recommendations[model_type].items():
        print(f"  {component:20} {status:12} - {reason}")


show_freezing_recommendation("VLM")
show_freezing_recommendation("VAM")


ARCHITECTURE: CLIP Vision Encoder

Layer types:
  Linear               x  72     85.02M  ( 97.2%)
  Conv2d               x   1      2.36M  (  2.7%)
  LayerNorm            x  26      0.04M  (  0.0%)
  Embedding            x   1      0.04M  (  0.0%)

Total: 87.46M parameters

FREEZING RECOMMENDATIONS: VLM

  Vision Encoder       ❄️ FREEZE    - Pretrained representations work well
  Vision Projection    🔥 TRAIN      - Adapt to your domain
  LLM Backbone         ❄️ FREEZE    - Use LoRA adapters instead
  LoRA Adapters        🔥 TRAIN      - Efficient fine-tuning
  LM Head              🔥 TRAIN      - Task-specific outputs

FREEZING RECOMMENDATIONS: VAM

  Vision Encoder       ❄️ FREEZE    - Use pretrained CLIP/DINOv2
  State Encoder        🔥 TRAIN      - Task-specific
  Policy Network       🔥 TRAIN      - Core learning
  Action Head          🔥 TRAIN      - Output layer


## 6. Loading for Fine-tuning

When fine-tuning, you often need:
1. Quantization (for memory efficiency)
2. LoRA setup (for parameter efficiency)

In [16]:
from peft import LoraConfig, get_peft_model, TaskType

print("=" * 60)
print("SETTING UP FOR FINE-TUNING")
print("=" * 60)

# Example: Setting up LoRA for a model
def setup_lora_example():
    """Demonstrate LoRA configuration."""

    lora_config = LoraConfig(
        r=16,                      # LoRA rank
        lora_alpha=32,             # Scaling factor
        lora_dropout=0.05,         # Dropout for regularization
        bias="none",               # Don't train biases
        task_type=TaskType.CAUSAL_LM,
        target_modules=[           # Which layers to add LoRA to
            "q_proj",              # Query projection (attention)
            "k_proj",              # Key projection
            "v_proj",              # Value projection
            "o_proj",              # Output projection
        ],
    )

    print("\nLoRA Configuration:")
    print(f"  Rank (r):        {lora_config.r}")
    print(f"  Alpha:           {lora_config.lora_alpha}")
    print(f"  Scaling:         {lora_config.lora_alpha / lora_config.r}")
    print(f"  Dropout:         {lora_config.lora_dropout}")
    print(f"  Target modules:  {lora_config.target_modules}")

    return lora_config


lora_config = setup_lora_example()

# %%
# Demonstrate quantization config
print("\n" + "=" * 60)
print("QUANTIZATION OPTIONS")
print("=" * 60)

quantization_options = """
┌──────────────────────────────────────────────────────────────┐
│  Precision     │ Memory │ Speed  │ Quality │ Use Case        │
├──────────────────────────────────────────────────────────────┤
│  FP32 (full)   │ 4x     │ Slow   │ Best    │ Final training  │
│  FP16/BF16     │ 2x     │ Fast   │ Good    │ Standard train  │
│  INT8          │ 1x     │ Fast   │ Good    │ Inference       │
│  INT4 (QLoRA)  │ 0.5x   │ Medium │ Good    │ Fine-tuning     │
└──────────────────────────────────────────────────────────────┘
"""
print(quantization_options)

# Try to set up bitsandbytes quantization
# Note: bitsandbytes requires CUDA and may need explicit installation on Colab
try:
    from transformers import BitsAndBytesConfig

    # 4-bit quantization config for QLoRA
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",           # Normalized float 4-bit
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,       # Nested quantization
    )

    print("✅ 4-bit QLoRA Config created:")
    print(f"  Quant type:      nf4 (normalized float)")
    print(f"  Compute dtype:   bfloat16")
    print(f"  Double quant:    True (saves more memory)")

except Exception as e:
    print(f"⚠️  Could not create BitsAndBytesConfig: {e}")
    print("\nTo use quantization, install bitsandbytes:")
    print("  !pip install bitsandbytes>=0.42.0")
    print("\nFor now, you can still fine-tune using FP16/BF16 without quantization.")
    bnb_config = None

print("""
Note on Quantization:
─────────────────────
• Requires NVIDIA GPU with CUDA support
• bitsandbytes may need: !pip install bitsandbytes>=0.42.0
• If unavailable, use torch_dtype=torch.float16 instead
""")

SETTING UP FOR FINE-TUNING

LoRA Configuration:
  Rank (r):        16
  Alpha:           32
  Scaling:         2.0
  Dropout:         0.05
  Target modules:  {'q_proj', 'v_proj', 'o_proj', 'k_proj'}

QUANTIZATION OPTIONS

┌──────────────────────────────────────────────────────────────┐
│  Precision     │ Memory │ Speed  │ Quality │ Use Case        │
├──────────────────────────────────────────────────────────────┤
│  FP32 (full)   │ 4x     │ Slow   │ Best    │ Final training  │
│  FP16/BF16     │ 2x     │ Fast   │ Good    │ Standard train  │
│  INT8          │ 1x     │ Fast   │ Good    │ Inference       │
│  INT4 (QLoRA)  │ 0.5x   │ Medium │ Good    │ Fine-tuning     │
└──────────────────────────────────────────────────────────────┘

✅ 4-bit QLoRA Config created:
  Quant type:      nf4 (normalized float)
  Compute dtype:   bfloat16
  Double quant:    True (saves more memory)

Note on Quantization:
─────────────────────
• Requires NVIDIA GPU with CUDA support
• bitsandbytes may need: !pi

## 7. Robotics-Specific Models

For Vision Action Models, there are specialized pretrained models.

In [7]:
print("=" * 60)
print("ROBOTICS PRETRAINED MODELS")
print("=" * 60)

robotics_models = """
┌──────────────────────────────────────────────────────────────┐
│  Model       │ Type        │ Source      │ Installation      │
├──────────────────────────────────────────────────────────────┤
│  Octo        │ VAM         │ Berkeley    │ pip install octo  │
│  OpenVLA     │ VLA         │ Stanford    │ HuggingFace       │
│  LeRobot     │ VAM toolkit │ HuggingFace │ pip install lerobot│
│  RT-X        │ VAM         │ Google      │ Research only     │
└──────────────────────────────────────────────────────────────┘

LeRobot includes pretrained policies:
  • lerobot/diffusion_pusht  - Diffusion Policy on PushT
  • lerobot/act_aloha_sim   - ACT on ALOHA simulation
  • More available on HuggingFace Hub
"""
print(robotics_models)

# %%
# Show how to search for available models
print("\n🔍 Searching for robotics models on HuggingFace...")

try:
    # Search for LeRobot models
    models = list(list_models(search="lerobot", limit=5))
    print("\nLeRobot models available:")
    for m in models[:5]:
        print(f"  • {m.id}")
except Exception as e:
    print(f"Could not search: {e}")
    print("\nKnown LeRobot models:")
    print("  • lerobot/diffusion_pusht")
    print("  • lerobot/act_aloha_sim_transfer_cube_human")
    print("  • lerobot/act_aloha_sim_insertion_human")

ROBOTICS PRETRAINED MODELS

┌──────────────────────────────────────────────────────────────┐
│  Model       │ Type        │ Source      │ Installation      │
├──────────────────────────────────────────────────────────────┤
│  Octo        │ VAM         │ Berkeley    │ pip install octo  │
│  OpenVLA     │ VLA         │ Stanford    │ HuggingFace       │
│  LeRobot     │ VAM toolkit │ HuggingFace │ pip install lerobot│
│  RT-X        │ VAM         │ Google      │ Research only     │
└──────────────────────────────────────────────────────────────┘

LeRobot includes pretrained policies:
  • lerobot/diffusion_pusht  - Diffusion Policy on PushT
  • lerobot/act_aloha_sim   - ACT on ALOHA simulation
  • More available on HuggingFace Hub


🔍 Searching for robotics models on HuggingFace...

LeRobot models available:
  • lerobot/pi05_base
  • lerobot/xvla-base
  • lerobot/smolvla_base
  • lerobot/pi05_libero_finetuned
  • lerobot/xvla-google-robot


## 8. Running Inference

Let's verify our models work with example inputs.

In [8]:
# Test CLIP inference
print("=" * 60)
print("TESTING MODEL INFERENCE")
print("=" * 60)

print("\n🧪 Testing CLIP...")

# Create test image
test_image = Image.new('RGB', (224, 224), color=(180, 100, 100))

# Prepare inputs
texts = ["a red square", "a blue circle", "a photo of a cat"]
inputs = clip_processor(
    text=texts,
    images=test_image,
    return_tensors="pt",
    padding=True,
)

# Run inference
clip_model.eval()
with torch.no_grad():
    outputs = clip_model(**inputs)

# Get similarity scores
image_features = outputs.image_embeds
text_features = outputs.text_embeds

# Normalize
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)

# Compute similarity
similarities = (image_features @ text_features.T).squeeze()

print("Image-text similarities:")
for text, sim in zip(texts, similarities):
    print(f"  '{text}': {sim.item():.3f}")

print("✅ CLIP working correctly")

# %%
# Test DINOv2 feature extraction
print("\n🧪 Testing DINOv2...")

dinov2.eval()

# Prepare input (DINOv2 expects normalized input)
from torchvision import transforms

dino_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

img_tensor = dino_transform(test_image).unsqueeze(0)

with torch.no_grad():
    features = dinov2(img_tensor)

print(f"DINOv2 output shape: {features.shape}")
print(f"Feature dimension: {features.shape[-1]}")
print("✅ DINOv2 working correctly")

TESTING MODEL INFERENCE

🧪 Testing CLIP...
Image-text similarities:
  'a red square': 0.261
  'a blue circle': 0.222
  'a photo of a cat': 0.225
✅ CLIP working correctly

🧪 Testing DINOv2...
DINOv2 output shape: torch.Size([1, 384])
Feature dimension: 384
✅ DINOv2 working correctly


## 9. Memory and Compute Considerations

In [9]:
print("=" * 60)
print("MEMORY REQUIREMENTS")
print("=" * 60)

memory_table = """
Model                    FP32      FP16      INT8      INT4
─────────────────────────────────────────────────────────────
CLIP ViT-B/32           0.6 GB    0.3 GB    0.15 GB   0.08 GB
CLIP ViT-L/14           1.6 GB    0.8 GB    0.4 GB    0.2 GB
DINOv2 ViT-S/14         0.1 GB    0.05 GB   0.025 GB  0.013 GB
DINOv2 ViT-L/14         1.2 GB    0.6 GB    0.3 GB    0.15 GB
Florence-2-base         0.9 GB    0.45 GB   0.23 GB   0.12 GB
PaliGemma-3B            12 GB     6 GB      3 GB      1.5 GB
LLaVA-7B                28 GB     14 GB     7 GB      3.5 GB

Training Memory (approximate, add 2-3x for optimizer states):
─────────────────────────────────────────────────────────────
Full fine-tune          ~4x model size
LoRA fine-tune          ~1.5x model size
QLoRA fine-tune         ~1.2x model size
Inference only          ~1x model size
"""
print(memory_table)

# Show current GPU memory
if torch.cuda.is_available():
    print(f"\nCurrent GPU memory:")
    print(f"  Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
    print(f"  Reserved:  {torch.cuda.memory_reserved() / 1e9:.2f} GB")
    print(f"  Total:     {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

MEMORY REQUIREMENTS

Model                    FP32      FP16      INT8      INT4
─────────────────────────────────────────────────────────────
CLIP ViT-B/32           0.6 GB    0.3 GB    0.15 GB   0.08 GB
CLIP ViT-L/14           1.6 GB    0.8 GB    0.4 GB    0.2 GB
DINOv2 ViT-S/14         0.1 GB    0.05 GB   0.025 GB  0.013 GB
DINOv2 ViT-L/14         1.2 GB    0.6 GB    0.3 GB    0.15 GB
Florence-2-base         0.9 GB    0.45 GB   0.23 GB   0.12 GB
PaliGemma-3B            12 GB     6 GB      3 GB      1.5 GB
LLaVA-7B                28 GB     14 GB     7 GB      3.5 GB

Training Memory (approximate, add 2-3x for optimizer states):
─────────────────────────────────────────────────────────────
Full fine-tune          ~4x model size
LoRA fine-tune          ~1.5x model size
QLoRA fine-tune         ~1.2x model size
Inference only          ~1x model size



## 10. Choosing the Right Model

In [10]:
print("=" * 60)
print("MODEL SELECTION GUIDE")
print("=" * 60)

selection_guide = """
FOR VISION LANGUAGE TASKS:
─────────────────────────
• Limited compute (T4):     Florence-2, PaliGemma-3B (4-bit)
• Standard compute (A100):  LLaVA-7B, PaliGemma-3B (16-bit)
• Best quality:             LLaVA-13B, Qwen-VL-Chat

FOR VISION ENCODERS:
───────────────────
• General purpose:          CLIP ViT-L/14
• Dense features needed:    DINOv2 ViT-L/14
• Speed critical:           ResNet18, EfficientNet-B0
• Robotics:                 CLIP or DINOv2 (commonly used)

FOR ROBOTICS POLICIES:
─────────────────────
• Starting point:           LeRobot pretrained policies
• Cross-embodiment:         Octo
• Language-conditioned:     OpenVLA
• Custom task:              Train from scratch with BC/Diffusion

DECISION FACTORS:
────────────────
1. Available GPU memory
2. Inference speed requirements
3. Quality requirements
4. Domain similarity to pretraining data
"""
print(selection_guide)

MODEL SELECTION GUIDE

FOR VISION LANGUAGE TASKS:
─────────────────────────
• Limited compute (T4):     Florence-2, PaliGemma-3B (4-bit)
• Standard compute (A100):  LLaVA-7B, PaliGemma-3B (16-bit)
• Best quality:             LLaVA-13B, Qwen-VL-Chat

FOR VISION ENCODERS:
───────────────────
• General purpose:          CLIP ViT-L/14
• Dense features needed:    DINOv2 ViT-L/14  
• Speed critical:           ResNet18, EfficientNet-B0
• Robotics:                 CLIP or DINOv2 (commonly used)

FOR ROBOTICS POLICIES:
─────────────────────
• Starting point:           LeRobot pretrained policies
• Cross-embodiment:         Octo
• Language-conditioned:     OpenVLA
• Custom task:              Train from scratch with BC/Diffusion

DECISION FACTORS:
────────────────
1. Available GPU memory
2. Inference speed requirements
3. Quality requirements
4. Domain similarity to pretraining data

