# Lab 4.1.1: Vision-Language Models

**Module:** 4.1 - Multimodal AI  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand how vision-language models combine visual and text understanding
- [ ] Use LLaVA to analyze and describe images
- [ ] Use CLIP for image-text similarity matching
- [ ] Build a practical image analysis pipeline
- [ ] Optimize VLM inference for DGX Spark's 128GB memory

---

## üìö Prerequisites

- Completed: Module 3.6 (AI Agents)
- Knowledge of: Transformers, PyTorch basics
- Running in: NGC PyTorch container

---

## üåç Real-World Context

Vision-language models are transforming how we interact with visual content:

- **Accessibility**: Screen readers that describe images for visually impaired users
- **E-commerce**: Automated product tagging and description generation
- **Healthcare**: Analyzing medical images alongside patient records
- **Security**: Understanding surveillance footage with natural language queries
- **Creative**: AI assistants that can "see" and discuss your work

---

## üßí ELI5: What are Vision-Language Models?

> **Imagine you have a friend who speaks two languages fluently** - let's call them "Image" and "English." When you show them a photo, they can describe it in words. When you ask a question about the photo, they understand both the picture AND your words.
>
> Vision-language models work the same way! They have:
> 1. **An "eye"** (vision encoder) - like CLIP's vision transformer - that converts images into a language the model understands
> 2. **A "brain"** (language model) - like LLaMA - that can read, write, and reason
> 3. **A "translator"** (projection layer) - that connects the eye to the brain
>
> **In AI terms:** VLMs encode images into the same representation space as text, allowing a language model to "understand" visual content as if it were reading about it.

---

## Part 1: Environment Setup

First, let's verify our DGX Spark environment and install dependencies.

In [None]:
# Check GPU availability and memory
import torch

print("=" * 50)
print("DGX Spark Environment Check")
print("=" * 50)

if torch.cuda.is_available():
    device = torch.cuda.get_device_properties(0)
    print(f"GPU: {device.name}")
    print(f"Total Memory: {device.total_memory / 1024**3:.1f} GB")
    print(f"Compute Capability: {device.major}.{device.minor}")
    print(f"CUDA Version: {torch.version.cuda}")
else:
    print("WARNING: No GPU detected! VLMs require GPU acceleration.")

# Memory status
if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated() / 1024**3
    reserved = torch.cuda.memory_reserved() / 1024**3
    print(f"\nMemory Status:")
    print(f"  Allocated: {allocated:.2f} GB")
    print(f"  Reserved: {reserved:.2f} GB")

In [None]:
# Install required packages (run once)
# !pip install transformers>=4.45.0 accelerate>=0.27.0 bitsandbytes>=0.42.0 pillow>=10.0.0 requests

In [None]:
# Import libraries
import gc
import time
import requests
from io import BytesIO
from pathlib import Path
from typing import Optional, Union, List

import torch
from PIL import Image
import matplotlib.pyplot as plt

# Set default dtype for Blackwell optimization
torch.set_default_dtype(torch.bfloat16)

print("Libraries imported successfully!")

### üîç What Just Happened?

We've set up our environment with:
- **torch.bfloat16**: The optimal data type for Blackwell GPUs (DGX Spark's GB10 chip)
- **Memory monitoring**: Essential when working with large models

---

## Part 2: Understanding CLIP - The Foundation

Before diving into full VLMs, let's understand CLIP - the model that made vision-language AI practical.

### üßí ELI5: How CLIP Works

> **Imagine a game of matching cards.** On one side, you have image cards. On the other, you have text cards describing images.
>
> CLIP learned to play this game by looking at 400 million image-text pairs from the internet. It learned to put matching image and text cards close together in a "magic space" where similar things cluster.
>
> Now, when you give it a new image, it can find the closest text descriptions. When you give it text, it can find matching images!

In [None]:
from transformers import CLIPModel, CLIPProcessor

# Load CLIP - it's lightweight (~2GB)
print("Loading CLIP model...")
start_time = time.time()

clip_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

# Move to GPU
clip_model = clip_model.to("cuda")
clip_model.eval()

print(f"Loaded in {time.time() - start_time:.1f}s")
print(f"GPU Memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

In [None]:
# Helper function to load images
def load_image(source: Union[str, Path]) -> Image.Image:
    """Load image from URL or local path."""
    source_str = str(source)
    
    if source_str.startswith(("http://", "https://")):
        response = requests.get(source_str, timeout=10)
        response.raise_for_status()
        image = Image.open(BytesIO(response.content))
    else:
        image = Image.open(source_str)
    
    return image.convert("RGB")

# Load a sample image
sample_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"
sample_image = load_image(sample_url)

# Display the image
plt.figure(figsize=(8, 6))
plt.imshow(sample_image)
plt.axis('off')
plt.title("Sample Image")
plt.show()

In [None]:
# CLIP: Image-Text Similarity
# Let's see how CLIP matches images with text descriptions

text_options = [
    "a photo of a cat",
    "a photo of a dog",
    "a photo of a bird",
    "a photo of a car",
    "a photo of a house",
]

# Process inputs
inputs = clip_processor(
    text=text_options,
    images=sample_image,
    return_tensors="pt",
    padding=True
)

# Move to GPU
inputs = {k: v.to("cuda") for k, v in inputs.items()}

# Get similarity scores
with torch.no_grad():
    outputs = clip_model(**inputs)
    logits_per_image = outputs.logits_per_image  # Image-to-text similarity
    probs = logits_per_image.softmax(dim=1)  # Convert to probabilities

# Display results
print("\nüìä CLIP Image-Text Matching Results:")
print("=" * 40)
for text, prob in zip(text_options, probs[0]):
    bar = "‚ñà" * int(prob * 40)
    print(f"{text:25} {prob:.1%} {bar}")

best_match = text_options[probs.argmax()]
print(f"\nüéØ Best match: '{best_match}'")

### üîç What Just Happened?

CLIP computed embeddings for both the image and all text options, then calculated the cosine similarity between them. The text description with the highest similarity score is the best match!

**Key Insight**: CLIP can do "zero-shot" classification - it can recognize objects it's never been explicitly trained to classify, just by matching to text descriptions.

---

### ‚úã Try It Yourself

Modify the code above to:
1. Try a different image URL
2. Add more text options (be creative!)
3. Try more specific descriptions like "an orange tabby cat sleeping"

<details>
<summary>üí° Hint</summary>

More specific text descriptions often work better! Try:
- "a close-up photo of a cat's face"
- "a ginger/orange cat"
- "a domestic short-haired cat"
</details>

---

## Part 3: LLaVA - Visual Language Assistant

Now let's move to a full vision-language model that can have conversations about images!

### üßí ELI5: How LLaVA Works

> **LLaVA is like a very smart friend who can look at a photo and answer ANY question about it.**
>
> Here's how it works:
> 1. **The Eye (CLIP Vision Encoder)**: Looks at the image and creates a "summary" of what it sees
> 2. **The Translator (Projection Layer)**: Converts that visual summary into words the brain can understand
> 3. **The Brain (LLaMA)**: A powerful language model that can reason about what it "sees"
>
> Unlike CLIP which just matches images to text, LLaVA can generate NEW text describing what it sees!

In [None]:
# Clean up CLIP to free memory for LLaVA
del clip_model, clip_processor
torch.cuda.empty_cache()
gc.collect()

print(f"Freed memory. Current usage: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

In [None]:
from transformers import AutoProcessor, LlavaForConditionalGeneration

# Load LLaVA-1.5-7B
# This model uses ~16GB of our 128GB - plenty of headroom!
print("Loading LLaVA-1.5-7B...")
print("(This may take a minute on first run as it downloads the model)")
start_time = time.time()

model_name = "llava-hf/llava-1.5-7b-hf"

processor = AutoProcessor.from_pretrained(model_name)
model = LlavaForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,  # Optimal for Blackwell
    device_map="auto",           # Automatically use GPU
    low_cpu_mem_usage=True,
)

print(f"\n‚úÖ Loaded in {time.time() - start_time:.1f}s")
print(f"GPU Memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB / 128 GB")

In [None]:
def analyze_image(image: Image.Image, question: str, max_new_tokens: int = 256) -> str:
    """
    Analyze an image and answer a question about it.
    
    Args:
        image: PIL Image to analyze
        question: Question to answer about the image
        max_new_tokens: Maximum length of response
        
    Returns:
        Model's response
    """
    # Create the conversation prompt
    prompt = f"USER: <image>\n{question}\nASSISTANT:"
    
    # Process inputs
    inputs = processor(text=prompt, images=image, return_tensors="pt")
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    # Generate response
    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,  # Deterministic for reproducibility
        )
    
    # Decode response
    response = processor.decode(output_ids[0], skip_special_tokens=True)
    
    # Extract just the assistant's response
    if "ASSISTANT:" in response:
        response = response.split("ASSISTANT:")[-1].strip()
    
    return response

In [None]:
# Let's test it with our cat image!
question = "Describe this image in detail. What do you see?"

print("üñºÔ∏è  Image Analysis")
print("=" * 50)
print(f"Question: {question}\n")

start_time = time.time()
response = analyze_image(sample_image, question)
elapsed = time.time() - start_time

print(f"Response: {response}")
print(f"\n‚è±Ô∏è  Generated in {elapsed:.2f}s")

In [None]:
# Let's try different types of questions!

questions = [
    "What color is this cat?",
    "What is the cat's expression? Does it look happy, curious, or sleepy?",
    "Is this cat indoors or outdoors?",
    "What breed might this cat be?",
]

print("üîç Multi-Question Analysis")
print("=" * 50)

for q in questions:
    print(f"\n‚ùì {q}")
    response = analyze_image(sample_image, q, max_new_tokens=100)
    print(f"üí¨ {response}")

### üîç What Just Happened?

LLaVA processed our questions in two stages:
1. **Visual encoding**: The image was converted to visual tokens (like "words" describing the image)
2. **Language generation**: The LLM reasoned about the visual tokens and our question to generate a response

**Notice**: The model can answer different types of questions - descriptive, emotional, spatial, and even make educated guesses about the breed!

---

## Part 4: Practical Application - Image Analysis Pipeline

Let's build a more sophisticated pipeline that can analyze multiple aspects of an image.

In [None]:
def comprehensive_image_analysis(image: Image.Image) -> dict:
    """
    Perform comprehensive analysis of an image.
    
    Returns a structured analysis with multiple aspects.
    """
    analysis = {}
    
    # 1. General Description
    analysis["description"] = analyze_image(
        image,
        "Describe this image in 2-3 sentences. Focus on the main subject and setting.",
        max_new_tokens=150
    )
    
    # 2. Objects Detection
    analysis["objects"] = analyze_image(
        image,
        "List all the objects you can see in this image, separated by commas.",
        max_new_tokens=100
    )
    
    # 3. Colors & Style
    analysis["colors_style"] = analyze_image(
        image,
        "Describe the main colors and visual style of this image.",
        max_new_tokens=80
    )
    
    # 4. Mood/Atmosphere
    analysis["mood"] = analyze_image(
        image,
        "What is the mood or atmosphere of this image? Use 2-3 adjectives.",
        max_new_tokens=50
    )
    
    # 5. Suggested Caption
    analysis["caption"] = analyze_image(
        image,
        "Suggest a creative caption for this image suitable for social media.",
        max_new_tokens=50
    )
    
    return analysis

# Run the analysis
print("üî¨ Comprehensive Image Analysis")
print("=" * 60)

start_time = time.time()
results = comprehensive_image_analysis(sample_image)
total_time = time.time() - start_time

for key, value in results.items():
    print(f"\nüìå {key.upper().replace('_', ' ')}:")
    print(f"   {value}")

print(f"\n‚è±Ô∏è  Total analysis time: {total_time:.2f}s")

### ‚úã Try It Yourself: Analyze Your Own Image

Try analyzing a different image! You can:
1. Use a URL to any image on the web
2. Upload a local image to your workspace

In [None]:
# Exercise: Try your own image!
# Uncomment and modify the URL below:

# your_image_url = "YOUR_IMAGE_URL_HERE"
# your_image = load_image(your_image_url)

# plt.figure(figsize=(8, 6))
# plt.imshow(your_image)
# plt.axis('off')
# plt.show()

# Ask your own question:
# your_question = "What is happening in this image?"
# response = analyze_image(your_image, your_question)
# print(response)

---

## Part 5: DGX Spark Optimization - Loading Larger Models

With 128GB of unified memory, we can run much larger models! Let's explore our options.

In [None]:
# Model Size Comparison for DGX Spark
print("üìä VLM Model Sizes on DGX Spark (128GB)")
print("=" * 60)

models = [
    ("LLaVA-1.5-7B", 16, "Full precision, fastest"),
    ("LLaVA-1.5-13B", 28, "Better quality, still fast"),
    ("Qwen2-VL-7B", 18, "Excellent for documents"),
    ("Qwen2-VL-72B (4-bit)", 45, "State-of-the-art with quantization"),
    ("LLaVA-NeXT-34B", 70, "Best open-source VLM"),
]

total_memory = 128

for name, vram, notes in models:
    usage_pct = (vram / total_memory) * 100
    bar = "‚ñà" * int(usage_pct / 2) + "‚ñë" * (50 - int(usage_pct / 2))
    fits = "‚úÖ" if vram < total_memory else "‚ùå"
    print(f"{fits} {name:25} {vram:3}GB [{bar}] {usage_pct:.0f}%")
    print(f"   ‚îî‚îÄ {notes}")

In [None]:
# Example: Loading LLaVA-13B with 4-bit quantization
# This is optional - only run if you want to try a larger model

LOAD_LARGER_MODEL = False  # Change to True to try 13B model

if LOAD_LARGER_MODEL:
    # Clean up current model
    del model, processor
    torch.cuda.empty_cache()
    gc.collect()
    
    from transformers import BitsAndBytesConfig
    
    # Configure 4-bit quantization
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
    )
    
    print("Loading LLaVA-1.5-13B with 4-bit quantization...")
    
    model_name = "llava-hf/llava-1.5-13b-hf"
    
    processor = AutoProcessor.from_pretrained(model_name)
    model = LlavaForConditionalGeneration.from_pretrained(
        model_name,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        quantization_config=quantization_config,
        low_cpu_mem_usage=True,
    )
    
    print(f"\n‚úÖ Loaded 13B model!")
    print(f"GPU Memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
else:
    print("Using existing 7B model. Set LOAD_LARGER_MODEL=True to try 13B.")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Wrong Image Format
```python
# ‚ùå Wrong: Using RGBA image directly
image = Image.open("screenshot.png")  # May have alpha channel!
inputs = processor(images=image, ...)  # Can cause errors

# ‚úÖ Right: Always convert to RGB
image = Image.open("screenshot.png").convert("RGB")
inputs = processor(images=image, ...)
```
**Why:** VLMs expect RGB images. RGBA (with transparency) or grayscale can cause silent errors.

---

### Mistake 2: Forgetting to Move Inputs to GPU
```python
# ‚ùå Wrong: Inputs on CPU, model on GPU
inputs = processor(text=prompt, images=image, return_tensors="pt")
outputs = model.generate(**inputs)  # Error!

# ‚úÖ Right: Move inputs to same device as model
inputs = processor(text=prompt, images=image, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model.generate(**inputs)
```
**Why:** PyTorch requires all tensors to be on the same device.

---

### Mistake 3: Running Out of Memory
```python
# ‚ùå Wrong: Loading multiple large models
clip_model = CLIPModel.from_pretrained(...)  # 2GB
llava_model = LlavaForConditionalGeneration.from_pretrained(...)  # 16GB
# Now trying to load another...

# ‚úÖ Right: Clean up before loading new models
del clip_model
torch.cuda.empty_cache()
gc.collect()
# Now safe to load new model
```
**Why:** Even with 128GB, loading models without cleanup can fragment memory.

---

## üéâ Checkpoint

You've learned:
- ‚úÖ How CLIP creates a shared embedding space for images and text
- ‚úÖ How LLaVA combines a vision encoder with a language model
- ‚úÖ How to analyze images and ask questions about them
- ‚úÖ How to build practical image analysis pipelines
- ‚úÖ How to optimize VLM loading for DGX Spark

---

## üöÄ Challenge (Optional)

Build an **Image Comparison Assistant** that can:
1. Take two images as input
2. Describe similarities between them
3. Describe differences between them
4. Determine which image is "better" for a given purpose

Hint: You'll need to process both images and construct a clever prompt!

In [None]:
# Challenge: Your code here!

def compare_images(image1: Image.Image, image2: Image.Image, purpose: str = "general") -> dict:
    """
    Compare two images and return analysis.
    
    Args:
        image1: First image
        image2: Second image
        purpose: What the images are being compared for
        
    Returns:
        Dictionary with similarities, differences, and recommendation
    """
    # Your implementation here!
    pass

---

## üìñ Further Reading

- [LLaVA Paper](https://arxiv.org/abs/2304.08485) - Visual Instruction Tuning
- [CLIP Paper](https://arxiv.org/abs/2103.00020) - Learning Transferable Visual Models
- [Qwen-VL](https://arxiv.org/abs/2308.12966) - A Versatile Vision-Language Model
- [HuggingFace VLM Hub](https://huggingface.co/models?other=vision-language-model)

---

## üßπ Cleanup

In [None]:
# Clean up GPU memory
if 'model' in dir():
    del model
if 'processor' in dir():
    del processor

torch.cuda.empty_cache()
gc.collect()

print("‚úÖ Cleanup complete!")
print(f"GPU Memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

---

## Next Steps

In the next lab, we'll explore **Image Generation** with Stable Diffusion and SDXL - learning to create images from text descriptions!

‚û°Ô∏è Continue to [Lab 02: Image Generation](./02-image-generation.ipynb)