# Lab 4.1.1: Vision-Language Models Demo

**Module:** 4.1 - Multimodal AI  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand how vision-language models work at a high level
- [ ] Use LLaVA to analyze and describe images
- [ ] Use Qwen2-VL for advanced visual understanding
- [ ] Compare different VLM capabilities
- [ ] Build a simple image Q&A application

---

## Prerequisites

- Completed: Module 3.6 (AI Agents)
- Knowledge of: Transformers, attention mechanisms, LLMs
- Running in: NGC PyTorch container with transformers installed

---

## Real-World Context

Vision-Language Models (VLMs) are transforming how we interact with AI:

**Industry Applications:**
- **Medical Imaging**: "What abnormalities do you see in this X-ray?"
- **Accessibility**: Describing images for visually impaired users
- **E-commerce**: "Find products similar to this photo"
- **Security**: "Is there anything suspicious in this surveillance footage?"
- **Education**: "Explain the diagram in this textbook"

**Why DGX Spark?**
- Qwen2-VL-7B: ~18GB - fits easily in 128GB unified memory
- Qwen3-VL-8B: ~20GB - runs with room to spare (2025 model)
- No need for expensive cloud APIs - run everything locally!

---

## ELI5: How Do Vision-Language Models Work?

> **Imagine you're a translator who can speak two languages: "Picture" and "English".**
>
> When someone shows you a photo, you first describe what you see in the picture using your "Picture" vocabulary (that's the **vision encoder** - usually CLIP). You take mental notes: "I see a cat, it's orange, it's sitting on a blue couch..."
>
> Then, you translate these notes into English so you can talk about the picture (that's the **language model** part). But you need a special translator's notebook to connect your Picture-notes to English-notes (that's the **projection layer**).
>
> When someone asks "What color is the cat?", you check your Picture-notes, find the color information, and use your English skills to say "The cat is orange!"
>
> **In AI terms:** 
> 1. **Vision Encoder** (CLIP/SigLIP): Converts images to embeddings
> 2. **Projection Layer**: Aligns image embeddings to text embedding space
> 3. **Language Model**: Generates text responses using both image and text context

---

## Part 1: Environment Setup

Let's start by setting up our environment and checking our GPU.

In [None]:
# Install required packages (if not already installed)
# Run this only once
# !pip install transformers>=4.45.0 accelerate bitsandbytes pillow requests -q

In [None]:
import torch
import gc
import numpy as np
from PIL import Image
import requests
from io import BytesIO
import time
import warnings
from IPython.display import display
warnings.filterwarnings('ignore')

# Check GPU availability and memory
print("=" * 50)
print("GPU Configuration")
print("=" * 50)

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    total_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    allocated = torch.cuda.memory_allocated(0) / 1e9
    print(f"GPU: {gpu_name}")
    print(f"Total Memory: {total_memory:.1f} GB")
    print(f"Currently Allocated: {allocated:.2f} GB")
    print(f"Available: {total_memory - allocated:.1f} GB")
else:
    print("No GPU available - VLMs will be very slow on CPU!")
    
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA version: {torch.version.cuda}")

In [None]:
def clear_gpu_memory():
    """Clear GPU memory cache - essential when switching between models."""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
    print("GPU memory cleared!")

def get_memory_usage():
    """Get current GPU memory usage."""
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated(0) / 1e9
        reserved = torch.cuda.memory_reserved(0) / 1e9
        return f"Allocated: {allocated:.2f}GB, Reserved: {reserved:.2f}GB"
    return "No GPU available"

def load_image_from_url(url: str) -> Image.Image:
    """Load an image from a URL."""
    response = requests.get(url, timeout=10)
    image = Image.open(BytesIO(response.content)).convert('RGB')
    return image

print("Utility functions loaded!")

---

## Part 2: Understanding VLM Architecture

### The Three Components

```
┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  Vision Encoder │────▶│ Projection Layer │────▶│ Language Model  │
│    (CLIP/SigLIP)│     │   (Linear/MLP)   │     │  (Qwen/LLaMA)   │
└─────────────────┘     └──────────────────┘     └─────────────────┘
        │                        │                       │
   Image → Tokens          Align Spaces           Generate Text
  (224x224 patches)      (Vision → Text)        (Answer questions)
```

### Why This Matters

| Component | What It Does | Size Impact |
|-----------|--------------|-------------|
| Vision Encoder | "Sees" the image, creates embeddings | ~400M params |
| Projection | Translates vision → language space | ~10M params |
| Language Model | Reasons and generates responses | 7B-72B params |

The language model dominates the size, which is why:
- Qwen2-VL-7B ≈ 18GB
- Qwen3-VL-8B ≈ 20GB (2025 recommended model)

---

## Part 3: Qwen3-VL - Modern Vision-Language Model

Qwen3-VL (2025) is the latest vision-language model with impressive capabilities:
- **Dynamic Resolution**: Handles images of any size without fixed cropping
- **Video Understanding**: Can process video frames
- **Multilingual**: Strong support for Chinese and English
- **Document Understanding**: Great at reading text in images
- **Improved Reasoning**: Better at spatial and compositional understanding

Let's load it and try it out!

In [None]:
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration

# Clear any existing models from memory
clear_gpu_memory()

print("Loading LLaVA-1.6-Vicuna-7B...")
print(f"Memory before: {get_memory_usage()}")
start_time = time.time()

# Load model and processor
model_id = "llava-hf/llava-v1.6-vicuna-7b-hf"

llava_processor = LlavaNextProcessor.from_pretrained(model_id)

llava_model = LlavaNextForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,  # Use bfloat16 for Blackwell
    device_map="auto",           # Automatically map to GPU
    low_cpu_mem_usage=True       # Optimize memory during loading
)

load_time = time.time() - start_time
print(f"\nModel loaded in {load_time:.1f} seconds!")
print(f"Memory after: {get_memory_usage()}")

In [None]:
# Load a sample image
# Using a royalty-free image from Unsplash
image_url = "https://images.unsplash.com/photo-1518791841217-8f162f1e1131?w=800"

try:
    sample_image = load_image_from_url(image_url)
    # Resize for display
    display_image = sample_image.copy()
    display_image.thumbnail((400, 400))
    display(display_image)
    print(f"Image size: {sample_image.size}")
except Exception as e:
    print(f"Could not load image from URL: {e}")
    print("Creating a simple test image instead...")
    # Create a simple gradient image as fallback
    import numpy as np
    arr = np.zeros((224, 224, 3), dtype=np.uint8)
    arr[:, :, 0] = np.linspace(0, 255, 224).astype(np.uint8)  # Red gradient
    arr[:, :, 2] = np.linspace(255, 0, 224).astype(np.uint8)  # Blue gradient
    sample_image = Image.fromarray(arr)
    display(sample_image)

In [None]:
def ask_llava(image: Image.Image, question: str, max_tokens: int = 300) -> str:
    """
    Ask LLaVA a question about an image.
    
    Args:
        image: PIL Image to analyze
        question: Question to ask about the image
        max_tokens: Maximum tokens in response
        
    Returns:
        Model's response as a string
    """
    # Format the conversation
    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": question}
            ]
        }
    ]
    
    # Apply chat template
    prompt = llava_processor.apply_chat_template(conversation, add_generation_prompt=True)
    
    # Process inputs
    inputs = llava_processor(images=image, text=prompt, return_tensors="pt")
    inputs = {k: v.to(llava_model.device) for k, v in inputs.items()}
    
    # Generate response
    start_time = time.time()
    with torch.inference_mode():
        output = llava_model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9
        )
    generation_time = time.time() - start_time
    
    # Decode response
    response = llava_processor.decode(output[0], skip_special_tokens=True)
    
    # Extract just the assistant's response
    if "ASSISTANT:" in response:
        response = response.split("ASSISTANT:")[-1].strip()
    
    print(f"\n[Generated in {generation_time:.1f}s]")
    return response

print("ask_llava() function ready!")

In [None]:
# Let's ask LLaVA about our image!
question1 = "What do you see in this image? Describe it in detail."

print(f"Question: {question1}")
print("-" * 50)
response1 = ask_llava(sample_image, question1)
print(f"\nLLaVA: {response1}")

In [None]:
# Ask a follow-up question
question2 = "What mood or atmosphere does this image convey?"

print(f"Question: {question2}")
print("-" * 50)
response2 = ask_llava(sample_image, question2)
print(f"\nLLaVA: {response2}")

In [None]:
# Ask about specific details
question3 = "What colors are most prominent in this image?"

print(f"Question: {question3}")
print("-" * 50)
response3 = ask_llava(sample_image, question3)
print(f"\nLLaVA: {response3}")

### What Just Happened?

1. **Image Processing**: The vision encoder (CLIP ViT-L/14) converted the image into 576 visual tokens
2. **Token Alignment**: The projection layer aligned these tokens with the language model's embedding space
3. **Text Generation**: The language model (Vicuna-7B) generated a response based on both the visual tokens and your question

Notice the generation time - this includes both image processing and text generation!

---

## Try It Yourself: Image Q&A

Try asking different types of questions about the image:

1. **Descriptive**: "What objects are in this image?"
2. **Analytical**: "What might have happened just before this photo was taken?"
3. **Creative**: "Write a short story inspired by this image."
4. **Practical**: "If this were a product photo, what would be a good caption?"

<details>
<summary>Hint: Try loading your own image!</summary>

```python
# Load from file
my_image = Image.open("/path/to/your/image.jpg")

# Or from URL
my_image = load_image_from_url("https://example.com/image.jpg")

# Then ask questions
response = ask_llava(my_image, "What do you see?")
```
</details>

In [None]:
# YOUR CODE HERE
# Try different questions or load your own image!

my_question = "What would be a good title for this image if it were a painting?"

print(f"Question: {my_question}")
print("-" * 50)
my_response = ask_llava(sample_image, my_question)
print(f"\nLLaVA: {my_response}")

---

## Part 4: Qwen2-VL - Advanced Vision Understanding

Qwen2-VL is a state-of-the-art VLM from Alibaba with impressive capabilities:
- **Dynamic Resolution**: Handles images of any size without fixed cropping
- **Video Understanding**: Can process video frames
- **Multilingual**: Strong support for Chinese and English
- **Document Understanding**: Great at reading text in images

Let's compare it with LLaVA!

In [None]:
# First, let's clean up LLaVA to free memory
del llava_model
del llava_processor
clear_gpu_memory()
print(f"Memory after cleanup: {get_memory_usage()}")

In [None]:
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor

print("Loading Qwen2-VL-7B-Instruct...")
print(f"Memory before: {get_memory_usage()}")
start_time = time.time()

qwen_model_id = "Qwen/Qwen2-VL-7B-Instruct"

qwen_processor = AutoProcessor.from_pretrained(qwen_model_id)

qwen_model = Qwen2VLForConditionalGeneration.from_pretrained(
    qwen_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    low_cpu_mem_usage=True
)

load_time = time.time() - start_time
print(f"\nModel loaded in {load_time:.1f} seconds!")
print(f"Memory after: {get_memory_usage()}")

In [None]:
def ask_qwen(image: Image.Image, question: str, max_tokens: int = 300) -> str:
    """
    Ask Qwen2-VL a question about an image.
    
    Args:
        image: PIL Image to analyze
        question: Question to ask about the image
        max_tokens: Maximum tokens in response
        
    Returns:
        Model's response as a string
    """
    # Format the conversation for Qwen2-VL
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": question}
            ]
        }
    ]
    
    # Apply chat template
    text = qwen_processor.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )
    
    # Process inputs
    inputs = qwen_processor(
        text=[text],
        images=[image],
        padding=True,
        return_tensors="pt"
    )
    inputs = inputs.to(qwen_model.device)
    
    # Generate response
    start_time = time.time()
    with torch.inference_mode():
        output_ids = qwen_model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9
        )
    generation_time = time.time() - start_time
    
    # Decode - get only the new tokens
    generated_ids = output_ids[:, inputs.input_ids.shape[1]:]
    response = qwen_processor.batch_decode(
        generated_ids, 
        skip_special_tokens=True, 
        clean_up_tokenization_spaces=False
    )[0]
    
    print(f"\n[Generated in {generation_time:.1f}s]")
    return response

print("ask_qwen() function ready!")

In [None]:
# Let's try Qwen2-VL with the same image
print(f"Question: {question1}")
print("-" * 50)
qwen_response = ask_qwen(sample_image, question1)
print(f"\nQwen2-VL: {qwen_response}")

### Qwen2-VL's Special Abilities

Qwen2-VL excels at:
1. **OCR/Text Reading**: It can read text in images very accurately
2. **Document Understanding**: Great with charts, tables, and diagrams
3. **Spatial Reasoning**: Better understanding of object positions

In [None]:
# Let's test OCR capabilities with a document-style image
# We'll create a simple image with text

from PIL import ImageDraw, ImageFont

# Create a simple document image
doc_image = Image.new('RGB', (400, 200), color='white')
draw = ImageDraw.Draw(doc_image)

# Add some text (using default font)
try:
    # Try to use a nice font if available
    font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", 20)
except:
    font = ImageFont.load_default()

text_content = """Meeting Notes - Dec 2024
1. Project deadline: March 15
2. Budget: $50,000
3. Team size: 5 people"""

draw.text((10, 10), text_content, fill='black', font=font)

display(doc_image)
print("\nDocument image created for OCR test")

In [None]:
# Test OCR capabilities
ocr_question = "Read all the text in this image and summarize the key information."

print(f"Question: {ocr_question}")
print("-" * 50)
ocr_response = ask_qwen(doc_image, ocr_question)
print(f"\nQwen2-VL: {ocr_response}")

---

## Part 5: Comparing VLM Capabilities

Let's create a systematic comparison of different VLM tasks.

### Task Categories

| Task | Description | Best Model |
|------|-------------|------------|
| Image Description | General scene understanding | Both good |
| Object Detection | Identifying specific objects | Similar |
| OCR/Text Reading | Reading text in images | Qwen2-VL |
| Document Understanding | Charts, tables, diagrams | Qwen2-VL |
| Counting | Counting objects accurately | Challenging for both |
| Spatial Reasoning | Understanding positions | Qwen2-VL slightly better |

In [None]:
# Let's test with a more complex scene
# Using a street scene image
street_url = "https://images.unsplash.com/photo-1480714378408-67cf0d13bc1b?w=800"

try:
    street_image = load_image_from_url(street_url)
    display_img = street_image.copy()
    display_img.thumbnail((400, 400))
    display(display_img)
except Exception as e:
    print(f"Could not load street image: {e}")
    street_image = sample_image  # Fallback

In [None]:
# Test spatial reasoning
spatial_question = "Describe the layout of this scene from left to right. What objects are in the foreground vs background?"

print(f"Question: {spatial_question}")
print("-" * 50)
spatial_response = ask_qwen(street_image, spatial_question)
print(f"\nQwen2-VL: {spatial_response}")

---

## Part 6: Building an Image Q&A Application

Let's put it all together into a reusable application class!

In [None]:
class ImageQA:
    """
    A simple Image Question-Answering application using VLMs.
    
    Example usage:
        qa = ImageQA(model_name="qwen")
        image = Image.open("photo.jpg")
        answer = qa.ask(image, "What do you see?")
    """
    
    def __init__(self, model_name: str = "qwen"):
        """
        Initialize the Image QA system.
        
        Args:
            model_name: Either "qwen" or "llava"
        """
        self.model_name = model_name
        self.model = None
        self.processor = None
        self.conversation_history = []
        
    def load_model(self):
        """Load the selected model."""
        if self.model is not None:
            print("Model already loaded!")
            return
            
        clear_gpu_memory()
        
        if self.model_name == "qwen":
            from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
            model_id = "Qwen/Qwen2-VL-7B-Instruct"
            self.processor = AutoProcessor.from_pretrained(model_id)
            self.model = Qwen2VLForConditionalGeneration.from_pretrained(
                model_id,
                torch_dtype=torch.bfloat16,
                device_map="auto"
            )
        else:
            from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
            model_id = "llava-hf/llava-v1.6-vicuna-7b-hf"
            self.processor = LlavaNextProcessor.from_pretrained(model_id)
            self.model = LlavaNextForConditionalGeneration.from_pretrained(
                model_id,
                torch_dtype=torch.bfloat16,
                device_map="auto"
            )
        
        print(f"{self.model_name.upper()} model loaded!")
        
    def ask(self, image: Image.Image, question: str, max_tokens: int = 300) -> str:
        """
        Ask a question about an image.
        
        Args:
            image: PIL Image to analyze
            question: Question to ask
            max_tokens: Maximum response length
            
        Returns:
            Model's response
        """
        if self.model is None:
            self.load_model()
            
        if self.model_name == "qwen":
            return self._ask_qwen(image, question, max_tokens)
        else:
            return self._ask_llava(image, question, max_tokens)
            
    def _ask_qwen(self, image, question, max_tokens):
        messages = [{"role": "user", "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": question}
        ]}]
        
        text = self.processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        inputs = self.processor(text=[text], images=[image], padding=True, return_tensors="pt")
        inputs = inputs.to(self.model.device)
        
        with torch.inference_mode():
            output_ids = self.model.generate(**inputs, max_new_tokens=max_tokens)
        
        generated_ids = output_ids[:, inputs.input_ids.shape[1]:]
        return self.processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
    def _ask_llava(self, image, question, max_tokens):
        conversation = [{"role": "user", "content": [
            {"type": "image"},
            {"type": "text", "text": question}
        ]}]
        
        prompt = self.processor.apply_chat_template(conversation, add_generation_prompt=True)
        inputs = self.processor(images=image, text=prompt, return_tensors="pt")
        inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
        
        with torch.inference_mode():
            output = self.model.generate(**inputs, max_new_tokens=max_tokens)
        
        response = self.processor.decode(output[0], skip_special_tokens=True)
        if "ASSISTANT:" in response:
            response = response.split("ASSISTANT:")[-1].strip()
        return response
    
    def analyze_image(self, image: Image.Image) -> dict:
        """
        Perform comprehensive analysis of an image.
        
        Returns a dictionary with:
        - description: General description
        - objects: Objects detected
        - mood: Mood/atmosphere
        - text: Any text found in the image
        """
        analysis = {}
        
        print("Analyzing image...")
        analysis['description'] = self.ask(image, "Describe this image in 2-3 sentences.")
        analysis['objects'] = self.ask(image, "List the main objects visible in this image.")
        analysis['mood'] = self.ask(image, "What mood or feeling does this image convey? Answer in one sentence.")
        analysis['text'] = self.ask(image, "Is there any text visible in this image? If yes, what does it say? If no, just say 'No text found'.")
        
        return analysis
    
    def cleanup(self):
        """Free GPU memory."""
        if self.model is not None:
            del self.model
            del self.processor
            self.model = None
            self.processor = None
            clear_gpu_memory()
            
print("ImageQA class defined!")

In [None]:
# Clean up the existing model first
del qwen_model
del qwen_processor
clear_gpu_memory()

# Test our ImageQA class
qa = ImageQA(model_name="qwen")

# Run comprehensive analysis
analysis = qa.analyze_image(sample_image)

print("\n" + "=" * 50)
print("IMAGE ANALYSIS RESULTS")
print("=" * 50)
for key, value in analysis.items():
    print(f"\n{key.upper()}:")
    print(f"  {value}")

---

## Common Mistakes

### Mistake 1: Not Clearing GPU Memory Between Models

```python
# Wrong - loading a second model without clearing the first
model1 = load_llava()
model2 = load_qwen()  # OOM error!

# Right - always clear memory first
model1 = load_llava()
del model1
torch.cuda.empty_cache()
gc.collect()
model2 = load_qwen()  # Works!
```

### Mistake 2: Forgetting to Move Inputs to GPU

```python
# Wrong - inputs stay on CPU
inputs = processor(images=image, text=prompt, return_tensors="pt")
output = model.generate(**inputs)  # Error: tensors on different devices

# Right - move inputs to model's device
inputs = processor(images=image, text=prompt, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
output = model.generate(**inputs)  # Works!
```

### Mistake 3: Using Wrong Image Format

```python
# Wrong - passing file path instead of PIL Image
response = ask_llava("path/to/image.jpg", "What is this?")  # Error!

# Right - load as PIL Image first
image = Image.open("path/to/image.jpg").convert('RGB')
response = ask_llava(image, "What is this?")  # Works!
```

### Mistake 4: Expecting Perfect Counting

```python
# Unrealistic expectation
# VLMs are not great at precise counting!
response = ask_llava(crowd_photo, "Exactly how many people are in this image?")
# Response might be "approximately 50 people" when there are 47

# Better approach - use ranges or approximate language
response = ask_llava(crowd_photo, "Are there more or fewer than 100 people visible?")
```

---

## Checkpoint

You've learned:
- How VLMs combine vision encoders with language models
- How to use LLaVA for general image understanding
- How to use Qwen2-VL for advanced tasks including OCR
- How to build a reusable Image Q&A application
- Common pitfalls and how to avoid them

### Key Takeaways

1. **VLMs = Vision Encoder + Projection + LLM** - Understanding this architecture helps you choose the right model
2. **Model size matters** - 7B models are fast but less capable than 13B+ models
3. **Memory management is crucial** - Always clear GPU memory when switching models
4. **Different strengths** - LLaVA is great for general chat, Qwen2-VL excels at documents and OCR

---

## Challenge (Optional)

### Build a Multi-Image Comparison Tool

Create a function that:
1. Takes two images as input
2. Describes each image separately
3. Compares and contrasts the two images
4. Suggests which image is better for a given purpose

**Hint**: You'll need to ask the VLM about each image, then ask it to compare based on the descriptions.

<details>
<summary>Solution Approach</summary>

1. Analyze each image separately
2. Concatenate the images side by side into one image
3. Ask the VLM to compare "the left image" vs "the right image"
4. Or use the text descriptions to prompt an LLM for comparison
</details>

In [None]:
# YOUR CHALLENGE CODE HERE

def compare_images(image1: Image.Image, image2: Image.Image, purpose: str = "social media post") -> dict:
    """
    Compare two images and recommend which is better for a given purpose.
    
    Args:
        image1: First image
        image2: Second image  
        purpose: What the images will be used for
        
    Returns:
        Dictionary with comparison results
    """
    # TODO: Implement this function
    pass

---

## Further Reading

- [LLaVA Paper: Visual Instruction Tuning](https://arxiv.org/abs/2304.08485)
- [Qwen2-VL Technical Report](https://arxiv.org/abs/2409.12191)
- [CLIP Paper: Learning Transferable Visual Models](https://arxiv.org/abs/2103.00020)
- [Hugging Face VLM Guide](https://huggingface.co/docs/transformers/main/en/tasks/visual_question_answering)

---

## Cleanup

In [None]:
# Clean up GPU memory
if 'qa' in dir():
    qa.cleanup()

clear_gpu_memory()
print(f"Final memory state: {get_memory_usage()}")
print("\nNotebook complete! Ready for the next task.")