# VLM Benchmark for Object Property Abstraction

This notebook implements a benchmark for evaluating Vision Language Models (VLMs) on object property abstraction and visual question answering (VQA) tasks. The benchmark includes three types of questions:

1. Direct Recognition
2. Property Inference
3. Counterfactual Reasoning

And three types of images:
- REAL
- ANIMATED
- AI GENERATED

## Setup and Imports

First, let's import the necessary libraries and set up our environment.

In [1]:
# Install required packages
# %pip install transformers torch Pillow tqdm bitsandbytes accelerate

In [2]:
%pip install qwen-vl-utils flash-attn #--no-build-isolation







Note: you may need to restart the kernel to use updated packages.


In [3]:
# Import required libraries
import torch
import json
from pathlib import Path
from PIL import Image
import gc
import re
from tqdm import tqdm
from typing import List, Dict, Any
from qwen_vl_utils import process_vision_info
import time

# Check if CUDA is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

  from .autonotebook import tqdm as notebook_tqdm


Using device: cuda


## Benchmark Tester Class

This class handles the evaluation of models against our benchmark.

In [4]:
# class BenchmarkTester:
#     def __init__(self, benchmark_path="/var/scratch/ave303/OP_bench/benchmark.json", data_dir="/var/scratch/ave303/OP_bench/"):
#         self.device = "cuda" if torch.cuda.is_available() else "cpu"
#         with open(benchmark_path, 'r') as f:
#             self.benchmark = json.load(f)
#         self.data_dir = data_dir
    
#     def format_question(self, question, model_name):
#         """Format a question for the model."""

#         if model_name=="blip2":
#             return f"Question: {question['question']} Answer:"
#         else:
#             return f"Question: {question['question']} Answer with a number and list of objects. Answer:"

#     def clean_answer(self, answer):
#         """Clean the model output to extract just the number."""
#         # Remove any text that's not a number
#         # import re
#         # numbers = re.findall(r'\d+', answer)
#         # if numbers:
#         #     return numbers[0]  # Return the first number found
#         # return answer
#         """Extract number and reasoning from the model's answer."""
#         # Try to extract number and reasoning using regex
#         import re
#         pattern = r'(\d+)\s*\[(.*?)\]'
#         match = re.search(pattern, answer)
        
#         if match:
#             number = match.group(1)
#             objects = [obj.strip() for obj in match.group(2).split(',')]
#             return {
#                 "count": number,
#                 "reasoning": objects
#             }
#         else:
#             # Fallback if format isn't matched
#             numbers = re.findall(r'\d+', answer)
#             return {
#                 "count": numbers[0] if numbers else "0",
#                 "reasoning": []
#             }

#     def model_generation(self, model_name, model, inputs, processor):
#         """Generate answer and decode."""
#         outputs = None  # Initialize outputs to None
        
#         if model_name=="smolVLM2":
#             outputs = model.generate(**inputs, do_sample=False, max_new_tokens=64)
#             answer = processor.batch_decode(
#                 outputs,
#                 skip_special_tokens=True,
#             )[0]
#         elif model_name=="Qwen2.5-VL":
#             outputs = model.generate(**inputs, max_new_tokens=50)
#             outputs = [
#                 out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, outputs)
#             ]
#             answer = processor.batch_decode(
#                 outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False
#             )[0]
#         else:
#             print(f"Warning: Unknown model name '{model_name}' in model_generation.")
#             answer = ""  # Return an empty string

#         return answer, outputs
    
#     def evaluate_model(self, model_name, model, processor, save_path, start_idx=0, batch_size=5):
#         results = []
#         print(f"\nEvaluating {model_name}...")
#         print(f"Using device: {self.device}")
        
#         # Force garbage collection before starting
#         gc.collect()
#         torch.cuda.empty_cache()

#         try:
#             images = self.benchmark['benchmark']['images'][start_idx:start_idx + batch_size]
#             total_images = len(images)
            
#             for idx, image_data in enumerate(tqdm(images, desc="Processing images")):
#                 try:
#                     print(f"\nProcessing image {idx+1}/{total_images}: {image_data['image_id']}")
#                     image_path = Path(self.data_dir)/image_data['path']
#                     if not image_path.exists():
#                         print(f"Warning: Image not found at {image_path}")
#                         continue
                    
#                     # Load and preprocess image
#                     image = Image.open(image_path).convert("RGB")
#                     image_results = []  # Store results for current image
                    
#                     for question in image_data['questions']:
#                         try:
#                             # prompt = self.format_question(question, model_name)
#                             print(f"Question: {question['question']}")

#                             messages = [
#                                 {
#                                     "role": "user",
#                                     "content": [
#                                         {"type": "image", "image": image},
#                                         # {"type": "text", "text": f"{question['question']} Answer format: total number(numerical) objects(within square brackets)"},
#                                         # {"type": "text", "text": f"{question['question']} Provide just the total count and the list of objects in the given format \n Format: number [objects]"},
#                                         # {"type": "text", "text": f"{question['question']} Answer Format: number [objects]"},
#                                         {"type": "text", "text": f"{question["question"]} Your response MUST be in the following format and nothing else:\n <NUMBER> [<OBJECT1>, <OBJECT2>, <OBJECT3>, ...]"}
#                                     ]
#                                 },
#                             ]
                            
#                             # Clear cache before processing each question
#                             torch.cuda.empty_cache()
                            
#                             # Process image and text
#                             # inputs = processor(images=image, text=prompt, return_tensors="pt").to(self.device)
#                             if model_name=="smolVLM2":
#                                 inputs = processor.apply_chat_template(
#                                     messages,
#                                     add_generation_prompt=True,
#                                     tokenize=True,
#                                     return_dict=True,
#                                     return_tensors="pt",
#                                 ).to(model.device, dtype=torch.float16)
#                             else:
                                
#                                 text = processor.apply_chat_template(
#                                     messages, tokenize=False, add_generation_prompt=True
#                                 )
#                                 # image_inputs, video_inputs = process_vision_info(messages)
#                                 inputs = processor(
#                                     text=text,
#                                     images=image,
#                                     videos=None,
#                                     padding=True,
#                                     return_tensors="pt",
#                                 ).to("cuda")
                            
#                             # Generate answer with better settings
#                             with torch.no_grad():
#                                 answer, outputs = self.model_generation(model_name, model, inputs, processor)    #call for model.generate
        
#                             cleaned_answer = self.clean_answer(answer)
                            
#                             image_results.append({
#                                 "image_id": image_data["image_id"],
#                                 "image_type": image_data["image_type"],
#                                 "question_id": question["id"],
#                                 "question": question["question"],
#                                 "ground_truth": question["answer"],
#                                 "model_answer": cleaned_answer["count"],
#                                 "model_reasoning": cleaned_answer["reasoning"],
#                                 "raw_answer": answer,  # Keep raw answer for debugging
#                                 "property_category": question["property_category"]
#                             })
                            
#                             # Clear memory
#                             del outputs, inputs
#                             torch.cuda.empty_cache()
                            
#                         except Exception as e:
#                             print(f"Error processing question: {str(e)}")
#                             continue
                    
#                     # Add results from this image
#                     results.extend(image_results)
                    
#                     # Save intermediate results only every 2 images or if it's the last image
#                     if (idx + 1) % 2 == 0 or idx == total_images - 1:
#                         with open(f"{save_path}_checkpoint.json", 'w') as f:
#                             json.dump(results, f, indent=4)
                            
#                 except Exception as e:
#                     print(f"Error processing image {image_data['image_id']}: {str(e)}")
#                     continue
            
#             # Save final results
#             if results:
#                 with open(save_path, 'w') as f:
#                     json.dump(results, f, indent=4)
            
#         except Exception as e:
#             print(f"An error occurred during evaluation: {str(e)}")
#             if results:
                # with open(f"{save_path}_error_state.json", 'w') as f:
#                     json.dump(results, f, indent=4)
        
#         return results

In [5]:
# import torch
# import json
# from pathlib import Path
# from PIL import Image
# import gc
# import re
# import time
# from tqdm import tqdm
# from typing import List, Dict, Any
# import psutil
# import os

# class BenchmarkTester:
#     def __init__(self, benchmark_path="/var/scratch/ave303/OP_bench/benchmark.json", data_dir="/var/scratch/ave303/OP_bench/"):
#         self.device = "cuda" if torch.cuda.is_available() else "cpu"
#         with open(benchmark_path, 'r') as f:
#             self.benchmark = json.load(f)
#         self.data_dir = data_dir
        
#         # Set memory optimization environment variables
#         os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
        
#         # Memory monitoring
#         self.max_memory_allocated = 0
#         self.memory_threshold = 0.70  # 70% of GPU memory as threshold

#     def get_gpu_memory_info(self):
#         """Get current GPU memory usage information."""
#         if torch.cuda.is_available():
#             allocated = torch.cuda.memory_allocated() / 1024**3  # GB
#             reserved = torch.cuda.memory_reserved() / 1024**3    # GB
#             max_memory = torch.cuda.max_memory_allocated() / 1024**3  # GB
#             total_memory = torch.cuda.get_device_properties(0).total_memory / 1024**3  # GB
            
#             return {
#                 'allocated': allocated,
#                 'reserved': reserved,
#                 'max_allocated': max_memory,
#                 'total': total_memory,
#                 'free': total_memory - allocated,
#                 'usage_percent': (allocated / total_memory) * 100
#             }
#         return None

#     def aggressive_memory_cleanup(self):
#         """Perform aggressive memory cleanup - alias for ultra_aggressive_memory_cleanup."""
#         self.ultra_aggressive_memory_cleanup()

#     def ultra_aggressive_memory_cleanup(self):
#         """Perform ultra-aggressive memory cleanup including model cache clearing."""
#         # Clear Python garbage collector multiple times
#         for _ in range(5):
#             gc.collect()
        
#         if torch.cuda.is_available():
#             # Force synchronize all streams
#             torch.cuda.synchronize()
#             # Clear all cached memory
#             torch.cuda.empty_cache()
#             # Reset peak memory stats
#             torch.cuda.reset_peak_memory_stats()
#             # Force memory defragmentation
#             torch.cuda.memory.empty_cache()
#             # Another sync to ensure completion
#             torch.cuda.synchronize()
            
#         # Force system memory cleanup
#         import ctypes
#         libc = ctypes.CDLL("libc.so.6")
#         libc.malloc_trim(0)

#     def check_available_memory_and_restart_if_needed(self):
#         """Check if we need to recommend model restart due to fragmentation."""
#         memory_info = self.get_gpu_memory_info()
#         if memory_info:
#             # If allocated is much less than reserved, we have fragmentation
#             fragmentation_ratio = memory_info['reserved'] / max(memory_info['allocated'], 0.1)
#             if fragmentation_ratio > 2.0 and memory_info['usage_percent'] > 80:
#                 print(f"⚠️  Severe memory fragmentation detected (fragmentation ratio: {fragmentation_ratio:.2f})")
#                 print("Consider restarting the Python process to defragment GPU memory")
#                 return False
#         return True

#     def resize_image_if_needed(self, image, max_size=(512, 512)):
#         """Resize image aggressively to prevent memory issues."""
#         original_size = image.size
        
#         # Always resize to max_size to ensure consistent memory usage
#         # Calculate aspect ratio preserving resize
#         ratio = min(max_size[0] / original_size[0], max_size[1] / original_size[1])
#         new_size = (int(original_size[0] * ratio), int(original_size[1] * ratio))
        
#         print(f"Resizing image from {original_size} to {new_size}")
#         # Use NEAREST for fastest processing and lowest memory
#         image = image.resize(new_size, Image.Resampling.NEAREST)
        
#         return image

#     def check_memory_before_processing(self, image_id, skip_if_high=True):
#         """Check if we have enough memory before processing with option to skip."""
#         memory_info = self.get_gpu_memory_info()
#         if memory_info and memory_info['usage_percent'] > self.memory_threshold * 100:
#             print(f"Warning: High memory usage ({memory_info['usage_percent']:.1f}%) before processing {image_id}")
#             self.ultra_aggressive_memory_cleanup()
            
#             # Check again after cleanup
#             memory_info = self.get_gpu_memory_info()
#             if memory_info['usage_percent'] > self.memory_threshold * 100:
#                 print(f"Critical: Still high memory usage ({memory_info['usage_percent']:.1f}%) after cleanup")
                
#                 # Check for fragmentation issues
#                 if not self.check_available_memory_and_restart_if_needed():
#                     return False
                    
#                 if skip_if_high:
#                     print(f"Skipping {image_id} due to insufficient memory")
#                     return False
#         return True

#     def clean_answer(self, answer):
#         """Extract number and reasoning from the model's answer."""
#         import re
#         pattern = r'(\d+)\s*\[(.*?)\]'
#         match = re.search(pattern, answer)
        
#         if match:
#             number = match.group(1)
#             objects = [obj.strip() for obj in match.group(2).split(',')]
#             return {
#                 "count": number,
#                 "reasoning": objects
#             }
#         else:
#             numbers = re.findall(r'\d+', answer)
#             return {
#                 "count": numbers[0] if numbers else "0",
#                 "reasoning": []
#             }

#     def model_generation(self, model_name, model, inputs, processor):
#         """Generate answer with memory-optimized inference."""
#         outputs = None
        
#         try:
#             if model_name == "Qwen2.5-VL":
#                 # Use gradient checkpointing and mixed precision if available
#                 with torch.cuda.amp.autocast(enabled=True):
#                     outputs = model.generate(
#                         **inputs, 
#                         max_new_tokens=200,
#                         do_sample=False,
#                         temperature=None,
#                         top_p=None,
#                         top_k=None,
#                         num_beams=1,
#                         early_stopping=False,
#                         pad_token_id=processor.tokenizer.pad_token_id,
#                         eos_token_id=processor.tokenizer.eos_token_id,
#                         use_cache=False,  # Disable KV cache to save memory
#                     )
                
#                 outputs = [
#                     out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, outputs)
#                 ]
#                 answer = processor.batch_decode(
#                     outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False
#                 )[0]
#             else:
#                 print(f"Warning: Unknown model name '{model_name}' in model_generation.")
#                 answer = ""

#             return answer, outputs
            
#         except torch.cuda.OutOfMemoryError as e:
#             print(f"CUDA OOM during generation: {e}")
#             # Aggressive cleanup and retry once
#             self.aggressive_memory_cleanup()
#             raise e

#     def process_single_question(self, model_name, model, processor, image, question, image_id):
#         """Process a single question with extreme memory optimization."""
#         try:
#             # Ultra-aggressive pre-check
#             if not self.check_memory_before_processing(f"{image_id}_q{question['id']}", skip_if_high=False):
#                 raise RuntimeError("Insufficient GPU memory after cleanup")

#             # Create a minimal image copy to avoid references
#             image_copy = image.copy()
            
#             messages = [
#                 {
#                     "role": "user",
#                     "content": [
#                         {"type": "image", "image": image_copy},
#                         {"type": "text", "text": f"{question['question']} Your response MUST be in the following format and nothing else:\n <NUMBER> [<OBJECT1>, <OBJECT2>, <OBJECT3>, ...]"}
#                     ]
#                 },
#             ]
            
#             # Process with maximum memory optimization
#             text = processor.apply_chat_template(
#                 messages, tokenize=False, add_generation_prompt=True
#             )
            
#             # Monitor memory before tokenization
#             memory_before = self.get_gpu_memory_info()
#             if memory_before and memory_before['usage_percent'] > 75:
#                 print(f"⚠️  Memory usage high before tokenization: {memory_before['usage_percent']:.1f}%")
#                 self.ultra_aggressive_memory_cleanup()
            
#             # Process inputs with minimal memory footprint
#             inputs = processor(
#                 text=text,
#                 images=image_copy,
#                 videos=None,
#                 padding=True,
#                 return_tensors="pt",
#             )
            
#             # Move to device only when needed
#             inputs = {k: v.to(self.device) if hasattr(v, 'to') else v for k, v in inputs.items()}
            
#             # Delete image copy immediately
#             del image_copy, messages
#             self.ultra_aggressive_memory_cleanup()
            
#             # Monitor memory before generation
#             memory_before_gen = self.get_gpu_memory_info()
#             if memory_before_gen:
#                 print(f"Memory before generation: {memory_before_gen['usage_percent']:.1f}%")
#                 if memory_before_gen['usage_percent'] > 85:
#                     raise RuntimeError(f"Memory too high for generation: {memory_before_gen['usage_percent']:.1f}%")
            
#             # Generate with maximum memory efficiency
#             with torch.no_grad():
#                 with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):
#                     answer, outputs = self.model_generation(model_name, model, inputs, processor)
            
#             cleaned_answer = self.clean_answer(answer)
            
#             # Immediate and thorough cleanup
#             del outputs, inputs
#             self.ultra_aggressive_memory_cleanup()
            
#             return {
#                 "question_id": question["id"],
#                 "question": question["question"],
#                 "ground_truth": question["answer"],
#                 "model_answer": cleaned_answer["count"],
#                 "model_reasoning": cleaned_answer["reasoning"],
#                 "raw_answer": answer,
#                 "property_category": question["property_category"]
#             }
            
#         except (torch.cuda.OutOfMemoryError, RuntimeError) as e:
#             print(f"Error processing question {question['id']}: {e}")
#             self.ultra_aggressive_memory_cleanup()
#             raise e

#     def evaluate_model(self, model_name, model, processor, save_path, start_idx=0, batch_size=5):
#         results = []
        
#         # Initialize tracking variables
#         successful_images = []
#         failed_images = []
#         total_questions_processed = 0
#         total_questions_failed = 0
        
#         print(f"\nEvaluating {model_name}...")
#         print(f"Using device: {self.device}")
        
#         # Initial memory cleanup
#         self.ultra_aggressive_memory_cleanup()
        
#         # Print initial memory status
#         memory_info = self.get_gpu_memory_info()
#         if memory_info:
#             print(f"Initial GPU memory: {memory_info['usage_percent']:.1f}% used")

#         try:
#             images = self.benchmark['benchmark']['images'][start_idx:start_idx + batch_size]
#             total_images = len(images)
            
#             for idx, image_data in enumerate(tqdm(images, desc="Processing images")):
#                 image_start_time = time.time()
#                 current_image_questions_failed = 0
#                 current_image_questions_total = 0
                
#                 try:
#                     image_path = Path(self.data_dir) / image_data['path']
#                     if not image_path.exists():
#                         failed_images.append({
#                             'image_id': image_data['image_id'],
#                             'image_type': image_data.get('image_type', 'unknown'),
#                             'error_type': 'file_not_found',
#                             'error_message': f'Image not found at {image_path}'
#                         })
#                         continue
                    
#                     # Load and preprocess image with size control
#                     image = Image.open(image_path).convert("RGB")
#                     print(f"Original image size: {image.size}")
                    
#                     # Resize aggressively - much smaller images
#                     image = self.resize_image_if_needed(image, max_size=(384, 384))
                    
#                     image_results = []
                    
#                     # Process questions one by one with memory monitoring
#                     for question_idx, question in enumerate(image_data['questions']):
#                         current_image_questions_total += 1
#                         total_questions_processed += 1
                        
#                         try:
#                             # Process single question
#                             question_result = self.process_single_question(
#                                 model_name, model, processor, image, question, image_data['image_id']
#                             )
                            
#                             # Add image metadata
#                             question_result.update({
#                                 "image_id": image_data["image_id"],
#                                 "image_type": image_data.get("image_type", "unknown")
#                             })
                            
#                             image_results.append(question_result)
                            
#                         except Exception as e:
#                             print(f"Failed question {question['id']}: {e}")
#                             current_image_questions_failed += 1
#                             total_questions_failed += 1
#                             continue
                    
#                     # Add results from this image
#                     results.extend(image_results)
                    
#                     # Calculate success metrics
#                     questions_succeeded = current_image_questions_total - current_image_questions_failed
                    
#                     if current_image_questions_failed == 0:
#                         successful_images.append({
#                             'image_id': image_data['image_id'],
#                             'image_type': image_data.get('image_type', 'unknown'),
#                             'questions_total': current_image_questions_total,
#                             'questions_succeeded': questions_succeeded,
#                             'processing_time': time.time() - image_start_time
#                         })
#                     else:
#                         image_success_rate = (questions_succeeded / current_image_questions_total * 100) if current_image_questions_total > 0 else 0
#                         failed_images.append({
#                             'image_id': image_data['image_id'],
#                             'image_type': image_data.get('image_type', 'unknown'),
#                             'error_type': 'partial_failure',
#                             'questions_total': current_image_questions_total,
#                             'questions_failed': current_image_questions_failed,
#                             'questions_succeeded': questions_succeeded,
#                             'success_rate': f"{image_success_rate:.1f}%"
#                         })
                    
#                     # Ultra-aggressive cleanup after each image
#                     del image
#                     self.ultra_aggressive_memory_cleanup()
                    
#                     # Save intermediate results
#                     if (idx + 1) % 2 == 0 or idx == total_images - 1:
#                         checkpoint_path = f"{save_path}_checkpoint.json"
#                         with open(checkpoint_path, 'w') as f:
#                             json.dump(results, f, indent=4)
                            
#                 except Exception as e:
#                     print(f"Complete failure for image {image_data['image_id']}: {e}")
#                     failed_images.append({
#                         'image_id': image_data['image_id'],
#                         'image_type': image_data.get('image_type', 'unknown'),
#                         'error_type': 'complete_failure',
#                         'error_message': str(e)
#                     })
                    
#                     # Cleanup even on failure
#                     self.ultra_aggressive_memory_cleanup()
#                     continue
            
#             # Save final results
#             if results:
#                 with open(save_path, 'w') as f:
#                     json.dump(results, f, indent=4)
            
#         except Exception as e:
#             print(f"Critical error during evaluation: {e}")
#             if results:
#                 error_save_path = f"{save_path}_error_state.json"
#                 with open(error_save_path, 'w') as f:
#                     json.dump(results, f, indent=4)
        
#         # Print comprehensive summary
#         self._print_evaluation_summary(
#             model_name, total_images, successful_images, failed_images, 
#             total_questions_processed, total_questions_failed, len(results)
#         )
        
#         return results
    
#     def _print_evaluation_summary(self, model_name, total_images, successful_images, 
#                                 failed_images, total_questions_processed, total_questions_failed, total_results):
#         """Print a comprehensive evaluation summary."""
#         print(f"\n{'='*60}")
#         print(f"EVALUATION SUMMARY FOR {model_name.upper()}")
#         print(f"{'='*60}")
        
#         # Image-level statistics
#         num_successful = len(successful_images)
#         num_failed = len(failed_images)
        
#         print(f"📊 IMAGE PROCESSING SUMMARY:")
#         print(f"   Total images attempted: {total_images}")
#         print(f"   Successfully processed: {num_successful} ({num_successful/total_images*100:.1f}%)")
#         print(f"   Failed images: {num_failed} ({num_failed/total_images*100:.1f}%)")
        
#         # Question-level statistics
#         questions_succeeded = total_questions_processed - total_questions_failed
#         print(f"\n📝 QUESTION PROCESSING SUMMARY:")
#         print(f"   Total questions attempted: {total_questions_processed}")
#         print(f"   Successfully processed: {questions_succeeded} ({questions_succeeded/total_questions_processed*100:.1f}%)")
#         print(f"   Failed questions: {total_questions_failed} ({total_questions_failed/total_questions_processed*100:.1f}%)")
#         print(f"   Results saved: {total_results}")
        
#         # Memory usage summary
#         memory_info = self.get_gpu_memory_info()
#         if memory_info:
#             print(f"\n🧠 FINAL MEMORY USAGE:")
#             print(f"   Current allocation: {memory_info['allocated']:.2f} GB ({memory_info['usage_percent']:.1f}%)")
#             print(f"   Peak allocation: {memory_info['max_allocated']:.2f} GB")
#             print(f"   Total GPU memory: {memory_info['total']:.2f} GB")
        
#         # Successful images details
#         if successful_images:
#             print(f"\n✅ SUCCESSFUL IMAGES ({len(successful_images)}):")
#             for img in successful_images:
#                 print(f"   • {img['image_id']} (Type: {img['image_type']}, "
#                       f"Questions: {img['questions_succeeded']}/{img['questions_total']}, "
#                       f"Time: {img['processing_time']:.1f}s)")
        
#         # Failed images details
#         if failed_images:
#             print(f"\n❌ FAILED/PROBLEMATIC IMAGES ({len(failed_images)}):")
#             for img in failed_images:
#                 if img['error_type'] == 'complete_failure':
#                     print(f"   • {img['image_id']} (Type: {img['image_type']}) - "
#                           f"COMPLETE FAILURE: {img.get('error_message', 'Unknown error')}")
#                 elif img['error_type'] == 'partial_failure':
#                     print(f"   • {img['image_id']} (Type: {img['image_type']}) - "
#                           f"PARTIAL: {img['questions_failed']}/{img['questions_total']} failed "
#                           f"({img['success_rate']} success)")
#                 elif img['error_type'] == 'file_not_found':
#                     print(f"   • {img['image_id']} (Type: {img['image_type']}) - "
#                           f"FILE NOT FOUND: {img['error_message']}")
        
#         print(f"{'='*60}\n")

In [6]:
class BenchmarkTester:
    def __init__(self, benchmark_path="/var/scratch/ave303/OP_bench/benchmark.json", data_dir="/var/scratch/ave303/OP_bench/"):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        with open(benchmark_path, 'r') as f:
            self.benchmark = json.load(f)
        self.data_dir = data_dir

    def clean_answer(self, answer):
        """Extract number and reasoning from the model's answer."""
        # Try to extract number and reasoning using regex
        import re
        pattern = r'(\d+)\s*\[(.*?)\]'
        match = re.search(pattern, answer)
        
        if match:
            number = match.group(1)
            objects = [obj.strip() for obj in match.group(2).split(',')]
            return {
                "count": number,
                "reasoning": objects
            }
        else:
            # Fallback if format isn't matched
            numbers = re.findall(r'\d+', answer)
            return {
                "count": numbers[0] if numbers else "0",
                "reasoning": []
            }

    def model_generation(self, model_name, model, inputs, processor):
        """Generate answer and decode with greedy decoding."""
        outputs = None  # Initialize outputs to None
        
        if model_name == "Qwen2.5-VL":
            # Explicit greedy decoding parameters
            outputs = model.generate(
                **inputs, 
                max_new_tokens=200,
                do_sample=False,          # Disable sampling for greedy decoding
                temperature=None,         # Not used in greedy decoding
                top_p=None,              # Not used in greedy decoding  
                top_k=None,              # Not used in greedy decoding
                num_beams=1,             # Single beam for greedy decoding
                early_stopping=False,    # Let it generate until max_tokens or EOS
                pad_token_id=processor.tokenizer.pad_token_id,
                eos_token_id=processor.tokenizer.eos_token_id
            )
            outputs = [
                out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, outputs)
            ]
            answer = processor.batch_decode(
                outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False
            )[0]
        else:
            print(f"Warning: Unknown model name '{model_name}' in model_generation.")
            answer = ""  # Return an empty string

        return answer, outputs
    
    def evaluate_model(self, model_name, model, processor, save_path, start_idx=0, batch_size=5):
        results = []
        
        # Initialize tracking variables
        successful_images = []
        failed_images = []
        total_questions_processed = 0
        total_questions_failed = 0
        
        print(f"\nEvaluating {model_name}...")
        print(f"Using device: {self.device}")
        
        # Force garbage collection before starting
        gc.collect()
        torch.cuda.empty_cache()

        try:
            images = self.benchmark['benchmark']['images'][start_idx:start_idx + batch_size]
            total_images = len(images)
            
            for idx, image_data in enumerate(tqdm(images, desc="Processing images")):
                image_start_time = time.time()
                current_image_questions_failed = 0
                current_image_questions_total = 0
                
                try:
                    image_path = Path(self.data_dir)/image_data['path']
                    if not image_path.exists():
                        failed_images.append({
                            'image_id': image_data['image_id'],
                            'image_type': image_data.get('image_type', 'unknown'),
                            'error_type': 'file_not_found',
                            'error_message': f'Image not found at {image_path}'
                        })
                        continue
                    
                    # Load and preprocess image
                    image = Image.open(image_path).convert("RGB")
                    image_results = []  # Store results for current image
                    
                    for question_idx, question in enumerate(image_data['questions']):
                        current_image_questions_total += 1
                        total_questions_processed += 1
                        
                        try:
                            messages = [
                                {
                                    "role": "user",
                                    "content": [
                                        {"type": "image", "image": image},
                                        {"type": "text", "text": f"{question['question']} Your response MUST be in the following format and nothing else:\n <NUMBER> [<OBJECT1>, <OBJECT2>, <OBJECT3>, ...]"}
                                    ]
                                },
                            ]
                            
                            # Clear cache before processing each question
                            torch.cuda.empty_cache()
                            
                            # Process image and text for Qwen2.5-VL
                            text = processor.apply_chat_template(
                                messages, tokenize=False, add_generation_prompt=True
                            )
                            inputs = processor(
                                text=text,
                                images=image,
                                videos=None,
                                padding=True,
                                return_tensors="pt",
                            ).to("cuda")
                            
                            # Generate answer with greedy decoding
                            with torch.no_grad():
                                answer, outputs = self.model_generation(model_name, model, inputs, processor)
        
                            cleaned_answer = self.clean_answer(answer)
                            
                            image_results.append({
                                "image_id": image_data["image_id"],
                                "image_type": image_data.get("image_type", "unknown"),
                                "question_id": question["id"],
                                "question": question["question"],
                                "ground_truth": question["answer"],
                                "model_answer": cleaned_answer["count"],
                                "model_reasoning": cleaned_answer["reasoning"],
                                "raw_answer": answer,  # Keep raw answer for debugging
                                "property_category": question["property_category"]
                            })
                            
                            # Clear memory
                            del outputs, inputs
                            torch.cuda.empty_cache()
                            
                        except Exception as e:
                            current_image_questions_failed += 1
                            total_questions_failed += 1
                            continue
                    
                    # Add results from this image
                    results.extend(image_results)
                    
                    # Calculate success rate for this image
                    questions_succeeded = current_image_questions_total - current_image_questions_failed
                    
                    if current_image_questions_failed == 0:
                        # All questions succeeded
                        successful_images.append({
                            'image_id': image_data['image_id'],
                            'image_type': image_data.get('image_type', 'unknown'),
                            'questions_total': current_image_questions_total,
                            'questions_succeeded': questions_succeeded,
                            'processing_time': time.time() - image_start_time
                        })
                    else:
                        # Some questions failed
                        image_success_rate = (questions_succeeded / current_image_questions_total * 100) if current_image_questions_total > 0 else 0
                        failed_images.append({
                            'image_id': image_data['image_id'],
                            'image_type': image_data.get('image_type', 'unknown'),
                            'error_type': 'partial_failure',
                            'questions_total': current_image_questions_total,
                            'questions_failed': current_image_questions_failed,
                            'questions_succeeded': questions_succeeded,
                            'success_rate': f"{image_success_rate:.1f}%"
                        })
                    
                    # Save intermediate results only every 2 images or if it's the last image
                    if (idx + 1) % 2 == 0 or idx == total_images - 1:
                        checkpoint_path = f"{save_path}_checkpoint.json"
                        with open(checkpoint_path, 'w') as f:
                            json.dump(results, f, indent=4)
                            
                except Exception as e:
                    failed_images.append({
                        'image_id': image_data['image_id'],
                        'image_type': image_data.get('image_type', 'unknown'),
                        'error_type': 'complete_failure',
                        'error_message': str(e)
                    })
                    continue
            
            # Save final results
            if results:
                with open(save_path, 'w') as f:
                    json.dump(results, f, indent=4)
            
        except Exception as e:
            if results:
                error_save_path = f"{save_path}_error_state.json"
                with open(error_save_path, 'w') as f:
                    json.dump(results, f, indent=4)
        
        # Print comprehensive summary
        self._print_evaluation_summary(
            model_name, total_images, successful_images, failed_images, 
            total_questions_processed, total_questions_failed, len(results)
        )
        
        return results
    
    def _print_evaluation_summary(self, model_name, total_images, successful_images, 
                                failed_images, total_questions_processed, total_questions_failed, total_results):
        """Print a comprehensive evaluation summary."""
        print(f"\n{'='*60}")
        print(f"EVALUATION SUMMARY FOR {model_name.upper()}")
        print(f"{'='*60}")
        
        # Image-level statistics
        num_successful = len(successful_images)
        num_failed = len(failed_images)
        
        print(f"📊 IMAGE PROCESSING SUMMARY:")
        print(f"   Total images attempted: {total_images}")
        print(f"   Successfully processed: {num_successful} ({num_successful/total_images*100:.1f}%)")
        print(f"   Failed images: {num_failed} ({num_failed/total_images*100:.1f}%)")
        
        # Question-level statistics
        questions_succeeded = total_questions_processed - total_questions_failed
        print(f"\n📝 QUESTION PROCESSING SUMMARY:")
        print(f"   Total questions attempted: {total_questions_processed}")
        print(f"   Successfully processed: {questions_succeeded} ({questions_succeeded/total_questions_processed*100:.1f}%)")
        print(f"   Failed questions: {total_questions_failed} ({total_questions_failed/total_questions_processed*100:.1f}%)")
        print(f"   Results saved: {total_results}")
        
        # Successful images details
        if successful_images:
            print(f"\n✅ SUCCESSFUL IMAGES ({len(successful_images)}):")
            for img in successful_images:
                print(f"   • {img['image_id']} (Type: {img['image_type']}, "
                      f"Questions: {img['questions_succeeded']}/{img['questions_total']}, "
                      f"Time: {img['processing_time']:.1f}s)")
        
        # Failed images details
        if failed_images:
            print(f"\n❌ FAILED/PROBLEMATIC IMAGES ({len(failed_images)}):")
            for img in failed_images:
                if img['error_type'] == 'complete_failure':
                    print(f"   • {img['image_id']} (Type: {img['image_type']}) - "
                          f"COMPLETE FAILURE: {img.get('error_message', 'Unknown error')}")
                elif img['error_type'] == 'partial_failure':
                    print(f"   • {img['image_id']} (Type: {img['image_type']}) - "
                          f"PARTIAL: {img['questions_failed']}/{img['questions_total']} failed "
                          f"({img['success_rate']} success)")
                elif img['error_type'] == 'file_not_found':
                    print(f"   • {img['image_id']} (Type: {img['image_type']}) - "
                          f"FILE NOT FOUND: {img['error_message']}")
        
        # Group failed images by type
        if failed_images:
            print(f"\n📈 FAILURE ANALYSIS BY IMAGE TYPE:")
            from collections import defaultdict
            failures_by_type = defaultdict(list)
            for img in failed_images:
                failures_by_type[img['image_type']].append(img)
            
            for img_type, failures in failures_by_type.items():
                print(f"   • {img_type}: {len(failures)} failed images")
                for failure in failures:
                    print(f"     - {failure['image_id']} ({failure['error_type']})")
        
        print(f"{'='*60}\n")

## Test SmolVLM Model

Let's evaluate the SmolVLM2-2.2B-Instruct model

In [7]:
# def test_smolVLM2():
#     from transformers import AutoProcessor, AutoModelForImageTextToText

#     print("Loading smolVLM model...")
    
#     model = AutoModelForImageTextToText.from_pretrained(
#         "HuggingFaceTB/SmolVLM2-2.2B-Instruct",
#         torch_dtype=torch.float16,
#         attn_implementation="flash_attention_2",
#         low_cpu_mem_usage=True,
#         trust_remote_code=True
#     ).to("cuda")

#     processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct")

#     ## A bit slow without the flash_attention2 requires ampere gpu's. Better performance in some cases

#     # Optional: Enable memory efficient attention
#     if hasattr(model.config, 'use_memory_efficient_attention'):
#         model.config.use_memory_efficient_attention = True

#     tester = BenchmarkTester()
#     smolVLM_results = tester.evaluate_model(
#         "smolVLM2",
#         model, 
#         processor, 
#         "smolVLM2_results_1.json", 
#         batch_size=25
#     )

#     # Clean up
#     del model, processor
#     torch.cuda.empty_cache()
#     gc.collect()

## Test Qwen2.5-VL

Lets evaluate the Qwen2.5-VL-7B-Instruct model

In [8]:
def test_Qwen2_5VL():
    from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
    
    # default: Load the model on the available device(s)
    # model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    #     "Qwen/Qwen2.5-VL-3B-Instruct", 
    #     load_in_8bit=True, # throws error when .to() is added
    #     torch_dtype=torch.bfloat16, 
    #     device_map="auto",
    #     # attn_implementation="flash_attention_2",
    #     low_cpu_mem_usage=True
    # )
    
    # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
    model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
        "/var/scratch/ave303/models/qwen2-5-vl-3b",
        torch_dtype=torch.float16,
        # load_in_8bit=True,
        # attn_implementation="flash_attention_2",
        device_map="auto",
        low_cpu_mem_usage=True,
        trust_remote_code=True
    )
    
    # default processer
    processor = AutoProcessor.from_pretrained("/var/scratch/ave303/models/qwen2-5-vl-3b")

    ### Qwen2.5-VL-7B-Instruct --> goes out of CUDA memory
    ### Qwen2.5-VL-3B-Instruct --> can handle only 2 images before going out of memory but decent performance

    # Optional: Enable memory efficient attention
    if hasattr(model.config, 'use_memory_efficient_attention'):
        model.config.use_memory_efficient_attention = True

    tester = BenchmarkTester()
    Qwen2_5VL_results = tester.evaluate_model(
        "Qwen2.5-VL",
        model, 
        processor, 
        "Qwen2.5-VL_3b_results.json",
        # start_idx=0,
        batch_size=360
    )

    # Clean up
    del model, processor
    torch.cuda.empty_cache()
    gc.collect()

## Run Evaluation

Now we can run our evaluation. Let's start with the SmolVLM2 model:

In [9]:
# test_smolVLM2()

In [10]:
test_Qwen2_5VL()

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Loading checkpoint shards:  25%|██▌       | 1/4 [00:05<00:15,  5.26s/it]

Loading checkpoint shards:  50%|█████     | 2/4 [00:10<00:10,  5.02s/it]

Loading checkpoint shards:  75%|███████▌  | 3/4 [00:15<00:05,  5.14s/it]

Loading checkpoint shards: 100%|██████████| 4/4 [00:15<00:00,  3.19s/it]

Loading checkpoint shards: 100%|██████████| 4/4 [00:15<00:00,  3.90s/it]




Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.



Evaluating Qwen2.5-VL...
Using device: cuda


Processing images:   0%|          | 0/360 [00:00<?, ?it/s]

Processing images:   0%|          | 1/360 [00:06<39:35,  6.62s/it]

Processing images:   1%|          | 2/360 [00:08<23:56,  4.01s/it]

Processing images:   1%|          | 3/360 [00:38<1:32:25, 15.53s/it]

Processing images:   1%|          | 4/360 [00:40<1:01:10, 10.31s/it]

Processing images:   1%|▏         | 5/360 [01:06<1:33:59, 15.88s/it]

Processing images:   2%|▏         | 6/360 [01:11<1:11:48, 12.17s/it]

Processing images:   2%|▏         | 7/360 [01:12<51:21,  8.73s/it]  

Processing images:   2%|▏         | 8/360 [01:13<36:45,  6.27s/it]

Processing images:   2%|▎         | 9/360 [01:15<28:12,  4.82s/it]

Processing images:   3%|▎         | 10/360 [01:18<24:23,  4.18s/it]

Processing images:   3%|▎         | 11/360 [01:22<24:19,  4.18s/it]

Processing images:   3%|▎         | 12/360 [01:25<22:02,  3.80s/it]

Processing images:   4%|▎         | 13/360 [01:26<17:42,  3.06s/it]

Processing images:   4%|▍         | 14/360 [01:28<15:42,  2.72s/it]

Processing images:   4%|▍         | 15/360 [01:55<57:55, 10.07s/it]

Processing images:   4%|▍         | 16/360 [02:07<1:00:45, 10.60s/it]

Processing images:   5%|▍         | 17/360 [02:08<44:39,  7.81s/it]  

Processing images:   5%|▌         | 18/360 [02:10<34:26,  6.04s/it]

Processing images:   5%|▌         | 19/360 [02:12<27:44,  4.88s/it]

Processing images:   6%|▌         | 20/360 [02:36<58:56, 10.40s/it]

Processing images:   6%|▌         | 21/360 [02:38<45:43,  8.09s/it]

Processing images:   6%|▌         | 22/360 [04:31<3:41:59, 39.41s/it]

Processing images:   6%|▋         | 23/360 [06:23<5:43:30, 61.16s/it]

Processing images:   7%|▋         | 24/360 [06:24<4:02:32, 43.31s/it]

Processing images:   7%|▋         | 25/360 [06:26<2:51:28, 30.71s/it]

Processing images:   7%|▋         | 26/360 [06:28<2:03:36, 22.20s/it]

Processing images:   8%|▊         | 27/360 [06:30<1:28:44, 15.99s/it]

Processing images:   8%|▊         | 28/360 [06:31<1:04:04, 11.58s/it]

Processing images:   8%|▊         | 29/360 [06:33<47:39,  8.64s/it]  

Processing images:   8%|▊         | 30/360 [06:35<37:54,  6.89s/it]

Processing images:   9%|▊         | 31/360 [06:38<31:16,  5.70s/it]

Processing images:   9%|▉         | 32/360 [06:41<25:40,  4.70s/it]

Processing images:   9%|▉         | 33/360 [06:44<23:14,  4.26s/it]

Processing images:   9%|▉         | 34/360 [06:45<18:27,  3.40s/it]

Processing images:  10%|▉         | 35/360 [06:49<18:11,  3.36s/it]

Processing images:  10%|█         | 36/360 [06:50<15:42,  2.91s/it]

Processing images:  10%|█         | 37/360 [06:53<15:22,  2.86s/it]

Processing images:  11%|█         | 38/360 [06:57<17:10,  3.20s/it]

Processing images:  11%|█         | 39/360 [07:00<16:04,  3.00s/it]

Processing images:  11%|█         | 40/360 [07:04<17:24,  3.26s/it]

Processing images:  11%|█▏        | 41/360 [07:05<14:26,  2.72s/it]

Processing images:  12%|█▏        | 42/360 [07:08<14:41,  2.77s/it]

Processing images:  12%|█▏        | 43/360 [07:11<15:01,  2.84s/it]

Processing images:  12%|█▏        | 44/360 [07:18<21:43,  4.13s/it]

Processing images:  12%|█▎        | 45/360 [07:22<21:58,  4.18s/it]

Processing images:  13%|█▎        | 46/360 [07:25<19:34,  3.74s/it]

Processing images:  13%|█▎        | 47/360 [07:27<17:08,  3.29s/it]

Processing images:  13%|█▎        | 48/360 [07:29<14:21,  2.76s/it]

Processing images:  14%|█▎        | 49/360 [07:51<43:49,  8.45s/it]

Processing images:  14%|█▍        | 50/360 [07:58<41:54,  8.11s/it]

Processing images:  14%|█▍        | 51/360 [08:22<1:06:53, 12.99s/it]

Processing images:  14%|█▍        | 52/360 [08:24<49:44,  9.69s/it]  

Processing images:  15%|█▍        | 53/360 [08:29<41:52,  8.18s/it]

Processing images:  15%|█▌        | 54/360 [10:20<3:18:41, 38.96s/it]

Processing images:  15%|█▌        | 55/360 [10:32<2:37:38, 31.01s/it]

Processing images:  16%|█▌        | 56/360 [12:24<4:39:33, 55.18s/it]

Processing images:  16%|█▌        | 57/360 [12:41<3:41:34, 43.88s/it]

Processing images:  16%|█▌        | 58/360 [12:45<2:40:30, 31.89s/it]

Processing images:  16%|█▋        | 59/360 [12:49<1:57:45, 23.47s/it]

Processing images:  17%|█▋        | 60/360 [13:23<2:12:37, 26.53s/it]

Processing images:  17%|█▋        | 61/360 [13:54<2:19:31, 28.00s/it]

Processing images:  17%|█▋        | 62/360 [14:20<2:15:48, 27.34s/it]

Processing images:  18%|█▊        | 63/360 [14:33<1:53:53, 23.01s/it]

Processing images:  18%|█▊        | 64/360 [15:14<2:20:57, 28.57s/it]

Processing images:  18%|█▊        | 65/360 [15:16<1:41:18, 20.61s/it]

Processing images:  18%|█▊        | 66/360 [15:19<1:13:58, 15.10s/it]

Processing images:  19%|█▊        | 67/360 [15:26<1:01:58, 12.69s/it]

Processing images:  19%|█▉        | 68/360 [15:33<53:20, 10.96s/it]  

Processing images:  19%|█▉        | 69/360 [15:58<1:14:47, 15.42s/it]

Processing images:  19%|█▉        | 70/360 [16:13<1:12:53, 15.08s/it]

Processing images:  20%|█▉        | 71/360 [16:54<1:50:51, 23.02s/it]

Processing images:  20%|██        | 72/360 [17:01<1:27:08, 18.16s/it]

Processing images:  20%|██        | 73/360 [17:28<1:38:56, 20.69s/it]

Processing images:  21%|██        | 74/360 [17:32<1:15:50, 15.91s/it]

Processing images:  21%|██        | 75/360 [19:23<3:31:00, 44.42s/it]

Processing images:  21%|██        | 76/360 [19:31<2:37:32, 33.28s/it]

Processing images:  21%|██▏       | 77/360 [19:32<1:52:09, 23.78s/it]

Processing images:  22%|██▏       | 78/360 [19:40<1:28:52, 18.91s/it]

Processing images:  22%|██▏       | 79/360 [19:47<1:12:34, 15.50s/it]

Processing images:  22%|██▏       | 80/360 [19:54<1:00:10, 12.90s/it]

Processing images:  22%|██▎       | 81/360 [20:01<51:38, 11.11s/it]  

Processing images:  23%|██▎       | 82/360 [20:37<1:25:24, 18.44s/it]

Processing images:  23%|██▎       | 83/360 [20:38<1:01:44, 13.37s/it]

Processing images:  23%|██▎       | 84/360 [22:06<2:44:17, 35.72s/it]

Processing images:  24%|██▎       | 85/360 [22:54<3:00:27, 39.37s/it]

Processing images:  24%|██▍       | 86/360 [22:56<2:08:52, 28.22s/it]

Processing images:  24%|██▍       | 87/360 [23:04<1:40:06, 22.00s/it]

Processing images:  24%|██▍       | 88/360 [23:11<1:19:17, 17.49s/it]

Processing images:  25%|██▍       | 89/360 [23:52<1:51:30, 24.69s/it]

Processing images:  25%|██▌       | 90/360 [24:33<2:12:28, 29.44s/it]

Processing images:  25%|██▌       | 91/360 [24:35<1:36:01, 21.42s/it]

Processing images:  26%|██▌       | 92/360 [26:28<3:37:44, 48.75s/it]

Processing images:  26%|██▌       | 93/360 [26:35<2:41:14, 36.23s/it]

Processing images:  26%|██▌       | 94/360 [26:43<2:03:18, 27.81s/it]

Processing images:  26%|██▋       | 95/360 [27:09<2:00:14, 27.23s/it]

Processing images:  27%|██▋       | 96/360 [27:46<2:13:15, 30.28s/it]

Processing images:  27%|██▋       | 97/360 [28:09<2:03:14, 28.12s/it]

Processing images:  27%|██▋       | 98/360 [28:16<1:35:12, 21.80s/it]

Processing images:  28%|██▊       | 99/360 [29:44<3:00:20, 41.46s/it]

Processing images:  28%|██▊       | 100/360 [29:51<2:15:40, 31.31s/it]

Processing images:  28%|██▊       | 101/360 [29:59<1:44:18, 24.16s/it]

Processing images:  28%|██▊       | 102/360 [30:07<1:23:31, 19.43s/it]

Processing images:  29%|██▊       | 103/360 [30:34<1:32:48, 21.67s/it]

Processing images:  29%|██▉       | 104/360 [31:00<1:37:42, 22.90s/it]

Processing images:  29%|██▉       | 105/360 [31:04<1:13:37, 17.32s/it]

Processing images:  29%|██▉       | 106/360 [31:40<1:36:20, 22.76s/it]

Processing images:  30%|██▉       | 107/360 [31:47<1:16:25, 18.12s/it]

Processing images:  30%|███       | 108/360 [32:29<1:45:41, 25.16s/it]

Processing images:  30%|███       | 109/360 [33:10<2:05:43, 30.05s/it]

Processing images:  31%|███       | 110/360 [33:51<2:19:27, 33.47s/it]

Processing images:  31%|███       | 111/360 [34:33<2:28:53, 35.88s/it]

Processing images:  31%|███       | 112/360 [34:40<1:53:06, 27.36s/it]

Processing images:  31%|███▏      | 113/360 [34:42<1:20:40, 19.60s/it]

Processing images:  32%|███▏      | 114/360 [35:07<1:26:31, 21.10s/it]

Processing images:  32%|███▏      | 115/360 [35:15<1:10:38, 17.30s/it]

Processing images:  32%|███▏      | 116/360 [35:56<1:39:47, 24.54s/it]

Processing images:  32%|███▎      | 117/360 [36:02<1:16:35, 18.91s/it]

Processing images:  33%|███▎      | 118/360 [36:10<1:02:15, 15.44s/it]

Processing images:  33%|███▎      | 119/360 [36:16<50:39, 12.61s/it]  

Processing images:  33%|███▎      | 120/360 [36:41<1:06:13, 16.56s/it]

Processing images:  34%|███▎      | 121/360 [36:43<48:20, 12.14s/it]  

Processing images:  34%|███▍      | 122/360 [36:44<35:16,  8.89s/it]

Processing images:  34%|███▍      | 123/360 [36:47<27:05,  6.86s/it]

Processing images:  34%|███▍      | 124/360 [36:50<22:55,  5.83s/it]

Processing images:  35%|███▍      | 125/360 [36:52<18:48,  4.80s/it]

Processing images:  35%|███▌      | 126/360 [36:55<15:46,  4.04s/it]

Processing images:  35%|███▌      | 127/360 [36:57<13:38,  3.51s/it]

Processing images:  36%|███▌      | 128/360 [37:00<13:18,  3.44s/it]

Processing images:  36%|███▌      | 129/360 [37:03<12:22,  3.22s/it]

Processing images:  36%|███▌      | 130/360 [38:55<2:17:16, 35.81s/it]

Processing images:  36%|███▋      | 131/360 [39:15<1:59:19, 31.26s/it]

Processing images:  37%|███▋      | 132/360 [39:48<1:59:55, 31.56s/it]

Processing images:  37%|███▋      | 133/360 [40:28<2:09:33, 34.25s/it]

Processing images:  37%|███▋      | 134/360 [40:30<1:32:28, 24.55s/it]

Processing images:  38%|███▊      | 135/360 [40:56<1:33:05, 24.82s/it]

Processing images:  38%|███▊      | 136/360 [41:21<1:33:19, 25.00s/it]

Processing images:  38%|███▊      | 137/360 [41:25<1:09:56, 18.82s/it]

Processing images:  38%|███▊      | 138/360 [41:28<51:39, 13.96s/it]  

Processing images:  39%|███▊      | 139/360 [41:31<39:08, 10.63s/it]

Processing images:  39%|███▉      | 140/360 [41:34<30:59,  8.45s/it]

Processing images:  39%|███▉      | 141/360 [43:26<2:23:36, 39.35s/it]

Processing images:  39%|███▉      | 142/360 [43:40<1:55:16, 31.73s/it]

Processing images:  40%|███▉      | 143/360 [45:31<3:21:30, 55.71s/it]

Processing images:  40%|████      | 144/360 [45:33<2:22:12, 39.50s/it]

Processing images:  40%|████      | 145/360 [45:35<1:41:17, 28.27s/it]

Processing images:  41%|████      | 146/360 [45:37<1:12:48, 20.41s/it]

Processing images:  41%|████      | 147/360 [45:40<53:17, 15.01s/it]  

Processing images:  41%|████      | 148/360 [45:42<40:02, 11.33s/it]

Processing images:  41%|████▏     | 149/360 [45:44<29:57,  8.52s/it]

Processing images:  42%|████▏     | 150/360 [45:46<22:24,  6.40s/it]

Processing images:  42%|████▏     | 151/360 [45:48<17:53,  5.14s/it]

Processing images:  42%|████▏     | 152/360 [45:49<14:03,  4.05s/it]

Processing images:  42%|████▎     | 153/360 [45:51<11:52,  3.44s/it]

Processing images:  43%|████▎     | 154/360 [45:53<10:11,  2.97s/it]

Processing images:  43%|████▎     | 155/360 [45:55<09:09,  2.68s/it]

Processing images:  43%|████▎     | 156/360 [45:58<09:37,  2.83s/it]

Processing images:  44%|████▎     | 157/360 [46:00<08:32,  2.53s/it]

Processing images:  44%|████▍     | 158/360 [46:02<08:03,  2.39s/it]

Processing images:  44%|████▍     | 159/360 [46:06<08:43,  2.61s/it]

Processing images:  44%|████▍     | 160/360 [46:07<08:04,  2.42s/it]

Processing images:  45%|████▍     | 161/360 [47:59<1:56:09, 35.02s/it]

Processing images:  45%|████▌     | 162/360 [48:43<2:04:29, 37.72s/it]

Processing images:  45%|████▌     | 163/360 [50:35<3:17:48, 60.25s/it]

Processing images:  46%|████▌     | 164/360 [52:06<3:46:57, 69.48s/it]

Processing images:  46%|████▌     | 165/360 [53:57<4:25:24, 81.67s/it]

Processing images:  46%|████▌     | 166/360 [55:31<4:36:08, 85.40s/it]

Processing images:  46%|████▋     | 167/360 [55:56<3:36:52, 67.42s/it]

Processing images:  47%|████▋     | 168/360 [56:40<3:13:18, 60.41s/it]

Processing images:  47%|████▋     | 169/360 [57:06<2:39:09, 50.00s/it]

Processing images:  47%|████▋     | 170/360 [57:38<2:21:27, 44.67s/it]

Processing images:  48%|████▊     | 171/360 [59:34<3:28:03, 66.05s/it]

Processing images:  48%|████▊     | 172/360 [1:01:28<4:11:32, 80.28s/it]

Processing images:  48%|████▊     | 173/360 [1:02:57<4:18:59, 83.10s/it]

Processing images:  48%|████▊     | 174/360 [1:03:00<3:02:33, 58.89s/it]

Processing images:  49%|████▊     | 175/360 [1:03:01<2:08:06, 41.55s/it]

Processing images:  49%|████▉     | 176/360 [1:03:02<1:30:07, 29.39s/it]

Processing images:  49%|████▉     | 177/360 [1:03:03<1:03:41, 20.88s/it]

Processing images:  49%|████▉     | 178/360 [1:04:34<2:07:19, 41.98s/it]

Processing images:  50%|████▉     | 179/360 [1:06:27<3:10:59, 63.31s/it]

Processing images:  50%|█████     | 180/360 [1:06:28<2:14:14, 44.75s/it]

Processing images:  50%|█████     | 181/360 [1:06:31<1:35:54, 32.15s/it]

Processing images:  51%|█████     | 182/360 [1:06:35<1:10:31, 23.77s/it]

Processing images:  51%|█████     | 183/360 [1:06:41<53:37, 18.18s/it]  

Processing images:  51%|█████     | 184/360 [1:06:42<38:17, 13.05s/it]

Processing images:  51%|█████▏    | 185/360 [1:06:45<29:55, 10.26s/it]

Processing images:  52%|█████▏    | 186/360 [1:06:50<24:25,  8.42s/it]

Processing images:  52%|█████▏    | 187/360 [1:06:56<22:17,  7.73s/it]

Processing images:  52%|█████▏    | 188/360 [1:07:00<19:32,  6.82s/it]

Processing images:  52%|█████▎    | 189/360 [1:07:06<18:10,  6.37s/it]

Processing images:  53%|█████▎    | 190/360 [1:07:12<18:02,  6.37s/it]

Processing images:  53%|█████▎    | 191/360 [1:07:17<16:45,  5.95s/it]

Processing images:  53%|█████▎    | 192/360 [1:07:21<15:26,  5.51s/it]

Processing images:  54%|█████▎    | 193/360 [1:07:46<31:16, 11.24s/it]

Processing images:  54%|█████▍    | 194/360 [1:07:51<25:48,  9.33s/it]

Processing images:  54%|█████▍    | 195/360 [1:07:56<21:50,  7.94s/it]

Processing images:  54%|█████▍    | 196/360 [1:08:01<19:40,  7.20s/it]

Processing images:  55%|█████▍    | 197/360 [1:08:06<17:22,  6.40s/it]

Processing images:  55%|█████▌    | 198/360 [1:08:10<15:58,  5.91s/it]

Processing images:  55%|█████▌    | 199/360 [1:08:16<15:35,  5.81s/it]

Processing images:  56%|█████▌    | 200/360 [1:08:21<15:05,  5.66s/it]

Processing images:  56%|█████▌    | 201/360 [1:08:24<12:22,  4.67s/it]

Processing images:  56%|█████▌    | 202/360 [1:08:26<10:07,  3.84s/it]

Processing images:  56%|█████▋    | 203/360 [1:08:27<08:31,  3.26s/it]

Processing images:  57%|█████▋    | 204/360 [1:10:21<1:34:06, 36.20s/it]

Processing images:  57%|█████▋    | 205/360 [1:10:23<1:07:01, 25.94s/it]

Processing images:  57%|█████▋    | 206/360 [1:10:25<48:15, 18.80s/it]  

Processing images:  57%|█████▊    | 207/360 [1:10:26<34:52, 13.68s/it]

Processing images:  58%|█████▊    | 208/360 [1:10:28<25:26, 10.05s/it]

Processing images:  58%|█████▊    | 209/360 [1:10:29<18:45,  7.45s/it]

Processing images:  58%|█████▊    | 210/360 [1:10:30<13:50,  5.54s/it]

Processing images:  59%|█████▊    | 211/360 [1:10:32<10:33,  4.25s/it]

Processing images:  59%|█████▉    | 212/360 [1:10:40<13:20,  5.41s/it]

Processing images:  59%|█████▉    | 213/360 [1:10:42<10:37,  4.34s/it]

Processing images:  59%|█████▉    | 214/360 [1:10:43<08:31,  3.50s/it]

Processing images:  60%|█████▉    | 215/360 [1:10:44<06:48,  2.82s/it]

Processing images:  60%|██████    | 216/360 [1:10:46<05:57,  2.48s/it]

Processing images:  60%|██████    | 217/360 [1:10:57<12:02,  5.05s/it]

Processing images:  61%|██████    | 218/360 [1:11:00<10:31,  4.45s/it]

Processing images:  61%|██████    | 219/360 [1:11:03<08:57,  3.81s/it]

Processing images:  61%|██████    | 220/360 [1:11:04<07:09,  3.07s/it]

Processing images:  61%|██████▏   | 221/360 [1:11:08<08:08,  3.52s/it]

Processing images:  62%|██████▏   | 222/360 [1:11:10<06:56,  3.02s/it]

Processing images:  62%|██████▏   | 223/360 [1:11:13<06:34,  2.88s/it]

Processing images:  62%|██████▏   | 224/360 [1:11:15<06:10,  2.72s/it]

Processing images:  62%|██████▎   | 225/360 [1:11:17<05:47,  2.57s/it]

Processing images:  63%|██████▎   | 226/360 [1:11:19<05:06,  2.29s/it]

Processing images:  63%|██████▎   | 227/360 [1:11:20<04:27,  2.01s/it]

Processing images:  63%|██████▎   | 228/360 [1:11:22<04:02,  1.84s/it]

Processing images:  64%|██████▎   | 229/360 [1:11:27<05:53,  2.69s/it]

Processing images:  64%|██████▍   | 230/360 [1:12:57<1:02:54, 29.03s/it]

Processing images:  64%|██████▍   | 231/360 [1:14:53<1:58:28, 55.10s/it]

Processing images:  64%|██████▍   | 232/360 [1:16:23<2:20:08, 65.69s/it]

Processing images:  65%|██████▍   | 233/360 [1:18:16<2:49:05, 79.89s/it]

Processing images:  65%|██████▌   | 234/360 [1:18:42<2:13:49, 63.72s/it]

Processing images:  65%|██████▌   | 235/360 [1:20:16<2:31:30, 72.72s/it]

Processing images:  66%|██████▌   | 236/360 [1:22:08<2:54:25, 84.40s/it]

Processing images:  66%|██████▌   | 237/360 [1:23:59<3:09:39, 92.52s/it]

Processing images:  66%|██████▌   | 238/360 [1:25:31<3:07:25, 92.18s/it]

Processing images:  66%|██████▋   | 239/360 [1:27:23<3:18:22, 98.37s/it]

Processing images:  67%|██████▋   | 240/360 [1:29:16<3:25:28, 102.74s/it]

Processing images:  67%|██████▋   | 241/360 [1:29:19<2:24:26, 72.82s/it] 

Processing images:  67%|██████▋   | 242/360 [1:29:22<1:42:05, 51.91s/it]

Processing images:  68%|██████▊   | 243/360 [1:29:26<1:12:40, 37.27s/it]

Processing images:  68%|██████▊   | 244/360 [1:29:30<52:58, 27.40s/it]  

Processing images:  68%|██████▊   | 245/360 [1:29:32<38:11, 19.93s/it]

Processing images:  68%|██████▊   | 246/360 [1:29:35<28:13, 14.86s/it]

Processing images:  69%|██████▊   | 247/360 [1:29:40<21:59, 11.68s/it]

Processing images:  69%|██████▉   | 248/360 [1:29:45<17:58,  9.63s/it]

Processing images:  69%|██████▉   | 249/360 [1:29:49<14:56,  8.08s/it]

Processing images:  69%|██████▉   | 250/360 [1:29:53<12:21,  6.74s/it]

Processing images:  70%|██████▉   | 251/360 [1:29:57<10:54,  6.01s/it]

Processing images:  70%|███████   | 252/360 [1:30:01<09:33,  5.31s/it]

Processing images:  70%|███████   | 253/360 [1:30:05<08:51,  4.97s/it]

Processing images:  71%|███████   | 254/360 [1:30:07<07:11,  4.08s/it]

Processing images:  71%|███████   | 255/360 [1:30:09<06:12,  3.55s/it]

Processing images:  71%|███████   | 256/360 [1:30:11<05:32,  3.20s/it]

Processing images:  71%|███████▏  | 257/360 [1:30:14<05:04,  2.96s/it]

Processing images:  72%|███████▏  | 258/360 [1:30:19<05:53,  3.47s/it]

Processing images:  72%|███████▏  | 259/360 [1:30:21<05:11,  3.09s/it]

Processing images:  72%|███████▏  | 260/360 [1:30:23<04:35,  2.76s/it]

Processing images:  72%|███████▎  | 261/360 [1:30:25<04:23,  2.66s/it]

Processing images:  73%|███████▎  | 262/360 [1:30:27<04:05,  2.51s/it]

Processing images:  73%|███████▎  | 263/360 [1:30:32<04:55,  3.05s/it]

Processing images:  73%|███████▎  | 264/360 [1:30:34<04:32,  2.84s/it]

Processing images:  74%|███████▎  | 265/360 [1:30:36<04:01,  2.54s/it]

Processing images:  74%|███████▍  | 266/360 [1:30:40<04:51,  3.10s/it]

Processing images:  74%|███████▍  | 267/360 [1:31:14<18:53, 12.19s/it]

Processing images:  74%|███████▍  | 268/360 [1:31:16<14:08,  9.22s/it]

Processing images:  75%|███████▍  | 269/360 [1:31:18<10:33,  6.96s/it]

Processing images:  75%|███████▌  | 270/360 [1:31:20<08:10,  5.45s/it]

Processing images:  75%|███████▌  | 271/360 [1:31:21<06:19,  4.27s/it]

Processing images:  76%|███████▌  | 272/360 [1:31:23<05:05,  3.47s/it]

Processing images:  76%|███████▌  | 273/360 [1:31:25<04:31,  3.12s/it]

Processing images:  76%|███████▌  | 274/360 [1:31:27<04:09,  2.90s/it]

Processing images:  76%|███████▋  | 275/360 [1:31:29<03:42,  2.61s/it]

Processing images:  77%|███████▋  | 276/360 [1:31:31<03:23,  2.42s/it]

Processing images:  77%|███████▋  | 277/360 [1:31:33<03:09,  2.28s/it]

Processing images:  77%|███████▋  | 278/360 [1:31:35<03:05,  2.26s/it]

Processing images:  78%|███████▊  | 279/360 [1:31:50<07:59,  5.93s/it]

Processing images:  78%|███████▊  | 280/360 [1:31:52<06:30,  4.88s/it]

Processing images:  78%|███████▊  | 281/360 [1:31:54<05:18,  4.03s/it]

Processing images:  78%|███████▊  | 282/360 [1:31:56<04:28,  3.44s/it]

Processing images:  79%|███████▊  | 283/360 [1:32:30<15:58, 12.44s/it]

Processing images:  79%|███████▉  | 284/360 [1:32:33<12:16,  9.70s/it]

Processing images:  79%|███████▉  | 285/360 [1:32:36<09:29,  7.60s/it]

Processing images:  79%|███████▉  | 286/360 [1:32:38<07:27,  6.05s/it]

Processing images:  80%|███████▉  | 287/360 [1:32:41<05:57,  4.90s/it]

Processing images:  80%|████████  | 288/360 [1:32:43<04:56,  4.11s/it]

Processing images:  80%|████████  | 289/360 [1:33:16<15:14, 12.88s/it]

Processing images:  81%|████████  | 290/360 [1:33:47<21:23, 18.33s/it]

Processing images:  81%|████████  | 291/360 [1:34:21<26:16, 22.84s/it]

Processing images:  81%|████████  | 292/360 [1:34:54<29:28, 26.01s/it]

Processing images:  81%|████████▏ | 293/360 [1:35:27<31:30, 28.22s/it]

Processing images:  82%|████████▏ | 294/360 [1:36:01<32:45, 29.78s/it]

Processing images:  82%|████████▏ | 295/360 [1:36:04<23:46, 21.94s/it]

Processing images:  82%|████████▏ | 296/360 [1:36:08<17:29, 16.40s/it]

Processing images:  82%|████████▎ | 297/360 [1:36:13<13:36, 12.96s/it]

Processing images:  83%|████████▎ | 298/360 [1:36:18<10:55, 10.58s/it]

Processing images:  83%|████████▎ | 299/360 [1:36:22<08:51,  8.72s/it]

Processing images:  83%|████████▎ | 300/360 [1:36:27<07:24,  7.41s/it]

Processing images:  84%|████████▎ | 301/360 [1:37:00<14:58, 15.23s/it]

Processing images:  84%|████████▍ | 302/360 [1:37:33<19:59, 20.69s/it]

Processing images:  84%|████████▍ | 303/360 [1:38:07<23:16, 24.50s/it]

Processing images:  84%|████████▍ | 304/360 [1:38:40<25:21, 27.17s/it]

Processing images:  85%|████████▍ | 305/360 [1:38:55<21:30, 23.46s/it]

Processing images:  85%|████████▌ | 306/360 [1:39:29<23:49, 26.47s/it]

Processing images:  85%|████████▌ | 307/360 [1:40:02<25:12, 28.54s/it]

Processing images:  86%|████████▌ | 308/360 [1:40:35<26:00, 30.01s/it]

Processing images:  86%|████████▌ | 309/360 [1:41:09<26:21, 31.02s/it]

Processing images:  86%|████████▌ | 310/360 [1:41:24<21:47, 26.16s/it]

Processing images:  86%|████████▋ | 311/360 [1:41:25<15:23, 18.86s/it]

Processing images:  87%|████████▋ | 312/360 [1:41:27<11:02, 13.81s/it]

Processing images:  87%|████████▋ | 313/360 [1:41:30<08:05, 10.33s/it]

Processing images:  87%|████████▋ | 314/360 [1:41:32<05:58,  7.80s/it]

Processing images:  88%|████████▊ | 315/360 [1:41:34<04:34,  6.09s/it]

Processing images:  88%|████████▊ | 316/360 [1:41:36<03:35,  4.91s/it]

Processing images:  88%|████████▊ | 317/360 [1:41:39<03:08,  4.38s/it]

Processing images:  88%|████████▊ | 318/360 [1:41:42<02:48,  4.01s/it]

Processing images:  89%|████████▊ | 319/360 [1:41:45<02:34,  3.77s/it]

Processing images:  89%|████████▉ | 320/360 [1:41:48<02:22,  3.57s/it]

Processing images:  89%|████████▉ | 321/360 [1:41:52<02:15,  3.47s/it]

Processing images:  89%|████████▉ | 322/360 [1:41:55<02:11,  3.46s/it]

Processing images:  90%|████████▉ | 323/360 [1:42:28<07:40, 12.43s/it]

Processing images:  90%|█████████ | 324/360 [1:42:32<05:50,  9.73s/it]

Processing images:  90%|█████████ | 325/360 [1:42:35<04:30,  7.72s/it]

Processing images:  91%|█████████ | 326/360 [1:42:38<03:35,  6.35s/it]

Processing images:  91%|█████████ | 327/360 [1:43:11<07:57, 14.47s/it]

Processing images:  91%|█████████ | 328/360 [1:43:45<10:44, 20.14s/it]

Processing images:  91%|█████████▏| 329/360 [1:43:48<07:43, 14.95s/it]

Processing images:  92%|█████████▏| 330/360 [1:43:51<05:40, 11.36s/it]

Processing images:  92%|█████████▏| 331/360 [1:44:24<08:41, 17.97s/it]

Processing images:  92%|█████████▏| 332/360 [1:44:27<06:18, 13.51s/it]

Processing images:  92%|█████████▎| 333/360 [1:44:30<04:38, 10.32s/it]

Processing images:  93%|█████████▎| 334/360 [1:44:33<03:33,  8.23s/it]

Processing images:  93%|█████████▎| 335/360 [1:44:37<02:49,  6.79s/it]

Processing images:  93%|█████████▎| 336/360 [1:45:10<05:55, 14.79s/it]

Processing images:  94%|█████████▎| 337/360 [1:45:13<04:19, 11.29s/it]

Processing images:  94%|█████████▍| 338/360 [1:45:47<06:34, 17.92s/it]

Processing images:  94%|█████████▍| 339/360 [1:45:50<04:41, 13.42s/it]

Processing images:  94%|█████████▍| 340/360 [1:46:23<06:28, 19.41s/it]

Processing images:  95%|█████████▍| 341/360 [1:46:56<07:28, 23.61s/it]

Processing images:  95%|█████████▌| 342/360 [1:46:59<05:13, 17.43s/it]

Processing images:  95%|█████████▌| 343/360 [1:47:03<03:43, 13.12s/it]

Processing images:  96%|█████████▌| 344/360 [1:47:06<02:42, 10.14s/it]

Processing images:  96%|█████████▌| 345/360 [1:47:09<01:59,  7.95s/it]

Processing images:  96%|█████████▌| 346/360 [1:47:42<03:38, 15.60s/it]

Processing images:  96%|█████████▋| 347/360 [1:48:15<04:32, 20.94s/it]

Processing images:  97%|█████████▋| 348/360 [1:48:19<03:07, 15.64s/it]

Processing images:  97%|█████████▋| 349/360 [1:48:21<02:09, 11.76s/it]

Processing images:  97%|█████████▋| 350/360 [1:48:55<03:02, 18.27s/it]

Processing images:  98%|█████████▊| 351/360 [1:48:58<02:02, 13.65s/it]

Processing images:  98%|█████████▊| 352/360 [1:49:01<01:23, 10.43s/it]

Processing images:  98%|█████████▊| 353/360 [1:49:04<00:57,  8.22s/it]

Processing images:  98%|█████████▊| 354/360 [1:49:07<00:39,  6.66s/it]

Processing images:  99%|█████████▊| 355/360 [1:49:09<00:26,  5.32s/it]

Processing images:  99%|█████████▉| 356/360 [1:49:12<00:18,  4.69s/it]

Processing images:  99%|█████████▉| 357/360 [1:49:15<00:12,  4.18s/it]

Processing images:  99%|█████████▉| 358/360 [1:49:30<00:14,  7.48s/it]

Processing images: 100%|█████████▉| 359/360 [1:50:04<00:15, 15.26s/it]

Processing images: 100%|██████████| 360/360 [1:50:34<00:00, 19.82s/it]

Processing images: 100%|██████████| 360/360 [1:50:34<00:00, 18.43s/it]





EVALUATION SUMMARY FOR QWEN2.5-VL
📊 IMAGE PROCESSING SUMMARY:
   Total images attempted: 360
   Successfully processed: 360 (100.0%)
   Failed images: 0 (0.0%)

📝 QUESTION PROCESSING SUMMARY:
   Total questions attempted: 1080
   Successfully processed: 1080 (100.0%)
   Failed questions: 0 (0.0%)
   Results saved: 1080

✅ SUCCESSFUL IMAGES (360):
   • image01 (Type: REAL, Questions: 3/3, Time: 6.6s)
   • image02 (Type: REAL, Questions: 3/3, Time: 2.2s)
   • image03 (Type: REAL, Questions: 3/3, Time: 29.2s)
   • image04 (Type: REAL, Questions: 3/3, Time: 2.3s)
   • image05 (Type: REAL, Questions: 3/3, Time: 25.8s)
   • image06 (Type: REAL, Questions: 3/3, Time: 5.0s)
   • image07 (Type: REAL, Questions: 3/3, Time: 1.6s)
   • image08 (Type: REAL, Questions: 3/3, Time: 1.0s)
   • image09 (Type: REAL, Questions: 3/3, Time: 1.6s)
   • image10 (Type: REAL, Questions: 3/3, Time: 2.7s)
   • image11 (Type: REAL, Questions: 3/3, Time: 4.2s)
   • image12 (Type: REAL, Questions: 3/3, Time: 2.9s)
