# VLM Benchmark for Object Property Abstraction

This notebook implements a benchmark for evaluating Vision Language Models (VLMs) on object property abstraction and visual question answering (VQA) tasks. The benchmark includes three types of questions:

1. Direct Recognition
2. Property Inference
3. Counterfactual Reasoning

And three types of images:
- REAL
- ANIMATED
- AI GENERATED

## Setup and Imports

First, let's import the necessary libraries and set up our environment.

In [1]:
import torch
import json
from pathlib import Path
from PIL import Image
import gc
import re
from tqdm import tqdm
from typing import List, Dict, Any

# Check if CUDA is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Add this cell to your notebook
import sys
import torch
print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU devices: {torch.cuda.device_count() if torch.cuda.is_available() else 'None'}")


Using device: cuda
Python version: 3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:49:16) [MSC v.1929 64 bit (AMD64)]
PyTorch version: 2.2.1+cu118
CUDA available: True
GPU devices: 1


In [2]:
# Add to notebook
import torch
x = torch.rand(5,3)
print(x.device)  # Should be 'cpu'
if torch.cuda.is_available():
    x = x.cuda()
    print(x.device)  # Should be something like 'cuda:0'

cpu
cuda:0


In [3]:
# Add this to your notebook
import psutil
print(f"Memory usage: {psutil.Process().memory_info().rss / 1024 / 1024:.2f} MB")

Memory usage: 444.53 MB


## Benchmark Tester Class

This class handles the evaluation of models against our benchmark.

In [4]:
class BenchmarkTester:
    def __init__(self, benchmark_path: str = "benchmark.json", data_dir: str = "."):
        self.benchmark_path = Path(benchmark_path)
        self.data_dir = Path(data_dir)
        self.results = {} # []
        
        # Load benchmark data
        with open(self.benchmark_path, 'r') as f:
            self.benchmark_data = json.load(f)
    
    def format_question(self, question: str) -> str:
        """Format the question to ensure we get a numerical answer."""
        return f"{question['question']} \nPlease provide a number followed by the list of objects within square brackets as your answer."
    
    def clean_answer(self, answer: str) -> str:
        # """Extract only the first number from the model's answer."""
        # numbers = re.findall(r'\d+', answer)
        # return numbers[0] if numbers else ""
        """Extract number and reasoning from the model's answer."""
        # Try to extract number and reasoning using regex
        pattern = r'(\d+)\s*\[(.*?)\]'
        match = re.search(pattern, answer)
        
        if match:
            number = match.group(1)
            objects = [obj.strip() for obj in match.group(2).split(',')]
            return {
                "count": number,
                "reasoning": objects
            }
        else:
            # Fallback if format isn't matched
            numbers = re.findall(r'\d+', answer)
            return {
                "count": numbers[0] if numbers else "0",
                "reasoning": []
            }
    
    def evaluate_model(self, model, processor, batch_size: int = 3, start_idx: int = 0):
        """Evaluate the model on the benchmark images."""
        images = self.benchmark_data['benchmark']['images']
        total_images = len(images)

        # Force garbage collection before starting
        gc.collect()
        torch.cuda.empty_cache()
        
        # Use tqdm for the outer loop
        for i in tqdm(range(start_idx, total_images, batch_size), desc="Processing batches"):
            batch = images[i:i + batch_size]
            print(f"Processing batch {i//batch_size + 1}/{(total_images-1)//batch_size + 1}")
            
            for img_data in batch:
                img_path = Path(img_data['path'])
                if not img_path.exists():
                    print(f"Warning: Image not found at {img_path}")
                    continue
                
                image = Image.open(img_path).convert('RGB')
                
                # Use tqdm for questions within each image
                for q_idx, question in enumerate(tqdm(img_data['questions'], desc=f"Processing questions for {img_path.stem}")):
                    formatted_question = self.format_question(question)
                    print(f"Question: {question['question']}")

                    # Clear cache before processing each question
                    torch.cuda.empty_cache()
                    
                    # Process image and question
                    inputs = processor(images=image, text=formatted_question, return_tensors="pt").to(device)
                    
                    with torch.no_grad():
                        outputs = model.generate(
                            **inputs,
                            max_new_tokens=200,  
                            num_beams=1,        # probably needs to be 1
                            temperature=0.7,
                            pad_token_id=processor.tokenizer.eos_token_id
                        )
                    
                    answer = processor.decode(outputs[0], skip_special_tokens=True)
                    cleaned_answer = self.clean_answer(answer)
                    
                    # Store result
                    result_key = f"{img_path.stem}_q{q_idx}"
                    self.results[result_key] = {
                        'image_id': img_data["image_id"],
                        'image_type': img_data["image_type"],
                        'question_id': question["id"],
                        'question': question['question'],
                        'ground_truth': question['answer'],
                        'model_answer': cleaned_answer["count"],
                        'model_reasoning': cleaned_answer["reasoning"],
                        'raw_answer': answer,
                        'property_category': question["property_category"]                        
                    }
                    # self.results.append({
                    #     "image_id": img_data["image_id"],
                    #     "image_type": img_data["image_type"],
                    #     "question_id": question["id"],
                    #     "question": question['question'],
                    #     "ground_truth": question['answer'],
                    #     "model_answer": cleaned_answer["count"],
                    #     "model_reasoning": cleaned_answer["reasoning"],
                    #     "raw_answer": answer,
                    #     "property_category": question["property_category"]
                    # })
                    
                    # Clear memory
                    del inputs, outputs
                    torch.cuda.empty_cache()
                    gc.collect()
            
            # Save checkpoint after each batch
            self.save_results("checkpoint.json")

        # if self.results:
        #     self.save_results(save_path)
    
    def save_results(self, filename: str = "results.json"):
        """Save the evaluation results to a JSON file."""
        with open(filename, 'w') as f:
            json.dump(self.results, f, indent=2)

In [5]:
import torch
print(torch.version.cuda)  # Should show CUDA version if available

11.8


## Test Fuyu Model

Let's evaluate the Fuyu-8b model on our benchmark.

In [6]:
def test_fuyu():
    # from transformers import AutoModelForCausalLM, AutoTokenizer
    #from transformers import AutoProcessor, AutoModelForImageTextToText
    from transformers import FuyuProcessor, FuyuForCausalLM
    
    print("Loading Fuyu-8b model...")
    model = FuyuForCausalLM.from_pretrained(
        "adept/fuyu-8b",
        torch_dtype=torch.float16,
        device_map="auto",
        low_cpu_mem_usage=True,
        trust_remote_code=True,
        use_cache=False
    )
    processor = FuyuProcessor.from_pretrained("adept/fuyu-8b", trust_remote_code=True)

    # Optional: Enable memory efficient attention
    if hasattr(model.config, 'use_memory_efficient_attention'):
        model.config.use_memory_efficient_attention = True

    # Update generation settings
    model.config.use_cache = False  # Disable caching at config level

    print('Model loaded successfully')
    
    tester = BenchmarkTester()
    tester.evaluate_model(model, processor, batch_size=1)
    tester.save_results("fuyu_results.json")
    
    # Clean up
    del model, processor
    torch.cuda.empty_cache()
    gc.collect()

## Test BLIP-2 Model

Now let's evaluate the BLIP-2 model.

In [7]:
def test_blip2():
    from transformers import Blip2Processor, Blip2ForConditionalGeneration
    
    print("Loading BLIP-2 model...")
    model = Blip2ForConditionalGeneration.from_pretrained(
        "Salesforce/blip2-opt-2.7b",
        load_in_8bit=True,
        torch_dtype=torch.float16,
        device_map="auto",
        low_cpu_mem_usage=True,
    )
    processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")

    # Optional: Enable memory efficient attention
    if hasattr(model.config, 'use_memory_efficient_attention'):
        model.config.use_memory_efficient_attention = True
    
    tester = BenchmarkTester()
    tester.evaluate_model(model, processor, batch_size=1)
    tester.save_results("blip2_results.json")
    
    # Clean up
    del model, processor
    torch.cuda.empty_cache()
    gc.collect()

## Run Evaluation

Now we can run our evaluation. Let's start with the Fuyu model:

In [8]:
test_fuyu()

  from .autonotebook import tqdm as notebook_tqdm


Loading Fuyu-8b model...


Loading checkpoint shards: 100%|██████████| 2/2 [00:14<00:00,  7.46s/it]
Some parameters are on the meta device device because they were offloaded to the disk and cpu.


Model loaded successfully


Processing batches:   0%|          | 0/25 [00:00<?, ?it/s]

Processing batch 1/25




Question: How many objects made of wood are present?


Processing questions for image01:   0%|          | 0/3 [06:07<?, ?it/s]
Processing batches:   0%|          | 0/25 [06:07<?, ?it/s]


KeyboardInterrupt: 

And then the BLIP-2 model:

In [9]:
%pip install -U bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.5-py3-none-win_amd64.whl.metadata (5.1 kB)
Downloading bitsandbytes-0.45.5-py3-none-win_amd64.whl (75.4 MB)
   ---------------------------------------- 0.0/75.4 MB ? eta -:--:--
   ---------------------------------------- 0.8/75.4 MB 5.6 MB/s eta 0:00:14
   - -------------------------------------- 2.6/75.4 MB 7.6 MB/s eta 0:00:10
   -- ------------------------------------- 4.5/75.4 MB 8.4 MB/s eta 0:00:09
   --- ------------------------------------ 6.8/75.4 MB 9.3 MB/s eta 0:00:08
   ----- ---------------------------------- 9.4/75.4 MB 10.0 MB/s eta 0:00:07
   ------ --------------------------------- 11.8/75.4 MB 10.3 MB/s eta 0:00:07
   ------- -------------------------------- 14.2/75.4 MB 10.5 MB/s eta 0:00:06
   -------- ------------------------------- 16.8/75.4 MB 10.8 MB/s eta 0:00:06
   ---------- ----------------------------- 19.1/75.4 MB 10.9 MB/s eta 0:00:06
   ----------- ---------------------------- 21.8/75.4 MB 11.0 MB

In [None]:
test_blip2()

  from .autonotebook import tqdm as notebook_tqdm


Loading BLIP-2 model...


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

: 