# VLM Benchmark for Object Property Abstraction

This notebook implements a benchmark for evaluating Vision Language Models (VLMs) on object property abstraction and visual question answering (VQA) tasks. The benchmark includes three types of questions:

1. Direct Recognition
2. Property Inference
3. Counterfactual Reasoning

And three types of images:
- REAL
- ANIMATED
- AI GENERATED

## Setup and Imports

First, let's import the necessary libraries and set up our environment.

In [2]:
# Install required packages
!pip install transformers torch Pillow tqdm bitsandbytes accelerate

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cusolver-cu12==11.6.1.9 (from torch)
  Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cusparse-cu12==12.3.1.170 (from torch)
  Downloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl.metadata (1.6

In [3]:
pip install num2words qwen-vl-utils #flash-attn --no-build-isolation 

Collecting num2words
  Downloading num2words-0.5.14-py3-none-any.whl.metadata (13 kB)
Collecting qwen-vl-utils
  Downloading qwen_vl_utils-0.0.10-py3-none-any.whl.metadata (6.3 kB)
Collecting docopt>=0.6.2 (from num2words)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting av (from qwen-vl-utils)
  Downloading av-14.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.7 kB)
Downloading num2words-0.5.14-py3-none-any.whl (163 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.5/163.5 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading qwen_vl_utils-0.0.10-py3-none-any.whl (6.7 kB)
Downloading av-14.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (35.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m35.2/35.2 MB[0m [31m35.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hBuilding wheels for collected packages: docopt
  Build

In [4]:
# Import required libraries
import torch
import json
from pathlib import Path
from PIL import Image
import gc
import re
from tqdm import tqdm
from typing import List, Dict, Any
from qwen_vl_utils import process_vision_info

# Check if CUDA is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


## Benchmark Tester Class

This class handles the evaluation of models against our benchmark.

In [8]:
class BenchmarkTester:
    def __init__(self, benchmark_path="/kaggle/input/opabenchmark/benchmark.json", data_dir="/kaggle/input/opabenchmark/data"):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        with open(benchmark_path, 'r') as f:
            self.benchmark = json.load(f)
        self.data_dir = data_dir
    
    def format_question(self, question, model_name):
        """Format a question for the model."""

        if model_name=="blip2":
            return f"Question: {question['question']} Answer:"
        else:
            return f"Question: {question['question']} Answer with a number and list of objects. Answer:"

    def clean_answer(self, answer):
        """Clean the model output to extract just the number."""
        # Remove any text that's not a number
        # import re
        # numbers = re.findall(r'\d+', answer)
        # if numbers:
        #     return numbers[0]  # Return the first number found
        # return answer
        """Extract number and reasoning from the model's answer."""
        # Try to extract number and reasoning using regex
        import re
        pattern = r'(\d+)\s*\[(.*?)\]'
        match = re.search(pattern, answer)
        
        if match:
            number = match.group(1)
            objects = [obj.strip() for obj in match.group(2).split(',')]
            return {
                "count": number,
                "reasoning": objects
            }
        else:
            # Fallback if format isn't matched
            numbers = re.findall(r'\d+', answer)
            return {
                "count": numbers[0] if numbers else "0",
                "reasoning": []
            }

    def model_generation(self, model_name, model, inputs, processor):
        """Generate answer and decode."""
        outputs = None  # Initialize outputs to None
        
        if model_name=="smolVLM2":
            outputs = model.generate(**inputs, do_sample=False, max_new_tokens=64)
            answer = processor.batch_decode(
                outputs,
                skip_special_tokens=True,
            )[0]
        elif model_name=="Qwen2.5-VL":
            outputs = model.generate(**inputs, max_new_tokens=128)
            outputs = [
                out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, outputs)
            ]
            answer = processor.batch_decode(
                outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False
            )[0]
        else:
            print(f"Warning: Unknown model name '{model_name}' in model_generation.")
            answer = ""  # Return an empty string

        return answer, outputs
    
    def evaluate_model(self, model_name, model, processor, save_path, start_idx=0, batch_size=5):
        results = []
        print(f"\nEvaluating {model_name}...")
        print(f"Using device: {self.device}")
        
        # Force garbage collection before starting
        gc.collect()
        torch.cuda.empty_cache()

        try:
            images = self.benchmark['benchmark']['images'][start_idx:start_idx + batch_size]
            total_images = len(images)
            
            for idx, image_data in enumerate(tqdm(images, desc="Processing images")):
                try:
                    print(f"\nProcessing image {idx+1}/{total_images}: {image_data['image_id']}")
                    image_path = Path(self.data_dir)/image_data['path']
                    if not image_path.exists():
                        print(f"Warning: Image not found at {image_path}")
                        continue
                    
                    # Load and preprocess image
                    image = Image.open(image_path).convert("RGB")
                    image_results = []  # Store results for current image
                    
                    for question in image_data['questions']:
                        try:
                            # prompt = self.format_question(question, model_name)
                            print(f"Question: {question['question']}")

                            messages = [
                                {
                                    "role": "user",
                                    "content": [
                                        {"type": "image", "image": image},
                                        {"type": "text", "text": f"{question['question']} Answer format: total number(numerical) objects(within square brackets)"},
                                    ]
                                },
                            ]
                            
                            # Clear cache before processing each question
                            torch.cuda.empty_cache()
                            
                            # Process image and text
                            # inputs = processor(images=image, text=prompt, return_tensors="pt").to(self.device)
                            if model_name=="smolVLM2":
                                inputs = processor.apply_chat_template(
                                    messages,
                                    add_generation_prompt=True,
                                    tokenize=True,
                                    return_dict=True,
                                    return_tensors="pt",
                                ).to(model.device, dtype=torch.bfloat16)
                            else:
                                
                                text = processor.apply_chat_template(
                                    messages, tokenize=False, add_generation_prompt=True
                                )
                                # image_inputs, video_inputs = process_vision_info(messages)
                                inputs = processor(
                                    text=text,
                                    images=image,
                                    videos=None,
                                    padding=True,
                                    return_tensors="pt",
                                ).to("cuda")
                            
                            # Generate answer with better settings
                            with torch.no_grad():
                                answer, outputs = self.model_generation(model_name, model, inputs, processor)    #call for model.generate
        
                            cleaned_answer = self.clean_answer(answer)
                            
                            image_results.append({
                                "image_id": image_data["image_id"],
                                "image_type": image_data["image_type"],
                                "question_id": question["id"],
                                "question": question["question"],
                                "ground_truth": question["answer"],
                                "model_answer": cleaned_answer["count"],
                                "model_reasoning": cleaned_answer["reasoning"],
                                "raw_answer": answer,  # Keep raw answer for debugging
                                "property_category": question["property_category"]
                            })
                            
                            # Clear memory
                            del outputs, inputs
                            torch.cuda.empty_cache()
                            
                        except Exception as e:
                            print(f"Error processing question: {str(e)}")
                            continue
                    
                    # Add results from this image
                    results.extend(image_results)
                    
                    # Save intermediate results only every 2 images or if it's the last image
                    if (idx + 1) % 2 == 0 or idx == total_images - 1:
                        with open(f"{save_path}_checkpoint.json", 'w') as f:
                            json.dump(results, f, indent=4)
                            
                except Exception as e:
                    print(f"Error processing image {image_data['image_id']}: {str(e)}")
                    continue
            
            # Save final results
            if results:
                with open(save_path, 'w') as f:
                    json.dump(results, f, indent=4)
            
        except Exception as e:
            print(f"An error occurred during evaluation: {str(e)}")
            if results:
                with open(f"{save_path}_error_state.json", 'w') as f:
                    json.dump(results, f, indent=4)
        
        return results

In [9]:
import os
print(os.listdir("/kaggle/input/opabenchmark/"))

['data', 'benchmark.json']


## Test SmolVLM Model

Let's evaluate the SmolVLM2-2.2B-Instruct model

In [10]:
def test_smolVLM2():
    from transformers import AutoProcessor, AutoModelForImageTextToText

    print("Loading smolVLM model...")
    
    model = AutoModelForImageTextToText.from_pretrained(
        "HuggingFaceTB/SmolVLM2-2.2B-Instruct",
        torch_dtype=torch.bfloat16,
        # _attn_implementation="flash_attention_2"
        low_cpu_mem_usage=True
    ).to("cuda")

    processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct")

    ## A bit slow without the flash_attention2 requires ampere gpu's. Better performance in some cases

    # Optional: Enable memory efficient attention
    if hasattr(model.config, 'use_memory_efficient_attention'):
        model.config.use_memory_efficient_attention = True

    tester = BenchmarkTester()
    smolVLM_results = tester.evaluate_model(
        "smolVLM2",
        model, 
        processor, 
        "smolVLM2_results.json", 
        batch_size=25
    )

    # Clean up
    del model, processor
    torch.cuda.empty_cache()
    gc.collect()

## Test Qwen2.5-VL

Lets evaluate the Qwen2.5-VL-7B-Instruct model

In [10]:
def test_Qwen2_5VL():
    from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
    
    # default: Load the model on the available device(s)
    model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
        "Qwen/Qwen2.5-VL-3B-Instruct", 
        load_in_8bit=True, # throws error when .to() is added
        torch_dtype=torch.bfloat16, 
        device_map="auto",
        # attn_implementation="flash_attention_2",
        low_cpu_mem_usage=True
    )
    
    # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
    # model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    #     "Qwen/Qwen2.5-VL-7B-Instruct",
    #     torch_dtype=torch.bfloat16,
    #     attn_implementation="flash_attention_2",
    #     device_map="auto",
    # )
    
    # default processer
    processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")

    ### Qwen2.5-VL-7B-Instruct --> goes out of CUDA memory
    ### Qwen2.5-VL-3B-Instruct --> can handle only 2 images before going out of memory but decent performance

    # Optional: Enable memory efficient attention
    if hasattr(model.config, 'use_memory_efficient_attention'):
        model.config.use_memory_efficient_attention = True

    tester = BenchmarkTester()
    Qwen2_5VL_results = tester.evaluate_model(
        "Qwen2.5-VL",
        model, 
        processor, 
        "Qwen2.5-VL_results.json", 
        batch_size=2
    )

    # Clean up
    del model, processor
    torch.cuda.empty_cache()
    gc.collect()

## Run Evaluation

Now we can run our evaluation. Let's start with the SmolVLM2 model:

In [11]:
test_smolVLM2()

2025-04-16 15:40:27.391930: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1744818027.661107      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744818027.736238      31 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Loading smolVLM model...


config.json:   0%|          | 0.00/3.64k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/63.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.03G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/136 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/430 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.6k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.55M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/4.74k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/868 [00:00<?, ?B/s]


Evaluating smolVLM2...
Using device: cuda


Processing images:   0%|          | 0/25 [00:00<?, ?it/s]


Processing image 1/25: image01
Question: How many objects made of wood are present?
Question: Count the number of breakable items?
Question: If one of the metal objects were replaced by a wooden object, how many wooden objects would be there in the image?


Processing images:   4%|▍         | 1/25 [00:18<07:14, 18.09s/it]


Processing image 2/25: image02
Question: How many mammals are present in the image?
Question: Count the number of items that can store other items?
Question: If one of the zebra were replaced by a tree, how many mammals would be present in the image?


Processing images:   8%|▊         | 2/25 [00:34<06:33, 17.10s/it]


Processing image 3/25: image03
Question: How many objects made of rubber are present?
Question: How many objects with the primary purpose of illumination can be seen?
Question: If the person riding one of the bicycles were replaced by a pedestrian, how many objects that have handles would be present?


Processing images:  12%|█▏        | 3/25 [00:56<07:04, 19.31s/it]


Processing image 4/25: image04
Question: How many tools are visible in the image?
Question: How many cutting tools are present in this image?
Question: If the red handle were replaced by a wooden handle, how many colored artifacts would remain in the image?


Processing images:  16%|█▌        | 4/25 [01:13<06:24, 18.31s/it]


Processing image 5/25: image05
Question: How many furniture items are present that have legs?
Question: Count the number of containers that cannot hold hot liquids?
Question: If the room were transformed into an open workspace instead of a meeting room, how many privacy features would need to be removed?


Processing images:  20%|██        | 5/25 [01:30<05:59, 17.99s/it]


Processing image 6/25: image06
Question: How many reptiles are visible in this enclosure?
Question: How many reptilian couples, at maximum, are present?
Question: If all the small pebbles forming the mosaic floor were replaced with sand, how many natural elements would still be visible in the enclosure?


Processing images:  24%|██▍       | 6/25 [01:48<05:42, 18.01s/it]


Processing image 7/25: image07
Question: How many birds are visible in this image?
Question: How many objects are present that can comfortably seat a human?
Question: If the birds sitting together only on one railing were to fly away, how many birds would remain?


Processing images:  28%|██▊       | 7/25 [02:06<05:21, 17.87s/it]


Processing image 8/25: image08
Question: How many reptiles are visible in this image?
Question: How many objects are present that act as support?
Question: If one turtle slid off the log into the water, how many turtles would be in the water?


Processing images:  32%|███▏      | 8/25 [02:23<05:02, 17.81s/it]


Processing image 9/25: image09
Question: How many different types of vegetables are present in the image?
Question: How many objects are used as containers?
Question: If the bag of limes were removed and replaced with two additional avocados, how many fruits would be present in total on the table, considering avocados are fruits?


Processing images:  36%|███▌      | 9/25 [02:42<04:48, 18.03s/it]


Processing image 10/25: image10
Question: How many objects are present that are flexible?
Question: Count the number of items that are battery powered?
Question: If two phones with three camera lenses were replaced with phones having two camera lenses, how many phones with two camera lenses would be present?


Processing images:  40%|████      | 10/25 [03:02<04:38, 18.54s/it]


Processing image 11/25: image01
Question: How many mammals are present in total?
Question: How many objects are visible that can store items?
Question: If the bear were to be replaced by a tree, how many different types of mammals would be there at the zoo?


Processing images:  44%|████▍     | 11/25 [03:20<04:18, 18.44s/it]


Processing image 12/25: image02
Question: How many kitchen tools are visible in the image?
Question: Count the number of items that require electricity to operate?
Question: If blinds were installed for the windows above the sink, how many transparent objects would remain?


Processing images:  48%|████▊     | 12/25 [03:38<03:57, 18.29s/it]


Processing image 13/25: image03
Question: How many objects made of glass are present?
Question: How many tools are visible that can be used for cutting?
Question: If the worker was not wearing ear protection, how many protective items would remain?


Processing images:  52%|█████▏    | 13/25 [04:02<03:59, 19.99s/it]


Processing image 14/25: image04
Question: How many objects made of rubber are present?
Question: Excluding the drawers, how many items in the workshop serve as containers for storage?
Question: If an electric fan were placed on the workstation to provide ventilation, how many objects in the room would require electricity to operate?


Processing images:  56%|█████▌    | 14/25 [04:20<03:34, 19.48s/it]


Processing image 15/25: image05
Question: How many birds are visible in the image?
Question: How many objects are present that act as support?
Question: If the clouds were to completely cover the sky, blocking the sunlight, how many natural elements would still be visible?


Processing images:  60%|██████    | 15/25 [04:44<03:29, 20.92s/it]


Processing image 16/25: image06
Question: How many objects are present that have chimneys?
Question: How many objects are visible that are means of transportation?
Question: If the bus were replaced by a pedestrian, how many mammals would be present?


Processing images:  64%|██████▍   | 16/25 [05:08<03:16, 21.85s/it]


Processing image 17/25: image07
Question: How many objects made of glass are present?
Question: Count the number of items that can be used to carry liquid?
Question: If the waste to be disposed was color-coded to match the bins, how many objects are to be thrown in the bin on the right?


Processing images:  68%|██████▊   | 17/25 [05:32<02:59, 22.49s/it]


Processing image 18/25: image08
Question: How many objects are present that have legs?
Question: How many items are visible that are openable?
Question: If the bottle was removed from the table, how many objects are present on top of the table?


Processing images:  72%|███████▏  | 18/25 [05:51<02:30, 21.47s/it]


Processing image 19/25: image09
Question: How many objects made of wood are present?
Question: How many kitchen items are visible that can be used for cutting?
Question: If the two jars on the top shelf were removed, how many breakable items would be present in the image?


Processing images:  76%|███████▌  | 19/25 [06:10<02:04, 20.71s/it]


Processing image 20/25: image10
Question: How many objects made of plastic are visible?
Question: How many items are visible that can record audio?
Question: If the microphones were replaced with headsets for every character, how many objects in total would be present that are worn on the head?


Processing images:  80%|████████  | 20/25 [06:29<01:40, 20.02s/it]


Processing image 21/25: image01
Question: How many objects made of rubber are visible?
Question: How many objects are visible that are means of transportation?
Question: If the car in the driveway were to leave, how many objects primarily made of metal would be present?


Processing images:  84%|████████▍ | 21/25 [06:54<01:26, 21.50s/it]


Processing image 22/25: image02
Question: How many objects made of concrete are present?
Question: How many objects are visible that can be used for lifting?
Question: If the orange paint spilled all over one of the plexiglass sheets, how many objects would remain that are transparent?


Processing images:  88%|████████▊ | 22/25 [07:18<01:07, 22.45s/it]


Processing image 23/25: image03
Question: How many mammals are present in the image?
Question: How many objects are visible that are used for both meat and wool production?
Question: If the two sheep were replaced by a cow grazing in the same area, how many objects would be present in between the two fences?


Processing images:  92%|█████████▏| 23/25 [07:43<00:46, 23.08s/it]


Processing image 24/25: image04
Question: How many objects are visible that are made of paper?
Question: How many objects are present that behave as storage spaces?
Question: If the glasses were placed inside the ceramic container, and we use this container as a dividing line between the left and right sides of the bookshelf, how many objects would be on the right side?


Processing images:  96%|█████████▌| 24/25 [08:02<00:21, 21.80s/it]


Processing image 25/25: image05
Question: How many objects are visible that are made of porcelain?
Question: How many decoration items are present in the image?
Question: If the drinks were split evenly between the two humans, how many drinks would each human consume?


Processing images: 100%|██████████| 25/25 [08:20<00:00, 20.04s/it]


In [11]:
test_Qwen2_5VL()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]


Evaluating Qwen2.5-VL...
Using device: cuda


Processing images:   0%|          | 0/2 [00:00<?, ?it/s]


Processing image 1/2: image01
Question: How many objects made of wood are present?
Question: Count the number of breakable items?
Question: If one of the metal objects were replaced by a wooden object, how many wooden objects would be there in the image?


Processing images:  50%|█████     | 1/2 [00:13<00:13, 13.72s/it]


Processing image 2/2: image02
Question: How many mammals are present in the image?
Question: Count the number of items that can store other items?
Question: If one of the zebra were replaced by a tree, how many mammals would be present in the image?


Processing images: 100%|██████████| 2/2 [00:33<00:00, 16.99s/it]
