# VLM Benchmark for Object Property Abstraction

This notebook implements a benchmark for evaluating Vision Language Models (VLMs) on object property abstraction and visual question answering (VQA) tasks. The benchmark includes three types of questions:

1. Direct Recognition
2. Property Inference
3. Counterfactual Reasoning

And three types of images:
- REAL
- ANIMATED
- AI GENERATED

## Setup and Imports

First, let's import the necessary libraries and set up our environment.

In [1]:
# Install required packages
!pip install transformers torch Pillow tqdm bitsandbytes accelerate

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cusolver-cu12==11.6.1.9 (from torch)
  Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cusparse-cu12==12.3.1.170 (from torch)
  Downloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl.metadata (1.6

In [2]:
pip install num2words qwen-vl-utils #flash-attn --no-build-isolation 

Collecting num2words
  Downloading num2words-0.5.14-py3-none-any.whl.metadata (13 kB)
Collecting qwen-vl-utils
  Downloading qwen_vl_utils-0.0.10-py3-none-any.whl.metadata (6.3 kB)
Collecting docopt>=0.6.2 (from num2words)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting av (from qwen-vl-utils)
  Downloading av-14.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.7 kB)
Downloading num2words-0.5.14-py3-none-any.whl (163 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.5/163.5 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading qwen_vl_utils-0.0.10-py3-none-any.whl (6.7 kB)
Downloading av-14.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (35.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m35.2/35.2 MB[0m [31m53.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hBuilding wheels for collected packages: docopt
  Building wheel for doc

In [3]:
# Import required libraries
import torch
import json
from pathlib import Path
from PIL import Image
import gc
import re
from tqdm import tqdm
from typing import List, Dict, Any

# Check if CUDA is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


In [4]:
import numpy as np
import torchvision.transforms as T
from PIL import Image
from torchvision.transforms.functional import InterpolationMode

## Benchmark Tester Class

This class handles the evaluation of models against our benchmark.

In [5]:
class BenchmarkTester:
    def __init__(self, benchmark_path="/kaggle/input/opabenchmark/benchmark.json", data_dir="/kaggle/input/opabenchmark/data"):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        with open(benchmark_path, 'r') as f:
            self.benchmark = json.load(f)
        self.data_dir = data_dir

    def clean_answer(self, answer):
        """Extract number and reasoning from the model's answer."""
        # Try to extract number and reasoning using regex
        import re
        pattern = r'(\d+)\s*\[(.*?)\]'
        match = re.search(pattern, answer)
        
        if match:
            number = match.group(1)
            objects = [obj.strip() for obj in match.group(2).split(',')]
            return {
                "count": number,
                "reasoning": objects
            }
        else:
            # Fallback if format isn't matched
            numbers = re.findall(r'\d+', answer)
            return {
                "count": numbers[0] if numbers else "0",
                "reasoning": []
            }

    IMAGENET_MEAN = (0.485, 0.456, 0.406)
    IMAGENET_STD = (0.229, 0.224, 0.225)

    def build_transform(self, input_size):
        MEAN, STD = self.IMAGENET_MEAN, self.IMAGENET_STD
        transform = T.Compose([
            T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
            T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
            T.ToTensor(),
            T.Normalize(mean=MEAN, std=STD)
        ])
        return transform

    def find_closest_aspect_ratio(self, aspect_ratio, target_ratios, width, height, image_size):
        best_ratio_diff = float('inf')
        best_ratio = (1, 1)
        area = width * height
        for ratio in target_ratios:
            target_aspect_ratio = ratio[0] / ratio[1]
            ratio_diff = abs(aspect_ratio - target_aspect_ratio)
            if ratio_diff < best_ratio_diff:
                best_ratio_diff = ratio_diff
                best_ratio = ratio
            elif ratio_diff == best_ratio_diff:
                if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                    best_ratio = ratio
        return best_ratio

    def dynamic_preprocess(self, image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
        orig_width, orig_height = image.size
        aspect_ratio = orig_width / orig_height
    
        # calculate the existing image aspect ratio
        target_ratios = set(
            (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
            i * j <= max_num and i * j >= min_num)
        target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
    
        # find the closest aspect ratio to the target
        target_aspect_ratio = self.find_closest_aspect_ratio(
            aspect_ratio, target_ratios, orig_width, orig_height, image_size)

        # calculate the target width and height
        target_width = image_size * target_aspect_ratio[0]
        target_height = image_size * target_aspect_ratio[1]
        blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
    
        # resize the image
        resized_img = image.resize((target_width, target_height))
        processed_images = []
        for i in range(blocks):
            box = (
                (i % (target_width // image_size)) * image_size,
                (i // (target_width // image_size)) * image_size,
                ((i % (target_width // image_size)) + 1) * image_size,
                ((i // (target_width // image_size)) + 1) * image_size
            )
            # split the image
            split_img = resized_img.crop(box)
            processed_images.append(split_img)
        assert len(processed_images) == blocks
        if use_thumbnail and len(processed_images) != 1:
            thumbnail_img = image.resize((image_size, image_size))
            processed_images.append(thumbnail_img)
        return processed_images

    def load_image(self, image_file, input_size=448, max_num=12):
        image = Image.open(image_file).convert('RGB')
        transform = self.build_transform(input_size=input_size)
        images = self.dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
        pixel_values = [transform(image) for image in images]
        pixel_values = torch.stack(pixel_values)
        return pixel_values
    
    def evaluate_model(self, model_name, model, processor, save_path, start_idx=0, batch_size=5):
        results = []
        print(f"\nEvaluating {model_name}...")
        print(f"Using device: {self.device}")
        
        # Force garbage collection before starting
        gc.collect()
        torch.cuda.empty_cache()

        try:
            images = self.benchmark['benchmark']['images'][start_idx:start_idx + batch_size]
            total_images = len(images)
            
            for idx, image_data in enumerate(tqdm(images, desc="Processing images")):
                try:
                    print(f"\nProcessing image {idx+1}/{total_images}: {image_data['image_id']}")
                    image_path = Path(self.data_dir)/image_data['path']
                    if not image_path.exists():
                        print(f"Warning: Image not found at {image_path}")
                        continue
                    
                    # Load and preprocess image
                    # image = Image.open(image_path).convert("RGB")
                    image_results = []  # Store results for current image
                    
                    for question in image_data['questions']:
                        try:
                            print(f"Question: {question['question']}")
                            
                            # Clear cache before processing each question
                            torch.cuda.empty_cache()

                            # set the max number of tiles in `max_num`
                            
                            pixel_values = self.load_image(image_path, max_num=12).to(torch.bfloat16).cuda()
                            generation_config = dict(max_new_tokens=1024, do_sample=True)
                            
                            prompt = f'<image>\n {question["question"]} Provide just the total count and the list of objects in the given format \n Format: number [objects]'# Answer with the total number(numerical) followed by the objects within square brackets' #Answer format: total number(numerical) objects(within square brackets)'
                            answer = model.chat(processor, pixel_values, prompt, generation_config)
                            
                            cleaned_answer = self.clean_answer(answer)
                            
                            image_results.append({
                                "image_id": image_data["image_id"],
                                "image_type": image_data["image_type"],
                                "question_id": question["id"],
                                "question": question["question"],
                                "ground_truth": question["answer"],
                                "model_answer": cleaned_answer["count"],
                                "model_reasoning": cleaned_answer["reasoning"],
                                "raw_answer": answer,  # Keep raw answer for debugging
                                "property_category": question["property_category"]
                            })
                            
                            # Clear memory
                            # del outputs, inputs
                            torch.cuda.empty_cache()
                            
                        except Exception as e:
                            print(f"Error processing question: {str(e)}")
                            continue
                    
                    # Add results from this image
                    results.extend(image_results)
                    
                    # Save intermediate results only every 2 images or if it's the last image
                    if (idx + 1) % 2 == 0 or idx == total_images - 1:
                        with open(f"{save_path}_checkpoint.json", 'w') as f:
                            json.dump(results, f, indent=4)
                            
                except Exception as e:
                    print(f"Error processing image {image_data['image_id']}: {str(e)}")
                    continue
            
            # Save final results
            if results:
                with open(save_path, 'w') as f:
                    json.dump(results, f, indent=4)
            
        except Exception as e:
            print(f"An error occurred during evaluation: {str(e)}")
            if results:
                with open(f"{save_path}_error_state.json", 'w') as f:
                    json.dump(results, f, indent=4)
        
        return results

In [6]:
import os
print(os.listdir("/kaggle/input/opabenchmark/"))

['data', 'benchmark.json']


## Test InternVL2.5
Let's evaluate the InternVL2_5-4B-MPO model

In [7]:
def split_model(model_name):
    import math
    device_map = {}
    world_size = torch.cuda.device_count()
    num_layers = {
        'InternVL2_5-1B': 24, 'InternVL2_5-2B': 24, 'InternVL2_5-4B': 36, 'InternVL2_5-8B': 32,
        'InternVL2_5-26B': 48, 'InternVL2_5-38B': 64, 'InternVL2_5-78B': 80}[model_name]
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

In [8]:
def test_InternVL2_5():
    import torch
    from transformers import AutoTokenizer, AutoModel

    device_map = split_model('InternVL2_5-4B')
    
    model = AutoModel.from_pretrained(
        "OpenGVLab/InternVL2_5-4B-MPO",
        torch_dtype=torch.bfloat16,
        load_in_8bit=True,
        low_cpu_mem_usage=True,
        # use_flash_attn=True,
        trust_remote_code=True).eval()
    
    tokenizer = AutoTokenizer.from_pretrained("OpenGVLab/InternVL2_5-4B-MPO", trust_remote_code=True, use_fast=False)

    ## InternVL2.5-4B --> performs decently well. slight post processing required
    
    # Optional: Enable memory efficient attention
    if hasattr(model.config, 'use_memory_efficient_attention'):
        model.config.use_memory_efficient_attention = True

    tester = BenchmarkTester()
    InternVL2_5_results = tester.evaluate_model(
        "InternVL2.5",
        model, 
        tokenizer, 
        "InternVL2.5_results.json", 
        batch_size=25
    )

    # Clean up
    del model, tokenizer
    torch.cuda.empty_cache()
    gc.collect()

## Run Evaluation

Now we can run our evaluation. Let's start with the InternVL2.5 model:

In [9]:
test_InternVL2_5() #8.59 #9.07

config.json:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

configuration_internvl_chat.py:   0%|          | 0.00/4.04k [00:00<?, ?B/s]

configuration_intern_vit.py:   0%|          | 0.00/5.55k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-4B-MPO:
- configuration_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-4B-MPO:
- configuration_internvl_chat.py
- configuration_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_internvl_chat.py:   0%|          | 0.00/15.9k [00:00<?, ?B/s]

modeling_intern_vit.py:   0%|          | 0.00/18.1k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-4B-MPO:
- modeling_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


conversation.py:   0%|          | 0.00/15.3k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-4B-MPO:
- conversation.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-4B-MPO:
- modeling_internvl_chat.py
- modeling_intern_vit.py
- conversation.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
2025-04-16 16:35:42.509952: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1744821342.926251      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 

FlashAttention2 is not installed.


model.safetensors.index.json:   0%|          | 0.00/71.7k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.43G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

Sliding Window Attention is enabled but not implemented for `eager`; unexpected results may be encountered.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/129 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/9.02k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/3.38M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/790 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/744 [00:00<?, ?B/s]


Evaluating InternVL2.5...
Using device: cuda


Processing images:   0%|          | 0/25 [00:00<?, ?it/s]


Processing image 1/25: image01
Question: How many objects made of wood are present?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: Count the number of breakable items?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If one of the metal objects were replaced by a wooden object, how many wooden objects would be there in the image?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:   4%|▍         | 1/25 [00:15<06:04, 15.18s/it]


Processing image 2/25: image02
Question: How many mammals are present in the image?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: Count the number of items that can store other items?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If one of the zebra were replaced by a tree, how many mammals would be present in the image?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:   8%|▊         | 2/25 [00:28<05:19, 13.89s/it]


Processing image 3/25: image03
Question: How many objects made of rubber are present?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: How many objects with the primary purpose of illumination can be seen?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If the person riding one of the bicycles were replaced by a pedestrian, how many objects that have handles would be present?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  12%|█▏        | 3/25 [00:44<05:34, 15.18s/it]


Processing image 4/25: image04
Question: How many tools are visible in the image?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: How many cutting tools are present in this image?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If the red handle were replaced by a wooden handle, how many colored artifacts would remain in the image?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  16%|█▌        | 4/25 [01:09<06:34, 18.80s/it]


Processing image 5/25: image05
Question: How many furniture items are present that have legs?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: Count the number of containers that cannot hold hot liquids?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If the room were transformed into an open workspace instead of a meeting room, how many privacy features would need to be removed?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  20%|██        | 5/25 [01:22<05:35, 16.79s/it]


Processing image 6/25: image06
Question: How many reptiles are visible in this enclosure?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: How many reptilian couples, at maximum, are present?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If all the small pebbles forming the mosaic floor were replaced with sand, how many natural elements would still be visible in the enclosure?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  24%|██▍       | 6/25 [01:35<04:53, 15.44s/it]


Processing image 7/25: image07
Question: How many birds are visible in this image?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: How many objects are present that can comfortably seat a human?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If the birds sitting together only on one railing were to fly away, how many birds would remain?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  28%|██▊       | 7/25 [01:50<04:33, 15.22s/it]


Processing image 8/25: image08
Question: How many reptiles are visible in this image?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: How many objects are present that act as support?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If one turtle slid off the log into the water, how many turtles would be in the water?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  32%|███▏      | 8/25 [02:01<03:59, 14.11s/it]


Processing image 9/25: image09
Question: How many different types of vegetables are present in the image?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: How many objects are used as containers?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If the bag of limes were removed and replaced with two additional avocados, how many fruits would be present in total on the table, considering avocados are fruits?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  36%|███▌      | 9/25 [02:19<04:01, 15.10s/it]


Processing image 10/25: image10
Question: How many objects are present that are flexible?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: Count the number of items that are battery powered?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If two phones with three camera lenses were replaced with phones having two camera lenses, how many phones with two camera lenses would be present?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  40%|████      | 10/25 [02:45<04:41, 18.74s/it]


Processing image 11/25: image01
Question: How many mammals are present in total?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: How many objects are visible that can store items?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If the bear were to be replaced by a tree, how many different types of mammals would be there at the zoo?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  44%|████▍     | 11/25 [03:11<04:50, 20.72s/it]


Processing image 12/25: image02
Question: How many kitchen tools are visible in the image?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: Count the number of items that require electricity to operate?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If blinds were installed for the windows above the sink, how many transparent objects would remain?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  48%|████▊     | 12/25 [03:18<03:37, 16.70s/it]


Processing image 13/25: image03
Question: How many objects made of glass are present?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: How many tools are visible that can be used for cutting?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If the worker was not wearing ear protection, how many protective items would remain?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  52%|█████▏    | 13/25 [03:28<02:55, 14.59s/it]


Processing image 14/25: image04
Question: How many objects made of rubber are present?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: Excluding the drawers, how many items in the workshop serve as containers for storage?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If an electric fan were placed on the workstation to provide ventilation, how many objects in the room would require electricity to operate?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  56%|█████▌    | 14/25 [03:40<02:32, 13.91s/it]


Processing image 15/25: image05
Question: How many birds are visible in the image?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: How many objects are present that act as support?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If the clouds were to completely cover the sky, blocking the sunlight, how many natural elements would still be visible?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  60%|██████    | 15/25 [03:54<02:19, 13.95s/it]


Processing image 16/25: image06
Question: How many objects are present that have chimneys?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: How many objects are visible that are means of transportation?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If the bus were replaced by a pedestrian, how many mammals would be present?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  64%|██████▍   | 16/25 [04:05<01:55, 12.88s/it]


Processing image 17/25: image07
Question: How many objects made of glass are present?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: Count the number of items that can be used to carry liquid?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If the waste to be disposed was color-coded to match the bins, how many objects are to be thrown in the bin on the right?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  68%|██████▊   | 17/25 [04:14<01:34, 11.81s/it]


Processing image 18/25: image08
Question: How many objects are present that have legs?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: How many items are visible that are openable?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If the bottle was removed from the table, how many objects are present on top of the table?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  72%|███████▏  | 18/25 [04:31<01:32, 13.27s/it]


Processing image 19/25: image09
Question: How many objects made of wood are present?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: How many kitchen items are visible that can be used for cutting?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If the two jars on the top shelf were removed, how many breakable items would be present in the image?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  76%|███████▌  | 19/25 [04:56<01:40, 16.79s/it]


Processing image 20/25: image10
Question: How many objects made of plastic are visible?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: How many items are visible that can record audio?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If the microphones were replaced with headsets for every character, how many objects in total would be present that are worn on the head?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  80%|████████  | 20/25 [05:03<01:10, 14.10s/it]


Processing image 21/25: image01
Question: How many objects made of rubber are visible?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: How many objects are visible that are means of transportation?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If the car in the driveway were to leave, how many objects primarily made of metal would be present?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  84%|████████▍ | 21/25 [05:22<01:01, 15.32s/it]


Processing image 22/25: image02
Question: How many objects made of concrete are present?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: How many objects are visible that can be used for lifting?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If the orange paint spilled all over one of the plexiglass sheets, how many objects would remain that are transparent?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  88%|████████▊ | 22/25 [05:40<00:48, 16.27s/it]


Processing image 23/25: image03
Question: How many mammals are present in the image?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: How many objects are visible that are used for both meat and wool production?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If the two sheep were replaced by a cow grazing in the same area, how many objects would be present in between the two fences?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  92%|█████████▏| 23/25 [06:01<00:35, 17.54s/it]


Processing image 24/25: image04
Question: How many objects are visible that are made of paper?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: How many objects are present that behave as storage spaces?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If the glasses were placed inside the ceramic container, and we use this container as a dividing line between the left and right sides of the bookshelf, how many objects would be on the right side?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images:  96%|█████████▌| 24/25 [06:16<00:16, 16.84s/it]


Processing image 25/25: image05
Question: How many objects are visible that are made of porcelain?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: How many decoration items are present in the image?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Question: If the drinks were split evenly between the two humans, how many drinks would each human consume?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Processing images: 100%|██████████| 25/25 [06:43<00:00, 16.13s/it]
