# VLM Benchmark for Object Property Abstraction

This notebook implements a benchmark for evaluating Vision Language Models (VLMs) on object property abstraction and visual question answering (VQA) tasks. The benchmark includes three types of questions:

1. Direct Recognition
2. Property Inference
3. Counterfactual Reasoning

And three types of images:
- REAL
- ANIMATED
- AI GENERATED

## Setup and Imports

First, let's import the necessary libraries and set up our environment.

In [1]:
# Install required packages
# %pip install transformers torch Pillow tqdm bitsandbytes accelerate

In [2]:
%pip install flash-attn #--no-build-isolation 





Note: you may need to restart the kernel to use updated packages.


In [3]:
# Import required libraries
import torch
import json
from pathlib import Path
from PIL import Image
import gc
import re
from tqdm import tqdm
from typing import Dict

# Check if CUDA is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


In [4]:
import numpy as np
import torchvision.transforms as T
from PIL import Image
from torchvision.transforms.functional import InterpolationMode

  from .autonotebook import tqdm as notebook_tqdm


## Benchmark Tester Class

This class handles the evaluation of models against our benchmark.

In [5]:
class BenchmarkTester:
    def __init__(self, benchmark_path="/var/scratch/ave303/OP_bench/benchmark.json", data_dir="/var/scratch/ave303/OP_bench/"):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        with open(benchmark_path, 'r') as f:
            self.benchmark = json.load(f)
        self.data_dir = data_dir

    def clean_answer(self, answer):
        """Extract number and reasoning from the model's answer."""
        # Try to extract number and reasoning using regex
        import re
        pattern = r'(\d+)\s*\[(.*?)\]'
        match = re.search(pattern, answer)
        
        if match:
            number = match.group(1)
            objects = [obj.strip() for obj in match.group(2).split(',')]
            return {
                "count": number,
                "reasoning": objects
            }
        else:
            # Fallback if format isn't matched
            numbers = re.findall(r'\d+', answer)
            return {
                "count": numbers[0] if numbers else "0",
                "reasoning": []
            }

    # IMAGENET_MEAN = (0.485, 0.456, 0.406)
    # IMAGENET_STD = (0.229, 0.224, 0.225)

    IMAGENET_MEAN = (0.5, 0.5, 0.5)
    IMAGENET_STD = (0.5, 0.5, 0.5)


    def build_transform(self, input_size):
        MEAN, STD = self.IMAGENET_MEAN, self.IMAGENET_STD
        transform = T.Compose([
            T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
            T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
            T.ToTensor(),
            T.Normalize(mean=MEAN, std=STD)
        ])
        return transform

    def find_closest_aspect_ratio(self, aspect_ratio, target_ratios, width, height, image_size):
        best_ratio_diff = float('inf')
        best_ratio = (1, 1)
        area = width * height
        for ratio in target_ratios:
            target_aspect_ratio = ratio[0] / ratio[1]
            ratio_diff = abs(aspect_ratio - target_aspect_ratio)
            if ratio_diff < best_ratio_diff:
                best_ratio_diff = ratio_diff
                best_ratio = ratio
            elif ratio_diff == best_ratio_diff:
                if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                    best_ratio = ratio
        return best_ratio

    def dynamic_preprocess(self, image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
        orig_width, orig_height = image.size
        aspect_ratio = orig_width / orig_height
    
        # calculate the existing image aspect ratio
        target_ratios = set(
            (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
            i * j <= max_num and i * j >= min_num)
        target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
    
        # find the closest aspect ratio to the target
        target_aspect_ratio = self.find_closest_aspect_ratio(
            aspect_ratio, target_ratios, orig_width, orig_height, image_size)

        # calculate the target width and height
        target_width = image_size * target_aspect_ratio[0]
        target_height = image_size * target_aspect_ratio[1]
        blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
    
        # resize the image
        resized_img = image.resize((target_width, target_height))
        processed_images = []
        for i in range(blocks):
            box = (
                (i % (target_width // image_size)) * image_size,
                (i // (target_width // image_size)) * image_size,
                ((i % (target_width // image_size)) + 1) * image_size,
                ((i // (target_width // image_size)) + 1) * image_size
            )
            # split the image
            split_img = resized_img.crop(box)
            processed_images.append(split_img)
        assert len(processed_images) == blocks
        if use_thumbnail and len(processed_images) != 1:
            thumbnail_img = image.resize((image_size, image_size))
            processed_images.append(thumbnail_img)
        return processed_images

    def load_image(self, image_file, input_size=448, max_num=12):     #internvl -> 448, 12 || ristretto -> 384, 10
        image = Image.open(image_file).convert('RGB')
        transform = self.build_transform(input_size=input_size)
        images = self.dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
        pixel_values = [transform(image) for image in images]
        pixel_values = torch.stack(pixel_values)
        return pixel_values
    
    def evaluate_model(self, model_name, model, processor, save_path, start_idx=0, batch_size=5):
        results = []
        print(f"\nEvaluating {model_name}...")
        print(f"Using device: {self.device}")
        
        # Force garbage collection before starting
        gc.collect()
        torch.cuda.empty_cache()

        try:
            images = self.benchmark['benchmark']['images'][start_idx:start_idx + batch_size]
            total_images = len(images)
            
            for idx, image_data in enumerate(tqdm(images, desc="Processing images")):
                try:
                    print(f"\nProcessing image {idx+1}/{total_images}: {image_data['image_id']}")
                    image_path = Path(self.data_dir)/image_data['path']
                    if not image_path.exists():
                        print(f"Warning: Image not found at {image_path}")
                        continue
                    
                    # Load and preprocess image
                    # image = Image.open(image_path).convert("RGB")
                    image_results = []  # Store results for current image
                    
                    for question in image_data['questions']:
                        try:
                            print(f"Question: {question['question']}")
                            
                            # Clear cache before processing each question
                            torch.cuda.empty_cache()

                            # set the max number of tiles in `max_num`
                            
                            pixel_values = self.load_image(image_path, max_num=12).to(torch.float16).cuda()
                            generation_config = dict(max_new_tokens=1024, do_sample=True)
                            # prompt = f'<image>\n {question["question"]} Provide the total count of the objects and then list the objects, separated by commas. \n Format: <number> [<object1>, <object2>, <object3>, ...]'
                            # Answer with the total number(numerical) followed by the objects within square brackets' #Answer format: total number(numerical) objects(within square brackets)'
                            prompt = f'<image>\n {question["question"]} Your response MUST be in the following format and nothing else:\n <NUMBER> [<OBJECT1>, <OBJECT2>, <OBJECT3>, ...]'
                            # prompt = f'<image>\n {question["question"]} Answer format: total count  [list of objects]'
                            answer = model.chat(processor, pixel_values, prompt, generation_config)
                            
                            cleaned_answer = self.clean_answer(answer)
                            
                            image_results.append({
                                "image_id": image_data["image_id"],
                                "image_type": image_data["image_type"],
                                "question_id": question["id"],
                                "question": question["question"],
                                "ground_truth": question["answer"],
                                "model_answer": cleaned_answer["count"],
                                "model_reasoning": cleaned_answer["reasoning"],
                                "raw_answer": answer,  # Keep raw answer for debugging
                                "property_category": question["property_category"]
                            })
                            
                            # Clear memory
                            # del outputs, inputs
                            torch.cuda.empty_cache()
                            
                        except Exception as e:
                            print(f"Error processing question: {str(e)}")
                            continue
                    
                    # Add results from this image
                    results.extend(image_results)
                    
                    # Save intermediate results only every 2 images or if it's the last image
                    if (idx + 1) % 2 == 0 or idx == total_images - 1:
                        with open(f"{save_path}_checkpoint.json", 'w') as f:
                            json.dump(results, f, indent=4)
                            
                except Exception as e:
                    print(f"Error processing image {image_data['image_id']}: {str(e)}")
                    continue
            
            # Save final results
            if results:
                with open(save_path, 'w') as f:
                    json.dump(results, f, indent=4)
            
        except Exception as e:
            print(f"An error occurred during evaluation: {str(e)}")
            if results:
                with open(f"{save_path}_error_state.json", 'w') as f:
                    json.dump(results, f, indent=4)
        
        return results

## Test InternVL2.5
Let's evaluate the InternVL2_5-4B-MPO model

In [6]:
# def split_model(model_name):
#     import math
#     device_map = {}
#     world_size = torch.cuda.device_count()
#     config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
#     num_layers = config.llm_config.num_hidden_layers
#     # num_layers = {
#     #     'InternVL2_5-1B': 24, 'InternVL2_5-2B': 24, 'InternVL2_5-4B': 36, 'InternVL2_5-8B': 32,
#     #     'InternVL2_5-26B': 48, 'InternVL2_5-38B': 64, 'InternVL2_5-78B': 80}[model_name]
#     # Since the first GPU will be used for ViT, treat it as half a GPU.
#     num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
#     num_layers_per_gpu = [num_layers_per_gpu] * world_size
#     num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
#     layer_cnt = 0
#     for i, num_layer in enumerate(num_layers_per_gpu):
#         for j in range(num_layer):
#             device_map[f'language_model.model.layers.{layer_cnt}'] = i
#             layer_cnt += 1
#     device_map['vision_model'] = 0
#     device_map['mlp1'] = 0
#     device_map['language_model.model.tok_embeddings'] = 0
#     device_map['language_model.model.embed_tokens'] = 0
#     device_map['language_model.output'] = 0
#     device_map['language_model.model.norm'] = 0
#     device_map['language_model.model.rotary_emb'] = 0
#     device_map['language_model.lm_head'] = 0
#     device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

#     return device_map

In [7]:
def test_InternVL2_5():
    import torch
    from transformers import AutoTokenizer, AutoModel

    # device_map = split_model('InternVL2_5-4B')
    
    model = AutoModel.from_pretrained(
        "/var/scratch/ave303/models/internvl2.5-8b",
        torch_dtype=torch.float16,
        # load_in_8bit=False,
        low_cpu_mem_usage=True,
        use_flash_attn=True,
        trust_remote_code=True).to('cuda').eval()
    
    tokenizer = AutoTokenizer.from_pretrained("/var/scratch/ave303/models/internvl2.5-8b", trust_remote_code=True, use_fast=False)

    ## InternVL2.5-4B --> performs decently well. slight post processing required
    
    # Optional: Enable memory efficient attention
    if hasattr(model.config, 'use_memory_efficient_attention'):
        model.config.use_memory_efficient_attention = True

    tester = BenchmarkTester()
    InternVL2_5_results = tester.evaluate_model(
        "InternVL2.5",
        model, 
        tokenizer, 
        "InternVL2.5_results_8bMPO.json", 
        batch_size=50
    )

    # Clean up
    del model, tokenizer
    torch.cuda.empty_cache()
    gc.collect()

In [8]:
def test_InternVL3():
    import torch
    from transformers import AutoTokenizer, AutoModel

    # device_map = split_model('InternVL3-8B')
    
    model = AutoModel.from_pretrained(
        "/var/scratch/ave303/models/internvl3-8b",
        torch_dtype=torch.float16,
        load_in_8bit=False,
        low_cpu_mem_usage=True,
        use_flash_attn=True,
        # device_map=device_map,
        trust_remote_code=True).to('cuda').eval()
    
    tokenizer = AutoTokenizer.from_pretrained("/var/scratch/ave303/models/internvl3-8b", trust_remote_code=True, use_fast=False)

    ## InternVL2.5-4B --> performs decently well. slight post processing required
    
    # Optional: Enable memory efficient attention
    if hasattr(model.config, 'use_memory_efficient_attention'):
        model.config.use_memory_efficient_attention = True

    tester = BenchmarkTester()
    InternVL3_results = tester.evaluate_model(
        "InternVL3",
        model, 
        tokenizer, 
        "InternVL3_results_8b.json", 
        batch_size=50
    )

    # Clean up
    del model, tokenizer
    torch.cuda.empty_cache()
    gc.collect()

In [9]:
# def test_Ristretto():
#     import torch
#     from transformers import AutoTokenizer, AutoModel

#     # device_map = split_model('InternVL3-8B')
    
#     model = AutoModel.from_pretrained(
#         "/var/scratch/ave303/models/ristretto-3b",
#         torch_dtype=torch.float16,
#         low_cpu_mem_usage=True,
#         # device_map=device_map,
#         trust_remote_code=True).to('cuda').eval()
    
#     tokenizer = AutoTokenizer.from_pretrained("/var/scratch/ave303/models/ristretto-3b", trust_remote_code=True, use_fast=False)

#     ## InternVL2.5-4B --> performs decently well. slight post processing required
    
#     # Optional: Enable memory efficient attention
#     if hasattr(model.config, 'use_memory_efficient_attention'):
#         model.config.use_memory_efficient_attention = True

#     tester = BenchmarkTester()
#     Ristretto_results = tester.evaluate_model(
#         "Ristretto-3b",
#         model, 
#         tokenizer, 
#         "Ristretto_3b_results.json", 
#         batch_size=25
#     )

#     # Clean up
#     del model, tokenizer
#     torch.cuda.empty_cache()
#     gc.collect()

## Run Evaluation

Now we can run our evaluation. Let's start with the InternVL2.5 model:

In [10]:
test_InternVL2_5() #8.59 #9.07

InternLM2ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.


Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

Loading checkpoint shards:  14%|█▍        | 1/7 [00:03<00:19,  3.27s/it]

Loading checkpoint shards:  29%|██▊       | 2/7 [00:06<00:16,  3.23s/it]

Loading checkpoint shards:  43%|████▎     | 3/7 [00:09<00:12,  3.08s/it]

Loading checkpoint shards:  57%|█████▋    | 4/7 [00:12<00:09,  3.05s/it]

Loading checkpoint shards:  71%|███████▏  | 5/7 [00:15<00:06,  3.05s/it]

Loading checkpoint shards:  86%|████████▌ | 6/7 [00:18<00:02,  2.99s/it]

Loading checkpoint shards: 100%|██████████| 7/7 [00:20<00:00,  2.58s/it]

Loading checkpoint shards: 100%|██████████| 7/7 [00:20<00:00,  2.86s/it]





Evaluating InternVL2.5...
Using device: cuda


Processing images:   0%|          | 0/50 [00:00<?, ?it/s]


Processing image 1/50: image01
Question: How many objects made of wood are present?


Question: Count the number of breakable items?


Question: If one of the metal objects were replaced by a wooden object, how many wooden objects would be there in the image?


Processing images:   2%|▏         | 1/50 [00:03<03:12,  3.92s/it]


Processing image 2/50: image02
Question: How many mammals are present in the image?


Question: Count the number of items that can store other items?


Question: If one of the zebra were replaced by a tree, how many mammals would be present in the image?


Processing images:   4%|▍         | 2/50 [00:06<02:37,  3.29s/it]


Processing image 3/50: image03
Question: How many objects made of rubber are present?


Question: How many objects with the primary purpose of illumination can be seen?


Question: If the person riding one of the bicycles were replaced by a pedestrian, how many objects that have handles would be present?


Processing images:   6%|▌         | 3/50 [00:10<02:37,  3.34s/it]


Processing image 4/50: image04
Question: How many tools are visible in the image?


Question: How many cutting tools are present in this image?


Question: If the red handle were replaced by a wooden handle, how many colored artifacts would remain in the image?


Processing images:   8%|▊         | 4/50 [00:14<02:47,  3.65s/it]


Processing image 5/50: image05
Question: How many furniture items are present that have legs?


Question: Count the number of containers that cannot hold hot liquids?


Question: If the room were transformed into an open workspace instead of a meeting room, how many privacy features would need to be removed?


Processing images:  10%|█         | 5/50 [00:16<02:25,  3.24s/it]


Processing image 6/50: image06
Question: How many reptiles are visible in this enclosure?


Question: How many reptilian couples, at maximum, are present?


Question: If all the small pebbles forming the mosaic floor were replaced with sand, how many natural elements would still be visible in the enclosure?


Processing images:  12%|█▏        | 6/50 [00:19<02:16,  3.10s/it]


Processing image 7/50: image07
Question: How many birds are visible in this image?


Question: How many objects are present that can comfortably seat a human?


Question: If the birds sitting together only on one railing were to fly away, how many birds would remain?


Processing images:  14%|█▍        | 7/50 [00:22<02:06,  2.95s/it]


Processing image 8/50: image08
Question: How many reptiles are visible in this image?


Question: How many objects are present that act as support?


Question: If one turtle slid off the log into the water, how many turtles would be in the water?


Processing images:  16%|█▌        | 8/50 [00:24<01:55,  2.75s/it]


Processing image 9/50: image09
Question: How many different types of vegetables are present in the image?


Question: How many objects are used as containers?


Question: If the bag of limes were removed and replaced with two additional avocados, how many fruits would be present in total on the table, considering avocados are fruits?


Processing images:  18%|█▊        | 9/50 [00:28<02:05,  3.06s/it]


Processing image 10/50: image10
Question: How many objects are present that are flexible?


Question: Count the number of items that are battery powered?


Question: If two phones with three camera lenses were replaced with phones having two camera lenses, how many phones with two camera lenses would be present?


Processing images:  20%|██        | 10/50 [00:32<02:16,  3.41s/it]


Processing image 11/50: image11
Question: How many objects made of glass are present on the table?


Question: How many objects are present at the table that can be used for sitting?


Question: If the tables in the center are removed, how many objects are visible that have legs?


Processing images:  22%|██▏       | 11/50 [00:35<02:03,  3.16s/it]


Processing image 12/50: image12
Question: How many pieces of gym equipment are visible in the image?


Question: How many objects are present that provide shade?


Question: If two of the stationary bikes were replaced by two treadmills, how many objects would be present that have pedals?


Processing images:  24%|██▍       | 12/50 [00:38<02:03,  3.24s/it]


Processing image 13/50: image13
Question: How many furniture items are present in the room?


Question: How many individual storage compartments are present in the furniture items in the room?


Question: If the two bedside lamps were removed, how many objects are present that need electricity?


Processing images:  26%|██▌       | 13/50 [00:41<01:54,  3.10s/it]


Processing image 14/50: image14
Question: How many objects are present that are transparent?


Question: How many objects are positioned for student use to place other items?


Question: If the signages were removed, how many objects would be present that hang from the ceiling?


Processing images:  28%|██▊       | 14/50 [00:43<01:44,  2.90s/it]


Processing image 15/50: image15
Question: How many objects made of rubber are present?


Question: How many objects are visible that can be used to move up?


Question: If the car on the ground is driven out of the garage, how many objects are present that is used to indicate slowing down to a stop?


Processing images:  30%|███       | 15/50 [00:47<01:53,  3.25s/it]


Processing image 16/50: image16
Question: How many objects made of rubber are present?


Question: How many objects can be used as modes of transport if fixed?


Question: If the car in the center is fixed and driven out of the garage, how many objects made of rubber would be visible in the image?


Processing images:  32%|███▏      | 16/50 [00:51<01:50,  3.24s/it]


Processing image 17/50: image17
Question: How many yellow colored objects are present?


Question: How many objects are visible that are used to protect the head?


Question: If one person leaves the cleaning group, how many mammals would remain?


Processing images:  34%|███▍      | 17/50 [00:53<01:39,  3.03s/it]


Processing image 18/50: image18
Question: How many mammals are visible in the image?


Question: How many objects are present that provide shelter?


Question: If the mammals are to all step inside the shelters, how many natural elements are visible in the image?


Processing images:  36%|███▌      | 18/50 [00:56<01:32,  2.90s/it]


Processing image 19/50: image19
Question: How many gardening tools are present that are made of metal?


Question: How many objects are present in the garden that can hold other items?


Question: If half the woven baskets are filled, how many containers would remain empty?


Processing images:  38%|███▊      | 19/50 [00:59<01:29,  2.88s/it]


Processing image 20/50: image20
Question: How many objects in the background are present that have legs?


Question: How many objects in the foreground are visible that are foldable?


Question: If the stack of books on the table in the foreground was moved to the shelf, how many objects in physical contact with the table would be present?


Processing images:  40%|████      | 20/50 [01:02<01:30,  3.03s/it]


Processing image 21/50: image01
Question: How many mammals are present in total?


Question: How many objects are visible that can store items?


Question: If the bear were to be replaced by a tree, how many different types of mammals would be there at the zoo?


Processing images:  42%|████▏     | 21/50 [01:06<01:38,  3.40s/it]


Processing image 22/50: image02
Question: How many kitchen tools are visible in the image?


Question: Count the number of items that require electricity to operate?


Question: If blinds were installed for the windows above the sink, how many transparent objects would remain?


Processing images:  44%|████▍     | 22/50 [01:08<01:21,  2.91s/it]


Processing image 23/50: image03
Question: How many objects made of glass are present?


Question: How many tools are visible that can be used for cutting?


Question: If the worker was not wearing ear protection, how many protective items would remain?


Processing images:  46%|████▌     | 23/50 [01:10<01:11,  2.66s/it]


Processing image 24/50: image04
Question: How many objects made of rubber are present?


Question: Excluding the drawers, how many items in the workshop serve as containers for storage?


Question: If an electric fan were placed on the workstation to provide ventilation, how many objects in the room would require electricity to operate?


Processing images:  48%|████▊     | 24/50 [01:12<01:07,  2.60s/it]


Processing image 25/50: image05
Question: How many birds are visible in the image?


Question: How many objects are present that act as support?


Question: If the clouds were to completely cover the sky, blocking the sunlight, how many natural elements would still be visible?


Processing images:  50%|█████     | 25/50 [01:15<01:02,  2.49s/it]


Processing image 26/50: image06
Question: How many objects are present that have chimneys?


Question: How many objects are visible that are means of transportation?


Question: If the bus were replaced by a pedestrian, how many mammals would be present?


Processing images:  52%|█████▏    | 26/50 [01:17<00:58,  2.42s/it]


Processing image 27/50: image07
Question: How many objects made of glass are present?


Question: Count the number of items that can be used to carry liquid?


Question: If the waste to be disposed was color-coded to match the bins, how many objects are to be thrown in the bin on the right?


Processing images:  54%|█████▍    | 27/50 [01:19<00:50,  2.21s/it]


Processing image 28/50: image08
Question: How many objects are present that have legs?


Question: How many items are visible that are openable?


Question: If the bottle was removed from the table, how many objects are present on top of the table?


Processing images:  56%|█████▌    | 28/50 [01:22<00:54,  2.48s/it]


Processing image 29/50: image09
Question: How many objects made of wood are present?


Question: How many kitchen items are visible that can be used for cutting?


Question: If the two jars on the top shelf were removed, how many breakable items would be present in the image?


Processing images:  58%|█████▊    | 29/50 [01:26<01:02,  2.99s/it]


Processing image 30/50: image10
Question: How many objects made of plastic are visible?


Question: How many items are visible that can record audio?


Question: If the microphones were replaced with headsets for every character, how many objects in total would be present that are worn on the head?


Processing images:  60%|██████    | 30/50 [01:28<00:52,  2.65s/it]


Processing image 31/50: image11
Question: How many different food items are present on the kitchen countertop?


Question: How many objects are visible that need electricity to operate?


Question: If all the objects on the two shelves above the counter were placed inside the cabinet, how many items that are breakable would be present on the counter?


Processing images:  62%|██████▏   | 31/50 [01:32<01:00,  3.21s/it]


Processing image 32/50: image12
Question: How many different types of plants are present?


Question: How many objects are visible that behave as containers?


Question: If all the visible plants were potted individually and placed on the stand, how many pots would be present on the stand?


Processing images:  64%|██████▍   | 32/50 [01:36<01:02,  3.47s/it]


Processing image 33/50: image13
Question: How many mammals are visible in the image?


Question: How many objects are present that can be used for sitting?


Question: If the character standing upright took a seat for themself and the huddled group are seated in pairs, that is two characters per seat. How many objects would remain that can be used for sitting?


Processing images:  66%|██████▌   | 33/50 [01:40<01:01,  3.65s/it]


Processing image 34/50: image14
Question: How many cardboard objects are visible in the image?


Question: How many objects are visible that can be used for sitting?


Question: If the bottled objects and the white cups are packed away, how many objects are present that can be used to drink out of?


Processing images:  68%|██████▊   | 34/50 [01:45<01:00,  3.78s/it]


Processing image 35/50: image15
Question: How many objects that are present have wheels?


Question: How many items are visible that can be used to hold liquids?


Question: If the car drives away, how many objects made of rubber are visible?


Processing images:  70%|███████   | 35/50 [01:48<00:53,  3.59s/it]


Processing image 36/50: image16
Question: How many objects made of glass are present?


Question: How many tools designed for gathering or sweeping are visible?


Question: If there was a flood and the water washed up the beach, completely submerging it, how many natural elements would be present in the image?


Processing images:  72%|███████▏  | 36/50 [01:51<00:47,  3.39s/it]


Processing image 37/50: image17
Question: How many objects are visible that have legs?


Question: How many objects are visible that are attached to the wall or ceiling?


Question: If the blinds are pulled over the window, how many sources of illumination would remain?


Processing images:  74%|███████▍  | 37/50 [01:54<00:44,  3.43s/it]


Processing image 38/50: image18
Question: How many objects made of rubber are visible?


Question: How many objects are present that can hold liquids?


Question: If the tools hanging on the wall were to be placed on the shelf, how many objects would be present on the shelf?


Processing images:  76%|███████▌  | 38/50 [01:58<00:42,  3.51s/it]


Processing image 39/50: image19
Question: How many different types of gym equipment are present?


Question: How many pieces of exercise equipment primarily designed for cardiovascular workouts are visible?


Question: If the blinds were pulled over the windows, how many sources of illumination would remain?


Processing images:  78%|███████▊  | 39/50 [02:01<00:36,  3.28s/it]


Processing image 40/50: image20
Question: How many objects are present that have legs?


Question: How many objects are visible that act as protection or shade?


Question: If the laptop were placed on the shelf next to the TV, how many objects would be present on the shelf?


Processing images:  80%|████████  | 40/50 [02:03<00:31,  3.13s/it]


Processing image 41/50: image01
Question: How many objects made of rubber are visible?


Question: How many objects are visible that are means of transportation?


Question: If the car in the driveway were to leave, how many objects primarily made of metal would be present?


Processing images:  82%|████████▏ | 41/50 [02:07<00:29,  3.26s/it]


Processing image 42/50: image02
Question: How many objects made of concrete are present?


Question: How many objects are visible that can be used for lifting?


Question: If the orange paint spilled all over one of the plexiglass sheets, how many objects would remain that are transparent?


Processing images:  84%|████████▍ | 42/50 [02:10<00:26,  3.29s/it]


Processing image 43/50: image03
Question: How many mammals are present in the image?


Question: How many objects are visible that are used for both meat and wool production?


Question: If the two sheep were replaced by a cow grazing in the same area, how many objects would be present in between the two fences?


Processing images:  86%|████████▌ | 43/50 [02:14<00:23,  3.33s/it]


Processing image 44/50: image04
Question: How many objects are visible that are made of paper?


Question: How many objects are present that behave as storage spaces?


Question: If the glasses were placed inside the ceramic container, and we use this container as a dividing line between the left and right sides of the bookshelf, how many objects would be on the right side?


Processing images:  88%|████████▊ | 44/50 [02:16<00:18,  3.11s/it]


Processing image 45/50: image05
Question: How many objects are visible that are made of porcelain?


Question: How many decoration items are present in the image?


Question: If the drinks were split evenly between the two humans, how many drinks would each human consume?


Processing images:  90%|█████████ | 45/50 [02:21<00:17,  3.51s/it]


Processing image 46/50: image06
Question: How many mammals are present in the image?


Question: How many objects are visible that are designed to contain liquids?


Question: If the trash bags and bottles on the sand are only thrown into the black bin, how many mammals are actively holding some other object?


Processing images:  92%|█████████▏| 46/50 [02:25<00:15,  3.82s/it]


Processing image 47/50: image07
Question: How many mammals are present in the image?


Question: How many objects are present that provide shelter?


Question: If one of the mammals douses the fire, how many objects are present that can be switched off?


Processing images:  94%|█████████▍| 47/50 [02:28<00:10,  3.49s/it]


Processing image 48/50: image08
Question: How many different types of gym equipment are present?


Question: How many objects are visible that are positioned between the row of treadmills and the bench press station?


Question: If one of the treadmills is faulty and removed from the gym, how many objects are present that convey some kind of information?


Processing images:  96%|█████████▌| 48/50 [02:31<00:06,  3.37s/it]


Processing image 49/50: image09
Question: How many objects made of rubber are visible in the image?


Question: How many objects are visible that need electricity to operate?


Question: If one of the workers took a wrench off the table, how many objects would remain in physical contact with the table?


Processing images:  98%|█████████▊| 49/50 [02:34<00:03,  3.16s/it]


Processing image 50/50: image10
Question: How many objects are visible that are made of metal?


Question: How many objects present are breakable?


Question: If the bowls with the tomatoes and the chickpeas were emptied into the steaming pot, how many containers would still have something remaining in them?


Processing images: 100%|██████████| 50/50 [02:38<00:00,  3.36s/it]

Processing images: 100%|██████████| 50/50 [02:38<00:00,  3.16s/it]




In [11]:
# test_InternVL3()

In [12]:
# test_Ristretto()