# VLM Benchmark for Object Property Abstraction

This notebook implements a benchmark for evaluating Vision Language Models (VLMs) on object property abstraction and visual question answering (VQA) tasks. The benchmark includes three types of questions:

1. Direct Recognition
2. Property Inference
3. Counterfactual Reasoning

And three types of images:
- REAL
- ANIMATED
- AI GENERATED

## Setup and Imports

First, let's import the necessary libraries and set up our environment.

In [1]:
# Install required packages
# %pip install transformers torch Pillow tqdm bitsandbytes accelerate

In [2]:
%pip install qwen-vl-utils flash-attn #--no-build-isolation







Note: you may need to restart the kernel to use updated packages.


In [3]:
# Import required libraries
import torch
import json
from pathlib import Path
from PIL import Image
import gc
import re
from tqdm import tqdm
from typing import List, Dict, Any
from qwen_vl_utils import process_vision_info

# Check if CUDA is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

  from .autonotebook import tqdm as notebook_tqdm


Using device: cuda


## Benchmark Tester Class

This class handles the evaluation of models against our benchmark.

In [4]:
class BenchmarkTester:
    def __init__(self, benchmark_path="/var/scratch/ave303/OP_bench/benchmark.json", data_dir="/var/scratch/ave303/OP_bench/"):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        with open(benchmark_path, 'r') as f:
            self.benchmark = json.load(f)
        self.data_dir = data_dir
    
    def format_question(self, question, model_name):
        """Format a question for the model."""

        if model_name=="blip2":
            return f"Question: {question['question']} Answer:"
        else:
            return f"Question: {question['question']} Answer with a number and list of objects. Answer:"

    def clean_answer(self, answer):
        """Clean the model output to extract just the number."""
        # Remove any text that's not a number
        # import re
        # numbers = re.findall(r'\d+', answer)
        # if numbers:
        #     return numbers[0]  # Return the first number found
        # return answer
        """Extract number and reasoning from the model's answer."""
        # Try to extract number and reasoning using regex
        import re
        pattern = r'(\d+)\s*\[(.*?)\]'
        match = re.search(pattern, answer)
        
        if match:
            number = match.group(1)
            objects = [obj.strip() for obj in match.group(2).split(',')]
            return {
                "count": number,
                "reasoning": objects
            }
        else:
            # Fallback if format isn't matched
            numbers = re.findall(r'\d+', answer)
            return {
                "count": numbers[0] if numbers else "0",
                "reasoning": []
            }

    def model_generation(self, model_name, model, inputs, processor):
        """Generate answer and decode."""
        outputs = None  # Initialize outputs to None
        
        if model_name=="smolVLM2":
            outputs = model.generate(**inputs, do_sample=False, max_new_tokens=64)
            answer = processor.batch_decode(
                outputs,
                skip_special_tokens=True,
            )[0]
        elif model_name=="Qwen2.5-VL":
            outputs = model.generate(**inputs, max_new_tokens=50)
            outputs = [
                out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, outputs)
            ]
            answer = processor.batch_decode(
                outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False
            )[0]
        else:
            print(f"Warning: Unknown model name '{model_name}' in model_generation.")
            answer = ""  # Return an empty string

        return answer, outputs
    
    def evaluate_model(self, model_name, model, processor, save_path, start_idx=0, batch_size=5):
        results = []
        print(f"\nEvaluating {model_name}...")
        print(f"Using device: {self.device}")
        
        # Force garbage collection before starting
        gc.collect()
        torch.cuda.empty_cache()

        try:
            images = self.benchmark['benchmark']['images'][start_idx:start_idx + batch_size]
            total_images = len(images)
            
            for idx, image_data in enumerate(tqdm(images, desc="Processing images")):
                try:
                    print(f"\nProcessing image {idx+1}/{total_images}: {image_data['image_id']}")
                    image_path = Path(self.data_dir)/image_data['path']
                    if not image_path.exists():
                        print(f"Warning: Image not found at {image_path}")
                        continue
                    
                    # Load and preprocess image
                    image = Image.open(image_path).convert("RGB")
                    image_results = []  # Store results for current image
                    
                    for question in image_data['questions']:
                        try:
                            # prompt = self.format_question(question, model_name)
                            print(f"Question: {question['question']}")

                            messages = [
                                {
                                    "role": "user",
                                    "content": [
                                        {"type": "image", "image": image},
                                        # {"type": "text", "text": f"{question['question']} Answer format: total number(numerical) objects(within square brackets)"},
                                        # {"type": "text", "text": f"{question['question']} Provide just the total count and the list of objects in the given format \n Format: number [objects]"},
                                        # {"type": "text", "text": f"{question['question']} Answer Format: number [objects]"},
                                        {"type": "text", "text": f"{question["question"]} Your response MUST be in the following format and nothing else:\n <NUMBER> [<OBJECT1>, <OBJECT2>, <OBJECT3>, ...]"}
                                    ]
                                },
                            ]
                            
                            # Clear cache before processing each question
                            torch.cuda.empty_cache()
                            
                            # Process image and text
                            # inputs = processor(images=image, text=prompt, return_tensors="pt").to(self.device)
                            if model_name=="smolVLM2":
                                inputs = processor.apply_chat_template(
                                    messages,
                                    add_generation_prompt=True,
                                    tokenize=True,
                                    return_dict=True,
                                    return_tensors="pt",
                                ).to(model.device, dtype=torch.float16)
                            else:
                                
                                text = processor.apply_chat_template(
                                    messages, tokenize=False, add_generation_prompt=True
                                )
                                # image_inputs, video_inputs = process_vision_info(messages)
                                inputs = processor(
                                    text=text,
                                    images=image,
                                    videos=None,
                                    padding=True,
                                    return_tensors="pt",
                                ).to("cuda")
                            
                            # Generate answer with better settings
                            with torch.no_grad():
                                answer, outputs = self.model_generation(model_name, model, inputs, processor)    #call for model.generate
        
                            cleaned_answer = self.clean_answer(answer)
                            
                            image_results.append({
                                "image_id": image_data["image_id"],
                                "image_type": image_data["image_type"],
                                "question_id": question["id"],
                                "question": question["question"],
                                "ground_truth": question["answer"],
                                "model_answer": cleaned_answer["count"],
                                "model_reasoning": cleaned_answer["reasoning"],
                                "raw_answer": answer,  # Keep raw answer for debugging
                                "property_category": question["property_category"]
                            })
                            
                            # Clear memory
                            del outputs, inputs
                            torch.cuda.empty_cache()
                            
                        except Exception as e:
                            print(f"Error processing question: {str(e)}")
                            continue
                    
                    # Add results from this image
                    results.extend(image_results)
                    
                    # Save intermediate results only every 2 images or if it's the last image
                    if (idx + 1) % 2 == 0 or idx == total_images - 1:
                        with open(f"{save_path}_checkpoint.json", 'w') as f:
                            json.dump(results, f, indent=4)
                            
                except Exception as e:
                    print(f"Error processing image {image_data['image_id']}: {str(e)}")
                    continue
            
            # Save final results
            if results:
                with open(save_path, 'w') as f:
                    json.dump(results, f, indent=4)
            
        except Exception as e:
            print(f"An error occurred during evaluation: {str(e)}")
            if results:
                with open(f"{save_path}_error_state.json", 'w') as f:
                    json.dump(results, f, indent=4)
        
        return results

## Test SmolVLM Model

Let's evaluate the SmolVLM2-2.2B-Instruct model

In [5]:
# def test_smolVLM2():
#     from transformers import AutoProcessor, AutoModelForImageTextToText

#     print("Loading smolVLM model...")
    
#     model = AutoModelForImageTextToText.from_pretrained(
#         "HuggingFaceTB/SmolVLM2-2.2B-Instruct",
#         torch_dtype=torch.float16,
#         attn_implementation="flash_attention_2",
#         low_cpu_mem_usage=True,
#         trust_remote_code=True
#     ).to("cuda")

#     processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct")

#     ## A bit slow without the flash_attention2 requires ampere gpu's. Better performance in some cases

#     # Optional: Enable memory efficient attention
#     if hasattr(model.config, 'use_memory_efficient_attention'):
#         model.config.use_memory_efficient_attention = True

#     tester = BenchmarkTester()
#     smolVLM_results = tester.evaluate_model(
#         "smolVLM2",
#         model, 
#         processor, 
#         "smolVLM2_results_1.json", 
#         batch_size=25
#     )

#     # Clean up
#     del model, processor
#     torch.cuda.empty_cache()
#     gc.collect()

## Test Qwen2.5-VL

Lets evaluate the Qwen2.5-VL-7B-Instruct model

In [6]:
def test_Qwen2_5VL():
    from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
    
    # default: Load the model on the available device(s)
    # model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    #     "Qwen/Qwen2.5-VL-3B-Instruct", 
    #     load_in_8bit=True, # throws error when .to() is added
    #     torch_dtype=torch.bfloat16, 
    #     device_map="auto",
    #     # attn_implementation="flash_attention_2",
    #     low_cpu_mem_usage=True
    # )
    
    # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
    model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
        "/var/scratch/ave303/models/qwen2-5-vl-32b",
        torch_dtype=torch.float16,
        attn_implementation="flash_attention_2",
        device_map="auto",
        low_cpu_mem_usage=True,
        trust_remote_code=True
    )
    
    # default processer
    processor = AutoProcessor.from_pretrained("/var/scratch/ave303/models/qwen2-5-vl-32b")

    ### Qwen2.5-VL-7B-Instruct --> goes out of CUDA memory
    ### Qwen2.5-VL-3B-Instruct --> can handle only 2 images before going out of memory but decent performance

    # Optional: Enable memory efficient attention
    if hasattr(model.config, 'use_memory_efficient_attention'):
        model.config.use_memory_efficient_attention = True

    tester = BenchmarkTester()
    Qwen2_5VL_results = tester.evaluate_model(
        "Qwen2.5-VL",
        model, 
        processor, 
        "Qwen2.5-VL_32b_results.json", 
        batch_size=50
    )

    # Clean up
    del model, processor
    torch.cuda.empty_cache()
    gc.collect()

## Run Evaluation

Now we can run our evaluation. Let's start with the SmolVLM2 model:

In [7]:
# test_smolVLM2()

In [8]:
test_Qwen2_5VL()

Loading checkpoint shards:   0%|          | 0/30 [00:00<?, ?it/s]

Loading checkpoint shards:   3%|▎         | 1/30 [00:02<01:21,  2.80s/it]

Loading checkpoint shards:   7%|▋         | 2/30 [00:06<01:28,  3.14s/it]

Loading checkpoint shards:  10%|█         | 3/30 [00:09<01:29,  3.31s/it]

Loading checkpoint shards:  13%|█▎        | 4/30 [00:12<01:25,  3.30s/it]

Loading checkpoint shards:  17%|█▋        | 5/30 [00:16<01:23,  3.32s/it]

Loading checkpoint shards:  20%|██        | 6/30 [00:19<01:21,  3.40s/it]

Loading checkpoint shards:  23%|██▎       | 7/30 [00:23<01:17,  3.36s/it]

Loading checkpoint shards:  27%|██▋       | 8/30 [00:26<01:12,  3.30s/it]

Loading checkpoint shards:  30%|███       | 9/30 [00:29<01:08,  3.28s/it]

Loading checkpoint shards:  33%|███▎      | 10/30 [00:32<01:06,  3.32s/it]

Loading checkpoint shards:  37%|███▋      | 11/30 [00:36<01:03,  3.35s/it]

Loading checkpoint shards:  40%|████      | 12/30 [00:39<01:01,  3.40s/it]

Loading checkpoint shards:  43%|████▎     | 13/30 [00:43<00:56,  3.35s/it]

Loading checkpoint shards:  47%|████▋     | 14/30 [00:46<00:53,  3.33s/it]

Loading checkpoint shards:  50%|█████     | 15/30 [00:49<00:50,  3.40s/it]

Loading checkpoint shards:  53%|█████▎    | 16/30 [00:53<00:46,  3.34s/it]

Loading checkpoint shards:  57%|█████▋    | 17/30 [00:56<00:43,  3.37s/it]

Loading checkpoint shards:  60%|██████    | 18/30 [01:00<00:40,  3.41s/it]

Loading checkpoint shards:  63%|██████▎   | 19/30 [01:03<00:37,  3.39s/it]

Loading checkpoint shards:  67%|██████▋   | 20/30 [01:06<00:33,  3.35s/it]

Loading checkpoint shards:  70%|███████   | 21/30 [01:10<00:29,  3.33s/it]

Loading checkpoint shards:  73%|███████▎  | 22/30 [01:13<00:26,  3.25s/it]

Loading checkpoint shards:  77%|███████▋  | 23/30 [01:16<00:22,  3.25s/it]

Loading checkpoint shards:  80%|████████  | 24/30 [01:19<00:19,  3.28s/it]

Loading checkpoint shards:  83%|████████▎ | 25/30 [01:22<00:16,  3.24s/it]

Loading checkpoint shards:  87%|████████▋ | 26/30 [01:25<00:12,  3.17s/it]

Loading checkpoint shards:  90%|█████████ | 27/30 [01:28<00:09,  3.15s/it]

Loading checkpoint shards:  93%|█████████▎| 28/30 [01:32<00:06,  3.16s/it]

Loading checkpoint shards:  97%|█████████▋| 29/30 [01:35<00:03,  3.22s/it]

Loading checkpoint shards: 100%|██████████| 30/30 [01:38<00:00,  3.06s/it]

Loading checkpoint shards: 100%|██████████| 30/30 [01:38<00:00,  3.27s/it]




Some parameters are on the meta device device because they were offloaded to the cpu.


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.



Evaluating Qwen2.5-VL...
Using device: cuda


Processing images:   0%|          | 0/50 [00:00<?, ?it/s]


Processing image 1/50: image01
Question: How many objects made of wood are present?


Question: Count the number of breakable items?


Question: If one of the metal objects were replaced by a wooden object, how many wooden objects would be there in the image?


Processing images:   2%|▏         | 1/50 [02:08<1:45:17, 128.94s/it]


Processing image 2/50: image02
Question: How many mammals are present in the image?


Question: Count the number of items that can store other items?


Question: If one of the zebra were replaced by a tree, how many mammals would be present in the image?


Processing images:   4%|▍         | 2/50 [04:03<1:36:31, 120.67s/it]


Processing image 3/50: image03
Question: How many objects made of rubber are present?


Question: How many objects with the primary purpose of illumination can be seen?


Question: If the person riding one of the bicycles were replaced by a pedestrian, how many objects that have handles would be present?


Processing images:   6%|▌         | 3/50 [05:42<1:26:47, 110.79s/it]


Processing image 4/50: image04
Question: How many tools are visible in the image?


Question: How many cutting tools are present in this image?


Question: If the red handle were replaced by a wooden handle, how many colored artifacts would remain in the image?


Processing images:   8%|▊         | 4/50 [08:05<1:34:37, 123.43s/it]


Processing image 5/50: image05
Question: How many furniture items are present that have legs?


Question: Count the number of containers that cannot hold hot liquids?


Question: If the room were transformed into an open workspace instead of a meeting room, how many privacy features would need to be removed?


Processing images:  10%|█         | 5/50 [09:49<1:27:14, 116.32s/it]


Processing image 6/50: image06
Question: How many reptiles are visible in this enclosure?


Question: How many reptilian couples, at maximum, are present?


Question: If all the small pebbles forming the mosaic floor were replaced with sand, how many natural elements would still be visible in the enclosure?


Processing images:  12%|█▏        | 6/50 [12:35<1:37:37, 133.13s/it]


Processing image 7/50: image07
Question: How many birds are visible in this image?


Question: How many objects are present that can comfortably seat a human?


Question: If the birds sitting together only on one railing were to fly away, how many birds would remain?


Processing images:  14%|█▍        | 7/50 [13:38<1:19:02, 110.30s/it]


Processing image 8/50: image08
Question: How many reptiles are visible in this image?


Question: How many objects are present that act as support?


Question: If one turtle slid off the log into the water, how many turtles would be in the water?


Processing images:  16%|█▌        | 8/50 [14:23<1:02:43, 89.61s/it] 


Processing image 9/50: image09
Question: How many different types of vegetables are present in the image?


Question: How many objects are used as containers?


Question: If the bag of limes were removed and replaced with two additional avocados, how many fruits would be present in total on the table, considering avocados are fruits?


Processing images:  18%|█▊        | 9/50 [16:52<1:13:58, 108.25s/it]


Processing image 10/50: image10
Question: How many objects are present that are flexible?


Question: Count the number of items that are battery powered?


Question: If two phones with three camera lenses were replaced with phones having two camera lenses, how many phones with two camera lenses would be present?


Processing images:  20%|██        | 10/50 [19:29<1:22:04, 123.11s/it]


Processing image 11/50: image11
Question: How many objects made of glass are present on the table?


Question: How many objects are present at the table that can be used for sitting?


Question: If the tables in the center are removed, how many objects are visible that have legs?


Processing images:  22%|██▏       | 11/50 [20:37<1:09:05, 106.30s/it]


Processing image 12/50: image12
Question: How many pieces of gym equipment are visible in the image?


Question: How many objects are present that provide shade?


Question: If two of the stationary bikes were replaced by two treadmills, how many objects would be present that have pedals?


Processing images:  24%|██▍       | 12/50 [23:10<1:16:23, 120.61s/it]


Processing image 13/50: image13
Question: How many furniture items are present in the room?


Question: How many individual storage compartments are present in the furniture items in the room?


Question: If the two bedside lamps were removed, how many objects are present that need electricity?


Processing images:  26%|██▌       | 13/50 [24:50<1:10:32, 114.40s/it]


Processing image 14/50: image14
Question: How many objects are present that are transparent?


Question: How many objects are positioned for student use to place other items?


Question: If the signages were removed, how many objects would be present that hang from the ceiling?


Processing images:  28%|██▊       | 14/50 [25:32<55:21, 92.28s/it]   


Processing image 15/50: image15
Question: How many objects made of rubber are present?


Question: How many objects are visible that can be used to move up?


Question: If the car on the ground is driven out of the garage, how many objects are present that is used to indicate slowing down to a stop?


Processing images:  30%|███       | 15/50 [27:01<53:23, 91.53s/it]


Processing image 16/50: image16
Question: How many objects made of rubber are present?


Question: How many objects can be used as modes of transport if fixed?


Question: If the car in the center is fixed and driven out of the garage, how many objects made of rubber would be visible in the image?


Processing images:  32%|███▏      | 16/50 [29:08<57:48, 102.01s/it]


Processing image 17/50: image17
Question: How many yellow colored objects are present?


Question: How many objects are visible that are used to protect the head?


Question: If one person leaves the cleaning group, how many mammals would remain?


Processing images:  34%|███▍      | 17/50 [30:29<52:44, 95.91s/it] 


Processing image 18/50: image18
Question: How many mammals are visible in the image?


Question: How many objects are present that provide shelter?


Question: If the mammals are to all step inside the shelters, how many natural elements are visible in the image?


Processing images:  36%|███▌      | 18/50 [32:10<51:50, 97.19s/it]


Processing image 19/50: image19
Question: How many gardening tools are present that are made of metal?


Question: How many objects are present in the garden that can hold other items?


Question: If half the woven baskets are filled, how many containers would remain empty?


Processing images:  38%|███▊      | 19/50 [34:02<52:32, 101.70s/it]


Processing image 20/50: image20
Question: How many objects in the background are present that have legs?


Error processing question: CUDA out of memory. Tried to allocate 860.00 MiB. GPU 0 has a total capacity of 47.65 GiB of which 714.00 MiB is free. Including non-PyTorch memory, this process has 46.94 GiB memory in use. Of the allocated memory 45.16 GiB is allocated by PyTorch, and 1.46 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Question: How many objects in the foreground are visible that are foldable?


Error processing question: CUDA out of memory. Tried to allocate 860.00 MiB. GPU 0 has a total capacity of 47.65 GiB of which 234.00 MiB is free. Including non-PyTorch memory, this process has 47.41 GiB memory in use. Of the allocated memory 45.97 GiB is allocated by PyTorch, and 1.12 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Question: If the stack of books on the table in the foreground was moved to the shelf, how many objects in physical contact with the table would be present?


Processing images:  40%|████      | 20/50 [34:30<39:44, 79.50s/it] 

Error processing question: CUDA out of memory. Tried to allocate 860.00 MiB. GPU 0 has a total capacity of 47.65 GiB of which 234.00 MiB is free. Including non-PyTorch memory, this process has 47.41 GiB memory in use. Of the allocated memory 45.85 GiB is allocated by PyTorch, and 1.24 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Processing image 21/50: image01
Question: How many mammals are present in total?


Question: How many objects are visible that can store items?


Question: If the bear were to be replaced by a tree, how many different types of mammals would be there at the zoo?


Processing images:  42%|████▏     | 21/50 [36:07<41:03, 84.95s/it]


Processing image 22/50: image02
Question: How many kitchen tools are visible in the image?


Question: Count the number of items that require electricity to operate?


Question: If blinds were installed for the windows above the sink, how many transparent objects would remain?


Processing images:  44%|████▍     | 22/50 [37:35<40:02, 85.82s/it]


Processing image 23/50: image03
Question: How many objects made of glass are present?


Question: How many tools are visible that can be used for cutting?


Question: If the worker was not wearing ear protection, how many protective items would remain?


Processing images:  46%|████▌     | 23/50 [38:32<34:43, 77.16s/it]


Processing image 24/50: image04
Question: How many objects made of rubber are present?


Question: Excluding the drawers, how many items in the workshop serve as containers for storage?


Question: If an electric fan were placed on the workstation to provide ventilation, how many objects in the room would require electricity to operate?


Processing images:  48%|████▊     | 24/50 [39:57<34:28, 79.54s/it]


Processing image 25/50: image05
Question: How many birds are visible in the image?


Question: How many objects are present that act as support?


Question: If the clouds were to completely cover the sky, blocking the sunlight, how many natural elements would still be visible?


Processing images:  50%|█████     | 25/50 [41:36<35:36, 85.44s/it]


Processing image 26/50: image06
Question: How many objects are present that have chimneys?


Question: How many objects are visible that are means of transportation?


Question: If the bus were replaced by a pedestrian, how many mammals would be present?


Processing images:  52%|█████▏    | 26/50 [42:54<33:12, 83.02s/it]


Processing image 27/50: image07
Question: How many objects made of glass are present?


Question: Count the number of items that can be used to carry liquid?


Question: If the waste to be disposed was color-coded to match the bins, how many objects are to be thrown in the bin on the right?


Processing images:  54%|█████▍    | 27/50 [43:48<28:33, 74.49s/it]


Processing image 28/50: image08
Question: How many objects are present that have legs?


Question: How many items are visible that are openable?


Question: If the bottle was removed from the table, how many objects are present on top of the table?


Processing images:  56%|█████▌    | 28/50 [45:29<30:08, 82.19s/it]


Processing image 29/50: image09
Question: How many objects made of wood are present?


Question: How many kitchen items are visible that can be used for cutting?


Question: If the two jars on the top shelf were removed, how many breakable items would be present in the image?


Processing images:  58%|█████▊    | 29/50 [47:50<34:57, 99.88s/it]


Processing image 30/50: image10
Question: How many objects made of plastic are visible?


Question: How many items are visible that can record audio?


Question: If the microphones were replaced with headsets for every character, how many objects in total would be present that are worn on the head?


Processing images:  60%|██████    | 30/50 [50:12<37:32, 112.61s/it]


Processing image 31/50: image11
Question: How many different food items are present on the kitchen countertop?


Question: How many objects are visible that need electricity to operate?


Question: If all the objects on the two shelves above the counter were placed inside the cabinet, how many items that are breakable would be present on the counter?


Processing images:  62%|██████▏   | 31/50 [52:33<38:24, 121.28s/it]


Processing image 32/50: image12
Question: How many different types of plants are present?


Question: How many objects are visible that behave as containers?


Question: If all the visible plants were potted individually and placed on the stand, how many pots would be present on the stand?


Processing images:  64%|██████▍   | 32/50 [56:00<44:03, 146.87s/it]


Processing image 33/50: image13
Question: How many mammals are visible in the image?


Error processing question: CUDA out of memory. Tried to allocate 864.00 MiB. GPU 0 has a total capacity of 47.65 GiB of which 764.00 MiB is free. Including non-PyTorch memory, this process has 46.89 GiB memory in use. Of the allocated memory 45.18 GiB is allocated by PyTorch, and 1.39 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Question: How many objects are present that can be used for sitting?


Error processing question: CUDA out of memory. Tried to allocate 864.00 MiB. GPU 0 has a total capacity of 47.65 GiB of which 54.00 MiB is free. Including non-PyTorch memory, this process has 47.59 GiB memory in use. Of the allocated memory 45.68 GiB is allocated by PyTorch, and 1.58 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Question: If the character standing upright took a seat for themself and the huddled group are seated in pairs, that is two characters per seat. How many objects would remain that can be used for sitting?


Processing images:  66%|██████▌   | 33/50 [56:26<31:21, 110.68s/it]

Error processing question: CUDA out of memory. Tried to allocate 866.00 MiB. GPU 0 has a total capacity of 47.65 GiB of which 46.00 MiB is free. Including non-PyTorch memory, this process has 47.60 GiB memory in use. Of the allocated memory 45.63 GiB is allocated by PyTorch, and 1.64 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Processing image 34/50: image14
Question: How many cardboard objects are visible in the image?


Question: How many objects are visible that can be used for sitting?


Question: If the bottled objects and the white cups are packed away, how many objects are present that can be used to drink out of?


Processing images:  68%|██████▊   | 34/50 [58:26<30:15, 113.49s/it]


Processing image 35/50: image15
Question: How many objects that are present have wheels?


Question: How many items are visible that can be used to hold liquids?


Question: If the car drives away, how many objects made of rubber are visible?


Processing images:  70%|███████   | 35/50 [1:00:30<29:07, 116.50s/it]


Processing image 36/50: image16
Question: How many objects made of glass are present?


Question: How many tools designed for gathering or sweeping are visible?


Question: If there was a flood and the water washed up the beach, completely submerging it, how many natural elements would be present in the image?


Processing images:  72%|███████▏  | 36/50 [1:02:11<26:05, 111.82s/it]


Processing image 37/50: image17
Question: How many objects are visible that have legs?


Question: How many objects are visible that are attached to the wall or ceiling?


Question: If the blinds are pulled over the window, how many sources of illumination would remain?


Processing images:  74%|███████▍  | 37/50 [1:04:44<26:56, 124.38s/it]


Processing image 38/50: image18
Question: How many objects made of rubber are visible?


Question: How many objects are present that can hold liquids?


Question: If the tools hanging on the wall were to be placed on the shelf, how many objects would be present on the shelf?


Processing images:  76%|███████▌  | 38/50 [1:06:21<23:12, 116.04s/it]


Processing image 39/50: image19
Question: How many different types of gym equipment are present?


Question: How many pieces of exercise equipment primarily designed for cardiovascular workouts are visible?


Question: If the blinds were pulled over the windows, how many sources of illumination would remain?


Processing images:  78%|███████▊  | 39/50 [1:09:34<25:30, 139.12s/it]


Processing image 40/50: image20
Question: How many objects are present that have legs?


Question: How many objects are visible that act as protection or shade?


Question: If the laptop were placed on the shelf next to the TV, how many objects would be present on the shelf?


Processing images:  80%|████████  | 40/50 [1:11:22<21:37, 129.77s/it]


Processing image 41/50: image01
Question: How many objects made of rubber are visible?


Question: How many objects are visible that are means of transportation?


Question: If the car in the driveway were to leave, how many objects primarily made of metal would be present?


Processing images:  82%|████████▏ | 41/50 [1:12:55<17:48, 118.74s/it]


Processing image 42/50: image02
Question: How many objects made of concrete are present?


Question: How many objects are visible that can be used for lifting?


Question: If the orange paint spilled all over one of the plexiglass sheets, how many objects would remain that are transparent?


Processing images:  84%|████████▍ | 42/50 [1:14:18<14:23, 107.94s/it]


Processing image 43/50: image03
Question: How many mammals are present in the image?


Question: How many objects are visible that are used for both meat and wool production?


Question: If the two sheep were replaced by a cow grazing in the same area, how many objects would be present in between the two fences?


Processing images:  86%|████████▌ | 43/50 [1:16:09<12:42, 108.86s/it]


Processing image 44/50: image04
Question: How many objects are visible that are made of paper?


Question: How many objects are present that behave as storage spaces?


Question: If the glasses were placed inside the ceramic container, and we use this container as a dividing line between the left and right sides of the bookshelf, how many objects would be on the right side?


Processing images:  88%|████████▊ | 44/50 [1:18:13<11:21, 113.58s/it]


Processing image 45/50: image05
Question: How many objects are visible that are made of porcelain?


Question: How many decoration items are present in the image?


Question: If the drinks were split evenly between the two humans, how many drinks would each human consume?


Processing images:  90%|█████████ | 45/50 [1:20:40<10:16, 123.38s/it]


Processing image 46/50: image06
Question: How many mammals are present in the image?


Question: How many objects are visible that are designed to contain liquids?


Question: If the trash bags and bottles on the sand are only thrown into the black bin, how many mammals are actively holding some other object?


Processing images:  92%|█████████▏| 46/50 [1:22:33<08:02, 120.53s/it]


Processing image 47/50: image07
Question: How many mammals are present in the image?


Question: How many objects are present that provide shelter?


Question: If one of the mammals douses the fire, how many objects are present that can be switched off?


Processing images:  94%|█████████▍| 47/50 [1:23:54<05:25, 108.53s/it]


Processing image 48/50: image08
Question: How many different types of gym equipment are present?


Question: How many objects are visible that are positioned between the row of treadmills and the bench press station?


Question: If one of the treadmills is faulty and removed from the gym, how many objects are present that convey some kind of information?


Processing images:  96%|█████████▌| 48/50 [1:26:49<04:17, 128.61s/it]


Processing image 49/50: image09
Question: How many objects made of rubber are visible in the image?


Question: How many objects are visible that need electricity to operate?


Question: If one of the workers took a wrench off the table, how many objects would remain in physical contact with the table?


Processing images:  98%|█████████▊| 49/50 [1:28:10<01:54, 114.09s/it]


Processing image 50/50: image10
Question: How many objects are visible that are made of metal?


Question: How many objects present are breakable?


Question: If the bowls with the tomatoes and the chickpeas were emptied into the steaming pot, how many containers would still have something remaining in them?


Processing images: 100%|██████████| 50/50 [1:30:26<00:00, 120.80s/it]

Processing images: 100%|██████████| 50/50 [1:30:26<00:00, 108.53s/it]


