# Model Baselines for VQA Counting

### Options:

#### a) Small:

1. Blip-VQA-Base: https://huggingface.co/Salesforce/blip-vqa-base
2. ViLt-finetuned-vqa: https://huggingface.co/dandelin/vilt-b32-finetuned-vqa
4. OWL-ViT (Multimodal-Object Detection): https://huggingface.co/google/owlvit-base-patch32

Rather not:
5. Movie-ResNext: https://github.com/facebookresearch/mmf/tree/main/projects/movie_mcan

=> available on Facebook Repo with huge model zoo, so probably more effort to work with
6. RCN (TallyQA): https://github.com/manoja328/tallyqacode/tree/master

=> also just on GitHub. Doubts: Repo not popular, some parts like dependencies in config.py with regard to data etc. not clear and model probably outdated

#### b) Big:
1. Pali-Gemma-3b: https://huggingface.co/google/paligemma-3b-pt-224
2. InternLm-XComposer2-vl-7b: https://huggingface.co/internlm/internlm-xcomposer2-vl-7b/tree/main
3. MiniCPM-Llama3-V-2_5 (8.54B): https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5

=> More a general multimodal LLM... also available for 2Billion parameters

In [1]:
from google.colab import drive

from PIL import Image
import os
import json
from tqdm import tqdm

import torch
from torch.utils.data import DataLoader, Dataset

import transformers
from transformers import AutoTokenizer, AutoProcessor, AutoModelForVisualQuestionAnswering

drive.mount('/content/drive')
# %cd /content/drive/My Drive/MAI_project

Mounted at /content/drive
/content/drive/.shortcut-targets-by-id/1LQRkg9gduygZjO84k_knGQGpy1qEPTp6/Colab Notebooks


In [None]:
# define the data class
class HighCountVQADataset(Dataset):
  def __init__(self, data_root, json_name, transform=lambda x: x, mode='RGB'):
    with open(os.path.join(data_root, json_name), 'r') as file:
      self.data_points = json.load(file)

    self.mode = mode
    self.data_root = data_root
    self.transform = transform

  def __len__(self):
    return len(self.data_points)

  def __getitem__(self, idx):
    data_point = self.data_points[idx]
    # ToDo: Adjust folder names of images to match the data_source field
    img_path = os.path.join(self.data_root, data_point['data_source'], data_point['image'])
    img = self.transform(Image.open(img_path).convert(self.mode))
    answer = data_point['answer']
    question = data_point['question']

    return {'question': question, 'image': img, 'answer': answer}

In [None]:
# Hyperparameter
batch_size=16
device="cuda:0"
model_name = "Salesforce/blip-vqa-base"

# declare all necessary components (here just for inference)
test_set = HighCountVQADataset(data_root="/content/drive/My Drive/MAI_project",
                               json_name='HighCountVQA.json')
test_loader = DataLoader(test_set,
                         batchsize=batch_size, shuffle=False)


# Blip specific loading

# processor instead of tokenizer -> contains BERT tokenizer + BLIP image processor
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVisualQuestionAnswering.from_pretrained(model_name).to(device)

In [None]:
total_iterations = int(len(test_set) / batch_size)

# data structures for tracking accuracies
max_obj_number = 15
total_count = 0
distro_dict = {f'{i+1}':0 for i in range(max_obj_number)}
count_dict = {f'{i+1}':0 for i in range(max_obj_number)}

with torch.no_grad():
  for idx, data in enumerate(tqdm(test_loader, total=total_iterations)):
    questions = data['question']
    imgs = data['image']
    answers = data['answer']

    inputs = processor(imgs, questions, return_tensors="pt").to(device)
    # "generate" is only for inference to get the full answer instead of just the next token
    outputs = model.generate(**inputs)
    predictions = processor.decode(outputs[0], skip_special_tokens=True)

    # update accuracy counts
    true_pred = (predictions == answers)
    total_count += true_pred.sum()
    for is_match, answer in zip(true_pred, answers):
      distro_dict[answer] += 1
      if is_match:
        count_dict[answer] += 1

print(f'Total Accuracy: {(total_count / len(test_set))*100}%')
for i in range(max_obj_number):
  print(f'Accuracy for {i+1} objects: {(count_dict[i+1] / distro_dict[i+1])*100}%')

# ToDo's
- Check data again because of errors like: {"image": "VG_100K/2358553.jpg", "answer": 1, "data_source": "imported_genome", "question": , "image_id": 5184, "question_id": 20031300} => no question, throws error in loading
- Adjust folder names of images to match the data_source field
- Split data into train, val, test (once data is curated and save on drive to always have the same set)