# Model Baselines for VQA Counting

### Options:

#### a) Small:

1. Blip-VQA-Base: https://huggingface.co/Salesforce/blip-vqa-base
2. ViLt-finetuned-vqa: https://huggingface.co/dandelin/vilt-b32-finetuned-vqa
4. OWL-ViT (Multimodal-Object Detection): https://huggingface.co/google/owlvit-base-patch32

Rather not:
5. Movie-ResNext: https://github.com/facebookresearch/mmf/tree/main/projects/movie_mcan

=> available on Facebook Repo with huge model zoo, so probably more effort to work with
6. RCN (TallyQA): https://github.com/manoja328/tallyqacode/tree/master

=> also just on GitHub. Doubts: Repo not popular, some parts like dependencies in config.py with regard to data etc. not clear and model probably outdated

#### b) Big:
1. Pali-Gemma-3b: https://huggingface.co/google/paligemma-3b-pt-224
2. InternLm-XComposer2-vl-7b: https://huggingface.co/internlm/internlm-xcomposer2-vl-7b/tree/main
3. MiniCPM-Llama3-V-2_5 (8.54B): https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5

=> More a general multimodal LLM... also available for 2Billion parameters

In [18]:
from google.colab import drive

from PIL import Image
import os
import json
from tqdm import tqdm
import numpy as np

import torch
from torch.utils.data import DataLoader, Dataset
from torchvision import transforms

import transformers
from transformers import AutoTokenizer, AutoProcessor, AutoModelForVisualQuestionAnswering

drive.mount('/content/drive')
# %cd /content/drive/My Drive/MAI_project

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
# define the data class
class HighCountVQADataset(Dataset):
  def __init__(self, data_root, json_name, transform=transforms.ToTensor(), mode='RGB'):
    with open(os.path.join(data_root, json_name), 'r') as file:
      self.data_points = json.load(file)

    self.mode = mode
    self.data_root = data_root
    self.transform = transform
    self.source_map = {'generate': 'coco', 'imported_vqa': 'coco',
                       'tdiuc_templates': 'coco', 'amt': 'visual_genome',
                       'imported_genome': 'visual_genome'}

  def __len__(self):
    return len(self.data_points)

  def __getitem__(self, idx):
    data_point = self.data_points[idx]
    img_path = os.path.join(self.data_root, self.source_map[data_point['data_source']], data_point['image'])


    img = self.transform(Image.open(img_path).convert(self.mode))

    answer = data_point['answer']
    question = data_point['question']

    return {'question': question, 'image': img, 'answer': answer}

In [7]:
# config

batch_size = 1
device = "cuda:0" if torch.cuda.is_available() else 'cpu'
model_name = "Salesforce/blip-vqa-base"
data_root = "/content/drive/MyDrive/MAI_project"
json_file = "HighCountVQA.json"

In [8]:
# declare all necessary components (here just for inference)
test_set = HighCountVQADataset(data_root=data_root,
                               json_name=json_file)
test_loader = DataLoader(test_set,
                         batch_size=batch_size, shuffle=False)


# Blip specific loading

# processor instead of tokenizer -> contains BERT tokenizer + BLIP image processor
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVisualQuestionAnswering.from_pretrained(model_name).to(device)

In [24]:
total_iterations = int(len(test_set) / batch_size)

# data structures for tracking accuracies
max_obj_number = 15
total_count = 0
acc_arr = np.zeros((2, max_obj_number))

with torch.no_grad():
  for idx, data in enumerate(tqdm(test_loader, total=total_iterations)):
    questions = data['question']
    imgs = data['image']
    answers = np.array(data['answer'])

    inputs = processor(imgs, questions, return_tensors="pt").to(device)
    # "generate" is only for inference to get the full answer instead of just the next token
    outputs = model.generate(**inputs)
    predictions = processor.batch_decode(outputs, skip_special_tokens=True)
    predictions = np.array(predictions, dtype=np.int32)

    # update accuracy counts
    true_pred = (predictions == answers)

    total_count += true_pred.sum()
    # add 1 to each ground truth occurence
    np.add.at(acc_arr[0], answers, 1)
    # add 1 to each correct count of those
    np.add.at(acc_arr[1], answers[true_pred], 1)

single_acc = (acc_arr[1] / acc_arr[0]) * 100

print(f'Total Accuracy: {(total_count / len(test_set))*100}%')
for i in range(max_obj_number):
  print(f'Accuracy for {i+1} objects: {single_acc[i]}%')

  0%|          | 4/61538 [00:14<63:25:48,  3.71s/it]


ValueError: invalid literal for int() with base 10: 'one'

# Problems
- MANY different image sizes -> I got the first 500 images and EVERY image has a different size

=> either use a common transforms function (which resolution is fitting then as many models have different input sizes?) or only use batchsize of 1 as this only throws an error when PyTorch DataLoader tries to stack the images
- Known problem between colab and drive: IO errors when folders contain to many files (between 5k-15k?) -> possible solution is to split data again into many subfolders

- model outputs sometimes not numbers but 'one' etc. -> how to treat these cases?

# ToDo's
- address problems
- Split data into train, val, test (once data is curated and save on drive to always have the same set)

# Notes:
- Data Point deleted: {"image": "VG_100K/2358553.jpg", "answer": 1, "data_source": "imported_genome", "question": , "image_id": 5184, "question_id": 20031300}
