# Performing Model Comparison Query and Query Results Approximation in Task-Me-Anything

In this notebook, we will show how to perform a “Model Comparison Query” in Task-Me-Anything. We’ll compare the performance of `llavav1.5-7b` with the baseline model `instructblip-flant5xl` over 3200+ task plans on “2D sticker how many” task type, by finding the task plan that performance  of `llavav1.5-7b` is significant higher than `instructblip-flant5xl`. After that, we willl using `Fit` and `Active` query results approximation algorithms to approximate the performance of tasks plan within only 500 budgets.

## Generate tasks

These are the process of task plans generation, illustrations on these part will be in the `generate` part of demo.

In this step, we generate 3,249 “how many” task plans in 2D scenarios. Each task plan contains all the configuration and content needed to generate an image-question pair (test instance).

In [1]:
import sys
# set the working directory to the root of the project
sys.path.append("../..")
from tma.imageqa.sticker_2d import *
from tma.imageqa.metadata import Objaverse2DMetaData
from tma.task_store import TaskStore

# the code to download the source data, if you already downloaded the data, you can skip this step
# from huggingface_hub import snapshot_download
# path = "../taskverse-source"
# snapshot_download(repo_id="weikaih/taskverse-source", repo_type="dataset", local_dir=path)



path = '/your_path/TaskMeAnything-v1-source'
metadata = Objaverse2DMetaData('../../annotations', image_folder=f'{path}/object_images')
generator = HowManyGridTaskGenerator(metadata)


# enumerate all "how many" task plans
task_store = TaskStore(generator.schema)
generator.enumerate_task_plans(task_store)
df = task_store.return_df()


# sample a subset of the all "how many" task plans
interval = len(df) // 3000
df = df.iloc[::interval, :]
df

enumerating [how many attribute 1] task: 100%|██████████| 3/3 [00:00<00:00, 8955.81it/s]
enumerating [how many attribute 2] task: 100%|██████████| 465/465 [00:01<00:00, 241.76it/s]


Unnamed: 0,task type,grid number,target category,count,attribute type,attribute value
0,how many,2,,1,color,gold
10,how many,3,,7,color,gold
20,how many,3,,4,color,orange
30,how many,3,,1,color,black
40,how many,2,,2,color,pink
...,...,...,...,...,...,...
32440,how many,2,Q99895,4,color,white
32450,how many,3,Q99895,2,color,white
32460,how many,3,Q99895,4,color,white
32470,how many,3,Q99895,6,color,white


## Embedding the tasks and create VQATaskEvaluator


Task evaluator takes the model and the tasks as input, and evaluate and query the model's performance on the tasks generated by task plans. 



<!-- Because we want to fit a performance regressor, we need to embed the tasks. We will use the Cohere API to embed the tasks. First you need to set the `api_key` parameter to your Cohere API key. You can also using other embedding API or models to embed the tasks. (e.g Openai embedding API, BERT, etc.)

Then you should create a `VQATaskEvaluator` object. `VQATaskEvaluator` is a class designed to evaluate a model's performance on task. It can handle the details in evaluate the model such as create the embedding of the tasks, fit the performance regressor, etc.

Notice that `VQATaskEvaluator` can cache the embeddings to avoid redundant requests to the OpenAI API. You can change the path of the cache file by setting the `cache_path` parameter. -->

In [2]:
from tma.task_evaluator import VQATaskEvaluator

task_evaluator = VQATaskEvaluator(
    task_plan_df=df, # data frames task plans to evaluate
    task_generator=generator, # task generator, used to generate test instances for each task plan
    embedding_name='st',  # using sentence transformer (st) to embedding questions
    embedding_batch_size=10000,  # batch size for embedding
    n_instance_per_task=5,  # number of test instances generated per task plan
    n_trials_per_instance=3,  # number of trials per test instance
    cache_path_root=".cache",  # enter you path for cache
    seed=42  # random seed
)

## Evaluating the model on all the task plans

In this steps, we will start to get the ground truth of the query. We will not use query approximation algorithms in this step. Instead, we will evaluate the model on all the tasks and get the top 10 worst-performing tasks as the ground truth. 

You can call tma.models.qa_model.list_vqa_models() to find all the available VQA models.

In [3]:
from tma.models.qa_model import list_imageqa_models

# list all available models
list_imageqa_models()

['instructblip-flant5xl',
 'instructblip-flant5xxl',
 'instructblip-vicuna7b',
 'instructblip-vicuna13b',
 'blip2-flant5xxl',
 'llavav1.5-7b',
 'llavav1.5-13b',
 'llavav1.6-34b',
 'llava1.6-34b-api',
 'qwenvl',
 'qwenvl-chat',
 'internvl-chat-v1.5',
 'gpt4v',
 'gpt4o',
 'qwen-vl-plus',
 'qwen-vl-max',
 'gemini-vision-pro']

We will use `instructblip-flant5xl` as baseline model and `llavav1.5-7b` as model for comparing for showcasing, you can use other models you like or using multi-models.

In [4]:
from tma.models.qa_model import ImageQAModel
from tma.models.qa_model import prompt
import torch

# single model
baseline_model = ImageQAModel(model_name='instructblip-flant5xl', precision=torch.bfloat16, prompt_name = "succinct_prompt", prompt_func=prompt.succinct_prompt, cache_path = ".cache/")
model_to_compare = ImageQAModel(model_name='llavav1.5-7b', precision=torch.bfloat16, prompt_name = "succinct_prompt", prompt_func=prompt.succinct_prompt, cache_path = ".cache/")


# # multiple models
# # Notice: If you have multiple GPUs, you can set the torch_device for each model to avoid running out of GPU memory.
# model1 = ImageQAModel(model_name='llavav1.5-7b', torch_device=0, precision=torch.bfloat16, prompt_name = "succinct_prompt", prompt_func=prompt.succinct_prompt, cache_path = ".cache/")
# model2 = ImageQAModel(model_name='llavav1.5-13b', torch_device=1, precision=torch.bfloat16, prompt_name = "succinct_prompt", prompt_func=prompt.succinct_prompt, cache_path = ".cache/")
 
# baseline_models = [model1, model2]


# model3 = ImageQAModel(model_name='qwenvl', torch_device=3, precision=torch.bfloat16, prompt_name = "succinct_prompt", prompt_func=prompt.succinct_prompt, cache_path = ".cache/")
# model4 = ImageQAModel(model_name='qwenvl-chat', torch_device=4, precision=torch.bfloat16, prompt_name = "succinct_prompt", prompt_func=prompt.succinct_prompt, cache_path = ".cache/")
 
# models_to_compare = [model3, model4]

[IMPORTANT] model cache is enabled, cache path: .cache/
Loading instructblip-flant5xl...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Finish loading instructblip-flant5xl
[IMPORTANT] model cache is enabled, cache path: .cache/
Loading llavav1.5-7b...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Finish loading llavav1.5-7b


After loading model, we can start evaluating all the task plans.

In [5]:
import numpy as np

# find the task plan that the model_to_compare performs better than the baseline_model above 30%

ground_truth_results = task_evaluator.model_compare(
    x_indices=np.arange(len(df)),
    greater_than=True,
    threshold = 0.3,
    baselines=[baseline_model],
    model = model_to_compare,
    fit_function_approximator=False
)

Evaluating tasks: 100%|██████████| 3249/3249 [05:38<00:00,  9.59it/s] 


In [12]:
def display_results(results):
    pattern_stats = results[0]
    # Determine the headers
    headers = ["Pattern", "Times"]
    
    # Calculate the maximum length for formatting
    max_pattern_length = max(len(str(plan[1])) for plan in pattern_stats)
    
    # Print the headers
    print(f"{headers[0]:<{max_pattern_length}} {headers[1]}")
    print("-" * (max_pattern_length + len(headers[1]) + 1))
    
    # Iterate over the task plans and print each plan
    for plan in pattern_stats:
        task_id, attributes = plan
        pattern = ', '.join([f"{attr[0]}: {attr[1]}" for attr in attributes])
        print(f"{pattern:<{max_pattern_length}} {task_id}")
        
display_results(ground_truth_results)

Pattern                                                                        Times
------------------------------------------------------------------------------------
task type: how many                                                            534
task type: how many, grid number: 3                                            290
task type: how many, attribute type: color                                     285
task type: how many, grid number: 2                                            244
task type: how many, grid number: 3, attribute type: color                     153
task type: how many, count: 1                                                  139
task type: how many, grid number: 2, attribute type: color                     132
task type: how many, count: 3                                                  112
task type: how many, count: 1, attribute type: color                           92
task type: how many, grid number: 2, count: 3                                  92


# Apply query approximation algorithms
Query approximation algorithms means only evaluate model on a subset of tasks and use the result to approximate the performance on the whole task plans.

We will use the `Fit` algorithm and `Active` algorithm to approximate the top k worst query, and compare the performance of these two methods with the ground truth. For each algorithm, we will give 500 budgets, which means the approximation algorithm can only evaluate 500 task plans.

* In the `Fit` approach, we randomly select 500 task plans and fit the function approximator.
* In the `Active` approach, we start with 200 task plans and then gradually add more task plans to the training set based on the function approximator's predictions.

In [7]:
# here are the functions to evaluate the approximation results with the ground truth
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

def compare_metric(gt, pred):

    gt_selection = gt[1]
    if len(gt_selection) == 0:
        a = 1
    pred_selection = pred[1]

    # Determine the maximum index for array sizing
    max_index = max(max(gt_selection, default=0), max(pred_selection, default=0))

    # Initialize the labels based on the maximum index
    gt_label = np.zeros(max_index + 1)
    pred_label = np.zeros(max_index + 1)

    for k in gt_selection:
        gt_label[k] = 1

    for k in pred_selection:
        pred_label[k] = 1

    f1 = f1_score(gt_label, pred_label) * 100
    acc = accuracy_score(gt_label, pred_label) * 100
    precision = precision_score(gt_label, pred_label) * 100
    recall = recall_score(gt_label, pred_label) * 100

    return precision, recall, f1, acc

def print_metrics(precision, recall, f1, acc):
    print(f"{'Metric':<15} {'Value':<10}")
    print("-" * 25)
    print(f"{'Precision:':<15} {precision:.2f}%")
    print(f"{'Recall:':<15} {recall:.2f}%")
    print(f"{'F1 Score:':<15} {f1:.2f}%")

### Use "Fit" approximation algorithm

In [13]:
budget = 500
np.random.seed(42)
perm = np.random.permutation(len(df))
x_indices = perm[:budget]

fit_results = task_evaluator.model_compare(
    x_indices=x_indices,
    greater_than=True,
    threshold = 0.2,
    baselines=[baseline_model],
    model = model_to_compare,
    fit_function_approximator=True
)
precision, recall, f1, acc = compare_metric(ground_truth_results, fit_results)
print_metrics(precision, recall, f1, acc)
display_results(fit_results)

Evaluating tasks: 100%|██████████| 500/500 [00:03<00:00, 144.09it/s]


Metric          Value     
-------------------------
Precision:      64.23%
Recall:         14.79%
F1 Score:       24.05%
Pattern                                                                        Times
------------------------------------------------------------------------------------
task type: how many                                                            123
task type: how many, grid number: 3                                            75
task type: how many, attribute type: color                                     73
task type: how many, grid number: 2                                            48
task type: how many, grid number: 3, attribute type: color                     44
task type: how many, count: 1                                                  29
task type: how many, grid number: 2, attribute type: color                     29
task type: how many, count: 3                                                  24
task type: how many, count: 1, attribute type: colo

### Use "Active" approximation algorithm

In [9]:
warmup_budget=200
active_results = task_evaluator.active_model_compare(
    k=10,
    warmup_budget=warmup_budget,
    budget=budget-warmup_budget,
    greater_than=True,
    threshold = 0.2,
    baselines=[baseline_model],
    model = model_to_compare,
)

precision, recall, f1, acc = compare_metric(ground_truth_results, active_results)
print_metrics(precision, recall, f1, acc)
display_results(active_results[0])

[WARMUP] Querying 200 tasks


Evaluating tasks: 100%|██████████| 200/200 [00:01<00:00, 146.27it/s]


[Query] Queried 300/300 tasks
[(134, [('task type', 'how many')]), (85, [('task type', 'how many'), ('attribute type', 'color')]), (79, [('task type', 'how many'), ('grid number', '3')]), (55, [('task type', 'how many'), ('grid number', '2')]), (47, [('task type', 'how many'), ('grid number', '3'), ('attribute type', 'color')]), (38, [('task type', 'how many'), ('grid number', '2'), ('attribute type', 'color')]), (36, [('task type', 'how many'), ('count', '1')]), (35, [('task type', 'how many'), ('count', '3')]), (26, [('task type', 'how many'), ('count', '1'), ('attribute type', 'color')]), (26, [('task type', 'how many'), ('grid number', '2'), ('count', '3')])]
Metric          Value     
-------------------------
Precision:      70.90%
Recall:         17.79%
F1 Score:       28.44%
