# Performing Top K Query and Query Results Approximation in Task-Me-Anything

In this notebook, we will show how to perform a “Top K Query” in Task-Me-Anything. We’ll focus on identify the top 10 worst-performing task plans of `llavav1.5-7b` over 3200+ task plans on “2D sticker how many” task type. After that, we willl using `Fit` and `Active` query results approximation algorithms to approximate the performance of tasks plan within only 500 budgets.

## Generate tasks

These are the process of task plans generation, illustrations on these part will be in the `generate` part of demo.

In this step, we generate 3,249 “how many” task plans in 2D scenarios. Each task plan contains all the configuration and content needed to generate an image-question pair.

In [1]:
import sys
# set the working directory to the root of the project
sys.path.append("../..")
from tma.imageqa.sticker_2d import *
from tma.imageqa.metadata import Objaverse2DMetaData
from tma.task_store import TaskStore

# the code to download the source data, if you already downloaded the data, you can skip this step
# from huggingface_hub import snapshot_download
# path = "../TaskMeAnything-v1-source"
# snapshot_download(repo_id="jieyuz2/TaskMeAnything-v1-source", repo_type="dataset", local_dir=path)



path = '/your_path/TaskMeAnything-v1-source'
metadata = Objaverse2DMetaData('../../annotations', image_folder=f'{path}/object_images')
generator = HowManyGridTaskGenerator(metadata)


# enumerate all "how many" task plans
task_store = TaskStore(generator.schema)
generator.enumerate_task_plans(task_store)
df = task_store.return_df()


# sample a subset of the all "how many" task plans
interval = len(df) // 3000
df = df.iloc[::interval, :]
df

enumerating [how many attribute 1] task: 100%|██████████| 3/3 [00:00<00:00, 8659.95it/s]
enumerating [how many attribute 2] task: 100%|██████████| 465/465 [00:01<00:00, 259.03it/s]


Unnamed: 0,task type,grid number,target category,count,attribute type,attribute value
0,how many,2,,1,color,blue
10,how many,3,,7,color,blue
20,how many,3,,4,color,gold
30,how many,3,,1,color,black
40,how many,2,,2,color,yellow
...,...,...,...,...,...,...
32440,how many,2,Q99895,4,color,white
32450,how many,3,Q99895,2,color,white
32460,how many,3,Q99895,4,color,white
32470,how many,3,Q99895,6,color,white


## Embedding the tasks and create VQATaskEvaluator


Task evaluator takes the model and the tasks as input, and evaluate and query the model's performance on the tasks generated by task plans. 



<!-- Because we want to fit a performance regressor, we need to embed the tasks. We will use the Cohere API to embed the tasks. First you need to set the `api_key` parameter to your Cohere API key. You can also using other embedding API or models to embed the tasks. (e.g Openai embedding API, BERT, etc.)

Then you should create a `VQATaskEvaluator` object. `VQATaskEvaluator` is a class designed to evaluate a model's performance on task. It can handle the details in evaluate the model such as create the embedding of the tasks, fit the performance regressor, etc.

Notice that `VQATaskEvaluator` can cache the embeddings to avoid redundant requests to the OpenAI API. You can change the path of the cache file by setting the `cache_path` parameter. -->

In [2]:
from tma.task_evaluator import VQATaskEvaluator

task_evaluator = VQATaskEvaluator(
    task_plan_df=df, # data frames task plans to evaluate
    task_generator=generator, # task generator, used to generate test instances for each task plan
    embedding_name='st',  # using sentence transformer (st) to embedding questions
    embedding_batch_size=10000,  # batch size for embedding
    n_instance_per_task=5,  # number of test instances generated per task plan
    n_trials_per_instance=3,  # number of trials per test instance
    cache_path_root=".cache",  # enter you path for cache
    seed=42  # random seed
)

## Evaluating the model on all the task plans

In this steps, we will start to get the ground truth of the query. We will not use query approximation algorithms in this step. Instead, we will evaluate the model on all the tasks and get the top 10 worst-performing tasks as the ground truth. 

You can call tma.models.qa_model.list_vqa_models() to find all the available VQA models.

In [3]:
from tma.models.qa_model import list_imageqa_models

# list all available models
list_imageqa_models()

['instructblip-flant5xl',
 'instructblip-flant5xxl',
 'instructblip-vicuna7b',
 'instructblip-vicuna13b',
 'blip2-flant5xxl',
 'llavav1.5-7b',
 'llavav1.5-13b',
 'llavav1.6-34b',
 'llava1.6-34b-api',
 'qwenvl',
 'qwenvl-chat',
 'internvl-chat-v1.5',
 'gpt4v',
 'gpt4o',
 'qwen-vl-plus',
 'qwen-vl-max',
 'gemini-vision-pro']

We will use `llavav1.5-7b` for showcasing, you can use other models you like or using multi-models.

In [4]:
from tma.models.qa_model import ImageQAModel
from tma.models.qa_model import prompt
import torch

# single model
model = ImageQAModel(model_name='llavav1.5-7b', precision=torch.bfloat16, prompt_name = "succinct_prompt", prompt_func=prompt.succinct_prompt, cache_path = ".cache/")

# # multiple models
# # Notice: If you have multiple GPUs, you can set the torch_device for each model to avoid running out of GPU memory.
# model1 = ImageQAModel(model_name='llavav1.5-7b', torch_device=0, precision=torch.bfloat16, prompt_name = "succinct_prompt", prompt_func=prompt.succinct_prompt, cache_path = ".cache/")
# model2 = ImageQAModel(model_name='qwenvl-chat', torch_device=1, precision=torch.bfloat16, prompt_name = "succinct_prompt", prompt_func=prompt.succinct_prompt, cache_path = ".cache/")
 
# model = [model1, model2]

[IMPORTANT] model cache is enabled, cache path: .cache/
Loading llavav1.5-7b...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Finish loading llavav1.5-7b


To evaluate our model’s performance across different grouped categories, we follow a systematic approach:

1.	Getting Score on Each Task Plan: For each task plan, we generate multiple questions. We then calculate the model’s aggregate score across these questions. This aggregate score provides a comprehensive measure of the model’s performance for each task plan.
2.	Grouping: We group different task plans by their “target category,” which is the category of the object that serves as the answer for a task plan. All test cases with the same target category are grouped together.
3.	Average Score Calculation: For each category, we calculate the average score using the scores of the task plans within that category.
4.	Ranking: Finally, we rank the categories based on their average scores and identify the top k categories where the model has the lowest average scores, indicating its weakest performance areas.

Example: Suppose there are 3 task plans where the answer to each plan is “banana.” For each task plan, we generate 3 task instances, resulting in 3 * 3 = 9 test instances. We evaluate the model on all 9 test instances and calculate the average score for each task plan. Since these 3 task plans all have “banana” as the answer, they are grouped together as a single category. The average score of the “banana” category is calculated using the aggregate scores of these 3 task plans. We then use this average score to rank the “banana” category among other categories.

Here we represent task categories in the format of QID, The values start with "Q" is the QID of the category, which corresponds to a Wikidata entry (eg,  `Q11422` corresponds to https://www.wikidata.org/wiki/Q11422).

In [5]:
import numpy as np

groupby = "target category"
ground_truth = np.array(task_evaluator.evaluate(model=model))
indices = list(df.groupby(groupby).indices.items())
aggregate_perf = np.array([np.mean(ground_truth[i]) for k, i in indices])
category_to_rank = {indices[i][0]:rank for rank, i in enumerate(np.argsort(aggregate_perf))}

Evaluating tasks: 100%|██████████| 3249/3249 [02:21<00:00, 22.90it/s] 


In [6]:
category_to_rank

{'Q207220': 0,
 'Q161439': 1,
 'Q13202263': 2,
 'Q7220961': 3,
 'Q849813': 4,
 'Q2596997': 5,
 'Q245761': 6,
 'Q104555': 7,
 'Q172833': 8,
 'Q875696': 9,
 'Q2750929': 10,
 'Q42527': 11,
 'Q1317634': 12,
 'Q682582': 13,
 'Q11422': 14,
 'Q178': 15,
 'Q37828': 16,
 'Q35197': 17,
 'Q16917685': 18,
 'Q2248059': 19,
 'Q107444': 20,
 'Q155972': 21,
 'Q29024343': 22,
 'Q3962': 23,
 'Q207763': 24,
 'Q189299': 25,
 'Q23664': 26,
 'Q768186': 27,
 'Q2637814': 28,
 'Q170484': 29,
 'Q5936788': 30,
 'Q191851': 31,
 'Q50643': 32,
 'Q11285759': 33,
 'Q104526': 34,
 'Q19827042': 35,
 'Q127666': 36,
 'Q13681': 37,
 'Q171446': 38,
 'Q19968163': 39,
 'Q196538': 40,
 'Q3506176': 41,
 'Q188075': 42,
 'Q179904': 43,
 'Q1798603': 44,
 'Q131696': 45,
 'Q12132': 46,
 'Q4006': 47,
 'Q101674': 48,
 'Q939611': 49,
 'Q44106': 50,
 'Q729': 51,
 'Q5113': 52,
 'Q501862': 53,
 'Q11035': 54,
 'Q1129239': 55,
 'Q42177': 56,
 'Q6950796': 57,
 'Q11460': 58,
 'Q154': 59,
 'Q16836622': 60,
 'Q171495': 61,
 'Q81881': 62,
 'Q13

# Apply query approximation algorithms
Query approximation algorithms means only evaluate model on a subset of tasks and use the result to approximate the performance on the whole task plans.

We will use the `Fit` algorithm and `Active` algorithm to approximate the top k worst query, and compare the performance of these two methods with the ground truth. For each algorithm, we will give 500 budgets, which means the approximation algorithm can only evaluate 500 task plans.

* In the `Fit` approach, we randomly select 500 task plans and fit the function approximator.
* In the `Active` approach, we start with 200 task plans and then gradually add more task plans to the training set based on the function approximator's predictions.

For `Active` algorithm here is the details:
Initially, we use VQA questions from 100 training task categories as a warm-up phase, allowing the regressor to roughly fit the model. Then, in each step, we select the top k worst queries based on the performance regressor’s predictions. - We then evaluate VQA models on these top k task categories and get actual result and add these new task categories to train the performance regressor, iterativly untill we run out of our budget. Basically, we are using the performance regressor to determine which data points to use and the VQA model to obtain actual performance data. This iterative process continues until we have utilized our entire budget. This algorithm is inspired by Bayesian optimization.

In [7]:
# this function is used to print the results of the query approximation algorithms' results compare with the ground truth
def print_results(topk):
    hit_count = 0
    total_rank = 0

    for i, (k, v) in enumerate(topk):
        actual_rank = category_to_rank[k]
        print(f"category: {k:<10} predicted rank {i:<2} actual rank: {actual_rank:<3}")
        total_rank += actual_rank
        if actual_rank <= len(topk):
            hit_count += 1

    mean_rank = total_rank / len(topk)
    hit_rate = hit_count / len(topk)

    print(f"Mean Rank: {mean_rank:.2f}")
    print(f"Hit Rate: {hit_rate:.2f}")
    
    
# set up the budget    
budget = 500

### Use "Fit" approximation algorithm

In [8]:
np.random.seed(42)
perm = np.random.permutation(len(df))
x_indices = perm[:budget]

top_k, performance_regressor = task_evaluator.top_k_query(
    k=10,
    x_indices=x_indices,
    model=model,
    reverse=True,
    by=groupby,
    fit_function_approximator=True
)
print_results(top_k)

Evaluating tasks: 100%|██████████| 500/500 [00:03<00:00, 149.56it/s]
Embedding tasks: 100%|██████████| 3249/3249 [00:01<00:00, 1682.25it/s]


category: Q161439    predicted rank 0  actual rank: 1  
category: Q11442     predicted rank 1  actual rank: 133
category: Q875696    predicted rank 2  actual rank: 9  
category: Q1317634   predicted rank 3  actual rank: 12 
category: Q101674    predicted rank 4  actual rank: 48 
category: Q265868    predicted rank 5  actual rank: 198
category: Q207763    predicted rank 6  actual rank: 24 
category: Q125356    predicted rank 7  actual rank: 87 
category: Q50643     predicted rank 8  actual rank: 32 
category: Q19827042  predicted rank 9  actual rank: 35 
Mean Rank: 57.90
Hit Rate: 0.20


### Use "Active" approximation algorithm

In [9]:
warmup_budget=200
top_k, performance_regressor = task_evaluator.active_top_k_query(
    k=10,
    warmup_budget=warmup_budget,
    budget=budget-warmup_budget,
    model=model,
    reverse=True,
    by=groupby,
)
print_results(top_k)

[WARMUP] Querying 200 tasks


Evaluating tasks: 100%|██████████| 200/200 [00:01<00:00, 148.37it/s]


[Query] Queried 300/300 tasks
category: Q161439    predicted rank 0  actual rank: 1  
category: Q245761    predicted rank 1  actual rank: 6  
category: Q172833    predicted rank 2  actual rank: 8  
category: Q682582    predicted rank 3  actual rank: 13 
category: Q2248059   predicted rank 4  actual rank: 19 
category: Q29024343  predicted rank 5  actual rank: 22 
category: Q2637814   predicted rank 6  actual rank: 28 
category: Q13681     predicted rank 7  actual rank: 37 
category: Q1798603   predicted rank 8  actual rank: 44 
category: Q4006      predicted rank 9  actual rank: 47 
Mean Rank: 22.50
Hit Rate: 0.30
