# Performing Threshold Query and Query Results Approximation in Task-Me-Anything

In this notebook, we will demonstrate how to perform a “Threshold Query” in Task-Me-Anything. We’ll focus on identify the task plans with below 50% accuracy `llavav1.5-7b`  over 3200+ task plans on “2D sticker how many” task type. After that, we willl using `Fit` and `Active` query results approximation algorithms to approximate the performance of tasks plan within only 500 budgets.












<!-- In this notebook we will illstrate how to conduct `Top K query` on multiple models in taskverse. Top K query has two types: Top K best query and Top K worst query, which aims to query the top k best or worst test cases.  We will use Top K worst query in this notebook. Top K worst query is to query the top k task categories that given `VQA model` performance at. (e.g GPT4v achieve 0.5 acc in task category 1 and 0.3 in task category 2, then task category 2 is worse than task category 1)


In this notebook, we will first evaluates models on all the 2d sticker howmany test cases and got the top 10 worst performed cases as ground truth. Then we will use `Random Selection` and `Active Selection` method to approximate the top 10 worst performed cases. We will compare the performance of these two methods with the ground truth. 

In the `Random Selection` approach, we randomly select 2000 task categories and train the `performance regressor`. Conversely, in the second method, we iteratively select the top k worst-performing data points and train the `performance regressor` accordingly. We will discuss the details in later sections.

It is important to note that the `Active Selection` of top k worst-performaing data is specifically tailored for identifying the top k worst scenarios. Since it is trained using data from the top k worst queries, it may not generalize well to scenarios involving the top k best queries or other requirements. In contrast, the `Random Selection` method offers a more generalized approach. -->

## Generate tasks

These are the process of task plans generation, illustrations on these part will be in the `generate` part of demo.

In this step, we generate 3,249 “how many” task plans in 2D scenarios. Each task plan contains all the configuration and content needed to generate an image-question pair.

In [1]:
import sys
# set the working directory to the root of the project
sys.path.append("../..")
from tma.imageqa.sticker_2d import *
from tma.imageqa.metadata import Objaverse2DMetaData
from tma.task_store import TaskStore

# the code to download the source data, if you already downloaded the data, you can skip this step
# from huggingface_hub import snapshot_download
# path = "../taskverse-source"
# snapshot_download(repo_id="weikaih/taskverse-source", repo_type="dataset", local_dir=path)



path = '/your_path/TaskMeAnything-v1-source'
metadata = Objaverse2DMetaData('../../annotations', image_folder=f'{path}/object_images')
generator = HowManyGridTaskGenerator(metadata)


# enumerate all "how many" task plans
task_store = TaskStore(generator.schema)
generator.enumerate_task_plans(task_store)
df = task_store.return_df()


# sample a subset of the all "how many" task plans
interval = len(df) // 3000
df = df.iloc[::interval, :]
df

enumerating [how many attribute 1] task: 100%|██████████| 3/3 [00:00<00:00, 6868.40it/s]
enumerating [how many attribute 2] task: 100%|██████████| 465/465 [00:01<00:00, 268.08it/s]


Unnamed: 0,task type,grid number,target category,count,attribute type,attribute value
0,how many,2,,1,color,gold
10,how many,3,,7,color,gold
20,how many,3,,4,color,red
30,how many,3,,1,color,green
40,how many,2,,2,color,brown
...,...,...,...,...,...,...
32440,how many,2,Q99895,4,color,white
32450,how many,3,Q99895,2,color,white
32460,how many,3,Q99895,4,color,white
32470,how many,3,Q99895,6,color,white


## Embedding the tasks and create VQATaskEvaluator


Task evaluator takes the model and the tasks as input, and evaluate and query the model's performance on the tasks generated by task plans. 



<!-- Because we want to fit a performance regressor, we need to embed the tasks. We will use the Cohere API to embed the tasks. First you need to set the `api_key` parameter to your Cohere API key. You can also using other embedding API or models to embed the tasks. (e.g Openai embedding API, BERT, etc.)

Then you should create a `VQATaskEvaluator` object. `VQATaskEvaluator` is a class designed to evaluate a model's performance on task. It can handle the details in evaluate the model such as create the embedding of the tasks, fit the performance regressor, etc.

Notice that `VQATaskEvaluator` can cache the embeddings to avoid redundant requests to the OpenAI API. You can change the path of the cache file by setting the `cache_path` parameter. -->

In [2]:
from tma.task_evaluator import VQATaskEvaluator

task_evaluator = VQATaskEvaluator(
    task_plan_df=df, # data frames task plans to evaluate
    task_generator=generator, # task generator, used to generate test instances for each task plan
    embedding_name='st',  # using sentence transformer (st) to embedding questions
    embedding_batch_size=10000,  # batch size for embedding
    n_instance_per_task=5,  # number of test instances generated per task plan
    n_trials_per_instance=3,  # number of trials per test instance
    cache_path_root=".cache",  # enter you path for cache
    seed=42  # random seed
)

## Evaluating the model on all the task plans

In this steps, we will start to get the ground truth of the query. We will not use query approximation algorithms in this step. Instead, we will evaluate the model on all the tasks and get the top 10 worst-performing tasks as the ground truth. 

You can call tma.models.qa_model.list_vqa_models() to find all the available VQA models.

In [3]:
from tma.models.qa_model import list_imageqa_models

# list all available models
list_imageqa_models()

['instructblip-flant5xl',
 'instructblip-flant5xxl',
 'instructblip-vicuna7b',
 'instructblip-vicuna13b',
 'blip2-flant5xxl',
 'llavav1.5-7b',
 'llavav1.5-13b',
 'llavav1.6-34b',
 'llava1.6-34b-api',
 'qwenvl',
 'qwenvl-chat',
 'internvl-chat-v1.5',
 'gpt4v',
 'gpt4o',
 'qwen-vl-plus',
 'qwen-vl-max',
 'gemini-vision-pro']

We will use `llavav1.5-7b` for showcasing, you can use other models you like or using multi-models.

In [4]:
from tma.models.qa_model import ImageQAModel
from tma.models.qa_model import prompt
import torch

# single model
model = ImageQAModel(model_name='llavav1.5-7b', precision=torch.bfloat16, prompt_name = "succinct_prompt", prompt_func=prompt.succinct_prompt, cache_path = ".cache/")

# # multiple models
# # Notice: If you have multiple GPUs, you can set the torch_device for each model to avoid running out of GPU memory.
# model1 = ImageQAModel(model_name='llavav1.5-7b', torch_device=0, precision=torch.bfloat16, prompt_name = "succinct_prompt", prompt_func=prompt.succinct_prompt, cache_path = ".cache/")
# model2 = ImageQAModel(model_name='qwenvl-chat', torch_device=1, precision=torch.bfloat16, prompt_name = "succinct_prompt", prompt_func=prompt.succinct_prompt, cache_path = ".cache/")
 
# model = [model1, model2]

[IMPORTANT] model cache is enabled, cache path: .cache/
Loading llavav1.5-7b...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Finish loading llavav1.5-7b


After loading model, we can start evaluating all the task plans.

In [5]:
import numpy as np

def results_process(ground_truth, groupby):
    indices = list(df.groupby(groupby).indices.items())
    return {k: np.mean(ground_truth[i]) for k, i in indices}

groupby = "target category"
ground_truth = np.array(task_evaluator.evaluate(model=model))

category_to_accuracy = results_process(ground_truth, groupby)
threshold_accuracy = 0.5 # set the threshold accuracy to 30% acc


Evaluating tasks: 100%|██████████| 3249/3249 [01:34<00:00, 34.53it/s] 


In [6]:
def print_category_threshold_status(category_to_accuracy, threshold):
    total_categories = len(category_to_accuracy)
    below_threshold_count = sum(1 for acc in category_to_accuracy.values() if acc < threshold)
    proportion_below_threshold = below_threshold_count / total_categories
    print(f"Current Threshold: {threshold}")
    print(f"Total Categories: {total_categories}")
    print(f"Categories Below Threshold: {below_threshold_count}")
    print(f"Proportion Below Threshold: {proportion_below_threshold:.2f}")
    for category, accuracy in category_to_accuracy.items():
        status = "below" if accuracy < threshold else "above"
        print(f"Category: {category:<10} Accuracy: {accuracy:.2f} Status: {status}")
        
print_category_threshold_status(category_to_accuracy, threshold_accuracy)

Current Threshold: 0.5
Total Categories: 465
Categories Below Threshold: 390
Proportion Below Threshold: 0.84
Category: Q101674    Accuracy: 0.30 Status: below
Category: Q1021686   Accuracy: 0.43 Status: below
Category: Q102626    Accuracy: 0.37 Status: below
Category: Q10289     Accuracy: 0.53 Status: above
Category: Q104526    Accuracy: 0.28 Status: below
Category: Q104555    Accuracy: 0.20 Status: below
Category: Q104666136 Accuracy: 0.40 Status: below
Category: Q1047832   Accuracy: 0.47 Status: below
Category: Q106106    Accuracy: 0.37 Status: below
Category: Q1064858   Accuracy: 0.48 Status: below
Category: Q107126067 Accuracy: 0.44 Status: below
Category: Q107196890 Accuracy: 0.55 Status: above
Category: Q107293    Accuracy: 0.53 Status: above
Category: Q107444    Accuracy: 0.25 Status: below
Category: Q1093742   Accuracy: 0.57 Status: above
Category: Q10990     Accuracy: 0.43 Status: below
Category: Q11004     Accuracy: 0.56 Status: above
Category: Q110079    Accuracy: 0.43 Stat

# Apply query approximation algorithms
Query approximation algorithms means only evaluate model on a subset of tasks and use the result to approximate the performance on the whole task plans.

We will use the `Fit` algorithm and `Active` algorithm to approximate the top k worst query, and compare the performance of these two methods with the ground truth. For each algorithm, we will give 500 budgets, which means the approximation algorithm can only evaluate 500 task plans.

* In the `Fit` approach, we randomly select 500 task plans and fit the function approximator.
* In the `Active` approach, we start with 200 task plans and then gradually add more task plans (10 each steps) to the training set based on the function approximator's predictions.

In [7]:
# here are the functions to evaluate the approximation results with the ground truth
from sklearn.metrics import accuracy_score, precision_score, f1_score

def calculate_metrics(gt_labels, pred_labels):
    accuracy = accuracy_score(gt_labels, pred_labels)
    precision = precision_score(gt_labels, pred_labels)
    f1 = f1_score(gt_labels, pred_labels)
    return accuracy, precision, f1

def print_results(actual_categories_above_threshold, approximate_categories_above_threshold):
    gt_labels = [1 if cat in actual_categories_above_threshold else 0 for cat in approximate_categories_above_threshold]
    pred_labels = [1] * len(approximate_categories_above_threshold)  # All are predicted as 1
    accuracy, precision, f1 = calculate_metrics(gt_labels, pred_labels)
    print("Overall Metrics:")
    print(f"  Accuracy : {accuracy:.2f}")
    print(f"  Precision: {precision:.2f}")
    print(f"  F1 Score : {f1:.2f}")
    
    



### Use "Fit" approximation algorithm

In [8]:
# set up the budget    
budget = 500
np.random.seed(42)
perm = np.random.permutation(len(df))
x_indices = perm[:budget]

# it return all the that > 50% accuracy
subset_approximation_results, performance_regressor = task_evaluator.threshold_query(
    threshold=threshold_accuracy,
    x_indices=x_indices,
    model=model,
    by=groupby,
    greater_than=False,
    fit_function_approximator=True
)
subset_approximation_categories_above_threshold = [result[0]for result in subset_approximation_results]
ground_truth_categories_above_threshold = [category for category in category_to_accuracy if category_to_accuracy[category] >= threshold_accuracy]
print_results(ground_truth_categories_above_threshold, subset_approximation_categories_above_threshold)

Evaluating tasks: 100%|██████████| 500/500 [00:01<00:00, 485.13it/s]
Embedding tasks: 100%|██████████| 3249/3249 [00:01<00:00, 1785.72it/s]


Overall Metrics:
  Accuracy : 0.15
  Precision: 0.15
  F1 Score : 0.26


### Use "Active" approximation algorithm

In [9]:
warmup_budget=200
active_approximation_results, performance_regressor = task_evaluator.active_threshold_query(
    threshold=threshold_accuracy,
    warmup_budget=warmup_budget,
    model=model,
    by=groupby,
    budget=budget - warmup_budget,
    greater_than=False,
)
active_approximation_categories_above_threshold = [result[0]for result in active_approximation_results]
print_results(ground_truth_categories_above_threshold, active_approximation_categories_above_threshold)

[WARMUP] Querying 200 tasks


Evaluating tasks: 100%|██████████| 200/200 [00:00<00:00, 512.75it/s]


[Query] Queried 300/300 tasks
Overall Metrics:
  Accuracy : 0.13
  Precision: 0.13
  F1 Score : 0.23
