# Performing Model Debugging Query and Query Results Approximation in Task-Me-Anything


In this notebook, we will show how to perform a “Model Debugging Query” in Task-Me-Anything. We’ll debug the performance of `llavav1.5-7b` on over 3200+ task plans on “2D sticker how many” task type, by finding the task plan whose performance is at least 30% below the average. After that, we willl using `Fit` and `Active` query results approximation algorithms to approximate the performance of tasks plan within only 500 budgets.












<!-- In this notebook we will illstrate how to conduct `Top K query` on multiple models in taskverse. Top K query has two types: Top K best query and Top K worst query, which aims to query the top k best or worst test cases.  We will use Top K worst query in this notebook. Top K worst query is to query the top k task categories that given `VQA model` performance at. (e.g GPT4v achieve 0.5 acc in task category 1 and 0.3 in task category 2, then task category 2 is worse than task category 1)


In this notebook, we will first evaluates models on all the 2d sticker howmany test cases and got the top 10 worst performed cases as ground truth. Then we will use `Random Selection` and `Active Selection` method to approximate the top 10 worst performed cases. We will compare the performance of these two methods with the ground truth. 

In the `Random Selection` approach, we randomly select 2000 task categories and train the `performance regressor`. Conversely, in the second method, we iteratively select the top k worst-performing data points and train the `performance regressor` accordingly. We will discuss the details in later sections.

It is important to note that the `Active Selection` of top k worst-performaing data is specifically tailored for identifying the top k worst scenarios. Since it is trained using data from the top k worst queries, it may not generalize well to scenarios involving the top k best queries or other requirements. In contrast, the `Random Selection` method offers a more generalized approach. -->

## Generate tasks

These are the process of task plans generation, illustrations on these part will be in the `generate` part of demo.

In this step, we generate 3,249 “how many” task plans in 2D scenarios. Each task plan contains all the configuration and content needed to generate an image-question pair.

In [1]:
import sys
# set the working directory to the root of the project
sys.path.append("../..")
from tma.imageqa.sticker_2d import *
from tma.imageqa.metadata import Objaverse2DMetaData
from tma.task_store import TaskStore

# the code to download the source data, if you already downloaded the data, you can skip this step
# from huggingface_hub import snapshot_download
# path = "../TaskMeAnything-v1-source"
# snapshot_download(repo_id="jieyuz2/TaskMeAnything-v1-source", repo_type="dataset", local_dir=path)



path = '/your_path/TaskMeAnything-v1-source'
metadata = Objaverse2DMetaData('../../annotations', image_folder=f'{path}/object_images')
generator = HowManyGridTaskGenerator(metadata)


# enumerate all "how many" task plans
task_store = TaskStore(generator.schema)
generator.enumerate_task_plans(task_store)
df = task_store.return_df()


# sample a subset of the all "how many" task plans
interval = len(df) // 3000
df = df.iloc[::interval, :]
df

enumerating [how many attribute 1] task: 100%|██████████| 3/3 [00:00<00:00, 8848.74it/s]
enumerating [how many attribute 2] task: 100%|██████████| 465/465 [00:01<00:00, 261.10it/s]


Unnamed: 0,task type,grid number,target category,count,attribute type,attribute value
0,how many,2,,1,color,white
10,how many,3,,7,color,white
20,how many,3,,4,color,green
30,how many,3,,1,color,gray
40,how many,2,,2,color,blue
...,...,...,...,...,...,...
32440,how many,2,Q99895,4,color,white
32450,how many,3,Q99895,2,color,white
32460,how many,3,Q99895,4,color,white
32470,how many,3,Q99895,6,color,white


## Embedding the tasks and create VQATaskEvaluator


Task evaluator takes the model and the tasks as input, and evaluate and query the model's performance on the tasks generated by task plans. 



<!-- Because we want to fit a performance regressor, we need to embed the tasks. We will use the Cohere API to embed the tasks. First you need to set the `api_key` parameter to your Cohere API key. You can also using other embedding API or models to embed the tasks. (e.g Openai embedding API, BERT, etc.)

Then you should create a `VQATaskEvaluator` object. `VQATaskEvaluator` is a class designed to evaluate a model's performance on task. It can handle the details in evaluate the model such as create the embedding of the tasks, fit the performance regressor, etc.

Notice that `VQATaskEvaluator` can cache the embeddings to avoid redundant requests to the OpenAI API. You can change the path of the cache file by setting the `cache_path` parameter. -->

In [2]:
from tma.task_evaluator import VQATaskEvaluator

task_evaluator = VQATaskEvaluator(
    task_plan_df=df, # task plans to evaluate
    task_generator=generator, # task generator, used to generate test instances for each task plan
    embedding_name='st',  # using sentence transformer to embed questions
    embedding_batch_size=10000,  # batch size for embedding
    n_instance_per_task=5,  # number of test instances per task plan
    n_trials_per_instance=3,  # number of trials per test instance
    cache_path_root=".cache",  # enter you path for cache
    seed=42  # random seed
)

## Evaluating the model on all the task plans

In this steps, we will start to get the ground truth of the query. We will not use query approximation algorithms in this step. Instead, we will evaluate the model on all the tasks and get the top 10 worst-performing tasks as the ground truth. 

You can call tma.models.qa_model.list_vqa_models() to find all the available VQA models.

In [3]:
from tma.models.qa_model import list_imageqa_models

# list all available models
list_imageqa_models()

['instructblip-flant5xl',
 'instructblip-flant5xxl',
 'instructblip-vicuna7b',
 'instructblip-vicuna13b',
 'blip2-flant5xxl',
 'llavav1.5-7b',
 'llavav1.5-13b',
 'llavav1.6-34b',
 'llava1.6-34b-api',
 'qwenvl',
 'qwenvl-chat',
 'internvl-chat-v1.5',
 'gpt4v',
 'gpt4o',
 'qwen-vl-plus',
 'qwen-vl-max',
 'gemini-vision-pro']

We will use `llavav1.5-7b` for showcasing, you can use other models you like or using multi-models.

In [4]:
from tma.models.qa_model import ImageQAModel
from tma.models.qa_model import prompt
import torch

# single model
model = ImageQAModel(model_name='llavav1.5-7b', precision=torch.bfloat16, prompt_name = "succinct_prompt", prompt_func=prompt.succinct_prompt, cache_path = ".cache/")

# # multiple models
# # Notice: If you have multiple GPUs, you can set the torch_device for each model to avoid running out of GPU memory.
# model1 = ImageQAModel(model_name='llavav1.5-7b', torch_device=0, precision=torch.bfloat16, prompt_name = "succinct_prompt", prompt_func=prompt.succinct_prompt, cache_path = ".cache/")
# model2 = ImageQAModel(model_name='qwenvl-chat', torch_device=1, precision=torch.bfloat16, prompt_name = "succinct_prompt", prompt_func=prompt.succinct_prompt, cache_path = ".cache/")
 
# model = [model1, model2]

[IMPORTANT] model cache is enabled, cache path: .cache/
Loading llavav1.5-7b...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Finish loading llavav1.5-7b


After loading model, we can start evaluating all the task plans.

In [5]:
import numpy as np

ground_truth_results = task_evaluator.model_debug(
    x_indices=np.arange(len(df)),
    greater_than=False,
    threshold = 0.3,
    model = model,
    fit_function_approximator=False
)

Evaluating tasks: 100%|██████████| 3249/3249 [01:32<00:00, 34.98it/s] 


In [6]:
def display_results(results):
    pattern_stats = results[0]
    # Determine the headers
    headers = ["Pattern", "Times"]
    
    # Calculate the maximum length for formatting
    max_pattern_length = max(len(str(plan[1])) for plan in pattern_stats)
    
    # Print the headers
    print(f"{headers[0]:<{max_pattern_length}} {headers[1]}")
    print("-" * (max_pattern_length + len(headers[1]) + 1))
    
    # Iterate over the task plans and print each plan
    for plan in pattern_stats:
        task_id, attributes = plan
        pattern = ', '.join([f"{attr[0]}: {attr[1]}" for attr in attributes])
        print(f"{pattern:<{max_pattern_length}} {task_id}")
        
display_results(ground_truth_results)

Pattern                                                                        Times
------------------------------------------------------------------------------------
task type: how many                                                            1174
task type: how many, grid number: 3                                            858
task type: how many, attribute type: color                                     562
task type: how many, grid number: 3, attribute type: color                     430
task type: how many, grid number: 2                                            316
task type: how many, grid number: 3, count: 5                                  234
task type: how many, count: 4                                                  228
task type: how many, attribute type: material                                  203
task type: how many, attribute type: shape                                     189
task type: how many, count: 2                                                  186

# Apply query approximation algorithms
Query approximation algorithms means only evaluate model on a subset of tasks and use the result to approximate the performance on the whole task plans.

We will use the `Fit` algorithm and `Active` algorithm to approximate the top k worst query, and compare the performance of these two methods with the ground truth. For each algorithm, we will give 500 budgets, which means the approximation algorithm can only evaluate 500 task plans.

* In the `Fit` approach, we randomly select 500 task plans and fit the function approximator.
* In the `Active` approach, we start with 200 task plans and then gradually add more task plans (10 each steps) to the training set based on the function approximator's predictions.

In [7]:
# here are the functions to evaluate the approximation results with the ground truth
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

def debug_metric(gt, pred):

    gt_selection = gt[1]
    if len(gt_selection) == 0:
        a = 1
    pred_selection = pred[1]

    # Determine the maximum index for array sizing
    max_index = max(max(gt_selection, default=0), max(pred_selection, default=0))

    # Initialize the labels based on the maximum index
    gt_label = np.zeros(max_index + 1)
    pred_label = np.zeros(max_index + 1)

    for k in gt_selection:
        gt_label[k] = 1

    for k in pred_selection:
        pred_label[k] = 1

    f1 = f1_score(gt_label, pred_label) * 100
    acc = accuracy_score(gt_label, pred_label) * 100
    precision = precision_score(gt_label, pred_label) * 100
    recall = recall_score(gt_label, pred_label) * 100

    return precision, recall, f1, acc

def print_metrics(precision, recall, f1, acc):
    print(f"{'Metric':<15} {'Value':<10}")
    print("-" * 25)
    print(f"{'Precision:':<15} {precision:.2f}%")
    print(f"{'Recall:':<15} {recall:.2f}%")
    print(f"{'F1 Score:':<15} {f1:.2f}%")

### Use "Fit" approximation algorithm

In [8]:
# ground_truth

budget = 500
np.random.seed(42)
perm = np.random.permutation(len(df))
x_indices = perm[:budget]

fit_results = task_evaluator.model_debug(
    x_indices=x_indices,
    greater_than=False,
    threshold = 0.3,
    model = model,
    fit_function_approximator=True
)

precision, recall, f1, acc = debug_metric(ground_truth_results, fit_results)
print_metrics(precision, recall, f1, acc)
display_results(fit_results)

Evaluating tasks: 100%|██████████| 500/500 [00:01<00:00, 148.39it/s]


Metric          Value     
-------------------------
Precision:      100.00%
Recall:         16.35%
F1 Score:       28.11%
Pattern                                                                           Times
---------------------------------------------------------------------------------------
task type: how many                                                               192
task type: how many, grid number: 3                                               138
task type: how many, attribute type: color                                        83
task type: how many, grid number: 3, attribute type: color                        59
task type: how many, grid number: 2                                               54
task type: how many, attribute type: material                                     42
task type: how many, count: 4                                                     40
task type: how many, attribute type: shape                                        31
task type: how many

### Use "Active" approximation algorithm

In [9]:
warmup_budget=200
active_results = task_evaluator.active_model_debug(
    k=10,
    warmup_budget=warmup_budget,
    budget=budget-warmup_budget,
    model=model,
    greater_than=False,
    threshold = 0.3
)

precision, recall, f1, acc = debug_metric(ground_truth_results, active_results)
print_metrics(precision, recall, f1, acc)
display_results(active_results)

[WARMUP] Querying 200 tasks


Evaluating tasks: 100%|██████████| 200/200 [00:01<00:00, 148.39it/s]


[Query] Queried 300/300 tasks
Metric          Value     
-------------------------
Precision:      100.00%
Recall:         16.87%
F1 Score:       28.86%
Pattern                                                                                Times
--------------------------------------------------------------------------------------------
task type: how many                                                                    198
task type: how many, grid number: 3                                                    155
task type: how many, attribute type: color                                             86
task type: how many, attribute type: material                                          72
task type: how many, grid number: 3, attribute type: color                             66
task type: how many, grid number: 3, attribute type: material                          57
task type: how many, attribute type: color, attribute value: white                     49
task type: how many, grid num