LlamaTouch Evaluator

Tip

Checkout our project page and paper for more details.

TL;DR
Installation
Using the Dataset
Evaluating Agent Execution Traces

TL;DR

This repository contains the implementation of the LlamaTouch Evaluator. Prepare the following to run the evaluator:

The ground-truth dataset
Task execution traces generated by a mobile agent on AgentEnv.
Code from this repo. Installation guide

Use our collected agent execution traces to run the code: example/evaluate_collected_traces.md

Installation

conda create -n llamatouch python=3.9
conda activate llamatouch 
git clone https://github.com/LlamaTouch/Evaluator.git && cd Evaluator
pip install -v -e .

Using the Dataset

The DatasetHelper class defined in evaluator/task_trace.py helps to retrieve the dataset and annotated essential states.

This class requires the path of task metadata and the path of LlamaTouch dataset for initialization. Checkout https://github.com/LlamaTouch/LlamaTouch/tree/main/dataset to get the download link.

Example code for instantiating a DatasetHelper class. Two paths are configured in config.py:

from config import CONFIG

helper = DatasetHelper(CONFIG.EPI_METADATA_PATH, CONFIG.GR_DATASET_PATH)

The following examples show how to use the dataset with LlamaTouch Evaluator for UI automation task execution (e.g., how does agent ingest task descriptions from the dataset) and evaluation (i.e., how does evaluator extract UI representations, actions, and essential states from the dataset).

Retrieve all episodes or episodes by category

from config import CONFIG
from evaluator.task_trace import DatasetHelper
from typing import List

helper = DatasetHelper(CONFIG.EPI_METADATA_PATH, CONFIG.GR_DATASET_PATH)

# get all episodes
episodes: List[str] = helper.get_all_episodes()

# get episodes by category
# AITW categories: "general", "install", "googleapps", "webshopping"
# LlamaTouch category: "generated"
episodes_general: List[str] = helper.get_episodes_by_category("general")

Retrieve task description and UI representations for a specific episode

from config import CONFIG
from evaluator.task_trace import (
    DatasetHelper, 
    TaskTrace, 
    get_all_screenshot_paths,
    get_all_vh_paths,
)
from typing import List

helper = DatasetHelper(CONFIG.EPI_METADATA_PATH, CONFIG.GR_DATASET_PATH)
episodes: List[str] = helper.get_all_episodes()
epi = episodes[0]

task_description: str = helper.get_task_decsription_by_episode(epi)
trace: TaskTrace = helper.load_groundtruth_trace_by_episode(epi)

screenshot_paths: List[str] = get_all_screenshot_paths(trace)
vhs: List[str] = get_all_vh_paths(trace)

Evaluating Agent Execution Traces

autoui.py, autodroid.py, appagent.py, cocoagent.py are four demonstrations for agent evaluation.

Overall, the evaluation process requires two instances:

A MobileAgent instance representing the agent to be evaluated.
An Evaluator instance representing the evaluation approach.

Implementing a MobileAgent class

A MobileAgent class represents the agent to be evaluated. A mobile agent should inherit this class and implement its abstract method for loading agent execution traces (generated by AgentEnv) for evaluation.

For example, the AutoUI class inherits MobileAgent and implements the following two methods.

load_exec_trace_by_episode takes a string-format episode as the input, and returns a TaskTrace object containing all recorded information during executing the task on AgentEnv. Agents should have their own implementation for this method, such as specifying the path of agent execution traces.
load_predicted_action_by_episode extracts the action sequence from an agent execution trace. This is used for the two baseline evaluation approaches involving only action match.

Code example

from evaluator.agent import MobileAgent

class AutoUI(MobileAgent):
    def __init__(self) -> None:
        super().__init__()
        self.agent = Agent.AUTOUI

    def load_exec_trace_by_episode(self, episode: str) -> Optional[TaskTrace]:
        pass

    def load_predicted_action_by_episode(self, episode: str) -> Optional[List[Action]]:
        pass

Instantiating an Evaluator class

An Evaluator class represents a concrete implementation of one evaluation method. Currently, LlamaTouch has three evaluator implementations:

TestbedEvaluator: the essential state-powered evaluator.
ExactMatchEvaluator: a baseline evaluation method that compares whether two action sequences are exactly matched.
LCSMatchEvaluator: a baseline evaluation method that compares whether the action sequence of a task execution trace is a subsequence of the ground-truth action sequence.

Evaluating Task Completion Rate

To use one evaluator to evaluate agent execution results, it requires

Create a MobileAgent instance.
Create a BaseEvaluator instance and pass the initialized agent instance as the input.
Call the evaluator.run_evaluation() method and evaluator.report_stats() to get evaluation results. Evaluation results will be dumped in the dumped_stats folder named [evaluator_name]_[agent_name]_[time].csv.

Code example for evaluating task completion rate

from config import CONFIG
from evaluator.testbed_evaluator import TestbedEvaluator

# this class is defined in the above section
agent = AutoUI()

te = TestbedEvaluator(
    agent=agent,
    # pass the metadata and dataset paths defined in config.py
    epi_metadata_path=CONFIG.EPI_METADATA_PATH,  
    gr_dataset_path=CONFIG.GR_DATASET_PATH,
    # this field is optional.
    # by default, all tasks in the metadata file will be evaluated
    options={
        # only tasks of their categories in this list will be evaluated
        "categories": [
            TaskCategory.GENERAL,
            TaskCategory.INSTALL,
            TaskCategory.WEBSHOPPING,
            TaskCategory.GOOGLEAPPS,
            TaskCategory.GENERATED,
        ],
        # only evaluate selected tasks with the following episodes
        "episodes": [
            "epi1",
            "epi2",
            "..."
        ],
    }
)
te.run_evaluation()
te.report_stats()

Accuracy of Evaluation Methods

The evaluator.report_stats() method has three optional parameters for easily evaluating the accuracy of different evaluation methods.

human_eval_path: Path of the CSV file that contains human validation results on agent execution traces. This is used for comparing results from evaluation methods and human validation. Default value: None, indicating no accuracy-related metrics will be reported.
```
episode,human
47574332150552188748,0
53818457957765811092,1
88323620415993082184,0
09598740612387056736,1
```
only_human_eval_positive: Evaluate task execution traces only when they are validated as completed in human validation. Default value: False.
suffix: Add a suffix to the dumped statistics file. The name of the dumped file will be [evaluator_name]_[agent_name]_[time]_[suffix].csv. Default value: "".

Usage of these parameters can be found in autoui.py, autodroid.py, appagent.py, cocoagent.py.

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
.github/workflows		.github/workflows
evaluator		evaluator
example		example
.gitignore		.gitignore
.pre-commit-hooks.yaml		.pre-commit-hooks.yaml
README.md		README.md
action_plot.py		action_plot.py
appagent.py		appagent.py
appagent_plot.py		appagent_plot.py
autodroid.py		autodroid.py
autoui.py		autoui.py
cocoagent.py		cocoagent.py
config.py		config.py
plot_gr_trace.py		plot_gr_trace.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LlamaTouch Evaluator

TL;DR

Installation

Using the Dataset

Evaluating Agent Execution Traces

Implementing a MobileAgent class

Instantiating an Evaluator class

Evaluating Task Completion Rate

Accuracy of Evaluation Methods

About

Releases

Packages

Contributors 4

Languages

LlamaTouch/Evaluator

Folders and files

Latest commit

History

Repository files navigation

LlamaTouch Evaluator

TL;DR

Installation

Using the Dataset

Evaluating Agent Execution Traces

Implementing a MobileAgent class

Instantiating an Evaluator class

Evaluating Task Completion Rate

Accuracy of Evaluation Methods

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages