Tip
Checkout our project page and paper for more details.
This repository contains the implementation of the LlamaTouch Evaluator. Prepare the following to run the evaluator:
- The ground-truth dataset
- Task execution traces generated by a mobile agent on AgentEnv.
- Code from this repo. Installation guide
Use our collected agent execution traces to run the code: example/evaluate_collected_traces.md
conda create -n llamatouch python=3.9
conda activate llamatouch
git clone https://github.com/LlamaTouch/Evaluator.git && cd Evaluator
pip install -v -e .
The DatasetHelper
class defined in evaluator/task_trace.py helps to retrieve the dataset and annotated essential states.
This class requires the path of task metadata and the path of LlamaTouch dataset for initialization. Checkout https://github.com/LlamaTouch/LlamaTouch/tree/main/dataset to get the download link.
Example code for instantiating a DatasetHelper
class.
Two paths are configured in config.py:
from config import CONFIG
helper = DatasetHelper(CONFIG.EPI_METADATA_PATH, CONFIG.GR_DATASET_PATH)
The following examples show how to use the dataset with LlamaTouch Evaluator for UI automation task execution (e.g., how does agent ingest task descriptions from the dataset) and evaluation (i.e., how does evaluator extract UI representations, actions, and essential states from the dataset).
Retrieve all episodes or episodes by category
from config import CONFIG
from evaluator.task_trace import DatasetHelper
from typing import List
helper = DatasetHelper(CONFIG.EPI_METADATA_PATH, CONFIG.GR_DATASET_PATH)
# get all episodes
episodes: List[str] = helper.get_all_episodes()
# get episodes by category
# AITW categories: "general", "install", "googleapps", "webshopping"
# LlamaTouch category: "generated"
episodes_general: List[str] = helper.get_episodes_by_category("general")
Retrieve task description and UI representations for a specific episode
from config import CONFIG
from evaluator.task_trace import (
DatasetHelper,
TaskTrace,
get_all_screenshot_paths,
get_all_vh_paths,
)
from typing import List
helper = DatasetHelper(CONFIG.EPI_METADATA_PATH, CONFIG.GR_DATASET_PATH)
episodes: List[str] = helper.get_all_episodes()
epi = episodes[0]
task_description: str = helper.get_task_decsription_by_episode(epi)
trace: TaskTrace = helper.load_groundtruth_trace_by_episode(epi)
screenshot_paths: List[str] = get_all_screenshot_paths(trace)
vhs: List[str] = get_all_vh_paths(trace)
autoui.py
, autodroid.py
, appagent.py
, cocoagent.py
are four demonstrations for agent evaluation.
Overall, the evaluation process requires two instances:
- A
MobileAgent
instance representing the agent to be evaluated. - An
Evaluator
instance representing the evaluation approach.
A MobileAgent
class represents the agent to be evaluated.
A mobile agent should inherit this class and implement its abstract method for loading agent execution traces (generated by AgentEnv) for evaluation.
For example, the AutoUI
class inherits MobileAgent
and implements the following two methods.
load_exec_trace_by_episode
takes a string-format episode as the input, and returns a TaskTrace object containing all recorded information during executing the task on AgentEnv. Agents should have their own implementation for this method, such as specifying the path of agent execution traces.load_predicted_action_by_episode
extracts the action sequence from an agent execution trace. This is used for the two baseline evaluation approaches involving only action match.
Code example
from evaluator.agent import MobileAgent
class AutoUI(MobileAgent):
def __init__(self) -> None:
super().__init__()
self.agent = Agent.AUTOUI
def load_exec_trace_by_episode(self, episode: str) -> Optional[TaskTrace]:
pass
def load_predicted_action_by_episode(self, episode: str) -> Optional[List[Action]]:
pass
An Evaluator
class represents a concrete implementation of one evaluation method.
Currently, LlamaTouch has three evaluator implementations:
- TestbedEvaluator: the essential state-powered evaluator.
- ExactMatchEvaluator: a baseline evaluation method that compares whether two action sequences are exactly matched.
- LCSMatchEvaluator: a baseline evaluation method that compares whether the action sequence of a task execution trace is a subsequence of the ground-truth action sequence.
To use one evaluator to evaluate agent execution results, it requires
- Create a
MobileAgent
instance. - Create a
BaseEvaluator
instance and pass the initialized agent instance as the input. - Call the
evaluator.run_evaluation()
method andevaluator.report_stats()
to get evaluation results. Evaluation results will be dumped in thedumped_stats
folder named[evaluator_name]_[agent_name]_[time].csv
.
Code example for evaluating task completion rate
from config import CONFIG
from evaluator.testbed_evaluator import TestbedEvaluator
# this class is defined in the above section
agent = AutoUI()
te = TestbedEvaluator(
agent=agent,
# pass the metadata and dataset paths defined in config.py
epi_metadata_path=CONFIG.EPI_METADATA_PATH,
gr_dataset_path=CONFIG.GR_DATASET_PATH,
# this field is optional.
# by default, all tasks in the metadata file will be evaluated
options={
# only tasks of their categories in this list will be evaluated
"categories": [
TaskCategory.GENERAL,
TaskCategory.INSTALL,
TaskCategory.WEBSHOPPING,
TaskCategory.GOOGLEAPPS,
TaskCategory.GENERATED,
],
# only evaluate selected tasks with the following episodes
"episodes": [
"epi1",
"epi2",
"..."
],
}
)
te.run_evaluation()
te.report_stats()
The evaluator.report_stats()
method has three optional parameters for easily evaluating the accuracy of different evaluation methods.
-
human_eval_path
: Path of the CSV file that contains human validation results on agent execution traces. This is used for comparing results from evaluation methods and human validation. Default value: None, indicating no accuracy-related metrics will be reported.episode,human 47574332150552188748,0 53818457957765811092,1 88323620415993082184,0 09598740612387056736,1
-
only_human_eval_positive
: Evaluate task execution traces only when they are validated as completed in human validation. Default value: False. -
suffix
: Add a suffix to the dumped statistics file. The name of the dumped file will be[evaluator_name]_[agent_name]_[time]_[suffix].csv
. Default value: "".
Usage of these parameters can be found in autoui.py
, autodroid.py
, appagent.py
, cocoagent.py
.