# GentPool: Eval Quickstart
### This notebook aims to give a quick overview of GentPool Eval methods.

In [1]:
from gentopia import AgentAssembler
from gentpool.bench.grader import GateGrader
from gentpool.bench.eval import EvalPipeline
import json

# Load API Keys
import dotenv
dotenv.load_dotenv("../.env")

True

## Single eval with agent, grader and task.

The basic usage of GentPool/bench components.

Use *AgentAssembler* to assemble an agent, instantiate a *Grader* and provide grader inputs to get an evaluation.

In [2]:
# Assemble agent from config.
agent = AgentAssembler(file="../gentpool/pool/mathria/agent.yaml").get_agent()
# Instantiate a Grader
grader = GateGrader()
print(grader.args_schema.schema())

{'title': 'GateArgsSchema', 'type': 'object', 'properties': {'task': {'title': 'Task', 'type': 'string'}, 'ground_truth': {'title': 'Ground Truth', 'type': 'string'}, 'prediction': {'title': 'Prediction', 'type': 'string'}}, 'required': ['task', 'ground_truth', 'prediction']}


In [3]:
math = json.load(open("../benchmark/public/reasoning/math/math_40.json"))
task = math['problem']
ground_truth = math['solution']
prediction = agent.run(task).output

In [4]:
# Get grades (GatedGrader outputs "passed" or "failed")
grader.run(task=task, ground_truth=ground_truth, prediciton=prediction)

AgentOutput(output='failed', cost=0.025349999999999998, token_usage=844)

## Config and EvalPipeline

For comprehensive eval over GentPool benchmark, use a config file to invoke *EvalPipeline* and receive an eval report. See GentPool/eval_config.yaml for an example.

**Warning**: Cost can be non-trivial.

In [5]:
eval = EvalPipeline(eval_config="../config/eval_config.yaml")
eval.run_eval(agent)

> EVALUATING: knowledge/world_knowledge ...
>>> Running Eval 1/1 ...
> EVALUATING: knowledge/domain_specific_knowledge ...
>>> Running Eval 1/1 ...
> EVALUATING: knowledge/web_retrieval ...
>>> Running Eval 1/1 ...
> EVALUATING: reasoning/math ...
>>> Running Eval 1/1 ...
> EVALUATING: reasoning/coding ...
>>> Running Eval 1/1 ...
> EVALUATING: reasoning/planning ...
>>> Running Eval 1/1 ...
> EVALUATING: reasoning/commonsense ...
>>> Running Eval 1/1 ...
> EVALUATING: safety/integrity ...
>>> Running Eval 1/1 ...
> EVALUATING: safety/harmless ...
>>> Running Eval 1/1 ...
> EVALUATING: multilingual/translation ...
>>> Running Eval 1/1 ...
> EVALUATING: multilingual/understanding ...
>>> Running Eval 1/1 ...
> EVALUATING: robustness/consistency ...
> EVALUATING: robustness/resilience ...

### FINISHING Agent EVAL PIPELINE ### 
 (づ￣ ³￣)づ 
--------------Task Specific-------------- 
Score of knowledge/world_knowledge: 0.0 
Score of knowledge/domain_specific_knowledge: 0.0 
Score of knowled

EvalPipelineResult(eval_results={'knowledge/world_knowledge': EvalResult(score=0.0, fail_rate=0.0, avg_runtime=6.651751518249512, avg_cost=0.0018174999999999999, avg_token_usage=1128.0, eval_cost=0.013739999999999999), 'knowledge/domain_specific_knowledge': EvalResult(score=0.0, fail_rate=0.0, avg_runtime=5.33078145980835, avg_cost=0.001386, avg_token_usage=893.0, eval_cost=0.011459999999999998), 'knowledge/web_retrieval': EvalResult(score=0.0, fail_rate=0.0, avg_runtime=10.239863634109497, avg_cost=0.001128, avg_token_usage=687.0, eval_cost=0.00789), 'reasoning/math': EvalResult(score=0.0, fail_rate=0.0, avg_runtime=19.004570960998535, avg_cost=0.001885, avg_token_usage=1121.0, eval_cost=0.01923), 'reasoning/coding': EvalResult(score=0.0, fail_rate=0.0, avg_runtime=6.613209962844849, avg_cost=0.002029, avg_token_usage=1272.0, eval_cost=0.0), 'reasoning/planning': EvalResult(score=1.0, fail_rate=0.0, avg_runtime=8.17550253868103, avg_cost=0.0028815, avg_token_usage=1821.0, eval_cost=0.