14. Validator API

The validator API can be used to validate the quality of a plan for a given flow model, where quality is measured in terms of the soundness, validity, and optimality of the plan. Soundness refers to a sequence of actions that does not violate any properties of the flow, while validity refers to plans that are not only sound but also achieves the specified goal of the flow. Optimality, as the name suggests, is the best among the valid plans possible, given the cost structure of the flow.

In the final section on usages, we will motivate different use cases for the validator API.

Tokenizer

The input to the validator is a list of tokens. This matches the prettified plan output format produced by the NL2Flow package. If you want to test on different tokens, you can also implement your own tokenizer (and pretty print).

For this section, will use a toy domain with agents A, B, C, and D as the running example. The goal is to do D. B and C are interchangeable and provide the required input of D. A produces an item of the same type that can be used to satisfy the requirements of either B or C. The flow definition is as follows.

agent_a = Operator(name="agent_a")
agent_a.add_output(SignatureItem(parameters=Parameter(item_id="a_1", item_type="type_a")))

agent_b = Operator(name="agent_b")
agent_b.add_output(SignatureItem(parameters="y"))
agent_b.add_input(SignatureItem(parameters=Parameter(item_id="a", item_type="type_a")))
agent_b.add_input(
    SignatureItem(
        constraints=[
            Constraint(
                constraint="$a > 10",
            )
        ]
    )
)

agent_c = deepcopy(agent_b)
agent_c.name = "agent_c"

agent_d = Operator(name="agent_d")
agent_d.add_input(SignatureItem(parameters="y"))

goal = GoalItems(goals=GoalItem(goal_name="agent_d"))

This produces the following plan (in prettified form). The starting point of the validator API is a series of tokens representing a plan like this.

a_1 = agent_a()
map(a_1, a)
confirm(a)
assert $a > 10
y = agent_b(a)
agent_d(y)

The tokenizer transforms these tokens to NL2Flow objects you are already familiar with, in the form of Operators, Constraints, Slots, Maps, Confirmations, and so on.

from nl2flow.debug.debug import BasicDebugger

tokens = [
    "a_1 = agent_a()",
    "map(a_1, a)",
    "confirm(a)",
    "assert $a > 10",
    "y = agent_b(a)",
    "agent_d(y)",
]


debugger = BasicDebugger(flow)
parsed_tokens = self.debugger.parse_tokens(tokens)

[
    Step(
        name="agent_a",
        parameters=[],
    ),
    Step(
        name=BasicOperations.MAPPER.value,
        parameters=["a_1", "a"],
    ),
    Step(
        name=BasicOperations.CONFIRM.value,
        parameters=["a"],
    ),
    Constraint(
        constraint="$a > 10",
        truth_value=True,
    ),
    Step(
        name="agent_b",
        parameters=["a"],
    ),
    Step(
        name="agent_d",
        parameters=["y"],
    ),
]

[test]

Plan Quality

Once you have your tokens, and your basis flow object that they represent a solution for, you can test it for various properties.

Soundness

If the tokens represent a plan that is executable in the flow, then it is called sound. In other words, this sequence of operations may or may not achieve the specified goal of the flow but it represents an allowed behavior in the flow.

To test this, we have taken the previous example and removed a couple of entries from the middle of the plan.

from nl2flow.debug.schemas import SolutionQuality

tokens = [
    "a_1 = agent_a()",
    "map(a_1, a)",
    "confirm(a)",
    "agent_d(y)",
]

report = debugger.debug(tokens, debug=SolutionQuality.SOUND)

This report contains several fields describing the result of this test.

Field name	Description
`report_type`	Denotes what type of test (soundness, validity, or optimality) was done.
`determination`	Denotes the result of the test i.e. if a soundness test was done and it turned out to be sound then this field is True, otherwise False.
`reference`	This records the parsed input tokens.
`planner_response`	This records the response from the planner to the debugging task.
`plan_diff_obj`	This records the difference between the reference and the planner response.
`plan_diff_str`	This records a git-style string difference between the reference and the planner response.

In the running example, the set of tokens is incomplete and not executable e.g. without agents B or C, we cannot do D in the final step of the list of tokens. So the soundness check not only says that the determination is False but also provides the following feedback in how to make it sound.

Plan Difference Object

[
    StepDiff(diff_type=None, step=Step(name='agent_a', parameters=[])), 
    StepDiff(diff_type=None, step=Step(name='map', parameters=['a_1', 'a'])), 
    StepDiff(diff_type=None, step=Step(name='confirm', parameters=['a'])), 
    StepDiff(diff_type=DiffAction.ADD, step=Constraint(constraint='$a > 10', truth_value=True)), 
    StepDiff(diff_type=DiffAction.ADD, step=Step(name='agent_b', parameters=['a'])), 
    StepDiff(diff_type=None, step=Step(name='agent_d', parameters=['y'])),
]

Plan Difference String

  a_1 = agent_a()
  map(a_1, a)
  confirm(a)
+ assert $a > 10
+ y = agent_b(a)
  agent_d(y)

[test]

Validity

Valid plans are sound plans that also achieve the specified goal of the flow. Let's now remove the final step from the tokens.

tokens = [
    "a_1 = agent_a()",
    "map(a_1, a)",
    "confirm(a)",
    "assert $a > 10",
    "y = agent_b(a)",
    "agent_d(y)",
]

report = debugger.debug(tokens, debug=SolutionQuality.VALID)

As before, the validation test returns false but also directive on how to fix the input. Note that this is a sound plan because even if the final step is not there to achieve the goal, the sequence of operations is still allowed behavior.

  a_1 = agent_a()
  map(a_1, a)
  confirm(a)
  assert $a > 10
  y = agent_b(a)
+ agent_d(y)

[test]

Optimality

Optimal plans are the best among valid plans, per the cost model of the flow. Let's use the running example again but this time try to directly slot-fill for the final step. This is both a sound as well as a valid plan but we know that slot filling (asking the user) is not preferred if alternatives are available.

tokens = [
    "ask(y)",
    "agent_d(y)",
]

report = debugger.debug(tokens, debug=SolutionQuality.OPTIMAL)

- ask(y)
+ a_1 = agent_a()
+ map(a_1, a)
+ confirm(a)
+ assert $a > 10
+ y = agent_b(a)
  agent_d(y)

[test] [test]

Plan equivalence

We mentioned previously how agents B and C are interchangeable. This is precisely to demonstrate that multiple plans can have the same properties. In the running example, let's replace B with C.

alternative_tokens = [
    "a_1 = agent_a()",
    "map(a_1, a)",
    "confirm(a)",
    "assert $a > 10",
    "y = agent_c(a)",
    "agent_d(y)",
]

report = debugger.debug(alternative_tokens, debug=SolutionQuality.OPTIMAL)
assert report.determination, "Reference plan is optimal"

[test]

Garbage Tokens

Finally, in certain cases, you may not be able to guarantee the sanctity of the tokens. This is especially true if you are working with language models either for testing or for scaffolding (more on this later). In such situations, the API will still produce the diffs while ignoring the tokens that does not correspond to a valid parse. It will also raise (non-blocking) warnings to log the errors while parsing.

messed_up_tokens = [
    "a_1 = agent_aa()",  # unknown agent
    "adsfaerafea",  # random garbage
    "amap(a_1, a)",  # unknown operation
    "map(a_1, a)",
    "a = confirm(a)",  # extra output
    "assert $aa > 10",  # typo in string
    "y = agent_c(a, a)",  # wrong number of inputs
    "a, y = agent_c(a, a)",  # wrong number of outputs
    "agent_d(y)",
]

report = debugger.debug(messed_up_tokens, debug=SolutionQuality.SOUND)
assert report.determination is False, "Reference plan is not sound"

diff_string = "\n".join(report.plan_diff_str)
print(f"\n\n{diff_string}")

Error generating step predicate: 'agent_aa'
Error generating step predicate: 'amap'
Error generating constraint predicate: Unknown constant: aa
Error generating step predicate: Arity mismatch applying element has_done_agent_c/2 with arity 2 to arguments (a (type_a), a (type_a), try_level_1 (num-retries))
Error generating step predicate: Arity mismatch applying element has_done_agent_c/2 with arity 2 to arguments (a (type_a), a (type_a), try_level_2 (num-retries))

- a_1 = agent_aa()
?              -

+ a_1 = agent_a()
- adsfaerafea
- amap(a_1, a)
  map(a_1, a)
- a = confirm(a)
? ----

+ confirm(a)
- assert $aa > 10
?          -

+ assert $a > 10
- y = agent_c(a, a)
?           ^  ---

+ y = agent_b(a)
?           ^

- a, y = agent_c(a, a)
  agent_d(y)

[test]

Usages

The validator API can be deployed in multiple ways and as such, other than the PDDL / plan generation API, is the most powerful part of the NL2Flow package.

Debugging with the developer

As mentioned in the stated goal of this project, we want to make deploying planners simpler. Part of that mission is to provide debugging pathways to the developers deploying the system. For example, if the system produced an unexpected behavior, the developer can query the system with a counterfactual.

The above example with the valid but suboptimal plan is a typical example of this usage pattern.

Tip

In the field of Explainable AI Planning [XAIP], such counterfactuals are called foils, and are used as a common vehicle for probing a system with questions like "Why not this other plan?".

Testing planning capacity of other models

You can use the validator to test the planning capacity of other generative models, such as large language models. More on this soon. 👀

Planning with scaffolds

As described above, the validator not only provides a yes / no answer to whether a plan is sound, valid, and optimal, but also how to make it so if it is not. You can thus use the validator to fix incomplete or partial sequences. For example, you can use a large language model to generate an initial guess for a plan as scaffolding and then use this API to make it sound, valid, or optimal as desired. More on this soon. 👀

Warning

There is some evidence that scaffolding can reduce the search but no evidence yet that this process will actually speed up the process end to end (counting the time to call the language model as well) in any real-world domain.

Note

The other motivation for using a language model to plan, apart from magically bypassing the curse of computational complexity, is that it represents an off-the-shelf world model and thus an understanding of more likely plans i.e. relative preferences among operations that is latent in the domain and not explicitly specified.

Plan recognition

Plan recognition or plan completion is the task of filling in the blanks of a partial plan or a set of observations when the underlying domain theory is known. One of the seminal works on this topic cast it as modified planning problem, with follow-up works dealing with situations where noisy observations must be discarded as well.

The validator API, in principle, uses a very similar modeling approach to produce the diff objects in the report. You can use the same API to represent a set of observations instead in the tokens and interpret the output as a completion or recognition of the underlying plan.

Execution Monitoring

Finally, the output of the tokenizer will look familiar if you have already read about execution monitoring in the NL2Flow API. This is because they are the same format -- both represent an plan. The critical difference is how they are interpreted by the API. While execution monitoring, the tokens are interpreted as things that have already happened in the past and thus the remaining flow only comes after and in response to it. While validating, this is no longer the case, and the API can interject the given plan with addition or deletion of new operations, as we saw in the examples above.

NL2Flow

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

14. Validator API

Tokenizer

Plan Quality

Soundness

Plan Difference Object

Plan Difference String

Validity

Optimality

Plan equivalence

Garbage Tokens

Usages

Debugging with the developer

Testing planning capacity of other models

Planning with scaffolds

Plan recognition

Execution Monitoring

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally