Skip to content

Make trajectory cache opt-in per workload (cache_safe contract) #886

@rutayan-nv

Description

@rutayan-nv

Problem

CloudAIGymEnv.get_cached_trajectory_result() (in src/cloudai/configurator/cloudai_gym.py) returns the previously recorded (reward, observation) whenever a step's action matches an earlier entry's action. The returned tuple is then written to trajectory.csv and returned to the caller as if the workload had been re-executed.

This is correct only if the workload is deterministic given the action. For a stochastic workload, a cached (reward, observation) is a single sample from a distribution; reusing it instead of re-executing silently biases the recorded trajectory and any downstream consumer (DSE analysis, offline training corpora, leaderboards).

There is currently no way for a workload to declare that caching is unsafe for it.

Proposed change

Add a workload-level cache_safe: bool = True declaration on TestDefinition. CloudAIGymEnv.get_cached_trajectory_result() returns None whenever self.test_run.test.test_definition.cache_safe is False, forcing re-execution.

  • Default remains True to preserve current behavior for the existing deterministic workloads.
  • Stochastic workloads override to False in their TestDefinition subclass (or in TOML).
  • No change to consumers of trajectory.csv: the cache becomes a property of the workload, not a property the consumer has to reason about.

Acceptance criteria

  • TestDefinition exposes cache_safe: bool = True.
  • get_cached_trajectory_result() returns None when cache_safe is False, regardless of trajectory contents.
  • Unit test: a TestDefinition with cache_safe=False re-executes on a duplicate action; cache_safe=True (default) returns the cached entry as today.
  • Documentation note next to cache_safe stating the determinism contract.

Out of scope

Changing any existing workload's cache_safe value. Each workload owner decides separately.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions