Skip to content

Failure decomposition: categorize rejections to accelerate learning #5

@poofeth

Description

@poofeth

Problem

When an experiment is REJECTED, the framework logs which gates failed but doesn't diagnose WHY. This makes it hard to learn from failures at scale. The same root cause (e.g., "execution cost kills thin edges") can repeat 10+ times across branches before the orchestrator converges on a fix.

In a real deployment, 62% of experiments were rejected. Many had the same root cause repeated across different branches, but each rejection was treated as an independent failure.

Proposal

After Step 5 (Collect Results), add automatic failure decomposition for REJECTed experiments.

Failure categories

INSUFFICIENT_DATA     - n_entries < threshold
                        → broaden filter, add data sources, or relax gate

WRONG_PARAMETER_RANGE - metric improves monotonically toward search boundary
                        → extend search space in that direction

WRONG_SIGNAL_TYPE     - metric doesn't respond to any parameter variation
                        → branch hypothesis is wrong, consider exhausting

REGIME_DEPENDENT      - positive in some folds, negative in others
                        → needs regime filter or conditional activation

EXECUTION_KILLED      - positive pre-cost metric, negative post-cost
                        → switch execution mode or find larger edges

CONCENTRATION_RISK    - edge exists but concentrated in few samples/families
                        → needs diversification or larger universe

GATE_BLOCKED          - would have promoted but for one specific gate
                        → flag for gate evolution review (see issue #4)

NOISE                 - metric within 1 sigma of champion, no clear direction
                        → inconclusive, may need more data

Implementation

  1. The judge (or a post-judge analysis step) assigns a failure category to each REJECT
  2. Categories are logged in experiment_log.jsonl under failure_category
  3. Track category distributions per branch in branch_beliefs.json
  4. When a branch accumulates 3+ failures of the same category, the orchestrator proposes the corresponding fix in the handoff

Orchestrator behavior

In synthesis (Step 5b), after collecting rejections:

"Branch {X} has {N} consecutive {EXECUTION_KILLED} failures. The signal has positive pre-cost edge but execution costs destroy it. Recommended action: switch to maker mode or increase minimum edge threshold."

Branch-level tracking

{
  "branch_name": {
    "failure_distribution": {
      "EXECUTION_KILLED": 4,
      "REGIME_DEPENDENT": 2,
      "NOISE": 1
    },
    "dominant_failure": "EXECUTION_KILLED",
    "recommended_action": "switch to maker mode"
  }
}

Why this matters

Failures contain as much information as successes. A lab that treats every REJECT as an opaque "didn't work" is throwing away signal. Categorizing failures turns rejections into directed next steps. This is the difference between random search and adaptive search.

Relationship to existing features

  • Extends the synthesis step (5b) with structured failure analysis
  • Feeds into the research scout: "Branch X is stuck with REGIME_DEPENDENT failures → scout for regime detection techniques"
  • Feeds into gate evolution (issue Gate evolution: detect and flag overly restrictive scoring gates #4): "Branch X is stuck with GATE_BLOCKED failures → review the blocking gate"
  • Complements diagnostics: persistent failure categories are natural triggers for diagnostic experiments

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions