Skip to content

ACP Claude: 3 GAIA instances consistently fail with UsagePolicyRefusal #501

@simonrosenberg

Description

@simonrosenberg

Summary

When running GAIA benchmark with agent_type=acp-claude, 3 specific instances consistently fail with UsagePolicyRefusal errors. These failures are deterministic and occur across multiple runs.

Affected Instances

Instance ID Task Type Analysis
2a649bb1-795f-4a01-b3be-9a01868dae73 Scientific research query False positive - asking about virus testing chemicals (SPFMV/SPCSV)
983bba7c-c092-455f-b6c9-7857003d48fc Academic cross-reference False positive - asking about Hafnia alvei papers
2d83110e-a098-4ebb-9987-066c06fa42d0 Reversed text puzzle Likely triggered by obfuscation detection

Impact

  • Failure rate: 3/500 = 0.6% of GAIA validation set
  • Benchmarks affected: GAIA only (Commit0 has 0 failures)
  • Models affected: claude-opus-4-6 via ACP

Error Details

UsagePolicyRefusal: Internal error: API Error: Claude Code is unable to respond
to this request, which appears to violate our Usage Policy
(https://www.anthropic.com/legal/aup). Try rephrasing the request or attempting
a different approach. If you are seeing this refusal repeatedly, try running
/model claude-sonnet-4-20250514 to switch models.

Characteristics:

  • Immediate refusal (~2 seconds)
  • Zero tokens consumed
  • Not retriable (3-4 retries all fail)
  • Pre-flight rejection before model processes input

Task Details

Instance 1: 2a649bb1-... (Virology Research)

What are the EC numbers of the two most commonly used chemicals for the virus
testing method in the paper about SPFMV and SPCSV in the Pearl Of Africa from
2016? Return the semicolon-separated numbers in the order of the alphabetized
chemicals.
  • SPFMV = Sweet Potato Feathery Mottle Virus
  • SPCSV = Sweet Potato Chlorotic Stunt Virus
  • "Pearl of Africa" = Uganda
  • EC numbers = Enzyme Commission classification (scientific identifiers)

Instance 2: 983bba7c-... (Microbiology Cross-Reference)

What animals that were mentioned in both Ilias Lagkouvardos's and Olga Tapia's
papers on the alvei species of the genus named for Copenhagen outside the
bibliographies were also present in the 2021 article cited on the alvei species'
Wikipedia page about a multicenter, randomized, double-blind study?
  • "Genus named for Copenhagen" = Hafnia (Latin for Copenhagen)
  • Hafnia alvei = bacterium studied in gut microbiome research

Instance 3: 2d83110e-... (Reversed Text Puzzle)

.rewsna eht sa "tfel" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI

Decoded: "If you understand this sentence, write the opposite of the word 'left' as the answer."
Expected answer: "right"

Runs Verified

Run ID eval_limit UsagePolicyRefusal Count
22935238175 500 3 (same instances)
22928852361 500 3 (same instances)

Proposed Solutions

Short-term Workarounds

  1. Instance exclusion list: Add these 3 instance IDs to a known-failing list for ACP Claude scoring adjustments

  2. Model switching fallback: Implement auto-fallback to claude-sonnet-4-20250514 when UsagePolicyRefusal is detected:

    if isinstance(error, UsagePolicyRefusalError):
        # Retry with alternative model
        response = prompt_with_model("claude-sonnet-4-20250514", task)

Long-term

  1. Report to Claude Code team: Instances 1 & 2 are clear false positives on legitimate scientific research queries
  2. Prompt preprocessing: For reversed text instances, consider decoding before sending to Claude Code

Labels

  • bug
  • acp
  • gaia

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions