Objective
Add a content preprocessor pipeline so LLM graders can evaluate agents that produce binary file outputs (e.g., .xlsx, .pdf, .docx). Currently, ContentFile blocks are defined in content.ts but silently ignored by the grading pipeline — the grader receives an empty candidate string.
Problem
extractLastAssistantContent() in providers/types.ts:250 only extracts ContentText blocks — ContentFile is ignored
- LLM grader receives empty
candidate when agent output is a file
- Built-in agent mode's
read_file skips binary extensions
- Code grader's
materializeContentForGrader() handles images but not files
Prerequisites — ContentFile Production
Before this feature is useful end-to-end, at least one provider must emit ContentFile blocks in agent output. Check whether any current providers (claude, codex, copilot) already produce ContentFile when agents write files, or whether provider-side changes are also needed. If provider work is required, it can be a parallel workstream — the preprocessor pipeline should be built to be testable with mock ContentFile blocks regardless.
Proposed Design
Add a preprocessor pipeline that converts ContentFile blocks to ContentText before graders see them.
Default behavior: read as text
Any ContentFile without a registered preprocessor is read as UTF-8 text. This covers csv, json, sql, md, yaml, html, xml, txt, and any other text-based format — no registration needed.
Preprocessors: only for formats that need transformation
Preprocessors exist only when raw text read is insufficient (binary formats, or when a text format needs restructuring before grading). Core ships no built-in preprocessors — only the registry and default text read. Converter scripts are provided as examples that users copy into their projects and customize.
Resolution order:
- User-defined preprocessor in YAML → takes priority (overrides default text read)
- Default fallback →
readFile(path, 'utf-8')
Core implementation
- Preprocessor registry (
content-preprocessor.ts): Map<type, (ContentFile) => ContentText> — populated only by user-defined preprocessors
- Format alias map: Short aliases resolve to MIME types (
xlsx → full MIME string). Unrecognized values treated as literal MIME types. One type field.
- Pipeline integration: Run preprocessing on
ContentFile blocks before candidate extraction
YAML config — scoping and syntax
Preprocessors are declared top-level in the eval file (shared by all evaluators). Per-evaluator override is possible but optional.
# Top-level: applies to all evaluators in this file
preprocessors:
- type: xlsx
command: ["bun", "run", "scripts/preprocessors/xlsx-to-csv.ts"]
- type: html
command: ["bun", "run", "scripts/preprocessors/html-to-md.ts"]
tests:
- id: report-check
assertions:
- type: llm-grader # inherits xlsx/html preprocessors
prompt: grade-report.txt
- type: rubrics # also inherits
criteria:
- Has revenue column
- id: special-case
assertions:
- type: llm-grader
preprocessors: # per-evaluator override
- type: xlsx
command: ["bun", "run", "scripts/preprocessors/xlsx-to-json.ts"]
Command path resolution
Preprocessor command paths follow the same resolution as code-grader: the last element of the command array is resolved relative to searchRoots (eval file directory + project root) via resolveFileReference(). This keeps preprocessor scripts at a project-level location, not mixed into eval folders:
my-project/
scripts/preprocessors/
xlsx-to-csv.ts
html-to-md.ts
evals/
dataset.eval.yaml # references scripts/preprocessors/xlsx-to-csv.ts
Integration points — hybrid approach
Use both integration strategies:
- At extraction boundary (for LLM graders): Modify or wrap
extractLastAssistantContent() to run preprocessors on ContentFile blocks → all LLM graders benefit automatically
- At materialization (for code graders): Extend
materializeContentForGrader() to write ContentFile blocks to temp files and pass paths to code-grader scripts — code graders may want raw file access, not just text
Error handling
- Binary file with no preprocessor: attempt text read → if it fails (invalid UTF-8), log warning, skip the block, note in grader evidence that file content was not evaluable
- Preprocessor command fails: log stderr, skip the block, note in grader evidence
Example converter scripts
Ship ready-to-copy converter scripts in examples/features/preprocessors/:
examples/features/preprocessors/
scripts/preprocessors/
xlsx-to-csv.ts # xlsx → CSV (zero deps, uses built-in zip/XML parsing)
html-to-md.ts # HTML → markdown (zero deps, regex-based)
evals/
dataset.eval.yaml # demonstrates top-level preprocessor config
README.md # usage guide
Users copy converter scripts into their project's scripts/preprocessors/ and customize as needed (e.g., pick specific xlsx sheets, filter HTML elements, change output format).
Design Latitude
- The preprocessor registry pattern is prescribed (aligns with Inspect AI Tier 1 approach from research)
- Hybrid integration (extraction + materialization) is recommended but implementer may simplify if warranted
- YAML config schema for custom preprocessors can be deferred to a follow-up if simpler to start with programmatic-only registration
- Implementation details (sync vs async, exact function signatures) are flexible
Acceptance Signals
Non-Goals
- Multimodal LLM grading (sending files natively to vision models) — separate concern
- Preprocessing for trace display or non-grader stages
- Streaming preprocessing
- Provider-side changes to emit
ContentFile (separate issue if needed)
- Built-in preprocessors in core (converters are examples, not built-ins)
Industry Context
| Framework |
Approach |
| Inspect AI |
Structured Content union preserved end-to-end (gold standard) |
| Braintrust |
Attachment → S3, AttachmentReference to scorers |
| promptfoo |
output: string only, no binary support |
| deepeval |
Slug injection into strings (anti-pattern) |
No framework has a first-class preprocessor primitive — this is an industry gap. The converter registry pattern (media type → converter function) is universal in adjacent domains (Apache Tika, LangChain document loaders, Unstructured.io).
Related
- Research:
agentevals-research/research/findings/binary-output-preprocessing/README.md
- Multimodal content model research:
agentevals-research/research/findings/multimodal-content-model/README.md
Objective
Add a content preprocessor pipeline so LLM graders can evaluate agents that produce binary file outputs (e.g.,
.xlsx,.pdf,.docx). Currently,ContentFileblocks are defined incontent.tsbut silently ignored by the grading pipeline — the grader receives an emptycandidatestring.Problem
extractLastAssistantContent()inproviders/types.ts:250only extractsContentTextblocks —ContentFileis ignoredcandidatewhen agent output is a fileread_fileskips binary extensionsmaterializeContentForGrader()handles images but not filesPrerequisites — ContentFile Production
Before this feature is useful end-to-end, at least one provider must emit
ContentFileblocks in agent output. Check whether any current providers (claude, codex, copilot) already produceContentFilewhen agents write files, or whether provider-side changes are also needed. If provider work is required, it can be a parallel workstream — the preprocessor pipeline should be built to be testable with mockContentFileblocks regardless.Proposed Design
Add a preprocessor pipeline that converts
ContentFileblocks toContentTextbefore graders see them.Default behavior: read as text
Any
ContentFilewithout a registered preprocessor is read as UTF-8 text. This covers csv, json, sql, md, yaml, html, xml, txt, and any other text-based format — no registration needed.Preprocessors: only for formats that need transformation
Preprocessors exist only when raw text read is insufficient (binary formats, or when a text format needs restructuring before grading). Core ships no built-in preprocessors — only the registry and default text read. Converter scripts are provided as examples that users copy into their projects and customize.
Resolution order:
readFile(path, 'utf-8')Core implementation
content-preprocessor.ts):Map<type, (ContentFile) => ContentText>— populated only by user-defined preprocessorsxlsx→ full MIME string). Unrecognized values treated as literal MIME types. Onetypefield.ContentFileblocks beforecandidateextractionYAML config — scoping and syntax
Preprocessors are declared top-level in the eval file (shared by all evaluators). Per-evaluator override is possible but optional.
Command path resolution
Preprocessor
commandpaths follow the same resolution as code-grader: the last element of the command array is resolved relative tosearchRoots(eval file directory + project root) viaresolveFileReference(). This keeps preprocessor scripts at a project-level location, not mixed into eval folders:Integration points — hybrid approach
Use both integration strategies:
extractLastAssistantContent()to run preprocessors onContentFileblocks → all LLM graders benefit automaticallymaterializeContentForGrader()to writeContentFileblocks to temp files and pass paths to code-grader scripts — code graders may want raw file access, not just textError handling
Example converter scripts
Ship ready-to-copy converter scripts in
examples/features/preprocessors/:Users copy converter scripts into their project's
scripts/preprocessors/and customize as needed (e.g., pick specific xlsx sheets, filter HTML elements, change output format).Design Latitude
Acceptance Signals
ContentFileblocks in agent output are converted to text before reaching LLM graderspreprocessorsconfig shared across all evaluators, with per-evaluator overrideContentFileabsent = no-op)Non-Goals
ContentFile(separate issue if needed)Industry Context
Contentunion preserved end-to-end (gold standard)Attachment→ S3,AttachmentReferenceto scorersoutput: stringonly, no binary supportNo framework has a first-class preprocessor primitive — this is an industry gap. The converter registry pattern (media type → converter function) is universal in adjacent domains (Apache Tika, LangChain document loaders, Unstructured.io).
Related
agentevals-research/research/findings/binary-output-preprocessing/README.mdagentevals-research/research/findings/multimodal-content-model/README.md