Skip to content

fix(cli): bundle MarkItDown in workspace installs + diagnose missing converter#469

Open
Jurij89 wants to merge 2 commits into
mainfrom
fix/markitdown-workspace-bundle
Open

fix(cli): bundle MarkItDown in workspace installs + diagnose missing converter#469
Jurij89 wants to merge 2 commits into
mainfrom
fix/markitdown-workspace-bundle

Conversation

@Jurij89
Copy link
Copy Markdown
Contributor

@Jurij89 Jurij89 commented May 12, 2026

Summary

  • Workspace/dev pnpm install now stages the MarkItDown binary by default so PDF/DOCX/PPTX/XLSX/CSV/HTML/EPUB/XML extraction works out of the box, matching the published-install experience. Previously the postinstall silently skipped the release-asset download for source checkouts and PDFs returned extraction.status='skipped'.
  • When the local package.json version is ahead of the latest published tag (the common pre-release case), the bundler falls back to the newest non-draft release and rewrites the meta sidecar with the local cliVersion (preserving the release tag in an effectiveTag field) so the daemon's converter check accepts the binary instead of rejecting it as a fingerprint mismatch.
  • A missing converter is now visible to operators and agents: /api/status carries extractionPipelines: string[] + warnings: [{code, message, remediation?}], the live /.well-known/skill.md adds an Unavailable extraction pipelines line, the static SKILL.md template points agents at the dkg_status tool, and the daemon startup warning tells operators exactly which command to run and that a restart is required.
  • Failure UX hardened: DKG_SKIP_MARKITDOWN_DOWNLOAD=1 opt-out, automatic silent skip in CI, per-OS install hint when python3-venv is missing from markitdown:build, and a release-time --verify-release-artifacts check in release.yml that fails the workflow if any matrix-built binary or sidecar is missing/mismatched before softprops/action-gh-release publishes.

Related

Files changed

File What
packages/cli/scripts/bundle-markitdown-binaries.mjs New bundleImplicitCurrentPlatform() workflow with workspace fallback to latest tag, fetchLatestReleaseTag() (uses /releases listing — /releases/latest returns 404 on prerelease-only repos), verifyReleaseArtifacts(), rewriteVenvError(), restart-hint logging, opt-out env vars. Removed the silent workspace-skip short-circuit.
packages/cli/src/daemon/routes/status.ts extractionPipelines and warnings[] on /api/status; unavailablePipelines passed to buildSkillMd for the live skill manifest.
packages/cli/src/daemon/manifest.ts buildSkillMd accepts unavailablePipelines, renders the diagnostic line, placeholder match updated in lockstep with the template wording.
packages/cli/src/daemon/lifecycle.ts Startup warning now includes pnpm --filter @origintrail-official/dkg run markitdown:bundle + restart-the-daemon remediation.
packages/cli/skills/dkg-node/SKILL.md §1 placeholder points agents at the dkg_status tool (matching the template's tool-first convention) instead of raw HTTP for the live pipeline list.
.github/workflows/release.yml New verification step (--verify-release-artifacts) runs before softprops/action-gh-release and fails the workflow if any matrix-built binary or sidecar is missing/mismatched.
.gitignore + packages/cli/bin/.gitkeep Defensive: keeps platform-specific MarkItDown binaries out of git while preserving the directory so the bundler can write into it.
packages/cli/test/markitdown-binaries.test.ts 14 new vitest cases covering workspace download, already-staged skip, version-ahead fallback, total-failure remediation, opt-out env vars, CI skip, venv error rewriter, and release-artifact verifier.

Test plan

  • pnpm exec vitest run test/markitdown-binaries.test.ts test/extraction-markitdown.test.ts test/document-processor-e2e.test.ts → 50 passed, 4 skipped (the skipIf'd e2e cases need a real converter; follow-up could remove the guards now that the workspace flow actually stages one).
  • pnpm exec tsc --noEmit in packages/cli/ → clean.
  • Manual smoke (Windows worktree): fresh pnpm install stages markitdown-win32-x64.exe via fallback to v10.0.0-rc.4; bin/markitdown-win32-x64.exe.meta.json reports cliVersion: "10.0.0-rc.6", effectiveTag: "v10.0.0-rc.4"; extraction-markitdown.test.ts no longer logs the "Ignoring bundled binary with incompatible metadata sidecar" warning that fires before this PR.
  • CI matrix run on Linux. Windows symlink-permission failures in slot-helpers, migration, rollback, blue-green-integration, auto-update, etc. on the dev machine are pre-existing and unrelated to these files.
  • Operator manual: with binary present, curl /api/status | jq '.extractionPipelines, .warnings' returns the full pipeline list and []. With binary deleted and daemon restarted, returns just text/markdown and a single markitdown_unavailable warning with remediation.
  • Operator manual: curl /.well-known/skill.md | grep -i 'extraction pipelines' shows the Available line when the binary is present, and BOTH an Available and an Unavailable line (listing the 9 MIME types) when it is missing.
  • PDF round-trip: import a PDF via POST /api/assertion/test-pdf/import-file → response carries extraction.status: "completed", pipelineUsed: "application/pdf", tripleCount > 0, and an mdIntermediateHash.

🤖 Generated with Claude Code

…converter

Resolves #467. Workspace/dev `pnpm install` now stages the MarkItDown
binary so PDF/DOCX/etc. extraction works out of the box, matching the
published-install experience.

Postinstall flow (packages/cli/scripts/bundle-markitdown-binaries.mjs):
- Workspace checkouts attempt the release-asset download by default
  (previously a silent skip). On HTTP 404 — common when the local
  package.json version is ahead of the latest published tag — fall
  back to the newest non-draft release and rewrite the meta sidecar
  to reflect the local cliVersion (preserving the actual release tag
  in an `effectiveTag` field) so the daemon's converter check accepts
  the binary instead of treating it as a stale fingerprint mismatch.
- Honor DKG_SKIP_MARKITDOWN_DOWNLOAD=1 opt-out and skip silently in
  CI environments to avoid burning CI quota on implicit downloads.
- Wrap the `python3 -m venv` step in `markitdown:build` so missing
  python3-venv / Python emits a per-OS install hint instead of an
  opaque `Command failed` error.
- New `--verify-release-artifacts <dir>` mode plus a release.yml step
  that fails the workflow when any matrix-built binary or its sidecar
  is missing/mismatched before `softprops/action-gh-release` publishes.

Runtime diagnostic surfaces (issue #467 acceptance criterion 5):
- /api/status now returns `extractionPipelines: string[]` and
  `warnings: [{code, message, remediation?}]`. When MarkItDown is
  missing the response carries a `markitdown_unavailable` warning
  with the remediation command. Surfaces via the existing `dkg_status`
  MCP tool so agents see the gap programmatically.
- /.well-known/skill.md gains an `Unavailable extraction pipelines`
  line when MarkItDown is absent, listing the MIME types and the
  remediation command.
- Static SKILL.md template points agents at `dkg_status` for the
  live pipeline list (templates copied to OpenClaw/Hermes home folders
  retain `(dynamic)` placeholders).
- Daemon startup warning now tells operators to run
  `pnpm --filter @origintrail-official/dkg run markitdown:bundle`
  and restart, mirroring the bundle script's own success log.

Tests: 14 new vitest cases in packages/cli/test/markitdown-binaries.test.ts
cover the implicit current-platform flow (workspace download, already-
staged skip, version-ahead fallback, total-failure remediation, opt-out
env var, CI skip), the venv error rewriter, and the new release-artifact
verifier.

Defensive: packages/cli/bin/ added to .gitignore with a .gitkeep so the
directory stays present for the bundler to write into.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
}

const result = await ensureCurrentPlatformBinary({
await bundleImplicitCurrentPlatform({
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: bundleImplicitCurrentPlatform() now turns download problems into { status: 'failed' } instead of throwing, but this caller drops the result. In --current-platform mode that means the script exits 0 even when staging failed, unless --best-effort happens to be set. Please check the returned status here and throw on failure/unexpected outcomes so explicit runs still fail fast.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in cc8ab93. main() now inspects bundleImplicitCurrentPlatform's return value and throws when status === 'failed' unless --best-effort is set, so explicit pnpm run markitdown:bundle (no --best-effort) exits 1. Verified locally with --release-repo this-org-does-not-exist/dkg-fake: remediation message printed, then the new wrapper error, exit code 1. Postinstall (which sets --best-effort) still exits 0 with the warning, unchanged.

message:
"MarkItDown binary not registered — PDF, DOCX, PPTX, XLSX, CSV, HTML, EPUB, and XML extraction unavailable. Imports of those types will return extraction.status='skipped' and the file will be stored as a blob.",
remediation:
"Run `pnpm --filter @origintrail-official/dkg run markitdown:bundle`, then restart the daemon.",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: this remediation tells operators to run pnpm --filter @origintrail-official/dkg run markitdown:bundle, but packages/cli/package.json only defines markitdown:build today. Following the warning will hit a missing-script error and leave extraction unavailable. Either add a real markitdown:bundle script or change this and the matching log/skill text to a command that actually exists.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in cc8ab93. Added "markitdown:bundle": "node ./scripts/bundle-markitdown-binaries.mjs --current-platform --force" to packages/cli/package.json. pnpm --filter @origintrail-official/dkg run markitdown:bundle now resolves, downloads the binary (using the workspace fallback if the local version is ahead of the latest published tag), and exits non-zero on failure — verified locally.

}
});

it('workspace + 404 + injected latest URL stages the fallback binary', async () => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: this test title/comment says the fallback binary is staged, but the assertions only verify the failure path (status === 'failed'). That leaves the new fallback-staged branch effectively untested and makes the coverage misleading. Either exercise the happy path here or rename the test/comments to match the behavior being asserted.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in cc8ab93. Added an injectable buildReleaseUrl(version, repo) parameter to bundleImplicitCurrentPlatform so tests can redirect BOTH the primary and fallback URLs to a local mock server (previously the fallback path called releaseBaseUrl directly and hit github.com). Removed the weak-assertion variant. The new test workspace + version-tag 404 + injected URL builder stages the fallback binary with rewritten meta exercises the full happy path: 404 on version tag → fetchLatestTag returns the latest → fallback downloads → status === 'fallback-staged', binary present on disk, and (critical for issue #467) meta sidecar reports cliVersion: LOCAL with effectiveTag: LATEST_TAG so the daemon's converter check accepts it.

…n:bundle script, real fallback test

Three Codex review findings on PR #469:

1. Explicit `--current-platform` runs (without --best-effort) used to exit
   0 even when staging failed because main() dropped the result of
   bundleImplicitCurrentPlatform(). Now main() inspects the result and
   throws when status === 'failed' so the IIFE error handler surfaces a
   non-zero exit, while postinstall (which sets --best-effort) keeps its
   swallowing behavior. Verified by running with a bogus --release-repo:
   remediation message printed, exit 1.

2. Remediation messages told operators to run `markitdown:bundle` but
   packages/cli/package.json defined only `markitdown:build`. Added the
   missing script — `node ./scripts/bundle-markitdown-binaries.mjs
   --current-platform --force` — so `pnpm --filter @origintrail-official/dkg
   run markitdown:bundle` resolves and actually downloads the binary.

3. The earlier "workspace + 404 + injected latest URL stages the fallback
   binary" test only asserted the failure path (the local server could not
   redirect the fallback URL because release.mjs called github.com directly).
   Added an injectable `buildReleaseUrl` parameter on
   bundleImplicitCurrentPlatform, removed the weak-assertion variant, and
   the new test exercises the full happy path including the critical
   meta-rewrite invariant (meta.cliVersion = LOCAL version, effectiveTag =
   actual release tag — otherwise the daemon's converter check rejects the
   binary).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex review skipped: filtered diff is 6748 lines (cap: 5,000). Please consider splitting this into smaller PRs for reviewability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MarkItDown converter is not available out of the box in workspace/dev node installs

1 participant