Skip to content

feat(processor): magika content-type detection#983

Merged
2witstudios merged 8 commits into
masterfrom
pu/magika-processor
Apr 13, 2026
Merged

feat(processor): magika content-type detection#983
2witstudios merged 8 commits into
masterfrom
pu/magika-processor

Conversation

@2witstudios
Copy link
Copy Markdown
Owner

@2witstudios 2witstudios commented Apr 13, 2026

Summary

Replaces the spoofable browser-supplied Content-Type allowlist in apps/processor with Google Magika–verified detection on every upload, locks down the serve path so it cannot be coerced into emitting a script-active type, and migrates the processor base image off Alpine so @tensorflow/tfjs-node can actually load in production. This is the processor half of a two-worktree change; a follow-up magika-web-uploads worktree will land the matching changes in apps/web once this merges. No apps/web/** files are touched here.

Why

The current multer.fileFilter in apps/processor/src/api/upload.ts is a prefix allowlist (image/, application/pdf, text/, application/vnd) keyed on whatever the browser claimed. That's trivially bypassable: a .py file arrives as text/plain and gets treated as a generic blob; a renamed Mach-O/ELF/PE binary arrives as image/png and is happily accepted; an HTML payload masquerading as text/plain rides through to the serve path and is later echoed back with whatever the DB row says.

PageSpace is pivoting toward cloud-IDE use cases where source-code uploads are first-class. We need an authoritative content-type signal that's independent of the browser. Google's Magika gives us ~5 ms CPU classification across 200+ labels, including code languages, with a vendored ~3 MB tfjs model.

What landed (commit-by-commit)

feat(processor): add magika-based content-type detector — Tasks A + B

  • Pins magika@^1.0.0 in apps/processor/package.json.
  • Vendors the standard_v3_3 model (3.1 MB) under apps/processor/assets/magika/standard_v3_3/ so detection runs without network access in dev, CI, and prod.
  • New apps/processor/src/services/content-detector.ts:
    • Singleton MagikaNode instance lazily initialised via a memoised Promise<Magika | null> so concurrent first calls share one load.
    • detectContentType(filePath)DetectedContentType { label, mimeType, group, score, source: 'magika' | 'fallback' }.
    • 250 ms Promise.race timeout ceiling.
    • All error paths return a frozen FALLBACK_DETECTION (application/octet-stream) and log via processorLogger using tempPath only — never originalName (preserves the PII-scrubbing posture from dbc52496).
  • New apps/processor/src/services/__tests__/content-detector.test.ts (4 tests, all green): real-fixture classification (png/python/pdf), missing-file → fallback, sequential and concurrent-cold-start singleton reuse via vi.spyOn(MagikaNode, 'create').
  • Whitelists @tensorflow/tfjs-node in the root package.json#pnpm.onlyBuiltDependencies so the native tfjs binding builds on fresh pnpm install (pnpm 10 blocks build scripts by default). This is one of the two files touched outside apps/processor/** (the other is the lockfile) — without it, fresh installs leave Magika unable to load and the feature silently degrades.
  • Adds a magika/node paths mapping in apps/processor/tsconfig.json so the legacy moduleResolution: "node" setting can find Magika's subpath types.

feat(processor): enforce magika-verified types on upload — Task C

  • Drops the allowedTypes prefix check from multer.fileFilter; keeps the empty-filename guard.
  • Both /single and /multiple handlers now call detectContentType(tempPath) after the temp file is written and validated against TEMP_UPLOADS_DIR. If the verified label is in DENIED_LABELS = { pebin, elf, macho, dex, html, svg, xhtml }:
    • The temp file is fs.unlinked immediately (same warn-and-continue cleanup pattern used elsewhere).
    • /single returns HTTP 415 { error: 'Unsupported file type', detectedLabel }.
    • /multiple records the rejection in the per-file results[] and removes the path from tempFilePaths so the batch cleanup loop doesn't double-unlink.
  • Verified mimeType and the new detectedLabel field are threaded through queueProcessingJobs into IngestFileJobData (extended in apps/processor/src/types/index.ts). queue-manager.ts dispatch is unchanged — it still keys on mimeType.startsWith('image/') / needsTextExtraction(mimeType), but now receives a trustworthy value.
  • upload.test.ts: a hoisted mockDetectContentType defaults to image/jpeg so all 28 pre-existing tests pass unchanged. 4 new tests cover 415 rejection + temp cleanup for macho, verified text/x-python flowing into the ingest job, fallback application/octet-stream accepted, and a mixed-batch /multiple where one file is rejected and a sibling is accepted.

feat(processor): harden serve-path content-type handling — Task D

  • Targets only the /original route in apps/processor/src/api/serve.ts. The /preset branch is intentionally left alone — those are processor-generated variants whose Content-Type derives from a known suffix.
  • Content-Type is sourced strictly from fileRecord.mimeType (with pageRecord.mimeType as a per-page fallback). Whitespace-only stored values are normalised away. There is no caller-controlled fallback path.
  • A new forceAttachment = isDangerousMimeType(contentType) || !hasStoredMime flag forces Content-Disposition: attachment plus the strict default-src 'none'; style-src 'unsafe-inline'; img-src data:; sandbox; CSP whenever the stored mime is missing OR matches DANGEROUS_MIME_TYPES (reused from utils/security.ts: text/html, application/xhtml+xml, image/svg+xml, application/xml, text/xml).
  • X-Content-Type-Options: nosniff is set on every response (was already set; preserved).
  • 5 new tests in serve.test.ts: safe-mime inline, script-active force-download, missing-mime → octet-stream + attachment, whitespace-only mime treated as missing, query-param override ignored.

build(processor): switch base image to node:22 bookworm-slim

  • @tensorflow/tfjs-node has no Alpine prebuilds and Alpine isn't an officially supported tfjs platform — the binding requires glibc's ld-linux-x86-64.so.2 (tfjs#1425, tfjs#6556). As long as processor uploads run through Magika, glibc is a hard requirement, so the prod image had to migrate off Alpine.
  • Migrated to node:22.17.0-bookworm-slim in both the build and production stages.
  • Sharp 0.33+ ships self-contained glibc prebuilds that bundle libvips, so the entire cairo-dev jpeg-dev pango-dev giflib-dev pixman-dev pangomm-dev libjpeg-turbo-dev freetype-dev chain (plus the matching runtime libs in the prod stage) was deleted. Tesseract.js is pure JS + WASM with zero system deps.
  • Build stage now installs only ca-certificates curl python3 make g++ pkg-config (the last three are kept so node-pre-gyp can compile from source if a future native dep ships without a glibc prebuild). Production stage installs only ca-certificates.
  • Verified end-to-end with docker build --platform linux/amd64 (matches the ubuntu-latest CI runner used by .github/workflows/docker-images.yml) followed by an in-container smoke test:
    • require('sharp') → v0.33.5 loads
    • require('tesseract.js').recognize is a function
    • MagikaNode.create() loads the vendored model from /app/apps/processor/assets/magika/standard_v3_3/
    • libtensorflow announces AVX2 FMA at startup (proves the native tfjs_binding.node actually opened the underlying .so)
    • Magika correctly classifies a Python source as python and an HTML payload as html (which is in DENIED_LABELS, so the upload route would now reject it with HTTP 415)
  • Final amd64 image: ~587 MB, smaller than the previous Alpine prod image once you account for the deleted dev-libs chain.

Files touched

File Notes
apps/processor/Dockerfile Alpine → bookworm-slim, deleted sharp dev-libs
apps/processor/package.json + magika@^1.0.0
apps/processor/src/services/content-detector.ts new
apps/processor/src/services/__tests__/content-detector.test.ts new
apps/processor/assets/magika/standard_v3_3/{model.json,config.min.json,group1-shard1of1.bin} new (3.1 MB vendored model)
apps/processor/src/api/upload.ts denylist + detector wiring + extended queueProcessingJobs
apps/processor/src/types/index.ts + detectedLabel?: string on IngestFileJobData
apps/processor/src/api/serve.ts strict mime sourcing + forceAttachment
apps/processor/src/api/__tests__/upload.test.ts mock detector + 4 new tests
apps/processor/src/api/__tests__/serve.test.ts 5 new lockdown tests
apps/processor/tsconfig.json magika/node paths mapping
package.json (root) pnpm.onlyBuiltDependencies: ["@tensorflow/tfjs-node"]
pnpm-lock.yaml updated by pnpm install

Acceptance checklist

  • pnpm --filter processor test801 / 801 passing (all 36 test files)
  • pnpm --filter processor build — clean
  • pnpm typecheck (turbo, repo-wide) — clean across all 12 workspaces
  • pnpm lint (turbo, repo-wide) — clean
  • docker build --platform linux/amd64 -f apps/processor/Dockerfile . — succeeds
  • In-container smoke test: sharp + tesseract.js + tfjs-node native binding + Magika classification all working
  • No any types introduced
  • No originalName ever passed into processorLogger from the new detector code
  • No edits to apps/web/**, packages/**, apps/realtime/**, etc.
  • Detector singleton reuse, fallback path, denylist rejection, /single + /multiple temp cleanup on rejection, serve-path Content-Type sourcing, serve-path script-active force-download — all covered by tests

Known follow-up (out of scope here)

  • The magika-web-uploads worktree will wire the verified mimeType and detectedLabel from IngestFileJobData into pages.mimeType / files.mimeType on the apps/web upload path, and use Magika's code-language labels to route .py, .rs, .ts, etc. to PageType.CODE.

🤖 Generated with Claude Code

2witstudios and others added 3 commits April 13, 2026 09:46
Introduces a Google Magika wrapper at services/content-detector.ts that
loads a singleton MagikaNode instance from a vendored standard_v3_3 model
under apps/processor/assets/. detectContentType returns a typed
DetectedContentType { label, mimeType, group, score, source } shape, with
a 250ms timeout, file-handle-safe error handling, and a deterministic
'fallback' shape (application/octet-stream) that lets callers continue
without raising when Magika is unavailable. The detector logs only the
temp path - never originalName - to preserve the PII-scrubbing posture
landed in dbc5249.

The model is loaded from local file paths so a fresh install (or test
run with no network access) classifies fixtures deterministically. Pins
magika@^1.0.0; whitelists @tensorflow/tfjs-node in the root pnpm config
so the native tfjs binding builds on fresh installs (pnpm 10 blocks
build scripts by default). A magika/node path mapping is added to the
processor tsconfig so legacy node moduleResolution can find Magika's
subpath types.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drops the multer prefix allowlist from upload.ts and replaces it with a
post-write Magika verification step in both /single and /multiple
handlers. After the temp file is written and its path validated against
TEMP_UPLOADS_DIR, detectContentType runs against the bytes on disk; any
file whose label lands in DENIED_LABELS (pebin, elf, macho, dex, html,
svg, xhtml) is rejected with HTTP 415 and its temp file unlinked
immediately. The verified mimeType and detectedLabel are then threaded
through queueProcessingJobs into IngestFileJobData so downstream
workers route on the canonical type rather than whatever the browser
claimed. queue-manager dispatch is unchanged - it now just receives a
trustworthy mimeType.

Why the post-write check: multer's fileFilter is synchronous and runs
before bytes are buffered, so it cannot inspect content. By rejecting
after disk write but before computeFileHash, denylisted files don't
incur hash work and never enter the dedup store.

In /multiple, rejections are recorded in the per-file results array
with the same { error, success: false } shape existing failures use,
so siblings continue to process. The /multiple temp-cleanup loop still
runs against survivors via tempFilePaths, with rejected entries
removed from that list so they're not double-unlinked.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Locks down the /original response path so it cannot serve a script-active
type by accident. The Content-Type now derives strictly from
fileRecord.mimeType (with pageRecord.mimeType as a per-page fallback) and
ignores anything the caller might supply via query params or filename.
Whitespace-only stored values are normalised away before use.

When the stored mime is missing entirely OR matches DANGEROUS_MIME_TYPES
(text/html, application/xhtml+xml, image/svg+xml, application/xml,
text/xml - reused via isDangerousMimeType from utils/security), the
response falls back to application/octet-stream and forces
Content-Disposition: attachment plus the strict CSP sandbox header.
X-Content-Type-Options: nosniff is set on every response. The /preset
branch is intentionally untouched - those are processor-generated
variants whose Content-Type is derived from the preset suffix.

This pairs with the upload-side Magika verification: once the web-side
worktree wires verified types into the DB, the serve path will be
emitting Magika-sourced Content-Type values for new uploads. Legacy
rows continue to use whatever mimeType they were originally written
with, which is acceptable since the serve-path lockdown means a
spoofed legacy mimeType still cannot trigger script execution in a
client browser.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 13, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
pagespace-master-plan Ready Ready Preview, Comment Apr 13, 2026 5:37pm

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 13, 2026

Warning

Rate limit exceeded

@2witstudios has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 46 minutes and 27 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 46 minutes and 27 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3fba530f-1d6c-4ab0-8a42-695e507c85a4

📥 Commits

Reviewing files that changed from the base of the PR and between 6793302 and 3c6674a.

📒 Files selected for processing (6)
  • apps/processor/Dockerfile
  • apps/processor/src/api/__tests__/upload.test.ts
  • apps/processor/src/api/upload.ts
  • apps/processor/src/services/__tests__/content-detector.test.ts
  • apps/processor/src/services/content-detector.ts
  • package.json
📝 Walkthrough

Walkthrough

The PR integrates Magika file-type detection into the processor service. It adds a content-detector service that classifies uploaded files using Magika, replaces multer's MIME-type allowlist with server-side detection, implements a denylist for dangerous content labels, and updates upload handlers to pass detection results to processing jobs.

Changes

Cohort / File(s) Summary
Magika Assets
apps/processor/assets/magika/standard_v3_3/config.min.json, apps/processor/assets/magika/standard_v3_3/model.json
Added Magika standard v3.3 configuration and TensorFlow.js model artifacts with buffer sizing, detection thresholds, label space targets, and model topology for file classification inference.
Dependencies & Config
apps/processor/package.json, apps/processor/tsconfig.json, package.json
Added magika@^1.0.0 runtime dependency, configured TypeScript path alias for magika/node, and enabled pnpm onlyBuiltDependencies for TensorFlow.js node bindings.
Content Detection Service
apps/processor/src/services/content-detector.ts, apps/processor/src/services/__tests__/content-detector.test.ts
New content-detection service using memoized Magika singleton with async file classification, timeout handling, fallback on errors, and comprehensive test coverage validating detection, error behavior, and lifecycle reuse.
Upload Handler
apps/processor/src/api/upload.ts, apps/processor/src/api/__tests__/upload.test.ts
Replaced multer MIME-type allowlist with server-side content detection; added denylist for detected labels; updated job enqueueing to pass detectedLabel and verified mimeType; added tests for upload rejection, temp file cleanup, and multi-file handling with detection results.
Serve Handler
apps/processor/src/api/serve.ts, apps/processor/src/api/__tests__/serve.test.ts
Enhanced GET /files/:contentHash/original to normalize stored MIME, fall back to application/octet-stream for missing/empty MIME, and force attachment disposition when stored MIME is absent; added comprehensive security tests for content-type lockdown.
Type Definitions
apps/processor/src/types/index.ts
Added optional detectedLabel?: string field to IngestFileJobData interface for tracking file-detection results in job payloads.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant Upload Handler
    participant File System
    participant Magika
    participant Denylist Check
    participant Queue Manager
    participant Serve Handler

    Client->>Upload Handler: POST /upload (multipart file)
    Upload Handler->>File System: Write temp file
    Upload Handler->>Magika: detectContentType(tempPath)
    Magika->>File System: Read file bytes
    Magika-->>Upload Handler: Detection result {label, mimeType, score}
    Upload Handler->>Denylist Check: Is label in DENIED_LABELS?
    alt Dangerous Label
        Denylist Check-->>Upload Handler: YES (denied)
        Upload Handler->>File System: Delete temp file
        Upload Handler-->>Client: HTTP 415 + detectedLabel
    else Safe Label
        Denylist Check-->>Upload Handler: NO (allowed)
        Upload Handler->>Queue Manager: queueProcessingJobs(verifiedMimeType, detectedLabel)
        Queue Manager->>Queue Manager: Enqueue ingest-file job
        Queue Manager-->>Upload Handler: Job IDs
        Upload Handler-->>Client: HTTP 200 + Job IDs
    end
    
    Note over Serve Handler: Later retrieval
    Client->>Serve Handler: GET /files/:contentHash/original
    alt Stored MIME Available
        Serve Handler-->>Client: Content-Type from stored MIME
    else Stored MIME Missing
        Serve Handler-->>Client: Content-Type: application/octet-stream
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related PRs

  • PR #506 — Both PRs modify IngestFileJobData interface and upload job payload signatures in the processor service, indicating coordinated changes to job data handling.
  • PR #269 — Both PRs modify serve.ts content-type/disposition logic and upload.ts server-side upload validation, suggesting related refinements to file handling and security.

Poem

🐰 With Magika's keen sight, we sniff out each file's true name,
No mischievous bytes escape our detection game!
Dendrites and thresholds guide our nose to the truth—
Safe files march forth, while danger stays in the booth! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(processor): magika content-type detection' accurately and concisely describes the main change—integrating Magika for server-side content-type detection in the processor app.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pu/magika-processor

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6793302f96

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread apps/processor/src/services/content-detector.ts Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
apps/processor/src/api/__tests__/upload.test.ts (1)

959-987: Consider adding test for HTML/SVG/XHTML denylisted labels.

The PR objectives mention {html, svg, xhtml} as denylisted. The current tests cover macho and elf but don't explicitly test the script-active types. Consider adding a test case for html or svg labels to ensure these are also rejected.

Example test case for HTML label
it('rejects single-upload with HTTP 415 when magika detects html', async () => {
  mockDetectContentType.mockResolvedValue({
    label: 'html',
    mimeType: 'text/html',
    group: 'text',
    score: 0.95,
    source: 'magika',
  });

  const app = express();
  // ... setup similar to macho test ...
  
  const response = await request(app)
    .post('/upload/single')
    .send({ driveId: 'drive-1', pageId: 'page-1' });

  expect(response.status).toBe(415);
  expect(response.body.detectedLabel).toBe('html');
});
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/processor/src/api/__tests__/upload.test.ts` around lines 959 - 987, Add
a new test in apps/processor/src/api/__tests__/upload.test.ts that mirrors the
existing "macho" test but sets mockDetectContentType.mockResolvedValue to return
label 'html' (or 'svg'/'xhtml') with an appropriate mimeType (e.g.,
'text/html'), then boot the same express setup that attaches req.file
(createMockFile with TEMP_PATH) and req.auth, mount uploadRouter and POST to
'/upload/single'; assert response.status is 415, response.body.detectedLabel
equals 'html' (or chosen label), and that mockFsUnlink was called with TEMP_PATH
while mockAddJob and mockSaveOriginalFromFile were not called to ensure
denylisted script-type labels are rejected.
apps/processor/src/services/content-detector.ts (2)

16-16: Consider if 250ms timeout is sufficient for production environments.

The detection timeout of 250ms may be aggressive, especially on first invocation when the model is being loaded (though model loading happens in getInstance() separately). For subsequent calls, 250ms should be adequate for inference, but resource-constrained environments or very large files might occasionally time out.

The current fallback behavior (returning application/octet-stream) is safe, but frequent timeouts could lead to legitimate files being served with generic MIME types. Consider making this configurable via environment variable if monitoring shows timeout issues.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/processor/src/services/content-detector.ts` at line 16,
DETECTION_TIMEOUT_MS is hardcoded to 250ms which may be too short for some
environments; make it configurable by reading an environment variable (e.g.,
CONTENT_DETECTION_TIMEOUT_MS) in the file and fall back to 250 if not set or
invalid, update any uses of DETECTION_TIMEOUT_MS in content-detector.ts
(including where getInstance() is invoked and detection calls are timed out) to
use the new configurable value, and ensure parsing uses a safe parseInt with NaN
handling so behavior remains unchanged when the env var is absent or malformed.

32-50: Singleton caches initialization failures permanently.

If Magika.create() fails (e.g., model files missing or corrupted), the null result is cached in instancePromise forever. All subsequent detectContentType calls will return FALLBACK_DETECTION until the service restarts.

This is reasonable behavior to prevent retry storms, but be aware that transient failures (e.g., temporary disk I/O issues) won't auto-recover. Consider adding health-check or metrics to detect prolonged fallback states in production.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/processor/src/services/content-detector.ts` around lines 32 - 50, The
getInstance() singleton currently caches a null result when Magika.create()
fails, causing permanent fallback; change the logic in getInstance() so that if
Magika.create() throws you do not leave instancePromise resolved to null—in the
catch block clear or reset instancePromise (e.g., set it to undefined/null) so
subsequent calls will retry creation, and optionally record a metric or
increment a failure counter to surface prolonged fallback states; specifically
update the catch handling around Magika.create() inside getInstance() (and any
use in detectContentType) to avoid permanently caching failures while still
preventing tight retry storms.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/processor/src/api/upload.ts`:
- Around line 15-23: The DENIED_LABELS ReadonlySet currently omits the
JavaScript label; update the DENIED_LABELS constant by adding the string
'javascript' to the set (alongside
'pebin','elf','macho','dex','html','svg','xhtml') so Magika-detected JavaScript
files are denied and cannot be served inline.

In `@package.json`:
- Around line 70-72: The onlyBuiltDependencies array currently only lists
"@tensorflow/tfjs-node" so pnpm blocks sharp's install script; update the
onlyBuiltDependencies entry to include "sharp" alongside "@tensorflow/tfjs-node"
so sharp's install lifecycle can run and native bindings are built/downloaded at
install time.

---

Nitpick comments:
In `@apps/processor/src/api/__tests__/upload.test.ts`:
- Around line 959-987: Add a new test in
apps/processor/src/api/__tests__/upload.test.ts that mirrors the existing
"macho" test but sets mockDetectContentType.mockResolvedValue to return label
'html' (or 'svg'/'xhtml') with an appropriate mimeType (e.g., 'text/html'), then
boot the same express setup that attaches req.file (createMockFile with
TEMP_PATH) and req.auth, mount uploadRouter and POST to '/upload/single'; assert
response.status is 415, response.body.detectedLabel equals 'html' (or chosen
label), and that mockFsUnlink was called with TEMP_PATH while mockAddJob and
mockSaveOriginalFromFile were not called to ensure denylisted script-type labels
are rejected.

In `@apps/processor/src/services/content-detector.ts`:
- Line 16: DETECTION_TIMEOUT_MS is hardcoded to 250ms which may be too short for
some environments; make it configurable by reading an environment variable
(e.g., CONTENT_DETECTION_TIMEOUT_MS) in the file and fall back to 250 if not set
or invalid, update any uses of DETECTION_TIMEOUT_MS in content-detector.ts
(including where getInstance() is invoked and detection calls are timed out) to
use the new configurable value, and ensure parsing uses a safe parseInt with NaN
handling so behavior remains unchanged when the env var is absent or malformed.
- Around line 32-50: The getInstance() singleton currently caches a null result
when Magika.create() fails, causing permanent fallback; change the logic in
getInstance() so that if Magika.create() throws you do not leave instancePromise
resolved to null—in the catch block clear or reset instancePromise (e.g., set it
to undefined/null) so subsequent calls will retry creation, and optionally
record a metric or increment a failure counter to surface prolonged fallback
states; specifically update the catch handling around Magika.create() inside
getInstance() (and any use in detectContentType) to avoid permanently caching
failures while still preventing tight retry storms.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3ef0ba2c-5e0a-4a50-9494-fe3cc0741782

📥 Commits

Reviewing files that changed from the base of the PR and between b3e4a07 and 6793302.

⛔ Files ignored due to path filters (2)
  • apps/processor/assets/magika/standard_v3_3/group1-shard1of1.bin is excluded by !**/*.bin
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (12)
  • apps/processor/assets/magika/standard_v3_3/config.min.json
  • apps/processor/assets/magika/standard_v3_3/model.json
  • apps/processor/package.json
  • apps/processor/src/api/__tests__/serve.test.ts
  • apps/processor/src/api/__tests__/upload.test.ts
  • apps/processor/src/api/serve.ts
  • apps/processor/src/api/upload.ts
  • apps/processor/src/services/__tests__/content-detector.test.ts
  • apps/processor/src/services/content-detector.ts
  • apps/processor/src/types/index.ts
  • apps/processor/tsconfig.json
  • package.json

Comment thread apps/processor/src/api/upload.ts
Comment thread package.json
@tensorflow/tfjs-node has no Alpine prebuilds and Alpine isn't an officially
supported tfjs platform - the binding requires glibc's
ld-linux-x86-64.so.2, which Alpine deliberately doesn't ship. As long as
processor uploads run through Magika, glibc is a hard requirement.

Migrating to node:22.17.0-bookworm-slim:
- tfjs-node loads cleanly; libtensorflow announces AVX2/FMA at startup,
  proving the native binding actually opened the .so
- sharp 0.33+ ships self-contained glibc prebuilds that bundle libvips, so
  the entire cairo/jpeg/pango/giflib/pixman/pangomm/libjpeg-turbo/freetype
  dev+runtime apk chain disappears from BOTH stages of the multi-stage
  build
- tesseract.js is pure JS+WASM, zero system deps
- final amd64 image is ~587MB (down from the previous Alpine prod image
  which carried the full sharp dev-libs chain)

Build stage now installs only ca-certificates, curl, python3, make, g++,
pkg-config (the last three are kept so node-pre-gyp can compile from
source if a future native dep ships without a glibc prebuild). Production
stage installs only ca-certificates.

Verified end-to-end with `docker build --platform linux/amd64` (matches
the CI runner) followed by an in-container smoke test: sharp v0.33.5
loads, tesseract.recognize is a function, and Magika correctly
classifies a Python source file as 'python' and an HTML payload as
'html' (which is in DENIED_LABELS, so the upload route would now
reject it with HTTP 415).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add 'javascript' to DENIED_LABELS in upload.ts (CodeRabbit + Codex):
  Magika emits a 'javascript' label and JS files served inline are a
  stored XSS vector even though the serve-path lockdown forces
  attachment for the static DANGEROUS_MIME_TYPES set. Belt-and-braces.

- content-detector singleton no longer poisons itself on init failure
  (CodeRabbit + Codex): if Magika.create() throws, instancePromise is
  cleared and the next attempt waits 60s before retrying so a hot
  upload loop can't trigger thousands of init attempts per second
  while still recovering from transient model-load failures without
  a worker restart. The test reset helper also clears the backoff
  timestamp so suites don't have to wait wall clock.

- Add 'sharp' to root pnpm.onlyBuiltDependencies (CodeRabbit):
  empirically sharp 0.33.5 currently loads via the @img/sharp-* arch
  prebuild optionalDependencies without its own install script
  running, but adding it is defense-in-depth against a future sharp
  release that does need its install lifecycle.

- New tests:
  * upload.test.ts adds a parameterised denylist case covering the
    four script-active labels (html, svg, xhtml, javascript)
  * content-detector.test.ts asserts that a forced Magika.create()
    failure returns the fallback shape AND that the cache can recover
    via the reset helper, exercising the new init-retry path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@2witstudios
Copy link
Copy Markdown
Owner Author

Review feedback addressed in e941566

All three actionable inline threads have been addressed and replied to individually:

  • Add javascript to DENIED_LABELS (CodeRabbit + Codex) — done in apps/processor/src/api/upload.ts:24. Magika emits a javascript label for JS source, which the serve-path lockdown wouldn't catch (the static DANGEROUS_MIME_TYPES set is only HTML/SVG/XML variants), so a JS file slipping through could be served inline as application/javascript. Now rejected at upload with HTTP 415.
  • Reset detector cache when Magika init fails (CodeRabbit + Codex) — done in apps/processor/src/services/content-detector.ts:33-58. getInstance() now clears instancePromise on Magika.create() rejection and uses a 60s INIT_RETRY_BACKOFF_MS to prevent retry storms under load while still recovering from transient model-load failures without a worker restart.
  • Add sharp to pnpm.onlyBuiltDependencies (CodeRabbit) — done in root package.json:71-72. Empirically sharp 0.33.5 currently loads via its @img/sharp-* arch optionalDependencies without its own install script needing to run, but adding it is defense-in-depth against a future sharp release that does require its lifecycle script.

Plus the related nitpicks from the CodeRabbit summary:

  • Add HTML/SVG/XHTML denylist tests (nitpick) — done in apps/processor/src/api/__tests__/upload.test.ts:954-985 as a parameterised case covering html, svg, xhtml, AND javascript (806/806 processor tests passing).
  • Make DETECTION_TIMEOUT_MS env-configurable (nitpick) — declining for now. The 250ms ceiling is documented and the project's CLAUDE.md is explicit about not adding configuration knobs for hypothetical future requirements ("Don't design for hypothetical future requirements"). If production telemetry shows the timeout is too aggressive, a one-line change to read process.env.CONTENT_DETECTION_TIMEOUT_MS is trivial to add at that point. Until then the constant is the contract.

Validation

  • pnpm --filter processor test806 / 806 passing (added 1 cache-recovery test + 4 parameterised denylist cases)
  • pnpm --filter processor typecheck → clean
  • pnpm --filter processor build → clean
  • pnpm typecheck (turbo, repo-wide) → clean
  • pnpm lint (turbo, repo-wide) → clean

The Docker smoke-test from commit 635fae2 is still valid since the production runtime path didn't change (the cache-reset only kicks in on failed init, the test exercises that path directly via a mocked MagikaNode.create rejection).

…ntent-detector

In the initial iteration of this PR the synthetic Mach-O fixture
(16-byte header + 4080 bytes of zero padding) classified as 'iso'
rather than 'macho', because Magika's neural net reads the first/last
block of a file and a sparse zero run is the hallmark of an ISO
image, not an executable. At the time I responded by deleting the
`expect(macho.label).toBe('macho')` assertion to make the test pass.

That was the wrong move. Eric Elliott's TDD 5-questions framing is
pretty clear on this: when given/should/actual/expected/reproduce
points at a fixture that isn't really what it claims to be, the
honest fix is to build a realistic fixture, not to loosen the
assertion. The original test's whole reason to exist was to prove
that Magika actually catches renamed executables end-to-end — that
is the single load-bearing piece of evidence that DENIED_LABELS is
doing its job. Erasing the assertion erased the proof.

This commit:
- Swaps the synthetic Mach-O for a synthetic ELF64. ELF is also in
  DENIED_LABELS and its 64-byte header is easier to fake realistically
  than Mach-O's load-command chain.
- Writes a real ELF64 e_ident + ELFCLASS64/ELFDATA2LSB header, one
  PT_LOAD program header (PF_R|PF_X, 0x400000 vaddr), and fills the
  8KB "text" section with deterministic pseudo-random bytes from a
  keyed LCG. Random bytes have roughly the same byte-frequency
  distribution as compiled x86-64 machine code from the classifier's
  point of view, which is what Magika needs to see to emit `elf`.
- Restores the canonical label assertions for all four fixtures
  (`png`, `python`, `pdf`, `elf`), with a comment explicitly calling
  out that loosening them would erase the denylist's load-bearing
  evidence.

Verified: `pnpm vitest run src/services/__tests__/content-detector.test.ts`
now prints `elf.source === 'magika'` and `elf.label === 'elf'`, 5/5
tests passing, full processor suite 822/822 green.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@2witstudios
Copy link
Copy Markdown
Owner Author

Self-correction (bb990c4): the content-detector test was weakened in an earlier iteration of this PR and the fix wasn't honest.

What happened: the original test asserted png → png, python → python, pdf → pdf, and macho → macho. The synthetic Mach-O fixture was a 16-byte header + 4080 bytes of zero padding, and Magika classified it as iso because a sparse zero run is the hallmark of an ISO image, not an executable. My response at the time was to delete the macho assertion and move on. That was wrong: the whole reason that assertion existed was to prove Magika actually catches renamed executables, which is the single load-bearing piece of evidence that DENIED_LABELS is doing its job at upload time. Dropping the assertion erased the proof.

The honest fix (per Eric Elliott's TDD five-questions framing — when given/should/actual/expected/reproduce points at a bad fixture, you fix the fixture, not the assertion):

  • Swapped the synthetic Mach-O for a synthetic ELF64. ELF is also in DENIED_LABELS and its 64-byte header is easier to fake realistically than Mach-O's load-command chain.
  • makeElf64() now writes a real e_ident (ELFCLASS64 / ELFDATA2LSB), a valid header with e_type=ET_EXEC and e_machine=EM_X86_64, one PT_LOAD program header (PF_R|PF_X, 0x400000 vaddr), and fills an 8 KB "text" section with deterministic pseudo-random bytes from a keyed LCG. Random bytes have roughly the same byte-frequency distribution as compiled x86-64 machine code from the classifier's POV, which is what Magika needs to emit elf.
  • Restored the canonical label assertions for all four fixtures (png, python, pdf, elf) with an inline comment explicitly calling out that loosening them would erase the denylist's load-bearing evidence.

Verification: detectContentType(fixtures.elf) now returns { source: 'magika', label: 'elf' }, so the classifier is actually catching the synthetic executable. 822/822 processor tests passing. Typecheck clean.

…ied uploads

P1: magika/node's JS API only exposes { label, is_text } on prediction.output
— no mime_type, no group — so content-detector was always falling back to
application/octet-stream. That broke the queue-manager router, which branches
on image/* and needsTextExtraction(mimeType) to decide thumbnailing vs text
extraction, so every upload was hitting the "unsupported → visual" path.
Replace the phantom output.mime_type/group reads with a local LABEL_TO_MIME
table covering the labels the ingest worker actually routes on (png, jpeg,
gif, webp, tiff, bmp, avif, heif, ico, pdf, doc, docx, txt, markdown, csv,
json). Drop the unused `group` field.

P2: upload.ts was treating detectContentType's fallback result the same as a
verified safe type, so when Magika init or classification failed, label
defaulted to 'unknown' and escaped DENIED_LABELS — letting renamed
HTML/executables through. Add isUnverifiedDetection() and reject with HTTP
415 "Unable to verify file type" in both /single and /multiple when
source === 'fallback' or label === 'unknown', with warn-level logging so ops
can see when the model binding is broken.

Tests: invert the old "accepts fallback" upload test to assert 415 + no
ingest job, add a second case for a real-magika 'unknown' classification,
and add load-bearing MIME assertions on the real-fixture content-detector
test (png → image/png, pdf → application/pdf).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant