Skip to content

Improve MP4/M4A server cleanse metadata quality#32

Merged
ChrisAdamsdevelopment merged 1 commit into
mainfrom
codex/fix-mp4/m4a-metadata-quality-issues
May 18, 2026
Merged

Improve MP4/M4A server cleanse metadata quality#32
ChrisAdamsdevelopment merged 1 commit into
mainfrom
codex/fix-mp4/m4a-metadata-quality-issues

Conversation

@ChrisAdamsdevelopment
Copy link
Copy Markdown
Owner

@ChrisAdamsdevelopment ChrisAdamsdevelopment commented May 18, 2026

Motivation

  • Fix QuickTime/track/media timestamps being left as 0000:00:00 00:00:00 and ensure files show sane current export timestamps after server cleanse.
  • Remove XMP/tool-trace residue such as XMPToolkit / Image::ExifTool and stop injecting generic branding like Artist: Creator and Copyright: © 2026 Creator when user metadata exists.
  • Preserve and forward user-provided metadata (artist/producer/title/copyright/genre/description/tags/lyrics/platform) so cleansed MP4/M4A outputs look like normal exports and provide automated metadata-quality verification.

Description

  • Reworked MP4/M4A metadata rewrite to use QuickTime ItemList tags (ItemList:Title, ItemList:Artist, ItemList:Producer, ItemList:Copyright, ItemList:Keyword, ItemList:Genre, and platform-specific ItemList:Description, ItemList:Comment, ItemList:Album, ItemList:ContentCreateDate, ItemList:Lyrics) and avoid unqualified tag injection that can leave XMP traces.
  • Implemented timestamp strategy that writes the current server processing/export timestamp (UTC) to QuickTime:CreateDate, QuickTime:ModifyDate, TrackCreateDate, TrackModifyDate, MediaCreateDate, and MediaModifyDate, and collects non-fatal warnings if ExifTool cannot update specific track/media fields.
  • Stopped using a forced fallback of Creator; if artist is blank the Artist field is omitted, and Copyright defaults to © <CURRENT_YEAR> <artist> only when artist is present and no explicit copyright was supplied, with a small inference for producer (detects Triple7 in context if present).
  • Removed XMP by calling -XMP:all= after re-writing and added marker rules to detect XMPToolkit and Image::ExifTool, while treating binary MP4 structural fields (e.g., AVCConfiguration, SampleSizes, ChunkOffset, MediaData, and related track/sample fields) as benign.
  • Added metadata-quality verification (buildQualityVerification) that flags remaining XMPToolkit / Image::ExifTool traces, generic Creator injections, zeroed QuickTime/track/media timestamps, and missing expected Title/Artist/Copyright/Producer; verification findings are included in the report and influence status (clean, clean_with_notes, review_required).
  • Exposed the full forensic report to the frontend via the X-Forensic-Report response header and updated the client FormData to include title, artist, producer, copyright, genre, description, tags, lyrics, and platform.
  • Files changed: server/processor.js, server/metadataRules.js, server.js, app.tsx, and src/utils/metadata.js (browser metadata parsing updated to extract producer/copyright/lyrics).

Testing

  • Ran static syntax checks with node --check server/processor.js && node --check server.js, which passed.
  • Built the frontend with npm run build (TypeScript + Vite) and the build completed successfully.
  • Performed an inline Node smoke test that exercised buildMetaToWrite, timestamp formatting, and buildQualityVerification, and the assertions passed.
  • Dependency install note: npm ci (default) failed in this container due to native better-sqlite3 build incompatibility with the local Node (24.x vs expected 20.x), while npm ci --ignore-scripts succeeded and was used to validate the frontend/build steps.

Codex Task

Summary by Sourcery

Improve MP4/M4A server-side cleansing to preserve user metadata, eliminate XMP/tool traces, and add metadata-quality verification to influence cleanse status and reporting.

New Features:

  • Forward user-supplied MP4/M4A metadata (title, artist, producer, copyright, genre, description, tags, lyrics, platform) into QuickTime ItemList tags instead of injecting generic defaults.
  • Add automated metadata-quality verification that inspects final tags for XMP/ExifTool residue, zeroed QuickTime timestamps, generic Creator injections, and missing key fields, feeding into the cleanse status and report.
  • Expose the full forensic report, including metadata-quality findings, to clients via an X-Forensic-Report response header and consume it in the frontend.

Bug Fixes:

  • Ensure QuickTime/track/media timestamps are written to sane current UTC export times instead of being left as zeroed values and flag fields that cannot be updated.
  • Stop writing generic Artist/Creator and © YEAR Creator copyright values when user metadata is present, reducing misleading metadata in cleansed files.
  • Treat low-level MP4 structural tags as benign so they no longer appear as suspicious provenance markers.

Enhancements:

  • Refine marker rules to treat XMPToolkit and Image::ExifTool values as provenance markers that should be removed.
  • Expand allowed injected tag handling to support producer, content creation date, and modern ItemList-based fields while normalizing tag-name detection.
  • Improve browser-side metadata extraction to pull producer, copyright, and lyrics from native containers where available.
  • Enrich the server cleanse report with export timestamps, quality verification details, and a consolidated list of verification findings.

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented May 18, 2026

Reviewer's Guide

Refactors MP4/M4A metadata handling to use QuickTime ItemList tags, adds precise QuickTime timestamp writing and metadata-quality verification, expands browser-side metadata extraction, and exposes a richer forensic report to the frontend and API clients.

Sequence diagram for updated MP4/M4A processing and forensic reporting

sequenceDiagram
  actor User
  participant BrowserApp as Browser_App
  participant readFileMetadata as readFileMetadata
  participant Server as API_Server
  participant processMediaFile as processMediaFile
  participant exiftool as exiftool

  User->>BrowserApp: Select MP4/M4A file
  BrowserApp->>readFileMetadata: readFileMetadata(file)
  readFileMetadata-->>BrowserApp: { title, artist, producer, copyright, genre, lyrics, detectedMarkers }

  BrowserApp->>Server: POST /api/process
  activate Server
  Server->>processMediaFile: processMediaFile({ outputPath, platform, metadata })
  activate processMediaFile

  processMediaFile->>exiftool: read(outputPath)
  exiftool-->>processMediaFile: beforeTags

  processMediaFile->>processMediaFile: buildMetaToWrite(platform, metadata)
  processMediaFile->>exiftool: write(outputPath, metaToWrite, -overwrite_original)
  processMediaFile->>exiftool: write(outputPath, {}, -XMP:all= -overwrite_original)

  processMediaFile->>processMediaFile: formatQuickTimeTimestamp()
  processMediaFile->>exiftool: writeQuickTimeTimestamps(outputPath, exportTimestamp)
  exiftool-->>processMediaFile: timestampWriteWarnings

  processMediaFile->>exiftool: read(outputPath)
  exiftool-->>processMediaFile: finalTags

  processMediaFile->>processMediaFile: detectMarkers(finalTags)
  processMediaFile->>processMediaFile: verifyFinalState(finalTags)
  processMediaFile->>processMediaFile: buildQualityVerification(finalTags, metadata, timestampWriteWarnings)
  processMediaFile-->>Server: { report }
  deactivate processMediaFile

  Server->>Server: Set X-Forensic-Report header (JSON.stringify(report))
  Server-->>BrowserApp: 200 OK + download URL
  deactivate Server

  BrowserApp->>BrowserApp: Parse X-Forensic-Report header
  BrowserApp-->>User: Show status, qualityVerification, and download link
Loading

File-Level Changes

Change Details Files
Rework server-side MP4/M4A metadata writing to use QuickTime ItemList tags, better defaults, and explicit text cleaning.
  • Replace generic Title/Artist/Copyright/Keywords/Genre fields with ItemList variants keyed to title, artist, producer, copyright, keyword, genre, description, comment, album, content create date, and lyrics.
  • Introduce cleanText helper to strip NULs, trim, and truncate values; infer producer from context (e.g., Triple7) and compute default copyright only when artist is present and no explicit copyright is provided.
  • Adjust allowedInjectedTags logic to treat ItemList prefixes uniformly and support the expanded set of descriptive tags.
server/processor.js
server/metadataRules.js
Introduce handling and verification of QuickTime timestamps and residual metadata quality issues in the server processor.
  • Add constants and helpers to format UTC QuickTime timestamps and write them to QuickTime, track, and media date fields, collecting non-fatal write warnings.
  • Implement buildQualityVerification to detect XMPToolkit/Image::ExifTool traces, generic Creator injections, zeroed timestamps, and missing expected title/artist/copyright/producer tags.
  • Integrate qualityVerification into processMediaFile, update status/summary logic, extend report payload, and export new helpers.
  • Ensure XMP is removed after metadata rewrite via an -XMP:all= exiftool call and treat additional binary MP4 structural tags as benign.
server/processor.js
server/metadataRules.js
Extend frontend metadata parsing to capture more fields and feed them into the cleanse request.
  • Add utilities to normalize and search native container metadata for producer, copyright, and lyrics frames using regex-based ID matching.
  • Populate analysis with producer, copyright, and lyrics when present, and include these plus artist/genre in the FormData sent to /api/process.
  • Preserve and surface expanded forensic report details by reading X-Forensic-Report, parsing JSON, and storing it on the queue item.
src/utils/metadata.js
app.tsx
Update API layer to pass through extended metadata and expose the full forensic report to clients.
  • Extend request body handling for /api/process and /api/process-batch to accept producer and copyright, forward the richer metadata object into processMediaFile, and keep existing parameters.
  • Expose the full server-side report via X-Forensic-Report in /api/process responses while still returning traditional forensic headers and usage info.
server.js

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@ChrisAdamsdevelopment ChrisAdamsdevelopment merged commit fbf3709 into main May 18, 2026
4 of 5 checks passed
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The default copyright/artist expectations in buildQualityVerification are recomputed with new Date().getUTCFullYear() instead of reusing the same year/export timestamp used in buildMetaToWrite, which risks subtle drift (e.g., at year boundaries) and makes tests brittle; consider passing the export timestamp or computed year into buildQualityVerification so both paths share a single source of truth.
  • Verification logic currently checks tags like Artist, Copyright, and Producer directly while the writer is now using ItemList:* keys; if ExifTool doesn’t always mirror these to the short names, it might miss expected fields, so it may be safer to extend readAnyTag to also look at the corresponding ItemList:* variants when validating the final metadata.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The default copyright/artist expectations in `buildQualityVerification` are recomputed with `new Date().getUTCFullYear()` instead of reusing the same year/export timestamp used in `buildMetaToWrite`, which risks subtle drift (e.g., at year boundaries) and makes tests brittle; consider passing the export timestamp or computed year into `buildQualityVerification` so both paths share a single source of truth.
- Verification logic currently checks tags like `Artist`, `Copyright`, and `Producer` directly while the writer is now using `ItemList:*` keys; if ExifTool doesn’t always mirror these to the short names, it might miss expected fields, so it may be safer to extend `readAnyTag` to also look at the corresponding `ItemList:*` variants when validating the final metadata.

## Individual Comments

### Comment 1
<location path="server/processor.js" line_range="112-121" />
<code_context>
+function buildQualityVerification(tags = {}, metadata = {}, timestampWriteWarnings = []) {
</code_context>
<issue_to_address>
**issue (bug_risk):** Quality verification only checks un-namespaced tag keys and may falsely flag missing fields when only ItemList:* variants exist.

`buildQualityVerification` currently calls `readAnyTag(tags, ['Title'])`, `['Artist']`, etc., but `buildMetaToWrite` now writes `ItemList:Title`, `ItemList:Artist`, `ItemList:Producer`, etc. If only `ItemList:*` keys are present, the bare `Title`/`Artist` keys may be missing and the verification will incorrectly report `expected_*_missing`. Please include the ItemList variants in the read keys (e.g. `['Title', 'ItemList:Title']`) or normalize tag names before verification so the checks match the write path.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread server/processor.js
Comment on lines +112 to +121
function buildQualityVerification(tags = {}, metadata = {}, timestampWriteWarnings = []) {
const failures = [];
const warnings = timestampWriteWarnings.map((field) => ({ code: 'timestamp_write_skipped', field, message: `${field} could not be updated safely by ExifTool.` }));
const expected = {
title: cleanText(metadata.title || 'Untitled', 255),
artist: cleanText(metadata.artist, 255),
producer: cleanText(metadata.producer, 255),
copyright: cleanText(metadata.copyright, 500) || (cleanText(metadata.artist, 255) ? `© ${new Date().getUTCFullYear()} ${cleanText(metadata.artist, 255)}` : ''),
};
if (tags.XMPToolkit) failures.push({ code: 'xmp_toolkit_present', field: 'XMPToolkit', message: 'XMPToolkit remains in final output.' });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Quality verification only checks un-namespaced tag keys and may falsely flag missing fields when only ItemList:* variants exist.

buildQualityVerification currently calls readAnyTag(tags, ['Title']), ['Artist'], etc., but buildMetaToWrite now writes ItemList:Title, ItemList:Artist, ItemList:Producer, etc. If only ItemList:* keys are present, the bare Title/Artist keys may be missing and the verification will incorrectly report expected_*_missing. Please include the ItemList variants in the read keys (e.g. ['Title', 'ItemList:Title']) or normalize tag names before verification so the checks match the write path.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dba2bac0a4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server.js
res.setHeader('X-Forensic-Removed', report.removedCount);
res.setHeader('X-Forensic-Tags', JSON.stringify(report.removedTags.slice(0, 50)));
res.setHeader('X-Forensic-Status', report.status || 'Sanitized');
res.setHeader('X-Forensic-Report', JSON.stringify(report));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid putting JSON report with user text in response header

Setting X-Forensic-Report to JSON.stringify(report) can throw ERR_INVALID_CHAR when metadata contains non-Latin1 characters (for example Japanese, emoji, or many non-Western artist/title values), because Node validates header bytes and rejects those characters. In that case the /api/process request falls into the error path and returns a server failure instead of the cleansed file. This should be moved to the response body (or safely encoded) rather than a raw custom header.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant