Skip to content

fix(api): v1/v1.1 normalize dispatch + stuck-row reconciler (#2870)#2873

Merged
vpetersson merged 6 commits into
masterfrom
fix-2870-v1-v1.1-normalize-dispatch
May 12, 2026
Merged

fix(api): v1/v1.1 normalize dispatch + stuck-row reconciler (#2870)#2873
vpetersson merged 6 commits into
masterfrom
fix-2870-v1-v1.1-normalize-dispatch

Conversation

@vpetersson
Copy link
Copy Markdown
Contributor

Issues Fixed

Closes #2870.

Description

Two-line defence against the v1 / v1.1 video-upload normalize gap.

1. Root cause — v1 / v1.1 never dispatched normalize.
A shared dispatch_pending_normalize(serializer, asset_id) helper now
runs after Asset.objects.create(...) in all four AssetListView.post
methods, branching on the serializer's _pending_normalize flag.
CreateAssetSerializerV1_1 was also updated to set
_pending_normalize='video' and is_processing=1 on local video
uploads — pre-fix, those rows landed in whatever codec the operator
sent (MPEG-1 in the issue's repro) and the per-board passthrough set
silently dropped them from rotation forever. Image normalisation stays
opt-in via v1.2 / v2 only; promoting it onto the legacy endpoints
would shift synchronous-availability semantics for older clients.

2. Reconciler safety-net.
New reconcile_stuck_processing celery-beat task (10-min cadence,
60-min threshold — longer than NORMALIZE_VIDEO_TIME_LIMIT_S=30min)
recovers rows whose normalize task was lost mid-flight (worker
SIGKILL, OOM, backup restore that bypassed dispatch, Redis flake
during .delay()). Ages rows via metadata.processing_started_at,
stamped by each dispatch helper. Rows missing the stamp (legacy /
restored) get stamped on first sight for the full grace window; the
next sweep applies the threshold uniformly.

Bonus: drop hurry.filesize.
Five call sites swapped to django.template.defaultfilters.filesizeformat.
Wire-format change on /api/v*/info.free_space and home-page disk
fields: "15G""15.0 GB" (NBSP separator, decimal, full unit).
The hurry.filesize pin is dropped from pyproject.toml (deps +
mypy override) and uv.lock.

Version bumped to 2026.05.1 in pyproject.toml / package.json /
uv.lock.

Checklist

  • I have performed a self-review of my own code.
  • New and existing unit tests pass locally and on CI with my changes.
  • I have done an end-to-end test for Raspberry Pi devices.
  • I have tested my changes for x86 devices.
  • I added documentation for the changes I have made (when necessary).

…uck rows (#2870)

* Wire normalize-task dispatch into v1 and v1.1 ``AssetListView.post``
  via a shared ``dispatch_pending_normalize`` helper; before this,
  v1/v1.1 video uploads landed in whatever codec the operator sent
  and the per-board passthrough set silently dropped non-h264/hevc
  files from rotation forever.
* Set ``_pending_normalize='video'`` / ``is_processing=1`` on local
  video uploads in ``CreateAssetSerializerV1_1``. Image normalisation
  stays opt-in via v1.2 / v2 — promoting it onto the legacy endpoints
  would shift synchronous availability semantics for older clients.
* Add ``reconcile_stuck_processing`` celery-beat task (10-min cadence,
  60-min threshold) to recover rows whose normalize task was lost
  (worker SIGKILL, OOM, backup restore that bypassed dispatch). Ages
  rows via ``metadata.processing_started_at`` — stamped by each
  dispatch helper, granted on first sight for legacy / restored rows.
* Drop ``hurry.filesize`` for ``django.template.defaultfilters.filesizeformat``.
  ``free_space`` on ``/api/v1/info``, ``/api/v1.1/info``, ``/api/v2/info``
  changes from ``"15G"`` → ``"15.0 GB"`` (NBSP separator, decimal,
  full unit). Same change on home-page disk fields.
* Bump version to 2026.05.1 in pyproject.toml, package.json, uv.lock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vpetersson vpetersson requested a review from a team as a code owner May 12, 2026 13:12
@vpetersson vpetersson self-assigned this May 12, 2026
@vpetersson vpetersson requested a review from Copilot May 12, 2026 13:12
CI failures on the new dispatch tests:

* v1 / v1.1 video and image tests hit ``validate_uri('/data/…')`` →
  500 ``Invalid file path``. The mixin patch only covered
  ``serializers.mixins.validate_uri``; v1 / v1.1 import the same
  helper as ``serializers.v1_1.validate_uri``. Patch both binding
  sites.
* v1.2 / v2 image test expected ``dispatch_normalize_image`` once but
  got zero. The mixin's rename writes ``<asset_id><ext>``; without
  ``ext`` in the payload the file lands extensionless and
  ``needs_image_normalisation`` returns False. Pass ``'ext': '.heic'``.
* Also pass ``'ext': '.mp4'`` on the video payload — mixin uses it for
  the rename destination too; v1 / v1.1 ignore it.
* Re-run ``ruff format`` on tests/test_processing.py — pre-push run
  used ``ruff check`` only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR closes #2870 by ensuring legacy API endpoints (v1/v1.1) dispatch video normalization consistently with v1.2/v2, and by adding a periodic Celery reconciler to recover (or at least unstick) assets that remain is_processing=True after dispatch/task failures. It also removes the hurry.filesize dependency by switching to Django’s filesizeformat, and bumps the project version to 2026.05.1.

Changes:

  • Centralize “pending normalize” dispatch via dispatch_pending_normalize() and wire it into v1/v1.1/v1.2/v2 create flows.
  • Add reconcile_stuck_processing periodic task plus metadata.processing_started_at stamping to support age-based recovery.
  • Replace hurry.filesize with django.template.defaultfilters.filesizeformat and update tests + version pins/locks.

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
uv.lock Bumps version and removes hurry-filesize (and related lock entries).
pyproject.toml Bumps version; drops hurry.filesize dependency and mypy ignore entry.
package.json Bumps version.
src/anthias_server/processing.py Adds processing-start stamp + shared dispatch_pending_normalize() helper; stamps on normalize dispatch.
src/anthias_server/celery_tasks.py Adds periodic reconcile_stuck_processing task and ISO timestamp parser; schedules the task.
src/anthias_server/api/views/v1.py Ensures v1 asset create dispatches pending normalization.
src/anthias_server/api/views/v1_1.py Ensures v1.1 asset create dispatches pending normalization.
src/anthias_server/api/views/v1_2.py Switches to shared pending-normalize dispatcher.
src/anthias_server/api/views/v2.py Switches to shared pending-normalize dispatcher; replaces hurry.filesize with filesizeformat.
src/anthias_server/api/views/mixins.py Replaces hurry.filesize with filesizeformat for info endpoints.
src/anthias_server/api/serializers/v1_1.py Flags local video uploads for normalization (_pending_normalize='video') and sets is_processing=1.
src/anthias_server/app/views.py Routes UI uploads through normalize dispatch helpers so stamping occurs consistently.
src/anthias_server/app/page_context.py Switches disk size formatting to Django filesizeformat.
src/anthias_common/youtube.py Stamps processing_started_at when dispatching YouTube downloads.
tests/test_processing.py Expands unit tests for stamping + pending-normalize routing behavior.
tests/test_celery_tasks.py Adds coverage for reconcile_stuck_processing behavior.
src/anthias_server/api/tests/test_assets.py Adds regression tests ensuring all API versions dispatch video normalization; guards image-normalize behavior by version.
src/anthias_server/api/tests/test_info_endpoints.py Updates mocks/expectations for filesizeformat output shape (NBSP, units).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/anthias_server/celery_tasks.py Outdated
Comment thread src/anthias_server/celery_tasks.py Outdated
Comment thread src/anthias_common/youtube.py Outdated
Addresses Copilot review on #2873:

* ``_parse_processing_started_at`` now returns ``None`` for naive
  (no-tzinfo) ISO-8601 strings. The dispatch helpers stamp via
  ``timezone.now()`` (tz-aware under ``USE_TZ=True``), so a naive
  value is by definition a hand-edit of metadata — comparing it to
  the tz-aware ``cutoff`` would raise ``TypeError`` and abort the
  sweep for every other stuck row. Route it through the
  stamp-on-first-sight branch instead.
* Wrap the reconciler body in a SETNX + Lua-compare-and-delete lock
  on ``RECONCILE_STUCK_LOCK_KEY`` — same pattern as
  ``revalidate_asset_urls``. Default Anthias runs one worker with
  embedded beat (so single-flighted today), but the lock bounds
  blast radius if a future deploy ever runs two workers with ``-B``.
* Rewrite the ``dispatch_download`` docstring: stamping
  ``processing_started_at`` does *not* let the reconciler re-download
  — it routes via mimetype='video' through ``normalize_video_asset``,
  whose ``path.isfile`` guard fails fast and ``on_failure`` writes
  ``metadata.error_message`` so the operator sees a "Failed" pill
  instead of a row stuck on "Processing".

Tests: naive-timestamp recovery + lock-prevents-overlap +
lock-released-after-clean-run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vpetersson vpetersson requested a review from Copilot May 12, 2026 13:46
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 18 changed files in this pull request and generated 5 comments.

Comment thread src/anthias_server/processing.py Outdated
Comment thread src/anthias_server/processing.py Outdated
Comment thread src/anthias_server/processing.py Outdated
Comment thread src/anthias_server/celery_tasks.py Outdated
Comment thread src/anthias_server/celery_tasks.py Outdated
Addresses second Copilot pass on #2873:

* Rename ``_stamp_processing_start`` → ``stamp_processing_start``.
  The single underscore marked it as module-private but it's already
  called cross-package (anthias_common.youtube.dispatch_download,
  anthias_server.celery_tasks.reconcile_stuck_processing). Promoting
  to a public symbol matches its actual API surface; docstring now
  states the cross-module contract explicitly.
* Wrap the read-modify-write in ``transaction.atomic()``. SQLite is
  single-writer at the file level so the wrap is mostly documenting
  intent today, but on Postgres / MySQL it produces an actual
  transaction boundary against a near-simultaneous operator PATCH or
  worker landing ``error_message``.
* Fix the docstring claim of "no-op via the filter().update()
  pattern" — the code is a SELECT-then-UPDATE; the no-op semantics
  come from the ``filter().first() is None`` guard, not from the
  update query shape.
* Tighten ``_parse_processing_started_at`` return type from ``Any``
  to ``datetime | None``. Pull ``datetime`` into the module-level
  ``from datetime import …`` so the annotation resolves; drop the
  redundant local imports inside the parser and reconciler bodies.
* Fix the reconciler docstring's "10-min grace window" — the actual
  grace is ``RECONCILE_STUCK_THRESHOLD_S`` (60 min) from the
  first-sweep stamp, with the 10-min cadence just dictating how
  often the row is re-checked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 18 changed files in this pull request and generated 2 comments.

Comment thread src/anthias_server/processing.py Outdated
Comment thread src/anthias_server/api/views/mixins.py
Round-3 Copilot review on #2873:

* ``stamp_processing_start`` docstring: replace the implied
  "transaction.atomic() makes this safe" framing with a clear
  acknowledgement that this is a read-modify-write race against
  arbitrary other metadata writers (operator PATCH, normalize task
  landing error_message). Document why we accept it: stamps fire at
  *dispatch* time, before the worker picks the task up, so the
  practical window is microscopic — matches the same race the wider
  codebase already accepts in ``_set_processing_error`` et al.
  ``select_for_update()`` would close it on Postgres / MySQL but is
  a no-op on SQLite (Anthias's only deployed backend), so we'd pay
  complexity for zero practical safety today. The
  ``transaction.atomic()`` wrap stays — cheap savepoint that survives
  a future multi-writer-backend switch.
* ``InfoViewMixin`` OpenAPI ``example`` field: update ``free_space``
  from ``'10G'`` to ``'10.0 GB'`` so the generated docs match the
  new ``filesizeformat`` shape (NBSP separator, decimal, full unit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 18 changed files in this pull request and generated 2 comments.

Comment thread src/anthias_server/celery_tasks.py Outdated
Comment thread src/anthias_server/celery_tasks.py Outdated
Round-4 Copilot review on #2873:

* ``RECONCILE_STUCK_INTERVAL_S`` comment: the previous "one SELECT,
  one UPDATE per stuck row" undercounted — each per-row branch
  itself does SELECT + UPDATE inside ``stamp_processing_start``, and
  the re-dispatch branch may do more on top. Rewrite to reflect the
  actual cost shape: the initial SELECT plus per-row stamp / dispatch
  work, with the observation that a healthy fleet has an empty
  filtered set so the tick costs one SELECT total.
* ``RECONCILE_STUCK_THRESHOLD_S`` comment: removed the claim that a
  row past the threshold "has had its worker time-limit expire" —
  rows can be stuck for reasons that never involved a worker
  picking the task up (Redis flake during ``.delay()``, container
  restart between enqueue and accept). Rewrite to call out both
  classes of failure so the threshold's role is honest: "if a row
  has carried is_processing=True for over an hour, something went
  wrong — recover it."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 18 changed files in this pull request and generated no new comments.

@vpetersson vpetersson merged commit d2ebdd5 into master May 12, 2026
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: video uploads via v1/v1.1 bypass normalize_video_asset and stick at is_processing=True

2 participants