Skip to content

feat(jobs): show who requests which tasks in ML processing job logs#1238

Merged
mihow merged 4 commits intomainfrom
worktree-job-logging-and-cleanup
Apr 16, 2026
Merged

feat(jobs): show who requests which tasks in ML processing job logs#1238
mihow merged 4 commits intomainfrom
worktree-job-logging-and-cleanup

Conversation

@mihow
Copy link
Copy Markdown
Collaborator

@mihow mihow commented Apr 15, 2026

What this does (plain English)

While an ML job is running, the job's log view was mostly empty during the long stretches where workers were actually doing the work. This PR fills that in: every worker poll, every result POST back from a worker, and a progress-rate line all show up in the job log now — along with which user and API token drove each request. Makes it much easier to tell at a glance whether a job is healthy, slow, or stuck, and who's doing what.


Summary

Adds structured observability log lines to the per-job logger for async_api ML jobs, so worker activity is visible from the job log view (GET /api/v2/jobs/{id}/logs.stdout).

Three additions in ami/jobs/views.py and ami/jobs/tasks.py:

  • Task-fetch lines (JobViewSet.tasks()): Tasks fetched: requested=..., delivered=..., user=..., token_id=... on every worker poll. Fills in the silence window between Created NATS stream and the first Updated job progress line — previously the job log showed nothing during this stretch even though workers were actively polling.
  • Result-POST mirror lines (JobViewSet.result()): Queued pipeline result: task_id=..., reply_subject=..., user=... mirrors the existing module-logger emission to the per-job log. Module-logger copy is retained for ops-level monitoring.
  • Throughput / ETA lines (_log_job_throughput in ami/jobs/tasks.py): Job {id} throughput: elapsed=..., processed=P/T, rate=R imgs/min, ETA=... on process/results stage progress updates. Intentionally a plain processed/elapsed division rather than a rolling-window estimator — accurate enough to distinguish stalled-vs-slow from healthy-but-throttled, cheap, easy to eyeball. Guarded by started_at != None and skipped on non-process/results stages.
  • Phantom-pull early-exit line: POST /tasks/ against a terminal job now logs Tasks requested for non-active job (status=...); returning empty. to the per-job log instead of silently returning {"tasks": []}. Restores the signal used to detect workers still polling terminal jobs.

Test plan

  • ami.jobs.tests.test_jobs — unit tests cover all four log lines (65/65 pass)
  • E2E on local compose stack (50-image async_api job) — all four signals observed in the live logs.stdout view; throughput line ETAs monotonically decrease and match DB counts

Summary by CodeRabbit

  • New Features

    • Job progress logging now includes throughput metrics (images per minute) and estimated time to completion for processing stages.
    • Enhanced job event logging with detailed context including user information and task counts.
  • Tests

    • Added comprehensive test coverage for job throughput logging and per-job logging behavior across job endpoints.

E2E validation

Ran against a local compose stack with one async_api ML job: 50-image random collection, pipeline quebec_vermont_moths_2023, ADC worker. Job completed successfully in ~4 minutes (50 images / 483 detections). Pulled the resulting job.logs.stdout via GET /api/v2/jobs/{id}/ and grep-counted each of the four new log lines. User/token identifiers redacted below.

Tasks fetched — 68 lines. delivered value correctly tracks the silence window (8× delivered=8 during initial pulls, 2× delivered=2 partial, 58× delivered=0 during the batch-processing stretch when no work is outstanding):

[… 18:56:28] INFO Tasks fetched: requested=8, delivered=8, user=<user>, token_id=<…>
[… 19:00:25] INFO Tasks fetched: requested=8, delivered=0, user=<user>, token_id=<…>

Queued pipeline result — 84 per-job lines observed; module-logger copy landed 100× in django container logs (both emissions sit in the same for loop). The per-job log is ~16/100 below the module log — likely a Job-row save race when multiple /result/ POSTs land concurrently (job.logger.info appends to a JSONB list on the Job row, and overlapping writes can lose entries). Not a correctness issue for the feature, but worth filing as a follow-up; the module-logger copy was retained exactly so ops doesn't rely solely on the per-job view.

Throughput / ETA — 64 lines, all on process/results stages (none on collect, confirming the stage filter). First line at elapsed=2m 11s (confirming the started_at guard). processed matches SourceImageCollection.source_images_count (50). ETA shrinks monotonically once processing ramps up:

[… 18:58:39] INFO Job N throughput: elapsed=2m 11s, processed=1/50, rate=0.5 imgs/min, ETA=1h 47m 16s
[… 19:00:25] INFO Job N throughput: elapsed=3m 57s, processed=46/50, rate=11.6 imgs/min, ETA=0m 20s
[… 19:00:27] INFO Job N throughput: elapsed=3m 59s, processed=50/50, rate=12.5 imgs/min, ETA=0m 00s

Phantom-pull early-exit — after the job hit SUCCESS, posting an extra /tasks/ call returned {"tasks":[]} and wrote the expected line to logs.stdout:

[… 19:02:41] INFO Tasks requested for non-active job (status=SUCCESS); returning empty. user=<user>, token_id=<…>

mihow and others added 2 commits April 15, 2026 15:24
Async_api jobs have a silence window between "Created NATS stream" and the
first "Updated job progress" line where Django only logs results coming
back. Operators viewing the per-job log can't tell if a worker is working,
missing, slow, or hitting a wall.

Mirror the task-fetch and result-POST paths to job.logger so the per-job
log view reflects worker polling activity:

- Successful fetch: log requested/delivered batch sizes and caller info.
- Early-exit on terminal-status jobs: makes the "phantom-pull" pattern
  (workers polling against jobs whose NATS stream still exists after
  terminal state) visible from the per-job log.
- NATS-unavailable 503: mirror the existing module-logger warning.
- Result POST: mirror the existing "Queued pipeline result" line with
  user/token_id added. Module-logger copy stays for ops monitoring.

Co-Authored-By: Claude <noreply@anthropic.com>
Operators viewing the per-job log have to compute throughput by hand from
started_at + processed/total to tell stalled-vs-slow vs healthy-but-throttled
jobs apart. Add a rolling rate line emitted from _update_job_progress on
every process/results stage update:

    Job N throughput: elapsed=Xm YYs, processed=P/T, rate=R imgs/min, ETA=...

Intentionally a plain division over total elapsed, not a rolling-window
forecast — cheap to compute, accurate enough to spot a stall, and easy to
read from one line. Skipped defensively when started_at is None, elapsed
<3s, or the stage is not process/results. Emits rate=0.0 / ETA=unknown
when processed_count is still zero.

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 15, 2026 23:32
@netlify
Copy link
Copy Markdown

netlify bot commented Apr 15, 2026

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit 14d405e
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/69e02fc3fe9648000846b629

@netlify
Copy link
Copy Markdown

netlify bot commented Apr 15, 2026

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit 14d405e
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/69e02fc3136f3c00070e5d4d

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 15, 2026

Warning

Rate limit exceeded

@mihow has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 59 minutes and 58 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 59 minutes and 58 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2417c114-7824-4cee-af70-b39444313e04

📥 Commits

Reviewing files that changed from the base of the PR and between cef5a68 and 14d405e.

📒 Files selected for processing (3)
  • ami/jobs/tasks.py
  • ami/jobs/tests/test_jobs.py
  • ami/jobs/views.py
📝 Walkthrough

Walkthrough

Added helper functions to compute and emit job throughput metrics and estimated completion time. Integrated per-job logging into the task and result endpoints to track processing rates and user/token context. Added comprehensive test coverage for throughput calculations and endpoint logging behavior.

Changes

Cohort / File(s) Summary
Job Throughput Logging
ami/jobs/tasks.py
Added _format_elapsed() to convert seconds into human-readable Hh Mm Ss/Mm Ss format. Added _log_job_throughput() to calculate processing rate (imgs/min), ETA, and emit via job.logger.info for process and results stages; handles missing/invalid data gracefully.
Endpoint Per-Job Logging
ami/jobs/views.py
Integrated per-job logging into tasks and result actions to record user email/token context, task counts, processing warnings, and pipeline results. Logs are emitted to job.logger alongside module-level logs.
Throughput Logging Tests
ami/jobs/tests/test_jobs.py
Added TestJobThroughputLogging suite to verify log formatting, throughput calculation, ETA computation, and edge cases (missing started_at, zero processed count, unsupported stages). Extended endpoint tests to assert per-job logging side effects for task fetching, result queuing, and non-active job polling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Poem

🐰 Hark! The logs now sing with speed,
Each task's ETA we shall heed,
Hours, minutes, seconds fly,
Job throughput soars on high!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main change: adding per-job logging that shows who (which user and API token) requests tasks during ML job processing.
Description check ✅ Passed The PR description is comprehensive and complete, covering all major required sections with clear detail on changes, testing, and validation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch worktree-job-logging-and-cleanup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds per-job observability log lines for async_api ML jobs so worker polling, result posting, and throughput/ETA are visible in GET /api/v2/jobs/{id}/logs.stdout.

Changes:

  • Log task-fetch activity (including terminal-job early exit) to the per-job logger in JobViewSet.tasks().
  • Mirror result-POST “Queued pipeline result …” to the per-job logger in JobViewSet.result().
  • Add _log_job_throughput to emit throughput/ETA lines on process/results progress updates.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
ami/jobs/views.py Adds per-job logging for task fetches, terminal-job early exit, and result-POST queuing.
ami/jobs/tasks.py Adds throughput/ETA computation and logs it during progress updates.
ami/jobs/tests/test_jobs.py Adds unit tests asserting the new per-job log lines and throughput behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ami/jobs/views.py Outdated
Comment thread ami/jobs/views.py Outdated
Comment thread ami/jobs/tasks.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
ami/jobs/views.py (1)

270-271: Extract actor-context building into a shared helper to avoid drift.

The same user_desc/token_id logic appears in both actions. Centralizing it will keep formatting consistent and make future adjustments safer.

♻️ Proposed refactor
 class JobViewSet(DefaultViewSet, ProjectMixin):
+    `@staticmethod`
+    def _actor_log_context(request) -> tuple[str, int | None]:
+        user_desc = getattr(request.user, "email", None)
+        if not user_desc:
+            user_pk = getattr(request.user, "pk", None)
+            user_desc = f"user_id={user_pk}" if user_pk is not None else "unknown"
+        token_id = getattr(request.auth, "pk", None)
+        return user_desc, token_id
+
     `@action`(detail=True, methods=["post"], name="tasks")
     def tasks(self, request, pk=None):
@@
-        user_desc = getattr(request.user, "email", None) or str(request.user)
-        token_id = getattr(request.auth, "pk", None)
+        user_desc, token_id = self._actor_log_context(request)
@@
     `@action`(detail=True, methods=["post"], name="result")
     def result(self, request, pk=None):
@@
-        user_desc = getattr(request.user, "email", None) or str(request.user)
-        token_id = getattr(request.auth, "pk", None)
+        user_desc, token_id = self._actor_log_context(request)

Also applies to: 330-331

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/jobs/views.py` around lines 270 - 271, Extract the repeated logic that
builds user_desc and token_id into a single helper (e.g.,
get_actor_context(request) or build_actor_context(request)) that returns a dict
or tuple with user_desc and token_id computed via getattr(request.user, "email",
None) or str(request.user) and getattr(request.auth, "pk", None); replace the
duplicated lines in the two action handlers (the blocks that currently set
user_desc = ... and token_id = ...) to call this helper and use its return
value, and add a small unit/test or docstring so future changes are made in one
place.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@ami/jobs/views.py`:
- Around line 270-271: Extract the repeated logic that builds user_desc and
token_id into a single helper (e.g., get_actor_context(request) or
build_actor_context(request)) that returns a dict or tuple with user_desc and
token_id computed via getattr(request.user, "email", None) or str(request.user)
and getattr(request.auth, "pk", None); replace the duplicated lines in the two
action handlers (the blocks that currently set user_desc = ... and token_id =
...) to call this helper and use its return value, and add a small unit/test or
docstring so future changes are made in one place.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eea1b9e3-3475-42aa-b412-51e7354759e4

📥 Commits

Reviewing files that changed from the base of the PR and between 5090057 and cef5a68.

📒 Files selected for processing (3)
  • ami/jobs/tasks.py
  • ami/jobs/tests/test_jobs.py
  • ami/jobs/views.py

mihow and others added 2 commits April 15, 2026 17:15
- Fix 1 (token leak): introduce _actor_log_context() helper that truncates
  Token.pk to an 8-char fingerprint before writing to job.logs.stdout.
  DRF Token.pk IS the 40-char bearer secret, so logging the full value
  exposed credentials to every project member.
- Fix 2 (noisy empty polls): tasks() now logs at DEBUG when delivered==0
  so zero-task polls stay out of the operator-facing per-job log feed;
  INFO is kept for delivered>0. Three unit tests cover the split.
- Fix 3 (throughput log inside atomic): wrap _log_job_throughput() in
  try/except so a logging failure does not roll back the progress update.
  Warning is emitted on the module logger to avoid recursion.

Co-Authored-By: Claude <noreply@anthropic.com>
Fires on every async_api task_postrun — always true under normal
operation (Celery task ends when images are queued; async workers drive
the actual stages afterward). A genuinely stuck async job shows up as
absence of progress/throughput lines, not this message. Same rationale
as the delivered=0 → DEBUG move earlier in this PR.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow merged commit 701d5f1 into main Apr 16, 2026
7 checks passed
@mihow mihow deleted the worktree-job-logging-and-cleanup branch April 16, 2026 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants