feat(jobs): show who requests which tasks in ML processing job logs by mihow · Pull Request #1238 · RolnickLab/antenna

mihow · 2026-04-15T23:32:58Z

What this does (plain English)

While an ML job is running, the job's log view was mostly empty during the long stretches where workers were actually doing the work. This PR fills that in: every worker poll, every result POST back from a worker, and a progress-rate line all show up in the job log now — along with which user and API token drove each request. Makes it much easier to tell at a glance whether a job is healthy, slow, or stuck, and who's doing what.

Summary

Adds structured observability log lines to the per-job logger for async_api ML jobs, so worker activity is visible from the job log view (GET /api/v2/jobs/{id}/ → logs.stdout).

Three additions in ami/jobs/views.py and ami/jobs/tasks.py:

Task-fetch lines (JobViewSet.tasks()): Tasks fetched: requested=..., delivered=..., user=..., token_id=... on every worker poll. Fills in the silence window between Created NATS stream and the first Updated job progress line — previously the job log showed nothing during this stretch even though workers were actively polling.
Result-POST mirror lines (JobViewSet.result()): Queued pipeline result: task_id=..., reply_subject=..., user=... mirrors the existing module-logger emission to the per-job log. Module-logger copy is retained for ops-level monitoring.
Throughput / ETA lines (_log_job_throughput in ami/jobs/tasks.py): Job {id} throughput: elapsed=..., processed=P/T, rate=R imgs/min, ETA=... on process/results stage progress updates. Intentionally a plain processed/elapsed division rather than a rolling-window estimator — accurate enough to distinguish stalled-vs-slow from healthy-but-throttled, cheap, easy to eyeball. Guarded by started_at != None and skipped on non-process/results stages.
Phantom-pull early-exit line: POST /tasks/ against a terminal job now logs Tasks requested for non-active job (status=...); returning empty. to the per-job log instead of silently returning {"tasks": []}. Restores the signal used to detect workers still polling terminal jobs.

Test plan

ami.jobs.tests.test_jobs — unit tests cover all four log lines (65/65 pass)
E2E on local compose stack (50-image async_api job) — all four signals observed in the live logs.stdout view; throughput line ETAs monotonically decrease and match DB counts

Summary by CodeRabbit

New Features
- Job progress logging now includes throughput metrics (images per minute) and estimated time to completion for processing stages.
- Enhanced job event logging with detailed context including user information and task counts.
Tests
- Added comprehensive test coverage for job throughput logging and per-job logging behavior across job endpoints.

E2E validation

Ran against a local compose stack with one async_api ML job: 50-image random collection, pipeline quebec_vermont_moths_2023, ADC worker. Job completed successfully in ~4 minutes (50 images / 483 detections). Pulled the resulting job.logs.stdout via GET /api/v2/jobs/{id}/ and grep-counted each of the four new log lines. User/token identifiers redacted below.

Tasks fetched — 68 lines. delivered value correctly tracks the silence window (8× delivered=8 during initial pulls, 2× delivered=2 partial, 58× delivered=0 during the batch-processing stretch when no work is outstanding):

[… 18:56:28] INFO Tasks fetched: requested=8, delivered=8, user=<user>, token_id=<…>
[… 19:00:25] INFO Tasks fetched: requested=8, delivered=0, user=<user>, token_id=<…>

Queued pipeline result — 84 per-job lines observed; module-logger copy landed 100× in django container logs (both emissions sit in the same for loop). The per-job log is ~16/100 below the module log — likely a Job-row save race when multiple /result/ POSTs land concurrently (job.logger.info appends to a JSONB list on the Job row, and overlapping writes can lose entries). Not a correctness issue for the feature, but worth filing as a follow-up; the module-logger copy was retained exactly so ops doesn't rely solely on the per-job view.

Throughput / ETA — 64 lines, all on process/results stages (none on collect, confirming the stage filter). First line at elapsed=2m 11s (confirming the started_at guard). processed matches SourceImageCollection.source_images_count (50). ETA shrinks monotonically once processing ramps up:

[… 18:58:39] INFO Job N throughput: elapsed=2m 11s, processed=1/50, rate=0.5 imgs/min, ETA=1h 47m 16s
[… 19:00:25] INFO Job N throughput: elapsed=3m 57s, processed=46/50, rate=11.6 imgs/min, ETA=0m 20s
[… 19:00:27] INFO Job N throughput: elapsed=3m 59s, processed=50/50, rate=12.5 imgs/min, ETA=0m 00s

Phantom-pull early-exit — after the job hit SUCCESS, posting an extra /tasks/ call returned {"tasks":[]} and wrote the expected line to logs.stdout:

[… 19:02:41] INFO Tasks requested for non-active job (status=SUCCESS); returning empty. user=<user>, token_id=<…>

Async_api jobs have a silence window between "Created NATS stream" and the first "Updated job progress" line where Django only logs results coming back. Operators viewing the per-job log can't tell if a worker is working, missing, slow, or hitting a wall. Mirror the task-fetch and result-POST paths to job.logger so the per-job log view reflects worker polling activity: - Successful fetch: log requested/delivered batch sizes and caller info. - Early-exit on terminal-status jobs: makes the "phantom-pull" pattern (workers polling against jobs whose NATS stream still exists after terminal state) visible from the per-job log. - NATS-unavailable 503: mirror the existing module-logger warning. - Result POST: mirror the existing "Queued pipeline result" line with user/token_id added. Module-logger copy stays for ops monitoring. Co-Authored-By: Claude <noreply@anthropic.com>

Operators viewing the per-job log have to compute throughput by hand from started_at + processed/total to tell stalled-vs-slow vs healthy-but-throttled jobs apart. Add a rolling rate line emitted from _update_job_progress on every process/results stage update: Job N throughput: elapsed=Xm YYs, processed=P/T, rate=R imgs/min, ETA=... Intentionally a plain division over total elapsed, not a rolling-window forecast — cheap to compute, accurate enough to spot a stall, and easy to read from one line. Skipped defensively when started_at is None, elapsed <3s, or the stage is not process/results. Emits rate=0.0 / ETA=unknown when processed_count is still zero. Co-Authored-By: Claude <noreply@anthropic.com>

netlify · 2026-04-15T23:33:04Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`14d405e`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/69e02fc3fe9648000846b629

netlify · 2026-04-15T23:33:04Z

✅ Deploy Preview for antenna-ssec canceled.

Name	Link
🔨 Latest commit	`14d405e`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-ssec/deploys/69e02fc3136f3c00070e5d4d

coderabbitai · 2026-04-15T23:33:14Z

Warning

Rate limit exceeded

@mihow has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 59 minutes and 58 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 59 minutes and 58 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2417c114-7824-4cee-af70-b39444313e04

📥 Commits

Reviewing files that changed from the base of the PR and between cef5a68 and 14d405e.

📒 Files selected for processing (3)

ami/jobs/tasks.py
ami/jobs/tests/test_jobs.py
ami/jobs/views.py

📝 Walkthrough

Walkthrough

Added helper functions to compute and emit job throughput metrics and estimated completion time. Integrated per-job logging into the task and result endpoints to track processing rates and user/token context. Added comprehensive test coverage for throughput calculations and endpoint logging behavior.

Changes

Cohort / File(s)	Summary
Job Throughput Logging `ami/jobs/tasks.py`	Added `_format_elapsed()` to convert seconds into human-readable `Hh Mm Ss`/`Mm Ss` format. Added `_log_job_throughput()` to calculate processing rate (imgs/min), ETA, and emit via `job.logger.info` for process and results stages; handles missing/invalid data gracefully.
Endpoint Per-Job Logging `ami/jobs/views.py`	Integrated per-job logging into `tasks` and `result` actions to record user email/token context, task counts, processing warnings, and pipeline results. Logs are emitted to `job.logger` alongside module-level logs.
Throughput Logging Tests `ami/jobs/tests/test_jobs.py`	Added `TestJobThroughputLogging` suite to verify log formatting, throughput calculation, ETA computation, and edge cases (missing `started_at`, zero processed count, unsupported stages). Extended endpoint tests to assert per-job logging side effects for task fetching, result queuing, and non-active job polling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

fix(jobs): prevent jobs from hanging in STARTED state with no progress #1234: Both PRs modify _update_job_progress() and the job progress update path; the related PR alters results-stage ACK semantics while this PR hooks throughput logging into the same function.

Poem

🐰 Hark! The logs now sing with speed,
Each task's ETA we shall heed,
Hours, minutes, seconds fly,
Job throughput soars on high! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 56.25% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and specifically describes the main change: adding per-job logging that shows who (which user and API token) requests tasks during ML job processing.
Description check	✅ Passed	The PR description is comprehensive and complete, covering all major required sections with clear detail on changes, testing, and validation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch worktree-job-logging-and-cleanup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Adds per-job observability log lines for async_api ML jobs so worker polling, result posting, and throughput/ETA are visible in GET /api/v2/jobs/{id}/ → logs.stdout.

Changes:

Log task-fetch activity (including terminal-job early exit) to the per-job logger in JobViewSet.tasks().
Mirror result-POST “Queued pipeline result …” to the per-job logger in JobViewSet.result().
Add _log_job_throughput to emit throughput/ETA lines on process/results progress updates.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`ami/jobs/views.py`	Adds per-job logging for task fetches, terminal-job early exit, and result-POST queuing.
`ami/jobs/tasks.py`	Adds throughput/ETA computation and logs it during progress updates.
`ami/jobs/tests/test_jobs.py`	Adds unit tests asserting the new per-job log lines and throughput behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai

🧹 Nitpick comments (1)

ami/jobs/views.py (1)

270-271: Extract actor-context building into a shared helper to avoid drift.

The same user_desc/token_id logic appears in both actions. Centralizing it will keep formatting consistent and make future adjustments safer.

♻️ Proposed refactor

 class JobViewSet(DefaultViewSet, ProjectMixin):
+    `@staticmethod`
+    def _actor_log_context(request) -> tuple[str, int | None]:
+        user_desc = getattr(request.user, "email", None)
+        if not user_desc:
+            user_pk = getattr(request.user, "pk", None)
+            user_desc = f"user_id={user_pk}" if user_pk is not None else "unknown"
+        token_id = getattr(request.auth, "pk", None)
+        return user_desc, token_id
+
     `@action`(detail=True, methods=["post"], name="tasks")
     def tasks(self, request, pk=None):
@@
-        user_desc = getattr(request.user, "email", None) or str(request.user)
-        token_id = getattr(request.auth, "pk", None)
+        user_desc, token_id = self._actor_log_context(request)
@@
     `@action`(detail=True, methods=["post"], name="result")
     def result(self, request, pk=None):
@@
-        user_desc = getattr(request.user, "email", None) or str(request.user)
-        token_id = getattr(request.auth, "pk", None)
+        user_desc, token_id = self._actor_log_context(request)

Also applies to: 330-331

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@ami/jobs/views.py` around lines 270 - 271, Extract the repeated logic that
builds user_desc and token_id into a single helper (e.g.,
get_actor_context(request) or build_actor_context(request)) that returns a dict
or tuple with user_desc and token_id computed via getattr(request.user, "email",
None) or str(request.user) and getattr(request.auth, "pk", None); replace the
duplicated lines in the two action handlers (the blocks that currently set
user_desc = ... and token_id = ...) to call this helper and use its return
value, and add a small unit/test or docstring so future changes are made in one
place.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@ami/jobs/views.py`:
- Around line 270-271: Extract the repeated logic that builds user_desc and
token_id into a single helper (e.g., get_actor_context(request) or
build_actor_context(request)) that returns a dict or tuple with user_desc and
token_id computed via getattr(request.user, "email", None) or str(request.user)
and getattr(request.auth, "pk", None); replace the duplicated lines in the two
action handlers (the blocks that currently set user_desc = ... and token_id =
...) to call this helper and use its return value, and add a small unit/test or
docstring so future changes are made in one place.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eea1b9e3-3475-42aa-b412-51e7354759e4

📥 Commits

Reviewing files that changed from the base of the PR and between 5090057 and cef5a68.

📒 Files selected for processing (3)

ami/jobs/tasks.py
ami/jobs/tests/test_jobs.py
ami/jobs/views.py

- Fix 1 (token leak): introduce _actor_log_context() helper that truncates Token.pk to an 8-char fingerprint before writing to job.logs.stdout. DRF Token.pk IS the 40-char bearer secret, so logging the full value exposed credentials to every project member. - Fix 2 (noisy empty polls): tasks() now logs at DEBUG when delivered==0 so zero-task polls stay out of the operator-facing per-job log feed; INFO is kept for delivered>0. Three unit tests cover the split. - Fix 3 (throughput log inside atomic): wrap _log_job_throughput() in try/except so a logging failure does not roll back the progress update. Warning is emitted on the module logger to avoid recursion. Co-Authored-By: Claude <noreply@anthropic.com>

Fires on every async_api task_postrun — always true under normal operation (Celery task ends when images are queued; async workers drive the actual stages afterward). A genuinely stuck async job shows up as absence of progress/throughput lines, not this message. Same rationale as the delivered=0 → DEBUG move earlier in this PR. Co-Authored-By: Claude <noreply@anthropic.com>

mihow and others added 2 commits April 15, 2026 15:24

Copilot AI review requested due to automatic review settings April 15, 2026 23:32

Copilot started reviewing on behalf of mihow April 15, 2026 23:33 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Comment thread ami/jobs/views.py Outdated

Comment thread ami/jobs/views.py Outdated

Comment thread ami/jobs/tasks.py Outdated

coderabbitai bot reviewed Apr 15, 2026

View reviewed changes

mihow and others added 2 commits April 15, 2026 17:15

mihow merged commit 701d5f1 into main Apr 16, 2026
7 checks passed

mihow deleted the worktree-job-logging-and-cleanup branch April 16, 2026 01:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(jobs): show who requests which tasks in ML processing job logs#1238

feat(jobs): show who requests which tasks in ML processing job logs#1238
mihow merged 4 commits intomainfrom
worktree-job-logging-and-cleanup

mihow commented Apr 15, 2026 •

edited

Loading

Uh oh!

netlify bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

netlify bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 15, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mihow commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does (plain English)

Summary

Test plan

Summary by CodeRabbit

E2E validation

Uh oh!

netlify bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

netlify bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-ssec canceled.

Uh oh!

coderabbitai bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mihow commented Apr 15, 2026 •

edited

Loading

netlify bot commented Apr 15, 2026 •

edited

Loading

netlify bot commented Apr 15, 2026 •

edited

Loading

coderabbitai bot commented Apr 15, 2026 •

edited

Loading