feat: collect model serving metrics artifacts by Ker102 · Pull Request #7 · Ker102/nullstate-cli

Ker102 · 2026-05-08T14:51:13Z

Closes #4

Summary

Classify endpoint type as offline, managed, self-hosted, or AMD GPU-hosted.
Scrape vLLM-compatible /metrics before and after runs and store raw .prom artifacts when available.
Capture local GPU snapshot evidence through amd-smi or rocm-smi, with graceful fallback when neither exists.
Document useful non-GPU work while DigitalOcean/AMD support is pending.

Test Plan

python -m unittest discover -s tests -v
python -m ruff check src tests
python -m mypy src
python -m pip_audit . --skip-editable
offline CLI smoke run writes metrics.json

Security Impact

No secrets added.
Metrics artifacts store endpoint hostname only, not API keys.
GPU/vLLM collection is best-effort and does not fail offline runs.

Summary by CodeRabbit

Release Notes

New Features
- Added metrics collection before and after remediation runs to track model endpoint performance
- Added support for scraping vLLM Prometheus metrics from running model endpoints
- Added GPU status snapshots from AMD tools when available
Documentation
- Enhanced runbook with detailed metrics evidence collection procedures and artifact review guidance
- Expanded operational evidence documentation to include endpoint types and available metrics
- Added guidance for AMD GPU-hosted model evidence collection

coderabbitai · 2026-05-08T14:51:22Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c01d0656-3990-4fc5-8823-24fc163eb6a7

📥 Commits

Reviewing files that changed from the base of the PR and between d9a4072 and 2974a59.

📒 Files selected for processing (9)

CHANGELOG.md
README.md
docs/case-study-outline.md
docs/case-study.md
docs/cost-report.md
docs/runbook.md
src/nullstate/cli.py
src/nullstate/metrics.py
tests/test_metrics.py

📝 Walkthrough

Walkthrough

This PR adds end-to-end vLLM metrics collection and GPU snapshot capture to the CLI run workflow. New metrics functions classify endpoints, scrape /metrics before and after remediation, write Prometheus artifacts, and attempt GPU status snapshots. The CLI integrates these calls into the run lifecycle and stores results in metrics.json with before/after endpoint data.

Changes

vLLM Metrics Collection & GPU Snapshots

Layer / File(s)	Summary
Metrics Collection API `src/nullstate/metrics.py`	New functions: `classify_endpoint()` returns endpoint type (offline, managed, self-hosted, amd-gpu-hosted); `collect_run_metrics()` fetches `/metrics`, writes stage-specific `.prom` artifacts, parses vLLM counters, and captures GPU snapshots; `gpu_snapshot()` attempts `amd-smi`/`rocm-smi`; `_safe_host()` extracts hostname from base_url.
CLI Run Command Integration `src/nullstate/cli.py`	Calls `collect_run_metrics(..., stage="before")` after Terraform plan, and `collect_run_metrics(..., stage="after")` after remediation. Updates `metrics.json` structure to include `endpoint.before` and `endpoint.after` snapshots alongside existing `model_calls` and `notes`.
Metrics Unit Tests `tests/test_metrics.py`	Adds `test_classifies_offline_managed_and_amd_endpoints`, `test_collect_run_metrics_writes_vllm_snapshots`, and `test_gpu_snapshot_is_available_without_gpu_tools` to cover endpoint classification, metrics fetching with artifact writing, and GPU status fallback.
Documentation & Changelog `docs/runbook.md`, `README.md`, `CHANGELOG.md`, `docs/case-study.md`, `docs/case-study-outline.md`, `docs/cost-report.md`	Runbook documents metrics scraping from `NULLSTATE_LLM_BASE_URL/metrics`, `.prom` artifact generation, parsed `metrics.json` storage, and GPU tool fallback. README lists new artifacts. Case study and outline describe endpoint types and AMD GPU evidence collection. Cost report tracks inference mode per run. Changelog entry added.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Metrics gathered, snapshots clear,
GPU dust and endpoints near,
Before and after, side by side,
Evidence flows with case-study pride!

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/metrics-artifacts

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

feat: collect model serving metrics artifacts

2974a59

Ker102 force-pushed the feat/metrics-artifacts branch from 28ec891 to 2974a59 Compare May 8, 2026 15:42

Ker102 merged commit 92cba8c into main May 8, 2026
4 of 5 checks passed

Ker102 deleted the feat/metrics-artifacts branch May 8, 2026 15:44

This was referenced May 9, 2026

feat: infer scenarios and probe sandbox status #13

Merged

feat: support separate red and blue model endpoints #15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: collect model serving metrics artifacts#7

feat: collect model serving metrics artifacts#7
Ker102 merged 1 commit into
mainfrom
feat/metrics-artifacts

Ker102 commented May 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Ker102 commented May 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Security Impact

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Ker102 commented May 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 8, 2026 •

edited

Loading