Skip to content

feat: collect model serving metrics artifacts#7

Merged
Ker102 merged 1 commit into
mainfrom
feat/metrics-artifacts
May 8, 2026
Merged

feat: collect model serving metrics artifacts#7
Ker102 merged 1 commit into
mainfrom
feat/metrics-artifacts

Conversation

@Ker102
Copy link
Copy Markdown
Owner

@Ker102 Ker102 commented May 8, 2026

Closes #4

Summary

  • Classify endpoint type as offline, managed, self-hosted, or AMD GPU-hosted.
  • Scrape vLLM-compatible /metrics before and after runs and store raw .prom artifacts when available.
  • Capture local GPU snapshot evidence through amd-smi or rocm-smi, with graceful fallback when neither exists.
  • Document useful non-GPU work while DigitalOcean/AMD support is pending.

Test Plan

  • python -m unittest discover -s tests -v
  • python -m ruff check src tests
  • python -m mypy src
  • python -m pip_audit . --skip-editable
  • offline CLI smoke run writes metrics.json

Security Impact

  • No secrets added.
  • Metrics artifacts store endpoint hostname only, not API keys.
  • GPU/vLLM collection is best-effort and does not fail offline runs.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added metrics collection before and after remediation runs to track model endpoint performance
    • Added support for scraping vLLM Prometheus metrics from running model endpoints
    • Added GPU status snapshots from AMD tools when available
  • Documentation

    • Enhanced runbook with detailed metrics evidence collection procedures and artifact review guidance
    • Expanded operational evidence documentation to include endpoint types and available metrics
    • Added guidance for AMD GPU-hosted model evidence collection

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 8, 2026

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c01d0656-3990-4fc5-8823-24fc163eb6a7

📥 Commits

Reviewing files that changed from the base of the PR and between d9a4072 and 2974a59.

📒 Files selected for processing (9)
  • CHANGELOG.md
  • README.md
  • docs/case-study-outline.md
  • docs/case-study.md
  • docs/cost-report.md
  • docs/runbook.md
  • src/nullstate/cli.py
  • src/nullstate/metrics.py
  • tests/test_metrics.py

📝 Walkthrough

Walkthrough

This PR adds end-to-end vLLM metrics collection and GPU snapshot capture to the CLI run workflow. New metrics functions classify endpoints, scrape /metrics before and after remediation, write Prometheus artifacts, and attempt GPU status snapshots. The CLI integrates these calls into the run lifecycle and stores results in metrics.json with before/after endpoint data.

Changes

vLLM Metrics Collection & GPU Snapshots

Layer / File(s) Summary
Metrics Collection API
src/nullstate/metrics.py
New functions: classify_endpoint() returns endpoint type (offline, managed, self-hosted, amd-gpu-hosted); collect_run_metrics() fetches /metrics, writes stage-specific .prom artifacts, parses vLLM counters, and captures GPU snapshots; gpu_snapshot() attempts amd-smi/rocm-smi; _safe_host() extracts hostname from base_url.
CLI Run Command Integration
src/nullstate/cli.py
Calls collect_run_metrics(..., stage="before") after Terraform plan, and collect_run_metrics(..., stage="after") after remediation. Updates metrics.json structure to include endpoint.before and endpoint.after snapshots alongside existing model_calls and notes.
Metrics Unit Tests
tests/test_metrics.py
Adds test_classifies_offline_managed_and_amd_endpoints, test_collect_run_metrics_writes_vllm_snapshots, and test_gpu_snapshot_is_available_without_gpu_tools to cover endpoint classification, metrics fetching with artifact writing, and GPU status fallback.
Documentation & Changelog
docs/runbook.md, README.md, CHANGELOG.md, docs/case-study.md, docs/case-study-outline.md, docs/cost-report.md
Runbook documents metrics scraping from NULLSTATE_LLM_BASE_URL/metrics, .prom artifact generation, parsed metrics.json storage, and GPU tool fallback. README lists new artifacts. Case study and outline describe endpoint types and AMD GPU evidence collection. Cost report tracks inference mode per run. Changelog entry added.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Metrics gathered, snapshots clear,
GPU dust and endpoints near,
Before and after, side by side,
Evidence flows with case-study pride!

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/metrics-artifacts

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Ker102 Ker102 force-pushed the feat/metrics-artifacts branch from 28ec891 to 2974a59 Compare May 8, 2026 15:42
@Ker102 Ker102 merged commit 92cba8c into main May 8, 2026
4 of 5 checks passed
@Ker102 Ker102 deleted the feat/metrics-artifacts branch May 8, 2026 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: collect vLLM and AMD GPU metrics artifacts

1 participant