Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
0a0d9b9
gather-info: fix DCGM daemon detection, InfiniBand hints, and harden …
Banshee1221 Apr 8, 2026
2395922
gather-info: fix banner and info box rendering in narrow terminals
Banshee1221 Apr 8, 2026
2e5762a
gather-info: implement phase 2 stable identity end-to-end
Banshee1221 Apr 8, 2026
e5c040c
gather-info: complete phase 3/4 triage core hardening
Banshee1221 Apr 8, 2026
f295418
gather-info v0.2.0: fix skipped artifact placeholders, fingerprint un…
Banshee1221 Apr 8, 2026
5b5d40e
gather-info v0.2.1: post-audit triage fixes and single sudo re-exec
Banshee1221 Apr 9, 2026
4edca23
gather-info v0.2.1: final cleanup
Banshee1221 Apr 9, 2026
445e131
gather-info: make --xid-md optional in update-xid-catalog
Banshee1221 Apr 9, 2026
e1f8c1c
feat: add vm-troubleshooting-dashboard; harden gather-info; docs & gi…
Banshee1221 Apr 13, 2026
5b2195b
feat: harden dashboard ingest/API and improve gather-info journal & e…
Banshee1221 Apr 13, 2026
44c7627
feat: harden gather-info for containerized hypervisors and refresh da…
Banshee1221 Apr 16, 2026
3f4f502
feat(vm-diagnostics): HW telemetry collectors, deeper net/IB/journal,…
Banshee1221 Apr 17, 2026
a2e0660
feat(vm-troubleshooting-dashboard): triage UX, issue state, deploy stack
Banshee1221 Apr 20, 2026
5a8fd0c
feat(vm-troubleshooting-dashboard): system-log digestibility — struct…
Banshee1221 Apr 20, 2026
7c488bd
feat(vm-troubleshooting-dashboard): track B polish — stat hierarchy, …
Banshee1221 Apr 20, 2026
43131cb
fix(vm-troubleshooting-dashboard): compact-view readability — group t…
Banshee1221 Apr 20, 2026
c2b05a6
fix(vm-troubleshooting-dashboard): Top Issues grid overflow on narrow…
Banshee1221 Apr 20, 2026
c00687f
refactor(vm-troubleshooting-dashboard): consolidate issue-detail main…
Banshee1221 Apr 20, 2026
14dea88
fix(vm-troubleshooting-dashboard): hide "What happened" section when …
Banshee1221 Apr 20, 2026
32f433b
refactor(vm-troubleshooting-dashboard): polish issue-detail header, t…
Banshee1221 Apr 20, 2026
767b4e1
fix(vm-troubleshooting-dashboard): strip redundant "Title:" prefix fr…
Banshee1221 Apr 20, 2026
25951c2
refactor(vm-troubleshooting-dashboard): unify issue-detail main card …
Banshee1221 Apr 20, 2026
ab0765c
fix(vm-troubleshooting-dashboard): kill prev/next flicker on issue-de…
Banshee1221 Apr 20, 2026
94d8396
Merge pull request #2 from NexGenCloud/dashboard_mvp
Banshee1221 Apr 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 20 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,19 @@
*.exe
bin/

# Go (repo-wide)
coverage.out
coverage.html
*.coverprofile
*.prof
cpu.prof
mem.prof

# OS metadata
.DS_Store
Thumbs.db

# Editor/IDE
# Editor/IDE (local only; subprojects may un-ignore e.g. !.vscode/extensions.json)
*.swp
*.swo
*~
Expand All @@ -16,10 +24,20 @@ Thumbs.db
*.sublime-project
*.sublime-workspace

# Environment and secrets
# Environment and secrets (allow committed templates)
.env
.env.*
!.env.example
!.env*.example
*.pem
*.key
*.p12
*.pfx

# Local tool caches (not shared project config)
.mcp_data/
.serena/
.cursor/
.codex/
.claude/
docs/plans/
50 changes: 29 additions & 21 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,17 @@
This repository contains customer-facing support scripts and binaries for collecting diagnostics or applying narrow workarounds on customer VMs.
Assume the operator is a customer or support engineer running the tool locally on a machine we do not control and cannot access directly.

## Monorepo model
- First-class projects live under `customers/<name>/`. Each project has its own **`AGENTS.md`** with verification commands, stack-specific rules, and links to `CODEMAP.md` / `ARCHITECTURE.md` as needed.
- When changing code under a project directory, follow **that project's `AGENTS.md`** for how to build, test, and review.
- New projects may use any stack (Go, Node, Python, shell, etc.). The **root file stays policy and discovery**—not a catalog of every toolchain's commands.

## Cross-project contracts
- **`customers/vm-troubleshooting/`** (`gather-info`) **produces** diagnostic archives (manifest, report stream, triage data, schemas).
- **`customers/vm-troubleshooting-dashboard/`** **consumes** those archives for ingest and UI.
- Authoritative compatibility and versioning rules: **`customers/vm-troubleshooting/SCHEMA-COMPATIBILITY.md`**. Producer or consumer changes that affect archive shape, schema majors, or mirrored JSON files usually belong in a **coordinated** change (both sides + that doc when applicable).
- Dashboard `schemas/` mirrors collector `schemas/`; keep them aligned per each project's `AGENTS.md` checklists.

## Operating constraints
- Prefer self-contained tooling with minimal external dependencies.
- Assume common Linux distributions first: Ubuntu 20.04/22.04/24.04. Treat other distros as best-effort unless explicitly supported.
Expand Down Expand Up @@ -37,34 +48,31 @@ Assume the operator is a customer or support engineer running the tool locally o
- Prefer machine-readable sources when available.
- Do not silently ignore errors that affect support value; record them in output.

## Verification
Before considering work complete, run the narrowest relevant checks that exist.
## Verification (repository root only)
Before considering work complete, run the narrowest relevant checks for what you changed.

Current commands:
**Root-level assets only** (e.g. scripts in the repo root):
- Bash lint: `shellcheck nvidia-drm-disable-modeset.sh`
- Bash syntax: `bash -n nvidia-drm-disable-modeset.sh`

If a Go implementation exists:
- Format: `cd customers/vm-troubleshooting && gofmt -w .`
- Vet: `cd customers/vm-troubleshooting && go vet ./...`
- Test: `cd customers/vm-troubleshooting && go test ./...`
- Build: `cd customers/vm-troubleshooting && CGO_ENABLED=0 go build ./cmd/gather-info`
**Anything under `customers/`:** use that project's **`AGENTS.md`** verification section (Go, frontend, etc.).

## Repo structure
- `customers/`: customer-run support tooling and related assets.
- Root scripts: focused one-off support or remediation utilities.
- `customers/vm-troubleshooting/`: current Go-based diagnostics collector.
- `customers/vm-troubleshooting/CODEMAP.md`: architecture and collector map for the diagnostics collector.
- `customers/`: shipped or support-facing tools and assets; each subfolder is a project with its own `AGENTS.md`.
- Root scripts: focused one-off support or remediation utilities (verify with root-only commands above).
- `customers/vm-troubleshooting/`: diagnostics collector (`gather-info`). Maps: `CODEMAP.md`, `ARCHITECTURE.md`.
- `customers/vm-troubleshooting-dashboard/`: dashboard (Go API + UI). Map: `CODEMAP.md`.
- `docs/`: planning notes and indexes (`docs/README.md`, `docs/architecture.md` point to project-local maps).

## Doc maintenance (cross-cutting)
When architecture or boundaries change materially, update the relevant **project** orientation docs in the same change:
- Collector: `customers/vm-troubleshooting/CODEMAP.md`; `ARCHITECTURE.md` when pipeline/types/modes narratives change.
- Dashboard: `customers/vm-troubleshooting-dashboard/CODEMAP.md` when package layout or ingest/API flow changes.

For **Go-based diagnostics** behavior (modular collectors, timeouts, static builds, graceful skips), follow the collector project's `AGENTS.md`; do not duplicate those rules here.

## Diagnostics collector guidance
For the Go-based diagnostics collector:
- Preserve user-visible behavior and output structure unless there is a clear improvement.
- Keep collectors modular.
- Keep command execution timeout-bound.
- Check command availability before execution.
- Make unsupported probes report "not available" rather than fail the whole run.
- Prefer static builds for portability unless there is a strong reason not to.
- Keep `customers/vm-troubleshooting/CODEMAP.md` current when changing architecture or collector ownership.
## Optional: CI at scale
If you add continuous integration, **path-scoped** jobs (build/test only projects whose paths changed) are a practical way to keep feedback fast as `customers/` grows. This is optional operational practice, not a requirement of this repository.

## Done means
A change is not done until:
Expand Down
6 changes: 2 additions & 4 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
See @AGENTS.md for repository-wide policy, safety, monorepo discovery, and cross-project contracts.

See @AGENTS.md for all project instructions, constraints, and verification commands.

For the diagnostics collector, also see `customers/vm-troubleshooting/AGENTS.md` and `customers/vm-troubleshooting/CODEMAP.md`.
This repository is a **monorepo**. Prefer the **nearest** `CLAUDE.md` or `AGENTS.md` under the tree you are editing (for example `customers/vm-troubleshooting/` or `customers/vm-troubleshooting-dashboard/`). Keep changes scoped to the active project unless the task explicitly spans multiple projects or shared contracts.

ShellCheck is pre-allowed in `.claude/settings.local.json`.
34 changes: 33 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,33 @@
This repository consists of support scripts that Hyperstack Customer Experience team would send to Hyperstack users, with instructions, to fix or workaround issues, or to gather information related to a ticket.
# support-scripts

This repository holds **customer-run support tooling**: scripts and small programs that operators use on Linux VMs to gather diagnostics or apply narrow, documented workarounds. Tools are meant to work on partially broken hosts, with minimal assumptions about network, GPU, containers, or privilege.

## What lives here

| Area | Purpose |
|------|---------|
| [`customers/`](customers/) | Packaged tools distributed with support (Go binaries, dashboards, assets). |
| [`docs/`](docs/) | Planning notes and indexes; authoritative behavior is always the code plus per-project `AGENTS.md`. |
| Root `.sh` scripts | Focused one-off utilities (when present). |

## Main projects

- **[`customers/vm-troubleshooting/`](customers/vm-troubleshooting/)** — `gather-info`: static Go binary that collects VM diagnostics into a single `.tar.gz` with manifest, report stream, and summaries.
- Quick map: [`customers/vm-troubleshooting/CODEMAP.md`](customers/vm-troubleshooting/CODEMAP.md)
- Extended reference: [`customers/vm-troubleshooting/ARCHITECTURE.md`](customers/vm-troubleshooting/ARCHITECTURE.md)

- **[`customers/vm-troubleshooting-dashboard/`](customers/vm-troubleshooting-dashboard/)** — Local web app (Go API + SQLite + React) to ingest and browse `gather-info` archives.
- Map: [`customers/vm-troubleshooting-dashboard/CODEMAP.md`](customers/vm-troubleshooting-dashboard/CODEMAP.md)

## Contributing / agents

- Repo-wide policy, monorepo discovery, and **root-only** verification (e.g. root shell scripts): [`AGENTS.md`](AGENTS.md)
- **Per-project** build/test commands and stack rules: that project's [`AGENTS.md`](customers/vm-troubleshooting/AGENTS.md) (collector) or [`AGENTS.md`](customers/vm-troubleshooting-dashboard/AGENTS.md) (dashboard).

## Plans and hardening

Workstreams and follow-up hardening are described under [`docs/plans/`](docs/plans/). Those documents are **specifications and history**: always confirm behavior in the current tree and tests (for example [`docs/plans/post-audit-hardening.md`](docs/plans/post-audit-hardening.md) outlines post-audit dashboard and collector hardening goals).

## Quick verification

Use the **`AGENTS.md` inside the project** you are changing (collector or dashboard) for exact commands. The repo root [`AGENTS.md`](AGENTS.md) only lists checks for **root-level** assets (e.g. specific shell scripts).
39 changes: 39 additions & 0 deletions customers/vm-troubleshooting-dashboard/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
.git
.gitignore
.dockerignore
Dockerfile
docker-compose.yml
docker-compose.*.yml
*.md

# Build artefacts
bin/
dashboard
frontend/dist/
frontend/node_modules/
node_modules/

# Runtime state
dashboard-data/
*.db
*.db-journal
*.db-wal
*.db-shm

# Local dev / editor
.env
.env.*
!.env.example
.idea/
.vscode/
.DS_Store
*.swp

# Test output
*.test
*.out
coverage.*
*.coverprofile

# CI / tooling
.github/
24 changes: 24 additions & 0 deletions customers/vm-troubleshooting-dashboard/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Copy to .env and fill in. Do not commit the real .env.

# Public hostname that Caddy serves on. Must resolve to this host and
# have :80/:443 reachable from the internet for Let's Encrypt to issue.
TRIAGE_HOSTNAME=triage.ngbackend.cloud

# Address Let's Encrypt registers the ACME account under; used for
# expiry warnings. Keep it a real, monitored inbox.
ACME_EMAIL=cx-ops@example.com

# From Authentik provider (DEPLOYMENT.md §8a)
OAUTH2_PROXY_CLIENT_ID=
OAUTH2_PROXY_CLIENT_SECRET=

# Generate with: openssl rand -base64 32 | tr -- '+/' '-_'
OAUTH2_PROXY_COOKIE_SECRET=

# Optional build metadata (surfaces in /api/v1/version if wired in main.go)
TRIAGE_VERSION=dev
TRIAGE_COMMIT=unknown
TRIAGE_BUILD_DATE=unknown

# Container timezone (affects log timestamps only; DB stores UTC)
TZ=UTC
44 changes: 44 additions & 0 deletions customers/vm-troubleshooting-dashboard/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Build output
bin/
# Leading slash anchors to this project's root only. A bare `dashboard`
# rule would match the `frontend/src/components/dashboard/` directory too.
/dashboard

# Runtime / local data
dashboard-data/
*.db
*.db-journal
*.db-wal
*.db-shm

# Frontend
frontend/dist/
frontend/node_modules/

# Go test & coverage
*.test
*.out
coverage.out
coverage.html
*.coverprofile

# OS
.DS_Store
Thumbs.db

# Editor
*.swp
*.swo
*~
.idea/
.vscode/

# Environment (templates may be committed from repo root rules)
.env
.env.*
!.env.example

# Live oauth2-proxy + Caddy configs (hostnames are deployment-specific).
# The .example templates ARE committed.
oauth2-proxy.cfg
Caddyfile
102 changes: 102 additions & 0 deletions customers/vm-troubleshooting-dashboard/AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# AGENTS.md

## Scope
This file applies to everything under `customers/vm-troubleshooting-dashboard/`.
The dashboard is a Go + frontend web app that ingests `gather-info` archives
and renders them for support engineers.

See `customers/vm-troubleshooting-dashboard/CODEMAP.md` for package layout, request flow, and verification commands.

## Monorepo boundary
- This project **owns** ingest, persistence, HTTP API, and UI for archives produced by **`customers/vm-troubleshooting/`** (`gather-info`).
- Stay **forward-compatible** with supported schema majors and newer minors; authoritative rules live in `customers/vm-troubleshooting/SCHEMA-COMPATIBILITY.md`.

## Simplicity (KISS)
- Keep ingest, model, store, and API layers straightforward; avoid extra indirection unless it removes real duplication.
- Favor tolerant parsing and generic UI fallbacks over hardcoded enums for codes/tags/hints.

## Project goals
- Render any archive produced by a supported collector major version.
- Treat archives as the source of truth — never modify them on ingest.
- Be tolerant of unknown fields and unknown enum values; never block ingest on
cosmetic schema drift.

## Architecture rules
- `cmd/` is the entrypoint (HTTP server). Keep it thin.
- `internal/ingest/` parses archives into the model.
- `internal/model/` defines the in-memory shape and the schema-version gate.
- `internal/store/` persists ingested archives to SQLite.
- `frontend/` is the UI (TypeScript). Issue codes / tags / parser hints are
opaque strings — do not hardcode enums.

## Schema compatibility (READ THIS BEFORE TOUCHING SCHEMA HANDLING)
Authoritative compatibility rules live in
`customers/vm-troubleshooting/SCHEMA-COMPATIBILITY.md`. The dashboard's job is
to honor the consumer side of that contract:

- **Accept any archive whose `schema_version` major matches
`SupportedSchemaMajor`** (`internal/model/types.go`). Do not pin to a specific
minor. Do not add minor checks that would reject newer-minor archives.
- **Treat unknown fields as ignorable.** The default Go `encoding/json`
behavior already does this. Do not add `DisallowUnknownFields()` to ingest
decoders.
- **Treat issue codes, tags, parser hints, and finding codes as opaque
strings.** Never assume a fixed enum in Go or TS. Renderers should fall back
to a generic display for codes they do not recognize, not error or hide them.
- **Schema files in `schemas/` are mirrors of the collector schemas.** They
must match the collector copies byte-for-byte (modulo `$id` if intentional).
When the collector adds a new enum value or field, mirror the change here in
the same PR.
- **Never validate archives against the JSON Schema at runtime.** The schemas
are documentation/contract. Runtime parsing uses Go structs and is
deliberately permissive.
- **When extending support to a new major version**, prefer extending the
version gate to accept multiple majors (range check) over flipping it. This
preserves the ability to view historical archives.

## Forward compatibility checklist for any dashboard change
- [ ] Does this change reject archives that today's dashboard accepts? If yes,
stop and reconsider.
- [ ] Does this change hardcode an enum that the collector treats as
extensible? If yes, replace with a fallback-friendly lookup.
- [ ] If schema mirror files were edited, do they match the collector's
`customers/vm-troubleshooting/schemas/` byte-for-byte?
- [ ] Did you test ingest with both an old archive (e.g. `3.0.x` or `3.1.x`)
and a current-version archive?

## UX and rendering
- Issue / finding codes are displayed by code string with optional friendly
label lookup. Unknown codes render as the raw code, not as an error or
blank.
- Severity / confidence are constrained enums in the schema, but renderers
should still tolerate unexpected values rather than crash.
- Facts are an open `map[string]any`. The UI should render any well-formed
fact, not just a known set.

## Safety and privacy
- The dashboard ingests data the collector already sanitized. Do not
re-collect from customer systems.
- Storage paths are server-controlled. Never pass user input directly into
filesystem paths without validation.
- Do not log archive contents at info level; archives may contain hostnames
and IPs that count as customer-identifiable.

## Tests and verification
Before considering a change done:
- `gofmt -w .`
- `go vet ./...`
- `go test ./...`
- `go build ./cmd/dashboard`
- For frontend changes: `cd frontend && pnpm build`, then load a known-good archive in a browser. Type checks and unit tests do not catch all UI regressions.

When changing ingest or model code, always include a test that ingests an
archive with a slightly newer minor schema version (e.g. one with an
unknown enum value or fact key) and confirms it succeeds.

## Change management
- Preserve URL routes and API response shapes used by the frontend unless
there is a clear improvement.
- Database migrations must be additive (new columns nullable; never drop
columns in the same release that adds the replacement).
- If schema compat rules change, update both this file and
`customers/vm-troubleshooting/SCHEMA-COMPATIBILITY.md` in the same change.
1 change: 1 addition & 0 deletions customers/vm-troubleshooting-dashboard/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
@AGENTS.md
Loading