Skip to content

feat(instance): add TenantState REPAIRING for online repair#1835

Open
sunilkumar-nvidia wants to merge 4 commits into
NVIDIA:mainfrom
sunilkumar-nvidia:onlinerepair
Open

feat(instance): add TenantState REPAIRING for online repair#1835
sunilkumar-nvidia wants to merge 4 commits into
NVIDIA:mainfrom
sunilkumar-nvidia:onlinerepair

Conversation

@sunilkumar-nvidia
Copy link
Copy Markdown
Contributor

@sunilkumar-nvidia sunilkumar-nvidia commented May 20, 2026

Description

Adds a tenant-visible REPAIRING instance status for hosts under active repair health merges (repair-request or request-online-repair). Tenants and cloud sync can distinguish “instance is up but site is repairing it” from Ready, Updating (reprovision/firmware), Configuring, or Failed.

Repairing is shown only when the instance would otherwise be tenant-ready: InstanceState::Ready with synced configs and extension services ready. Repair merges do not override Failed, Updating, Configuring, Provisioning, or Terminating.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Changes

  • Proto / RPC: TenantState.REPAIRING = 10 in forge.proto and mirrored workflow protos
  • Core: Thread host health into instance status derivation; map to TenantState::Repairing in instance_status_tenant_state
  • Cloud: InstanceStatusRepairing in DB model; workflow maps TenantState_REPAIRING → cloud status
  • API: OpenAPI InstanceStatus enum includes Repairing
  • Handlers / CLI: Use REPAIR_REQUEST_MERGE_SOURCE / REQUEST_ONLINE_REPAIR_MERGE_SOURCE constants

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Summary by CodeRabbit

Release Notes

  • New Features
    • Added "Repairing" instance status to display when machines are undergoing repair operations.
    • Enhanced health report system to track and detect active repair states with improved visibility across instance lifecycle.

Review Change Stack

Add REPAIRING tenant state across proto, cloud workflow, and OpenAPI.
Surface it only when the instance is tenant-ready and a repair health
merge is active; repair_merge_active() and shared merge-source constants
preserve Failed, Updating, Configuring, and Terminating precedence.
@sunilkumar-nvidia sunilkumar-nvidia requested a review from a team as a code owner May 20, 2026 08:39
@sunilkumar-nvidia sunilkumar-nvidia self-assigned this May 20, 2026
@sunilkumar-nvidia
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 13c8737a-f5d1-44e2-800a-7017dd888be6

📥 Commits

Reviewing files that changed from the base of the PR and between 9c89c3e and 9d0add6.

⛔ Files ignored due to path filters (2)
  • rest-api/flow/internal/nicoapi/gen/nico.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • rest-api/workflow-schema/schema/site-agent/workflows/v1/nico_nico.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (19)
  • crates/admin-cli/src/machine/health_report/cmd.rs
  • crates/api-model/src/health.rs
  • crates/api-model/src/instance/status/tenant.rs
  • crates/api-model/src/rpc_conv/instance/snapshot.rs
  • crates/api-model/src/rpc_conv/instance/status.rs
  • crates/api-model/src/rpc_conv/instance/status/tenant.rs
  • crates/api-model/src/rpc_conv/machine/mod.rs
  • crates/api/src/handlers/instance.rs
  • crates/api/src/tests/dpu_reprovisioning.rs
  • crates/api/src/tests/host_bmc_firmware_test.rs
  • crates/api/src/tests/instance.rs
  • crates/api/src/tests/machine_health.rs
  • crates/health-report/src/lib.rs
  • crates/rpc/proto/forge.proto
  • rest-api/db/pkg/db/model/instance.go
  • rest-api/flow/internal/nicoapi/nicoproto/nico.proto
  • rest-api/openapi/spec.yaml
  • rest-api/workflow-schema/site-agent/workflows/v1/nico_nico.proto
  • rest-api/workflow/pkg/activity/instance/instance.go

Walkthrough

The PR introduces repair state support across the codebase by adding health report merge-source constants, defining a new TenantState::Repairing enum variant, implementing repair-merge detection via a helper method, threading health reporting data through instance status derivation, and updating handlers, CLI tools, and tests to use the new constants and state transitions consistently.

Changes

Repair State and Health Merge Integration

Layer / File(s) Summary
Health report constants and tenant state enum expansion
crates/health-report/src/lib.rs, crates/api-model/src/instance/status/tenant.rs, crates/rpc/proto/forge.proto, rest-api/db/pkg/db/model/instance.go, rest-api/flow/internal/nicoapi/nicoproto/nico.proto, rest-api/openapi/spec.yaml, rest-api/workflow-schema/site-agent/workflows/v1/nico_nico.proto
Two public string constants define merge-source keys for repair and online repair scenarios. New TenantState::Repairing variant is added to Rust, proto, and Go models, indicating an instance undergoing repair while tenant-ready, with corresponding RPC and OpenAPI enum updates.
Repair merge active detection and tenant state logic
crates/api-model/src/health.rs, crates/api-model/src/rpc_conv/instance/status/tenant.rs
HealthReportSources::repair_merge_active() detects repair-related merge sources. instance_status_tenant_state gains a repair_active parameter and returns TenantState::Repairing only when the instance is otherwise tenant-ready, preserving precedence for other states. RPC conversion maps the new variant.
Instance status derivation pipeline with health report threading
crates/api-model/src/rpc_conv/instance/snapshot.rs, crates/api-model/src/rpc_conv/instance/status.rs, crates/api-model/src/rpc_conv/machine/mod.rs
instance_snapshot_derive_status and instance_status_from_config_and_observation accept host_health: &HealthReportSources and pass repair_merge_active() to tenant-state computation. Health reports flow through the conversion pipeline from snapshot to final instance status. New test validates repair-merge activation yields TenantState::Repairing.
Handler and CLI updates using repair constants
crates/api/src/handlers/instance.rs, crates/admin-cli/src/machine/health_report/cmd.rs
Repair override creation, detection, and removal consistently use REPAIR_REQUEST_MERGE_SOURCE constant. Admin CLI RequestRepair and RequestOnlineRepair templates populate source and target fields from the new constants instead of hardcoded strings.
Comprehensive test updates across all test modules
crates/api/src/tests/dpu_reprovisioning.rs, crates/api/src/tests/host_bmc_firmware_test.rs, crates/api/src/tests/instance.rs, crates/api/src/tests/machine_health.rs, rest-api/workflow/pkg/activity/instance/instance.go
All test invocations of instance_snapshot_derive_status now pass &host.health_reports. Handler and machine health tests replace hardcoded merge-source strings with constants. Tests validate repair-merge precedence and state transitions, and Go status mapping handles the new TenantState_REPAIRING variant.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🐰 A repairing state hops into view,
Health merges checked, the pipeline through,
Constants now steady, no strings left loose,
Tenant states bloom with repair's excuse!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main objective: introducing a new TenantState enum variant REPAIRING for online repair, which is the primary change across all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant