Skip to content

feat: Initial draft commit for Instance online repair#415

Draft
hwadekar-nv wants to merge 2 commits intomainfrom
feat/online-repair
Draft

feat: Initial draft commit for Instance online repair#415
hwadekar-nv wants to merge 2 commits intomainfrom
feat/online-repair

Conversation

@hwadekar-nv
Copy link
Copy Markdown
Contributor

@hwadekar-nv hwadekar-nv commented Apr 22, 2026

Description

Based on the design https://gitlab-master.nvidia.com/nvmetal/designs/-/blob/breakfix-swiftstack/designs/0066-Instance-Repair-Management.md#proposed-workflow-in-pool-repair-dedicated-ops-team, the Instance-Repair-Management implementation was created in an initial draft version.

  • Added mechanism in Instance to update or make repair status

    • Tenant: Online repair is allowed for Tenant Admin when either TargetedInstanceCreation is set or the request is only an online-repair operation; tenant must have a tenant account with the machine’s provider, and the machine’s single instance must belong to that tenant.
    • Provider: Same instance/machine checks without tenant ownership.
    • Enter: Instance must be Ready. DB transaction updates instance to Repairing, sets label onlineRepair.allowAutoInstanceDeletionOnFailure to "true"/"false", then runs Temporal workflow ApplyMachineOnlineRepairHealthOverride with a tenant-reported-issue health override matching the HLD (OnLineRepair alert + classifications).
    • Exit: Instance must be Repairing. DB updates instance back to Ready and removes the policy label, then workflow ClearMachineOnlineRepairHealthOverride removes the override for source tenant-reported-issue._
  • Added PATCH request in Machine to allow Machine to be in repair while having Instance allocated
    - Enter: requestOnlineRepair: true + machineHealthIssue + repairPolicy.allowAutoInstanceDeletionOnFailure + acknowledgments (all three booleans must be true).
    - Exit: requestOnlineRepair: false + clearOnlineRepair: true (no other online-repair payload fields)._

  • Site workflow

    • InsertHealthReportOverrideOnSite / RemoveHealthReportOverrideOnSite activities call
    • InsertHealthReportOverride / RemoveHealthReportOverride.
    • Workflows ApplyMachineOnlineRepairHealthOverride and ClearMachineOnlineRepairHealthOverride_
  • Site-agent

    • Workflow registered in site-agent/pkg/components/managers/machine/subscriber.go._

Type of Change

  • Feature - New feature or functionality (feat:)
  • Fix - Bug fixes (fix:)
  • Chore - Modification or removal of existing functionality (chore:)
  • Refactor - Refactoring of existing functionality (refactor:)
  • Docs - Changes in documentation or OpenAPI schema (docs:)
  • CI - Changes in GitHub workflows. Requires additional scrutiny (ci:)
  • Version - Issuing a new release version (version:)

Services Affected

  • API - API models or endpoints updated
  • Workflow - Workflow service updated
  • DB - DB DAOs or migrations updated
  • Site Manager - Site Manager updated
  • Cert Manager - Cert Manager updated
  • Site Agent - Site Agent updated
  • RLA - RLA service updated
  • Powershelf Manager - Powershelf Manager updated
  • NVSwitch Manager - NVSwitch Manager updated

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@hwadekar-nv hwadekar-nv self-assigned this Apr 22, 2026
@hwadekar-nv hwadekar-nv requested a review from a team as a code owner April 22, 2026 22:39
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 80dbce4b-1d34-40b5-83d3-3892a7467a57

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

The pull request introduces in-pool online repair functionality for machines, enabling tenant repair teams to restore machines within the instance pool. Changes include API handler authorization and transaction logic for state transitions, new request models with validation for repair intent and acknowledgments, instance status additions, Temporal workflows coordinating with Site for health report overrides, and comprehensive design documentation.

Changes

Cohort / File(s) Summary
API Handler Implementation
api/pkg/api/handler/machine.go
Extended UpdateMachineHandler to enforce tenant authorization for online repair, validate machine assignment and instance status, manage database transaction for status toggling (Ready↔Repairing), coordinate health report overrides via Temporal workflows, and handle auto-deletion label management based on repair policy.
API Handler Tests
api/pkg/api/handler/machine_test.go
Added expectations for two new Temporal workflow executions (ApplyMachineOnlineRepairHealthOverride, ClearMachineOnlineRepairHealthOverride) in both success and timeout scenarios.
API Request Models
api/pkg/api/model/machine.go
Added new types (APIMachineHealthIssue, APIOnlineRepairPolicy, APIOnlineRepairAcknowledgments) and extended APIMachineUpdateRequest with online repair fields. Introduced validation for exclusive operation modes, health issue category/length constraints, and mandatory acknowledgments. Added builder function BuildOnlineRepairHealthOverrideMessageJSON for health override payloads.
API Model Tests
api/pkg/api/model/machine_test.go
Extended TestAPIMachineUpdateRequest_Validate with three new test cases covering enter-repair success, conflict detection with maintenance mode, and exit-repair validation.
Database Instance Model
db/pkg/db/model/instance.go
Added InstanceStatusRepairing constant and updated InstanceStatusMap and GetCountByStatus to track the new status.
Database Model Tests
db/pkg/db/model/instance_test.go
Updated TestInstanceSQLDAO_GetCountByStatus expectations to include InstanceStatusRepairing count entries.
Site Workflow Activities
site-workflow/pkg/activity/machine.go
Implemented two new activities (InsertHealthReportOverrideOnSite, RemoveHealthReportOverrideOnSite) with strict request validation and Forge RPC invocation to manage health report overrides.
Site Activity Tests
site-workflow/pkg/activity/machine_test.go
Added unit tests for both new activities covering success paths and nil request error handling.
Site Workflows
site-workflow/pkg/workflow/machine.go
Introduced two Temporal workflows (ApplyMachineOnlineRepairHealthOverride, ClearMachineOnlineRepairHealthOverride) with 2-minute timeouts and retry policies for health override coordination.
Workflow Tests
site-workflow/pkg/workflow/machine_test.go
Added test cases for both new workflows verifying successful activity integration.
Workflow gRPC Client Mock
site-workflow/pkg/grpc/client/testing.go
Added mock implementations for InsertHealthReportOverride and RemoveHealthReportOverride RPC methods.
Site Agent Registration
site-agent/pkg/components/managers/machine/subscriber.go
Registered new Temporal workflows and site activities for online repair health override management.
OpenAPI Specification
openapi/spec.yaml
Extended machine update API contract with new schemas for health issues, repair policies, and acknowledgments; added Repairing instance status; updated request descriptions for online repair semantics.
Design Documentation
db/repair.md, db/tenant-capability.md, db/tenant-capability-google-doc.html
Introduced comprehensive design documents describing in-pool online repair workflow, instance state transitions, failure handling semantics, and tenant capability scoping architecture.

Sequence Diagram

sequenceDiagram
    participant Tenant as Tenant/Client
    participant Handler as API Handler
    participant DB as Database
    participant Temporal as Temporal
    participant Activity as Site Activity
    participant Site as Site Controller

    Tenant->>Handler: PATCH /machines/{id} (requestOnlineRepair)
    Handler->>Handler: Validate machine is assigned & Instance status
    Handler->>Handler: Enforce TenantAccount authorization
    
    Handler->>DB: Begin Transaction
    Handler->>DB: Update Instance status Ready → Repairing
    Handler->>DB: Set OnlineRepair auto-deletion label
    Handler->>DB: Commit Transaction
    
    Handler->>Temporal: ExecuteWorkflow(ApplyMachineOnlineRepairHealthOverride)
    Temporal->>Activity: InsertHealthReportOverrideOnSite(machine_id, health_report)
    Activity->>Site: RPC InsertHealthReportOverride
    Site-->>Activity: Empty response
    Activity-->>Temporal: Success
    Temporal-->>Handler: Workflow complete
    
    Handler-->>Tenant: 200 OK (Instance.status=Repairing)
    
    Tenant->>Handler: PATCH /machines/{id} (clearOnlineRepair)
    Handler->>Handler: Validate Instance status is Repairing
    
    Handler->>DB: Begin Transaction
    Handler->>DB: Update Instance status Repairing → Ready
    Handler->>DB: Clear OnlineRepair label
    Handler->>DB: Commit Transaction
    
    Handler->>Temporal: ExecuteWorkflow(ClearMachineOnlineRepairHealthOverride)
    Temporal->>Activity: RemoveHealthReportOverrideOnSite(machine_id)
    Activity->>Site: RPC RemoveHealthReportOverride
    Site-->>Activity: Empty response
    Activity-->>Temporal: Success
    Temporal-->>Handler: Workflow complete
    
    Handler-->>Tenant: 200 OK (Instance.status=Ready)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Rationale: The changes introduce a new feature spanning heterogeneous components—dense handler transaction logic with state management, comprehensive validation across multiple model layers, database schema extensions, Temporal workflow coordination, activity implementations, and OpenAPI contract updates. The handler requires careful scrutiny for transaction semantics, authorization gating, and Temporal error handling. Multiple interacting components demand verification of consistency across API, database, workflow, and activity layers.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 26.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description comprehensively details the implementation of an Instance-Repair-Management feature, referencing a design document and explaining role-specific authorization, state transitions, API changes, and site workflow components.
Title check ✅ Passed The PR title 'feat: Initial draft commit for Instance online repair' clearly and concisely describes the primary feature being introduced—online repair functionality for machine instances.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/online-repair

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-04-22 22:40:40 UTC | Commit: 72dc761

@hwadekar-nv hwadekar-nv marked this pull request as draft April 22, 2026 22:41
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 22, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
openapi/spec.yaml (2)

16949-17010: ⚠️ Potential issue | 🟠 Major

Regenerate the checked-in SDK/docs for this OpenAPI change.

CI is already failing because openapi/spec.yaml changed without the corresponding generated output under sdk/standard/ or docs/index.html. Please include the regenerated artifacts in this PR so consumers pick up Repairing and the new online-repair request shapes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openapi/spec.yaml` around lines 16949 - 17010, The OpenAPI spec added
online-repair fields (requestOnlineRepair, clearOnlineRepair) and schemas
(MachineHealthIssue, OnlineRepairPolicy, OnlineRepairAcknowledgments) and
introduced the new Instance status "Repairing", but the generated SDK/docs were
not updated; regenerate the OpenAPI client and docs and commit the outputs so CI
passes—run the repo's OpenAPI generation script or tool to update sdk/standard/
and docs/index.html (and any other generated artifacts), verify the new shapes
and the "Repairing" status appear in the generated SDK and HTML, and include
those regenerated files in this PR.

16962-17010: ⚠️ Potential issue | 🟠 Major

Encode the online-repair modes in the schema itself.

The enter/exit rules currently live only in descriptions and examples. The schema still permits invalid payloads such as requestOnlineRepair: true without acknowledgments, requestOnlineRepair: false together with machineHealthIssue, or online-repair fields mixed with the other update properties. Please model these as mutually exclusive variants with oneOf/not so generated clients and validators enforce the same contract as the handler.

As per coding guidelines, "MachineUpdateRequest supports two mutually exclusive modes for online repair: Enter: requestOnlineRepair=true + machineHealthIssue + repairPolicy + acknowledgments. Exit: requestOnlineRepair=false + clearOnlineRepair=true, with no other online-repair payload fields combined."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openapi/spec.yaml` around lines 16962 - 17010, The schema currently allows
invalid combinations because the online-repair semantics are only in docs;
update the MachineUpdateRequest schema to express three mutually exclusive
variants using oneOf/not: (1) Normal update: allows
instanceTypeId/clearInstanceType, setMaintenanceMode, maintenanceMessage, labels
and must forbid any online-repair fields (requestOnlineRepair,
clearOnlineRepair, machineHealthIssue, repairPolicy, acknowledgments); (2) Enter
online-repair mode: require requestOnlineRepair=true and require
machineHealthIssue, repairPolicy, and acknowledgments (reference
MachineHealthIssue, OnlineRepairPolicy, OnlineRepairAcknowledgments) and forbid
instanceTypeId/clearInstanceType, setMaintenanceMode, maintenanceMessage,
labels, clearOnlineRepair; (3) Exit online-repair mode: require
requestOnlineRepair=false and clearOnlineRepair=true and forbid
machineHealthIssue, repairPolicy, acknowledgments and also forbid other update
properties; implement these with oneOf plus required/const and
additionalProperties/not blocks so validators and generated clients enforce the
enter/exit rules described in the description.
🧹 Nitpick comments (8)
site-workflow/pkg/grpc/client/testing.go (1)

346-352: Mock methods lack context-based error injection support.

The new InsertHealthReportOverride and RemoveHealthReportOverride mock methods always return success, unlike other mock methods in this file that support ctx.Value("wantError") for injecting failures during tests. This limits the ability to test error handling paths in workflows and activities that invoke these RPCs.

Consider aligning with the established pattern for comprehensive test coverage:

♻️ Suggested enhancement
 func (c *MockForgeClient) InsertHealthReportOverride(ctx context.Context, in *wflows.InsertHealthReportOverrideRequest, opts ...grpc.CallOption) (*emptypb.Empty, error) {
+	if err, ok := ctx.Value("wantError").(error); ok {
+		return nil, status.Error(status.Code(err), "failed to insert health report override")
+	}
 	return new(emptypb.Empty), nil
 }

 func (c *MockForgeClient) RemoveHealthReportOverride(ctx context.Context, in *wflows.RemoveHealthReportOverrideRequest, opts ...grpc.CallOption) (*emptypb.Empty, error) {
+	if err, ok := ctx.Value("wantError").(error); ok {
+		return nil, status.Error(status.Code(err), "failed to remove health report override")
+	}
 	return new(emptypb.Empty), nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@site-workflow/pkg/grpc/client/testing.go` around lines 346 - 352, The two
mock RPCs InsertHealthReportOverride and RemoveHealthReportOverride on
MockForgeClient always return success; update both to follow the file's
established pattern by checking ctx.Value("wantError") (or the same key other
mocks use) and, if set, return that injected error (or wrap it) instead of
success; otherwise return new(emptypb.Empty), nil—this enables context-based
error injection for tests.
db/tenant-capability-google-doc.html (1)

1-325: Consider consolidating documentation to avoid duplication.

Both db/tenant-capability.md and db/tenant-capability-google-doc.html contain the same HLD content. Maintaining two formats creates a risk of divergence. Consider keeping only the Markdown version and generating HTML if needed, or documenting which is the source of truth.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@db/tenant-capability-google-doc.html` around lines 1 - 325, Duplicate HLD
content exists in db/tenant-capability.md and
db/tenant-capability-google-doc.html which risks divergence; remove the
redundancy by choosing a single source-of-truth (prefer the Markdown file) and
delete or turn the HTML into a generated artifact, update references to point to
db/tenant-capability.md, and ensure the canonical doc names (e.g.
TenantSiteCapabilityAssociation, TargetedInstanceCreation) are preserved in the
retained file so handlers and reviewers can find the same content.
site-workflow/pkg/activity/machine_test.go (1)

142-180: Tests cover basic paths but lack edge case coverage.

The tests verify successful execution and nil request validation. However, the activity implementations validate multiple conditions (e.g., MachineId.Id == "", Override == nil, Override.Report == nil, Source == ""). Consider adding tests for these edge cases to ensure the validation logic is fully exercised.

♻️ Suggested additional test cases
func TestManageMachine_InsertHealthReportOverrideOnSite_ValidationErrors(t *testing.T) {
	mockCarbide := cClient.NewMockCarbideClient()
	carbideAtomicClient := cClient.NewCarbideAtomicClient(&cClient.CarbideClientConfig{})
	carbideAtomicClient.SwapClient(mockCarbide)
	mm := NewManageMachine(carbideAtomicClient)

	testCases := []struct {
		name    string
		request *cwssaws.InsertHealthReportOverrideRequest
	}{
		{"nil request", nil},
		{"nil MachineId", &cwssaws.InsertHealthReportOverrideRequest{MachineId: nil}},
		{"empty MachineId.Id", &cwssaws.InsertHealthReportOverrideRequest{MachineId: &cwssaws.MachineId{Id: ""}}},
		{"nil Override", &cwssaws.InsertHealthReportOverrideRequest{MachineId: &cwssaws.MachineId{Id: "m1"}, Override: nil}},
		{"nil Override.Report", &cwssaws.InsertHealthReportOverrideRequest{MachineId: &cwssaws.MachineId{Id: "m1"}, Override: &cwssaws.HealthReportOverride{Report: nil}}},
	}

	for _, tc := range testCases {
		t.Run(tc.name, func(t *testing.T) {
			err := mm.InsertHealthReportOverrideOnSite(context.Background(), tc.request)
			assert.Error(t, err)
		})
	}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@site-workflow/pkg/activity/machine_test.go` around lines 142 - 180, Tests
only exercise success and nil-request paths; add unit tests that assert
validation errors for empty or missing fields checked in the activity. Extend
site-workflow/pkg/activity/machine_test.go to include table-driven tests calling
NewManageMachine(...).InsertHealthReportOverrideOnSite and
RemoveHealthReportOverrideOnSite covering cases: nil MachineId, empty
MachineId.Id, nil Override, nil Override.Report, and empty Source (for
RemoveHealthReportOverrideOnSite), and assert Error for each; reference the
functions InsertHealthReportOverrideOnSite, RemoveHealthReportOverrideOnSite and
constructor NewManageMachine to locate where to call them.
site-workflow/pkg/workflow/machine_test.go (1)

187-219: Tests cover success paths; consider adding failure scenario tests.

The tests correctly verify that both workflows complete successfully when activities succeed. However, the test suite includes failure scenario tests for other workflows (e.g., Test_UpdateMachineMetadata_ActivityFails). For consistency and to ensure error propagation is verified, consider adding analogous failure tests.

♻️ Suggested failure test
func (s *MachineWorkflowTestSuite) Test_ApplyMachineOnlineRepairHealthOverride_ActivityFails() {
	var machineManager mActivity.ManageMachine
	req := &cwssaws.InsertHealthReportOverrideRequest{
		MachineId: &cwssaws.MachineId{Id: uuid.New().String()},
		Override: &cwssaws.HealthReportOverride{
			Report: &cwssaws.HealthReport{Source: "tenant-reported-issue"},
			Mode:   cwssaws.OverrideMode_Replace,
		},
	}
	errMsg := "Site Controller communication error"

	s.env.RegisterActivity(machineManager.InsertHealthReportOverrideOnSite)
	s.env.OnActivity(machineManager.InsertHealthReportOverrideOnSite, mock.Anything, mock.Anything).Return(errors.New(errMsg))

	s.env.ExecuteWorkflow(ApplyMachineOnlineRepairHealthOverride, req)
	s.True(s.env.IsWorkflowCompleted())
	err := s.env.GetWorkflowError()
	s.Error(err)

	var applicationErr *temporal.ApplicationError
	s.True(errors.As(err, &applicationErr))
	s.Equal(errMsg, applicationErr.Error())
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@site-workflow/pkg/workflow/machine_test.go` around lines 187 - 219, The tests
only cover success paths for ApplyMachineOnlineRepairHealthOverride and
ClearMachineOnlineRepairHealthOverride; add failure-case unit tests that mock
the activities to return an error and assert the workflow surfaces that error.
Specifically, add tests similar to
Test_ApplyMachineOnlineRepairHealthOverride_ActivityFails and a counterpart for
ClearMachineOnlineRepairHealthOverride where you RegisterActivity and use
env.OnActivity for mActivity.ManageMachine.InsertHealthReportOverrideOnSite (and
RemoveHealthReportOverrideOnSite) to Return(errors.New(...)),
ExecuteWorkflow(ApplyMachineOnlineRepairHealthOverride /
ClearMachineOnlineRepairHealthOverride) and then assert
env.IsWorkflowCompleted(), env.GetWorkflowError() is non-nil, the error is a
temporal.ApplicationError and its message matches the mocked error string.
db/tenant-capability.md (1)

265-272: Handler migration guidance is documented but not yet implemented.

The HLD correctly identifies that downstream handlers must replace checks using tenant.Config.TargetedInstanceCreation with effective site-scoped capability evaluation. Based on the relevant code snippets provided, the current api/pkg/api/handler/machine.go implementation (lines 323-329 and 824-845) still uses only the tenant-global check without TenantSiteCapabilityAssociation lookup.

This is acceptable for an initial draft, but consider adding a tracking item or TODO to ensure the handler migration is completed before this capability model is fully operational.

Would you like me to open an issue to track the handler migration work outlined in this HLD?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@db/tenant-capability.md` around lines 265 - 272, The handler migration
described in the HLD hasn't been implemented: handlers in
api/pkg/api/handler/machine.go still check
tenant.Config.TargetedInstanceCreation directly instead of using
effective(tenant, TenantSiteCapabilityAssociation, siteID) for site-scoped
requests (instance create/delete, machine APIs, site listing). Add a tracking
TODO and create a repo issue describing: replace direct TargetedInstanceCreation
checks in the machine handlers and any site-scoped handlers with the
effective(..., siteID) helper, centralize the lookup in shared helpers, and
decide site-listing behavior (narrow vs discovery). Reference
TenantSiteCapabilityAssociation, TargetedInstanceCreation, effective(...,
siteID), and the machine handler functions so the migration can be completed
before the capability model goes live.
api/pkg/api/model/machine_test.go (1)

729-778: Add missing negative cases for online-repair branch coverage.

Current additions validate happy paths plus one exclusivity error, but Validate() has more online-repair branches that are currently untested (e.g., enter with clearOnlineRepair=true, exit without clearOnlineRepair=true, exit with forbidden machineHealthIssue/repairPolicy/acknowledgments payloads). Adding these would harden regression detection.

📌 Suggested test-case additions
@@
 		{
 			name: "test valid exit online repair request",
 			fields: fields{
 				RequestOnlineRepair: cdb.GetBoolPtr(false),
 				ClearOnlineRepair:   cdb.GetBoolPtr(true),
 			},
 			wantErr: false,
 		},
+		{
+			name: "test invalid enter online repair when clearOnlineRepair is true",
+			fields: fields{
+				RequestOnlineRepair: cdb.GetBoolPtr(true),
+				ClearOnlineRepair:   cdb.GetBoolPtr(true),
+				MachineHealthIssue: &APIMachineHealthIssue{
+					Category: "STORAGE",
+					Summary:  "Disk issue",
+					Details:  "logs and ticket refs",
+				},
+				RepairPolicy: &APIOnlineRepairPolicy{
+					AllowAutoInstanceDeletionOnFailure: cdb.GetBoolPtr(false),
+				},
+				Acknowledgments: &APIOnlineRepairAcknowledgments{
+					AcceptDataCorruptionRisk:   cdb.GetBoolPtr(true),
+					AcceptRepairTeamAccess:     cdb.GetBoolPtr(true),
+					AcceptInstanceDeletionRisk: cdb.GetBoolPtr(true),
+				},
+			},
+			wantErr: true,
+		},
+		{
+			name: "test invalid exit online repair without clearOnlineRepair true",
+			fields: fields{
+				RequestOnlineRepair: cdb.GetBoolPtr(false),
+			},
+			wantErr: true,
+		},
+		{
+			name: "test invalid exit online repair when enter-only fields are provided",
+			fields: fields{
+				RequestOnlineRepair: cdb.GetBoolPtr(false),
+				ClearOnlineRepair:   cdb.GetBoolPtr(true),
+				MachineHealthIssue: &APIMachineHealthIssue{
+					Category: "OTHER",
+					Summary:  "Should not be sent on exit",
+					Details:  "Should not be sent on exit",
+				},
+			},
+			wantErr: true,
+		},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/model/machine_test.go` around lines 729 - 778, Add unit tests
calling the Machine.Validate() cases that are currently missing: (1) an "enter
online-repair" case where RequestOnlineRepair==true but ClearOnlineRepair==true
(should error), (2) an "exit online-repair" case where
RequestOnlineRepair==false and ClearOnlineRepair!=true (should error), and (3)
one or more "exit online-repair" cases where RequestOnlineRepair==false and
ClearOnlineRepair==true but forbidden payload fields are present
(MachineHealthIssue, RepairPolicy, Acknowledgments) and should cause validation
errors. Use the same test table structure and fields (RequestOnlineRepair,
ClearOnlineRepair, MachineHealthIssue, RepairPolicy, Acknowledgments) as the
surrounding tests and assert wantErr accordingly when calling Validate().
api/pkg/api/handler/machine.go (2)

64-67: Inconsistent constant value casing.

The value "OnLineRepair" contains inconsistent capitalization. The convention for compound words like "online" should be OnlineRepair rather than OnLineRepair.

♻️ Suggested fix
 const (
 	onlineRepairHealthOverrideSource = "tenant-reported-issue"
-	onlineRepairHealthAlertID      = "OnLineRepair"
+	onlineRepairHealthAlertID        = "OnlineRepair"
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/machine.go` around lines 64 - 67, The constant
onlineRepairHealthAlertID has inconsistent capitalization ("OnLineRepair");
update its string value to "OnlineRepair" to follow compound-word casing
conventions, and search for any uses of onlineRepairHealthAlertID to ensure no
downstream logic or external consumers depend on the old casing and update tests
or comparisons accordingly.

846-862: Duplicated instance retrieval logic.

The instance retrieval and validation at lines 850-858 is duplicated in the main online repair block at lines 1454-1462. Consider extracting this common logic into a helper function or restructuring to retrieve the instance once and pass it through.

♻️ Suggested approach

Extract instance retrieval to a helper that returns the validated instance:

func (umh UpdateMachineHandler) getOnlineRepairInstance(ctx context.Context, machineID string, logger zerolog.Logger) (*cdbm.Instance, *cutil.APIError) {
    iDAO := cdbm.NewInstanceDAO(umh.dbSession)
    instances, ic, ierr := iDAO.GetAll(ctx, nil, cdbm.InstanceFilterInput{MachineIDs: []string{machineID}}, cdbp.PageInput{Limit: cdb.GetIntPtr(2)}, nil)
    if ierr != nil {
        logger.Error().Err(ierr).Msg("error retrieving Instance for Machine")
        return nil, cutil.NewAPIError(http.StatusInternalServerError, "Failed to retrieve Instance for Machine", nil)
    }
    if ic != 1 {
        return nil, cutil.NewAPIError(http.StatusBadRequest, "Machine must have exactly one Instance for online repair", nil)
    }
    return &instances[0], nil
}

Then call this once during authorization and reuse the result in the main online repair block.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/machine.go` around lines 846 - 862, The instance
retrieval/validation for online repair in UpdateMachineHandler is duplicated;
extract it into a helper (e.g., UpdateMachineHandler.getOnlineRepairInstance)
that accepts ctx, machine.ID and logger and returns (*cdbm.Instance,
*cutil.APIError) after calling cdbm.NewInstanceDAO(...).GetAll, logging errors
and translating them into cutil.NewAPIError responses (handle non-1 count and
tenant mismatch there or return the instance and let caller check tenant), then
replace both inline blocks (the pre-authorization check and the main
online-repair block) to call this helper once and reuse the returned instance
instead of repeating the GetAll logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/pkg/api/handler/machine.go`:
- Around line 1586-1593: The response currently returns the original machine
object (um = machine) after the online repair, which doesn't include instance
state changes; update the code so the response reflects the modified instance by
replacing um = machine with either (a) embedding the updated instance fields
into um (copy updated status/labels from the repaired instance object into the
machine struct used for response) or (b) reloading the machine+instance from the
DB (call the same loader used by getAPIMachines) after orTx.Commit() and assign
that fresh object to um so getAPIMachines returns the accurate, post-repair
instance state; ensure you reference the modified instance struct created during
the online repair and set orTxCommitted accordingly.
- Around line 1521-1528: The timeout detection block currently checks ctx.Err()
(parent context) instead of the workflow context, so replace the ctx.Err() check
with wfCtx.Err() in the error branch that inspects tp.TimeoutError and
context.DeadlineExceeded; update the conditional in the function handling err
(the block that declares var timeoutErr *tp.TimeoutError and returns via
cutil.NewAPIErrorResponse) to use wfCtx.Err() so timeouts from the workflow
context are correctly detected while keeping the existing UnwrapWorkflowError
handling and logging (common.UnwrapWorkflowError, logger.Error(), and the
subsequent cutil.NewAPIErrorResponse) unchanged.
- Around line 1574-1582: In the error-handling block for clearing the online
repair workflow (the if err != nil branch that checks timeout and unwraps
workflow errors via common.UnwrapWorkflowError), replace the incorrect ctx.Err()
check with wfCtx.Err() so the timeout/deadline check uses the workflow context;
keep the rest of the logic (using err in the timeout message and werr for the
UnwrapWorkflowError path) unchanged and ensure wfCtx is in scope where the check
occurs.
- Around line 1448-1452: Duplicate call to umh.scp.GetClientByID is unnecessary;
remove the second retrieval (the stc, err :=
umh.scp.GetClientByID(machine.SiteID) block at the later location) and reuse the
previously obtained stc from earlier in the same handler, ensuring the variable
is in scope (or pass it into the helper function if needed) and delete the
redundant error handling branch so the later logic uses the existing Temporal
client instead.

In `@db/repair.md`:
- Around line 94-95: The document uses inconsistent state names (e.g.,
OnlineRepair vs Repairing, and an unexpected Assigned alongside Ready); pick a
single canonical term for repair state (suggest using "Repairing") and normalize
every occurrence: replace "OnlineRepair" with "Repairing", update the status
table to list "Repairing" instead of "OnlineRepair", and either add "Assigned"
to the defined state list or remove the stray "Assigned" reference (ensure
"Ready/Assigned" wording is corrected to use only defined states like "Ready" or
"Assigned" consistently); verify all mentions (e.g., the sentence from the diff
and the table around line 171) use the chosen vocabulary.
- Line 184: The alert ID casing is inconsistent: change the JSON field value
"id": "OnLineRepair" to "OnlineRepair" to match the rest of the document and
ensure implementation references resolve; search for any other occurrences of
"OnLineRepair" (e.g., in examples, anchors, or references) and update them to
"OnlineRepair" so casing is consistent across the markdown and any linked
configuration or code.
- Around line 248-249: The docs state allowAutoInstanceDeletionOnFailure is
optional but the Machine PATCH API flow (requestOnlineRepair: true) requires it
on entry; update the documentation to mark allowAutoInstanceDeletionOnFailure as
required for the Machine PATCH API request when requestOnlineRepair: true (or
clarify the conditional requirement), and add a short note next to
requestOnlineRepair and the Machine PATCH API section explaining the required
inputs (machine health issue, acknowledgments, and the mandatory
allowAutoInstanceDeletionOnFailure flag) so clients won’t send invalid requests.

In `@db/tenant-capability-google-doc.html`:
- Around line 93-112: The Mermaid flowchart inside the <pre> block contains raw
">" characters (e.g., in node labels like A[PATCH Tenant Capabilities] and
arrows like A -->|"sites[], capabilityName, infrastructureProviderId, enabled"|
B) which HTMLHint flags; update the <pre> block so all literal ">" characters
are replaced with the escaped entity &gt; (and similarly escape any "<" as &lt;
if present) so the Mermaid diagram remains the same visually but the HTML is
valid; target the flowchart text within the <pre> block (nodes A, B, C, D, E, F,
G and the arrow lines) to perform the replacements.

In `@openapi/spec.yaml`:
- Around line 16884-16909: There are two duplicate components.schemas entries
named MachineHealthIssue; remove the redundant definition and keep a single
authoritative MachineHealthIssue schema under components.schemas (or merge any
differing fields into one consolidated version), ensuring all properties
(category, summary, details), enums and descriptions are preserved; update or
verify any references to MachineHealthIssue remain valid and remove the earlier
or later duplicate so the spec contains only one MachineHealthIssue schema.

---

Outside diff comments:
In `@openapi/spec.yaml`:
- Around line 16949-17010: The OpenAPI spec added online-repair fields
(requestOnlineRepair, clearOnlineRepair) and schemas (MachineHealthIssue,
OnlineRepairPolicy, OnlineRepairAcknowledgments) and introduced the new Instance
status "Repairing", but the generated SDK/docs were not updated; regenerate the
OpenAPI client and docs and commit the outputs so CI passes—run the repo's
OpenAPI generation script or tool to update sdk/standard/ and docs/index.html
(and any other generated artifacts), verify the new shapes and the "Repairing"
status appear in the generated SDK and HTML, and include those regenerated files
in this PR.
- Around line 16962-17010: The schema currently allows invalid combinations
because the online-repair semantics are only in docs; update the
MachineUpdateRequest schema to express three mutually exclusive variants using
oneOf/not: (1) Normal update: allows instanceTypeId/clearInstanceType,
setMaintenanceMode, maintenanceMessage, labels and must forbid any online-repair
fields (requestOnlineRepair, clearOnlineRepair, machineHealthIssue,
repairPolicy, acknowledgments); (2) Enter online-repair mode: require
requestOnlineRepair=true and require machineHealthIssue, repairPolicy, and
acknowledgments (reference MachineHealthIssue, OnlineRepairPolicy,
OnlineRepairAcknowledgments) and forbid instanceTypeId/clearInstanceType,
setMaintenanceMode, maintenanceMessage, labels, clearOnlineRepair; (3) Exit
online-repair mode: require requestOnlineRepair=false and clearOnlineRepair=true
and forbid machineHealthIssue, repairPolicy, acknowledgments and also forbid
other update properties; implement these with oneOf plus required/const and
additionalProperties/not blocks so validators and generated clients enforce the
enter/exit rules described in the description.

---

Nitpick comments:
In `@api/pkg/api/handler/machine.go`:
- Around line 64-67: The constant onlineRepairHealthAlertID has inconsistent
capitalization ("OnLineRepair"); update its string value to "OnlineRepair" to
follow compound-word casing conventions, and search for any uses of
onlineRepairHealthAlertID to ensure no downstream logic or external consumers
depend on the old casing and update tests or comparisons accordingly.
- Around line 846-862: The instance retrieval/validation for online repair in
UpdateMachineHandler is duplicated; extract it into a helper (e.g.,
UpdateMachineHandler.getOnlineRepairInstance) that accepts ctx, machine.ID and
logger and returns (*cdbm.Instance, *cutil.APIError) after calling
cdbm.NewInstanceDAO(...).GetAll, logging errors and translating them into
cutil.NewAPIError responses (handle non-1 count and tenant mismatch there or
return the instance and let caller check tenant), then replace both inline
blocks (the pre-authorization check and the main online-repair block) to call
this helper once and reuse the returned instance instead of repeating the GetAll
logic.

In `@api/pkg/api/model/machine_test.go`:
- Around line 729-778: Add unit tests calling the Machine.Validate() cases that
are currently missing: (1) an "enter online-repair" case where
RequestOnlineRepair==true but ClearOnlineRepair==true (should error), (2) an
"exit online-repair" case where RequestOnlineRepair==false and
ClearOnlineRepair!=true (should error), and (3) one or more "exit online-repair"
cases where RequestOnlineRepair==false and ClearOnlineRepair==true but forbidden
payload fields are present (MachineHealthIssue, RepairPolicy, Acknowledgments)
and should cause validation errors. Use the same test table structure and fields
(RequestOnlineRepair, ClearOnlineRepair, MachineHealthIssue, RepairPolicy,
Acknowledgments) as the surrounding tests and assert wantErr accordingly when
calling Validate().

In `@db/tenant-capability-google-doc.html`:
- Around line 1-325: Duplicate HLD content exists in db/tenant-capability.md and
db/tenant-capability-google-doc.html which risks divergence; remove the
redundancy by choosing a single source-of-truth (prefer the Markdown file) and
delete or turn the HTML into a generated artifact, update references to point to
db/tenant-capability.md, and ensure the canonical doc names (e.g.
TenantSiteCapabilityAssociation, TargetedInstanceCreation) are preserved in the
retained file so handlers and reviewers can find the same content.

In `@db/tenant-capability.md`:
- Around line 265-272: The handler migration described in the HLD hasn't been
implemented: handlers in api/pkg/api/handler/machine.go still check
tenant.Config.TargetedInstanceCreation directly instead of using
effective(tenant, TenantSiteCapabilityAssociation, siteID) for site-scoped
requests (instance create/delete, machine APIs, site listing). Add a tracking
TODO and create a repo issue describing: replace direct TargetedInstanceCreation
checks in the machine handlers and any site-scoped handlers with the
effective(..., siteID) helper, centralize the lookup in shared helpers, and
decide site-listing behavior (narrow vs discovery). Reference
TenantSiteCapabilityAssociation, TargetedInstanceCreation, effective(...,
siteID), and the machine handler functions so the migration can be completed
before the capability model goes live.

In `@site-workflow/pkg/activity/machine_test.go`:
- Around line 142-180: Tests only exercise success and nil-request paths; add
unit tests that assert validation errors for empty or missing fields checked in
the activity. Extend site-workflow/pkg/activity/machine_test.go to include
table-driven tests calling
NewManageMachine(...).InsertHealthReportOverrideOnSite and
RemoveHealthReportOverrideOnSite covering cases: nil MachineId, empty
MachineId.Id, nil Override, nil Override.Report, and empty Source (for
RemoveHealthReportOverrideOnSite), and assert Error for each; reference the
functions InsertHealthReportOverrideOnSite, RemoveHealthReportOverrideOnSite and
constructor NewManageMachine to locate where to call them.

In `@site-workflow/pkg/grpc/client/testing.go`:
- Around line 346-352: The two mock RPCs InsertHealthReportOverride and
RemoveHealthReportOverride on MockForgeClient always return success; update both
to follow the file's established pattern by checking ctx.Value("wantError") (or
the same key other mocks use) and, if set, return that injected error (or wrap
it) instead of success; otherwise return new(emptypb.Empty), nil—this enables
context-based error injection for tests.

In `@site-workflow/pkg/workflow/machine_test.go`:
- Around line 187-219: The tests only cover success paths for
ApplyMachineOnlineRepairHealthOverride and
ClearMachineOnlineRepairHealthOverride; add failure-case unit tests that mock
the activities to return an error and assert the workflow surfaces that error.
Specifically, add tests similar to
Test_ApplyMachineOnlineRepairHealthOverride_ActivityFails and a counterpart for
ClearMachineOnlineRepairHealthOverride where you RegisterActivity and use
env.OnActivity for mActivity.ManageMachine.InsertHealthReportOverrideOnSite (and
RemoveHealthReportOverrideOnSite) to Return(errors.New(...)),
ExecuteWorkflow(ApplyMachineOnlineRepairHealthOverride /
ClearMachineOnlineRepairHealthOverride) and then assert
env.IsWorkflowCompleted(), env.GetWorkflowError() is non-nil, the error is a
temporal.ApplicationError and its message matches the mocked error string.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: d5273c3b-389e-4af5-9c19-c3e0d00ed09c

📥 Commits

Reviewing files that changed from the base of the PR and between 03fbdb0 and 72dc761.

📒 Files selected for processing (16)
  • api/pkg/api/handler/machine.go
  • api/pkg/api/handler/machine_test.go
  • api/pkg/api/model/machine.go
  • api/pkg/api/model/machine_test.go
  • db/pkg/db/model/instance.go
  • db/pkg/db/model/instance_test.go
  • db/repair.md
  • db/tenant-capability-google-doc.html
  • db/tenant-capability.md
  • openapi/spec.yaml
  • site-agent/pkg/components/managers/machine/subscriber.go
  • site-workflow/pkg/activity/machine.go
  • site-workflow/pkg/activity/machine_test.go
  • site-workflow/pkg/grpc/client/testing.go
  • site-workflow/pkg/workflow/machine.go
  • site-workflow/pkg/workflow/machine_test.go

Comment on lines +1448 to +1452
stc, err := umh.scp.GetClientByID(machine.SiteID)
if err != nil {
logger.Error().Err(err).Msg("failed to retrieve Temporal client for Site")
return err
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Redundant temporal client retrieval.

The site temporal client is already retrieved at line 965 (stc, err := umh.scp.GetClientByID(machine.SiteID)). This second retrieval at line 1448 is unnecessary.

♻️ Suggested fix
 	// Enter or exit in-pool online repair (Site health override + Instance status / labels in Cloud DB)
 	if apiRequest.IsMachineOnlineRepairOperation() {
 		if machine.IsMissingOnSite {
 			return cutil.NewAPIErrorResponse(c, http.StatusBadRequest, "Machine is currently missing on Site, cannot update online repair state", nil)
 		}
-
-		stc, err := umh.scp.GetClientByID(machine.SiteID)
-		if err != nil {
-			logger.Error().Err(err).Msg("failed to retrieve Temporal client for Site")
-			return err
-		}

 		iDAO := cdbm.NewInstanceDAO(umh.dbSession)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
stc, err := umh.scp.GetClientByID(machine.SiteID)
if err != nil {
logger.Error().Err(err).Msg("failed to retrieve Temporal client for Site")
return err
}
// Enter or exit in-pool online repair (Site health override + Instance status / labels in Cloud DB)
if apiRequest.IsMachineOnlineRepairOperation() {
if machine.IsMissingOnSite {
return cutil.NewAPIErrorResponse(c, http.StatusBadRequest, "Machine is currently missing on Site, cannot update online repair state", nil)
}
iDAO := cdbm.NewInstanceDAO(umh.dbSession)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/machine.go` around lines 1448 - 1452, Duplicate call to
umh.scp.GetClientByID is unnecessary; remove the second retrieval (the stc, err
:= umh.scp.GetClientByID(machine.SiteID) block at the later location) and reuse
the previously obtained stc from earlier in the same handler, ensuring the
variable is in scope (or pass it into the helper function if needed) and delete
the redundant error handling branch so the later logic uses the existing
Temporal client instead.

Comment on lines +1521 to +1528
if err != nil {
var timeoutErr *tp.TimeoutError
if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || ctx.Err() != nil {
return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed online repair workflow on Site: %s", err), nil)
}
code, werr := common.UnwrapWorkflowError(err)
logger.Error().Err(werr).Msg("online repair health override workflow failed")
return cutil.NewAPIErrorResponse(c, code, fmt.Sprintf("Failed to execute online repair workflow on Site: %s", werr), nil)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Incorrect context variable in timeout check.

At line 1523, ctx.Err() checks the parent context, but the workflow was executed with wfCtx. If the parent context hasn't timed out but wfCtx has, this condition may not evaluate correctly.

🐛 Suggested fix
 			err = we.Get(wfCtx, nil)
 			if err != nil {
 				var timeoutErr *tp.TimeoutError
-				if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || ctx.Err() != nil {
+				if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || wfCtx.Err() != nil {
 					return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed online repair workflow on Site: %s", err), nil)
 				}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if err != nil {
var timeoutErr *tp.TimeoutError
if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || ctx.Err() != nil {
return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed online repair workflow on Site: %s", err), nil)
}
code, werr := common.UnwrapWorkflowError(err)
logger.Error().Err(werr).Msg("online repair health override workflow failed")
return cutil.NewAPIErrorResponse(c, code, fmt.Sprintf("Failed to execute online repair workflow on Site: %s", werr), nil)
if err != nil {
var timeoutErr *tp.TimeoutError
if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || wfCtx.Err() != nil {
return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed online repair workflow on Site: %s", err), nil)
}
code, werr := common.UnwrapWorkflowError(err)
logger.Error().Err(werr).Msg("online repair health override workflow failed")
return cutil.NewAPIErrorResponse(c, code, fmt.Sprintf("Failed to execute online repair workflow on Site: %s", werr), nil)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/machine.go` around lines 1521 - 1528, The timeout
detection block currently checks ctx.Err() (parent context) instead of the
workflow context, so replace the ctx.Err() check with wfCtx.Err() in the error
branch that inspects tp.TimeoutError and context.DeadlineExceeded; update the
conditional in the function handling err (the block that declares var timeoutErr
*tp.TimeoutError and returns via cutil.NewAPIErrorResponse) to use wfCtx.Err()
so timeouts from the workflow context are correctly detected while keeping the
existing UnwrapWorkflowError handling and logging (common.UnwrapWorkflowError,
logger.Error(), and the subsequent cutil.NewAPIErrorResponse) unchanged.

Comment on lines +1574 to +1582
if err != nil {
var timeoutErr *tp.TimeoutError
if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || ctx.Err() != nil {
return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed clear online repair workflow on Site: %s", err), nil)
}
code, werr := common.UnwrapWorkflowError(err)
logger.Error().Err(werr).Msg("clear online repair health override workflow failed")
return cutil.NewAPIErrorResponse(c, code, fmt.Sprintf("Failed to execute clear online repair workflow on Site: %s", werr), nil)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Same context variable issue in exit flow.

At line 1576, the timeout check uses ctx.Err() instead of wfCtx.Err().

🐛 Suggested fix
 			err = we.Get(wfCtx, nil)
 			if err != nil {
 				var timeoutErr *tp.TimeoutError
-				if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || ctx.Err() != nil {
+				if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || wfCtx.Err() != nil {
 					return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed clear online repair workflow on Site: %s", err), nil)
 				}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if err != nil {
var timeoutErr *tp.TimeoutError
if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || ctx.Err() != nil {
return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed clear online repair workflow on Site: %s", err), nil)
}
code, werr := common.UnwrapWorkflowError(err)
logger.Error().Err(werr).Msg("clear online repair health override workflow failed")
return cutil.NewAPIErrorResponse(c, code, fmt.Sprintf("Failed to execute clear online repair workflow on Site: %s", werr), nil)
}
if err != nil {
var timeoutErr *tp.TimeoutError
if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || wfCtx.Err() != nil {
return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed clear online repair workflow on Site: %s", err), nil)
}
code, werr := common.UnwrapWorkflowError(err)
logger.Error().Err(werr).Msg("clear online repair health override workflow failed")
return cutil.NewAPIErrorResponse(c, code, fmt.Sprintf("Failed to execute clear online repair workflow on Site: %s", werr), nil)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/machine.go` around lines 1574 - 1582, In the
error-handling block for clearing the online repair workflow (the if err != nil
branch that checks timeout and unwraps workflow errors via
common.UnwrapWorkflowError), replace the incorrect ctx.Err() check with
wfCtx.Err() so the timeout/deadline check uses the workflow context; keep the
rest of the logic (using err in the timeout message and werr for the
UnwrapWorkflowError path) unchanged and ensure wfCtx is in scope where the check
occurs.

Comment on lines +1586 to +1593
err = orTx.Commit()
if err != nil {
logger.Error().Err(err).Msg("error committing transaction")
return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, "Failed to update Machine, DB transaction error", nil)
}
orTxCommitted = true
um = machine
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Response does not reflect instance state changes.

At line 1592, um = machine assigns the original machine entity to um. However, the online repair operation modifies the associated instance's status and labels, not the machine itself. The response returned to the client via getAPIMachines (line 1600) will include the original machine data, which may not accurately reflect the instance state changes made during this operation.

Consider whether the API response should include the updated instance information to provide clients with accurate state after the operation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/machine.go` around lines 1586 - 1593, The response
currently returns the original machine object (um = machine) after the online
repair, which doesn't include instance state changes; update the code so the
response reflects the modified instance by replacing um = machine with either
(a) embedding the updated instance fields into um (copy updated status/labels
from the repaired instance object into the machine struct used for response) or
(b) reloading the machine+instance from the DB (call the same loader used by
getAPIMachines) after orTx.Commit() and assign that fresh object to um so
getAPIMachines returns the accurate, post-repair instance state; ensure you
reference the modified instance struct created during the online repair and set
orTxCommitted accordingly.

Comment thread db/repair.md Outdated
Comment on lines +94 to +95
- The instance **stays in the tenant pool** (in **Ready** when active; in **OnlineRepair** during the repair window). No deallocation or return to Carbide pool unless the instance is escalated and deleted.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

State naming is inconsistent (Repairing vs OnlineRepair, and Assigned appears unexpectedly).

Line 94 uses OnlineRepair as a state while the status table introduces Repairing; Line 171 says Ready/Assigned even though the document’s state list does not define Assigned. Please normalize to one state vocabulary.

✏️ Suggested terminology normalization
-- The instance **stays in the tenant pool** (in **Ready** when active; in **OnlineRepair** during the repair window). No deallocation or return to Carbide pool unless the instance is escalated and deleted.
+- The instance **stays in the tenant pool** (in **Ready** when active; in **Repairing** during the repair window). No deallocation or return to Carbide pool unless the instance is escalated and deleted.
@@
-**Effect:** The Health Override for OnlineRepair is removed from the machine; the instance returns to **active (Ready/Assigned)** and is available for workload again.
+**Effect:** The Health Override for OnlineRepair is removed from the machine; the instance returns to **active (Ready)** and is available for workload again.
As per coding guidelines: `**/*.md`: "Assess the documentation for misspellings, grammatical errors, missing documentation and correctness".

Also applies to: 171-171

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@db/repair.md` around lines 94 - 95, The document uses inconsistent state
names (e.g., OnlineRepair vs Repairing, and an unexpected Assigned alongside
Ready); pick a single canonical term for repair state (suggest using
"Repairing") and normalize every occurrence: replace "OnlineRepair" with
"Repairing", update the status table to list "Repairing" instead of
"OnlineRepair", and either add "Assigned" to the defined state list or remove
the stray "Assigned" reference (ensure "Ready/Assigned" wording is corrected to
use only defined states like "Ready" or "Assigned" consistently); verify all
mentions (e.g., the sentence from the diff and the table around line 171) use
the chosen vocabulary.

Comment thread db/repair.md Outdated
"source": "tenant-reported-issue",
"alerts": [
{
"id": "OnLineRepair",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use consistent alert ID casing (OnlineRepair vs OnLineRepair).

Line 184 uses OnLineRepair, while the rest of the document consistently uses OnlineRepair. Standardizing this avoids ambiguity in implementation references.

✏️ Suggested casing fix
-      "id": "OnLineRepair",
+      "id": "OnlineRepair",
As per coding guidelines: `**/*.md`: "Assess the documentation for misspellings, grammatical errors, missing documentation and correctness".
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"id": "OnLineRepair",
"id": "OnlineRepair",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@db/repair.md` at line 184, The alert ID casing is inconsistent: change the
JSON field value "id": "OnLineRepair" to "OnlineRepair" to match the rest of the
document and ensure implementation references resolve; search for any other
occurrences of "OnLineRepair" (e.g., in examples, anchors, or references) and
update them to "OnlineRepair" so casing is consistent across the markdown and
any linked configuration or code.

Comment thread db/repair.md Outdated
Comment on lines +248 to +249
The **tenant operator** (or tenant) invokes the Machine PATCH API with `requestOnlineRepair: true` and the required inputs (machine health issue, acknowledgments, and optionally `allowAutoInstanceDeletionOnFailure`).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

allowAutoInstanceDeletionOnFailure is documented as optional but is required on enter.

Line 248 says the field is optional, which conflicts with the enter request contract in this PR flow. This will cause client confusion and invalid requests.

✏️ Suggested doc correction
-The **tenant operator** (or tenant) invokes the Machine PATCH API with `requestOnlineRepair: true` and the required inputs (machine health issue, acknowledgments, and optionally `allowAutoInstanceDeletionOnFailure`).
+The **tenant operator** (or tenant) invokes the Machine PATCH API with `requestOnlineRepair: true` and the required inputs (machine health issue, acknowledgments, and `allowAutoInstanceDeletionOnFailure`).
As per coding guidelines: `**/*.md`: "Assess the documentation for misspellings, grammatical errors, missing documentation and correctness".
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
The **tenant operator** (or tenant) invokes the Machine PATCH API with `requestOnlineRepair: true` and the required inputs (machine health issue, acknowledgments, and optionally `allowAutoInstanceDeletionOnFailure`).
The **tenant operator** (or tenant) invokes the Machine PATCH API with `requestOnlineRepair: true` and the required inputs (machine health issue, acknowledgments, and `allowAutoInstanceDeletionOnFailure`).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@db/repair.md` around lines 248 - 249, The docs state
allowAutoInstanceDeletionOnFailure is optional but the Machine PATCH API flow
(requestOnlineRepair: true) requires it on entry; update the documentation to
mark allowAutoInstanceDeletionOnFailure as required for the Machine PATCH API
request when requestOnlineRepair: true (or clarify the conditional requirement),
and add a short note next to requestOnlineRepair and the Machine PATCH API
section explaining the required inputs (machine health issue, acknowledgments,
and the mandatory allowAutoInstanceDeletionOnFailure flag) so clients won’t send
invalid requests.

Comment thread db/tenant-capability-google-doc.html Outdated
Comment on lines +93 to +112
<pre>flowchart TB
subgraph Client["Tenant Admin"]
A[PATCH Tenant Capabilities]
end
subgraph API["REST API"]
B[Resolve site set]
C[Upsert TenantSiteCapabilityAssociation]
end
subgraph Data["Persistence"]
D[(tenant)]
E[(tenant_site_capability_association)]
F[(site)]
G[(infrastructure_provider)]
end
A -->|"sites[], capabilityName, infrastructureProviderId, enabled"| B
B -->|"empty sites[] => all eligible sites; optional filter by provider"| C
C --> E
E --> F
E --> G
D -->|"ceiling check"| B</pre>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

HTMLHint flags unescaped special characters in Mermaid diagram.

The > characters in the Mermaid flowchart syntax within the <pre> block should technically be escaped as &gt; for valid HTML. While browsers typically render this correctly, escaping ensures consistency across parsers.

♻️ Suggested fix
-  A -->|"sites[], capabilityName, infrastructureProviderId, enabled"| B
-  B -->|"empty sites[] => all eligible sites; optional filter by provider"| C
-  C --> E
-  E --> F
-  E --> G
-  D -->|"ceiling check"| B</pre>
+  A --&gt;|"sites[], capabilityName, infrastructureProviderId, enabled"| B
+  B --&gt;|"empty sites[] =&gt; all eligible sites; optional filter by provider"| C
+  C --&gt; E
+  E --&gt; F
+  E --&gt; G
+  D --&gt;|"ceiling check"| B</pre>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
<pre>flowchart TB
subgraph Client["Tenant Admin"]
A[PATCH Tenant Capabilities]
end
subgraph API["REST API"]
B[Resolve site set]
C[Upsert TenantSiteCapabilityAssociation]
end
subgraph Data["Persistence"]
D[(tenant)]
E[(tenant_site_capability_association)]
F[(site)]
G[(infrastructure_provider)]
end
A -->|"sites[], capabilityName, infrastructureProviderId, enabled"| B
B -->|"empty sites[] => all eligible sites; optional filter by provider"| C
C --> E
E --> F
E --> G
D -->|"ceiling check"| B</pre>
<pre>flowchart TB
subgraph Client["Tenant Admin"]
A[PATCH Tenant Capabilities]
end
subgraph API["REST API"]
B[Resolve site set]
C[Upsert TenantSiteCapabilityAssociation]
end
subgraph Data["Persistence"]
D[(tenant)]
E[(tenant_site_capability_association)]
F[(site)]
G[(infrastructure_provider)]
end
A --&gt;|"sites[], capabilityName, infrastructureProviderId, enabled"| B
B --&gt;|"empty sites[] =&gt; all eligible sites; optional filter by provider"| C
C --&gt; E
E --&gt; F
E --&gt; G
D --&gt;|"ceiling check"| B</pre>
🧰 Tools
🪛 HTMLHint (1.9.2)

[error] 107-107: Special characters must be escaped : [ > ].

(spec-char-escape)


[error] 108-108: Special characters must be escaped : [ > ].

(spec-char-escape)


[error] 108-108: Special characters must be escaped : [ > ].

(spec-char-escape)


[error] 109-109: Special characters must be escaped : [ > ].

(spec-char-escape)


[error] 110-110: Special characters must be escaped : [ > ].

(spec-char-escape)


[error] 111-111: Special characters must be escaped : [ > ].

(spec-char-escape)


[error] 112-112: Special characters must be escaped : [ > ].

(spec-char-escape)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@db/tenant-capability-google-doc.html` around lines 93 - 112, The Mermaid
flowchart inside the <pre> block contains raw ">" characters (e.g., in node
labels like A[PATCH Tenant Capabilities] and arrows like A -->|"sites[],
capabilityName, infrastructureProviderId, enabled"| B) which HTMLHint flags;
update the <pre> block so all literal ">" characters are replaced with the
escaped entity &gt; (and similarly escape any "<" as &lt; if present) so the
Mermaid diagram remains the same visually but the HTML is valid; target the
flowchart text within the <pre> block (nodes A, B, C, D, E, F, G and the arrow
lines) to perform the replacements.

Comment thread openapi/spec.yaml
Comment on lines +16884 to +16909
MachineHealthIssue:
title: MachineHealthIssue
type: object
required:
- category
- summary
- details
properties:
category:
type: string
enum:
- HARDWARE
- NETWORK
- PERFORMANCE
- STORAGE
- SOFTWARE
- OTHER
description: High-level category for the tenant-reported issue.
summary:
type: string
maxLength: 512
description: Brief description of the issue for operators.
details:
type: string
maxLength: 8192
description: Diagnostic information, logs, ticket numbers, etc.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Expected result: exactly one match. More than one means the schema key is duplicated.
rg -n '^ {4}MachineHealthIssue:$' openapi/spec.yaml

Repository: NVIDIA/ncx-infra-controller-rest

Length of output: 135


Remove the duplicate MachineHealthIssue schema definition.

The YAML file contains two MachineHealthIssue entries under components.schemas (lines 15587 and 16884). In YAML, the second key silently overwrites the first, rendering the initial definition inaccessible to any consumers of the specification. This breaks schema clarity, violates the API contract, and causes linting failures. Consolidate to a single, authoritative definition.

🧰 Tools
🪛 YAMLlint (1.38.0)

[error] 16884-16884: duplication of key "MachineHealthIssue" in mapping

(key-duplicates)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openapi/spec.yaml` around lines 16884 - 16909, There are two duplicate
components.schemas entries named MachineHealthIssue; remove the redundant
definition and keep a single authoritative MachineHealthIssue schema under
components.schemas (or merge any differing fields into one consolidated
version), ensuring all properties (category, summary, details), enums and
descriptions are preserved; update or verify any references to
MachineHealthIssue remain valid and remove the earlier or later duplicate so the
spec contains only one MachineHealthIssue schema.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 22, 2026

Test Results

8 856 tests  +7   8 854 ✅ +5   7m 24s ⏱️ +38s
  146 suites +1       0 💤 ±0 
   14 files   ±0       2 ❌ +2 

For more details on these failures, see this check.

Results for commit a0cd420. ± Comparison against base commit 3616400.

♻️ This comment has been updated with latest results.

Classifications: []string{
"PreventAllocations",
"PreventDeletion",
"PreventSuperTenantAllocation",
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to remove for being send PreventSuperTenantAllocation

Id: onlineRepairHealthAlertID,
Target: &tgt,
Message: msg,
TenantMessage: cdb.GetStrPtr(fmt.Sprintf("TenantReportedIssue: %s", mhi.Summary)),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add acknowledgment policy and accept from user also

@hwadekar-nv hwadekar-nv changed the title feat: Initial draft commit for Machine OnlineRepair feat: Initial draft commit for Instance online repair Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant