feat: Initial draft commit for Instance online repair by hwadekar-nv · Pull Request #415 · NVIDIA/infra-controller-rest

hwadekar-nv · 2026-04-22T22:39:49Z

Description

Based on the design https://gitlab-master.nvidia.com/nvmetal/designs/-/blob/breakfix-swiftstack/designs/0066-Instance-Repair-Management.md#proposed-workflow-in-pool-repair-dedicated-ops-team, the Instance-Repair-Management implementation was created in an initial draft version.

Added mechanism in Instance to update or make repair status
- Tenant: Online repair is allowed for Tenant Admin when either TargetedInstanceCreation is set or the request is only an online-repair operation; tenant must have a tenant account with the machine’s provider, and the machine’s single instance must belong to that tenant.
- Provider: Same instance/machine checks without tenant ownership.
- Enter: Instance must be Ready. DB transaction updates instance to Repairing, sets label onlineRepair.allowAutoInstanceDeletionOnFailure to "true"/"false", then runs Temporal workflow ApplyMachineOnlineRepairHealthOverride with a tenant-reported-issue health override matching the HLD (OnLineRepair alert + classifications).
- Exit: Instance must be Repairing. DB updates instance back to Ready and removes the policy label, then workflow ClearMachineOnlineRepairHealthOverride removes the override for source tenant-reported-issue._
Added PATCH request in Machine to allow Machine to be in repair while having Instance allocated
- Enter: requestOnlineRepair: true + machineHealthIssue + repairPolicy.allowAutoInstanceDeletionOnFailure + acknowledgments (all three booleans must be true).
- Exit: requestOnlineRepair: false + clearOnlineRepair: true (no other online-repair payload fields)._
Site workflow
- InsertHealthReportOverrideOnSite / RemoveHealthReportOverrideOnSite activities call
- InsertHealthReportOverride / RemoveHealthReportOverride.
- Workflows ApplyMachineOnlineRepairHealthOverride and ClearMachineOnlineRepairHealthOverride_
Site-agent
- Workflow registered in site-agent/pkg/components/managers/machine/subscriber.go._

Type of Change

Feature - New feature or functionality (feat:)
Fix - Bug fixes (fix:)
Chore - Modification or removal of existing functionality (chore:)
Refactor - Refactoring of existing functionality (refactor:)
Docs - Changes in documentation or OpenAPI schema (docs:)
CI - Changes in GitHub workflows. Requires additional scrutiny (ci:)
Version - Issuing a new release version (version:)

Services Affected

API - API models or endpoints updated
Workflow - Workflow service updated
DB - DB DAOs or migrations updated
Site Manager - Site Manager updated
Cert Manager - Cert Manager updated
Site Agent - Site Agent updated
RLA - RLA service updated
Powershelf Manager - Powershelf Manager updated
NVSwitch Manager - NVSwitch Manager updated

Related Issues (Optional)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

coderabbitai · 2026-04-22T22:40:02Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 80dbce4b-1d34-40b5-83d3-3892a7467a57

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

The pull request introduces in-pool online repair functionality for machines, enabling tenant repair teams to restore machines within the instance pool. Changes include API handler authorization and transaction logic for state transitions, new request models with validation for repair intent and acknowledgments, instance status additions, Temporal workflows coordinating with Site for health report overrides, and comprehensive design documentation.

Changes

Cohort / File(s)	Summary
API Handler Implementation `api/pkg/api/handler/machine.go`	Extended `UpdateMachineHandler` to enforce tenant authorization for online repair, validate machine assignment and instance status, manage database transaction for status toggling (Ready↔Repairing), coordinate health report overrides via Temporal workflows, and handle auto-deletion label management based on repair policy.
API Handler Tests `api/pkg/api/handler/machine_test.go`	Added expectations for two new Temporal workflow executions (`ApplyMachineOnlineRepairHealthOverride`, `ClearMachineOnlineRepairHealthOverride`) in both success and timeout scenarios.
API Request Models `api/pkg/api/model/machine.go`	Added new types (`APIMachineHealthIssue`, `APIOnlineRepairPolicy`, `APIOnlineRepairAcknowledgments`) and extended `APIMachineUpdateRequest` with online repair fields. Introduced validation for exclusive operation modes, health issue category/length constraints, and mandatory acknowledgments. Added builder function `BuildOnlineRepairHealthOverrideMessageJSON` for health override payloads.
API Model Tests `api/pkg/api/model/machine_test.go`	Extended `TestAPIMachineUpdateRequest_Validate` with three new test cases covering enter-repair success, conflict detection with maintenance mode, and exit-repair validation.
Database Instance Model `db/pkg/db/model/instance.go`	Added `InstanceStatusRepairing` constant and updated `InstanceStatusMap` and `GetCountByStatus` to track the new status.
Database Model Tests `db/pkg/db/model/instance_test.go`	Updated `TestInstanceSQLDAO_GetCountByStatus` expectations to include `InstanceStatusRepairing` count entries.
Site Workflow Activities `site-workflow/pkg/activity/machine.go`	Implemented two new activities (`InsertHealthReportOverrideOnSite`, `RemoveHealthReportOverrideOnSite`) with strict request validation and Forge RPC invocation to manage health report overrides.
Site Activity Tests `site-workflow/pkg/activity/machine_test.go`	Added unit tests for both new activities covering success paths and nil request error handling.
Site Workflows `site-workflow/pkg/workflow/machine.go`	Introduced two Temporal workflows (`ApplyMachineOnlineRepairHealthOverride`, `ClearMachineOnlineRepairHealthOverride`) with 2-minute timeouts and retry policies for health override coordination.
Workflow Tests `site-workflow/pkg/workflow/machine_test.go`	Added test cases for both new workflows verifying successful activity integration.
Workflow gRPC Client Mock `site-workflow/pkg/grpc/client/testing.go`	Added mock implementations for `InsertHealthReportOverride` and `RemoveHealthReportOverride` RPC methods.
Site Agent Registration `site-agent/pkg/components/managers/machine/subscriber.go`	Registered new Temporal workflows and site activities for online repair health override management.
OpenAPI Specification `openapi/spec.yaml`	Extended machine update API contract with new schemas for health issues, repair policies, and acknowledgments; added `Repairing` instance status; updated request descriptions for online repair semantics.
Design Documentation `db/repair.md`, `db/tenant-capability.md`, `db/tenant-capability-google-doc.html`	Introduced comprehensive design documents describing in-pool online repair workflow, instance state transitions, failure handling semantics, and tenant capability scoping architecture.

Sequence Diagram

sequenceDiagram
    participant Tenant as Tenant/Client
    participant Handler as API Handler
    participant DB as Database
    participant Temporal as Temporal
    participant Activity as Site Activity
    participant Site as Site Controller

    Tenant->>Handler: PATCH /machines/{id} (requestOnlineRepair)
    Handler->>Handler: Validate machine is assigned & Instance status
    Handler->>Handler: Enforce TenantAccount authorization
    
    Handler->>DB: Begin Transaction
    Handler->>DB: Update Instance status Ready → Repairing
    Handler->>DB: Set OnlineRepair auto-deletion label
    Handler->>DB: Commit Transaction
    
    Handler->>Temporal: ExecuteWorkflow(ApplyMachineOnlineRepairHealthOverride)
    Temporal->>Activity: InsertHealthReportOverrideOnSite(machine_id, health_report)
    Activity->>Site: RPC InsertHealthReportOverride
    Site-->>Activity: Empty response
    Activity-->>Temporal: Success
    Temporal-->>Handler: Workflow complete
    
    Handler-->>Tenant: 200 OK (Instance.status=Repairing)
    
    Tenant->>Handler: PATCH /machines/{id} (clearOnlineRepair)
    Handler->>Handler: Validate Instance status is Repairing
    
    Handler->>DB: Begin Transaction
    Handler->>DB: Update Instance status Repairing → Ready
    Handler->>DB: Clear OnlineRepair label
    Handler->>DB: Commit Transaction
    
    Handler->>Temporal: ExecuteWorkflow(ClearMachineOnlineRepairHealthOverride)
    Temporal->>Activity: RemoveHealthReportOverrideOnSite(machine_id)
    Activity->>Site: RPC RemoveHealthReportOverride
    Site-->>Activity: Empty response
    Activity-->>Temporal: Success
    Temporal-->>Handler: Workflow complete
    
    Handler-->>Tenant: 200 OK (Instance.status=Ready)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Rationale: The changes introduce a new feature spanning heterogeneous components—dense handler transaction logic with state management, comprehensive validation across multiple model layers, database schema extensions, Temporal workflow coordination, activity implementations, and OpenAPI contract updates. The handler requires careful scrutiny for transaction semantics, authorization gating, and Temporal error handling. Multiple interacting components demand verification of consistency across API, database, workflow, and activity layers.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 26.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description comprehensively details the implementation of an Instance-Repair-Management feature, referencing a design document and explaining role-specific authorization, state transitions, API changes, and site workflow components.
Title check	✅ Passed	The PR title 'feat: Initial draft commit for Instance online repair' clearly and concisely describes the primary feature being introduced—online repair functionality for machine instances.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/online-repair

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-22T22:40:41Z

🔐 TruffleHog Secret Scan

✅ No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

_{🕐 Last updated: 2026-04-22 22:40:40 UTC | Commit: 72dc761}

copy-pr-bot · 2026-04-22T22:41:54Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

openapi/spec.yaml (2)
16949-17010: ⚠️ Potential issue | 🟠 Major

Regenerate the checked-in SDK/docs for this OpenAPI change.

CI is already failing because openapi/spec.yaml changed without the corresponding generated output under sdk/standard/ or docs/index.html. Please include the regenerated artifacts in this PR so consumers pick up Repairing and the new online-repair request shapes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openapi/spec.yaml` around lines 16949 - 17010, The OpenAPI spec added
online-repair fields (requestOnlineRepair, clearOnlineRepair) and schemas
(MachineHealthIssue, OnlineRepairPolicy, OnlineRepairAcknowledgments) and
introduced the new Instance status "Repairing", but the generated SDK/docs were
not updated; regenerate the OpenAPI client and docs and commit the outputs so CI
passes—run the repo's OpenAPI generation script or tool to update sdk/standard/
and docs/index.html (and any other generated artifacts), verify the new shapes
and the "Repairing" status appear in the generated SDK and HTML, and include
those regenerated files in this PR.
16962-17010: ⚠️ Potential issue | 🟠 Major

Encode the online-repair modes in the schema itself.

The enter/exit rules currently live only in descriptions and examples. The schema still permits invalid payloads such as requestOnlineRepair: true without acknowledgments, requestOnlineRepair: false together with machineHealthIssue, or online-repair fields mixed with the other update properties. Please model these as mutually exclusive variants with oneOf/not so generated clients and validators enforce the same contract as the handler.

As per coding guidelines, "MachineUpdateRequest supports two mutually exclusive modes for online repair: Enter: requestOnlineRepair=true + machineHealthIssue + repairPolicy + acknowledgments. Exit: requestOnlineRepair=false + clearOnlineRepair=true, with no other online-repair payload fields combined."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openapi/spec.yaml` around lines 16962 - 17010, The schema currently allows
invalid combinations because the online-repair semantics are only in docs;
update the MachineUpdateRequest schema to express three mutually exclusive
variants using oneOf/not: (1) Normal update: allows
instanceTypeId/clearInstanceType, setMaintenanceMode, maintenanceMessage, labels
and must forbid any online-repair fields (requestOnlineRepair,
clearOnlineRepair, machineHealthIssue, repairPolicy, acknowledgments); (2) Enter
online-repair mode: require requestOnlineRepair=true and require
machineHealthIssue, repairPolicy, and acknowledgments (reference
MachineHealthIssue, OnlineRepairPolicy, OnlineRepairAcknowledgments) and forbid
instanceTypeId/clearInstanceType, setMaintenanceMode, maintenanceMessage,
labels, clearOnlineRepair; (3) Exit online-repair mode: require
requestOnlineRepair=false and clearOnlineRepair=true and forbid
machineHealthIssue, repairPolicy, acknowledgments and also forbid other update
properties; implement these with oneOf plus required/const and
additionalProperties/not blocks so validators and generated clients enforce the
enter/exit rules described in the description.

🧹 Nitpick comments (8)

site-workflow/pkg/grpc/client/testing.go (1)

346-352: Mock methods lack context-based error injection support.

The new InsertHealthReportOverride and RemoveHealthReportOverride mock methods always return success, unlike other mock methods in this file that support ctx.Value("wantError") for injecting failures during tests. This limits the ability to test error handling paths in workflows and activities that invoke these RPCs.

Consider aligning with the established pattern for comprehensive test coverage:

♻️ Suggested enhancement

 func (c *MockForgeClient) InsertHealthReportOverride(ctx context.Context, in *wflows.InsertHealthReportOverrideRequest, opts ...grpc.CallOption) (*emptypb.Empty, error) {
+	if err, ok := ctx.Value("wantError").(error); ok {
+		return nil, status.Error(status.Code(err), "failed to insert health report override")
+	}
 	return new(emptypb.Empty), nil
 }

 func (c *MockForgeClient) RemoveHealthReportOverride(ctx context.Context, in *wflows.RemoveHealthReportOverrideRequest, opts ...grpc.CallOption) (*emptypb.Empty, error) {
+	if err, ok := ctx.Value("wantError").(error); ok {
+		return nil, status.Error(status.Code(err), "failed to remove health report override")
+	}
 	return new(emptypb.Empty), nil
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@site-workflow/pkg/grpc/client/testing.go` around lines 346 - 352, The two
mock RPCs InsertHealthReportOverride and RemoveHealthReportOverride on
MockForgeClient always return success; update both to follow the file's
established pattern by checking ctx.Value("wantError") (or the same key other
mocks use) and, if set, return that injected error (or wrap it) instead of
success; otherwise return new(emptypb.Empty), nil—this enables context-based
error injection for tests.

db/tenant-capability-google-doc.html (1)

1-325: Consider consolidating documentation to avoid duplication.

Both db/tenant-capability.md and db/tenant-capability-google-doc.html contain the same HLD content. Maintaining two formats creates a risk of divergence. Consider keeping only the Markdown version and generating HTML if needed, or documenting which is the source of truth.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@db/tenant-capability-google-doc.html` around lines 1 - 325, Duplicate HLD
content exists in db/tenant-capability.md and
db/tenant-capability-google-doc.html which risks divergence; remove the
redundancy by choosing a single source-of-truth (prefer the Markdown file) and
delete or turn the HTML into a generated artifact, update references to point to
db/tenant-capability.md, and ensure the canonical doc names (e.g.
TenantSiteCapabilityAssociation, TargetedInstanceCreation) are preserved in the
retained file so handlers and reviewers can find the same content.

site-workflow/pkg/activity/machine_test.go (1)

142-180: Tests cover basic paths but lack edge case coverage.

The tests verify successful execution and nil request validation. However, the activity implementations validate multiple conditions (e.g., MachineId.Id == "", Override == nil, Override.Report == nil, Source == ""). Consider adding tests for these edge cases to ensure the validation logic is fully exercised.

♻️ Suggested additional test cases

func TestManageMachine_InsertHealthReportOverrideOnSite_ValidationErrors(t *testing.T) {
	mockCarbide := cClient.NewMockCarbideClient()
	carbideAtomicClient := cClient.NewCarbideAtomicClient(&cClient.CarbideClientConfig{})
	carbideAtomicClient.SwapClient(mockCarbide)
	mm := NewManageMachine(carbideAtomicClient)

	testCases := []struct {
		name    string
		request *cwssaws.InsertHealthReportOverrideRequest
	}{
		{"nil request", nil},
		{"nil MachineId", &cwssaws.InsertHealthReportOverrideRequest{MachineId: nil}},
		{"empty MachineId.Id", &cwssaws.InsertHealthReportOverrideRequest{MachineId: &cwssaws.MachineId{Id: ""}}},
		{"nil Override", &cwssaws.InsertHealthReportOverrideRequest{MachineId: &cwssaws.MachineId{Id: "m1"}, Override: nil}},
		{"nil Override.Report", &cwssaws.InsertHealthReportOverrideRequest{MachineId: &cwssaws.MachineId{Id: "m1"}, Override: &cwssaws.HealthReportOverride{Report: nil}}},
	}

	for _, tc := range testCases {
		t.Run(tc.name, func(t *testing.T) {
			err := mm.InsertHealthReportOverrideOnSite(context.Background(), tc.request)
			assert.Error(t, err)
		})
	}
}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@site-workflow/pkg/activity/machine_test.go` around lines 142 - 180, Tests
only exercise success and nil-request paths; add unit tests that assert
validation errors for empty or missing fields checked in the activity. Extend
site-workflow/pkg/activity/machine_test.go to include table-driven tests calling
NewManageMachine(...).InsertHealthReportOverrideOnSite and
RemoveHealthReportOverrideOnSite covering cases: nil MachineId, empty
MachineId.Id, nil Override, nil Override.Report, and empty Source (for
RemoveHealthReportOverrideOnSite), and assert Error for each; reference the
functions InsertHealthReportOverrideOnSite, RemoveHealthReportOverrideOnSite and
constructor NewManageMachine to locate where to call them.

site-workflow/pkg/workflow/machine_test.go (1)

187-219: Tests cover success paths; consider adding failure scenario tests.

The tests correctly verify that both workflows complete successfully when activities succeed. However, the test suite includes failure scenario tests for other workflows (e.g., Test_UpdateMachineMetadata_ActivityFails). For consistency and to ensure error propagation is verified, consider adding analogous failure tests.

♻️ Suggested failure test

func (s *MachineWorkflowTestSuite) Test_ApplyMachineOnlineRepairHealthOverride_ActivityFails() {
	var machineManager mActivity.ManageMachine
	req := &cwssaws.InsertHealthReportOverrideRequest{
		MachineId: &cwssaws.MachineId{Id: uuid.New().String()},
		Override: &cwssaws.HealthReportOverride{
			Report: &cwssaws.HealthReport{Source: "tenant-reported-issue"},
			Mode:   cwssaws.OverrideMode_Replace,
		},
	}
	errMsg := "Site Controller communication error"

	s.env.RegisterActivity(machineManager.InsertHealthReportOverrideOnSite)
	s.env.OnActivity(machineManager.InsertHealthReportOverrideOnSite, mock.Anything, mock.Anything).Return(errors.New(errMsg))

	s.env.ExecuteWorkflow(ApplyMachineOnlineRepairHealthOverride, req)
	s.True(s.env.IsWorkflowCompleted())
	err := s.env.GetWorkflowError()
	s.Error(err)

	var applicationErr *temporal.ApplicationError
	s.True(errors.As(err, &applicationErr))
	s.Equal(errMsg, applicationErr.Error())
}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@site-workflow/pkg/workflow/machine_test.go` around lines 187 - 219, The tests
only cover success paths for ApplyMachineOnlineRepairHealthOverride and
ClearMachineOnlineRepairHealthOverride; add failure-case unit tests that mock
the activities to return an error and assert the workflow surfaces that error.
Specifically, add tests similar to
Test_ApplyMachineOnlineRepairHealthOverride_ActivityFails and a counterpart for
ClearMachineOnlineRepairHealthOverride where you RegisterActivity and use
env.OnActivity for mActivity.ManageMachine.InsertHealthReportOverrideOnSite (and
RemoveHealthReportOverrideOnSite) to Return(errors.New(...)),
ExecuteWorkflow(ApplyMachineOnlineRepairHealthOverride /
ClearMachineOnlineRepairHealthOverride) and then assert
env.IsWorkflowCompleted(), env.GetWorkflowError() is non-nil, the error is a
temporal.ApplicationError and its message matches the mocked error string.

db/tenant-capability.md (1)

265-272: Handler migration guidance is documented but not yet implemented.

The HLD correctly identifies that downstream handlers must replace checks using tenant.Config.TargetedInstanceCreation with effective site-scoped capability evaluation. Based on the relevant code snippets provided, the current api/pkg/api/handler/machine.go implementation (lines 323-329 and 824-845) still uses only the tenant-global check without TenantSiteCapabilityAssociation lookup.

This is acceptable for an initial draft, but consider adding a tracking item or TODO to ensure the handler migration is completed before this capability model is fully operational.

Would you like me to open an issue to track the handler migration work outlined in this HLD?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@db/tenant-capability.md` around lines 265 - 272, The handler migration
described in the HLD hasn't been implemented: handlers in
api/pkg/api/handler/machine.go still check
tenant.Config.TargetedInstanceCreation directly instead of using
effective(tenant, TenantSiteCapabilityAssociation, siteID) for site-scoped
requests (instance create/delete, machine APIs, site listing). Add a tracking
TODO and create a repo issue describing: replace direct TargetedInstanceCreation
checks in the machine handlers and any site-scoped handlers with the
effective(..., siteID) helper, centralize the lookup in shared helpers, and
decide site-listing behavior (narrow vs discovery). Reference
TenantSiteCapabilityAssociation, TargetedInstanceCreation, effective(...,
siteID), and the machine handler functions so the migration can be completed
before the capability model goes live.

api/pkg/api/model/machine_test.go (1)

729-778: Add missing negative cases for online-repair branch coverage.

Current additions validate happy paths plus one exclusivity error, but Validate() has more online-repair branches that are currently untested (e.g., enter with clearOnlineRepair=true, exit without clearOnlineRepair=true, exit with forbidden machineHealthIssue/repairPolicy/acknowledgments payloads). Adding these would harden regression detection.

📌 Suggested test-case additions

@@
 		{
 			name: "test valid exit online repair request",
 			fields: fields{
 				RequestOnlineRepair: cdb.GetBoolPtr(false),
 				ClearOnlineRepair:   cdb.GetBoolPtr(true),
 			},
 			wantErr: false,
 		},
+		{
+			name: "test invalid enter online repair when clearOnlineRepair is true",
+			fields: fields{
+				RequestOnlineRepair: cdb.GetBoolPtr(true),
+				ClearOnlineRepair:   cdb.GetBoolPtr(true),
+				MachineHealthIssue: &APIMachineHealthIssue{
+					Category: "STORAGE",
+					Summary:  "Disk issue",
+					Details:  "logs and ticket refs",
+				},
+				RepairPolicy: &APIOnlineRepairPolicy{
+					AllowAutoInstanceDeletionOnFailure: cdb.GetBoolPtr(false),
+				},
+				Acknowledgments: &APIOnlineRepairAcknowledgments{
+					AcceptDataCorruptionRisk:   cdb.GetBoolPtr(true),
+					AcceptRepairTeamAccess:     cdb.GetBoolPtr(true),
+					AcceptInstanceDeletionRisk: cdb.GetBoolPtr(true),
+				},
+			},
+			wantErr: true,
+		},
+		{
+			name: "test invalid exit online repair without clearOnlineRepair true",
+			fields: fields{
+				RequestOnlineRepair: cdb.GetBoolPtr(false),
+			},
+			wantErr: true,
+		},
+		{
+			name: "test invalid exit online repair when enter-only fields are provided",
+			fields: fields{
+				RequestOnlineRepair: cdb.GetBoolPtr(false),
+				ClearOnlineRepair:   cdb.GetBoolPtr(true),
+				MachineHealthIssue: &APIMachineHealthIssue{
+					Category: "OTHER",
+					Summary:  "Should not be sent on exit",
+					Details:  "Should not be sent on exit",
+				},
+			},
+			wantErr: true,
+		},

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/model/machine_test.go` around lines 729 - 778, Add unit tests
calling the Machine.Validate() cases that are currently missing: (1) an "enter
online-repair" case where RequestOnlineRepair==true but ClearOnlineRepair==true
(should error), (2) an "exit online-repair" case where
RequestOnlineRepair==false and ClearOnlineRepair!=true (should error), and (3)
one or more "exit online-repair" cases where RequestOnlineRepair==false and
ClearOnlineRepair==true but forbidden payload fields are present
(MachineHealthIssue, RepairPolicy, Acknowledgments) and should cause validation
errors. Use the same test table structure and fields (RequestOnlineRepair,
ClearOnlineRepair, MachineHealthIssue, RepairPolicy, Acknowledgments) as the
surrounding tests and assert wantErr accordingly when calling Validate().

api/pkg/api/handler/machine.go (2)

64-67: Inconsistent constant value casing.

The value "OnLineRepair" contains inconsistent capitalization. The convention for compound words like "online" should be OnlineRepair rather than OnLineRepair.

♻️ Suggested fix

 const (
 	onlineRepairHealthOverrideSource = "tenant-reported-issue"
-	onlineRepairHealthAlertID      = "OnLineRepair"
+	onlineRepairHealthAlertID        = "OnlineRepair"
 )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/machine.go` around lines 64 - 67, The constant
onlineRepairHealthAlertID has inconsistent capitalization ("OnLineRepair");
update its string value to "OnlineRepair" to follow compound-word casing
conventions, and search for any uses of onlineRepairHealthAlertID to ensure no
downstream logic or external consumers depend on the old casing and update tests
or comparisons accordingly.

846-862: Duplicated instance retrieval logic.

The instance retrieval and validation at lines 850-858 is duplicated in the main online repair block at lines 1454-1462. Consider extracting this common logic into a helper function or restructuring to retrieve the instance once and pass it through.

♻️ Suggested approach

Extract instance retrieval to a helper that returns the validated instance:

func (umh UpdateMachineHandler) getOnlineRepairInstance(ctx context.Context, machineID string, logger zerolog.Logger) (*cdbm.Instance, *cutil.APIError) {
    iDAO := cdbm.NewInstanceDAO(umh.dbSession)
    instances, ic, ierr := iDAO.GetAll(ctx, nil, cdbm.InstanceFilterInput{MachineIDs: []string{machineID}}, cdbp.PageInput{Limit: cdb.GetIntPtr(2)}, nil)
    if ierr != nil {
        logger.Error().Err(ierr).Msg("error retrieving Instance for Machine")
        return nil, cutil.NewAPIError(http.StatusInternalServerError, "Failed to retrieve Instance for Machine", nil)
    }
    if ic != 1 {
        return nil, cutil.NewAPIError(http.StatusBadRequest, "Machine must have exactly one Instance for online repair", nil)
    }
    return &instances[0], nil
}

Then call this once during authorization and reuse the result in the main online repair block.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/machine.go` around lines 846 - 862, The instance
retrieval/validation for online repair in UpdateMachineHandler is duplicated;
extract it into a helper (e.g., UpdateMachineHandler.getOnlineRepairInstance)
that accepts ctx, machine.ID and logger and returns (*cdbm.Instance,
*cutil.APIError) after calling cdbm.NewInstanceDAO(...).GetAll, logging errors
and translating them into cutil.NewAPIError responses (handle non-1 count and
tenant mismatch there or return the instance and let caller check tenant), then
replace both inline blocks (the pre-authorization check and the main
online-repair block) to call this helper once and reuse the returned instance
instead of repeating the GetAll logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/pkg/api/handler/machine.go`:
- Around line 1586-1593: The response currently returns the original machine
object (um = machine) after the online repair, which doesn't include instance
state changes; update the code so the response reflects the modified instance by
replacing um = machine with either (a) embedding the updated instance fields
into um (copy updated status/labels from the repaired instance object into the
machine struct used for response) or (b) reloading the machine+instance from the
DB (call the same loader used by getAPIMachines) after orTx.Commit() and assign
that fresh object to um so getAPIMachines returns the accurate, post-repair
instance state; ensure you reference the modified instance struct created during
the online repair and set orTxCommitted accordingly.
- Around line 1521-1528: The timeout detection block currently checks ctx.Err()
(parent context) instead of the workflow context, so replace the ctx.Err() check
with wfCtx.Err() in the error branch that inspects tp.TimeoutError and
context.DeadlineExceeded; update the conditional in the function handling err
(the block that declares var timeoutErr *tp.TimeoutError and returns via
cutil.NewAPIErrorResponse) to use wfCtx.Err() so timeouts from the workflow
context are correctly detected while keeping the existing UnwrapWorkflowError
handling and logging (common.UnwrapWorkflowError, logger.Error(), and the
subsequent cutil.NewAPIErrorResponse) unchanged.
- Around line 1574-1582: In the error-handling block for clearing the online
repair workflow (the if err != nil branch that checks timeout and unwraps
workflow errors via common.UnwrapWorkflowError), replace the incorrect ctx.Err()
check with wfCtx.Err() so the timeout/deadline check uses the workflow context;
keep the rest of the logic (using err in the timeout message and werr for the
UnwrapWorkflowError path) unchanged and ensure wfCtx is in scope where the check
occurs.
- Around line 1448-1452: Duplicate call to umh.scp.GetClientByID is unnecessary;
remove the second retrieval (the stc, err :=
umh.scp.GetClientByID(machine.SiteID) block at the later location) and reuse the
previously obtained stc from earlier in the same handler, ensuring the variable
is in scope (or pass it into the helper function if needed) and delete the
redundant error handling branch so the later logic uses the existing Temporal
client instead.

In `@db/repair.md`:
- Around line 94-95: The document uses inconsistent state names (e.g.,
OnlineRepair vs Repairing, and an unexpected Assigned alongside Ready); pick a
single canonical term for repair state (suggest using "Repairing") and normalize
every occurrence: replace "OnlineRepair" with "Repairing", update the status
table to list "Repairing" instead of "OnlineRepair", and either add "Assigned"
to the defined state list or remove the stray "Assigned" reference (ensure
"Ready/Assigned" wording is corrected to use only defined states like "Ready" or
"Assigned" consistently); verify all mentions (e.g., the sentence from the diff
and the table around line 171) use the chosen vocabulary.
- Line 184: The alert ID casing is inconsistent: change the JSON field value
"id": "OnLineRepair" to "OnlineRepair" to match the rest of the document and
ensure implementation references resolve; search for any other occurrences of
"OnLineRepair" (e.g., in examples, anchors, or references) and update them to
"OnlineRepair" so casing is consistent across the markdown and any linked
configuration or code.
- Around line 248-249: The docs state allowAutoInstanceDeletionOnFailure is
optional but the Machine PATCH API flow (requestOnlineRepair: true) requires it
on entry; update the documentation to mark allowAutoInstanceDeletionOnFailure as
required for the Machine PATCH API request when requestOnlineRepair: true (or
clarify the conditional requirement), and add a short note next to
requestOnlineRepair and the Machine PATCH API section explaining the required
inputs (machine health issue, acknowledgments, and the mandatory
allowAutoInstanceDeletionOnFailure flag) so clients won’t send invalid requests.

In `@db/tenant-capability-google-doc.html`:
- Around line 93-112: The Mermaid flowchart inside the <pre> block contains raw
">" characters (e.g., in node labels like A[PATCH Tenant Capabilities] and
arrows like A -->|"sites[], capabilityName, infrastructureProviderId, enabled"|
B) which HTMLHint flags; update the <pre> block so all literal ">" characters
are replaced with the escaped entity &gt; (and similarly escape any "<" as &lt;
if present) so the Mermaid diagram remains the same visually but the HTML is
valid; target the flowchart text within the <pre> block (nodes A, B, C, D, E, F,
G and the arrow lines) to perform the replacements.

In `@openapi/spec.yaml`:
- Around line 16884-16909: There are two duplicate components.schemas entries
named MachineHealthIssue; remove the redundant definition and keep a single
authoritative MachineHealthIssue schema under components.schemas (or merge any
differing fields into one consolidated version), ensuring all properties
(category, summary, details), enums and descriptions are preserved; update or
verify any references to MachineHealthIssue remain valid and remove the earlier
or later duplicate so the spec contains only one MachineHealthIssue schema.

---

Outside diff comments:
In `@openapi/spec.yaml`:
- Around line 16949-17010: The OpenAPI spec added online-repair fields
(requestOnlineRepair, clearOnlineRepair) and schemas (MachineHealthIssue,
OnlineRepairPolicy, OnlineRepairAcknowledgments) and introduced the new Instance
status "Repairing", but the generated SDK/docs were not updated; regenerate the
OpenAPI client and docs and commit the outputs so CI passes—run the repo's
OpenAPI generation script or tool to update sdk/standard/ and docs/index.html
(and any other generated artifacts), verify the new shapes and the "Repairing"
status appear in the generated SDK and HTML, and include those regenerated files
in this PR.
- Around line 16962-17010: The schema currently allows invalid combinations
because the online-repair semantics are only in docs; update the
MachineUpdateRequest schema to express three mutually exclusive variants using
oneOf/not: (1) Normal update: allows instanceTypeId/clearInstanceType,
setMaintenanceMode, maintenanceMessage, labels and must forbid any online-repair
fields (requestOnlineRepair, clearOnlineRepair, machineHealthIssue,
repairPolicy, acknowledgments); (2) Enter online-repair mode: require
requestOnlineRepair=true and require machineHealthIssue, repairPolicy, and
acknowledgments (reference MachineHealthIssue, OnlineRepairPolicy,
OnlineRepairAcknowledgments) and forbid instanceTypeId/clearInstanceType,
setMaintenanceMode, maintenanceMessage, labels, clearOnlineRepair; (3) Exit
online-repair mode: require requestOnlineRepair=false and clearOnlineRepair=true
and forbid machineHealthIssue, repairPolicy, acknowledgments and also forbid
other update properties; implement these with oneOf plus required/const and
additionalProperties/not blocks so validators and generated clients enforce the
enter/exit rules described in the description.

---

Nitpick comments:
In `@api/pkg/api/handler/machine.go`:
- Around line 64-67: The constant onlineRepairHealthAlertID has inconsistent
capitalization ("OnLineRepair"); update its string value to "OnlineRepair" to
follow compound-word casing conventions, and search for any uses of
onlineRepairHealthAlertID to ensure no downstream logic or external consumers
depend on the old casing and update tests or comparisons accordingly.
- Around line 846-862: The instance retrieval/validation for online repair in
UpdateMachineHandler is duplicated; extract it into a helper (e.g.,
UpdateMachineHandler.getOnlineRepairInstance) that accepts ctx, machine.ID and
logger and returns (*cdbm.Instance, *cutil.APIError) after calling
cdbm.NewInstanceDAO(...).GetAll, logging errors and translating them into
cutil.NewAPIError responses (handle non-1 count and tenant mismatch there or
return the instance and let caller check tenant), then replace both inline
blocks (the pre-authorization check and the main online-repair block) to call
this helper once and reuse the returned instance instead of repeating the GetAll
logic.

In `@api/pkg/api/model/machine_test.go`:
- Around line 729-778: Add unit tests calling the Machine.Validate() cases that
are currently missing: (1) an "enter online-repair" case where
RequestOnlineRepair==true but ClearOnlineRepair==true (should error), (2) an
"exit online-repair" case where RequestOnlineRepair==false and
ClearOnlineRepair!=true (should error), and (3) one or more "exit online-repair"
cases where RequestOnlineRepair==false and ClearOnlineRepair==true but forbidden
payload fields are present (MachineHealthIssue, RepairPolicy, Acknowledgments)
and should cause validation errors. Use the same test table structure and fields
(RequestOnlineRepair, ClearOnlineRepair, MachineHealthIssue, RepairPolicy,
Acknowledgments) as the surrounding tests and assert wantErr accordingly when
calling Validate().

In `@db/tenant-capability-google-doc.html`:
- Around line 1-325: Duplicate HLD content exists in db/tenant-capability.md and
db/tenant-capability-google-doc.html which risks divergence; remove the
redundancy by choosing a single source-of-truth (prefer the Markdown file) and
delete or turn the HTML into a generated artifact, update references to point to
db/tenant-capability.md, and ensure the canonical doc names (e.g.
TenantSiteCapabilityAssociation, TargetedInstanceCreation) are preserved in the
retained file so handlers and reviewers can find the same content.

In `@db/tenant-capability.md`:
- Around line 265-272: The handler migration described in the HLD hasn't been
implemented: handlers in api/pkg/api/handler/machine.go still check
tenant.Config.TargetedInstanceCreation directly instead of using
effective(tenant, TenantSiteCapabilityAssociation, siteID) for site-scoped
requests (instance create/delete, machine APIs, site listing). Add a tracking
TODO and create a repo issue describing: replace direct TargetedInstanceCreation
checks in the machine handlers and any site-scoped handlers with the
effective(..., siteID) helper, centralize the lookup in shared helpers, and
decide site-listing behavior (narrow vs discovery). Reference
TenantSiteCapabilityAssociation, TargetedInstanceCreation, effective(...,
siteID), and the machine handler functions so the migration can be completed
before the capability model goes live.

In `@site-workflow/pkg/activity/machine_test.go`:
- Around line 142-180: Tests only exercise success and nil-request paths; add
unit tests that assert validation errors for empty or missing fields checked in
the activity. Extend site-workflow/pkg/activity/machine_test.go to include
table-driven tests calling
NewManageMachine(...).InsertHealthReportOverrideOnSite and
RemoveHealthReportOverrideOnSite covering cases: nil MachineId, empty
MachineId.Id, nil Override, nil Override.Report, and empty Source (for
RemoveHealthReportOverrideOnSite), and assert Error for each; reference the
functions InsertHealthReportOverrideOnSite, RemoveHealthReportOverrideOnSite and
constructor NewManageMachine to locate where to call them.

In `@site-workflow/pkg/grpc/client/testing.go`:
- Around line 346-352: The two mock RPCs InsertHealthReportOverride and
RemoveHealthReportOverride on MockForgeClient always return success; update both
to follow the file's established pattern by checking ctx.Value("wantError") (or
the same key other mocks use) and, if set, return that injected error (or wrap
it) instead of success; otherwise return new(emptypb.Empty), nil—this enables
context-based error injection for tests.

In `@site-workflow/pkg/workflow/machine_test.go`:
- Around line 187-219: The tests only cover success paths for
ApplyMachineOnlineRepairHealthOverride and
ClearMachineOnlineRepairHealthOverride; add failure-case unit tests that mock
the activities to return an error and assert the workflow surfaces that error.
Specifically, add tests similar to
Test_ApplyMachineOnlineRepairHealthOverride_ActivityFails and a counterpart for
ClearMachineOnlineRepairHealthOverride where you RegisterActivity and use
env.OnActivity for mActivity.ManageMachine.InsertHealthReportOverrideOnSite (and
RemoveHealthReportOverrideOnSite) to Return(errors.New(...)),
ExecuteWorkflow(ApplyMachineOnlineRepairHealthOverride /
ClearMachineOnlineRepairHealthOverride) and then assert
env.IsWorkflowCompleted(), env.GetWorkflowError() is non-nil, the error is a
temporal.ApplicationError and its message matches the mocked error string.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: d5273c3b-389e-4af5-9c19-c3e0d00ed09c

📥 Commits

Reviewing files that changed from the base of the PR and between 03fbdb0 and 72dc761.

📒 Files selected for processing (16)

api/pkg/api/handler/machine.go
api/pkg/api/handler/machine_test.go
api/pkg/api/model/machine.go
api/pkg/api/model/machine_test.go
db/pkg/db/model/instance.go
db/pkg/db/model/instance_test.go
db/repair.md
db/tenant-capability-google-doc.html
db/tenant-capability.md
openapi/spec.yaml
site-agent/pkg/components/managers/machine/subscriber.go
site-workflow/pkg/activity/machine.go
site-workflow/pkg/activity/machine_test.go
site-workflow/pkg/grpc/client/testing.go
site-workflow/pkg/workflow/machine.go
site-workflow/pkg/workflow/machine_test.go

coderabbitai · 2026-04-22T22:48:18Z

+		stc, err := umh.scp.GetClientByID(machine.SiteID)
+		if err != nil {
+			logger.Error().Err(err).Msg("failed to retrieve Temporal client for Site")
+			return err
+		}


🛠️ Refactor suggestion | 🟠 Major

Redundant temporal client retrieval.

The site temporal client is already retrieved at line 965 (stc, err := umh.scp.GetClientByID(machine.SiteID)). This second retrieval at line 1448 is unnecessary.

♻️ Suggested fix

// Enter or exit in-pool online repair (Site health override + Instance status / labels in Cloud DB) if apiRequest.IsMachineOnlineRepairOperation() { if machine.IsMissingOnSite { return cutil.NewAPIErrorResponse(c, http.StatusBadRequest, "Machine is currently missing on Site, cannot update online repair state", nil) } - - stc, err := umh.scp.GetClientByID(machine.SiteID) - if err != nil { - logger.Error().Err(err).Msg("failed to retrieve Temporal client for Site") - return err - } iDAO := cdbm.NewInstanceDAO(umh.dbSession)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

stc, err := umh.scp.GetClientByID(machine.SiteID)

if err != nil {

logger.Error().Err(err).Msg("failed to retrieve Temporal client for Site")

return err

}

// Enter or exit in-pool online repair (Site health override + Instance status / labels in Cloud DB)

if apiRequest.IsMachineOnlineRepairOperation() {

if machine.IsMissingOnSite {

return cutil.NewAPIErrorResponse(c, http.StatusBadRequest, "Machine is currently missing on Site, cannot update online repair state", nil)

}

iDAO := cdbm.NewInstanceDAO(umh.dbSession)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@api/pkg/api/handler/machine.go` around lines 1448 - 1452, Duplicate call to umh.scp.GetClientByID is unnecessary; remove the second retrieval (the stc, err := umh.scp.GetClientByID(machine.SiteID) block at the later location) and reuse the previously obtained stc from earlier in the same handler, ensuring the variable is in scope (or pass it into the helper function if needed) and delete the redundant error handling branch so the later logic uses the existing Temporal client instead.

coderabbitai · 2026-04-22T22:48:18Z

+			if err != nil {
+				var timeoutErr *tp.TimeoutError
+				if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || ctx.Err() != nil {
+					return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed online repair workflow on Site: %s", err), nil)
+				}
+				code, werr := common.UnwrapWorkflowError(err)
+				logger.Error().Err(werr).Msg("online repair health override workflow failed")
+				return cutil.NewAPIErrorResponse(c, code, fmt.Sprintf("Failed to execute online repair workflow on Site: %s", werr), nil)


⚠️ Potential issue | 🟡 Minor

Incorrect context variable in timeout check.

At line 1523, ctx.Err() checks the parent context, but the workflow was executed with wfCtx. If the parent context hasn't timed out but wfCtx has, this condition may not evaluate correctly.

🐛 Suggested fix

err = we.Get(wfCtx, nil) if err != nil { var timeoutErr *tp.TimeoutError - if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || ctx.Err() != nil { + if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || wfCtx.Err() != nil { return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed online repair workflow on Site: %s", err), nil) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if err != nil {

var timeoutErr *tp.TimeoutError

if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || ctx.Err() != nil {

return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed online repair workflow on Site: %s", err), nil)

}

code, werr := common.UnwrapWorkflowError(err)

logger.Error().Err(werr).Msg("online repair health override workflow failed")

return cutil.NewAPIErrorResponse(c, code, fmt.Sprintf("Failed to execute online repair workflow on Site: %s", werr), nil)

if err != nil {

var timeoutErr *tp.TimeoutError

if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || wfCtx.Err() != nil {

return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed online repair workflow on Site: %s", err), nil)

}

code, werr := common.UnwrapWorkflowError(err)

logger.Error().Err(werr).Msg("online repair health override workflow failed")

return cutil.NewAPIErrorResponse(c, code, fmt.Sprintf("Failed to execute online repair workflow on Site: %s", werr), nil)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@api/pkg/api/handler/machine.go` around lines 1521 - 1528, The timeout detection block currently checks ctx.Err() (parent context) instead of the workflow context, so replace the ctx.Err() check with wfCtx.Err() in the error branch that inspects tp.TimeoutError and context.DeadlineExceeded; update the conditional in the function handling err (the block that declares var timeoutErr *tp.TimeoutError and returns via cutil.NewAPIErrorResponse) to use wfCtx.Err() so timeouts from the workflow context are correctly detected while keeping the existing UnwrapWorkflowError handling and logging (common.UnwrapWorkflowError, logger.Error(), and the subsequent cutil.NewAPIErrorResponse) unchanged.

coderabbitai · 2026-04-22T22:48:18Z

+			if err != nil {
+				var timeoutErr *tp.TimeoutError
+				if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || ctx.Err() != nil {
+					return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed clear online repair workflow on Site: %s", err), nil)
+				}
+				code, werr := common.UnwrapWorkflowError(err)
+				logger.Error().Err(werr).Msg("clear online repair health override workflow failed")
+				return cutil.NewAPIErrorResponse(c, code, fmt.Sprintf("Failed to execute clear online repair workflow on Site: %s", werr), nil)
+			}


⚠️ Potential issue | 🟡 Minor

Same context variable issue in exit flow.

At line 1576, the timeout check uses ctx.Err() instead of wfCtx.Err().

🐛 Suggested fix

err = we.Get(wfCtx, nil) if err != nil { var timeoutErr *tp.TimeoutError - if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || ctx.Err() != nil { + if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || wfCtx.Err() != nil { return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed clear online repair workflow on Site: %s", err), nil) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if err != nil {

var timeoutErr *tp.TimeoutError

if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || ctx.Err() != nil {

return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed clear online repair workflow on Site: %s", err), nil)

}

code, werr := common.UnwrapWorkflowError(err)

logger.Error().Err(werr).Msg("clear online repair health override workflow failed")

return cutil.NewAPIErrorResponse(c, code, fmt.Sprintf("Failed to execute clear online repair workflow on Site: %s", werr), nil)

}

if err != nil {

var timeoutErr *tp.TimeoutError

if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || wfCtx.Err() != nil {

return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed clear online repair workflow on Site: %s", err), nil)

}

code, werr := common.UnwrapWorkflowError(err)

logger.Error().Err(werr).Msg("clear online repair health override workflow failed")

return cutil.NewAPIErrorResponse(c, code, fmt.Sprintf("Failed to execute clear online repair workflow on Site: %s", werr), nil)

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@api/pkg/api/handler/machine.go` around lines 1574 - 1582, In the error-handling block for clearing the online repair workflow (the if err != nil branch that checks timeout and unwraps workflow errors via common.UnwrapWorkflowError), replace the incorrect ctx.Err() check with wfCtx.Err() so the timeout/deadline check uses the workflow context; keep the rest of the logic (using err in the timeout message and werr for the UnwrapWorkflowError path) unchanged and ensure wfCtx is in scope where the check occurs.

coderabbitai · 2026-04-22T22:48:18Z

+		err = orTx.Commit()
+		if err != nil {
+			logger.Error().Err(err).Msg("error committing transaction")
+			return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, "Failed to update Machine, DB transaction error", nil)
+		}
+		orTxCommitted = true
+		um = machine
+	}


⚠️ Potential issue | 🟡 Minor

Response does not reflect instance state changes.

At line 1592, um = machine assigns the original machine entity to um. However, the online repair operation modifies the associated instance's status and labels, not the machine itself. The response returned to the client via getAPIMachines (line 1600) will include the original machine data, which may not accurately reflect the instance state changes made during this operation.

Consider whether the API response should include the updated instance information to provide clients with accurate state after the operation.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@api/pkg/api/handler/machine.go` around lines 1586 - 1593, The response currently returns the original machine object (um = machine) after the online repair, which doesn't include instance state changes; update the code so the response reflects the modified instance by replacing um = machine with either (a) embedding the updated instance fields into um (copy updated status/labels from the repaired instance object into the machine struct used for response) or (b) reloading the machine+instance from the DB (call the same loader used by getAPIMachines) after orTx.Commit() and assign that fresh object to um so getAPIMachines returns the accurate, post-repair instance state; ensure you reference the modified instance struct created during the online repair and set orTxCommitted accordingly.

coderabbitai · 2026-04-22T22:48:19Z

+- The instance **stays in the tenant pool** (in **Ready** when active; in **OnlineRepair** during the repair window). No deallocation or return to Carbide pool unless the instance is escalated and deleted.
+


⚠️ Potential issue | 🟠 Major

State naming is inconsistent (Repairing vs OnlineRepair, and Assigned appears unexpectedly).

Line 94 uses OnlineRepair as a state while the status table introduces Repairing; Line 171 says Ready/Assigned even though the document’s state list does not define Assigned. Please normalize to one state vocabulary.

✏️ Suggested terminology normalization

-- The instance **stays in the tenant pool** (in **Ready** when active; in **OnlineRepair** during the repair window). No deallocation or return to Carbide pool unless the instance is escalated and deleted. +- The instance **stays in the tenant pool** (in **Ready** when active; in **Repairing** during the repair window). No deallocation or return to Carbide pool unless the instance is escalated and deleted. @@ -**Effect:** The Health Override for OnlineRepair is removed from the machine; the instance returns to **active (Ready/Assigned)** and is available for workload again. +**Effect:** The Health Override for OnlineRepair is removed from the machine; the instance returns to **active (Ready)** and is available for workload again.

As per coding guidelines: `**/*.md`: "Assess the documentation for misspellings, grammatical errors, missing documentation and correctness".
Also applies to: 171-171

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@db/repair.md` around lines 94 - 95, The document uses inconsistent state names (e.g., OnlineRepair vs Repairing, and an unexpected Assigned alongside Ready); pick a single canonical term for repair state (suggest using "Repairing") and normalize every occurrence: replace "OnlineRepair" with "Repairing", update the status table to list "Repairing" instead of "OnlineRepair", and either add "Assigned" to the defined state list or remove the stray "Assigned" reference (ensure "Ready/Assigned" wording is corrected to use only defined states like "Ready" or "Assigned" consistently); verify all mentions (e.g., the sentence from the diff and the table around line 171) use the chosen vocabulary.

coderabbitai · 2026-04-22T22:48:19Z

+  "source": "tenant-reported-issue",
+  "alerts": [
+    {
+      "id": "OnLineRepair",


⚠️ Potential issue | 🟡 Minor

Use consistent alert ID casing (OnlineRepair vs OnLineRepair).

Line 184 uses OnLineRepair, while the rest of the document consistently uses OnlineRepair. Standardizing this avoids ambiguity in implementation references.

✏️ Suggested casing fix

- "id": "OnLineRepair", + "id": "OnlineRepair",

As per coding guidelines: `**/*.md`: "Assess the documentation for misspellings, grammatical errors, missing documentation and correctness".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"id": "OnLineRepair",

"id": "OnlineRepair",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@db/repair.md` at line 184, The alert ID casing is inconsistent: change the JSON field value "id": "OnLineRepair" to "OnlineRepair" to match the rest of the document and ensure implementation references resolve; search for any other occurrences of "OnLineRepair" (e.g., in examples, anchors, or references) and update them to "OnlineRepair" so casing is consistent across the markdown and any linked configuration or code.

coderabbitai · 2026-04-22T22:48:19Z

+The **tenant operator** (or tenant) invokes the Machine PATCH API with `requestOnlineRepair: true` and the required inputs (machine health issue, acknowledgments, and optionally `allowAutoInstanceDeletionOnFailure`).
+


⚠️ Potential issue | 🟠 Major

allowAutoInstanceDeletionOnFailure is documented as optional but is required on enter.

Line 248 says the field is optional, which conflicts with the enter request contract in this PR flow. This will cause client confusion and invalid requests.

✏️ Suggested doc correction

-The **tenant operator** (or tenant) invokes the Machine PATCH API with `requestOnlineRepair: true` and the required inputs (machine health issue, acknowledgments, and optionally `allowAutoInstanceDeletionOnFailure`). +The **tenant operator** (or tenant) invokes the Machine PATCH API with `requestOnlineRepair: true` and the required inputs (machine health issue, acknowledgments, and `allowAutoInstanceDeletionOnFailure`).

As per coding guidelines: `**/*.md`: "Assess the documentation for misspellings, grammatical errors, missing documentation and correctness".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

The **tenant operator** (or tenant) invokes the Machine PATCH API with `requestOnlineRepair: true` and the required inputs (machine health issue, acknowledgments, and optionally `allowAutoInstanceDeletionOnFailure`).

The **tenant operator** (or tenant) invokes the Machine PATCH API with `requestOnlineRepair: true` and the required inputs (machine health issue, acknowledgments, and `allowAutoInstanceDeletionOnFailure`).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@db/repair.md` around lines 248 - 249, The docs state allowAutoInstanceDeletionOnFailure is optional but the Machine PATCH API flow (requestOnlineRepair: true) requires it on entry; update the documentation to mark allowAutoInstanceDeletionOnFailure as required for the Machine PATCH API request when requestOnlineRepair: true (or clarify the conditional requirement), and add a short note next to requestOnlineRepair and the Machine PATCH API section explaining the required inputs (machine health issue, acknowledgments, and the mandatory allowAutoInstanceDeletionOnFailure flag) so clients won’t send invalid requests.

coderabbitai · 2026-04-22T22:48:19Z

+<pre>flowchart TB
+  subgraph Client["Tenant Admin"]
+    A[PATCH Tenant Capabilities]
+  end
+  subgraph API["REST API"]
+    B[Resolve site set]
+    C[Upsert TenantSiteCapabilityAssociation]
+  end
+  subgraph Data["Persistence"]
+    D[(tenant)]
+    E[(tenant_site_capability_association)]
+    F[(site)]
+    G[(infrastructure_provider)]
+  end
+  A -->|"sites[], capabilityName, infrastructureProviderId, enabled"| B
+  B -->|"empty sites[] => all eligible sites; optional filter by provider"| C
+  C --> E
+  E --> F
+  E --> G
+  D -->|"ceiling check"| B</pre>


⚠️ Potential issue | 🟡 Minor

HTMLHint flags unescaped special characters in Mermaid diagram.

The > characters in the Mermaid flowchart syntax within the <pre> block should technically be escaped as > for valid HTML. While browsers typically render this correctly, escaping ensures consistency across parsers.

♻️ Suggested fix

- A -->|"sites[], capabilityName, infrastructureProviderId, enabled"| B - B -->|"empty sites[] => all eligible sites; optional filter by provider"| C - C --> E - E --> F - E --> G - D -->|"ceiling check"| B</pre> + A -->|"sites[], capabilityName, infrastructureProviderId, enabled"| B + B -->|"empty sites[] => all eligible sites; optional filter by provider"| C + C --> E + E --> F + E --> G + D -->|"ceiling check"| B</pre>

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

<pre>flowchart TB

subgraph Client["Tenant Admin"]

A[PATCH Tenant Capabilities]

end

subgraph API["REST API"]

B[Resolve site set]

C[Upsert TenantSiteCapabilityAssociation]

end

subgraph Data["Persistence"]

D[(tenant)]

E[(tenant_site_capability_association)]

F[(site)]

G[(infrastructure_provider)]

end

A -->|"sites[], capabilityName, infrastructureProviderId, enabled"| B

B -->|"empty sites[] => all eligible sites; optional filter by provider"| C

C --> E

E --> F

E --> G

D -->|"ceiling check"| B</pre>

<pre>flowchart TB

subgraph Client["Tenant Admin"]

A[PATCH Tenant Capabilities]

end

subgraph API["REST API"]

B[Resolve site set]

C[Upsert TenantSiteCapabilityAssociation]

end

subgraph Data["Persistence"]

D[(tenant)]

E[(tenant_site_capability_association)]

F[(site)]

G[(infrastructure_provider)]

end

A -->|"sites[], capabilityName, infrastructureProviderId, enabled"| B

B -->|"empty sites[] => all eligible sites; optional filter by provider"| C

C --> E

E --> F

E --> G

D -->|"ceiling check"| B</pre>

🧰 Tools

🪛 HTMLHint (1.9.2)

[error] 107-107: Special characters must be escaped : [ > ].

(spec-char-escape)

[error] 108-108: Special characters must be escaped : [ > ].

(spec-char-escape)

[error] 108-108: Special characters must be escaped : [ > ].

(spec-char-escape)

[error] 109-109: Special characters must be escaped : [ > ].

(spec-char-escape)

[error] 110-110: Special characters must be escaped : [ > ].

(spec-char-escape)

[error] 111-111: Special characters must be escaped : [ > ].

(spec-char-escape)

[error] 112-112: Special characters must be escaped : [ > ].

(spec-char-escape)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@db/tenant-capability-google-doc.html` around lines 93 - 112, The Mermaid flowchart inside the <pre> block contains raw ">" characters (e.g., in node labels like A[PATCH Tenant Capabilities] and arrows like A -->|"sites[], capabilityName, infrastructureProviderId, enabled"| B) which HTMLHint flags; update the <pre> block so all literal ">" characters are replaced with the escaped entity > (and similarly escape any "<" as < if present) so the Mermaid diagram remains the same visually but the HTML is valid; target the flowchart text within the <pre> block (nodes A, B, C, D, E, F, G and the arrow lines) to perform the replacements.

coderabbitai · 2026-04-22T22:48:19Z

+    MachineHealthIssue:
+      title: MachineHealthIssue
+      type: object
+      required:
+        - category
+        - summary
+        - details
+      properties:
+        category:
+          type: string
+          enum:
+            - HARDWARE
+            - NETWORK
+            - PERFORMANCE
+            - STORAGE
+            - SOFTWARE
+            - OTHER
+          description: High-level category for the tenant-reported issue.
+        summary:
+          type: string
+          maxLength: 512
+          description: Brief description of the issue for operators.
+        details:
+          type: string
+          maxLength: 8192
+          description: Diagnostic information, logs, ticket numbers, etc.


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Expected result: exactly one match. More than one means the schema key is duplicated. rg -n '^ {4}MachineHealthIssue:$' openapi/spec.yaml

Repository: NVIDIA/ncx-infra-controller-rest

Length of output: 135

Remove the duplicate MachineHealthIssue schema definition.

The YAML file contains two MachineHealthIssue entries under components.schemas (lines 15587 and 16884). In YAML, the second key silently overwrites the first, rendering the initial definition inaccessible to any consumers of the specification. This breaks schema clarity, violates the API contract, and causes linting failures. Consolidate to a single, authoritative definition.

🧰 Tools

🪛 YAMLlint (1.38.0)

[error] 16884-16884: duplication of key "MachineHealthIssue" in mapping

(key-duplicates)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openapi/spec.yaml` around lines 16884 - 16909, There are two duplicate components.schemas entries named MachineHealthIssue; remove the redundant definition and keep a single authoritative MachineHealthIssue schema under components.schemas (or merge any differing fields into one consolidated version), ensuring all properties (category, summary, details), enums and descriptions are preserved; update or verify any references to MachineHealthIssue remain valid and remove the earlier or later duplicate so the spec contains only one MachineHealthIssue schema.

github-actions · 2026-04-22T22:49:47Z

Test Results

8 856 tests +7 8 854 ✅ +5 7m 24s ⏱️ +38s
146 suites +1 0 💤 ±0
14 files ±0 2 ❌ +2

For more details on these failures, see this check.

Results for commit a0cd420. ± Comparison against base commit 3616400.

♻️ This comment has been updated with latest results.

hwadekar-nv · 2026-04-24T06:10:51Z

+		Classifications: []string{
+			"PreventAllocations",
+			"PreventDeletion",
+			"PreventSuperTenantAllocation",


need to remove for being send PreventSuperTenantAllocation

hwadekar-nv · 2026-04-24T06:11:37Z

+		Id:              onlineRepairHealthAlertID,
+		Target:          &tgt,
+		Message:         msg,
+		TenantMessage:   cdb.GetStrPtr(fmt.Sprintf("TenantReportedIssue: %s", mhi.Summary)),


Add acknowledgment policy and accept from user also

hwadekar-nv requested review from kfelternv, pbreton, sunilkumar-nvidia and thossain-nv April 22, 2026 22:39

hwadekar-nv self-assigned this Apr 22, 2026

hwadekar-nv requested a review from a team as a code owner April 22, 2026 22:39

hwadekar-nv marked this pull request as draft April 22, 2026 22:41

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

hwadekar-nv commented Apr 24, 2026

View reviewed changes

hwadekar-nv changed the title ~~feat: Initial draft commit for Machine OnlineRepair~~ feat: Initial draft commit for Instance online repair Apr 28, 2026

feat: Initial draft commit for Machine OnlineRepair

d48bc5d

hwadekar-nv force-pushed the feat/online-repair branch from eb96939 to d48bc5d Compare April 28, 2026 22:39

Removed unwanted files

a0cd420

		- The instance stays in the tenant pool (in Ready when active; in OnlineRepair during the repair window). No deallocation or return to Carbide pool unless the instance is escalated and deleted.

		The tenant operator (or tenant) invokes the Machine PATCH API with `requestOnlineRepair: true` and the required inputs (machine health issue, acknowledgments, and optionally `allowAutoInstanceDeletionOnFailure`).

Conversation

hwadekar-nv commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Services Affected

Related Issues (Optional)

Breaking Changes

Testing

Additional Notes

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Apr 22, 2026

🔐 TruffleHog Secret Scan

Uh oh!

copy-pr-bot Bot commented Apr 22, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

hwadekar-nv Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

hwadekar-nv Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hwadekar-nv commented Apr 22, 2026 •

edited

Loading

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading

github-actions Bot commented Apr 22, 2026 •

edited

Loading