feat: Initial draft commit for Instance online repair#415
feat: Initial draft commit for Instance online repair#415hwadekar-nv wants to merge 2 commits intomainfrom
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughThe pull request introduces in-pool online repair functionality for machines, enabling tenant repair teams to restore machines within the instance pool. Changes include API handler authorization and transaction logic for state transitions, new request models with validation for repair intent and acknowledgments, instance status additions, Temporal workflows coordinating with Site for health report overrides, and comprehensive design documentation. Changes
Sequence DiagramsequenceDiagram
participant Tenant as Tenant/Client
participant Handler as API Handler
participant DB as Database
participant Temporal as Temporal
participant Activity as Site Activity
participant Site as Site Controller
Tenant->>Handler: PATCH /machines/{id} (requestOnlineRepair)
Handler->>Handler: Validate machine is assigned & Instance status
Handler->>Handler: Enforce TenantAccount authorization
Handler->>DB: Begin Transaction
Handler->>DB: Update Instance status Ready → Repairing
Handler->>DB: Set OnlineRepair auto-deletion label
Handler->>DB: Commit Transaction
Handler->>Temporal: ExecuteWorkflow(ApplyMachineOnlineRepairHealthOverride)
Temporal->>Activity: InsertHealthReportOverrideOnSite(machine_id, health_report)
Activity->>Site: RPC InsertHealthReportOverride
Site-->>Activity: Empty response
Activity-->>Temporal: Success
Temporal-->>Handler: Workflow complete
Handler-->>Tenant: 200 OK (Instance.status=Repairing)
Tenant->>Handler: PATCH /machines/{id} (clearOnlineRepair)
Handler->>Handler: Validate Instance status is Repairing
Handler->>DB: Begin Transaction
Handler->>DB: Update Instance status Repairing → Ready
Handler->>DB: Clear OnlineRepair label
Handler->>DB: Commit Transaction
Handler->>Temporal: ExecuteWorkflow(ClearMachineOnlineRepairHealthOverride)
Temporal->>Activity: RemoveHealthReportOverrideOnSite(machine_id)
Activity->>Site: RPC RemoveHealthReportOverride
Site-->>Activity: Empty response
Activity-->>Temporal: Success
Temporal-->>Handler: Workflow complete
Handler-->>Tenant: 200 OK (Instance.status=Ready)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Rationale: The changes introduce a new feature spanning heterogeneous components—dense handler transaction logic with state management, comprehensive validation across multiple model layers, database schema extensions, Temporal workflow coordination, activity implementations, and OpenAPI contract updates. The handler requires careful scrutiny for transaction semantics, authorization gating, and Temporal error handling. Multiple interacting components demand verification of consistency across API, database, workflow, and activity layers. 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-04-22 22:40:40 UTC | Commit: 72dc761 |
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
openapi/spec.yaml (2)
16949-17010:⚠️ Potential issue | 🟠 MajorRegenerate the checked-in SDK/docs for this OpenAPI change.
CI is already failing because
openapi/spec.yamlchanged without the corresponding generated output undersdk/standard/ordocs/index.html. Please include the regenerated artifacts in this PR so consumers pick upRepairingand the new online-repair request shapes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openapi/spec.yaml` around lines 16949 - 17010, The OpenAPI spec added online-repair fields (requestOnlineRepair, clearOnlineRepair) and schemas (MachineHealthIssue, OnlineRepairPolicy, OnlineRepairAcknowledgments) and introduced the new Instance status "Repairing", but the generated SDK/docs were not updated; regenerate the OpenAPI client and docs and commit the outputs so CI passes—run the repo's OpenAPI generation script or tool to update sdk/standard/ and docs/index.html (and any other generated artifacts), verify the new shapes and the "Repairing" status appear in the generated SDK and HTML, and include those regenerated files in this PR.
16962-17010:⚠️ Potential issue | 🟠 MajorEncode the online-repair modes in the schema itself.
The enter/exit rules currently live only in descriptions and examples. The schema still permits invalid payloads such as
requestOnlineRepair: truewithoutacknowledgments,requestOnlineRepair: falsetogether withmachineHealthIssue, or online-repair fields mixed with the other update properties. Please model these as mutually exclusive variants withoneOf/notso generated clients and validators enforce the same contract as the handler.As per coding guidelines, "MachineUpdateRequest supports two mutually exclusive modes for online repair: Enter: requestOnlineRepair=true + machineHealthIssue + repairPolicy + acknowledgments. Exit: requestOnlineRepair=false + clearOnlineRepair=true, with no other online-repair payload fields combined."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openapi/spec.yaml` around lines 16962 - 17010, The schema currently allows invalid combinations because the online-repair semantics are only in docs; update the MachineUpdateRequest schema to express three mutually exclusive variants using oneOf/not: (1) Normal update: allows instanceTypeId/clearInstanceType, setMaintenanceMode, maintenanceMessage, labels and must forbid any online-repair fields (requestOnlineRepair, clearOnlineRepair, machineHealthIssue, repairPolicy, acknowledgments); (2) Enter online-repair mode: require requestOnlineRepair=true and require machineHealthIssue, repairPolicy, and acknowledgments (reference MachineHealthIssue, OnlineRepairPolicy, OnlineRepairAcknowledgments) and forbid instanceTypeId/clearInstanceType, setMaintenanceMode, maintenanceMessage, labels, clearOnlineRepair; (3) Exit online-repair mode: require requestOnlineRepair=false and clearOnlineRepair=true and forbid machineHealthIssue, repairPolicy, acknowledgments and also forbid other update properties; implement these with oneOf plus required/const and additionalProperties/not blocks so validators and generated clients enforce the enter/exit rules described in the description.
🧹 Nitpick comments (8)
site-workflow/pkg/grpc/client/testing.go (1)
346-352: Mock methods lack context-based error injection support.The new
InsertHealthReportOverrideandRemoveHealthReportOverridemock methods always return success, unlike other mock methods in this file that supportctx.Value("wantError")for injecting failures during tests. This limits the ability to test error handling paths in workflows and activities that invoke these RPCs.Consider aligning with the established pattern for comprehensive test coverage:
♻️ Suggested enhancement
func (c *MockForgeClient) InsertHealthReportOverride(ctx context.Context, in *wflows.InsertHealthReportOverrideRequest, opts ...grpc.CallOption) (*emptypb.Empty, error) { + if err, ok := ctx.Value("wantError").(error); ok { + return nil, status.Error(status.Code(err), "failed to insert health report override") + } return new(emptypb.Empty), nil } func (c *MockForgeClient) RemoveHealthReportOverride(ctx context.Context, in *wflows.RemoveHealthReportOverrideRequest, opts ...grpc.CallOption) (*emptypb.Empty, error) { + if err, ok := ctx.Value("wantError").(error); ok { + return nil, status.Error(status.Code(err), "failed to remove health report override") + } return new(emptypb.Empty), nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@site-workflow/pkg/grpc/client/testing.go` around lines 346 - 352, The two mock RPCs InsertHealthReportOverride and RemoveHealthReportOverride on MockForgeClient always return success; update both to follow the file's established pattern by checking ctx.Value("wantError") (or the same key other mocks use) and, if set, return that injected error (or wrap it) instead of success; otherwise return new(emptypb.Empty), nil—this enables context-based error injection for tests.db/tenant-capability-google-doc.html (1)
1-325: Consider consolidating documentation to avoid duplication.Both
db/tenant-capability.mdanddb/tenant-capability-google-doc.htmlcontain the same HLD content. Maintaining two formats creates a risk of divergence. Consider keeping only the Markdown version and generating HTML if needed, or documenting which is the source of truth.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@db/tenant-capability-google-doc.html` around lines 1 - 325, Duplicate HLD content exists in db/tenant-capability.md and db/tenant-capability-google-doc.html which risks divergence; remove the redundancy by choosing a single source-of-truth (prefer the Markdown file) and delete or turn the HTML into a generated artifact, update references to point to db/tenant-capability.md, and ensure the canonical doc names (e.g. TenantSiteCapabilityAssociation, TargetedInstanceCreation) are preserved in the retained file so handlers and reviewers can find the same content.site-workflow/pkg/activity/machine_test.go (1)
142-180: Tests cover basic paths but lack edge case coverage.The tests verify successful execution and nil request validation. However, the activity implementations validate multiple conditions (e.g.,
MachineId.Id == "",Override == nil,Override.Report == nil,Source == ""). Consider adding tests for these edge cases to ensure the validation logic is fully exercised.♻️ Suggested additional test cases
func TestManageMachine_InsertHealthReportOverrideOnSite_ValidationErrors(t *testing.T) { mockCarbide := cClient.NewMockCarbideClient() carbideAtomicClient := cClient.NewCarbideAtomicClient(&cClient.CarbideClientConfig{}) carbideAtomicClient.SwapClient(mockCarbide) mm := NewManageMachine(carbideAtomicClient) testCases := []struct { name string request *cwssaws.InsertHealthReportOverrideRequest }{ {"nil request", nil}, {"nil MachineId", &cwssaws.InsertHealthReportOverrideRequest{MachineId: nil}}, {"empty MachineId.Id", &cwssaws.InsertHealthReportOverrideRequest{MachineId: &cwssaws.MachineId{Id: ""}}}, {"nil Override", &cwssaws.InsertHealthReportOverrideRequest{MachineId: &cwssaws.MachineId{Id: "m1"}, Override: nil}}, {"nil Override.Report", &cwssaws.InsertHealthReportOverrideRequest{MachineId: &cwssaws.MachineId{Id: "m1"}, Override: &cwssaws.HealthReportOverride{Report: nil}}}, } for _, tc := range testCases { t.Run(tc.name, func(t *testing.T) { err := mm.InsertHealthReportOverrideOnSite(context.Background(), tc.request) assert.Error(t, err) }) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@site-workflow/pkg/activity/machine_test.go` around lines 142 - 180, Tests only exercise success and nil-request paths; add unit tests that assert validation errors for empty or missing fields checked in the activity. Extend site-workflow/pkg/activity/machine_test.go to include table-driven tests calling NewManageMachine(...).InsertHealthReportOverrideOnSite and RemoveHealthReportOverrideOnSite covering cases: nil MachineId, empty MachineId.Id, nil Override, nil Override.Report, and empty Source (for RemoveHealthReportOverrideOnSite), and assert Error for each; reference the functions InsertHealthReportOverrideOnSite, RemoveHealthReportOverrideOnSite and constructor NewManageMachine to locate where to call them.site-workflow/pkg/workflow/machine_test.go (1)
187-219: Tests cover success paths; consider adding failure scenario tests.The tests correctly verify that both workflows complete successfully when activities succeed. However, the test suite includes failure scenario tests for other workflows (e.g.,
Test_UpdateMachineMetadata_ActivityFails). For consistency and to ensure error propagation is verified, consider adding analogous failure tests.♻️ Suggested failure test
func (s *MachineWorkflowTestSuite) Test_ApplyMachineOnlineRepairHealthOverride_ActivityFails() { var machineManager mActivity.ManageMachine req := &cwssaws.InsertHealthReportOverrideRequest{ MachineId: &cwssaws.MachineId{Id: uuid.New().String()}, Override: &cwssaws.HealthReportOverride{ Report: &cwssaws.HealthReport{Source: "tenant-reported-issue"}, Mode: cwssaws.OverrideMode_Replace, }, } errMsg := "Site Controller communication error" s.env.RegisterActivity(machineManager.InsertHealthReportOverrideOnSite) s.env.OnActivity(machineManager.InsertHealthReportOverrideOnSite, mock.Anything, mock.Anything).Return(errors.New(errMsg)) s.env.ExecuteWorkflow(ApplyMachineOnlineRepairHealthOverride, req) s.True(s.env.IsWorkflowCompleted()) err := s.env.GetWorkflowError() s.Error(err) var applicationErr *temporal.ApplicationError s.True(errors.As(err, &applicationErr)) s.Equal(errMsg, applicationErr.Error()) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@site-workflow/pkg/workflow/machine_test.go` around lines 187 - 219, The tests only cover success paths for ApplyMachineOnlineRepairHealthOverride and ClearMachineOnlineRepairHealthOverride; add failure-case unit tests that mock the activities to return an error and assert the workflow surfaces that error. Specifically, add tests similar to Test_ApplyMachineOnlineRepairHealthOverride_ActivityFails and a counterpart for ClearMachineOnlineRepairHealthOverride where you RegisterActivity and use env.OnActivity for mActivity.ManageMachine.InsertHealthReportOverrideOnSite (and RemoveHealthReportOverrideOnSite) to Return(errors.New(...)), ExecuteWorkflow(ApplyMachineOnlineRepairHealthOverride / ClearMachineOnlineRepairHealthOverride) and then assert env.IsWorkflowCompleted(), env.GetWorkflowError() is non-nil, the error is a temporal.ApplicationError and its message matches the mocked error string.db/tenant-capability.md (1)
265-272: Handler migration guidance is documented but not yet implemented.The HLD correctly identifies that downstream handlers must replace checks using
tenant.Config.TargetedInstanceCreationwith effective site-scoped capability evaluation. Based on the relevant code snippets provided, the currentapi/pkg/api/handler/machine.goimplementation (lines 323-329 and 824-845) still uses only the tenant-global check withoutTenantSiteCapabilityAssociationlookup.This is acceptable for an initial draft, but consider adding a tracking item or TODO to ensure the handler migration is completed before this capability model is fully operational.
Would you like me to open an issue to track the handler migration work outlined in this HLD?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@db/tenant-capability.md` around lines 265 - 272, The handler migration described in the HLD hasn't been implemented: handlers in api/pkg/api/handler/machine.go still check tenant.Config.TargetedInstanceCreation directly instead of using effective(tenant, TenantSiteCapabilityAssociation, siteID) for site-scoped requests (instance create/delete, machine APIs, site listing). Add a tracking TODO and create a repo issue describing: replace direct TargetedInstanceCreation checks in the machine handlers and any site-scoped handlers with the effective(..., siteID) helper, centralize the lookup in shared helpers, and decide site-listing behavior (narrow vs discovery). Reference TenantSiteCapabilityAssociation, TargetedInstanceCreation, effective(..., siteID), and the machine handler functions so the migration can be completed before the capability model goes live.api/pkg/api/model/machine_test.go (1)
729-778: Add missing negative cases for online-repair branch coverage.Current additions validate happy paths plus one exclusivity error, but
Validate()has more online-repair branches that are currently untested (e.g., enter withclearOnlineRepair=true, exit withoutclearOnlineRepair=true, exit with forbiddenmachineHealthIssue/repairPolicy/acknowledgmentspayloads). Adding these would harden regression detection.📌 Suggested test-case additions
@@ { name: "test valid exit online repair request", fields: fields{ RequestOnlineRepair: cdb.GetBoolPtr(false), ClearOnlineRepair: cdb.GetBoolPtr(true), }, wantErr: false, }, + { + name: "test invalid enter online repair when clearOnlineRepair is true", + fields: fields{ + RequestOnlineRepair: cdb.GetBoolPtr(true), + ClearOnlineRepair: cdb.GetBoolPtr(true), + MachineHealthIssue: &APIMachineHealthIssue{ + Category: "STORAGE", + Summary: "Disk issue", + Details: "logs and ticket refs", + }, + RepairPolicy: &APIOnlineRepairPolicy{ + AllowAutoInstanceDeletionOnFailure: cdb.GetBoolPtr(false), + }, + Acknowledgments: &APIOnlineRepairAcknowledgments{ + AcceptDataCorruptionRisk: cdb.GetBoolPtr(true), + AcceptRepairTeamAccess: cdb.GetBoolPtr(true), + AcceptInstanceDeletionRisk: cdb.GetBoolPtr(true), + }, + }, + wantErr: true, + }, + { + name: "test invalid exit online repair without clearOnlineRepair true", + fields: fields{ + RequestOnlineRepair: cdb.GetBoolPtr(false), + }, + wantErr: true, + }, + { + name: "test invalid exit online repair when enter-only fields are provided", + fields: fields{ + RequestOnlineRepair: cdb.GetBoolPtr(false), + ClearOnlineRepair: cdb.GetBoolPtr(true), + MachineHealthIssue: &APIMachineHealthIssue{ + Category: "OTHER", + Summary: "Should not be sent on exit", + Details: "Should not be sent on exit", + }, + }, + wantErr: true, + },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/pkg/api/model/machine_test.go` around lines 729 - 778, Add unit tests calling the Machine.Validate() cases that are currently missing: (1) an "enter online-repair" case where RequestOnlineRepair==true but ClearOnlineRepair==true (should error), (2) an "exit online-repair" case where RequestOnlineRepair==false and ClearOnlineRepair!=true (should error), and (3) one or more "exit online-repair" cases where RequestOnlineRepair==false and ClearOnlineRepair==true but forbidden payload fields are present (MachineHealthIssue, RepairPolicy, Acknowledgments) and should cause validation errors. Use the same test table structure and fields (RequestOnlineRepair, ClearOnlineRepair, MachineHealthIssue, RepairPolicy, Acknowledgments) as the surrounding tests and assert wantErr accordingly when calling Validate().api/pkg/api/handler/machine.go (2)
64-67: Inconsistent constant value casing.The value
"OnLineRepair"contains inconsistent capitalization. The convention for compound words like "online" should beOnlineRepairrather thanOnLineRepair.♻️ Suggested fix
const ( onlineRepairHealthOverrideSource = "tenant-reported-issue" - onlineRepairHealthAlertID = "OnLineRepair" + onlineRepairHealthAlertID = "OnlineRepair" )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/pkg/api/handler/machine.go` around lines 64 - 67, The constant onlineRepairHealthAlertID has inconsistent capitalization ("OnLineRepair"); update its string value to "OnlineRepair" to follow compound-word casing conventions, and search for any uses of onlineRepairHealthAlertID to ensure no downstream logic or external consumers depend on the old casing and update tests or comparisons accordingly.
846-862: Duplicated instance retrieval logic.The instance retrieval and validation at lines 850-858 is duplicated in the main online repair block at lines 1454-1462. Consider extracting this common logic into a helper function or restructuring to retrieve the instance once and pass it through.
♻️ Suggested approach
Extract instance retrieval to a helper that returns the validated instance:
func (umh UpdateMachineHandler) getOnlineRepairInstance(ctx context.Context, machineID string, logger zerolog.Logger) (*cdbm.Instance, *cutil.APIError) { iDAO := cdbm.NewInstanceDAO(umh.dbSession) instances, ic, ierr := iDAO.GetAll(ctx, nil, cdbm.InstanceFilterInput{MachineIDs: []string{machineID}}, cdbp.PageInput{Limit: cdb.GetIntPtr(2)}, nil) if ierr != nil { logger.Error().Err(ierr).Msg("error retrieving Instance for Machine") return nil, cutil.NewAPIError(http.StatusInternalServerError, "Failed to retrieve Instance for Machine", nil) } if ic != 1 { return nil, cutil.NewAPIError(http.StatusBadRequest, "Machine must have exactly one Instance for online repair", nil) } return &instances[0], nil }Then call this once during authorization and reuse the result in the main online repair block.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/pkg/api/handler/machine.go` around lines 846 - 862, The instance retrieval/validation for online repair in UpdateMachineHandler is duplicated; extract it into a helper (e.g., UpdateMachineHandler.getOnlineRepairInstance) that accepts ctx, machine.ID and logger and returns (*cdbm.Instance, *cutil.APIError) after calling cdbm.NewInstanceDAO(...).GetAll, logging errors and translating them into cutil.NewAPIError responses (handle non-1 count and tenant mismatch there or return the instance and let caller check tenant), then replace both inline blocks (the pre-authorization check and the main online-repair block) to call this helper once and reuse the returned instance instead of repeating the GetAll logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/pkg/api/handler/machine.go`:
- Around line 1586-1593: The response currently returns the original machine
object (um = machine) after the online repair, which doesn't include instance
state changes; update the code so the response reflects the modified instance by
replacing um = machine with either (a) embedding the updated instance fields
into um (copy updated status/labels from the repaired instance object into the
machine struct used for response) or (b) reloading the machine+instance from the
DB (call the same loader used by getAPIMachines) after orTx.Commit() and assign
that fresh object to um so getAPIMachines returns the accurate, post-repair
instance state; ensure you reference the modified instance struct created during
the online repair and set orTxCommitted accordingly.
- Around line 1521-1528: The timeout detection block currently checks ctx.Err()
(parent context) instead of the workflow context, so replace the ctx.Err() check
with wfCtx.Err() in the error branch that inspects tp.TimeoutError and
context.DeadlineExceeded; update the conditional in the function handling err
(the block that declares var timeoutErr *tp.TimeoutError and returns via
cutil.NewAPIErrorResponse) to use wfCtx.Err() so timeouts from the workflow
context are correctly detected while keeping the existing UnwrapWorkflowError
handling and logging (common.UnwrapWorkflowError, logger.Error(), and the
subsequent cutil.NewAPIErrorResponse) unchanged.
- Around line 1574-1582: In the error-handling block for clearing the online
repair workflow (the if err != nil branch that checks timeout and unwraps
workflow errors via common.UnwrapWorkflowError), replace the incorrect ctx.Err()
check with wfCtx.Err() so the timeout/deadline check uses the workflow context;
keep the rest of the logic (using err in the timeout message and werr for the
UnwrapWorkflowError path) unchanged and ensure wfCtx is in scope where the check
occurs.
- Around line 1448-1452: Duplicate call to umh.scp.GetClientByID is unnecessary;
remove the second retrieval (the stc, err :=
umh.scp.GetClientByID(machine.SiteID) block at the later location) and reuse the
previously obtained stc from earlier in the same handler, ensuring the variable
is in scope (or pass it into the helper function if needed) and delete the
redundant error handling branch so the later logic uses the existing Temporal
client instead.
In `@db/repair.md`:
- Around line 94-95: The document uses inconsistent state names (e.g.,
OnlineRepair vs Repairing, and an unexpected Assigned alongside Ready); pick a
single canonical term for repair state (suggest using "Repairing") and normalize
every occurrence: replace "OnlineRepair" with "Repairing", update the status
table to list "Repairing" instead of "OnlineRepair", and either add "Assigned"
to the defined state list or remove the stray "Assigned" reference (ensure
"Ready/Assigned" wording is corrected to use only defined states like "Ready" or
"Assigned" consistently); verify all mentions (e.g., the sentence from the diff
and the table around line 171) use the chosen vocabulary.
- Line 184: The alert ID casing is inconsistent: change the JSON field value
"id": "OnLineRepair" to "OnlineRepair" to match the rest of the document and
ensure implementation references resolve; search for any other occurrences of
"OnLineRepair" (e.g., in examples, anchors, or references) and update them to
"OnlineRepair" so casing is consistent across the markdown and any linked
configuration or code.
- Around line 248-249: The docs state allowAutoInstanceDeletionOnFailure is
optional but the Machine PATCH API flow (requestOnlineRepair: true) requires it
on entry; update the documentation to mark allowAutoInstanceDeletionOnFailure as
required for the Machine PATCH API request when requestOnlineRepair: true (or
clarify the conditional requirement), and add a short note next to
requestOnlineRepair and the Machine PATCH API section explaining the required
inputs (machine health issue, acknowledgments, and the mandatory
allowAutoInstanceDeletionOnFailure flag) so clients won’t send invalid requests.
In `@db/tenant-capability-google-doc.html`:
- Around line 93-112: The Mermaid flowchart inside the <pre> block contains raw
">" characters (e.g., in node labels like A[PATCH Tenant Capabilities] and
arrows like A -->|"sites[], capabilityName, infrastructureProviderId, enabled"|
B) which HTMLHint flags; update the <pre> block so all literal ">" characters
are replaced with the escaped entity > (and similarly escape any "<" as <
if present) so the Mermaid diagram remains the same visually but the HTML is
valid; target the flowchart text within the <pre> block (nodes A, B, C, D, E, F,
G and the arrow lines) to perform the replacements.
In `@openapi/spec.yaml`:
- Around line 16884-16909: There are two duplicate components.schemas entries
named MachineHealthIssue; remove the redundant definition and keep a single
authoritative MachineHealthIssue schema under components.schemas (or merge any
differing fields into one consolidated version), ensuring all properties
(category, summary, details), enums and descriptions are preserved; update or
verify any references to MachineHealthIssue remain valid and remove the earlier
or later duplicate so the spec contains only one MachineHealthIssue schema.
---
Outside diff comments:
In `@openapi/spec.yaml`:
- Around line 16949-17010: The OpenAPI spec added online-repair fields
(requestOnlineRepair, clearOnlineRepair) and schemas (MachineHealthIssue,
OnlineRepairPolicy, OnlineRepairAcknowledgments) and introduced the new Instance
status "Repairing", but the generated SDK/docs were not updated; regenerate the
OpenAPI client and docs and commit the outputs so CI passes—run the repo's
OpenAPI generation script or tool to update sdk/standard/ and docs/index.html
(and any other generated artifacts), verify the new shapes and the "Repairing"
status appear in the generated SDK and HTML, and include those regenerated files
in this PR.
- Around line 16962-17010: The schema currently allows invalid combinations
because the online-repair semantics are only in docs; update the
MachineUpdateRequest schema to express three mutually exclusive variants using
oneOf/not: (1) Normal update: allows instanceTypeId/clearInstanceType,
setMaintenanceMode, maintenanceMessage, labels and must forbid any online-repair
fields (requestOnlineRepair, clearOnlineRepair, machineHealthIssue,
repairPolicy, acknowledgments); (2) Enter online-repair mode: require
requestOnlineRepair=true and require machineHealthIssue, repairPolicy, and
acknowledgments (reference MachineHealthIssue, OnlineRepairPolicy,
OnlineRepairAcknowledgments) and forbid instanceTypeId/clearInstanceType,
setMaintenanceMode, maintenanceMessage, labels, clearOnlineRepair; (3) Exit
online-repair mode: require requestOnlineRepair=false and clearOnlineRepair=true
and forbid machineHealthIssue, repairPolicy, acknowledgments and also forbid
other update properties; implement these with oneOf plus required/const and
additionalProperties/not blocks so validators and generated clients enforce the
enter/exit rules described in the description.
---
Nitpick comments:
In `@api/pkg/api/handler/machine.go`:
- Around line 64-67: The constant onlineRepairHealthAlertID has inconsistent
capitalization ("OnLineRepair"); update its string value to "OnlineRepair" to
follow compound-word casing conventions, and search for any uses of
onlineRepairHealthAlertID to ensure no downstream logic or external consumers
depend on the old casing and update tests or comparisons accordingly.
- Around line 846-862: The instance retrieval/validation for online repair in
UpdateMachineHandler is duplicated; extract it into a helper (e.g.,
UpdateMachineHandler.getOnlineRepairInstance) that accepts ctx, machine.ID and
logger and returns (*cdbm.Instance, *cutil.APIError) after calling
cdbm.NewInstanceDAO(...).GetAll, logging errors and translating them into
cutil.NewAPIError responses (handle non-1 count and tenant mismatch there or
return the instance and let caller check tenant), then replace both inline
blocks (the pre-authorization check and the main online-repair block) to call
this helper once and reuse the returned instance instead of repeating the GetAll
logic.
In `@api/pkg/api/model/machine_test.go`:
- Around line 729-778: Add unit tests calling the Machine.Validate() cases that
are currently missing: (1) an "enter online-repair" case where
RequestOnlineRepair==true but ClearOnlineRepair==true (should error), (2) an
"exit online-repair" case where RequestOnlineRepair==false and
ClearOnlineRepair!=true (should error), and (3) one or more "exit online-repair"
cases where RequestOnlineRepair==false and ClearOnlineRepair==true but forbidden
payload fields are present (MachineHealthIssue, RepairPolicy, Acknowledgments)
and should cause validation errors. Use the same test table structure and fields
(RequestOnlineRepair, ClearOnlineRepair, MachineHealthIssue, RepairPolicy,
Acknowledgments) as the surrounding tests and assert wantErr accordingly when
calling Validate().
In `@db/tenant-capability-google-doc.html`:
- Around line 1-325: Duplicate HLD content exists in db/tenant-capability.md and
db/tenant-capability-google-doc.html which risks divergence; remove the
redundancy by choosing a single source-of-truth (prefer the Markdown file) and
delete or turn the HTML into a generated artifact, update references to point to
db/tenant-capability.md, and ensure the canonical doc names (e.g.
TenantSiteCapabilityAssociation, TargetedInstanceCreation) are preserved in the
retained file so handlers and reviewers can find the same content.
In `@db/tenant-capability.md`:
- Around line 265-272: The handler migration described in the HLD hasn't been
implemented: handlers in api/pkg/api/handler/machine.go still check
tenant.Config.TargetedInstanceCreation directly instead of using
effective(tenant, TenantSiteCapabilityAssociation, siteID) for site-scoped
requests (instance create/delete, machine APIs, site listing). Add a tracking
TODO and create a repo issue describing: replace direct TargetedInstanceCreation
checks in the machine handlers and any site-scoped handlers with the
effective(..., siteID) helper, centralize the lookup in shared helpers, and
decide site-listing behavior (narrow vs discovery). Reference
TenantSiteCapabilityAssociation, TargetedInstanceCreation, effective(...,
siteID), and the machine handler functions so the migration can be completed
before the capability model goes live.
In `@site-workflow/pkg/activity/machine_test.go`:
- Around line 142-180: Tests only exercise success and nil-request paths; add
unit tests that assert validation errors for empty or missing fields checked in
the activity. Extend site-workflow/pkg/activity/machine_test.go to include
table-driven tests calling
NewManageMachine(...).InsertHealthReportOverrideOnSite and
RemoveHealthReportOverrideOnSite covering cases: nil MachineId, empty
MachineId.Id, nil Override, nil Override.Report, and empty Source (for
RemoveHealthReportOverrideOnSite), and assert Error for each; reference the
functions InsertHealthReportOverrideOnSite, RemoveHealthReportOverrideOnSite and
constructor NewManageMachine to locate where to call them.
In `@site-workflow/pkg/grpc/client/testing.go`:
- Around line 346-352: The two mock RPCs InsertHealthReportOverride and
RemoveHealthReportOverride on MockForgeClient always return success; update both
to follow the file's established pattern by checking ctx.Value("wantError") (or
the same key other mocks use) and, if set, return that injected error (or wrap
it) instead of success; otherwise return new(emptypb.Empty), nil—this enables
context-based error injection for tests.
In `@site-workflow/pkg/workflow/machine_test.go`:
- Around line 187-219: The tests only cover success paths for
ApplyMachineOnlineRepairHealthOverride and
ClearMachineOnlineRepairHealthOverride; add failure-case unit tests that mock
the activities to return an error and assert the workflow surfaces that error.
Specifically, add tests similar to
Test_ApplyMachineOnlineRepairHealthOverride_ActivityFails and a counterpart for
ClearMachineOnlineRepairHealthOverride where you RegisterActivity and use
env.OnActivity for mActivity.ManageMachine.InsertHealthReportOverrideOnSite (and
RemoveHealthReportOverrideOnSite) to Return(errors.New(...)),
ExecuteWorkflow(ApplyMachineOnlineRepairHealthOverride /
ClearMachineOnlineRepairHealthOverride) and then assert
env.IsWorkflowCompleted(), env.GetWorkflowError() is non-nil, the error is a
temporal.ApplicationError and its message matches the mocked error string.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: d5273c3b-389e-4af5-9c19-c3e0d00ed09c
📒 Files selected for processing (16)
api/pkg/api/handler/machine.goapi/pkg/api/handler/machine_test.goapi/pkg/api/model/machine.goapi/pkg/api/model/machine_test.godb/pkg/db/model/instance.godb/pkg/db/model/instance_test.godb/repair.mddb/tenant-capability-google-doc.htmldb/tenant-capability.mdopenapi/spec.yamlsite-agent/pkg/components/managers/machine/subscriber.gosite-workflow/pkg/activity/machine.gosite-workflow/pkg/activity/machine_test.gosite-workflow/pkg/grpc/client/testing.gosite-workflow/pkg/workflow/machine.gosite-workflow/pkg/workflow/machine_test.go
| stc, err := umh.scp.GetClientByID(machine.SiteID) | ||
| if err != nil { | ||
| logger.Error().Err(err).Msg("failed to retrieve Temporal client for Site") | ||
| return err | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Redundant temporal client retrieval.
The site temporal client is already retrieved at line 965 (stc, err := umh.scp.GetClientByID(machine.SiteID)). This second retrieval at line 1448 is unnecessary.
♻️ Suggested fix
// Enter or exit in-pool online repair (Site health override + Instance status / labels in Cloud DB)
if apiRequest.IsMachineOnlineRepairOperation() {
if machine.IsMissingOnSite {
return cutil.NewAPIErrorResponse(c, http.StatusBadRequest, "Machine is currently missing on Site, cannot update online repair state", nil)
}
-
- stc, err := umh.scp.GetClientByID(machine.SiteID)
- if err != nil {
- logger.Error().Err(err).Msg("failed to retrieve Temporal client for Site")
- return err
- }
iDAO := cdbm.NewInstanceDAO(umh.dbSession)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| stc, err := umh.scp.GetClientByID(machine.SiteID) | |
| if err != nil { | |
| logger.Error().Err(err).Msg("failed to retrieve Temporal client for Site") | |
| return err | |
| } | |
| // Enter or exit in-pool online repair (Site health override + Instance status / labels in Cloud DB) | |
| if apiRequest.IsMachineOnlineRepairOperation() { | |
| if machine.IsMissingOnSite { | |
| return cutil.NewAPIErrorResponse(c, http.StatusBadRequest, "Machine is currently missing on Site, cannot update online repair state", nil) | |
| } | |
| iDAO := cdbm.NewInstanceDAO(umh.dbSession) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/pkg/api/handler/machine.go` around lines 1448 - 1452, Duplicate call to
umh.scp.GetClientByID is unnecessary; remove the second retrieval (the stc, err
:= umh.scp.GetClientByID(machine.SiteID) block at the later location) and reuse
the previously obtained stc from earlier in the same handler, ensuring the
variable is in scope (or pass it into the helper function if needed) and delete
the redundant error handling branch so the later logic uses the existing
Temporal client instead.
| if err != nil { | ||
| var timeoutErr *tp.TimeoutError | ||
| if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || ctx.Err() != nil { | ||
| return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed online repair workflow on Site: %s", err), nil) | ||
| } | ||
| code, werr := common.UnwrapWorkflowError(err) | ||
| logger.Error().Err(werr).Msg("online repair health override workflow failed") | ||
| return cutil.NewAPIErrorResponse(c, code, fmt.Sprintf("Failed to execute online repair workflow on Site: %s", werr), nil) |
There was a problem hiding this comment.
Incorrect context variable in timeout check.
At line 1523, ctx.Err() checks the parent context, but the workflow was executed with wfCtx. If the parent context hasn't timed out but wfCtx has, this condition may not evaluate correctly.
🐛 Suggested fix
err = we.Get(wfCtx, nil)
if err != nil {
var timeoutErr *tp.TimeoutError
- if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || ctx.Err() != nil {
+ if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || wfCtx.Err() != nil {
return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed online repair workflow on Site: %s", err), nil)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if err != nil { | |
| var timeoutErr *tp.TimeoutError | |
| if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || ctx.Err() != nil { | |
| return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed online repair workflow on Site: %s", err), nil) | |
| } | |
| code, werr := common.UnwrapWorkflowError(err) | |
| logger.Error().Err(werr).Msg("online repair health override workflow failed") | |
| return cutil.NewAPIErrorResponse(c, code, fmt.Sprintf("Failed to execute online repair workflow on Site: %s", werr), nil) | |
| if err != nil { | |
| var timeoutErr *tp.TimeoutError | |
| if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || wfCtx.Err() != nil { | |
| return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed online repair workflow on Site: %s", err), nil) | |
| } | |
| code, werr := common.UnwrapWorkflowError(err) | |
| logger.Error().Err(werr).Msg("online repair health override workflow failed") | |
| return cutil.NewAPIErrorResponse(c, code, fmt.Sprintf("Failed to execute online repair workflow on Site: %s", werr), nil) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/pkg/api/handler/machine.go` around lines 1521 - 1528, The timeout
detection block currently checks ctx.Err() (parent context) instead of the
workflow context, so replace the ctx.Err() check with wfCtx.Err() in the error
branch that inspects tp.TimeoutError and context.DeadlineExceeded; update the
conditional in the function handling err (the block that declares var timeoutErr
*tp.TimeoutError and returns via cutil.NewAPIErrorResponse) to use wfCtx.Err()
so timeouts from the workflow context are correctly detected while keeping the
existing UnwrapWorkflowError handling and logging (common.UnwrapWorkflowError,
logger.Error(), and the subsequent cutil.NewAPIErrorResponse) unchanged.
| if err != nil { | ||
| var timeoutErr *tp.TimeoutError | ||
| if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || ctx.Err() != nil { | ||
| return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed clear online repair workflow on Site: %s", err), nil) | ||
| } | ||
| code, werr := common.UnwrapWorkflowError(err) | ||
| logger.Error().Err(werr).Msg("clear online repair health override workflow failed") | ||
| return cutil.NewAPIErrorResponse(c, code, fmt.Sprintf("Failed to execute clear online repair workflow on Site: %s", werr), nil) | ||
| } |
There was a problem hiding this comment.
Same context variable issue in exit flow.
At line 1576, the timeout check uses ctx.Err() instead of wfCtx.Err().
🐛 Suggested fix
err = we.Get(wfCtx, nil)
if err != nil {
var timeoutErr *tp.TimeoutError
- if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || ctx.Err() != nil {
+ if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || wfCtx.Err() != nil {
return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed clear online repair workflow on Site: %s", err), nil)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if err != nil { | |
| var timeoutErr *tp.TimeoutError | |
| if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || ctx.Err() != nil { | |
| return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed clear online repair workflow on Site: %s", err), nil) | |
| } | |
| code, werr := common.UnwrapWorkflowError(err) | |
| logger.Error().Err(werr).Msg("clear online repair health override workflow failed") | |
| return cutil.NewAPIErrorResponse(c, code, fmt.Sprintf("Failed to execute clear online repair workflow on Site: %s", werr), nil) | |
| } | |
| if err != nil { | |
| var timeoutErr *tp.TimeoutError | |
| if errors.As(err, &timeoutErr) || err == context.DeadlineExceeded || wfCtx.Err() != nil { | |
| return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, fmt.Sprintf("Failed clear online repair workflow on Site: %s", err), nil) | |
| } | |
| code, werr := common.UnwrapWorkflowError(err) | |
| logger.Error().Err(werr).Msg("clear online repair health override workflow failed") | |
| return cutil.NewAPIErrorResponse(c, code, fmt.Sprintf("Failed to execute clear online repair workflow on Site: %s", werr), nil) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/pkg/api/handler/machine.go` around lines 1574 - 1582, In the
error-handling block for clearing the online repair workflow (the if err != nil
branch that checks timeout and unwraps workflow errors via
common.UnwrapWorkflowError), replace the incorrect ctx.Err() check with
wfCtx.Err() so the timeout/deadline check uses the workflow context; keep the
rest of the logic (using err in the timeout message and werr for the
UnwrapWorkflowError path) unchanged and ensure wfCtx is in scope where the check
occurs.
| err = orTx.Commit() | ||
| if err != nil { | ||
| logger.Error().Err(err).Msg("error committing transaction") | ||
| return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, "Failed to update Machine, DB transaction error", nil) | ||
| } | ||
| orTxCommitted = true | ||
| um = machine | ||
| } |
There was a problem hiding this comment.
Response does not reflect instance state changes.
At line 1592, um = machine assigns the original machine entity to um. However, the online repair operation modifies the associated instance's status and labels, not the machine itself. The response returned to the client via getAPIMachines (line 1600) will include the original machine data, which may not accurately reflect the instance state changes made during this operation.
Consider whether the API response should include the updated instance information to provide clients with accurate state after the operation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/pkg/api/handler/machine.go` around lines 1586 - 1593, The response
currently returns the original machine object (um = machine) after the online
repair, which doesn't include instance state changes; update the code so the
response reflects the modified instance by replacing um = machine with either
(a) embedding the updated instance fields into um (copy updated status/labels
from the repaired instance object into the machine struct used for response) or
(b) reloading the machine+instance from the DB (call the same loader used by
getAPIMachines) after orTx.Commit() and assign that fresh object to um so
getAPIMachines returns the accurate, post-repair instance state; ensure you
reference the modified instance struct created during the online repair and set
orTxCommitted accordingly.
| - The instance **stays in the tenant pool** (in **Ready** when active; in **OnlineRepair** during the repair window). No deallocation or return to Carbide pool unless the instance is escalated and deleted. | ||
|
|
There was a problem hiding this comment.
State naming is inconsistent (Repairing vs OnlineRepair, and Assigned appears unexpectedly).
Line 94 uses OnlineRepair as a state while the status table introduces Repairing; Line 171 says Ready/Assigned even though the document’s state list does not define Assigned. Please normalize to one state vocabulary.
✏️ Suggested terminology normalization
-- The instance **stays in the tenant pool** (in **Ready** when active; in **OnlineRepair** during the repair window). No deallocation or return to Carbide pool unless the instance is escalated and deleted.
+- The instance **stays in the tenant pool** (in **Ready** when active; in **Repairing** during the repair window). No deallocation or return to Carbide pool unless the instance is escalated and deleted.
@@
-**Effect:** The Health Override for OnlineRepair is removed from the machine; the instance returns to **active (Ready/Assigned)** and is available for workload again.
+**Effect:** The Health Override for OnlineRepair is removed from the machine; the instance returns to **active (Ready)** and is available for workload again.Also applies to: 171-171
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@db/repair.md` around lines 94 - 95, The document uses inconsistent state
names (e.g., OnlineRepair vs Repairing, and an unexpected Assigned alongside
Ready); pick a single canonical term for repair state (suggest using
"Repairing") and normalize every occurrence: replace "OnlineRepair" with
"Repairing", update the status table to list "Repairing" instead of
"OnlineRepair", and either add "Assigned" to the defined state list or remove
the stray "Assigned" reference (ensure "Ready/Assigned" wording is corrected to
use only defined states like "Ready" or "Assigned" consistently); verify all
mentions (e.g., the sentence from the diff and the table around line 171) use
the chosen vocabulary.
| "source": "tenant-reported-issue", | ||
| "alerts": [ | ||
| { | ||
| "id": "OnLineRepair", |
There was a problem hiding this comment.
Use consistent alert ID casing (OnlineRepair vs OnLineRepair).
Line 184 uses OnLineRepair, while the rest of the document consistently uses OnlineRepair. Standardizing this avoids ambiguity in implementation references.
✏️ Suggested casing fix
- "id": "OnLineRepair",
+ "id": "OnlineRepair",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "id": "OnLineRepair", | |
| "id": "OnlineRepair", |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@db/repair.md` at line 184, The alert ID casing is inconsistent: change the
JSON field value "id": "OnLineRepair" to "OnlineRepair" to match the rest of the
document and ensure implementation references resolve; search for any other
occurrences of "OnLineRepair" (e.g., in examples, anchors, or references) and
update them to "OnlineRepair" so casing is consistent across the markdown and
any linked configuration or code.
| The **tenant operator** (or tenant) invokes the Machine PATCH API with `requestOnlineRepair: true` and the required inputs (machine health issue, acknowledgments, and optionally `allowAutoInstanceDeletionOnFailure`). | ||
|
|
There was a problem hiding this comment.
allowAutoInstanceDeletionOnFailure is documented as optional but is required on enter.
Line 248 says the field is optional, which conflicts with the enter request contract in this PR flow. This will cause client confusion and invalid requests.
✏️ Suggested doc correction
-The **tenant operator** (or tenant) invokes the Machine PATCH API with `requestOnlineRepair: true` and the required inputs (machine health issue, acknowledgments, and optionally `allowAutoInstanceDeletionOnFailure`).
+The **tenant operator** (or tenant) invokes the Machine PATCH API with `requestOnlineRepair: true` and the required inputs (machine health issue, acknowledgments, and `allowAutoInstanceDeletionOnFailure`).📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| The **tenant operator** (or tenant) invokes the Machine PATCH API with `requestOnlineRepair: true` and the required inputs (machine health issue, acknowledgments, and optionally `allowAutoInstanceDeletionOnFailure`). | |
| The **tenant operator** (or tenant) invokes the Machine PATCH API with `requestOnlineRepair: true` and the required inputs (machine health issue, acknowledgments, and `allowAutoInstanceDeletionOnFailure`). |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@db/repair.md` around lines 248 - 249, The docs state
allowAutoInstanceDeletionOnFailure is optional but the Machine PATCH API flow
(requestOnlineRepair: true) requires it on entry; update the documentation to
mark allowAutoInstanceDeletionOnFailure as required for the Machine PATCH API
request when requestOnlineRepair: true (or clarify the conditional requirement),
and add a short note next to requestOnlineRepair and the Machine PATCH API
section explaining the required inputs (machine health issue, acknowledgments,
and the mandatory allowAutoInstanceDeletionOnFailure flag) so clients won’t send
invalid requests.
| <pre>flowchart TB | ||
| subgraph Client["Tenant Admin"] | ||
| A[PATCH Tenant Capabilities] | ||
| end | ||
| subgraph API["REST API"] | ||
| B[Resolve site set] | ||
| C[Upsert TenantSiteCapabilityAssociation] | ||
| end | ||
| subgraph Data["Persistence"] | ||
| D[(tenant)] | ||
| E[(tenant_site_capability_association)] | ||
| F[(site)] | ||
| G[(infrastructure_provider)] | ||
| end | ||
| A -->|"sites[], capabilityName, infrastructureProviderId, enabled"| B | ||
| B -->|"empty sites[] => all eligible sites; optional filter by provider"| C | ||
| C --> E | ||
| E --> F | ||
| E --> G | ||
| D -->|"ceiling check"| B</pre> |
There was a problem hiding this comment.
HTMLHint flags unescaped special characters in Mermaid diagram.
The > characters in the Mermaid flowchart syntax within the <pre> block should technically be escaped as > for valid HTML. While browsers typically render this correctly, escaping ensures consistency across parsers.
♻️ Suggested fix
- A -->|"sites[], capabilityName, infrastructureProviderId, enabled"| B
- B -->|"empty sites[] => all eligible sites; optional filter by provider"| C
- C --> E
- E --> F
- E --> G
- D -->|"ceiling check"| B</pre>
+ A -->|"sites[], capabilityName, infrastructureProviderId, enabled"| B
+ B -->|"empty sites[] => all eligible sites; optional filter by provider"| C
+ C --> E
+ E --> F
+ E --> G
+ D -->|"ceiling check"| B</pre>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| <pre>flowchart TB | |
| subgraph Client["Tenant Admin"] | |
| A[PATCH Tenant Capabilities] | |
| end | |
| subgraph API["REST API"] | |
| B[Resolve site set] | |
| C[Upsert TenantSiteCapabilityAssociation] | |
| end | |
| subgraph Data["Persistence"] | |
| D[(tenant)] | |
| E[(tenant_site_capability_association)] | |
| F[(site)] | |
| G[(infrastructure_provider)] | |
| end | |
| A -->|"sites[], capabilityName, infrastructureProviderId, enabled"| B | |
| B -->|"empty sites[] => all eligible sites; optional filter by provider"| C | |
| C --> E | |
| E --> F | |
| E --> G | |
| D -->|"ceiling check"| B</pre> | |
| <pre>flowchart TB | |
| subgraph Client["Tenant Admin"] | |
| A[PATCH Tenant Capabilities] | |
| end | |
| subgraph API["REST API"] | |
| B[Resolve site set] | |
| C[Upsert TenantSiteCapabilityAssociation] | |
| end | |
| subgraph Data["Persistence"] | |
| D[(tenant)] | |
| E[(tenant_site_capability_association)] | |
| F[(site)] | |
| G[(infrastructure_provider)] | |
| end | |
| A -->|"sites[], capabilityName, infrastructureProviderId, enabled"| B | |
| B -->|"empty sites[] => all eligible sites; optional filter by provider"| C | |
| C --> E | |
| E --> F | |
| E --> G | |
| D -->|"ceiling check"| B</pre> |
🧰 Tools
🪛 HTMLHint (1.9.2)
[error] 107-107: Special characters must be escaped : [ > ].
(spec-char-escape)
[error] 108-108: Special characters must be escaped : [ > ].
(spec-char-escape)
[error] 108-108: Special characters must be escaped : [ > ].
(spec-char-escape)
[error] 109-109: Special characters must be escaped : [ > ].
(spec-char-escape)
[error] 110-110: Special characters must be escaped : [ > ].
(spec-char-escape)
[error] 111-111: Special characters must be escaped : [ > ].
(spec-char-escape)
[error] 112-112: Special characters must be escaped : [ > ].
(spec-char-escape)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@db/tenant-capability-google-doc.html` around lines 93 - 112, The Mermaid
flowchart inside the <pre> block contains raw ">" characters (e.g., in node
labels like A[PATCH Tenant Capabilities] and arrows like A -->|"sites[],
capabilityName, infrastructureProviderId, enabled"| B) which HTMLHint flags;
update the <pre> block so all literal ">" characters are replaced with the
escaped entity > (and similarly escape any "<" as < if present) so the
Mermaid diagram remains the same visually but the HTML is valid; target the
flowchart text within the <pre> block (nodes A, B, C, D, E, F, G and the arrow
lines) to perform the replacements.
| MachineHealthIssue: | ||
| title: MachineHealthIssue | ||
| type: object | ||
| required: | ||
| - category | ||
| - summary | ||
| - details | ||
| properties: | ||
| category: | ||
| type: string | ||
| enum: | ||
| - HARDWARE | ||
| - NETWORK | ||
| - PERFORMANCE | ||
| - STORAGE | ||
| - SOFTWARE | ||
| - OTHER | ||
| description: High-level category for the tenant-reported issue. | ||
| summary: | ||
| type: string | ||
| maxLength: 512 | ||
| description: Brief description of the issue for operators. | ||
| details: | ||
| type: string | ||
| maxLength: 8192 | ||
| description: Diagnostic information, logs, ticket numbers, etc. |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Expected result: exactly one match. More than one means the schema key is duplicated.
rg -n '^ {4}MachineHealthIssue:$' openapi/spec.yamlRepository: NVIDIA/ncx-infra-controller-rest
Length of output: 135
Remove the duplicate MachineHealthIssue schema definition.
The YAML file contains two MachineHealthIssue entries under components.schemas (lines 15587 and 16884). In YAML, the second key silently overwrites the first, rendering the initial definition inaccessible to any consumers of the specification. This breaks schema clarity, violates the API contract, and causes linting failures. Consolidate to a single, authoritative definition.
🧰 Tools
🪛 YAMLlint (1.38.0)
[error] 16884-16884: duplication of key "MachineHealthIssue" in mapping
(key-duplicates)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openapi/spec.yaml` around lines 16884 - 16909, There are two duplicate
components.schemas entries named MachineHealthIssue; remove the redundant
definition and keep a single authoritative MachineHealthIssue schema under
components.schemas (or merge any differing fields into one consolidated
version), ensuring all properties (category, summary, details), enums and
descriptions are preserved; update or verify any references to
MachineHealthIssue remain valid and remove the earlier or later duplicate so the
spec contains only one MachineHealthIssue schema.
Test Results8 856 tests +7 8 854 ✅ +5 7m 24s ⏱️ +38s For more details on these failures, see this check. Results for commit a0cd420. ± Comparison against base commit 3616400. ♻️ This comment has been updated with latest results. |
| Classifications: []string{ | ||
| "PreventAllocations", | ||
| "PreventDeletion", | ||
| "PreventSuperTenantAllocation", |
There was a problem hiding this comment.
need to remove for being send PreventSuperTenantAllocation
| Id: onlineRepairHealthAlertID, | ||
| Target: &tgt, | ||
| Message: msg, | ||
| TenantMessage: cdb.GetStrPtr(fmt.Sprintf("TenantReportedIssue: %s", mhi.Summary)), |
There was a problem hiding this comment.
Add acknowledgment policy and accept from user also
eb96939 to
d48bc5d
Compare
Description
Based on the design https://gitlab-master.nvidia.com/nvmetal/designs/-/blob/breakfix-swiftstack/designs/0066-Instance-Repair-Management.md#proposed-workflow-in-pool-repair-dedicated-ops-team, the Instance-Repair-Management implementation was created in an initial draft version.
Added mechanism in Instance to update or make repair status
Added PATCH request in Machine to allow Machine to be in repair while having Instance allocated
- Enter: requestOnlineRepair: true + machineHealthIssue + repairPolicy.allowAutoInstanceDeletionOnFailure + acknowledgments (all three booleans must be true).
- Exit: requestOnlineRepair: false + clearOnlineRepair: true (no other online-repair payload fields)._
Site workflow
Site-agent
Type of Change
Services Affected
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes