Skip to content

fix: Add IbPortDown alert for machines with down IB ports#519

Merged
ajf merged 1 commit intoNVIDIA:mainfrom
hasayesh:nvbug-5866723
Apr 29, 2026
Merged

fix: Add IbPortDown alert for machines with down IB ports#519
ajf merged 1 commit intoNVIDIA:mainfrom
hasayesh:nvbug-5866723

Conversation

@hasayesh
Copy link
Copy Markdown
Contributor

@hasayesh hasayesh commented Mar 11, 2026

Description

When the IB Fabric Monitor detects ports not in Active state, it now sets a PreventAllocations health alert on the affected machine. This prevents Carbide from attempting to allocate instances on machines with degraded IB connectivity, avoiding SRE alerts.

Port-down alerting uses a precedence model:
1. If a machine has a SKU assigned, alert on any down port not in the
SKU's inactive_devices list (hardware-level truth).
2. If no SKU but an active instance exists, alert on down ports in the
instance's IB config.
3. If neither SKU nor instance, no alert is generated.

  • Add HealthProbeId::ib_port_down() and HealthProbeAlert::ib_port_down()
  • Detect ports not in Active state during IB fabric monitoring
  • Set/clear IbPortDown health alert via health report overrides
  • Update existing test to expect health alert blocking

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@hasayesh hasayesh requested a review from a team as a code owner March 11, 2026 04:25
@github-actions
Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-03-11 04:27:25 UTC | Commit: eb88f65

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 11, 2026

🛡️ Vulnerability Scan

🚨 Found 72 vulnerability(ies)
📊 vs main: 72 (no change)

Severity Breakdown:

  • 🔴 Critical/High: 72
  • 🟡 Medium: 0
  • 🔵 Low/Info: 0

🔗 View full details in Security tab

🕐 Last updated: 2026-03-17 01:41:37 UTC | Commit: e4e83f8

@hasayesh hasayesh requested a review from Matthias247 March 11, 2026 16:51
@Matthias247
Copy link
Copy Markdown
Contributor

It needs to take the SKU into account. There's hosts which intentionally have multiple ports disconnected. And if they would all have PreventAllocations set, none of them would be usable anymore.

@hasayesh hasayesh requested a review from wminckler March 11, 2026 22:22
@hasayesh
Copy link
Copy Markdown
Contributor Author

This is what I suggest:
Update the L40 machines to have 2 IB's in SKU
Then we go from most to least specific:
Then if the machine has been assigned an instance-type use that
If not use SKU
if SKU does not exist (during early deployment) no check.

Please let me know if this is reasonable.

@hasayesh
Copy link
Copy Markdown
Contributor Author

This is what I suggest: Update the L40 machines to have 2 IB's in SKU Then we go from most to least specific: Then if the machine has been assigned an instance-type use that If not use SKU if SKU does not exist (during early deployment) no check.

Please let me know if this is reasonable.

OK my bad, json shows the properly populated inactive. Will update as such. Thanks.

Comment thread crates/api/src/ib_fabric_monitor/mod.rs
Comment thread crates/api/src/ib_fabric_monitor/mod.rs Outdated
///
/// Returns:
/// - `Ok(None)` if no SKU is assigned, SKU not found, or SKU has no infiniband_devices defined
/// (in these cases, IB port down monitoring should be skipped)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we don't want to monitor if we don't know the SKU of the machine?

What if we just returned an empty set in this case? That would mean "this machine does not expect any ports to be inactive", and we'd alert if the port is down. I feel that would be a more "safe" default if we don't know about the machine SKU, than avoiding alerting altogether.

(Maybe this question is for @Matthias247 who suggested checking the SKU. How do you feel about the default behavior if the SKU isn't found?)

Copy link
Copy Markdown
Contributor Author

@hasayesh hasayesh Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the two flags:

bom_validation.enabled
allow_allocation_on_validation_failure

If the operator explicitly skips bom_validation.enabled the point is not to do it correct? To me they both should be enabled, if the strict behavior is required otherwise what is the point of these?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're talking about different things.

I'm talking about what the fallback behavior is if we see an IB port down, but we're in an environment without BOM validation. I don't think we should skip alerting if that's the case.

To me, skipping BOM validation doesn't mean you don't care about whether IB ports are down. Unless I'm missing something?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we just returned an empty set in this case? That would mean "this machine does not expect any ports to be inactive", and we'd alert if the port is down.

I know of a few deployments which don't have SKUs/BOMs set up, but where not all ports are connected. For these the alarms would go off.
If we wanted to avoid that, then only taking into account SKUs if actually defined is ok.
But maybe this would also be a good way to "motivate" operators to setup SKUs?

I'd leave the decision up to @ajf and SRE teams.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a reference, we know on some sites (like AZ04,) they are not using IB, because they are not using proper instance type and no partitions are used (there are tickets for this) but security settings are set. These are single machine work loads. Just as FYI, of the production sites, 24 don't have bom validation enabled.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding correct one sorry: @jedimt

@ajf My understanding of what @nvcoop is suggesting (please correct me @nvcoop):

  1. Default behavior (current implementation):
    Only generate IbPortDown alerts for machines that have a SKU assigned. The SKU's inactive_devices list is used to distinguish intentionally disconnected ports from unexpectedly down ports. Machines without a SKU are not alerted on.

  2. Optional behavior (future config flag?): A new config setting (e.g. ib_port_down_alert_all_machines: bool, default false) that, when enabled, would alert on all machines with down ports regardless of whether a SKU is defined. This would be useful for operators who want strict monitoring on every machine.

  3. This PR already implements the default behavior:

    • If a machine has a SKU: alerts are generated for ports that are down and not in the SKU's inactive_devices list. For example, an L40 with 4 IB devices but inactive_devices: [1, 3] would only alert if ports 0 or 2 go down — not on ports 1 and 3 which are down by design.
    • If a machine has an active instance: only ports in the instance's IB config are monitored.
    • If a machine has no SKU assigned: no alert is generated (lenient — avoids noise on sites without BOM validation).

I could add the optional config flag for "alert all machines" as a follow-up if that's desired. @nvcoop could you confirm this matches what you had in mind?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want option 2. Global settings shouldn't override SKU definitions. If an operator wanted strict monitoring, the SKUs in use would just say all ports should be up.

If a machine has a SKU: alerts are generated for ports that are down and not in the SKU's inactive_devices list. For example, an L40 with 4 IB devices but inactive_devices: [1, 3] would only alert if ports 0 or 2 go down — not on ports 1 and 3 which are down by design

Good so far.

If a machine has an active instance: only ports in the instance's IB config are monitored.

I think that's correct behavior.

If a machine has no SKU assigned: no alert is generated (lenient — avoids noise on sites without BOM validation).

Yeah I think I'm just concerned with this case. If there's no SKU, I think we should generate an alert for down ports.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajf it's that last part that I want to be optional. Generating alerts for machines that don't have a SKU assigned.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a machine has an active instance: only ports in the instance's IB config are monitored.

I'd avoid that. We added that workaround mostly because hosts did not had SKUs assigned. But now when we get them assigned, we might as well do the actually correct thing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Matthias247 updated the flow to use SKU as the primary check if it exist and fall back on instance ib-config if not for transitioning sites.

Comment thread crates/api/src/ib_fabric_monitor/mod.rs Outdated
Copy link
Copy Markdown
Contributor

@Matthias247 Matthias247 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked at the implementation in-depth. I think @kensimon already did look deeper or can support more.

One question is probably whether the health alert is set in the IB monitor or in the main state handler. E.g. we currently set the "dpu heartbeattimeout" one in the main state handler. Both will work, and I personally don't have a big favorite (but maybe @kensimon or @chet might have?).

If you implement the optimization pointed out in the other comment then the check inside the monitor might be a bit more efficient since we don't need to load SKU data in the state handler.

Comment thread crates/api/src/ib_fabric_monitor/mod.rs Outdated
@nvcoop
Copy link
Copy Markdown

nvcoop commented Mar 30, 2026

Can we have a way to dry-run this to see how many alerts would get generated at a site before this enabled?

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 1, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread crates/api/src/ib_fabric_monitor/mod.rs Outdated
Comment thread crates/api/src/ib_fabric_monitor/mod.rs Outdated
When the IB Fabric Monitor detects ports not in Active state, it now
sets a PreventAllocations health alert on the affected machine. This
prevents Carbide from attempting to allocate instances on machines with
degraded IB connectivity, avoiding SRE alerts.

Port-down alerting uses a precedence model:
1. If a machine has a SKU assigned, alert on any down port not in the
   SKU's inactive_devices list (hardware-level truth).
2. If no SKU but an active instance exists, alert on down ports in the
   instance's IB config.
3. If neither SKU nor instance, no alert is generated.

- Add HealthProbeId::ib_port_down() and HealthProbeAlert::ib_port_down()
- Detect ports not in Active state during IB fabric monitoring
- Set/clear IbPortDown health alert via health report overrides
- Update existing test to expect health alert blocking

    ## Type of Change
    <!-- Check one that best describes this PR -->
    - [ ] **Add** - New feature or capability
    - [ ] **Change** - Changes in existing functionality
    - [x] **Fix** - Bug fixes
    - [ ] **Remove** - Removed features or deprecated functionality
    - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

    ## Related Issues (Optional)
    <!-- If applicable, provide GitHub Issue. -->

    ## Breaking Changes
    - [ ] This PR contains breaking changes

    <!-- If checked above, describe the breaking changes and migration steps
    -->

    ## Testing
    <!-- How was this tested? Check all that apply -->
    - [x] Unit tests added/updated
    - [x] Integration tests added/updated
    - [x] Manual testing performed
    - [ ] No testing required (docs, internal refactor, etc.)

    ## Additional Notes
    <!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Hamid Asayesh <hasayesh@nvidia.com>

Apply suggestion from @kensimon

Signed-off-by: Hamid Asayesh <162524665+hasayesh@users.noreply.github.com>
@ajf
Copy link
Copy Markdown
Collaborator

ajf commented Apr 29, 2026

This is waiting on @hasayesh and @nvcoop to discuss impact on notifications given there's a new health alert.

@ajf
Copy link
Copy Markdown
Collaborator

ajf commented Apr 29, 2026

This is waiting on @hasayesh and @nvcoop to discuss impact on notifications given there's a new health alert.

This is all aligned now.

@ajf ajf merged commit 4b77fcd into NVIDIA:main Apr 29, 2026
36 checks passed
andreliuNV pushed a commit to andreliuNV/ncx-infra-controller-core that referenced this pull request Apr 30, 2026
## Description

When the IB Fabric Monitor detects ports not in Active state, it now
sets a PreventAllocations health alert on the affected machine. This
prevents Carbide from attempting to allocate instances on machines with
degraded IB connectivity, avoiding SRE alerts.

   Port-down alerting uses a precedence model:
1. If a machine has a SKU assigned, alert on any down port not in the
       SKU's inactive_devices list (hardware-level truth).
2. If no SKU but an active instance exists, alert on down ports in the
       instance's IB config.
    3. If neither SKU nor instance, no alert is generated.
    
- Add HealthProbeId::ib_port_down() and HealthProbeAlert::ib_port_down()
- Detect ports not in Active state during IB fabric monitoring
- Set/clear IbPortDown health alert via health report overrides
- Update existing test to expect health alert blocking

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [x] Integration tests added/updated
- [x] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Hamid Asayesh <162524665+hasayesh@users.noreply.github.com>
rpowers-nv pushed a commit to rpowers-nv/ncx-infra-controller-core that referenced this pull request May 5, 2026
## Description

When the IB Fabric Monitor detects ports not in Active state, it now
sets a PreventAllocations health alert on the affected machine. This
prevents Carbide from attempting to allocate instances on machines with
degraded IB connectivity, avoiding SRE alerts.

   Port-down alerting uses a precedence model:
1. If a machine has a SKU assigned, alert on any down port not in the
       SKU's inactive_devices list (hardware-level truth).
2. If no SKU but an active instance exists, alert on down ports in the
       instance's IB config.
    3. If neither SKU nor instance, no alert is generated.

- Add HealthProbeId::ib_port_down() and HealthProbeAlert::ib_port_down()
- Detect ports not in Active state during IB fabric monitoring
- Set/clear IbPortDown health alert via health report overrides
- Update existing test to expect health alert blocking

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [x] Integration tests added/updated
- [x] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Hamid Asayesh <162524665+hasayesh@users.noreply.github.com>
Signed-off-by: rpowers <rpowers@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants