Skip to content

[RFC] fix incorrect GSI numbering causing device limit on aarch64 #5283

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jul 8, 2025

Conversation

Swchexit
Copy link
Contributor

I’m opening this as an RFC because I would appreciate feedback on my proposed solution. Please see my detailed analysis and reasoning on the issue page #4207 . Additionally, I’m encountering timeout problems running the test suite with my current setup and would appreciate assistance verifying this patch.

Changes

Introduce two new constants (GSI_BASE and GSI_MAX) in the layout.rs files for both aarch64 and x86_64. These constants correctly adjust the GSI numbering for aarch64 VMs, aligning Firecracker’s interrupt assignments with Linux’s expectations and ensuring that the interrupts provided in the device tree and registered via irqfd are correctly numbered.

Reason

To resolve issue #4207 , where incorrect GSI numbering caused by using an offset of 32 (IRQ_BASE) resulted in an artificial limitation of 64 devices on aarch64 microVMs. This change removes that limitation, enabling aarch64 microVMs to correctly use up to 95 devices.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

On aarch64, Linux expects device tree interrupts (GSIs) to start at 0,
as it internally adds an offset of 32 for SPIs. Firecracker previously
defined IRQ_BASE as 32, unintentionally causing a double offset and
limiting the maximum number of attachable devices to 64.

This commit introduces new constants, GSI_BASE and GSI_MAX, properly
adjusted for aarch64, with GSIs starting at 1 (since GSI 0 is disallowed
by device_manager/mmio.rs). These changes ensure interrupts are numbered
correctly, resolving the issue and allowing an aarch64 VM to handle
up to 95 devices.

Additional adjustments are made to tests to reflect this correction.

Signed-off-by: Sheng-Wei (Way) Chen <waychensw@gmail.com>
Copy link

codecov bot commented Jun 30, 2025

Codecov Report

Attention: Patch coverage is 88.88889% with 1 line in your changes missing coverage. Please review.

Project coverage is 82.91%. Comparing base (5ce383e) to head (b46706b).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
src/vmm/src/device_manager/mmio.rs 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5283      +/-   ##
==========================================
+ Coverage   82.86%   82.91%   +0.05%     
==========================================
  Files         250      250              
  Lines       26897    26897              
==========================================
+ Hits        22288    22302      +14     
+ Misses       4609     4595      -14     
Flag Coverage Δ
5.10-c5n.metal 83.35% <100.00%> (ø)
5.10-m5n.metal 83.35% <100.00%> (+<0.01%) ⬆️
5.10-m6a.metal 82.56% <100.00%> (-0.01%) ⬇️
5.10-m6g.metal 79.17% <83.33%> (ø)
5.10-m6i.metal 83.34% <100.00%> (-0.01%) ⬇️
5.10-m7a.metal-48xl 82.55% <100.00%> (?)
5.10-m7g.metal 79.17% <83.33%> (ø)
5.10-m7i.metal-24xl 83.31% <100.00%> (?)
5.10-m7i.metal-48xl 83.30% <100.00%> (?)
5.10-m8g.metal-24xl 79.17% <83.33%> (?)
5.10-m8g.metal-48xl 79.17% <83.33%> (?)
6.1-c5n.metal 83.39% <100.00%> (-0.01%) ⬇️
6.1-m5n.metal 83.40% <100.00%> (ø)
6.1-m6a.metal 82.61% <100.00%> (-0.01%) ⬇️
6.1-m6g.metal 79.16% <83.33%> (ø)
6.1-m6i.metal 83.39% <100.00%> (+<0.01%) ⬆️
6.1-m7a.metal-48xl 82.60% <100.00%> (?)
6.1-m7g.metal 79.17% <83.33%> (ø)
6.1-m7i.metal-24xl 83.40% <100.00%> (?)
6.1-m7i.metal-48xl 83.41% <100.00%> (?)
6.1-m8g.metal-24xl 79.16% <83.33%> (?)
6.1-m8g.metal-48xl 79.17% <83.33%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Swchexit added 2 commits July 1, 2025 09:48
Remove the documentation note regarding the 64-device limit on aarch64,
as this limitation was resolved by an earlier commit.

The limitation no longer applies, and the documentation is now updated.

Signed-off-by: Sheng-Wei (Way) Chen <waychensw@gmail.com>
Set `GSI_BASE` to `0` for aarch64, enabling the use of the full
interrupt range and allowing up to 96 devices to be attached.

Change `MMIODeviceInfo::irq` from `Option<NonZeroU32>` to `Option<u32>`
to accommodate `0` as a valid IRQ number. Adjust relevant tests to
reflect this change and ensure correctness.

Signed-off-by: Sheng-Wei (Way) Chen <waychensw@gmail.com>
@Manciukic
Copy link
Contributor

aarch64 complains about:

error: useless conversion to the same type: `u32`
   --> src/vmm/src/arch/aarch64/fdt.rs:365:13
    |
365 |             dev_info.irq.unwrap().into(),
    |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: consider removing `.into()`: `dev_info.irq.unwrap()`
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#useless_conversion
    = note: `-D clippy::useless-conversion` implied by `-D warnings`
    = help: to override `-D warnings` add `#[allow(clippy::useless_conversion)]`
error: useless conversion to the same type: `u32`
   --> src/vmm/src/arch/aarch64/fdt.rs:386:13
    |
386 |             dev_info.irq.unwrap().into(),
    |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: consider removing `.into()`: `dev_info.irq.unwrap()`
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#useless_conversion

@Manciukic
Copy link
Contributor

We also need to update a few entries in the documentation:

  • add a changelog entry describing the fix
  • remove the 64 device limitation here:
    ## Known device limitations
    If more than 64 devices are configured for a VM in total on aarch64, only first
    64 of them are functional
    ([related issue](https://github.com/firecracker-microvm/firecracker/issues/4207)).

PS: kudos for finding the problem and fixing the issue! Thanks!

The test_max_devices test verifies the maximum number of devices that
can be attached, as permitted by the platform. It checks both that the
expected number of devices can be successfully attached and that
attaching more than the limit results in a failure.

Previously, this test was only executed on x86_64, where 18 devices are
expected. This update extends the test to run on aarch64 as well, which
supports up to 94 devices.

Signed-off-by: Sheng-Wei (Way) Chen <waychensw@gmail.com>
After changing the type of `MMIODeviceInfo::irq` from
`Option<NonZeroU32>` to `Option<u32>`, we don't need to convert
NonZeroU32 to u32 with .into() anymore. Thus any occurance is removed.

Signed-off-by: Sheng-Wei (Way) Chen <waychensw@gmail.com>
@Swchexit
Copy link
Contributor Author

Swchexit commented Jul 2, 2025

Hi, I've fixed the clippy error, updated the CHANGELOG, and removed the description of the 64-device limit from the docs. I will try to run the extended test as soon as I fix my settings.

I will also clean up commits if there are no further changes to be made.

roypat and others added 3 commits July 3, 2025 06:09
Establishing 94 SSH connections exhausts the default 256 MiB memory of
the `uvm_plain_machine` with `basic_config` previously used in tests
`test_max_devices` on aarch64.

Increase the VM's memory limit to 512 MiB to ensure stable and
consistent test execution.

Signed-off-by: Sheng-Wei (Way) Chen <waychensw@gmail.com>
@roypat
Copy link
Contributor

roypat commented Jul 7, 2025

Alright, I think the last failures we can fix by just disabling the memory monitor in the "too many devices" test (pass monitor_memory=False to microvm_factory.build()).

Attaching a large number of devices (94) on aarch64 consumes significant
host memory, causing memory monitor checks to fail. Since this test
specifically validates behavior when exceeding device limits, memory
usage isn't relevant here.

Disable the memory monitor check to ensure the test passes consistently.

Signed-off-by: Sheng-Wei (Way) Chen <waychensw@gmail.com>
@roypat roypat enabled auto-merge (rebase) July 8, 2025 08:33
@roypat roypat merged commit a1a4a42 into firecracker-microvm:main Jul 8, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants