Skip to content

Qemu bugfixes for numa distance and huge pfnmap alignment for 10.1#13

Open
ankita-nv wants to merge 4 commits intoNVIDIA:nvidia_stable-10.1from
ankita-nv:nvidia_stable-10.1-ankita-bugfixes-0219
Open

Qemu bugfixes for numa distance and huge pfnmap alignment for 10.1#13
ankita-nv wants to merge 4 commits intoNVIDIA:nvidia_stable-10.1from
ankita-nv:nvidia_stable-10.1-ankita-bugfixes-0219

Conversation

@ankita-nv
Copy link
Contributor

This PR addresses the following bugs:

  1. Correct setting of numa distances
    • This is under internal review
  2. [PATCH v4 0/3] hw/vfio: Enable hugepfnmap for non-power-of-2 device memory regions
    • Backported from the latest posting that is pulled into qemu.

During creation of the VM's SRAT table, the generic intiator entries
are added. Currently, the code queries the object, which may not be
in the sorted order. This results in the mismatch in the VMs view
of the PXM and the numa node ids.

As a fix, the patch builds a list of generic intiator objects,
sorts them and then put it in the VM's SRAT table.

Original (unsorted) PXM in the VM SRAT table
[152h 0338 004h]            Proximity Domain : 00000000
[17Ah 0378 004h]            Proximity Domain : 00000001
[1A4h 0420 004h]            Proximity Domain : 00000007
[1C4h 0452 004h]            Proximity Domain : 00000006
[1E4h 0484 004h]            Proximity Domain : 00000005
[204h 0516 004h]            Proximity Domain : 00000004
[224h 0548 004h]            Proximity Domain : 00000003
[244h 0580 004h]            Proximity Domain : 00000009
[264h 0612 004h]            Proximity Domain : 00000002
[284h 0644 004h]            Proximity Domain : 00000008
[2A2h 0674 004h]            Proximity Domain : 00000009

After the patch (sorted)
[152h 0338 004h]            Proximity Domain : 00000000
[17Ah 0378 004h]            Proximity Domain : 00000001
[1A4h 0420 004h]            Proximity Domain : 00000002
[1C4h 0452 004h]            Proximity Domain : 00000003
[1E4h 0484 004h]            Proximity Domain : 00000004
[204h 0516 004h]            Proximity Domain : 00000005
[224h 0548 004h]            Proximity Domain : 00000006
[244h 0580 004h]            Proximity Domain : 00000007
[264h 0612 004h]            Proximity Domain : 00000008
[284h 0644 004h]            Proximity Domain : 00000009

Fixes: 0a5b5ac ("hw/acpi: Implement the SRAT GI affinity structure")
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Sort sparse mmap regions by offset during region setup to ensure
predictable mapping order, avoid overlaps and a proper handling
of the gaps between sub-regions.

Add validation to detect overlapping sparse regions early during
setup before any mapping operations begin.

The sorting is performed on the subregions ranges during
vfio_setup_region_sparse_mmaps(). This also ensures that subsequent
mapping code can rely on subregions being in ascending offset order.

This is preparatory work for alignment adjustments needed to support
hugepfnmap on systems where device memory (e.g., Grace-based systems)
may have non-power-of-2 sizes.

cc: Alex Williamson <alex@shazbot.org>
Reviewed-by: Alex Williamson <alex@shazbot.org>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
(backported from https://lore.kernel.org/all/20260217153010.408739-2-ankita@nvidia.com/)
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Add an Error **errp parameter to vfio_region_setup() and
vfio_setup_region_sparse_mmaps to allow proper error handling
instead of just returning error codes.

The function sets errors via error_setg() when failure occur.

Suggested-by: Cedric Le Goater <clg@redhat.com>
(backported from https://lore.kernel.org/all/20260217153010.408739-3-ankita@nvidia.com/)
[ankita: added the Error** param to vfio_region_setup in hw/vfio/platform.c]
[ankita: include/hw/vfio/vfio-region.h patched instead of hw/vfio/vfio-region.h]
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
…ugepfnmap

On Grace-based systems such as GB200, device memory is exposed as a
BAR but the actual mappable size is not power-of-2 aligned. The
previous algorithm aligned each sparse mmap area based on its
individual size using ctz64() which prevented efficient huge page
usage by the kernel.

Adjust VFIO region mapping alignment to use the next power-of-2 of
the total region size and place the sparse subregions at their
appropriate offset. This provides better opportunities to get huge
alignment allowing the kernel to use larger page sizes for the VMA.

This enables the use of PMD-level huge pages which can significantly
improve memory access performance and reduce TLB pressure for large
device memory regions.

With this change:
- Create a single aligned base mapping for the entire region
- Change Alignment to be based on pow2ceil(region->size), capped at 1GiB
- Unmap gaps between sparse regions
- Use MAP_FIXED to overlay sparse mmap areas at their offsets

Example VMA for device memory of size 0x2F00F00000 on GB200:

Before (misaligned, no hugepfnmap):
ff88ff000000-ffb7fff00000 rw-s 400000000000 00:06 727                    /dev/vfio/devices/vfio1

After (aligned to 1GiB boundary, hugepfnmap enabled):
ff8ac0000000-ffb9c0f00000 rw-s 400000000000 00:06 727                    /dev/vfio/devices/vfio1

Requires sparse regions to be sorted by offset (done in previous
patch) to correctly identify and handle gaps.

cc: Alex Williamson <alex@shazbot.org>
Reviewed-by: Alex Williamson <alex@shazbot.org>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
(backported from https://lore.kernel.org/all/20260217153010.408739-4-ankita@nvidia.com/)
[ankita: resolved minor conflict in vfio_region_mmap to set variables]
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
@ankita-nv ankita-nv changed the title Nvidia stable 10.1 ankita bugfixes 0219 Qemu bugfixes for numa distance and huge pfnmap alignment Feb 19, 2026
@ankita-nv ankita-nv changed the title Qemu bugfixes for numa distance and huge pfnmap alignment Qemu bugfixes for numa distance and huge pfnmap alignment for 10.1 Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments