VMD hotplug init from config file fails for multiple domains #3370

tanabarr · 2024-05-07T15:49:57Z

Sighting report

Failure when parsing config file when using two VMD domains and enabling hotplug. With a single domain or if hotplug is not enabled in config then the application works as expected.

Expected Behavior

Application is expected to parse JSON config file successfully and recognise VMD-backing-device addresses specified in attach directives.

Current Behavior

The error is seen after calling spdk_subsystem_init_from_json_config() when xs_poll_completion(). cp_arg.cca_rc gets set to -1003.

The following is the application calling code:

1602 >       >       /* Initialize all registered subsystems: bdev, vmd, copy. */
1603 >       >       common_prep_arg(&cp_arg);
1604 >       >       spdk_subsystem_init_from_json_config(nvme_glb.bd_nvme_conf,
1605 >       >       >       >       >       >            SPDK_DEFAULT_RPC_ADDR,
1606 >       >       >       >       >       >            subsys_init_cb, &cp_arg,
1607 >       >       >       >       >       >            true);
1608 >       >       rc = xs_poll_completion(ctxt, &cp_arg.cca_inflights, 0);
1609 >       >       D_ASSERT(rc == 0);
1610
1611 >       >       if (cp_arg.cca_rc != 0) {
1612 >       >       >       rc = cp_arg.cca_rc;
1613 >       >       >       D_ERROR("failed to init bdevs, rc:%d\n", rc);
1614 >       >       >       goto out;
1615 >       >       }
1616
1617 >       >       /* Continue poll until no more events */
1618 >       >       while (spdk_thread_poll(ctxt->bxc_thread, 0, 0) > 0)
1619 >       >       >       ;
1620 >       >       D_DEBUG(DB_MGMT, "SPDK bdev initialized, tgt_id:%d", tgt_id);

The following errors are reported by SPDK:

ERROR: daos_engine:0 [2024-05-07 12:21:20.228670] nvme.c: 838:nvme_probe_internal: *ERROR*: NVMe ctrlr scan failed
ERROR: daos_engine:0 [2024-05-07 12:21:20.228703] bdev_nvme.c:4022:bdev_nvme_create: *ERROR*: No controller was found with provided trid (traddr: d70505:03:00.0)
ERROR: daos_engine:0 [2024-05-07 12:21:20.232712] json_config.c: 221:rpc_client_poller: *ERROR*: error response:
{
  "code": -19,
  "message": "No such device"
}

Seems like for some reason when hotplug is enabled the second VMD domain can't be discovered.

If I hack the application to run the same sequence without bdev_nvme_attach_controller calls in the config and hotplug enabled the app proceeds past the spdk_subsystem_init_from_json_config() call (but then fails shortly after because spdk_bdev_first() returns NULL.

Konrad explained that in the VMD-hotplug case it shouldn't be necessary to specify the devices explicitly in config attach entries as any devices attached to a VMD domain bound to the vfio driver should be discoverable by the SPDK process and therefore automatically detected and attached.

Not explicitly specifying the devices in attach entries in the config doesn't really work with the application's current implementation:

The application uses multiple SPDK processes which need to use a specified set of VMD domains in isolation
The application embeds information in the bdev name which is used in internal configuration after loading the JSON file.

To clarify the problem space, experiments verify the following behaviour:

With attach calls and single VMD domain with enable_hotplug SET, app can start
With attach calls and multiple VMD domains with enable_hotplug UNSET, app can start
With NO attach calls, multiple VMD domains and enable_hotplug SET, app continues past config load but fails on spdk_bdev_first()

Note that the VMD domains are in different sockets (just for extra context) but I doubt that makes any difference.

As Konrad has already observed, it might be to do with a race between the app and the hotplug poller.

Possible Solution

When attach config entries are removed, the app is failing after SPDK has been initialised from config on spdk_bdev_first(). In order to discover the devices in the VMD-hotplug case it would be needed to wait for the hotplug poller to detect, attach and init bdevs before continuing. This wouldn't be an ideal workaround as the application is currently dependent on information encoded in the bdev name. Ideally the current method of explicit device specification in config would work for multiple VMD domains per process.

Steps to Reproduce

As above.

Context (Environment including OS version, SPDK version, etc.)

Rocky Linux 8.7
SPDK 22.01
DAOS 2.6 (master as of 07-May-2024)

The text was updated successfully, but these errors were encountered:

tanabarr · 2024-05-07T18:16:11Z

Debug build info including DPDK parameters on init:

wolf-313.wolf.hpdd.intel.com ERROR 2024/05/07 18:07:08 daos_engine:0 [2024-05-07 18:07:08.718194] Starting SPDK v22.01.2 git sha1 cbf9b3496 / DPDK 21.11.2 initialization...
[2024-05-07 18:07:08.718204] [ DPDK EAL parameters: [2024-05-07 18:07:08.718208] daos_engine
wolf-313.wolf.hpdd.intel.com ERROR 2024/05/07 18:07:08 daos_engine:0 [2024-05-07 18:07:08.718211] --no-shconf [2024-05-07 18:07:08.718214] -c 0x1 [2024-05-07 18:07:08.718217] --allow=0000:5d:05.5 [2024-05-07 18:07:
08.718219] --allow=0000:d7:05.5 [2024-05-07 18:07:08.718221] --log-level=lib.eal:6 [2024-05-07 18:07:08.718224] --log-level=lib.cryptodev:5 [2024-05-07 18:07:08.718226] --log-level=user1:6 [2024-05-07 18:07:08.7182
29] --log-level=lib.eal:4 [2024-05-07 18:07:08.718231] --log-level=lib.malloc:4 [2024-05-07 18:07:08.718233] --log-level=lib.ring:4 [2024-05-07 18:07:08.718235] --log-level=lib.mempool:4 [2024-05-07 18:07:08.718238
] --log-level=lib.timer:4 [2024-05-07 18:07:08.718240] --log-level=pmd:4 [2024-05-07 18:07:08.718243] --log-level=lib.hash:4 [2024-05-07 18:07:08.718245] --log-level=lib.lpm:4 [2024-05-07 18:07:08.718247] --log-lev
el=lib.kni:4 [2024-05-07 18:07:08.718249] --log-level=lib.acl:4 [2024-05-07 18:07:08.718251] --log-level=lib.power:4 [2024-05-07 18:07:08.718254] --log-level=lib.meter:4 [2024-05-07 18:07:08.718256] --log-level=lib
.sched:4 [2024-05-07 18:07:08.718258] --log-level=lib.port:4 [2024-05-07 18:07:08.718260] --log-level=lib.table:4
wolf-313.wolf.hpdd.intel.com ERROR 2024/05/07 18:07:08 daos_engine:0 [2024-05-07 18:07:08.718263] --log-level=lib.pipeline:4 [2024-05-07 18:07:08.718266] --log-level=lib.mbuf:4 [2024-05-07 18:07:08.718268] --log-le
vel=lib.cryptodev:4 [2024-05-07 18:07:08.718271] --log-level=lib.efd:4 [2024-05-07 18:07:08.718273] --log-level=lib.eventdev:4 [2024-05-07 18:07:08.718275] --log-level=lib.gso:4 [2024-05-07 18:07:08.718277] --log-l
evel=user1:4 [2024-05-07 18:07:08.718280] --log-level=user2:4 [2024-05-07 18:07:08.718282] --log-level=user3:4 [2024-05-07 18:07:08.718284] --log-level=user4:4 [2024-05-07 18:07:08.718286] --log-level=user5:4 [2024
-05-07 18:07:08.718289] --log-level=user6:4 [2024-05-07 18:07:08.718291] --log-level=user7:4 [2024-05-07 18:07:08.718293] --log-level=user8:4 [2024-05-07 18:07:08.718295] --no-telemetry [2024-05-07 18:07:08.718298]
 --base-virtaddr=0x200000000000 [2024-05-07 18:07:08.718300] --match-allocations [2024-05-07 18:07:08.718302] --file-prefix=spdk_pid1017303 [2024-05-07 18:07:08.718305] ]

jimharris · 2024-05-07T18:18:59Z

Hi @tanabarr, this is a pretty old SPDK, can you reproduce this on v24.01 or latest master?

tanabarr · 2024-05-07T18:19:32Z

SPDK JSON config file contents:

[tanabarr@wolf-313 ~]$ cat /mnt/daos0/daos_nvme.conf
{
  "daos_data": {
    "config": [
      {
        "params": {
          "begin": 0,
          "end": 255
        },
        "method": "hotplug_busid_range"
      }
    ]
  },
  "subsystems": [
    {
      "subsystem": "bdev",
      "config": [
        {
          "params": {
            "bdev_io_pool_size": 65536,
            "bdev_io_cache_size": 256
          },
          "method": "bdev_set_options"
          },
        {
          "params": {
            "retry_count": 4,
            "timeout_us": 0,
            "nvme_adminq_poll_period_us": 100000,
            "action_on_timeout": "none",
            "nvme_ioq_poll_period_us": 0
          },
          "method": "bdev_nvme_set_options"
        },
        {
          "params": {
            "enable": true,
            "period_us": 5000000
          },
          "method": "bdev_nvme_set_hotplug"
        },
        {
          "params": {
            "trtype": "PCIe",
            "name": "Nvme_wolf-313.wolf.hpdd.intel.com_0_1_0",
            "traddr": "5d0505:01:00.0"
          },
          "method": "bdev_nvme_attach_controller"
        },
        {
          "params": {
            "trtype": "PCIe",
            "name": "Nvme_wolf-313.wolf.hpdd.intel.com_1_1_0",
            "traddr": "5d0505:03:00.0"
          },
          "method": "bdev_nvme_attach_controller"
        },
        {
          "params": {
            "trtype": "PCIe",
            "name": "Nvme_wolf-313.wolf.hpdd.intel.com_2_1_0",
            "traddr": "d70505:01:00.0"
          },
          "method": "bdev_nvme_attach_controller"
        },
        {
          "params": {
            "trtype": "PCIe",
            "name": "Nvme_wolf-313.wolf.hpdd.intel.com_3_1_0",
            "traddr": "d70505:03:00.0"
          },
          "method": "bdev_nvme_attach_controller"
        }
      ]
    },
    {
      "subsystem": "vmd",
      "config": [
        {
          "params": {},
          "method": "enable_vmd"
        }
      ]
    }
  ]
}

tanabarr · 2024-05-07T18:20:09Z

@jimharris not easily but I will try, I think we will run into various other errors upgrading to 24 but it's definitely the logical thing to do.

jimharris · 2024-05-07T20:27:05Z

The following errors are reported by SPDK:
ERROR: daos_engine:0 [2024-05-07 12:21:20.228670] nvme.c: 838:nvme_probe_internal: *ERROR*: NVMe ctrlr scan failed
ERROR: daos_engine:0 [2024-05-07 12:21:20.228703] bdev_nvme.c:4022:bdev_nvme_create: *ERROR*: No controller was found with provided trid (traddr: d70505:03:00.0)
ERROR: daos_engine:0 [2024-05-07 12:21:20.232712] json_config.c: 221:rpc_client_poller: *ERROR*: error response:
{
  "code": -19,
  "message": "No such device"
}
Seems like for some reason when hotplug is enabled the second VMD domain can't be discovered.

If I hack the application to run the same sequence without bdev_nvme_attach_controller calls in the config and hotplug enabled the app proceeds past the spdk_subsystem_init_from_json_config() call (but then fails shortly after because spdk_bdev_first() returns NULL.

This is something in DAOS that is calling spdk_bdev_first()?

Konrad explained that in the VMD-hotplug case it shouldn't be necessary to specify the devices explicitly in config attach entries as any devices attached to a VMD domain bound to the vfio driver should be discoverable by the SPDK process and therefore automatically detected and attached.

Not explicitly specifying the devices in attach entries in the config doesn't really work with the application's current implementation:

The application uses multiple SPDK processes which need to use a specified set of VMD domains in isolation

The application embeds information in the bdev name which is used in internal configuration after loading the JSON file.

But my understanding was that the hotplug_busid_range (beginning of your config file) takes care of deciding which process takes control of a newly found SSD. Normally this would be used when an SSD is hot inserted at some later time, but it should work just as well at initial start up.

To clarify the problem space, experiments verify the following behaviour:

With attach calls and single VMD domain with enable_hotplug SET, app can start

With attach calls and multiple VMD domains with enable_hotplug UNSET, app can start

With NO attach calls, multiple VMD domains and enable_hotplug SET, app continues past config load but fails on spdk_bdev_first()

I think we need to understand this call to spdk_bdev_first().

Note that the VMD domains are in different sockets (just for extra context) but I doubt that makes any difference.

As Konrad has already observed, it might be to do with a race between the app and the hotplug poller.

Possible Solution

When attach config entries are removed, the app is failing after SPDK has been initialised from config on spdk_bdev_first(). In order to discover the devices in the VMD-hotplug case it would be needed to wait for the hotplug poller to detect, attach and init bdevs before continuing. This wouldn't be an ideal workaround as the application is currently dependent on information encoded in the bdev name. Ideally the current method of explicit device specification in config would work for multiple VMD domains per process.

Steps to Reproduce

As above.

Context (Environment including OS version, SPDK version, etc.)

Rocky Linux 8.7 SPDK 22.01 DAOS 2.6 (master as of 07-May-2024)

jimharris · 2024-05-07T21:52:08Z

@tanabarr can you turn on the vmd debug log flag, re-run your test, and post the log here?

tanabarr · 2024-05-08T12:23:54Z

@jimharris

This is something in DAOS that is calling spdk_bdev_first()?

Yes, after attaching when loading config bdevs get enumerated and blobstores loaded/created.

But my understanding was that the hotplug_busid_range (beginning of your config file) takes care of deciding which process takes control of a newly found SSD. Normally this would be used when an SSD is hot inserted at some later time, but it should work just as well at initial start up.

The range in daos_data is decoded and applied in spdk_nvme_pcie_set_hotplug_filter() callback parameter. I haven't seen any logging from the callback during initial start-up that I have previously seen during (non-VMD mode) hotplug events. I was under the impression that for VMD the filter isn't relevant because the VMD driver is handling polling?

In terms of device selection in VMD mode, DAOS calls spdk_env_init() with the VMD domain addresses it is interested in (i.e. for the specific DAOS engine SPDK process) in an effort to prevent one engine from accessing anothers assigned VMD domains. The JSON config generated by DAOS control-plane contains the backing device addresses of the domains a to each engine/SPDK process. One of my concerns with the idea of removing the explicit device assignments in the config was the suggestion that the automatic detection and attachment would be performed on any domain bound to the vfio driver and in the case of multiple SPDK processes (DAOS engines) this might cause an issue. Maybe the filtering of domains applied through spdk_env_init_opts would mitigate this concern?

jimharris · 2024-05-08T20:15:30Z

@jimharris

This is something in DAOS that is calling spdk_bdev_first()?

Yes, after attaching when loading config bdevs get enumerated and blobstores loaded/created.

But my understanding was that the hotplug_busid_range (beginning of your config file) takes care of deciding which process takes control of a newly found SSD. Normally this would be used when an SSD is hot inserted at some later time, but it should work just as well at initial start up.

The range in daos_data is decoded and applied in spdk_nvme_pcie_set_hotplug_filter() callback parameter. I haven't seen any logging from the callback during initial start-up that I have previously seen during (non-VMD mode) hotplug events. I was under the impression that for VMD the filter isn't relevant because the VMD driver is handling polling?

The registered filter function only gets called when the bdev/nvme hotplug poller tries to look for new (or more precisely, non-attached) devices. If you explicitly attach using bdev_nvme_attach_controller, that filter function will not fire.

In terms of device selection in VMD mode, DAOS calls spdk_env_init() with the VMD domain addresses it is interested in (i.e. for the specific DAOS engine SPDK process) in an effort to prevent one engine from accessing anothers assigned VMD domains. The JSON config generated by DAOS control-plane contains the backing device addresses of the domains a to each engine/SPDK process. One of my concerns with the idea of removing the explicit device assignments in the config was the suggestion that the automatic detection and attachment would be performed on any domain bound to the vfio driver and in the case of multiple SPDK processes (DAOS engines) this might cause an issue. Maybe the filtering of domains applied through spdk_env_init_opts would mitigate this concern?

For each VMD PCI function that is bound to vfio-pci, you will only be able to bind it to one process. So you will need to do filtering through spdk_env_init_opts.

This is probably the problem. Without VMD, any new devices are globally accessible through the kernel's PCI subsystem, so you need that filter function to make sure only one of your hotplug-enabled processes tries to enable it. But I'm not sure how that filter function would work with VMD. It would need to pick whatever bus addresses VMD would choose for that VMD endpoint. But I don't think you even need the filter function with VMD - if the VMD device can only be attached to one SPDK process then by default all SSDs behind that VMD endpoint need to be attached to that process. No other process will be able to access them.

tanabarr · 2024-05-09T09:23:02Z

daos_engine.0.log
daos_helper.log
daos_server.log

Attached are the logs from a run with SPDK v22.01 (my attempts to build with v24 failed due to time constraints and specifically linking issues in our scons build, I will try again later). The run was with the original DAOS source as described in the description with none of the experimental workarounds I have been looking at. The only source changes are to enable debug build and "vmd" debug flag.

daos_ server.log is the most relevant file and contains the SPDK VMD logging interleaved with the control-plane entries
daos_helper.log contains (among other things) driver binding output from spdk/scripts/setup.sh driven from the control-plane
daos_engine.0.log is supplied for posterity and includes (non-SPDK) logging from the engine (SPDK) process. SPDK logging from the engine process is forwarded to daos_server.log. This log may be useful to correlate events via timestamps with daos_server.log

@jimharris @ksztyber @NiuYawei please let me know what you think and if any logs from comparison runs would be useful.

Note that I have tried some experiments with not setting hotplug poller bus-ID ranges and expanding the range to allow all. Those experiments yielded the same failure.

ksztyber · 2024-05-21T14:09:56Z

From the logs, it looks like the VMD driver has enumerated the device the bdev_nvme complains about. I suspect there is a race between the hotplug poller and the bdev_nvme_attach_controller calls. I think the last attach call is executed after the poller, but before the controller got fully initialized, which is why the error message says that scan failed.

If I'm right, moving the bdev_nvme_set_hotplug RPC after the attach RPCs in the config should fix this, i.e.:

    {
      "subsystem": "bdev",
      "config": [
        {
          "params": {
            "bdev_io_pool_size": 65536,
            "bdev_io_cache_size": 256
          },
          "method": "bdev_set_options"
          },
        {
          "params": {
            "retry_count": 4,
            "timeout_us": 0,
            "nvme_adminq_poll_period_us": 100000,
            "action_on_timeout": "none",
            "nvme_ioq_poll_period_us": 0
          },
          "method": "bdev_nvme_set_options"
        },
        {
          "params": {
            "trtype": "PCIe",
            "name": "Nvme_wolf-313.wolf.hpdd.intel.com_0_1_0",
            "traddr": "5d0505:01:00.0"
          },
          "method": "bdev_nvme_attach_controller"
        },
        {
          "params": {
            "trtype": "PCIe",
            "name": "Nvme_wolf-313.wolf.hpdd.intel.com_1_1_0",
            "traddr": "5d0505:03:00.0"
          },
          "method": "bdev_nvme_attach_controller"
        },
        {
          "params": {
            "trtype": "PCIe",
            "name": "Nvme_wolf-313.wolf.hpdd.intel.com_2_1_0",
            "traddr": "d70505:01:00.0"
          },
          "method": "bdev_nvme_attach_controller"
        },
        {
          "params": {
            "trtype": "PCIe",
            "name": "Nvme_wolf-313.wolf.hpdd.intel.com_3_1_0",
            "traddr": "d70505:03:00.0"
          },
          "method": "bdev_nvme_attach_controller"
        },
        {
          "params": {
            "enable": true,
            "period_us": 5000000
          },
          "method": "bdev_nvme_set_hotplug"
        }
      ]
    },

@tanabarr, could you check it?

tanabarr · 2024-05-29T13:37:06Z

@ksztyber apologies for the delay in response, I will try the suggestion this week.

tanabarr · 2024-05-30T09:15:32Z

Yes changing the RPC order as suggested works around the issue. See attached good/bad config files.
daos_nvme.bad.conf.txt
daos_nvme.good.conf.txt

tanabarr added the Sighting label May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VMD hotplug init from config file fails for multiple domains #3370

VMD hotplug init from config file fails for multiple domains #3370

tanabarr commented May 7, 2024 •

edited

tanabarr commented May 7, 2024

jimharris commented May 7, 2024

tanabarr commented May 7, 2024

tanabarr commented May 7, 2024

jimharris commented May 7, 2024

Possible Solution

Steps to Reproduce

Context (Environment including OS version, SPDK version, etc.)

jimharris commented May 7, 2024

tanabarr commented May 8, 2024

jimharris commented May 8, 2024

tanabarr commented May 9, 2024

ksztyber commented May 21, 2024

tanabarr commented May 29, 2024

tanabarr commented May 30, 2024

VMD hotplug init from config file fails for multiple domains #3370

VMD hotplug init from config file fails for multiple domains #3370

Comments

tanabarr commented May 7, 2024 • edited

Sighting report

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context (Environment including OS version, SPDK version, etc.)

tanabarr commented May 7, 2024

jimharris commented May 7, 2024

tanabarr commented May 7, 2024

tanabarr commented May 7, 2024

jimharris commented May 7, 2024

Possible Solution

Steps to Reproduce

Context (Environment including OS version, SPDK version, etc.)

jimharris commented May 7, 2024

tanabarr commented May 8, 2024

jimharris commented May 8, 2024

tanabarr commented May 9, 2024

ksztyber commented May 21, 2024

tanabarr commented May 29, 2024

tanabarr commented May 30, 2024

tanabarr commented May 7, 2024 •

edited