Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VMD hotplug init from config file fails for multiple domains #3370

Open
tanabarr opened this issue May 7, 2024 · 12 comments
Open

VMD hotplug init from config file fails for multiple domains #3370

tanabarr opened this issue May 7, 2024 · 12 comments
Labels

Comments

@tanabarr
Copy link
Contributor

tanabarr commented May 7, 2024

Sighting report

Failure when parsing config file when using two VMD domains and enabling hotplug. With a single domain or if hotplug is not enabled in config then the application works as expected.

Expected Behavior

Application is expected to parse JSON config file successfully and recognise VMD-backing-device addresses specified in attach directives.

Current Behavior

The error is seen after calling spdk_subsystem_init_from_json_config() when xs_poll_completion(). cp_arg.cca_rc gets set to -1003.

The following is the application calling code:

1602 >       >       /* Initialize all registered subsystems: bdev, vmd, copy. */
1603 >       >       common_prep_arg(&cp_arg);
1604 >       >       spdk_subsystem_init_from_json_config(nvme_glb.bd_nvme_conf,
1605 >       >       >       >       >       >            SPDK_DEFAULT_RPC_ADDR,
1606 >       >       >       >       >       >            subsys_init_cb, &cp_arg,
1607 >       >       >       >       >       >            true);
1608 >       >       rc = xs_poll_completion(ctxt, &cp_arg.cca_inflights, 0);
1609 >       >       D_ASSERT(rc == 0);
1610
1611 >       >       if (cp_arg.cca_rc != 0) {
1612 >       >       >       rc = cp_arg.cca_rc;
1613 >       >       >       D_ERROR("failed to init bdevs, rc:%d\n", rc);
1614 >       >       >       goto out;
1615 >       >       }
1616
1617 >       >       /* Continue poll until no more events */
1618 >       >       while (spdk_thread_poll(ctxt->bxc_thread, 0, 0) > 0)
1619 >       >       >       ;
1620 >       >       D_DEBUG(DB_MGMT, "SPDK bdev initialized, tgt_id:%d", tgt_id);

The following errors are reported by SPDK:

ERROR: daos_engine:0 [2024-05-07 12:21:20.228670] nvme.c: 838:nvme_probe_internal: *ERROR*: NVMe ctrlr scan failed
ERROR: daos_engine:0 [2024-05-07 12:21:20.228703] bdev_nvme.c:4022:bdev_nvme_create: *ERROR*: No controller was found with provided trid (traddr: d70505:03:00.0)
ERROR: daos_engine:0 [2024-05-07 12:21:20.232712] json_config.c: 221:rpc_client_poller: *ERROR*: error response:
{
  "code": -19,
  "message": "No such device"
}

Seems like for some reason when hotplug is enabled the second VMD domain can't be discovered.

If I hack the application to run the same sequence without bdev_nvme_attach_controller calls in the config and hotplug enabled the app proceeds past the spdk_subsystem_init_from_json_config() call (but then fails shortly after because spdk_bdev_first() returns NULL.

Konrad explained that in the VMD-hotplug case it shouldn't be necessary to specify the devices explicitly in config attach entries as any devices attached to a VMD domain bound to the vfio driver should be discoverable by the SPDK process and therefore automatically detected and attached.

Not explicitly specifying the devices in attach entries in the config doesn't really work with the application's current implementation:

  • The application uses multiple SPDK processes which need to use a specified set of VMD domains in isolation
  • The application embeds information in the bdev name which is used in internal configuration after loading the JSON file.

To clarify the problem space, experiments verify the following behaviour:

  • With attach calls and single VMD domain with enable_hotplug SET, app can start
  • With attach calls and multiple VMD domains with enable_hotplug UNSET, app can start
  • With NO attach calls, multiple VMD domains and enable_hotplug SET, app continues past config load but fails on spdk_bdev_first()

Note that the VMD domains are in different sockets (just for extra context) but I doubt that makes any difference.

As Konrad has already observed, it might be to do with a race between the app and the hotplug poller.

Possible Solution

When attach config entries are removed, the app is failing after SPDK has been initialised from config on spdk_bdev_first(). In order to discover the devices in the VMD-hotplug case it would be needed to wait for the hotplug poller to detect, attach and init bdevs before continuing. This wouldn't be an ideal workaround as the application is currently dependent on information encoded in the bdev name. Ideally the current method of explicit device specification in config would work for multiple VMD domains per process.

Steps to Reproduce

As above.

Context (Environment including OS version, SPDK version, etc.)

Rocky Linux 8.7
SPDK 22.01
DAOS 2.6 (master as of 07-May-2024)

@tanabarr
Copy link
Contributor Author

tanabarr commented May 7, 2024

Debug build info including DPDK parameters on init:

wolf-313.wolf.hpdd.intel.com ERROR 2024/05/07 18:07:08 daos_engine:0 [2024-05-07 18:07:08.718194] Starting SPDK v22.01.2 git sha1 cbf9b3496 / DPDK 21.11.2 initialization...
[2024-05-07 18:07:08.718204] [ DPDK EAL parameters: [2024-05-07 18:07:08.718208] daos_engine
wolf-313.wolf.hpdd.intel.com ERROR 2024/05/07 18:07:08 daos_engine:0 [2024-05-07 18:07:08.718211] --no-shconf [2024-05-07 18:07:08.718214] -c 0x1 [2024-05-07 18:07:08.718217] --allow=0000:5d:05.5 [2024-05-07 18:07:
08.718219] --allow=0000:d7:05.5 [2024-05-07 18:07:08.718221] --log-level=lib.eal:6 [2024-05-07 18:07:08.718224] --log-level=lib.cryptodev:5 [2024-05-07 18:07:08.718226] --log-level=user1:6 [2024-05-07 18:07:08.7182
29] --log-level=lib.eal:4 [2024-05-07 18:07:08.718231] --log-level=lib.malloc:4 [2024-05-07 18:07:08.718233] --log-level=lib.ring:4 [2024-05-07 18:07:08.718235] --log-level=lib.mempool:4 [2024-05-07 18:07:08.718238
] --log-level=lib.timer:4 [2024-05-07 18:07:08.718240] --log-level=pmd:4 [2024-05-07 18:07:08.718243] --log-level=lib.hash:4 [2024-05-07 18:07:08.718245] --log-level=lib.lpm:4 [2024-05-07 18:07:08.718247] --log-lev
el=lib.kni:4 [2024-05-07 18:07:08.718249] --log-level=lib.acl:4 [2024-05-07 18:07:08.718251] --log-level=lib.power:4 [2024-05-07 18:07:08.718254] --log-level=lib.meter:4 [2024-05-07 18:07:08.718256] --log-level=lib
.sched:4 [2024-05-07 18:07:08.718258] --log-level=lib.port:4 [2024-05-07 18:07:08.718260] --log-level=lib.table:4
wolf-313.wolf.hpdd.intel.com ERROR 2024/05/07 18:07:08 daos_engine:0 [2024-05-07 18:07:08.718263] --log-level=lib.pipeline:4 [2024-05-07 18:07:08.718266] --log-level=lib.mbuf:4 [2024-05-07 18:07:08.718268] --log-le
vel=lib.cryptodev:4 [2024-05-07 18:07:08.718271] --log-level=lib.efd:4 [2024-05-07 18:07:08.718273] --log-level=lib.eventdev:4 [2024-05-07 18:07:08.718275] --log-level=lib.gso:4 [2024-05-07 18:07:08.718277] --log-l
evel=user1:4 [2024-05-07 18:07:08.718280] --log-level=user2:4 [2024-05-07 18:07:08.718282] --log-level=user3:4 [2024-05-07 18:07:08.718284] --log-level=user4:4 [2024-05-07 18:07:08.718286] --log-level=user5:4 [2024
-05-07 18:07:08.718289] --log-level=user6:4 [2024-05-07 18:07:08.718291] --log-level=user7:4 [2024-05-07 18:07:08.718293] --log-level=user8:4 [2024-05-07 18:07:08.718295] --no-telemetry [2024-05-07 18:07:08.718298]
 --base-virtaddr=0x200000000000 [2024-05-07 18:07:08.718300] --match-allocations [2024-05-07 18:07:08.718302] --file-prefix=spdk_pid1017303 [2024-05-07 18:07:08.718305] ]

@jimharris
Copy link
Member

Hi @tanabarr, this is a pretty old SPDK, can you reproduce this on v24.01 or latest master?

@tanabarr
Copy link
Contributor Author

tanabarr commented May 7, 2024

SPDK JSON config file contents:

[tanabarr@wolf-313 ~]$ cat /mnt/daos0/daos_nvme.conf
{
  "daos_data": {
    "config": [
      {
        "params": {
          "begin": 0,
          "end": 255
        },
        "method": "hotplug_busid_range"
      }
    ]
  },
  "subsystems": [
    {
      "subsystem": "bdev",
      "config": [
        {
          "params": {
            "bdev_io_pool_size": 65536,
            "bdev_io_cache_size": 256
          },
          "method": "bdev_set_options"
          },
        {
          "params": {
            "retry_count": 4,
            "timeout_us": 0,
            "nvme_adminq_poll_period_us": 100000,
            "action_on_timeout": "none",
            "nvme_ioq_poll_period_us": 0
          },
          "method": "bdev_nvme_set_options"
        },
        {
          "params": {
            "enable": true,
            "period_us": 5000000
          },
          "method": "bdev_nvme_set_hotplug"
        },
        {
          "params": {
            "trtype": "PCIe",
            "name": "Nvme_wolf-313.wolf.hpdd.intel.com_0_1_0",
            "traddr": "5d0505:01:00.0"
          },
          "method": "bdev_nvme_attach_controller"
        },
        {
          "params": {
            "trtype": "PCIe",
            "name": "Nvme_wolf-313.wolf.hpdd.intel.com_1_1_0",
            "traddr": "5d0505:03:00.0"
          },
          "method": "bdev_nvme_attach_controller"
        },
        {
          "params": {
            "trtype": "PCIe",
            "name": "Nvme_wolf-313.wolf.hpdd.intel.com_2_1_0",
            "traddr": "d70505:01:00.0"
          },
          "method": "bdev_nvme_attach_controller"
        },
        {
          "params": {
            "trtype": "PCIe",
            "name": "Nvme_wolf-313.wolf.hpdd.intel.com_3_1_0",
            "traddr": "d70505:03:00.0"
          },
          "method": "bdev_nvme_attach_controller"
        }
      ]
    },
    {
      "subsystem": "vmd",
      "config": [
        {
          "params": {},
          "method": "enable_vmd"
        }
      ]
    }
  ]
}

@tanabarr
Copy link
Contributor Author

tanabarr commented May 7, 2024

@jimharris not easily but I will try, I think we will run into various other errors upgrading to 24 but it's definitely the logical thing to do.

@jimharris
Copy link
Member

The following errors are reported by SPDK:

ERROR: daos_engine:0 [2024-05-07 12:21:20.228670] nvme.c: 838:nvme_probe_internal: *ERROR*: NVMe ctrlr scan failed
ERROR: daos_engine:0 [2024-05-07 12:21:20.228703] bdev_nvme.c:4022:bdev_nvme_create: *ERROR*: No controller was found with provided trid (traddr: d70505:03:00.0)
ERROR: daos_engine:0 [2024-05-07 12:21:20.232712] json_config.c: 221:rpc_client_poller: *ERROR*: error response:
{
  "code": -19,
  "message": "No such device"
}

Seems like for some reason when hotplug is enabled the second VMD domain can't be discovered.

If I hack the application to run the same sequence without bdev_nvme_attach_controller calls in the config and hotplug enabled the app proceeds past the spdk_subsystem_init_from_json_config() call (but then fails shortly after because spdk_bdev_first() returns NULL.

This is something in DAOS that is calling spdk_bdev_first()?

Konrad explained that in the VMD-hotplug case it shouldn't be necessary to specify the devices explicitly in config attach entries as any devices attached to a VMD domain bound to the vfio driver should be discoverable by the SPDK process and therefore automatically detected and attached.

Not explicitly specifying the devices in attach entries in the config doesn't really work with the application's current implementation:

  • The application uses multiple SPDK processes which need to use a specified set of VMD domains in isolation
  • The application embeds information in the bdev name which is used in internal configuration after loading the JSON file.

But my understanding was that the hotplug_busid_range (beginning of your config file) takes care of deciding which process takes control of a newly found SSD. Normally this would be used when an SSD is hot inserted at some later time, but it should work just as well at initial start up.

To clarify the problem space, experiments verify the following behaviour:

  • With attach calls and single VMD domain with enable_hotplug SET, app can start
  • With attach calls and multiple VMD domains with enable_hotplug UNSET, app can start
  • With NO attach calls, multiple VMD domains and enable_hotplug SET, app continues past config load but fails on spdk_bdev_first()

I think we need to understand this call to spdk_bdev_first().

Note that the VMD domains are in different sockets (just for extra context) but I doubt that makes any difference.

As Konrad has already observed, it might be to do with a race between the app and the hotplug poller.

Possible Solution

When attach config entries are removed, the app is failing after SPDK has been initialised from config on spdk_bdev_first(). In order to discover the devices in the VMD-hotplug case it would be needed to wait for the hotplug poller to detect, attach and init bdevs before continuing. This wouldn't be an ideal workaround as the application is currently dependent on information encoded in the bdev name. Ideally the current method of explicit device specification in config would work for multiple VMD domains per process.

Steps to Reproduce

As above.

Context (Environment including OS version, SPDK version, etc.)

Rocky Linux 8.7 SPDK 22.01 DAOS 2.6 (master as of 07-May-2024)

@jimharris
Copy link
Member

@tanabarr can you turn on the vmd debug log flag, re-run your test, and post the log here?

@tanabarr
Copy link
Contributor Author

tanabarr commented May 8, 2024

@jimharris

This is something in DAOS that is calling spdk_bdev_first()?

Yes, after attaching when loading config bdevs get enumerated and blobstores loaded/created.

But my understanding was that the hotplug_busid_range (beginning of your config file) takes care of deciding which process takes control of a newly found SSD. Normally this would be used when an SSD is hot inserted at some later time, but it should work just as well at initial start up.

The range in daos_data is decoded and applied in spdk_nvme_pcie_set_hotplug_filter() callback parameter. I haven't seen any logging from the callback during initial start-up that I have previously seen during (non-VMD mode) hotplug events. I was under the impression that for VMD the filter isn't relevant because the VMD driver is handling polling?

In terms of device selection in VMD mode, DAOS calls spdk_env_init() with the VMD domain addresses it is interested in (i.e. for the specific DAOS engine SPDK process) in an effort to prevent one engine from accessing anothers assigned VMD domains. The JSON config generated by DAOS control-plane contains the backing device addresses of the domains a to each engine/SPDK process. One of my concerns with the idea of removing the explicit device assignments in the config was the suggestion that the automatic detection and attachment would be performed on any domain bound to the vfio driver and in the case of multiple SPDK processes (DAOS engines) this might cause an issue. Maybe the filtering of domains applied through spdk_env_init_opts would mitigate this concern?

@jimharris
Copy link
Member

@jimharris

This is something in DAOS that is calling spdk_bdev_first()?

Yes, after attaching when loading config bdevs get enumerated and blobstores loaded/created.

But my understanding was that the hotplug_busid_range (beginning of your config file) takes care of deciding which process takes control of a newly found SSD. Normally this would be used when an SSD is hot inserted at some later time, but it should work just as well at initial start up.

The range in daos_data is decoded and applied in spdk_nvme_pcie_set_hotplug_filter() callback parameter. I haven't seen any logging from the callback during initial start-up that I have previously seen during (non-VMD mode) hotplug events. I was under the impression that for VMD the filter isn't relevant because the VMD driver is handling polling?

The registered filter function only gets called when the bdev/nvme hotplug poller tries to look for new (or more precisely, non-attached) devices. If you explicitly attach using bdev_nvme_attach_controller, that filter function will not fire.

In terms of device selection in VMD mode, DAOS calls spdk_env_init() with the VMD domain addresses it is interested in (i.e. for the specific DAOS engine SPDK process) in an effort to prevent one engine from accessing anothers assigned VMD domains. The JSON config generated by DAOS control-plane contains the backing device addresses of the domains a to each engine/SPDK process. One of my concerns with the idea of removing the explicit device assignments in the config was the suggestion that the automatic detection and attachment would be performed on any domain bound to the vfio driver and in the case of multiple SPDK processes (DAOS engines) this might cause an issue. Maybe the filtering of domains applied through spdk_env_init_opts would mitigate this concern?

For each VMD PCI function that is bound to vfio-pci, you will only be able to bind it to one process. So you will need to do filtering through spdk_env_init_opts.

This is probably the problem. Without VMD, any new devices are globally accessible through the kernel's PCI subsystem, so you need that filter function to make sure only one of your hotplug-enabled processes tries to enable it. But I'm not sure how that filter function would work with VMD. It would need to pick whatever bus addresses VMD would choose for that VMD endpoint. But I don't think you even need the filter function with VMD - if the VMD device can only be attached to one SPDK process then by default all SSDs behind that VMD endpoint need to be attached to that process. No other process will be able to access them.

@tanabarr
Copy link
Contributor Author

tanabarr commented May 9, 2024

daos_engine.0.log
daos_helper.log
daos_server.log

Attached are the logs from a run with SPDK v22.01 (my attempts to build with v24 failed due to time constraints and specifically linking issues in our scons build, I will try again later). The run was with the original DAOS source as described in the description with none of the experimental workarounds I have been looking at. The only source changes are to enable debug build and "vmd" debug flag.

  • daos_ server.log is the most relevant file and contains the SPDK VMD logging interleaved with the control-plane entries
  • daos_helper.log contains (among other things) driver binding output from spdk/scripts/setup.sh driven from the control-plane
  • daos_engine.0.log is supplied for posterity and includes (non-SPDK) logging from the engine (SPDK) process. SPDK logging from the engine process is forwarded to daos_server.log. This log may be useful to correlate events via timestamps with daos_server.log

@jimharris @ksztyber @NiuYawei please let me know what you think and if any logs from comparison runs would be useful.

Note that I have tried some experiments with not setting hotplug poller bus-ID ranges and expanding the range to allow all. Those experiments yielded the same failure.

@ksztyber
Copy link
Contributor

From the logs, it looks like the VMD driver has enumerated the device the bdev_nvme complains about. I suspect there is a race between the hotplug poller and the bdev_nvme_attach_controller calls. I think the last attach call is executed after the poller, but before the controller got fully initialized, which is why the error message says that scan failed.

If I'm right, moving the bdev_nvme_set_hotplug RPC after the attach RPCs in the config should fix this, i.e.:

    {
      "subsystem": "bdev",
      "config": [
        {
          "params": {
            "bdev_io_pool_size": 65536,
            "bdev_io_cache_size": 256
          },
          "method": "bdev_set_options"
          },
        {
          "params": {
            "retry_count": 4,
            "timeout_us": 0,
            "nvme_adminq_poll_period_us": 100000,
            "action_on_timeout": "none",
            "nvme_ioq_poll_period_us": 0
          },
          "method": "bdev_nvme_set_options"
        },
        {
          "params": {
            "trtype": "PCIe",
            "name": "Nvme_wolf-313.wolf.hpdd.intel.com_0_1_0",
            "traddr": "5d0505:01:00.0"
          },
          "method": "bdev_nvme_attach_controller"
        },
        {
          "params": {
            "trtype": "PCIe",
            "name": "Nvme_wolf-313.wolf.hpdd.intel.com_1_1_0",
            "traddr": "5d0505:03:00.0"
          },
          "method": "bdev_nvme_attach_controller"
        },
        {
          "params": {
            "trtype": "PCIe",
            "name": "Nvme_wolf-313.wolf.hpdd.intel.com_2_1_0",
            "traddr": "d70505:01:00.0"
          },
          "method": "bdev_nvme_attach_controller"
        },
        {
          "params": {
            "trtype": "PCIe",
            "name": "Nvme_wolf-313.wolf.hpdd.intel.com_3_1_0",
            "traddr": "d70505:03:00.0"
          },
          "method": "bdev_nvme_attach_controller"
        },
        {
          "params": {
            "enable": true,
            "period_us": 5000000
          },
          "method": "bdev_nvme_set_hotplug"
        }
      ]
    },

@tanabarr, could you check it?

@tanabarr
Copy link
Contributor Author

@ksztyber apologies for the delay in response, I will try the suggestion this week.

@tanabarr
Copy link
Contributor Author

Yes changing the RPC order as suggested works around the issue. See attached good/bad config files.
daos_nvme.bad.conf.txt
daos_nvme.good.conf.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants