Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPDK backend uses detached controller #107

Closed
LiadOz opened this issue Jun 9, 2022 · 8 comments
Closed

SPDK backend uses detached controller #107

LiadOz opened this issue Jun 9, 2022 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@LiadOz
Copy link
Contributor

LiadOz commented Jun 9, 2022

I have found the following issue when working with version 0.3.0 and using ip:port for opening devices, here is the flow:

  1. Create 5 namespaces (should happen with at least 2)
  2. Open namespaces and issue commands
  3. Close namespaces
  4. Create namespace, open and issue commands
    In step 4 I crash sporadically, when debugging the issue I noticed that the controller has garbage values at this stage.

This is what I see when debugging:

  1. _cref_insert is called 5 times with the same controller, creating 5 entries
  2. _cref_lookup is called 5 times
  3. when closing the device _cref_deref is called 5 times removing the first entry of the controller out of 5 and detaching the controller
  4. _cref_lookup is called and raises the refcount of the next entry (same controller) to 2

It seems to me that the root cause of the issue is creating the 5 entries instead of only 1. assuming _cref_deref would be called enough times to clear every entry in the table it will detach the controller 5 times which should not happen.

With these changes the I stop crashing

diff --git a/lib/xnvme_be_spdk_dev.c b/lib/xnvme_be_spdk_dev.c
index 09f2ffc..91bd247 100644
--- a/lib/xnvme_be_spdk_dev.c
+++ b/lib/xnvme_be_spdk_dev.c
@@ -69,6 +69,15 @@ _cref_insert(struct xnvme_ident *ident, struct spdk_nvme_ctrlr *ctrlr)
                XNVME_DEBUG("FAILED: !ctrlr");
                return -EINVAL;
        }
+       for (int i = 0; i < XNVME_BE_SPDK_CREFS_LEN; ++i) { // check if entry already exists
+                if (!g_cref[i].refcount) {
+                        continue;
+                }
+
+                if (strncmp(g_cref[i].uri, ident->uri, XNVME_IDENT_URI_LEN - 1)) {
+                        return 0;
+                }
+        }

        for (int i = 0; i < XNVME_BE_SPDK_CREFS_LEN; ++i) {
                if (g_cref[i].refcount) {
@safl
Copy link
Member

safl commented Jun 9, 2022

Hi @LiadOz,
Thanks for reporting this and diving into the issue. For the five namespaces, the creation of those, is that similar to the scripts here:

Would be good to know to reproduce the issue.
Thanks!

@LiadOz
Copy link
Contributor Author

LiadOz commented Jun 9, 2022

I haven't been able to reproduce the crash itself in a simple environment but I can reproduce the issue:

I have created 3 namespace using nvmf app
here is the code

#include <libxnvme.h>
#include <libxnvme_pp.h>

int
main(int argc, char *argv[])
{
        struct xnvme_opts opts1 = xnvme_opts_default();
        struct xnvme_opts opts2 = xnvme_opts_default();
        struct xnvme_opts opts3 = xnvme_opts_default();
        struct xnvme_dev *dev1;
        struct xnvme_dev *dev2;
        struct xnvme_dev *dev3;

        opts1.nsid = 1;
        opts2.nsid = 2;
        opts3.nsid = 3;
        dev1 = xnvme_dev_open("127.0.0.1:4420", &opts1);
        dev2 = xnvme_dev_open("127.0.0.1:4420", &opts2);
        xnvme_dev_close(dev1);
        xnvme_dev_close(dev2);
        dev3 = xnvme_dev_open("127.0.0.1:4420", &opts3);
        xnvme_dev_close(dev3);
        return 0;
}

And made the following changes to add logs

diff --git a/lib/xnvme_be_spdk_dev.c b/lib/xnvme_be_spdk_dev.c
index 09f2ffc..071987e 100644
--- a/lib/xnvme_be_spdk_dev.c
+++ b/lib/xnvme_be_spdk_dev.c
@@ -50,6 +49,7 @@ _cref_lookup(struct xnvme_ident *id)

                g_cref[i].refcount += 1;

+               XNVME_DEBUG("LIADOZ: lookup ctrlr %p", g_cref[i].ctrlr);
                return g_cref[i].ctrlr;
        }

@@ -75,6 +75,7 @@ _cref_insert(struct xnvme_ident *ident, struct spdk_nvme_ctrlr *ctrlr)
                        continue;
                }

+               XNVME_DEBUG("LIADOZ: inserting ctrlr %p", ctrlr);
                g_cref[i].ctrlr = ctrlr;
                g_cref[i].refcount += 1;
                strncpy(g_cref[i].uri, ident->uri, XNVME_IDENT_URI_LEN);
@@ -107,6 +108,7 @@ _cref_deref(struct spdk_nvme_ctrlr *ctrlr)

                if (g_cref[i].refcount == 0) {
                        XNVME_DEBUG("INFO: refcount: %d => detaching", g_cref[i].refcount);
+                       XNVME_DEBUG("LIADOZ: detaching ctrlr %p", ctrlr);
                        spdk_nvme_detach(ctrlr);
                        memset(&g_cref[i], 0, sizeof g_cref[i]);
                }

here are the logs I get

# DBG:xnvme_be.c:xnvme_be_factory-667: INFO: obtained backend instance be: 'spdk'
# DBG:xnvme_be_spdk_dev.c:_spdk_env_init-228: INFO: SPDK NVMe PCIe Driver registration -- BEGIN
# DBG:xnvme_be_spdk_dev.c:_spdk_env_init-233: INFO: skipping, already registered.
# DBG:xnvme_be_spdk_dev.c:_spdk_env_init-235: INFO: SPDK NVMe PCIe Driver registration -- END
[2022-06-09 13:49:53.585464] Starting SPDK v21.10 git sha1 4e4f11ff7 / DPDK 21.08.0 initialization...
[2022-06-09 13:49:53.586031] [ DPDK EAL parameters: [2022-06-09 13:49:53.586286] spdk [2022-06-09 13:49:53.586492] --no-shconf [2022-06-09 13:49:53.586687] -c 0x1 [2022-06-09 13:49:53.586876] --log-level=lib.eal:6 [2022-06-09 13:49:53.587074] --log-level=lib.cryptodev:5 [2022-06-09 13:49:53.587262] --log-level=user1:6 [2022-06-09 13:49:53.587477] --iova-mode=pa [2022-06-09 13:49:53.587671] --base-virtaddr=0x200000000000 [2022-06-09 13:49:53.587862] --match-allocations [2022-06-09 13:49:53.588058] --file-prefix=spdk_pid111382 [2022-06-09 13:49:53.588249] ]
EAL: No available 1048576 kB hugepages reported
TELEMETRY: No legacy callbacks, legacy socket not created
# DBG:xnvme_be_spdk_dev.c:_spdk_env_init-258: INFO: spdk_env_is_initialized: 1
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-796: ############################
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-797: INFO: trtype: PCIe, # 1 of 4
EAL: Cannot find device (0127:00:00.1)
EAL: Failed to attach device on primary process
[2022-06-09 13:49:53.730738] nvme.c: 838:nvme_probe_internal: *ERROR*: NVMe ctrlr scan failed
[2022-06-09 13:49:53.730757] nvme.c: 915:spdk_nvme_probe: *ERROR*: Create probe context failed
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-808: FAILED: probe a:0, e:-1, i:0
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-796: ############################
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-797: INFO: trtype: TCP, # 2 of 4
[2022-06-09 13:49:54.199415] nvme_ctrlr.c: 704:nvme_ctrlr_set_intel_support_log_pages: *WARNING*: [nqn.2016-06.io.spdk:cnode1] Intel log pages not supported on Intel drive!
# DBG:xnvme_be_spdk_dev.c:attach_cb-451: INFO: nsid: 1
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-796: ############################
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-797: INFO: trtype: RDMA, # 3 of 4
[2022-06-09 13:49:54.199540] nvme.c: 830:nvme_probe_internal: *ERROR*: NVMe trtype 1 not available
[2022-06-09 13:49:54.199557] nvme.c: 915:spdk_nvme_probe: *ERROR*: Create probe context failed
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-808: FAILED: probe a:1, e:-1, i:0
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-796: ############################
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-797: INFO: trtype: FC, # 4 of 4
# DBG:xnvme_be_spdk_dev.c:_xnvme_be_spdk_ident_to_trid-320: FAILED: unsupported trtype: FC
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-802: SKIP/FAILED: ident_to_trid()
# DBG:xnvme_be_spdk_dev.c:_cref_insert-79: LIADOZ: inserting ctrlr 0x5575e1590860
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-838: INFO: open() : OK
# DBG:xnvme_be_spdk_admin.c:xnvme_be_spdk_sync_cmd_admin-66: FAILED: xnvme_cmd_ctx_cpl_status()
# DBG:xnvme_be.c:xnvme_be_dev_idfy-475: INFO: !xnvme_adm_idfy_ctrlr_csi(CSI_ZONED)
# DBG:xnvme_be.c:xnvme_be_dev_idfy-499: INFO: no positive response to idfy(ZNS)
# DBG:xnvme_be_spdk_admin.c:xnvme_be_spdk_sync_cmd_admin-66: FAILED: xnvme_cmd_ctx_cpl_status()
# DBG:xnvme_be.c:xnvme_be_dev_idfy-510: INFO: !xnvme_adm_idfy_ctrlr_csi(CSI_FS)
# DBG:xnvme_be.c:xnvme_be_dev_idfy-538: INFO: no positive response to idfy(FS)
# DBG:xnvme_be_spdk_admin.c:xnvme_be_spdk_sync_cmd_admin-66: FAILED: xnvme_cmd_ctx_cpl_status()
# DBG:xnvme_be.c:xnvme_be_dev_idfy-546: INFO: not csi-specific id-NVM
# DBG:xnvme_be.c:xnvme_be_dev_idfy-547: INFO: falling back to NVM assumption
# DBG:xnvme_be.c:xnvme_be_factory-672: INFO: obtained device handle
# DBG:xnvme_be.c:xnvme_be_factory-667: INFO: obtained backend instance be: 'spdk'
# DBG:xnvme_be_spdk_dev.c:_spdk_env_init-222: INFO: already initialized
# DBG:xnvme_be_spdk_dev.c:_spdk_env_init-258: INFO: spdk_env_is_initialized: 1
# DBG:xnvme_be_spdk_dev.c:_cref_lookup-53: LIADOZ: lookup ctrlr 0x5575e1590860
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-757: INFO: found dev->ident.uri: '127.0.0.1:4420' via cref_lookup()
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-777: INFO: re-using previously attached controller
# DBG:xnvme_be_spdk_dev.c:_cref_insert-79: LIADOZ: inserting ctrlr 0x5575e1590860
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-838: INFO: open() : OK
# DBG:xnvme_be_spdk_admin.c:xnvme_be_spdk_sync_cmd_admin-66: FAILED: xnvme_cmd_ctx_cpl_status()
# DBG:xnvme_be.c:xnvme_be_dev_idfy-475: INFO: !xnvme_adm_idfy_ctrlr_csi(CSI_ZONED)
# DBG:xnvme_be.c:xnvme_be_dev_idfy-499: INFO: no positive response to idfy(ZNS)
# DBG:xnvme_be_spdk_admin.c:xnvme_be_spdk_sync_cmd_admin-66: FAILED: xnvme_cmd_ctx_cpl_status()
# DBG:xnvme_be.c:xnvme_be_dev_idfy-510: INFO: !xnvme_adm_idfy_ctrlr_csi(CSI_FS)
# DBG:xnvme_be.c:xnvme_be_dev_idfy-538: INFO: no positive response to idfy(FS)
# DBG:xnvme_be_spdk_admin.c:xnvme_be_spdk_sync_cmd_admin-66: FAILED: xnvme_cmd_ctx_cpl_status()
# DBG:xnvme_be.c:xnvme_be_dev_idfy-546: INFO: not csi-specific id-NVM
# DBG:xnvme_be.c:xnvme_be_dev_idfy-547: INFO: falling back to NVM assumption
# DBG:xnvme_be.c:xnvme_be_factory-672: INFO: obtained device handle
# DBG:xnvme_be_spdk_dev.c:_cref_deref-111: INFO: refcount: 0 => detaching
# DBG:xnvme_be_spdk_dev.c:_cref_deref-112: LIADOZ: detaching ctrlr 0x5575e1590860
# DBG:xnvme_be.c:xnvme_be_factory-667: INFO: obtained backend instance be: 'spdk'
# DBG:xnvme_be_spdk_dev.c:_spdk_env_init-222: INFO: already initialized
# DBG:xnvme_be_spdk_dev.c:_spdk_env_init-258: INFO: spdk_env_is_initialized: 1
# DBG:xnvme_be_spdk_dev.c:_cref_lookup-53: LIADOZ: lookup ctrlr 0x5575e1590860
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-757: INFO: found dev->ident.uri: '127.0.0.1:4420' via cref_lookup()
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-768: FAILED: !spdk_nvme_ns_is_active(nsid:0x3)
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-796: ############################
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-797: INFO: trtype: PCIe, # 1 of 4
EAL: Cannot find device (0127:00:00.1)
EAL: Failed to attach device on primary process
[2022-06-09 13:49:54.436241] nvme.c: 838:nvme_probe_internal: *ERROR*: NVMe ctrlr scan failed
[2022-06-09 13:49:54.436261] nvme.c: 915:spdk_nvme_probe: *ERROR*: Create probe context failed
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-808: FAILED: probe a:0, e:-1, i:0
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-796: ############################
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-797: INFO: trtype: TCP, # 2 of 4
[2022-06-09 13:49:54.907427] nvme_ctrlr.c: 704:nvme_ctrlr_set_intel_support_log_pages: *WARNING*: [nqn.2016-06.io.spdk:cnode1] Intel log pages not supported on Intel drive!
# DBG:xnvme_be_spdk_dev.c:attach_cb-451: INFO: nsid: 3
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-796: ############################
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-797: INFO: trtype: RDMA, # 3 of 4
[2022-06-09 13:49:54.907567] nvme.c: 830:nvme_probe_internal: *ERROR*: NVMe trtype 1 not available
[2022-06-09 13:49:54.907593] nvme.c: 915:spdk_nvme_probe: *ERROR*: Create probe context failed
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-808: FAILED: probe a:1, e:-1, i:0
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-796: ############################
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-797: INFO: trtype: FC, # 4 of 4
# DBG:xnvme_be_spdk_dev.c:_xnvme_be_spdk_ident_to_trid-320: FAILED: unsupported trtype: FC
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-802: SKIP/FAILED: ident_to_trid()
# DBG:xnvme_be_spdk_dev.c:_cref_insert-79: LIADOZ: inserting ctrlr 0x5575e158cc60
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-838: INFO: open() : OK
# DBG:xnvme_be_spdk_admin.c:xnvme_be_spdk_sync_cmd_admin-66: FAILED: xnvme_cmd_ctx_cpl_status()
# DBG:xnvme_be.c:xnvme_be_dev_idfy-475: INFO: !xnvme_adm_idfy_ctrlr_csi(CSI_ZONED)
# DBG:xnvme_be.c:xnvme_be_dev_idfy-499: INFO: no positive response to idfy(ZNS)
# DBG:xnvme_be_spdk_admin.c:xnvme_be_spdk_sync_cmd_admin-66: FAILED: xnvme_cmd_ctx_cpl_status()
# DBG:xnvme_be.c:xnvme_be_dev_idfy-510: INFO: !xnvme_adm_idfy_ctrlr_csi(CSI_FS)
# DBG:xnvme_be.c:xnvme_be_dev_idfy-538: INFO: no positive response to idfy(FS)
# DBG:xnvme_be_spdk_admin.c:xnvme_be_spdk_sync_cmd_admin-66: FAILED: xnvme_cmd_ctx_cpl_status()
# DBG:xnvme_be.c:xnvme_be_dev_idfy-546: INFO: not csi-specific id-NVM
# DBG:xnvme_be.c:xnvme_be_dev_idfy-547: INFO: falling back to NVM assumption
# DBG:xnvme_be.c:xnvme_be_factory-672: INFO: obtained device handle
# DBG:xnvme_be_spdk_dev.c:_cref_deref-111: INFO: refcount: 0 => detaching
# DBG:xnvme_be_spdk_dev.c:_cref_deref-112: LIADOZ: detaching ctrlr 0x5575e158cc60
trid:
  trstring: 'TCP'
  trtype: 0x3
  adrfam: 0x1
  traddr: '127.0.0.1'
  trsvcid: '4420'
  subnqn: 'nqn.2016-06.io.spdk:cnode1'
  priority: 0x0
trid:
  trstring: 'TCP'
  trtype: 0x3
  adrfam: 0x1
  traddr: '127.0.0.1'
  trsvcid: '4420'
  subnqn: 'nqn.2016-06.io.spdk:cnode1'
  priority: 0x0

Here are the important bits:

# DBG:xnvme_be_spdk_dev.c:_cref_insert-79: LIADOZ: inserting ctrlr 0x5575e1590860
# DBG:xnvme_be_spdk_dev.c:_cref_lookup-53: LIADOZ: lookup ctrlr 0x5575e1590860
# DBG:xnvme_be_spdk_dev.c:_cref_insert-79: LIADOZ: inserting ctrlr 0x5575e1590860
# DBG:xnvme_be_spdk_dev.c:_cref_deref-112: LIADOZ: detaching ctrlr 0x5575e1590860
# DBG:xnvme_be_spdk_dev.c:_cref_lookup-53: LIADOZ: lookup ctrlr 0x5575e1590860
# DBG:xnvme_be_spdk_dev.c:xnvme_be_spdk_state_init-768: FAILED: !spdk_nvme_ns_is_active(nsid:0x3)
# DBG:xnvme_be_spdk_dev.c:_cref_insert-79: LIADOZ: inserting ctrlr 0x5575e158cc60
# DBG:xnvme_be_spdk_dev.c:_cref_deref-112: LIADOZ: detaching ctrlr 0x5575e158cc60

As you can see the controller can still be fetched after it was detached, in this example no harm is done since spdk_nvme_ns_is_active fails but it can succeed since the controller is freed and any data can be inside of it.

@LiadOz
Copy link
Contributor Author

LiadOz commented Jun 13, 2022

It seems to me that xnvme_be_spdk_state_init is using the controller lookup wrong, each time a controller is reused it both updates the reference count (_cref_lookup) and creates a new entry (with _cref_insert).
We should either not insert a new entry or dereference the controller properly. @safl what do you think?

@LiadOz
Copy link
Contributor Author

LiadOz commented Jun 22, 2022

#114 seems to fail due to a similar issue, in that pull request I added a function that gets called with a detached controller and it crashes.
The test xnvme_tests_enum is run with 2 devices, it first gets the devices connected then opens each. In enumerate_attach_cb there is a _cref_insert for each namespace (refcount 2) then at the end of the function spdk_nvme_detach is called and frees the controller (the function _cref_deref is also called but it doesn't have affect becuase the refcount changes to 1).
When trying to open each device it will use _cref_lookup and find the previously attached controller (which was detached) which will cause a crash.

@safl
Copy link
Member

safl commented Jul 8, 2022

Hi @LiadOz,
The integration of changes on: #126 should also fix this, can you give it try?
Thanks!

@safl safl added the bug Something isn't working label Oct 19, 2022
@karlowich
Copy link
Collaborator

Hi @LiadOz, this should be fixed in xNVMe v0.5.0.
Can you try it out and tell me if it resolves your issue?
Thanks!

@LiadOz
Copy link
Contributor Author

LiadOz commented Oct 20, 2022

This issue seems doesn't happen to me on 0.5.0

@karlowich
Copy link
Collaborator

Great then I am closing this issue.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

Successfully merging a pull request may close this issue.

3 participants