Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel panic while unloading v23.11 oct_ep_phc driver module from debian kernel #2

Open
cbf123 opened this issue Feb 8, 2024 · 5 comments

Comments

@cbf123
Copy link

cbf123 commented Feb 8, 2024

We have recently seen a repeatable issue where the v23.11 oct_ep_phc driver kernel panics when it is being unloaded. The versioning looks a bit odd, but this is the realtime 5.10.205 kernel imported via the Yocto project. The panic log is below, I've also attached the kernel config file in case it's useful.

[13319.084315] octeon_ep_vf: Unloading ...
[13319.084370] octeon_ep_vf: Unloading complete
[13320.222135] octeon_ep: Unloading ...
[13320.222152] octeon_ep 0000:51:00.0: Removing device.
[13320.238273] octeon_ep 0000:51:00.0 enp81s0f0: Stopping the device ...
[13320.298436] octeon_ep 0000:51:00.0 enp81s0f0: IRQs freed
[13320.298850] octeon_ep 0000:51:00.0: Disabled MSI-X
[13320.298856] octeon_ep 0000:51:00.0 enp81s0f0: Freed IOQ Vectors
[13320.299849] octeon_ep 0000:51:00.0 enp81s0f0: Device stopped !!
[13320.338150] octeon_ep 0000:51:00.0: Cleaning up Octeon Device ...
[13320.338153] octeon_ep 0000:51:00.0: Sending dev_unload msg to fw
[13320.338157] Octep ctrl mbox : Uninit successful.
[13320.338159] octeon_ep 0000:51:00.0: CNXKXX: Doing soft reset
[13320.348305] octeon_ep: Unloading complete
[13372.984738] Octeon EP PHC 0000:51:00.1: OCT_PHC[0]: Stopping octeon device
[13372.984785] BUG: unable to handle page fault for address: 0000000000483eb0
[13376.023616] #PF: supervisor write access in kernel mode
[13376.028843] #PF: error_code(0x0002) - not-present page
[13376.033981] PGD 1ee41d067 P4D 0
[13376.037214] Oops: 0002 [#1] PREEMPT_RT SMP NOPTI
[13376.041833] CPU: 47 PID: 100726 Comm: modprobe Kdump: loaded Tainted: G S O 5.10.0-6-rt-amd64 #1 Debian 5.10.205-1.stx.76
[13376.053906] Hardware name: Dell Inc. PowerEdge XR11/07GGDG, BIOS 1.10.2 03/03/2023
[13376.061472] RIP: 0010:native_queued_spin_lock_slowpath+0x19f/0x1e0
[13376.067651] Code: ff ff ff c6 47 01 00 e9 1d ff ff ff c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6 48 05 c0 01 03 00 48 03 04 f5 e0 36 9d 86 <48> ...
[13376.086397] RSP: 0018:ff55572f07ccbd18 EFLAGS: 00010002
[13376.091623] RAX: 0000000000483eb0 RBX: ff1b649cb058a818 RCX: 0000000000c00000
[13376.098756] RDX: ff1b64baff1f01c0 RSI: 0000000000000c5b RDI: ff1b649cb058a818
[13376.105888] RBP: ff1b649cb058a818 R08: 0000000000c00000 R09: ff1b649c80033e40
[13376.113022] R10: 0000000000000247 R11: fffffffffff9ca84 R12: 0000000000000246
[13376.120155] R13: 0000000000000000 R14: 0000000000000001 R15: ffffffffc0e97210
[13376.127287] FS: 00007facb457f540(0000) GS:ff1b64baff1c0000(0000) knlGS:0000000000000000
[13376.135371] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13376.141118] CR2: 0000000000483eb0 CR3: 00000001d2656006 CR4: 0000000000771ee0
[13376.148253] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[13376.155386] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[13376.162518] PKRU: 55555554
[13376.165229] Call Trace:
[13376.167686] ? __die+0x5d/0xa0
[13376.170742] ? no_context+0x189/0x3a0
[13376.174409] ? exc_page_fault+0x2a1/0x530
[13376.178421] ? dequeue_entity+0xc6/0x4a0
[13376.182348] ? newidle_balance+0x3fd/0x480
[13376.186448] ? asm_exc_page_fault+0x1e/0x30
[13376.190633] ? native_queued_spin_lock_slowpath+0x19f/0x1e0
[13376.196204] _raw_spin_lock_irqsave+0x30/0x40
[13376.200563] rt_spin_lock_slowlock+0x40/0x80
[13376.204837] ? wait_for_completion+0xa0/0xe0
[13376.209109] rt_spin_lock+0x2a/0x40
[13376.212604] __wake_up_common_lock+0x60/0xb0
[13376.216876] ptp_clock_unregister+0x2b/0x70
[13376.221062] octeon_ep_phc_remove+0xcc/0x134 [oct_ep_phc]
[13376.226461] ? pci_device_remove+0x38/0xa0
[13376.230560] ? __device_release_driver+0x17b/0x250
[13376.235353] ? driver_detach+0xc9/0x110
[13376.239192] ? bus_remove_driver+0x5b/0xe0
[13376.243293] ? pci_unregister_driver+0x2a/0xb0
[13376.247736] ? __do_sys_delete_module.constprop.0+0x171/0x2c0
[13376.253484] ? vtime_user_exit+0x1c/0x70
[13376.257410] ? __context_tracking_exit+0x56/0xe0
[13376.262028] ? do_syscall_64+0x30/0x40
[13376.265782] ? entry_SYSCALL_64_after_hwframe+0x62/0xc7
[13376.271008] Modules linked in: nft_counter xt_addrtype nft_compat nf_tables br_netfilter ...
[13376.271054] acpi_power_meter [last unloaded: octeon_ep]
[13376.363222] CR2: 0000000000483eb0

config-5.10.0-6-rt-amd64.txt

@cbf123 cbf123 changed the title Kernel panic unloading the v23.11 VF module from debian kernel Kernel panic in v23.11 oct_ep_phc driver while unloading the VF module from debian kernel Feb 8, 2024
@cbf123 cbf123 changed the title Kernel panic in v23.11 oct_ep_phc driver while unloading the VF module from debian kernel Kernel panic in v23.11 oct_ep_phc driver while unloading the octeon_ep_vf module from debian kernel Feb 8, 2024
@GaniHaseeb
Copy link
Contributor

GaniHaseeb commented Feb 8, 2024 via email

@cbf123
Copy link
Author

cbf123 commented Feb 8, 2024

I didn't do the test personally, but the developer who did the testing says that the oct_ep_phc driver was also built from the v23.11 tagged release.

The changes in a0dd5f4 move the kfree() and octeon_free_device_mem() calls down into the "before_exit" clause, but all of that occurs after the call to ptp_clock_unregister() which is implicated in the panic. As such, I don't think that change would have any impact on the panic that was seen.

@cbf123 cbf123 changed the title Kernel panic in v23.11 oct_ep_phc driver while unloading the octeon_ep_vf module from debian kernel Kernel panic while unloading v23.11 oct_ep_phc driver module from debian kernel Feb 8, 2024
@GaniHaseeb
Copy link
Contributor

Yes, you are right. The issue seems to happen due to page_fault while holding a spin_lock. One more point, I assume you see this issue only with the RT patch, right?

@m-v-b
Copy link

m-v-b commented Feb 10, 2024

@GaniHaseeb, sorry for the delayed response. I am the person that @cbf123 was referring to with respect to carrying out the driver's testing.

While the bug report is valid, it appears to be a corner case involving the incomplete initialization of the driver when an unsupported device ID is encountered. I apologize for not realizing this before @cbf123 reported this issue.

Please see the initialization logs of the driver below, from the same session that ended with the kernel crash whose logs @cbf123 provided above:

[   13.078037] systemd[1]: Starting Coldplug All udev Devices...
[   13.093017] systemd[1]: Started Journal Service.
[   13.101485] svcrdma: svcrdma is obsoleted, loading rpcrdma instead
[   13.102785] xprtrdma: xprtrdma is obsoleted, loading rpcrdma instead
[   13.108995] Octeon EP PHC 0000:51:00.1: OCT_PHC: Loading PHC driver
[   13.109291] Octeon EP PHC 0000:51:00.1: OCT_PHC[0]:fw ready status 1
[   13.109349] Octeon EP PHC 0000:51:00.1: OCT_PHC: Unknown device found (subsystem_id: 741028)
[   13.109350] Octeon EP PHC 0000:51:00.1: OCT_PHC[0]: Freeing PCI mapped regions for Bar0
[   13.109352] Octeon EP PHC 0000:51:00.1: OCT_PHC[0]: Freeing PCI mapped regions for Bar1
[   13.109353] Octeon EP PHC 0000:51:00.1: OCT_PHC[0]: Freeing PCI mapped regions for Bar2
[   13.109354] Octeon EP PHC 0000:51:00.1: OCT_PHC[0]: Chip specific setup failed
[   13.109355] Octeon EP PHC 0000:51:00.1: OCT_PHC[0]: ERROR: Octeon driver failed to load.
[   13.117400] sctp: Hash tables configured (bind 512/585)
[   13.119575] VFIO - User Level meta-driver version: 0.3

Cross-referencing this with the source code, I observe that when the call to the "octeon_device_init" function fails, the "octeon_device_init_work" function will return (bail out) early. Link:

dev_info(&oct_dev->pci_dev->dev, "OCT_PHC[%d]: ERROR: Octeon driver failed to load.\n",

This in turns causes the following line to not be executed in such cases:

oct_dev->oct_ep_ptp_clock->ptp_clock = ptp_clock_register(&oct_dev->oct_ep_ptp_clock->caps, NULL);

Because (along with that line) the rest of the "octeon_device_init_work" function is not executed, parts of the octeon_device_t data structure will not be initialized/populated either.

This in turn causes the crash discussed in this issue, as during driver module removal time, the "ptp_clock_unregister" function called by "octeon_ep_phc_remove" will be working with uninitialized memory.

Please note that I did not try using a standard (i.e., non-PREEMPT_RT) kernel, but the nature of the issue makes it obvious that this issue is likely not related to the PREEMPT_RT kernel.

I hope that this helps!

@GaniHaseeb
Copy link
Contributor

Thanks for reporting the issue. We will fix the error path in upcoming release.

starlingx-github pushed a commit to starlingx/kernel that referenced this issue Mar 6, 2024
This commit uprevisions the octeon_ep, octeon_ep_vf and oct_ep_phc
drivers from v23.04 to v23.11 to enable use cases that utilize the Dell
Open RAN Accelerator (DORA) card based on Marvell's Octeon
system-on-chip (SoC).

As the driver source code available on Sourceforge does not appear to be
kept up-to-date, the build system configuration files are updated to
acquire the driver source code from a Marvell-maintained git repository
on GitHub.

This commit also accommodates the minor differences between the
directory structures of the source code tar archive on Sourceforge and
the git repository on GitHub by modifying the debian/rules file.

We also block the automatic loading of the oct_ep_phc driver via a
modprobe.d configuration entry, for two reasons:

1) The oct_ep_phc driver does not appear to be needed by the major user
   whose use cases are enabled by this driver uprevision.
2) The oct_ep_phc driver triggers a kernel crash when being unloaded,
   due to an initialization error handling bug related to the DORA card,
   as reported at:
   MarvellEmbeddedProcessors/pcie_ep_octeon_host#2

The two patches applied to the driver package as part of StarlingX are
refreshed and adapted to apply cleanly onto the newer driver package
version acquired from GitHub.

The modprobe configuration file is renamed to octeon-ep.conf to adhere
to inclusive language guidelines.

Finally, the "debian/copyright" file is updated to adhere to Debian's
formatting guidelines published at [1], to update the name of the source
package, to note that most files are licensed under the GPL-2 license
and that the "apps" directory is licensed under the Apache-2.0 license.
Also, please note that the Makefile in the source code package acquired
from GitHub does not have a specific/different license, unlike the
package acquired from Sourceforge, so the special case for that file is
removed.

[1] https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/

== Additional Information ==

We would like to note that the DORA card is a bit special with respect
to its configuration interface. In summary, the octeon_ep physical
function (PF) driver instantiates a network interface managed by the
kernel, which acts as the configuration interface to the accelerator
card. The card sends DHCP discovery requests via the network interface.
If a DHCP server is listening on the network interface, then the card
acquires an IP address and a firmware download can be carried out via
the interface to fully initialize the accelerator card.

Unfortunately, we do not have access to the firmware images and the
software packages necessary to test the accelerator card end-to-end, so
our verification has been limited to ensuring that the DHCP discovery
requests are observed on the network interface created by the PF driver
after the driver is loaded.

The accelerator also has a serial console that can be attached to the
host via a USB-to-serial adapter, but our understanding is that the labs
we have been using for verification did not have this serial connection
set up.

== Verification ==

* An ISO image can be built with this commit applied to a repo project
  of a StarlingX-based distribution tracking StarlingX's master branch.

* The ISO image can be installed to a Dell XR11 server with a DORA card,
  and the system is successfully Ansible-bootstrapped.

* The octeon_ep, octeon_ep_vf and oct_ep_phc drivers are observed to not
  be automatically loaded.

* The octeon_ep driver can be loaded manually with modprobe, and a PF
  interface is instantiated by the kernel. Once the PF interface is
  brought up with the "ip" command, DHCP discovery packets are observed
  on the interface by running (as root):

  tcpdump -i <pf_iface> -nn -e 'udp port 67 or udp port 68'

* Virtual function (VF) interfaces can be instantiated by loading the
  octeon_ep_vf driver with modprobe and then writing (for example) the
  string "2" to the magic sysfs file at:
  /sys/class/net/<pf_iface>/device/sriov_numvfs

* The VF interfaces can be brought up with the "ip" command.

Story: 2010047
Task: 49651

Change-Id: I11965bf1be278030934b4b517860bc28683a6673
Signed-off-by: M. Vefa Bicakci <vefa.bicakci@windriver.com>
starlingx-github pushed a commit to starlingx/kernel that referenced this issue Mar 27, 2024
This commit uprevisions the octeon_ep, octeon_ep_vf and oct_ep_phc
drivers from v23.04 to v23.11 to enable use cases that utilize the Dell
Open RAN Accelerator (DORA) card based on Marvell's Octeon
system-on-chip (SoC).

As the driver source code available on Sourceforge does not appear to be
kept up-to-date, the build system configuration files are updated to
acquire the driver source code from a Marvell-maintained git repository
on GitHub.

This commit also accommodates the minor differences between the
directory structures of the source code tar archive on Sourceforge and
the git repository on GitHub by modifying the debian/rules file.

We also block the automatic loading of the oct_ep_phc driver via a
modprobe.d configuration entry, for two reasons:

1) The oct_ep_phc driver does not appear to be needed by the major user
   whose use cases are enabled by this driver uprevision.
2) The oct_ep_phc driver triggers a kernel crash when being unloaded,
   due to an initialization error handling bug related to the DORA card,
   as reported at:
   MarvellEmbeddedProcessors/pcie_ep_octeon_host#2

The two patches applied to the driver package as part of StarlingX are
refreshed and adapted to apply cleanly onto the newer driver package
version acquired from GitHub.

The modprobe configuration file is renamed to octeon-ep.conf to adhere
to inclusive language guidelines.

Finally, the "debian/copyright" file is updated to adhere to Debian's
formatting guidelines published at [1], to update the name of the source
package, to note that most files are licensed under the GPL-2 license
and that the "apps" directory is licensed under the Apache-2.0 license.
Also, please note that the Makefile in the source code package acquired
from GitHub does not have a specific/different license, unlike the
package acquired from Sourceforge, so the special case for that file is
removed.

[1] https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/

== Additional Information ==

We would like to note that the DORA card is a bit special with respect
to its configuration interface. In summary, the octeon_ep physical
function (PF) driver instantiates a network interface managed by the
kernel, which acts as the configuration interface to the accelerator
card. The card sends DHCP discovery requests via the network interface.
If a DHCP server is listening on the network interface, then the card
acquires an IP address and a firmware download can be carried out via
the interface to fully initialize the accelerator card.

Unfortunately, we do not have access to the firmware images and the
software packages necessary to test the accelerator card end-to-end, so
our verification has been limited to ensuring that the DHCP discovery
requests are observed on the network interface created by the PF driver
after the driver is loaded.

The accelerator also has a serial console that can be attached to the
host via a USB-to-serial adapter, but our understanding is that the labs
we have been using for verification did not have this serial connection
set up.

== Verification ==

* An ISO image can be built with this commit applied to a repo project
  of a StarlingX-based distribution tracking StarlingX's master branch.

* The ISO image can be installed to a Dell XR11 server with a DORA card,
  and the system is successfully Ansible-bootstrapped.

* The octeon_ep, octeon_ep_vf and oct_ep_phc drivers are observed to not
  be automatically loaded.

* The octeon_ep driver can be loaded manually with modprobe, and a PF
  interface is instantiated by the kernel. Once the PF interface is
  brought up with the "ip" command, DHCP discovery packets are observed
  on the interface by running (as root):

  tcpdump -i <pf_iface> -nn -e 'udp port 67 or udp port 68'

* Virtual function (VF) interfaces can be instantiated by loading the
  octeon_ep_vf driver with modprobe and then writing (for example) the
  string "2" to the magic sysfs file at:
  /sys/class/net/<pf_iface>/device/sriov_numvfs

* The VF interfaces can be brought up with the "ip" command.

Story: 2010047
Task: 49651

Change-Id: I11965bf1be278030934b4b517860bc28683a6673
Signed-off-by: M. Vefa Bicakci <vefa.bicakci@windriver.com>
(cherry picked from commit b015aa8)
Signed-off-by: Jiping Ma <jiping.ma2@windriver.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants