Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia_p2p_get_pages(): Fix double-free in register-callback error path #557

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

BrendanCunningham
Copy link

Double-free in rm_p2p_register_callback() error-path in nv_p2p_get_pages() causes memory corruption that leads to a kernel panic.

Fix this by adding a separate goto for this error path that skips freeing the already-freed memory.

Double-free can be produced by calling nvidia_p2p_get_pages() on one CPU while simultaneously freeing the GPU virtual address range passed into nvidia_p2p_get_pages() on another CPU. Producing the double-free is timing dependent and may require multiple tries.

'slub_debug=FZ' kernel boot parameter shows the double-free:

[ 239.115091] =============================================================================
[ 239.124659] BUG kmalloc-16 (Tainted: G OE ): Object already free
[ 239.133011] -----------------------------------------------------------------------------

[ 239.144491] Slab 0xfffffa8bc4434140 objects=85 used=82 fp=0xffff9a3dd0d05910 flags=0x17ffffc0000200(slab|node=0|zone=2|lastcpupid=0x1fffff)
[ 239.158997] Object 0xffff9a3dd0d05670 @offset=1648 fp=0x0000000000000000

[ 239.168766] Redzone ffff9a3dd0d05660: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
[ 239.179633] Object ffff9a3dd0d05670: 10 00 00 00 00 00 00 00 e5 04 3f 13 96 18 8e 47 ..........?....G
[ 239.190641] Redzone ffff9a3dd0d05680: bb bb bb bb bb bb bb bb ........
[ 239.200739] Padding ffff9a3dd0d05688: 84 80 0e 00 00 00 00 00 ........
[ 239.210938] CPU: 0 PID: 3150 Comm: hfi-sdma-test Kdump: loaded Tainted: G OE 6.5.0-rc1+ #1
[ 239.221911] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.1029.090220201031 09/02/2020
[ 239.233948] Call Trace:
[ 239.236992]
[ 239.239608] dump_stack_lvl+0x33/0x50
[ 239.244010] object_err+0x3a/0x80
[ 239.248014] free_debug_processing+0x265/0x360
[ 239.253392] ? nv_p2p_get_pages+0x163/0x590 [nvidia]
[ 239.259399] free_to_partial_list+0x80/0x280
[ 239.264478] ? nv_p2p_get_pages+0x163/0x590 [nvidia]
[ 239.270426] nv_p2p_get_pages+0x163/0x590 [nvidia]
[ 239.276303] ? __pfx_remove_nvidia_pages+0x10/0x10 [hfi1]
[ 239.282692] nvidia_p2p_get_pages+0x25/0x40 [nvidia]
[ 239.288601] ? __pfx_remove_nvidia_pages+0x10/0x10 [hfi1]
...
[ 239.498990]
[ 239.501662] Disabling lock debugging due to kernel taint
[ 239.507828] FIX kmalloc-16: Object at 0xffff9a3dd0d05670 not freed

Double-free in rm_p2p_register_callback() error-path in
nv_p2p_get_pages() causes memory corruption that leads to a kernel
panic.

Fix this by adding a separate goto for this error path that skips
freeing the already-freed memory.

Double-free can be produced by calling nvidia_p2p_get_pages() on one CPU
while simultaneously freeing the GPU virtual address range passed into
nvidia_p2p_get_pages() on another CPU. Producing the double-free is
timing dependent and may require multiple tries.

'slub_debug=FZ' kernel boot parameter shows the double-free:

  [  239.115091] =============================================================================
  [  239.124659] BUG kmalloc-16 (Tainted: G           OE     ): Object already free
  [  239.133011] -----------------------------------------------------------------------------

  [  239.144491] Slab 0xfffffa8bc4434140 objects=85 used=82 fp=0xffff9a3dd0d05910 flags=0x17ffffc0000200(slab|node=0|zone=2|lastcpupid=0x1fffff)
  [  239.158997] Object 0xffff9a3dd0d05670 @offset=1648 fp=0x0000000000000000

  [  239.168766] Redzone  ffff9a3dd0d05660: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
  [  239.179633] Object   ffff9a3dd0d05670: 10 00 00 00 00 00 00 00 e5 04 3f 13 96 18 8e 47  ..........?....G
  [  239.190641] Redzone  ffff9a3dd0d05680: bb bb bb bb bb bb bb bb                          ........
  [  239.200739] Padding  ffff9a3dd0d05688: 84 80 0e 00 00 00 00 00                          ........
  [  239.210938] CPU: 0 PID: 3150 Comm: hfi-sdma-test Kdump: loaded Tainted: G           OE      6.5.0-rc1+ NVIDIA#1
  [  239.221911] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.1029.090220201031 09/02/2020
  [  239.233948] Call Trace:
  [  239.236992]  <TASK>
  [  239.239608]  dump_stack_lvl+0x33/0x50
  [  239.244010]  object_err+0x3a/0x80
  [  239.248014]  free_debug_processing+0x265/0x360
  [  239.253392]  ? nv_p2p_get_pages+0x163/0x590 [nvidia]
  [  239.259399]  free_to_partial_list+0x80/0x280
  [  239.264478]  ? nv_p2p_get_pages+0x163/0x590 [nvidia]
  [  239.270426]  nv_p2p_get_pages+0x163/0x590 [nvidia]
  [  239.276303]  ? __pfx_remove_nvidia_pages+0x10/0x10 [hfi1]
  [  239.282692]  nvidia_p2p_get_pages+0x25/0x40 [nvidia]
  [  239.288601]  ? __pfx_remove_nvidia_pages+0x10/0x10 [hfi1]
  ...
  [  239.498990]  </TASK>
  [  239.501662] Disabling lock debugging due to kernel taint
  [  239.507828] FIX kmalloc-16: Object at 0xffff9a3dd0d05670 not freed

Signed-off-by: Brendan Cunningham <bcunningham@cornelisnetworks.com>
@aritger
Copy link
Collaborator

aritger commented Oct 3, 2023

Thanks for identifying this and the proposed fix, @BrendanCunningham. A variation of this fix will be included in a future release. I'll leave this open until then. Thanks.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants