Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite assertion failures and/or hang with metadata_csum=1 and replica_metadata=0 #30

Closed
stevenjswanson opened this issue Jul 5, 2017 · 6 comments

Comments

@stevenjswanson
Copy link
Member

stevenjswanson commented Jul 5, 2017

With these configuration options, I get a whole system hang. Sometimes w/ and sometimes w/o a bunch of assertion failures on dmesg.

on 84d3e6a, but it it's been present since at least 10142f3.

This happens with

measure_timing=0
inplace_data_updates=0
wprotect=0 
mmap_cow=1      
unsafe_metadata=0     
replica_metadata=0 
metadata_csum=1 
dram_struct_csum=1      
data_csum=1 
data_parity=1

but not

measure_timing=0
inplace_data_updates=0
wprotect=0
mmap_cow=1
unsafe_metadata=0
replica_metadata=0 
metadata_csum=0 ***
dram_struct_csum=1 
data_csum=1 
data_parity=1
# sudo umount /mnt/ramdisk; sudo rmmod nova; sudo modprobe nova measure_timing=0      inplace_data_updates=0      wprotect=0 mmap_cow=1      unsafe_metadata=0     replica_metadata=0 metadata_csum=1 dram_struct_csum=1      data_csum=1 data_parity=1; sudo mount -t NOVA -o init /dev/pmem0 /mnt/ramdisk; echo 1 | sudo tee  /proc/fs/NOVA/pmem0/create_snapshot ;cat /proc/fs/NOVA/pmem0/snapshots
<hangs->
# dmesg
[ 4064.180232] nova: nova_rebuild_dir_inode_tree: unknown type 195, 0x3eee000
[ 4064.181171] nova: nova_rebuild_dir_inode_tree: unknown type 195, 0x3eee000
[ 4064.181208] nova: nova_rebuild_dir_inode_tree: unknown type 195, 0x3eee000
[ 4064.181229] assertion failed fs/nova/rebuild.c:673: 0
[ 4064.181348] nova: nova_rebuild_dir_inode_tree: unknown type 195, 0x3eee000
[ 4064.181386] nova: nova_rebuild_dir_inode_tree: unknown type 195, 0x3eee000
[ 4064.181442] nova: nova_rebuild_dir_inode_tree: unknown type 195, 0x3eee000
[ 4064.181457] nova: nova_rebuild_dir_inode_tree: unknown type 195, 0x3eee000
[ 4064.181508] assertion failed fs/nova/rebuild.c:673: 0
[ 4064.181541] assertion failed fs/nova/rebuild.c:673: 0
[ 4064.181556] nova: nova_rebuild_dir_inode_tree: unknown type 195, 0x3eee000
<...forever...>
@luzh
Copy link
Contributor

luzh commented Jul 8, 2017

metadata_csum shouldn't be used without replica_metadata, as we rely on the tick-tock scheme to maintain consistency, since at least two fields are updated in some metadata structure: something + checksum. So some code may not handle this configuration well. I think we can fix it at mount time by setting replica_metadata = 1 when metadata_csum = 1.

@Andiry
Copy link
Contributor

Andiry commented Jul 8, 2017 via email

@stevenjswanson
Copy link
Member Author

stevenjswanson commented Jul 8, 2017 via email

@luzh
Copy link
Contributor

luzh commented Jul 8, 2017

A quick fix just to make them equal: a5f89a2

@stevenjswanson
Copy link
Member Author

stevenjswanson commented Jul 8, 2017 via email

@luzh
Copy link
Contributor

luzh commented Jul 11, 2017

Moved to #35

@luzh luzh closed this as completed Jul 11, 2017
juno-kim pushed a commit that referenced this issue Oct 15, 2018
Crash dump shows following instructions

crash> bt
PID: 0      TASK: ffffffffbe412480  CPU: 0   COMMAND: "swapper/0"
 #0 [ffff891ee0003868] machine_kexec at ffffffffbd063ef1
 #1 [ffff891ee00038c8] __crash_kexec at ffffffffbd12b6f2
 #2 [ffff891ee0003998] crash_kexec at ffffffffbd12c84c
 #3 [ffff891ee00039b8] oops_end at ffffffffbd030f0a
 #4 [ffff891ee00039e0] no_context at ffffffffbd074643
 #5 [ffff891ee0003a40] __bad_area_nosemaphore at ffffffffbd07496e
 #6 [ffff891ee0003a90] bad_area_nosemaphore at ffffffffbd074a64
 #7 [ffff891ee0003aa0] __do_page_fault at ffffffffbd074b0a
 #8 [ffff891ee0003b18] do_page_fault at ffffffffbd074fc8
 #9 [ffff891ee0003b50] page_fault at ffffffffbda01925
    [exception RIP: qlt_schedule_sess_for_deletion+15]
    RIP: ffffffffc02e526f  RSP: ffff891ee0003c08  RFLAGS: 00010046
    RAX: 0000000000000000  RBX: 0000000000000000  RCX: ffffffffc0307847
    RDX: 00000000000020e6  RSI: ffff891edbc377c8  RDI: 0000000000000000
    RBP: ffff891ee0003c18   R8: ffffffffc02f0b20   R9: 0000000000000250
    R10: 0000000000000258  R11: 000000000000b780  R12: ffff891ed9b43000
    R13: 00000000000000f0  R14: 0000000000000006  R15: ffff891edbc377c8
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #10 [ffff891ee0003c20] qla2x00_fcport_event_handler at ffffffffc02853d3 [qla2xxx]
 #11 [ffff891ee0003cf0] __dta_qla24xx_async_gnl_sp_done_333 at ffffffffc0285a1d [qla2xxx]
 #12 [ffff891ee0003de8] qla24xx_process_response_queue at ffffffffc02a2eb5 [qla2xxx]
 #13 [ffff891ee0003e88] qla24xx_msix_rsp_q at ffffffffc02a5403 [qla2xxx]
 #14 [ffff891ee0003ec0] __handle_irq_event_percpu at ffffffffbd0f4c59
 #15 [ffff891ee0003f10] handle_irq_event_percpu at ffffffffbd0f4e02
 #16 [ffff891ee0003f40] handle_irq_event at ffffffffbd0f4e90
 #17 [ffff891ee0003f68] handle_edge_irq at ffffffffbd0f8984
 #18 [ffff891ee0003f88] handle_irq at ffffffffbd0305d5
 #19 [ffff891ee0003fb8] do_IRQ at ffffffffbda02a18
 --- <IRQ stack> ---
 #20 [ffffffffbe403d30] ret_from_intr at ffffffffbda0094e
    [exception RIP: unknown or invalid address]
    RIP: 000000000000001f  RSP: 0000000000000000  RFLAGS: fff3b8c2091ebb3f
    RAX: ffffbba5a0000200  RBX: 0000be8cdfa8f9fa  RCX: 0000000000000018
    RDX: 0000000000000101  RSI: 000000000000015d  RDI: 0000000000000193
    RBP: 0000000000000083   R8: ffffffffbe403e38   R9: 0000000000000002
    R10: 0000000000000000  R11: ffffffffbe56b820  R12: ffff891ee001cf00
    R13: ffffffffbd11c0a4  R14: ffffffffbe403d60  R15: 0000000000000001
    ORIG_RAX: ffff891ee0022ac0  CS: 0000  SS: ffffffffffffffb9
 bt: WARNING: possibly bogus exception frame
 #21 [ffffffffbe403dd8] cpuidle_enter_state at ffffffffbd67c6fd
 #22 [ffffffffbe403e40] cpuidle_enter at ffffffffbd67c907
 #23 [ffffffffbe403e50] call_cpuidle at ffffffffbd0d98f3
 #24 [ffffffffbe403e60] do_idle at ffffffffbd0d9b42
 #25 [ffffffffbe403e98] cpu_startup_entry at ffffffffbd0d9da3
 #26 [ffffffffbe403ec0] rest_init at ffffffffbd81d4aa
 #27 [ffffffffbe403ed0] start_kernel at ffffffffbe67d2ca
 #28 [ffffffffbe403f28] x86_64_start_reservations at ffffffffbe67c675
 #29 [ffffffffbe403f38] x86_64_start_kernel at ffffffffbe67c6eb
 #30 [ffffffffbe403f50] secondary_startup_64 at ffffffffbd0000d5

Fixes: 040036b ("scsi: qla2xxx: Delay loop id allocation at login")
Cc: <stable@vger.kernel.org> # v4.17+
Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
yh-raphael pushed a commit to yh-raphael/linux-nova that referenced this issue Aug 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants