Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tikv panic repeatedly after this tikv recover from io hang #15292

Open
Lily2025 opened this issue Aug 8, 2023 · 8 comments · May be fixed by #16992
Open

tikv panic repeatedly after this tikv recover from io hang #15292

Lily2025 opened this issue Aug 8, 2023 · 8 comments · May be fixed by #16992

Comments

@Lily2025
Copy link

Lily2025 commented Aug 8, 2023

Bug Report

What version of TiKV are you using?

[2023/08/07 18:32:41.764 +08:00] [INFO] [lib.rs:88] ["Welcome to TiKV"]
[2023/08/07 18:32:41.765 +08:00] [INFO] [lib.rs:93] ["Release Version: 7.4.0-alpha"]
[2023/08/07 18:32:41.765 +08:00] [INFO] [lib.rs:93] ["Edition: Community"]
[2023/08/07 18:32:41.765 +08:00] [INFO] [lib.rs:93] ["Git Commit Hash: 1e47e9a"]
[2023/08/07 18:32:41.765 +08:00] [INFO] [lib.rs:93] ["Git Commit Branch: heads/refs/tags/v7.4.0-alpha"]
[2023/08/07 18:32:41.765 +08:00] [INFO] [lib.rs:93] ["UTC Build Time: Unknown (env var does not exist when building)"]
[2023/08/07 18:32:41.765 +08:00] [INFO] [lib.rs:93] ["Rust Version: rustc 1.67.0-nightly (96ddd32c4 2022-11-14)"]
[2023/08/07 18:32:41.765 +08:00] [INFO] [lib.rs:93] ["Enable Features: pprof-fp jemallo

What operating system and CPU are you using?

8c/32g

Steps to reproduce

1、run tpcc with 1000 warehouse and 10 thread
2、inject one of tikv io hang last for 20m

What did you expect?

no panic

What did happened?

tikv panic repeatedly after this tikv recover from io hang

[2023/08/07 18:32:36.398 +08:00] [FATAL] [lib.rs:510] ["[region 11581] 11584 applying snapshot failed"] [backtrace=" 0: tikv_util::set_panic_hook::{{closure}}\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/lib.rs:509:18\n 1: <alloc::boxed::Box<F,A> as core::ops::function::Fn>::call\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2032:9\n std::panicking::rust_panic_with_hook\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:692:13\n 2: std::panicking::begin_panic_handler::{{closure}}\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:579:13\n 3: std::sys_common::backtrace::__rust_end_short_backtrace\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:137:18\n 4: rust_begin_unwind\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:575:5\n 5: core::panicking::panic_fmt\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panicking.rs:65:14\n 6: raftstore::store::peer_storage::PeerStorage<EK,ER>::check_applying_snap\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/raftstore/src/store/peer_storage.rs:797:21\n 7: raftstore::store::peer::Peer<EK,ER>::check_snap_status\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/raftstore/src/store/peer.rs:2305:15\n raftstore::store::peer::Peer<EK,ER>::handle_raft_ready_append\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/raftstore/src/store/peer.rs:2425:13\n 8: raftstore::store::fsm::peer::PeerFsmDelegate<EK,ER,T>::collect_ready\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/peer.rs:1998:19\n <raftstore::store::fsm::store::RaftPoller<EK,ER,T> as batch_system::batch::PollHandler<raftstore::store::fsm::peer::PeerFsm<EK,ER>,raftstore::store::fsm::store::StoreFsm>>::handle_normal\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/store.rs:1023:13\n 9: batch_system::batch::Poller<N,C,Handler>::poll\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/batch-system/src/batch.rs:380:27\n 10: batch_system::batch::BatchSystem<N,C>::start_poller::{{closure}}\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/batch-system/src/batch.rs:550:17\n <std::thread::Builder as tikv_util::sys::thread::StdThreadBuildWrapper>::spawn_wrapper::{{closure}}\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/sys/thread.rs:438:13\n std::sys_common::backtrace::rust_begin_short_backtrace\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:121:18\n 11: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:551:17\n <core::panic::unwind_safe::AssertUnwindSafe as core::ops::function::FnOnce<()>>::call_once\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:271:9\n std::panicking::try::do_call\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:483:40\n std::panicking::try\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:447:19\n std::panic::catch_unwind\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panic.rs:137:14\n std::thread::Builder::spawn_unchecked::{{closure}}\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:550:30\n core::ops::function::FnOnce::call_once{{vtable.shim}}\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:513:5\n 12: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2000:9\n <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2000:9\n std::sys::unix::thread::Thread::new::thread_start\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys/unix/thread.rs:108:17\n 13: start_thread\n 14: __clone\n"] [location=/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/raftstore/src/store/peer_storage.rs:797] [thread_name=raftstore-18-0]
[2023/08/07 18:32:41.764 +08:00] [INFO] [lib.rs:88] ["Welcome to TiKV"]

@Lily2025 Lily2025 added the type/bug Type: Issue - Confirmed a bug label Aug 8, 2023
@Lily2025
Copy link
Author

Lily2025 commented Aug 8, 2023

/type bug
/severity critical

@Lily2025
Copy link
Author

Lily2025 commented Aug 8, 2023

/remove-severity critical
/severity major

@tonyxuqqi
Copy link
Contributor

Can the snapshot file be validated immediately after the reception? And if it's invalid, then behave just like the receive file failed @overvenus

@AndreMouche
Copy link
Member

It seems that this issue may occur when TiKV restarts abnormally. The root cause seems as followers:

After TiKV restarts, some region peer states are in PeerState::Applying and start to apply snapshots, but it is found that the corresponding snapshot has already been cleaned up (since tikv received a gc snapshot message before restarting). This will trigger a panic at this point.

workaround:
Set the region on this TiKV in the panic log to tombstone forcibly https://docs.pingcap.com/tidb/v6.5/tikv-control#set-a-region-to-tombstone

tikv-ctl --data-dir /path/to/tikv tombstone -p 127.0.0.1:2379 -r <region_id>,<region_id> --force 

cc @overvenus

@AndreMouche
Copy link
Member

AndreMouche commented Mar 13, 2024

Meanwhile, I believe that this PR(https://github.com/tikv/tikv/pull/11782) does not fundamentally solve the problem. We need to consider the situation of TiKV abnormal restart.

@tonyxuqqi
Copy link
Contributor

@AndreMouche I think this case is not about file is deleted, but the file is corrupted? If the snapshot is not applied, it should not be deleted.
If snapshot file is corrupted, then today there's no way to work around it, except tombstone that region.

@Lily2025 Could you upload more logs about the error?
Is the snapshot available but corrupted?

@tonyxuqqi
Copy link
Contributor

no log to troubleshotting. close it now.

@tonyxuqqi tonyxuqqi reopened this Mar 28, 2024
@tonyxuqqi
Copy link
Contributor

[2023/08/07 18:32:35.900 +08:00] [ERROR] [region.rs:540] ["failed to apply snap!!!"] [err_code=KV:Raftstore:SnapUnknown] [err="Other("[components/raftstore/src/store/snap.rs:1122]: \"[components/raftstore/src/store/snap.rs:277]: invalid checksum 1869881899 for snapshot cf file /var/lib/tikv/data/snap/rev_11581_6_2548383_write.sst, expected 2347348753\"")"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment