-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tikv panic repeatedly after this tikv recover from io hang #15292
Comments
/type bug |
/remove-severity critical |
Can the snapshot file be validated immediately after the reception? And if it's invalid, then behave just like the receive file failed @overvenus |
It seems that this issue may occur when TiKV restarts abnormally. The root cause seems as followers: After TiKV restarts, some region peer states are in workaround:
cc @overvenus |
Meanwhile, I believe that this PR(https://github.com/tikv/tikv/pull/11782) does not fundamentally solve the problem. We need to consider the situation of TiKV abnormal restart. |
@AndreMouche I think this case is not about file is deleted, but the file is corrupted? If the snapshot is not applied, it should not be deleted. @Lily2025 Could you upload more logs about the error? |
no log to troubleshotting. close it now. |
[2023/08/07 18:32:35.900 +08:00] [ERROR] [region.rs:540] ["failed to apply snap!!!"] [err_code=KV:Raftstore:SnapUnknown] [err="Other("[components/raftstore/src/store/snap.rs:1122]: \"[components/raftstore/src/store/snap.rs:277]: invalid checksum 1869881899 for snapshot cf file /var/lib/tikv/data/snap/rev_11581_6_2548383_write.sst, expected 2347348753\"")"] |
Bug Report
What version of TiKV are you using?
[2023/08/07 18:32:41.764 +08:00] [INFO] [lib.rs:88] ["Welcome to TiKV"]
[2023/08/07 18:32:41.765 +08:00] [INFO] [lib.rs:93] ["Release Version: 7.4.0-alpha"]
[2023/08/07 18:32:41.765 +08:00] [INFO] [lib.rs:93] ["Edition: Community"]
[2023/08/07 18:32:41.765 +08:00] [INFO] [lib.rs:93] ["Git Commit Hash: 1e47e9a"]
[2023/08/07 18:32:41.765 +08:00] [INFO] [lib.rs:93] ["Git Commit Branch: heads/refs/tags/v7.4.0-alpha"]
[2023/08/07 18:32:41.765 +08:00] [INFO] [lib.rs:93] ["UTC Build Time: Unknown (env var does not exist when building)"]
[2023/08/07 18:32:41.765 +08:00] [INFO] [lib.rs:93] ["Rust Version: rustc 1.67.0-nightly (96ddd32c4 2022-11-14)"]
[2023/08/07 18:32:41.765 +08:00] [INFO] [lib.rs:93] ["Enable Features: pprof-fp jemallo
What operating system and CPU are you using?
8c/32g
Steps to reproduce
1、run tpcc with 1000 warehouse and 10 thread
2、inject one of tikv io hang last for 20m
What did you expect?
no panic
What did happened?
tikv panic repeatedly after this tikv recover from io hang
[2023/08/07 18:32:36.398 +08:00] [FATAL] [lib.rs:510] ["[region 11581] 11584 applying snapshot failed"] [backtrace=" 0: tikv_util::set_panic_hook::{{closure}}\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/lib.rs:509:18\n 1: <alloc::boxed::Box<F,A> as core::ops::function::Fn>::call\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2032:9\n std::panicking::rust_panic_with_hook\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:692:13\n 2: std::panicking::begin_panic_handler::{{closure}}\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:579:13\n 3: std::sys_common::backtrace::__rust_end_short_backtrace\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:137:18\n 4: rust_begin_unwind\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:575:5\n 5: core::panicking::panic_fmt\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panicking.rs:65:14\n 6: raftstore::store::peer_storage::PeerStorage<EK,ER>::check_applying_snap\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/raftstore/src/store/peer_storage.rs:797:21\n 7: raftstore::store::peer::Peer<EK,ER>::check_snap_status\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/raftstore/src/store/peer.rs:2305:15\n raftstore::store::peer::Peer<EK,ER>::handle_raft_ready_append\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/raftstore/src/store/peer.rs:2425:13\n 8: raftstore::store::fsm::peer::PeerFsmDelegate<EK,ER,T>::collect_ready\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/peer.rs:1998:19\n <raftstore::store::fsm::store::RaftPoller<EK,ER,T> as batch_system::batch::PollHandler<raftstore::store::fsm::peer::PeerFsm<EK,ER>,raftstore::store::fsm::store::StoreFsm>>::handle_normal\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/store.rs:1023:13\n 9: batch_system::batch::Poller<N,C,Handler>::poll\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/batch-system/src/batch.rs:380:27\n 10: batch_system::batch::BatchSystem<N,C>::start_poller::{{closure}}\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/batch-system/src/batch.rs:550:17\n <std::thread::Builder as tikv_util::sys::thread::StdThreadBuildWrapper>::spawn_wrapper::{{closure}}\n at home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/sys/thread.rs:438:13\n std::sys_common::backtrace::rust_begin_short_backtrace\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:121:18\n 11: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:551:17\n <core::panic::unwind_safe::AssertUnwindSafe as core::ops::function::FnOnce<()>>::call_once\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:271:9\n std::panicking::try::do_call\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:483:40\n std::panicking::try\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:447:19\n std::panic::catch_unwind\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panic.rs:137:14\n std::thread::Builder::spawn_unchecked::{{closure}}\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:550:30\n core::ops::function::FnOnce::call_once{{vtable.shim}}\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:513:5\n 12: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2000:9\n <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2000:9\n std::sys::unix::thread::Thread::new::thread_start\n at rust/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys/unix/thread.rs:108:17\n 13: start_thread\n 14: __clone\n"] [location=/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/raftstore/src/store/peer_storage.rs:797] [thread_name=raftstore-18-0]
[2023/08/07 18:32:41.764 +08:00] [INFO] [lib.rs:88] ["Welcome to TiKV"]
The text was updated successfully, but these errors were encountered: