Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keydb crash suddenly #132

Closed
ronen-cedato opened this issue Jan 12, 2020 · 4 comments
Closed

Keydb crash suddenly #132

ronen-cedato opened this issue Jan 12, 2020 · 4 comments

Comments

@ronen-cedato
Copy link

Describe the bug
After running for a month or so with active-replica enabled it crashed suddenly.
The server was in active-replica with another node. All writes were done to the other node only.

** Log Files **
These should be KeyDB logs, not syslogs or logs from your container manager. If you are reporting a crash there will be a line in your log stating:
"=== KEYDB BUG REPORT START: Cut & paste starting from here ==="

Please copy everything after this line.
22795:S 09 Jan 2020 08:24:19.048 * 1 changes in 60 seconds. Saving...
22795:S 09 Jan 2020 08:24:19.048 * Background saving started by pid 12405
12405:C 09 Jan 2020 08:24:19.053 * DB saved on disk
12405:C 09 Jan 2020 08:24:19.054 * RDB: 0 MB of memory used by copy-on-write
22795:S 09 Jan 2020 08:24:19.148 * Background saving terminated with success

=== KEYDB BUG REPORT START: Cut & paste starting from here ===
22795:S 09 Jan 2020 08:25:06.102 # === ASSERTION FAILED OBJECT CONTEXT ===
22795:S 09 Jan 2020 08:25:06.102 # Object type: 0
22795:S 09 Jan 2020 08:25:06.102 # Object encoding: 0
22795:S 09 Jan 2020 08:25:06.102 # Object refcount: 1
22795:S 09 Jan 2020 08:25:06.102 # Object raw string len: 56
22795:S 09 Jan 2020 08:25:06.103 # Object raw string content: "hb::userid::bidwon::e984685b-17e7-442e-81cb-d4335a020448"
22795:S 09 Jan 2020 08:25:06.103 # === ASSERTION FAILED ===
22795:S 09 Jan 2020 08:25:06.103 # ==> db.cpp:226 'retval == DICT_OK' is not true
22795:S 09 Jan 2020 08:25:06.103 # (forcing SIGSEGV to print the bug report.)
22795:S 09 Jan 2020 08:25:06.103 # KeyDB 5.1.1 crashed by signal: 11
22795:S 09 Jan 2020 08:25:06.103 # Crashed running the instruction at: 0x48b8a8
22795:S 09 Jan 2020 08:25:06.103 # Accessing address: 0xffffffffffffffff
22795:S 09 Jan 2020 08:25:06.103 # Failed assertion: retval == DICT_OK (db.cpp:226)

------ STACK TRACE ------
EIP:
/usr/local/bin/keydb-server 0.0.0.0:6393(_serverAssert+0x78) [0x48b8a8]

Backtrace:
/usr/local/bin/keydb-server 0.0.0.0:6393(logStackTrace(ucontext*)+0x39) [0x48cdd9]
/usr/local/bin/keydb-server 0.0.0.0:6393(sigsegvHandler(int, siginfo_t*, void*)+0xae) [0x48d45e]
/lib64/libpthread.so.0(+0xf5f0) [0x7fc3056e05f0]
/usr/local/bin/keydb-server 0.0.0.0:6393(_serverAssert+0x78) [0x48b8a8]
/usr/local/bin/keydb-server 0.0.0.0:6393(dbAdd(redisDb*, redisObject*, redisObject*)+0x29) [0x456fc9]
/usr/local/bin/keydb-server 0.0.0.0:6393(incrDecrCommand(client*, long long)+0x184) [0x46e774]
/usr/local/bin/keydb-server 0.0.0.0:6393(call(client*, int)+0xb5) [0x43c065]
/usr/local/bin/keydb-server 0.0.0.0:6393(processCommand(client*, int)+0x7a8) [0x4402f8]
/usr/local/bin/keydb-server 0.0.0.0:6393(processCommandAndResetClient(client*, int)+0x21) [0x449571]
/usr/local/bin/keydb-server 0.0.0.0:6393(processInputBuffer(client*, int)+0x7a) [0x44da7a]
/usr/local/bin/keydb-server 0.0.0.0:6393(replicaReplayCommand(client*)+0x26e) [0x4644fe]
/usr/local/bin/keydb-server 0.0.0.0:6393(call(client*, int)+0xb5) [0x43c065]
/usr/local/bin/keydb-server 0.0.0.0:6393(processCommand(client*, int)+0x7a8) [0x4402f8]
/usr/local/bin/keydb-server 0.0.0.0:6393(processCommandAndResetClient(client*, int)+0x21) [0x449571]
/usr/local/bin/keydb-server 0.0.0.0:6393(processInputBuffer(client*, int)+0x7a) [0x44da7a]
/usr/local/bin/keydb-server 0.0.0.0:6393(processInputBufferAndReplicate(client*)+0x2d) [0x44dc4d]
/usr/local/bin/keydb-server 0.0.0.0:6393(readQueryFromClient(aeEventLoop*, int, void*, int)+0x350) [0x4503b0]
/usr/local/bin/keydb-server 0.0.0.0:6393(ProcessEventCore+0xe9) [0x434bb9]
/usr/local/bin/keydb-server 0.0.0.0:6393(aeProcessEvents+0x14d) [0x434e4d]
/usr/local/bin/keydb-server 0.0.0.0:6393(aeMain+0x2d) [0x4351bd]
/usr/local/bin/keydb-server 0.0.0.0:6393(workerThreadMain(void*)+0x8f) [0x43934f]
/lib64/libpthread.so.0(+0x7e65) [0x7fc3056d8e65]
/lib64/libc.so.6(clone+0x6d) [0x7fc30540188d]

------ INFO OUTPUT ------

Server

redis_version:5.1.1
redis_git_sha1:a91f58b8
redis_git_dirty:0
redis_build_id:2fdba36391352f16
redis_mode:standalone
os:Linux 3.10.0-1062.4.1.el7.x86_64 x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:7.3.1
process_id:22795
run_id:c68f15d6a3af64733f64a85e3d401336a15b3318
tcp_port:6393
uptime_in_seconds:3199814
uptime_in_days:37
hz:10
configured_hz:10
lru_clock:1518002
executable:/usr/local/bin/keydb-server
config_file:/opt/redis/conf.d/hb.conf

Clients

connected_clients:1
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:0
current_client_thread:0
thread_0_clients:2
thread_1_clients:0

Memory

used_memory:10579608
used_memory_human:10.09M
used_memory_rss:12558336
used_memory_rss_human:11.98M
used_memory_peak:15112376
used_memory_peak_human:14.41M
used_memory_peak_perc:70.01%
used_memory_overhead:10287624
used_memory_startup:9105264
used_memory_dataset:291984
used_memory_dataset_perc:19.80%
allocator_allocated:10839920
allocator_active:12070912
allocator_resident:21979136
total_system_memory:134924840960
total_system_memory_human:125.66G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:1.11
allocator_frag_bytes:1230992
allocator_rss_ratio:1.82
allocator_rss_bytes:9908224
rss_overhead_ratio:0.57
rss_overhead_bytes:-9420800
mem_fragmentation_ratio:1.19
mem_fragmentation_bytes:2001824
mem_not_counted_for_evict:0
mem_replication_backlog:1048576
mem_clients_slaves:17026
mem_clients_normal:49798
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

Persistence

loading:0
rdb_changes_since_last_save:5858
rdb_bgsave_in_progress:0
rdb_last_save_time:1578576259
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:585728
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0

Stats

total_connections_received:51463602
total_commands_processed:123898766
instantaneous_ops_per_sec:333
total_net_input_bytes:3745240000
total_net_output_bytes:91215203724
instantaneous_input_kbps:23.07
instantaneous_output_kbps:23.74
rejected_connections:0
sync_full:1
sync_partial_ok:0
sync_partial_err:0
expired_keys:167576
expired_stale_perc:0.00
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:0
keyspace_misses:3002
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:587
migrate_cached_sockets:0
slave_expires_tracked_keys:245
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0

Replication

role:active-replica
Master 0:
master_host:hb-01.wdc.cedato.local
master_port:6393
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:2133842024
slave_priority:100
slave_read_only:0
connected_slaves:1
slave0:ip=10.211.2.82,port=6393,state=online,offset=2934293792,lag=1
master_replid:490a5759e84681bef7fedfd84c09bfb914f7df7c
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:2934293792
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:2933245217
repl_backlog_histlen:1048576

CPU

used_cpu_sys:7001.193934
used_cpu_user:6231.054494
used_cpu_sys_children:25.178589
used_cpu_user_children:21.646321
server_threads:2
long_lock_waits:4

Commandstats

cmdstat_expire:calls=34006,usec=70621,usec_per_call=2.08
cmdstat_replconf:calls=3192764,usec=4602695,usec_per_call=1.44
cmdstat_psync:calls=1,usec=740,usec_per_call=740.00
cmdstat_get:calls=406,usec=2709,usec_per_call=6.67
cmdstat_rreplay:calls=9721690,usec=93641375,usec_per_call=9.63
cmdstat_auth:calls=50263775,usec=51889607,usec_per_call=1.03
cmdstat_ping:calls=19797585,usec=9271793,usec_per_call=0.47
cmdstat_slowlog:calls=1251754,usec=2252355,usec_per_call=1.80
cmdstat_hset:calls=6839644,usec=26648944,usec_per_call=3.90
cmdstat_info:calls=19479634,usec=1334568225,usec_per_call=68.51
cmdstat_set:calls=2208113,usec=5583995,usec_per_call=2.53
cmdstat_del:calls=166377,usec=1057453,usec_per_call=6.36
cmdstat_select:calls=729382,usec=246980,usec_per_call=0.34
cmdstat_config:calls=10052755,usec=144199483,usec_per_call=14.34
cmdstat_command:calls=5,usec=11739,usec_per_call=2347.80
cmdstat_keys:calls=6,usec=180,usec_per_call=30.00
cmdstat_incr:calls=158271,usec=982118,usec_per_call=6.21
cmdstat_flushall:calls=2,usec=7805,usec_per_call=3902.50
cmdstat_hget:calls=2596,usec=2416,usec_per_call=0.93

Cluster

cluster_enabled:0

Keyspace

db0:keys=743,expires=130,avg_ttl=27207403
db1:keys=185,expires=129,avg_ttl=26733981

------ CLIENT LIST OUTPUT ------
id=4386 addr=10.211.2.82:29691 fd=86 name= age=3199779 idle=1 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=replconf
id=4387 addr=10.211.2.82:6393 fd=88 name= age=3199779 idle=0 flags=M db=1 sub=0 psub=0 multi=-1 qbuf=1096 qbuf-free=31672 obl=0 oll=0 omem=0 events=r cmd=rreplay

------ CURRENT CLIENT INFO ------
id=71397756 addr=?:0 fd=-1 name= age=0 idle=0 flags=N db=1 sub=0 psub=0 multi=-1 qbuf=77 qbuf-free=77 obl=0 oll=0 omem=0 events= cmd=incr
argv[0]: 'incr'
argv[1]: 'hb::userid::bidwon::e984685b-17e7-442e-81cb-d4335a020448'
22795:S 09 Jan 2020 08:25:06.105 # key 'hb::userid::bidwon::e984685b-17e7-442e-81cb-d4335a020448' found in DB containing the following object:
22795:S 09 Jan 2020 08:25:06.105 # Object type: 0
22795:S 09 Jan 2020 08:25:06.105 # Object encoding: 1
22795:S 09 Jan 2020 08:25:06.105 # Object refcount: 1

------ REGISTERS ------
22795:S 09 Jan 2020 08:25:06.105 #
RAX:0000000000000000 RBX:00000000000000e2
RCX:0000000000001000 RDX:fffffffffffffdc0
RDI:00007fc2fc000000 RSI:00007fc2fc001a10
RBP:0000000000574336 RSP:00007fc3024fa200
R8 :0000000000000001 R9 :000000000001f5f0
R10:7562206568742074 R11:0000000000000206
R12:00000000005714ea R13:00000000007c9270
R14:00007fc300fe2800 R15:0000000000000000
RIP:000000000048b8a8 EFL:0000000000010206
CSGSFS:0000000000000033
22795:S 09 Jan 2020 08:25:06.105 # (00007fc3024fa20f) -> 0000000000000000
22795:S 09 Jan 2020 08:25:06.105 # (00007fc3024fa20e) -> 0000000000000000
22795:S 09 Jan 2020 08:25:06.105 # (00007fc3024fa20d) -> 00000000000016e2
22795:S 09 Jan 2020 08:25:06.105 # (00007fc3024fa20c) -> 00007fc304e0e060
22795:S 09 Jan 2020 08:25:06.105 # (00007fc3024fa20b) -> 000000000043c065
22795:S 09 Jan 2020 08:25:06.105 # (00007fc3024fa20a) -> 00059bb4eff9dd33
22795:S 09 Jan 2020 08:25:06.105 # (00007fc3024fa209) -> 0000000000000003
22795:S 09 Jan 2020 08:25:06.105 # (00007fc3024fa208) -> 00007fc300fe28b0
22795:S 09 Jan 2020 08:25:06.105 # (00007fc3024fa207) -> 0000000000000001
22795:S 09 Jan 2020 08:25:06.105 # (00007fc3024fa206) -> 00007fc300fe28b0
22795:S 09 Jan 2020 08:25:06.105 # (00007fc3024fa205) -> 000000000046e774
22795:S 09 Jan 2020 08:25:06.105 # (00007fc3024fa204) -> 00007fc300fe2800
22795:S 09 Jan 2020 08:25:06.105 # (00007fc3024fa203) -> 0000000000456fc9
22795:S 09 Jan 2020 08:25:06.105 # (00007fc3024fa202) -> 0000000000000001
22795:S 09 Jan 2020 08:25:06.105 # (00007fc3024fa201) -> 00007fc304e4c398
22795:S 09 Jan 2020 08:25:06.105 # (00007fc3024fa200) -> 00007fc300f4f4c8

------ FAST MEMORY TEST ------
22795:S 09 Jan 2020 08:25:06.105 # Bio thread for job type #0 terminated
22795:S 09 Jan 2020 08:25:06.105 # Bio thread for job type #1 terminated
22795:S 09 Jan 2020 08:25:06.105 # Bio thread for job type #2 terminated
*** Preparing to test memory region 7cf000 (2256896 bytes)
*** Preparing to test memory region c4d000 (135168 bytes)
*** Preparing to test memory region 7fc2f4000000 (135168 bytes)
*** Preparing to test memory region 7fc2fab80000 (17301504 bytes)
*** Preparing to test memory region 7fc2fbc80000 (3805184 bytes)
*** Preparing to test memory region 7fc300000000 (12582912 bytes)
*** Preparing to test memory region 7fc300ded000 (2097152 bytes)
*** Preparing to test memory region 7fc3011fa000 (24576 bytes)
*** Preparing to test memory region 7fc301200000 (2097152 bytes)
*** Preparing to test memory region 7fc3014fc000 (8388608 bytes)
*** Preparing to test memory region 7fc301cfd000 (8388608 bytes)
*** Preparing to test memory region 7fc3024fd000 (4194304 bytes)
*** Preparing to test memory region 7fc3028fe000 (8388608 bytes)
*** Preparing to test memory region 7fc3030ff000 (8388608 bytes)
*** Preparing to test memory region 7fc303900000 (8388608 bytes)
*** Preparing to test memory region 7fc304100000 (17825792 bytes)
*** Preparing to test memory region 7fc3056cc000 (20480 bytes)
*** Preparing to test memory region 7fc3058e9000 (16384 bytes)
*** Preparing to test memory region 7fc3060f7000 (86016 bytes)
*** Preparing to test memory region 7fc306928000 (36864 bytes)
*** Preparing to test memory region 7fc30693c000 (4096 bytes)
*** Preparing to test memory region 7fc30693d000 (4096 bytes)
*** Preparing to test memory region 7fc306940000 (4096 bytes)
.O.O.O.O.O.O.O.O.O.O.

To Reproduce
Do you know how to reproduce this? If so please provide repro steps.
No.

@JohnSully
Copy link
Collaborator

I believe this is a dupe of issue #83 where expires can cause us to enter an invalid state when used with Active Replication. This was fixed in v5.2. I recommend updating to the latest version which will fix your issue.

@pavelnemirovsky
Copy link

@JohnSully thx but the issue still happens in v5.3.0

@JohnSully
Copy link
Collaborator

Reopened.

JohnSully added a commit that referenced this issue Feb 7, 2020
JohnSully added a commit that referenced this issue Feb 7, 2020
@JohnSully
Copy link
Collaborator

JohnSully commented Feb 7, 2020

Sorry for the premature closure of this issue. We had a similar one specific only to Active Replication which I thought this was a dupe of. Detailed notes are written in issue #137 but I will point out here that the issue applies to both traditional and active replication.


Notes from #137:
The issue was we were using the result of expireIfNeeded to determine if lookupKeyWrite should return an object. This results in a situation where a key is in the underlying database but the high level logic doesn't see it. When the high level command tries to readd the key it fails because it already exists.

Commit 49d8674 adds tests to detect this
Commit 050d580 fixes the actual issue.

This only repros on replicas, not masters because expireIfNeeded will delete the actual key on a master (replicas wait for the master to send a DEL).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants