Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crossed wires with follower-proxy #26

Closed
aphyr opened this issue Mar 14, 2020 · 9 comments
Closed

Crossed wires with follower-proxy #26

aphyr opened this issue Mar 14, 2020 · 9 comments

Comments

@aphyr
Copy link
Collaborator

aphyr commented Mar 14, 2020

It looks like clusters with follower-proxy, network partitions, and crashes (or possibly fewer faults; I'm trying to narrow it down) can occasionally get into states where one specific Redis node gets mixed up and sends inappropriate responses to clients. I think what we're seeing is responses intended for other clients getting dispatched to the wrong place.

lein run test-all --raft-version dfd91d4 -w append --time-limit 600 --nemesis-interval 1 --nemesis standard --test-count 10 --concurrency 5n --follower-proxy

20200313T233858.000-0400.zip

We resolve a network partition:

2020-03-13 23:41:36,284{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :stop-partition nil

The first sign of weirdness comes from worker 5, executing process 130. It tries to execute a MULTI transaction, opens a fresh connection, and gets a single "211" instead of a vector of results from an LRANGE command.

2020-03-13 23:41:41,025{GMT}    INFO    [jepsen worker 5] jepsen.util: 130      :invoke :txn    [[:r 47 nil] [:r 47 nil] [:append 46 215]]
2020-03-13 23:41:41,026{GMT}    INFO    [jepsen worker 5] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socket #object[java.net.So
cket 0x5c701a81 Socket[addr=n1/192.168.122.11,port=6379,localport=41566]], :spec {:host n1, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x73be0bac java.io.DataInputStream@73be0bac], :out #object[java.io.BufferedOutputStream 0x6900a5e8 java.io.BufferedOutputStream@6900a5e8]}}, :in-txn? #object[clojure.lang.Atom 0x7407179f {:status :ready, :val false}], :spec {:host n1, :port 6379, :timeout-
ms 10000}}
2020-03-13 23:41:41,117{GMT}    WARN    [jepsen worker 5] jepsen.core: Process 130 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 47, :value "211"}

Concurrently, worker 15/process 140 does the same thing:

2020-03-13 23:41:41,045{GMT}    INFO    [jepsen worker 15] jepsen.util: 140     :invoke :txn    [[:r 46 nil] [:append 45 226] [:append 46 216] [:append 46 217]]
2020-03-13 23:41:41,046{GMT}    INFO    [jepsen worker 15] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socket #object[java.net.S
ocket 0x2925d9b7 Socket[addr=n1/192.168.122.11,port=6379,localport=41580]], :spec {:host n1, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x12c9c95b java.io.DataInputStream@12c9c95b], :out #object[java.io.BufferedOutputStream 0x15deb95b java.io.BufferedOutputStream@15deb95b]}}, :in-txn? #object[clojure.lang.Atom 0x747af9c6 {:status :ready, :val false}], :spec {:host n1, :port 6379, :timeout
-ms 10000}}
2020-03-13 23:41:41,123{GMT}    WARN    [jepsen worker 15] jepsen.core: Process 140 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 46, :value "211"}

It's particularly weird that both of them got "211" here.

Concurrently, process 20/worker 145 does a NON multi read--just a regular old single LRANGE by itself, on a fresh connection, and gets [2 3 ["226"]] which is... VERY weird.

2020-03-13 23:41:41,107{GMT}    INFO    [jepsen worker 20] jepsen.util: 145     :invoke :txn    [[:r 46 nil]]
2020-03-13 23:41:41,108{GMT}    INFO    [jepsen worker 20] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socket #object[java.net.S
ocket 0x6ec17aa1 Socket[addr=n1/192.168.122.11,port=6379,localport=41584]], :spec {:host n1, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x530146a8 java.io.DataInputStream@530146a8], :out #object[java.io.BufferedOutputStream 0xcdfc07b java.io.BufferedOutputStream@cdfc07b]}}, :in-txn? #object[clojure.lang.Atom 0x3c20d3b9 {:status :ready, :val false}], :spec {:host n1, :port 6379, :timeout-m
s 10000}}
2020-03-13 23:41:41,257{GMT}    WARN    [jepsen worker 20] jepsen.core: Process 145 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 46, :value [2 3 ["226"]]}

Same thing happens to worker 10's request with process 160. Fresh connection, a MULTI transaction, and it gets the vector [2 3]--it is a vector, but it's of integers, not strings, like we expected. Fresh connection here too.

2020-03-13 23:41:40,927{GMT}    INFO    [jepsen worker 10] jepsen.util: 160     :invoke :txn    [[:r 47 nil] [:append 46 211] [:append 47 111] [:append 46 212]]
2020-03-13 23:41:40,928{GMT}    INFO    [jepsen worker 10] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socket #object[java.net.S
ocket 0x274a045 Socket[addr=n1/192.168.122.11,port=6379,localport=41548]], :spec {:host n1, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x657968b9 java.io.DataInputStream@657968b9], :out #object[java.io.BufferedOutputStream 0x6f11d2be java.io.BufferedOutputStream@6f11d2be]}}, :in-txn? #object[clojure.lang.Atom 0x29e75eed {:status :ready, :val false}], :spec {:host n1, :port 6379, :timeout-
ms 10000}}
2020-03-13 23:41:41,308{GMT}    WARN    [jepsen worker 10] jepsen.core: Process 160 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 45, :value [2 3]}

Immediately after this, we detect the completion of a node remove operation that we'd been waiting on for a while: node n1 finishes removing n5 from the cluster.

2020-03-13 23:41:41,314{GMT}    INFO    [clojure-agent-send-off-pool-10] jepsen.redis.db: :done-waiting-for-node-removal {:raft
 {:role :follower,
  :num_voting_nodes "2",
  :leader_id 813301272,
  :is_voting "yes",
  :node_id 1597309644,
  :num_nodes 2,
  :state :up,
  :nodes
  ({:id 813301272,
    :state "connected",
    :voting "yes",
    :addr "192.168.122.12",
    :port 6379,
    :last_conn_secs 94,
    :conn_errors 0,
    :conn_oks 1}),
  :current_term 114},
 :log
 {:log_entries 1243,
  :current_index 1243,
  :commit_index 1243,
  :last_applied_index 1243,
  :file_size 173406,
  :cache_memory_size 136972,
  :cache_entries 1090},
 :snapshot {:snapshot_in_progress "no"},
 :clients {:clients_in_multi_state 0}}

Node n2 starts removing n3:

2020-03-13 23:41:41,349{GMT}    INFO    [jepsen node n2] jepsen.redis.db: n2 :removing n3 (id: 859783830)

Worker 5, process 180, hits a type error as well--it expects a list of elements from an LRANGE, but gets a single Long instead.

2020-03-13 23:41:41,322{GMT}    INFO    [jepsen worker 5] jepsen.util: 180      :invoke :txn    [[:r 46 nil] [:append 47 135] [:append 47 136] [:r 46 nil]]
2020-03-13 23:41:41,323{GMT}    INFO    [jepsen worker 5] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socket #object[java.net.Socket 0x7a37e654 Socket[addr=n1/192.168.122.11,port=6379,localport=41632]], :spec {:host n1, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x1fcb0d1f java.io.DataInputStream@1fcb0d1f], :out #object[java.io.BufferedOutputStream 0x24bb596f java.io.BufferedOutputStream@24bb596f]}}, :in-txn? #object[clojure.lang.Atom 0x335f2349 {:status :ready, :val false}], :spec {:host n1, :port 6379, :timeout-ms 10000}}
2020-03-13 23:41:41,365{GMT}    WARN    [jepsen worker 5] jepsen.core: Process 180 crashed
java.lang.IllegalArgumentException: Don't know how to create ISeq from: java.lang.Long

Worker 0, executing process 200, goes to perform a MULTI transaction. It opens a new connection to do so...

2020-03-13 23:41:42,021{GMT}    INFO    [jepsen worker 0] jepsen.util: 200      :invoke :txn    [[:r 49 nil] [:append 48 21]]
2020-03-13 23:41:42,021{GMT}    INFO    [jepsen worker 0] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socket #object[j
ava.net.Socket 0x25b8659b Socket[addr=n1/192.168.122.11,port=6379,localport=41722]], :spec {:host n1, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x55c08913 java.io.DataInput
Stream@55c08913], :out #object[java.io.BufferedOutputStream 0x4e607e5b java.io.BufferedOutputStream@4e607e5b]}}, :in-txn? #object[clojure.lang.Atom 0x60177bc0 {:status :ready, :val false}], :spec {:
host n1, :port 6379, :timeout-ms 10000}}

That request succeeds with what looks like normal results.

2020-03-13 23:41:42,121{GMT}    INFO    [jepsen worker 0] jepsen.util: 200      :ok     :txn    [[:r 49 [111 130 131 132 133 135 136 137 138 153 154 156 166 169 170 175 176 179 180 181]] [:append 48 21]]

Weirdly, this conflicts with other reads of key 49, so I'm not sure if we got the right thing here or not.

Worker 0 then tries to perform a single read of 48. This doesn't involve an EXEC, so it should have returned a single value. Instead it gets a vector of vectors, which... what?

2020-03-13 23:41:42,257{GMT}    INFO    [jepsen worker 0] jepsen.util: 200      :invoke :txn    [[:r 48 nil]]
...
2020-03-13 23:41:42,373{GMT}    WARN    [jepsen worker 0] jepsen.core: Process 200 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 48, :value [["2" "4" "5" "6" "8" "9" "14" "15" "17" "21"] ["111" "130" "131" "132" "133" "135" "136" "137" "138" "153" "154" "1
56" "166" "169" "170" "175" "176" "179" "180" "181" "194"] ["3" "11" "12" "13" "21"]]}

Eyeballing the vectors it returned, it looks like this is a read of keys [? 49 ?] respectively.

Then worker 0 (logical process 225) goes to perform a transaction:

2020-03-13 23:41:42,446{GMT}    INFO    [jepsen worker 0] jepsen.util: 225      :invoke :txn    [[:r 49 nil] [:r 48 nil] [:r 47 nil] [:append 47 211]]

225 is a new process, so it opens a fresh connection to perform its first request. Just to confirm, yes, this is a new port and connection object!

2020-03-13 23:41:42,447{GMT}    INFO    [jepsen worker 0] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socket #object[java.net.Socket 0x417158d7 Socket[addr=n1/192.168.122.11,port=6379,localport=41762]], :spec {:host n1, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x6d4fbc73 java.io.DataInputStream@6d4fbc73], :out #object[java.io.BufferedOutputStream 0x5c7b23f7 java.io.BufferedOutputStream@5c7b23f7]}}, :in-txn? #object[clojure.lang.Atom 0x59d46b31 {:status :ready, :val false}], :spec {:host n1, :port 6379, :timeout-ms 10000}}

It performs a MULTI, then a read (LRANGE) of key 49, read of 48, read of 47, and appends (RPUSH) 211 to key 47. It EXECs the transaction, and gets back a vector of responses. The first response should have been a list of values for key 49, but instead we obtained "2", which... what?

2020-03-13 23:41:42,552{GMT}    WARN    [jepsen worker 0] jepsen.core: Process 225 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 49, :value "2"}

Later, we'll read key 49 again and get a weird response:

2020-03-13 23:41:41,979{GMT}    INFO    [jepsen worker 20] jepsen.util: 270     :ok     :txn    [[:append 49 5] [:r 47 [2 4 5 6]] [:r 49 [3 11 12]] [:append 49 6]]

This read is clearly messed up because we just appended 5 to key 49, and didn't observe it. Also, the append of 12 to key 49 happened after this, and failed. It definitely can't be reading 49.

I feel like this points to wires getting crossed either inside Redis or the Carmine library--or maybe I'm somehow STILL using the library wrong. I think... what I should do next is get a Wireshark dump of the traffic back and forth, and try to figure out what's going on at a protocol level.

@yossigo
Copy link
Collaborator

yossigo commented Mar 15, 2020

Wires seem definitely crossed. So far we've not been able to reproduce this or explain it by looking at the code... don't want to be too quick to blame, but it might be Carmine indeed. The thing that doesn't make sense here is having this mixed up on a fresh connection - unless - it's only fresh back from a connection pool and not really a new connection. Any idea if that may be the case?

The current MULTI/EXEC error handling add to that and I want to suggest getting those fixed and out of the way first, because Carmine could be making some (right) assumptions on how MULTI/EXEC errors are handled and that could be the culprit.

@aphyr
Copy link
Collaborator Author

aphyr commented Mar 15, 2020

Yeah, looking at the network connections, they're separate socket objects--I'm pretty darn confident it's not re-using them! That said... Carmine is weird internally, so I can't guarantee. I'm aiming to get a packet capture so I can tell for sure what's going on.

As for MULTI/EXEC issues... as far as I know, we're doing that correctly now. I think the errors we're seeing around MULTI are caused by this mixed-up connection state problem.

@aphyr
Copy link
Collaborator Author

aphyr commented Mar 15, 2020

Oh--and to give you a little more context on connection pools--I wrote a connection pool specifically for Carmine to avoid this issue. We construct a fresh pool that wraps a single connection for each client, and don't share that with anyone else.

@yossigo
Copy link
Collaborator

yossigo commented Mar 15, 2020

I wasn't clear, I was referring to the MULTI/EXEC (or more specifically DISCARD) anomaly on the server side. If you don't think it contributes to this in any way, we'll leave that for a bit later.

@aphyr
Copy link
Collaborator Author

aphyr commented Mar 15, 2020

Ah, yeah, that could be the case! It might be that, like... MULTI triggers some kind of behavior that leaks between connections somehow. Or maybe MULTI messes up the client somehow--whatever mutable state the library uses to aggregate results might be getting confused? Or maybe the MULTI issues are being caused by some kind of state leakage. Could be either way!

@aphyr
Copy link
Collaborator Author

aphyr commented Mar 16, 2020

OK, two things. One, this looks to be associated with follower-proxy. Two: let's talk about this test run: 20200315T150856.000-0400.zip.

Process 16 tries to perform a single read (LRANGE) of key 33. It opens a fresh connection to do so, and gets back a totally weird result in response to its request: [3 4 ["2" "4" "21" "22"]]. This looks like a response to some other MULTI transaction--maybe two RPUSH commands followed by an LRANGE.

2020-03-15 15:10:15,857{GMT}    INFO    [jepsen worker 16] jepsen.util: 116     :invoke :txn    [[:r 33 nil]]
2020-03-15 15:10:15,858{GMT}    INFO    [jepsen worker 16] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socke
t #object[java.net.Socket 0x20e563fb Socket[addr=n2/192.168.122.12,port=6379,localport=48886]], :spec {:host n2, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x6c69c
e28 java.io.DataInputStream@6c69ce28], :out #object[java.io.BufferedOutputStream 0x501ddbda java.io.BufferedOutputStream@501ddbda]}}, :in-txn? #object[clojure.lang.Atom 0x486dbd67 {:status
 :ready, :val false}], :spec {:host n2, :port 6379, :timeout-ms 10000}}
2020-03-15 15:10:16,115{GMT}    WARN    [jepsen worker 16] jepsen.core: Process 116 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 33, :value [3 4 ["2" "4" "21" "22"]]}

Here's the corresponding TCP stream (see n2's tcpdump file, in the zip, if you want to check for yourself):

Screenshot from 2020-03-16 12-57-30

There are exactly two Redis messages in this stream--a request from the Jepsen to n2 for LRANGE 33 0 -1:

Screenshot from 2020-03-16 12-57-49

... and a MULTI-BULK reply from Redis of [3 4 [2 4 21 22]]:

Screenshot from 2020-03-16 12-57-59

Like the other cases I've seen, this issue happens in a burst, and appears limited to one node. In this test, it started on n2 at 2020-03-15 15:10:15.011:

2020-03-15 15:10:15,011{GMT}    WARN    [jepsen worker 6] jepsen.core: Process 56 crashed
java.lang.IllegalArgumentException: Don't know how to create ISeq from: java.lang.Long

What was going on in the cluster leading up to this point?

n1 removed n3:

2020-03-15 15:09:11,247{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :leave  "n3"
2020-03-15 15:09:11,675{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :leave  ["n3" {"n1" "OK"}]

We isolated n1 n2 n3 | n4 n5:

2020-03-15 15:09:13,676{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :start-partition        :majority
2020-03-15 15:09:13,780{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :start-partition        [:isolated {"n5" #{"n2" "n1" "n3"}, "n4" #{"n2" "n1" "n3"}, "n2" #{"n5" "n
4"}, "n1" #{"n5" "n4"}, "n3" #{"n5" "n4"}}]

Paused n1, n4, n5:

2020-03-15 15:09:15,780{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :pause  :majority
2020-03-15 15:09:15,884{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :pause  {"n5" "", "n4" "", "n1" ""}

Resolved the partition:

2020-03-15 15:09:24,320{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :stop-partition nil
2020-03-15 15:09:24,525{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :stop-partition :network-healed

Resumed all nodes:

2020-03-15 15:09:39,388{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :resume :all
2020-03-15 15:09:39,493{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :resume {"n1" "", "n2" "", "n3" "", "n4" "", "n5" ""}

n2 declared itself a leader immediately thereafter:

113869:15 Mar 12:09:39.617 <raftlib> becoming candidate
...
113869:15 Mar 12:09:39.741 <raftlib> becoming leader term:36

Created a network partition where every node could see a majority, but nobody agreed on what that majority was:

2020-03-15 15:09:46,556{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :start-partition        :majorities-ring
2020-03-15 15:09:46,661{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :start-partition        [:isolated {"n2" #{"n1" "n3"}, "n4" #{"n5" "n1"}, "n3" #{"n2" "n5"}, "n1" 
#{"n2" "n4"}, "n5" #{"n4" "n3"}}]

Killed n3, n4, n5:

2020-03-15 15:09:51,194{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :kill   :majority
2020-03-15 15:09:53,400{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :kill   {"n5" "", "n4" "", "n3" ""}

Started all nodes:

```clj
2020-03-15 15:09:55,401{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :start  :all
2020-03-15 15:09:55,607{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :start  {"n1" "", "n2" "", "n3" "", "n4" "", "n5" ""}

We had n2 remove n4:

2020-03-15 15:09:57,713{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :leave  "n4"
2020-03-15 15:09:58,141{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :leave  ["n4" {"n2" "OK"}]

We had n2 remove n5:

2020-03-15 15:10:00,246{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :leave  "n5"
2020-03-15 15:10:00,675{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :leave  ["n5" {"n2" "OK"}]

We resolved a network partition about 13 seconds before Things Went Bad:

2020-03-15 15:10:02,675{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :stop-partition nil
2020-03-15 15:10:02,881{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :stop-partition :network-healed

We tried to ask n2 to remove a node, but either timeout or redis-cli crashed with exit status 124:

2020-03-15 15:10:11,905{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :leave  "n2"    indeterminate: Command exited with non-zero status 124 on node n2:
sudo -S -u root bash -c "cd /; timeout 1s /opt/redis/redis-cli RAFT.NODE REMOVE 404091929"

n2 became a follower, after receiving an appendentries failure:

113869:15 Mar 12:10:09.325 node:1756104758: invalid RAFT.REQUESTVOTE reply
113869:15 Mar 12:10:09.325 node:1756104758: invalid RAFT.REQUESTVOTE reply
113869:15 Mar 12:10:09.326 node:1756104758: invalid RAFT.REQUESTVOTE reply
113869:15 Mar 12:10:09.326 node:1756104758: raft.c:444: <raftlib> received appendentries response fail ci:1675 rci:1672 msgid:2
113869:15 Mar 12:10:12.062 <raftlib> becoming follower

Went through a few elections:

113869:15 Mar 12:10:12.765 <raftlib> election starting: 760 800, term: 65 ci: 1675
113869:15 Mar 12:10:12.765 <raftlib> becoming candidate
113869:15 Mar 12:10:13.017 <raftlib> randomize election timeout to 939
113869:15 Mar 12:10:13.017 node:1756104758: raft.c:444: <raftlib> sending requestvote to: -960431056
113869:15 Mar 12:10:13.069 <raftlib> becoming follower
113869:15 Mar 12:10:13.708 <raftlib> node requested vote: 0 replying: unknown
113869:15 Mar 12:10:13.770 <raftlib> election starting: 794 800, term: 74 ci: 1675
113869:15 Mar 12:10:13.770 <raftlib> becoming candidate
113869:15 Mar 12:10:13.872 <raftlib> randomize election timeout to 726
113869:15 Mar 12:10:13.872 node:1756104758: raft.c:444: <raftlib> sending requestvote to: -960431056
113869:15 Mar 12:10:13.899 node:1756104758: invalid RAFT.AE reply
113869:15 Mar 12:10:13.899 node:1756104758: raft.c:444: <raftlib> node requested vote: -960431056 replying: not granted
113869:15 Mar 12:10:13.899 <raftlib> node requested vote: 0 replying: unknown
113869:15 Mar 12:10:13.952 <raftlib> node requested vote: 0 replying: unknown
113869:15 Mar 12:10:14.528 <raftlib> becoming follower
113869:15 Mar 12:10:14.528 <raftlib> randomize election timeout to 691

And then the madness started:

2020-03-15 15:10:15,857{GMT}    INFO    [jepsen worker 16] jepsen.util: 116     :invoke :txn    [[:r 33 nil]]
2020-03-15 15:10:15,858{GMT}    INFO    [jepsen worker 16] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socke
t #object[java.net.Socket 0x20e563fb Socket[addr=n2/192.168.122.12,port=6379,localport=48886]], :spec {:host n2, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x6c69c
e28 java.io.DataInputStream@6c69ce28], :out #object[java.io.BufferedOutputStream 0x501ddbda java.io.BufferedOutputStream@501ddbda]}}, :in-txn? #object[clojure.lang.Atom 0x486dbd67 {:status
 :ready, :val false}], :spec {:host n2, :port 6379, :timeout-ms 10000}}
2020-03-15 15:10:16,115{GMT}    WARN    [jepsen worker 16] jepsen.core: Process 116 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 33, :value [3 4 ["2" "4" "21" "22"]]}

During this time frame, n2's logs were:

113869:15 Mar 12:10:15.807 node:1756104758: raft.c:444: <raftlib> recvd appendentries t:76 ci:1691 lc:1691 pli:1691 plt:76 #1
113869:15 Mar 12:10:15.886 <raftlib> applying log: 1692, id: 170879493 size: 75
113869:15 Mar 12:10:15.899 node:1756104758: raft.c:444: <raftlib> recvd appendentries t:76 ci:1692 lc:1692 pli:1692 plt:76 #1
113869:15 Mar 12:10:16.049 node:1756104758: raft.c:444: <raftlib> recvd appendentries t:76 ci:1693 lc:1693 pli:1693 plt:76 #1
113869:15 Mar 12:10:16.088 <raftlib> applying log: 1693, id: 1983936574 size: 93
113869:15 Mar 12:10:16.099 node:1756104758: raft.c:444: <raftlib> recvd appendentries t:76 ci:1694 lc:1694 pli:1694 plt:76 #1
113869:15 Mar 12:10:16.188 <raftlib> applying log: 1694, id: 933018922 size: 94

@aphyr aphyr changed the title Unexpectedly received "2" from an LRANGE call which should have returned a list Crossed wires with follower-proxy Mar 16, 2020
@aphyr
Copy link
Collaborator Author

aphyr commented Mar 16, 2020

Update: we don't need process pauses to trigger this; crashes, membership, and partitions are sufficient.

@aphyr
Copy link
Collaborator Author

aphyr commented Mar 17, 2020

Got a reproduction case with just crashes and partitions; it looks like membership changes aren't necessary.
20200317T015737.000-0400.zip

This one took a while, so it's not clear what fault caused it:

2020-03-17 02:06:30,695{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :start  {"n1" "", "n2" "", "n3" "", "n4" "", "n5" ""}
...
2020-03-17 02:06:32,695{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :start-partition        :primaries
...
020-03-17 02:06:32,901{GMT}    INFO    [jepsen worker 7] jepsen.util: 857      :invoke :txn    [[:r 111 nil] [:append 111 34]]
2020-03-17 02:06:32,901{GMT}    INFO    [jepsen worker 7] jepsen.redis.client: :multi-discarding
2020-03-17 02:06:32,902{GMT}    INFO    [jepsen worker 7] jepsen.redis.client: :multi-discarded
2020-03-17 02:06:32,902{GMT}    INFO    [jepsen worker 7] jepsen.redis.client: :multi-starting
2020-03-17 02:06:32,902{GMT}    INFO    [jepsen worker 7] jepsen.redis.client: :multi-started
2020-03-17 02:06:32,903{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :start-partition        [:isolated {"n5" #{"n2"}, "n1" #{"n2"}, "n4" #{"n2"}, "n3" #{"n2"}, "n2" #{"n5" "n1" "n4" "n3"}}]
2020-03-17 02:06:32,903{GMT}    INFO    [jepsen worker 7] jepsen.redis.append: :multi-exec
...
2020-03-17 02:06:34,904{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :stop-partition nil
...
2020-03-17 02:06:35,110{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :stop-partition :network-healed
...
2020-03-17 02:06:37,111{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :start-partition        :majority
...
2020-03-17 02:06:37,215{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :start-partition        [:isolated {"n5" #{"n2" "n1" "n3"}, "n4" #{"n2" "n1" "n3"}, "n2" #{"n5" "n4"}, "n1" #{"n5" "n4"}, "n3" #{"n5" "n4"}}]
...
2020-03-17 02:06:39,215{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :stop-partition nil
...
2020-03-17 02:06:39,420{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :stop-partition :network-healed
...
2020-03-17 02:06:41,421{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :kill   :primaries
...
2020-03-17 02:06:41,577{GMT}    INFO    [jepsen worker 7] jepsen.redis.append: :multi-execed
2020-03-17 02:06:41,583{GMT}    WARN    [jepsen worker 7] jepsen.core: Process 857 crashed
java.lang.IllegalArgumentException: Don't know how to create ISeq from: java.lang.Long
...
2020-03-17 02:06:43,731{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :kill   {"n4" ""}

n3, the node that had the issue, was a follower before this time frame, became a leader during it, then stepped down:

39380:16 Mar 23:06:34.579 <raftlib> becoming candidate
...
39380:16 Mar 23:06:34.604 <raftlib> becoming leader term:345
...
39380:16 Mar 23:06:40.352 <raftlib> becoming follower

Which suggests, like previous cases, this might be something to do with stepdown behavior?

@aphyr
Copy link
Collaborator Author

aphyr commented Mar 18, 2020

This no longer looks reproducible with e657423. Nice work!

@aphyr aphyr closed this as completed Mar 18, 2020
sjpotter pushed a commit to sjpotter/redisraft that referenced this issue Oct 7, 2021
added initial cmake support, fixed coverage usage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants