-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crossed wires with follower-proxy #26
Comments
Wires seem definitely crossed. So far we've not been able to reproduce this or explain it by looking at the code... don't want to be too quick to blame, but it might be Carmine indeed. The thing that doesn't make sense here is having this mixed up on a fresh connection - unless - it's only fresh back from a connection pool and not really a new connection. Any idea if that may be the case? The current |
Yeah, looking at the network connections, they're separate socket objects--I'm pretty darn confident it's not re-using them! That said... Carmine is weird internally, so I can't guarantee. I'm aiming to get a packet capture so I can tell for sure what's going on. As for MULTI/EXEC issues... as far as I know, we're doing that correctly now. I think the errors we're seeing around MULTI are caused by this mixed-up connection state problem. |
Oh--and to give you a little more context on connection pools--I wrote a connection pool specifically for Carmine to avoid this issue. We construct a fresh pool that wraps a single connection for each client, and don't share that with anyone else. |
I wasn't clear, I was referring to the MULTI/EXEC (or more specifically DISCARD) anomaly on the server side. If you don't think it contributes to this in any way, we'll leave that for a bit later. |
Ah, yeah, that could be the case! It might be that, like... MULTI triggers some kind of behavior that leaks between connections somehow. Or maybe MULTI messes up the client somehow--whatever mutable state the library uses to aggregate results might be getting confused? Or maybe the MULTI issues are being caused by some kind of state leakage. Could be either way! |
OK, two things. One, this looks to be associated with follower-proxy. Two: let's talk about this test run: 20200315T150856.000-0400.zip. Process 16 tries to perform a single read ( 2020-03-15 15:10:15,857{GMT} INFO [jepsen worker 16] jepsen.util: 116 :invoke :txn [[:r 33 nil]]
2020-03-15 15:10:15,858{GMT} INFO [jepsen worker 16] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socke
t #object[java.net.Socket 0x20e563fb Socket[addr=n2/192.168.122.12,port=6379,localport=48886]], :spec {:host n2, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x6c69c
e28 java.io.DataInputStream@6c69ce28], :out #object[java.io.BufferedOutputStream 0x501ddbda java.io.BufferedOutputStream@501ddbda]}}, :in-txn? #object[clojure.lang.Atom 0x486dbd67 {:status
:ready, :val false}], :spec {:host n2, :port 6379, :timeout-ms 10000}}
2020-03-15 15:10:16,115{GMT} WARN [jepsen worker 16] jepsen.core: Process 116 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 33, :value [3 4 ["2" "4" "21" "22"]]} Here's the corresponding TCP stream (see n2's tcpdump file, in the zip, if you want to check for yourself): There are exactly two Redis messages in this stream--a request from the Jepsen to n2 for ... and a Like the other cases I've seen, this issue happens in a burst, and appears limited to one node. In this test, it started on n2 at 2020-03-15 15:10:15.011: 2020-03-15 15:10:15,011{GMT} WARN [jepsen worker 6] jepsen.core: Process 56 crashed
java.lang.IllegalArgumentException: Don't know how to create ISeq from: java.lang.Long What was going on in the cluster leading up to this point? n1 removed n3: 2020-03-15 15:09:11,247{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :leave "n3"
2020-03-15 15:09:11,675{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :leave ["n3" {"n1" "OK"}] We isolated n1 n2 n3 | n4 n5: 2020-03-15 15:09:13,676{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :start-partition :majority
2020-03-15 15:09:13,780{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :start-partition [:isolated {"n5" #{"n2" "n1" "n3"}, "n4" #{"n2" "n1" "n3"}, "n2" #{"n5" "n
4"}, "n1" #{"n5" "n4"}, "n3" #{"n5" "n4"}}] Paused n1, n4, n5: 2020-03-15 15:09:15,780{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :pause :majority
2020-03-15 15:09:15,884{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :pause {"n5" "", "n4" "", "n1" ""} Resolved the partition: 2020-03-15 15:09:24,320{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :stop-partition nil
2020-03-15 15:09:24,525{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :stop-partition :network-healed Resumed all nodes: 2020-03-15 15:09:39,388{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :resume :all
2020-03-15 15:09:39,493{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :resume {"n1" "", "n2" "", "n3" "", "n4" "", "n5" ""} n2 declared itself a leader immediately thereafter:
Created a network partition where every node could see a majority, but nobody agreed on what that majority was: 2020-03-15 15:09:46,556{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :start-partition :majorities-ring
2020-03-15 15:09:46,661{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :start-partition [:isolated {"n2" #{"n1" "n3"}, "n4" #{"n5" "n1"}, "n3" #{"n2" "n5"}, "n1"
#{"n2" "n4"}, "n5" #{"n4" "n3"}}] Killed n3, n4, n5: 2020-03-15 15:09:51,194{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :kill :majority
2020-03-15 15:09:53,400{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :kill {"n5" "", "n4" "", "n3" ""}
Started all nodes:
```clj
2020-03-15 15:09:55,401{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :start :all
2020-03-15 15:09:55,607{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :start {"n1" "", "n2" "", "n3" "", "n4" "", "n5" ""} We had n2 remove n4: 2020-03-15 15:09:57,713{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :leave "n4"
2020-03-15 15:09:58,141{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :leave ["n4" {"n2" "OK"}] We had n2 remove n5: 2020-03-15 15:10:00,246{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :leave "n5"
2020-03-15 15:10:00,675{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :leave ["n5" {"n2" "OK"}] We resolved a network partition about 13 seconds before Things Went Bad: 2020-03-15 15:10:02,675{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :stop-partition nil
2020-03-15 15:10:02,881{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :stop-partition :network-healed We tried to ask n2 to remove a node, but either 2020-03-15 15:10:11,905{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :leave "n2" indeterminate: Command exited with non-zero status 124 on node n2:
sudo -S -u root bash -c "cd /; timeout 1s /opt/redis/redis-cli RAFT.NODE REMOVE 404091929" n2 became a follower, after receiving an appendentries failure:
Went through a few elections:
And then the madness started: 2020-03-15 15:10:15,857{GMT} INFO [jepsen worker 16] jepsen.util: 116 :invoke :txn [[:r 33 nil]]
2020-03-15 15:10:15,858{GMT} INFO [jepsen worker 16] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socke
t #object[java.net.Socket 0x20e563fb Socket[addr=n2/192.168.122.12,port=6379,localport=48886]], :spec {:host n2, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x6c69c
e28 java.io.DataInputStream@6c69ce28], :out #object[java.io.BufferedOutputStream 0x501ddbda java.io.BufferedOutputStream@501ddbda]}}, :in-txn? #object[clojure.lang.Atom 0x486dbd67 {:status
:ready, :val false}], :spec {:host n2, :port 6379, :timeout-ms 10000}}
2020-03-15 15:10:16,115{GMT} WARN [jepsen worker 16] jepsen.core: Process 116 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 33, :value [3 4 ["2" "4" "21" "22"]]} During this time frame, n2's logs were:
|
Update: we don't need process pauses to trigger this; crashes, membership, and partitions are sufficient. |
Got a reproduction case with just crashes and partitions; it looks like membership changes aren't necessary. This one took a while, so it's not clear what fault caused it: 2020-03-17 02:06:30,695{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :start {"n1" "", "n2" "", "n3" "", "n4" "", "n5" ""}
...
2020-03-17 02:06:32,695{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :start-partition :primaries
...
020-03-17 02:06:32,901{GMT} INFO [jepsen worker 7] jepsen.util: 857 :invoke :txn [[:r 111 nil] [:append 111 34]]
2020-03-17 02:06:32,901{GMT} INFO [jepsen worker 7] jepsen.redis.client: :multi-discarding
2020-03-17 02:06:32,902{GMT} INFO [jepsen worker 7] jepsen.redis.client: :multi-discarded
2020-03-17 02:06:32,902{GMT} INFO [jepsen worker 7] jepsen.redis.client: :multi-starting
2020-03-17 02:06:32,902{GMT} INFO [jepsen worker 7] jepsen.redis.client: :multi-started
2020-03-17 02:06:32,903{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :start-partition [:isolated {"n5" #{"n2"}, "n1" #{"n2"}, "n4" #{"n2"}, "n3" #{"n2"}, "n2" #{"n5" "n1" "n4" "n3"}}]
2020-03-17 02:06:32,903{GMT} INFO [jepsen worker 7] jepsen.redis.append: :multi-exec
...
2020-03-17 02:06:34,904{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :stop-partition nil
...
2020-03-17 02:06:35,110{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :stop-partition :network-healed
...
2020-03-17 02:06:37,111{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :start-partition :majority
...
2020-03-17 02:06:37,215{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :start-partition [:isolated {"n5" #{"n2" "n1" "n3"}, "n4" #{"n2" "n1" "n3"}, "n2" #{"n5" "n4"}, "n1" #{"n5" "n4"}, "n3" #{"n5" "n4"}}]
...
2020-03-17 02:06:39,215{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :stop-partition nil
...
2020-03-17 02:06:39,420{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :stop-partition :network-healed
...
2020-03-17 02:06:41,421{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :kill :primaries
...
2020-03-17 02:06:41,577{GMT} INFO [jepsen worker 7] jepsen.redis.append: :multi-execed
2020-03-17 02:06:41,583{GMT} WARN [jepsen worker 7] jepsen.core: Process 857 crashed
java.lang.IllegalArgumentException: Don't know how to create ISeq from: java.lang.Long
...
2020-03-17 02:06:43,731{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :kill {"n4" ""} n3, the node that had the issue, was a follower before this time frame, became a leader during it, then stepped down: 39380:16 Mar 23:06:34.579 <raftlib> becoming candidate
...
39380:16 Mar 23:06:34.604 <raftlib> becoming leader term:345
...
39380:16 Mar 23:06:40.352 <raftlib> becoming follower Which suggests, like previous cases, this might be something to do with stepdown behavior? |
This no longer looks reproducible with e657423. Nice work! |
added initial cmake support, fixed coverage usage
It looks like clusters with follower-proxy, network partitions, and crashes (or possibly fewer faults; I'm trying to narrow it down) can occasionally get into states where one specific Redis node gets mixed up and sends inappropriate responses to clients. I think what we're seeing is responses intended for other clients getting dispatched to the wrong place.
20200313T233858.000-0400.zip
We resolve a network partition:
The first sign of weirdness comes from worker 5, executing process 130. It tries to execute a MULTI transaction, opens a fresh connection, and gets a single "211" instead of a vector of results from an LRANGE command.
Concurrently, worker 15/process 140 does the same thing:
It's particularly weird that both of them got "211" here.
Concurrently, process 20/worker 145 does a NON multi read--just a regular old single LRANGE by itself, on a fresh connection, and gets [2 3 ["226"]] which is... VERY weird.
Same thing happens to worker 10's request with process 160. Fresh connection, a MULTI transaction, and it gets the vector [2 3]--it is a vector, but it's of integers, not strings, like we expected. Fresh connection here too.
Immediately after this, we detect the completion of a node remove operation that we'd been waiting on for a while: node n1 finishes removing n5 from the cluster.
Node n2 starts removing n3:
Worker 5, process 180, hits a type error as well--it expects a list of elements from an LRANGE, but gets a single Long instead.
Worker 0, executing process 200, goes to perform a MULTI transaction. It opens a new connection to do so...
That request succeeds with what looks like normal results.
Weirdly, this conflicts with other reads of key 49, so I'm not sure if we got the right thing here or not.
Worker 0 then tries to perform a single read of 48. This doesn't involve an EXEC, so it should have returned a single value. Instead it gets a vector of vectors, which... what?
Eyeballing the vectors it returned, it looks like this is a read of keys
[? 49 ?]
respectively.Then worker 0 (logical process 225) goes to perform a transaction:
225 is a new process, so it opens a fresh connection to perform its first request. Just to confirm, yes, this is a new port and connection object!
It performs a MULTI, then a read (LRANGE) of key 49, read of 48, read of 47, and appends (RPUSH) 211 to key 47. It EXECs the transaction, and gets back a vector of responses. The first response should have been a list of values for key 49, but instead we obtained "2", which... what?
Later, we'll read key 49 again and get a weird response:
This read is clearly messed up because we just appended 5 to key 49, and didn't observe it. Also, the append of 12 to key 49 happened after this, and failed. It definitely can't be reading 49.
I feel like this points to wires getting crossed either inside Redis or the Carmine library--or maybe I'm somehow STILL using the library wrong. I think... what I should do next is get a Wireshark dump of the traffic back and forth, and try to figure out what's going on at a protocol level.
The text was updated successfully, but these errors were encountered: