Timeout hang bug #168

jpittis · 2017-03-28T22:00:56Z

This is a proposed fix to the API hang bug from issue #159.

Overview of what this PR does.

Adds a regression test for the bug mentioned above.
Fixes the bug by changing the timeout toxic to drop incoming tcp data on the ground.
The timeout toxic implements CleanupToxic and closes it's connection on cleanup to avoid stream corruption because of dropped data.
Toxiproxy calls Cleanup when a toxic is deleted and it implements the CleanupToxic interface.
Adds a test to confirm open connections are closed when a timeout toxic is deleted.
Fixes a test so that it fails gracefully rather than hanging forever.

TLDR of Bug (Described in issue #159.)

Toxics that occur before the timeout toxic attempt to forward TCP data to the timeout toxic.
Because the timeout toxic does not read incoming data, these forwarding toxics block forever.
When an API call deletes a toxic, it attempts to interrupt the toxic but this interrupt hangs forever because the toxic is blocked attempting to forward data to the timeout toxic.

TLDR of Fix (Described in comment #159 (comment).)

We make the timeout toxic read incoming data so that writers don't block forever. The timeout toxic then ignores this data ("drops it on the ground"). Because the data is dropped on the ground, this corrupts the TCP stream so the connection needs to be closed when the timeout toxic is deleted.

Does this break backwards compatibility?

Yes but not in realistic use cases. If you remove a timeout toxic, connections to this proxy will be closed. Before this change, data blocked by the timeout toxic would be forwarded after its deletion. (Assuming this hang bug was not encountered.). IMO the new behaviour is the correct behaviour because timeout implies the connection will be closed. If you want long delays, use a large value on a latency toxic.

sirupsen

Couple of questions, but basically LGTM! 🎉

sirupsen · 2017-04-04T12:18:24Z

toxics/toxic.go

@@ -26,6 +26,11 @@ type Toxic interface {
 	Pipe(*ToxicStub)
 }

+type CleanupToxic interface {


This is a clean approach. I like it.

sirupsen · 2017-04-04T12:19:26Z

toxics/timeout.go

+				// Drop the data on the ground.
+			}
+		}
+	} else {


Why would there not be a timeout? This is essentially a "drop data" proxy in that case, right? Perhaps it should be created as a different proxy and have a different name? We could also just alias the name.

This is keeping consistent with the previous implementation of the timeout toxic. Where a 0 timeout value immediately stops all data.

I agree that the name timeout might be confusing. I would propose we open an issue unless you are confident in how this should be renamed / changed.

sirupsen · 2017-04-04T12:21:43Z

toxics/timeout_test.go

+	proxy.Toxics.AddToxicJson(ToxicToJson(t, "to_delete", "timeout", "upstream", &toxics.TimeoutToxic{Timeout: 0}))
+
+	serverConnRecv := make(chan net.Conn)
+	go func() {


It almost seems at this point we need a helper for the pattern of starting a server and accepting connections, as it's somewhat common and should be encouraged.

I agree. I'm going to refactor this before shipping.

sirupsen · 2017-04-04T12:23:13Z

toxics/timeout_test.go

+		t.Fatal("Unable to write to proxy", err)
+	}
+
+	time.Sleep(1 * time.Second) // Shitty sync waiting for latency toxi to get data.


Why is it necessary for waiting for the latency toxic to get data to shut it down? Isn't this a race that could have problems?

This frustrates me.

Our bug only occurs when the latency toxic is blocked attempting to send data to the timeout toxic. Without the sleep, the test removes the latency toxic before it receives data which means the bug does not occur because the latency toxic is not blocking.

I can't think of a non racy way to wait for the latency toxic to receive input data. We can't wait for the upstream to receive the data because the timeout toxic is blocking data from reaching the upstream.

jpittis · 2017-04-04T17:31:16Z

toxics/timeout.go

+				// Drop the data on the ground.
+			}
+		}
+	} else {


This is keeping consistent with the previous implementation of the timeout toxic. Where a 0 timeout value immediately stops all data.

I agree that the name timeout might be confusing. I would propose we open an issue unless you are confident in how this should be renamed / changed.

jpittis · 2017-04-04T17:32:00Z

toxics/timeout_test.go

+	proxy.Toxics.AddToxicJson(ToxicToJson(t, "to_delete", "timeout", "upstream", &toxics.TimeoutToxic{Timeout: 0}))
+
+	serverConnRecv := make(chan net.Conn)
+	go func() {


I agree. I'm going to refactor this before shipping.

jpittis · 2017-04-04T17:36:34Z

toxics/timeout_test.go

+		t.Fatal("Unable to write to proxy", err)
+	}
+
+	time.Sleep(1 * time.Second) // Shitty sync waiting for latency toxi to get data.


This frustrates me.

Our bug only occurs when the latency toxic is blocked attempting to send data to the timeout toxic. Without the sleep, the test removes the latency toxic before it receives data which means the bug does not occur because the latency toxic is not blocking.

I can't think of a non racy way to wait for the latency toxic to receive input data. We can't wait for the upstream to receive the data because the timeout toxic is blocking data from reaching the upstream.

jpittis · 2017-04-04T17:37:22Z

link_test.go

-	n, err = link.output.Read(buf)
-	if n != 0 || err != io.EOF {
-		t.Fatalf("Read did not get EOF: %d %v", n, err)
+	done := make(chan struct{})


I'm also going to refactor this into a timeout test helper.

jpittis · 2017-04-04T20:15:08Z

@sirupsen: d9e6d26 and 6b5d43d add test helpers. I decided to stick the TimeoutAfter helper into a generalized testhelper package because I find that the timeout pattern comes up all the time when testing concurrent go code.

jpittis added 2 commits March 19, 2017 18:00

create a regression test for timeout toxic hang bug

6b464fc

timeout toxic cleans up after itself

f7154cc

jpittis requested a review from sirupsen March 28, 2017 22:01

sirupsen approved these changes Apr 4, 2017

View reviewed changes

jpittis commented Apr 4, 2017

View reviewed changes

jpittis added 2 commits April 4, 2017 14:13

add TimeoutAfter test helper

d9e6d26

factor out server establishment code

6b5d43d

jpittis requested a review from sirupsen April 4, 2017 20:15

sirupsen approved these changes Apr 4, 2017

View reviewed changes

jpittis mentioned this pull request Apr 4, 2017

simulate cable unplug #149

Open

jpittis merged commit 476c3aa into master Apr 4, 2017

This was referenced Apr 4, 2017

timeout toxic drops more than just the first segment #169

Merged

Toxiproxy hangs #159

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout hang bug #168

Timeout hang bug #168

jpittis commented Mar 28, 2017

sirupsen left a comment

sirupsen Apr 4, 2017

sirupsen Apr 4, 2017

jpittis Apr 4, 2017

sirupsen Apr 4, 2017

jpittis Apr 4, 2017

sirupsen Apr 4, 2017

jpittis Apr 4, 2017

jpittis Apr 4, 2017

jpittis Apr 4, 2017

jpittis Apr 4, 2017

jpittis Apr 4, 2017

jpittis commented Apr 4, 2017

Timeout hang bug #168

Timeout hang bug #168

Conversation

jpittis commented Mar 28, 2017

Overview of what this PR does.

TLDR of Bug (Described in issue #159.)

TLDR of Fix (Described in comment #159 (comment).)

Does this break backwards compatibility?

sirupsen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpittis commented Apr 4, 2017