Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout hang bug #168

Merged
merged 4 commits into from
Apr 4, 2017
Merged

Timeout hang bug #168

merged 4 commits into from
Apr 4, 2017

Conversation

jpittis
Copy link
Contributor

@jpittis jpittis commented Mar 28, 2017

This is a proposed fix to the API hang bug from issue #159.

Overview of what this PR does.

TLDR of Bug (Described in issue #159.)

  1. Toxics that occur before the timeout toxic attempt to forward TCP data to the timeout toxic.
  2. Because the timeout toxic does not read incoming data, these forwarding toxics block forever.
  3. When an API call deletes a toxic, it attempts to interrupt the toxic but this interrupt hangs forever because the toxic is blocked attempting to forward data to the timeout toxic.

TLDR of Fix (Described in comment #159 (comment).)

We make the timeout toxic read incoming data so that writers don't block forever. The timeout toxic then ignores this data ("drops it on the ground"). Because the data is dropped on the ground, this corrupts the TCP stream so the connection needs to be closed when the timeout toxic is deleted.

Does this break backwards compatibility?

Yes but not in realistic use cases. If you remove a timeout toxic, connections to this proxy will be closed. Before this change, data blocked by the timeout toxic would be forwarded after its deletion. (Assuming this hang bug was not encountered.). IMO the new behaviour is the correct behaviour because timeout implies the connection will be closed. If you want long delays, use a large value on a latency toxic.

@jpittis jpittis requested a review from sirupsen March 28, 2017 22:01
Copy link
Contributor

@sirupsen sirupsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of questions, but basically LGTM! 🎉

@@ -26,6 +26,11 @@ type Toxic interface {
Pipe(*ToxicStub)
}

type CleanupToxic interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clean approach. I like it.

// Drop the data on the ground.
}
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would there not be a timeout? This is essentially a "drop data" proxy in that case, right? Perhaps it should be created as a different proxy and have a different name? We could also just alias the name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is keeping consistent with the previous implementation of the timeout toxic. Where a 0 timeout value immediately stops all data.

I agree that the name timeout might be confusing. I would propose we open an issue unless you are confident in how this should be renamed / changed.

proxy.Toxics.AddToxicJson(ToxicToJson(t, "to_delete", "timeout", "upstream", &toxics.TimeoutToxic{Timeout: 0}))

serverConnRecv := make(chan net.Conn)
go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It almost seems at this point we need a helper for the pattern of starting a server and accepting connections, as it's somewhat common and should be encouraged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I'm going to refactor this before shipping.

t.Fatal("Unable to write to proxy", err)
}

time.Sleep(1 * time.Second) // Shitty sync waiting for latency toxi to get data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary for waiting for the latency toxic to get data to shut it down? Isn't this a race that could have problems?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This frustrates me.

Our bug only occurs when the latency toxic is blocked attempting to send data to the timeout toxic. Without the sleep, the test removes the latency toxic before it receives data which means the bug does not occur because the latency toxic is not blocking.

I can't think of a non racy way to wait for the latency toxic to receive input data. We can't wait for the upstream to receive the data because the timeout toxic is blocking data from reaching the upstream.

// Drop the data on the ground.
}
}
} else {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is keeping consistent with the previous implementation of the timeout toxic. Where a 0 timeout value immediately stops all data.

I agree that the name timeout might be confusing. I would propose we open an issue unless you are confident in how this should be renamed / changed.

proxy.Toxics.AddToxicJson(ToxicToJson(t, "to_delete", "timeout", "upstream", &toxics.TimeoutToxic{Timeout: 0}))

serverConnRecv := make(chan net.Conn)
go func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I'm going to refactor this before shipping.

t.Fatal("Unable to write to proxy", err)
}

time.Sleep(1 * time.Second) // Shitty sync waiting for latency toxi to get data.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This frustrates me.

Our bug only occurs when the latency toxic is blocked attempting to send data to the timeout toxic. Without the sleep, the test removes the latency toxic before it receives data which means the bug does not occur because the latency toxic is not blocking.

I can't think of a non racy way to wait for the latency toxic to receive input data. We can't wait for the upstream to receive the data because the timeout toxic is blocking data from reaching the upstream.

link_test.go Outdated
n, err = link.output.Read(buf)
if n != 0 || err != io.EOF {
t.Fatalf("Read did not get EOF: %d %v", n, err)
done := make(chan struct{})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also going to refactor this into a timeout test helper.

@jpittis
Copy link
Contributor Author

jpittis commented Apr 4, 2017

@sirupsen: d9e6d26 and 6b5d43d add test helpers. I decided to stick the TimeoutAfter helper into a generalized testhelper package because I find that the timeout pattern comes up all the time when testing concurrent go code.

@jpittis jpittis requested a review from sirupsen April 4, 2017 20:15
@jpittis jpittis mentioned this pull request Apr 4, 2017
@jpittis jpittis merged commit 476c3aa into master Apr 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants