Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

toxiproxy api hangs when trying to do anything after a long wait, run from inside testcontainers #254

Closed
astafev opened this issue Jun 19, 2019 · 18 comments
Assignees

Comments

@astafev
Copy link

astafev commented Jun 19, 2019

Hello, I'm facing a weird behaviour: toxirpoxy doesn't respond to API calls, the connection just hangs and nothing happens.

Here's the setup I'm currently using toxyproxi in:
There're few containers run inside docker with testcontiners in a single docker network. Toxiproxy is used to create troubles communicating between components. So, basically, it's between all components. And when the following scenario is used:

  1. cut connection to the postgresql db for one of the components
  2. wait for 60 seconds
  3. restore the connection (or reset proxy or perform any other action by API, including GET requests).

Toxiproxy is run inside docker, tested with images shopify/toxiproxy:2.1.4 and shopify/toxiproxy:2.1.0. Below are the logs from the docker container. Please let me know if I can provide more information.

time="2019-06-19T12:09:40Z" level="info" msg="API HTTP server starting" host="0.0.0.0" port="8474" version="2.1.4"
time="2019-06-19T12:09:40Z" level="info" msg="Started proxy" name="postgres:5432" proxy="[::]:8666" upstream="postgres:5432"
time="2019-06-19T12:09:55Z" level="info" msg="Started proxy" proxy="[::]:8667" upstream="kafka:9093" name="kafka:9093"
time="2019-06-19T12:09:55Z" level="info" msg="Started proxy" name="kafka:9092" proxy="[::]:8668" upstream="kafka:9092"
time="2019-06-19T12:09:55Z" level="info" msg="Started proxy" name="zookeeper:2181" proxy="[::]:8669" upstream="zookeeper:2181"
time="2019-06-19T12:10:03Z" level="info" msg="Accepted client" client=192.168.16.5:38948 proxy="[::]:8669" upstream="zookeeper:2181" name="zookeeper:2181"
time="2019-06-19T12:10:03Z" level="warning" msg="Source terminated" err=read tcp 192.168.16.2:8669->192.168.16.5:38948: use of closed network connection name="zookeeper:2181" bytes=61
time="2019-06-19T12:10:04Z" level="info" msg="Accepted client" proxy="[::]:8669" upstream="zookeeper:2181" name="zookeeper:2181" client=192.168.16.5:38952
time="2019-06-19T12:10:06Z" level="info" msg="Accepted client" client=192.168.16.5:47974 proxy="[::]:8668" upstream="kafka:9092" name="kafka:9092"
time="2019-06-19T12:10:07Z" level="info" msg="Accepted client" upstream="kafka:9092" name="kafka:9092" client=192.168.16.5:47978 proxy="[::]:8668"
time="2019-06-19T12:10:07Z" level="info" msg="Accepted client" upstream="kafka:9092" name="kafka:9092" client=192.168.16.5:47982 proxy="[::]:8668"
time="2019-06-19T12:10:07Z" level="warning" msg="Source terminated" bytes=605 err=read tcp 192.168.16.2:53148->192.168.16.5:9092: use of closed network connection name="kafka:9092"
time="2019-06-19T12:10:07Z" level="warning" msg="Source terminated" name="kafka:9092" bytes=387 err=read tcp 192.168.16.2:53144->192.168.16.5:9092: use of closed network connection
time="2019-06-19T12:10:11Z" level="info" msg="Started proxy" upstream="schema-registry:8081" name="schema-registry:8081" proxy="[::]:8670"
time="2019-06-19T12:10:18Z" level="info" msg="Accepted client" proxy="[::]:8669" upstream="zookeeper:2181" name="zookeeper:2181" client=192.168.16.6:44368
time="2019-06-19T12:10:18Z" level="warning" msg="Source terminated" name="zookeeper:2181" bytes=61 err=read tcp 192.168.16.2:43454->192.168.16.4:2181: use of closed network connection
time="2019-06-19T12:10:19Z" level="info" msg="Accepted client" upstream="zookeeper:2181" name="zookeeper:2181" client=192.168.16.6:44372 proxy="[::]:8669"
time="2019-06-19T12:10:19Z" level="warning" msg="Source terminated" name="zookeeper:2181" bytes=61 err=read tcp 192.168.16.2:43458->192.168.16.4:2181: use of closed network connection
time="2019-06-19T12:10:19Z" level="info" msg="Accepted client" name="zookeeper:2181" client=192.168.16.6:44376 proxy="[::]:8669" upstream="zookeeper:2181"
time="2019-06-19T12:10:19Z" level="warning" msg="Source terminated" name="zookeeper:2181" bytes=625 err=read tcp 192.168.16.2:43462->192.168.16.4:2181: use of closed network connection
time="2019-06-19T12:10:19Z" level="info" msg="Accepted client" upstream="kafka:9092" name="kafka:9092" client=192.168.16.6:56448 proxy="[::]:8668"
time="2019-06-19T12:10:19Z" level="info" msg="Accepted client" name="kafka:9092" client=192.168.16.6:56452 proxy="[::]:8668" upstream="kafka:9092"
time="2019-06-19T12:10:20Z" level="warning" msg="Source terminated" name="kafka:9092" bytes=351 err=read tcp 192.168.16.2:53172->192.168.16.5:9092: use of closed network connection
time="2019-06-19T12:10:20Z" level="warning" msg="Source terminated" err=read tcp 192.168.16.2:53168->192.168.16.5:9092: use of closed network connection name="kafka:9092" bytes=351
time="2019-06-19T12:10:21Z" level="info" msg="Accepted client" name="zookeeper:2181" client=192.168.16.6:44388 proxy="[::]:8669" upstream="zookeeper:2181"
time="2019-06-19T12:10:22Z" level="warning" msg="Source terminated" name="zookeeper:2181" bytes=209 err=read tcp 192.168.16.2:43474->192.168.16.4:2181: use of closed network connection
time="2019-06-19T12:10:22Z" level="info" msg="Accepted client" client=192.168.16.6:44392 proxy="[::]:8669" upstream="zookeeper:2181" name="zookeeper:2181"
time="2019-06-19T12:10:22Z" level="info" msg="Accepted client" upstream="zookeeper:2181" name="zookeeper:2181" client=192.168.16.6:44396 proxy="[::]:8669"
time="2019-06-19T12:10:22Z" level="warning" msg="Source terminated" bytes=440 err=read tcp 192.168.16.2:43482->192.168.16.4:2181: use of closed network connection name="zookeeper:2181"
time="2019-06-19T12:10:22Z" level="info" msg="Accepted client" client=192.168.16.6:56468 proxy="[::]:8668" upstream="kafka:9092" name="kafka:9092"
time="2019-06-19T12:10:22Z" level="info" msg="Accepted client" name="kafka:9092" client=192.168.16.6:56472 proxy="[::]:8668" upstream="kafka:9092"
time="2019-06-19T12:10:22Z" level="warning" msg="Source terminated" bytes=520 err=read tcp 192.168.16.2:53192->192.168.16.5:9092: use of closed network connection name="kafka:9092"
time="2019-06-19T12:10:22Z" level="warning" msg="Source terminated" err=read tcp 192.168.16.2:53188->192.168.16.5:9092: use of closed network connection name="kafka:9092" bytes=351
time="2019-06-19T12:10:23Z" level="info" msg="Accepted client" upstream="kafka:9092" name="kafka:9092" client=192.168.16.6:56480 proxy="[::]:8668"
time="2019-06-19T12:10:23Z" level="info" msg="Accepted client" name="kafka:9092" client=192.168.16.6:56488 proxy="[::]:8668" upstream="kafka:9092"
time="2019-06-19T12:10:23Z" level="info" msg="Accepted client" name="kafka:9092" client=192.168.16.6:56492 proxy="[::]:8668" upstream="kafka:9092"
time="2019-06-19T12:10:23Z" level="info" msg="Accepted client" name="kafka:9092" client=192.168.16.6:56496 proxy="[::]:8668" upstream="kafka:9092"
time="2019-06-19T12:10:26Z" level="info" msg="Accepted client" upstream="postgres:5432" name="postgres:5432" client=192.168.16.1:54766 proxy="[::]:8666"
time="2019-06-19T12:10:26Z" level="info" msg="Accepted client" name="postgres:5432" client=192.168.16.1:54770 proxy="[::]:8666" upstream="postgres:5432"
time="2019-06-19T12:10:26Z" level="info" msg="Accepted client" client=192.168.16.1:54774 proxy="[::]:8666" upstream="postgres:5432" name="postgres:5432"
time="2019-06-19T12:10:26Z" level="info" msg="Accepted client" proxy="[::]:8666" upstream="postgres:5432" name="postgres:5432" client=192.168.16.1:54778
time="2019-06-19T12:10:26Z" level="info" msg="Accepted client" name="postgres:5432" client=192.168.16.1:54782 proxy="[::]:8666" upstream="postgres:5432"
time="2019-06-19T12:10:26Z" level="info" msg="Accepted client" client=192.168.16.1:54786 proxy="[::]:8666" upstream="postgres:5432" name="postgres:5432"
time="2019-06-19T12:10:26Z" level="info" msg="Accepted client" client=192.168.16.1:54790 proxy="[::]:8666" upstream="postgres:5432" name="postgres:5432"
time="2019-06-19T12:10:26Z" level="info" msg="Accepted client" name="postgres:5432" client=192.168.16.1:54794 proxy="[::]:8666" upstream="postgres:5432"
time="2019-06-19T12:10:26Z" level="info" msg="Accepted client" name="postgres:5432" client=192.168.16.1:54798 proxy="[::]:8666" upstream="postgres:5432"
time="2019-06-19T12:10:26Z" level="info" msg="Accepted client" upstream="postgres:5432" name="postgres:5432" client=192.168.16.1:54802 proxy="[::]:8666"
time="2019-06-19T12:10:29Z" level="info" msg="Accepted client" name="kafka:9093" client=192.168.16.1:33124 proxy="[::]:8667" upstream="kafka:9093"
time="2019-06-19T12:10:29Z" level="info" msg="Accepted client" name="kafka:9093" client=192.168.16.1:33128 proxy="[::]:8667" upstream="kafka:9093"
time="2019-06-19T12:10:29Z" level="warning" msg="Source terminated" name="kafka:9093" bytes=618 err=read tcp 192.168.16.2:45516->192.168.16.5:9093: use of closed network connection
time="2019-06-19T12:10:29Z" level="warning" msg="Source terminated" name="kafka:9093" bytes=351 err=read tcp 192.168.16.2:45512->192.168.16.5:9093: use of closed network connection
time="2019-06-19T12:10:29Z" level="info" msg="Accepted client" client=192.168.16.1:54814 proxy="[::]:8666" upstream="postgres:5432" name="postgres:5432"
time="2019-06-19T12:10:29Z" level="info" msg="Accepted client" proxy="[::]:8666" upstream="postgres:5432" name="postgres:5432" client=192.168.16.1:54818
time="2019-06-19T12:11:31Z" level="warning" msg="Source terminated" bytes=557 err=read tcp 192.168.16.2:8666->192.168.16.1:54814: use of closed network connection name="postgres:5432"
time="2019-06-19T12:11:31Z" level="warning" msg="Destination terminated" name="postgres:5432" bytes=479 err=readfrom tcp 192.168.16.2:45194->192.168.16.3:5432: write tcp 192.168.16.2:45194->192.168.16.3:5432: write: broken pipe
@jpittis
Copy link
Contributor

jpittis commented Jun 21, 2019

Hi @astafev! I don't have time to help you debug this, but would you be able to share a bit more information in case someone else can help?

Would you be able to add:

  • Which OS you're running on.
  • The exact set of Toxiproxy CLI commands you're using to produce this hang. (Starting with the Toxiproxy boot until the final API command you run which hangs.)
  • It would also be useful to know the state of the hung connection. Did it reach the Toxiproxy server and hang somewhere inside the server logic, or is it some kind of network problem. You may or may not know how to gather this information so no big deal if you don't.

If someone has time to help debug, I'd suggest:

  • Trying to re-produce @astafev's hang using the exact set of Toxiproxy CLI commands he's using. (Likely we can sub the postgres part for another client like Redis or netcat.)
  • If the hang seems to originate from Toxiproxy server, then use printf debugging and pprof goroutine dumps to figure out why.
  • If the hang doesn't seem to originate from Toxiproxy server, this will likely be very hard to debug because it would likely depend on the host environment.

Thanks!

@astafev
Copy link
Author

astafev commented Jul 15, 2019

hi, @jpittis. I'm using testcontainers on windows. So, toxiproxy is run in docker using image shopify/toxiproxy:2.1.0. I don't know how to track the exact set of CLI commands on Toxiproxy side, but in general I'm doing the following:

  1. create a bunch of proxies,
  2. cut the connection of one of them (if I'm not mistaken it doesn'),
  3. wait for 60 seconds (just wait, no commands during that period),
  4. try to do anything, I want to restore connection but even just listing proxies causes the connection to hang.
    If I wait for 59 seconds, everything works fine...
    As for connection state, I'm pretty sure that it reaches the server, because I was executing the last commands from cli also and there was nothing in between, the behaviour was the same (works fine until 60 seconds, then just hangs).

Please let me know if there's some debug mode for toxiproxy so I could give you enough info from the output. I'm not a go developer, so it'd be difficult for me to make the modifications myself.

@adambair
Copy link

adambair commented Dec 6, 2019

I'm also having this same issue. After creating a simple toxic, running traffic through it, upon attempting to delete the toxic -- the toxiproxy api hangs and I see this in the log. I can't list proxies or toxics using curl or do any kind of interaction with the api until I restart toxiproxy. It just hangs.

I'm using Arch Linux and the shopify/toxiproxy image. Toxiproxy is operating alongside a pile of other docker containers.

It may be worth noting - I'm using https://github.com/mageddo/dns-proxy-server for local docker dns resolution.

That said, via curl (on the host machine) I can add/remove proxies and toxics without issue until I make a simple http DELETE request to toxiproxy's api from within Python using the requests lib (using the same url). Python is running in a separate container.

toxiproxy logs:

toxiproxy         | time="2019-12-06T21:00:41Z" level="info" msg="Accepted client" upstream="kafka-2:29092" name="kafka-2" client=172.19.0.8:42768 proxy="[::]:29092"
toxiproxy         | time="2019-12-06T21:00:41Z" level="info" msg="Accepted client" proxy="[::]:39092" upstream="kafka-3:39092" name="kafka-3" client=172.19.0.8:34244
toxiproxy         | time="2019-12-06T21:00:42Z" level="info" msg="Accepted client" name="kafka-1" client=172.19.0.8:60418 proxy="[::]:19092" upstream="kafka-1:19092"
toxiproxy         | time="2019-12-06T21:00:58Z" level="warning" msg="Source terminated" name="kafka-2" bytes=3843 err=read tcp 172.19.0.7:59734->172.19.0.9:29092: use of closed network connection
toxiproxy         | time="2019-12-06T21:00:58Z" level="warning" msg="Source terminated" name="kafka-1" bytes=2534 err=read tcp 172.19.0.7:41328->172.19.0.12:19092: use of closed network connection
toxiproxy         | time="2019-12-06T21:00:58Z" level="warning" msg="Source terminated" name="kafka-3" bytes=2534 err=read tcp 172.19.0.7:34216->172.19.0.11:39092: use of closed network connection
toxiproxy         | time="2019-12-06T21:00:58Z" level="info" msg="Accepted client" client=172.19.0.8:42788 proxy="[::]:29092" upstream="kafka-2:29092" name="kafka-2"
toxiproxy         | time="2019-12-06T21:00:58Z" level="info" msg="Accepted client" upstream="kafka-1:19092" name="kafka-1" client=172.19.0.8:60432 proxy="[::]:19092"
toxiproxy         | time="2019-12-06T21:00:58Z" level="info" msg="Accepted client" name="kafka-3" client=172.19.0.8:34264 proxy="[::]:39092" upstream="kafka-3:39092"
toxiproxy         | time="2019-12-06T21:00:58Z" level="info" msg="Accepted client" upstream="kafka-2:29092" name="kafka-2" client=172.19.0.8:42800 proxy="[::]:29092"
toxiproxy         | time="2019-12-06T21:01:03Z" level="warning" msg="Source terminated" name="kafka-2" bytes=1941 err=read tcp 172.19.0.7:59766->172.19.0.9:29092: use of closed network connection
toxiproxy         | time="2019-12-06T21:01:03Z" level="warning" msg="Source terminated" name="kafka-3" bytes=4423540 err=read tcp 172.19.0.7:34236->172.19.0.11:39092: use of closed network connection
toxiproxy         | time="2019-12-06T21:01:03Z" level="warning" msg="Source terminated" name="kafka-1" bytes=4423286 err=read tcp 172.19.0.7:41346->172.19.0.12:19092: use of closed network connection
toxiproxy         | time="2019-12-06T21:01:03Z" level="warning" msg="Source terminated" name="kafka-2" bytes=5737187 err=read tcp 172.19.0.7:59754->172.19.0.9:29092: use of closed network connection
toxiproxy         | time="2019-12-06T21:01:03Z" level="warning" msg="Destination terminated" err=readfrom tcp 172.19.0.7:19092->172.19.0.8:60432: write tcp 172.19.0.7:19092->172.19.0.8:60432: write: broken pipe name="kafka-1" bytes=4423219

toxic:

{
  "name": "kafka-1",
  "listen": "[::]:19092",
  "upstream": "kafka-1:19092",
  "enabled": true,
  "toxics": [
    {
      "attributes": {
        "latency": 500,
        "jitter": 0
      },
      "name": "kafka-1 downstream latency",
      "type": "latency",
      "stream": "downstream",
      "toxicity": 1
    }
  ]
}

@adambair
Copy link

adambair commented Dec 6, 2019

Figured it out.

Seems like something might be wonky inside toxiproxy regarding url parsing and spaces when deleting a toxic...

I changed the name from kafka-1 downstream latency to downstream_latency and everything works great now. No hangs, everything working as expected. And yes, I was sending the name url encoded when making the DELETE request.

Hopefully someone finds this useful in their Googling and saves them a few days of their life ;)

@kwart
Copy link

kwart commented Jul 29, 2022

We also hit this problem in one of Hazelcast enterprise tests.
When we are deleting a toxic it hangs. (We are running the toxiproxy through the testcontainers.)

I spent some time investigating the issue and it seems as a race condition between
https://github.com/Shopify/toxiproxy/blob/v2.4.0/link.go#L231

and https://github.com/Shopify/toxiproxy/blob/v2.4.0/toxics/toxic.go#L93-L99 (which is called from the https://github.com/Shopify/toxiproxy/blob/v2.4.0/link.go#L201)

kwart added a commit to kwart/test-app that referenced this issue Aug 3, 2022
kwart added a commit to kwart/test-app that referenced this issue Aug 3, 2022
@kwart
Copy link

kwart commented Aug 3, 2022

Adding a reproducer from this branch:

Run it with

mvn clean install

Prerequisities: Java 8+, Maven, Docker

Expected behavior
The AppTest finishes in about 5 minutes (if the Docker images are already pulled). The log will contain:

cutting connections
connections were cut
removing a toxic
removing a toxic
removing a toxic
all toxics were removed

Actual behavior
The AppTest hangs during the "removing a toxic" part. You won't see the "all toxics were removed".

@Wrent
Copy link

Wrent commented Aug 24, 2022

Experiencing the same problem as mentioned by @kwart with "org.testcontainers:toxiproxy:1.17.3" and "ghcr.io/shopify/toxiproxy:2.4.0" - when trying to uncut a connection it hangs on this line: https://github.com/testcontainers/testcontainers-java/blob/7e364891eed1eda2fdc1cdfea7e5facaccc7b025/modules/toxiproxy/src/main/java/org/testcontainers/containers/ToxiproxyContainer.java#L180

@mikael-carlstedt
Copy link

I am experiencing the same issue with toxiproxy-server 2.3 running on a CentOS Linux ec2 instance, so no docker networking involved. Quite a lot of traffic going over that proxy at the time when the toxicity is removed, so that race condition identified by @kwart looks like a very promising lead.

@miry miry added this to the 2.5.0 milestone Aug 28, 2022
@miry
Copy link
Contributor

miry commented Aug 28, 2022

Hi.

I tried to reproduce the problem with command line or golang tests.
Example of my sample test: test
Could not find any problems with blocking requests.
I am still loooking to the code and possible problems.

Mean time I am working on providing more o11y to the API server and removing toxic code.
In current master there are changes that would allow to log API HTTP requests.

UPDATE 2022-08-28:
I reproduced some bad behavior.
In my experiments my go client got error:

ERROR: proxy[pg_2]: could not remove toxic pg_2 latency_upstream97: Delete "http://localhost:8474/proxies/pg_2/toxics/pg_2%20latency_upstream97": read tcp 127.0.0.1:55928->127.0.0.1:8474: read: connection reset by peer

@mikael-carlstedt
Copy link

Seeing this warning again in the toxiproxy server log at the same time as submitting the toxicity delete that hangs:
Aug 30 09:16:23 us1tst-cnxms002.int.clxnetworks.net toxiproxy-server[1135]: time="2022-08-30T09:16:23Z" level=warning msg="Destination terminated" bytes=105348152 err="readfrom tcp 10.96.147.83:9094->10.96.140.179:51832: write tcp 10.96.147.83:9094->10.96.140.179:51832: write: broken pipe" name="10.96.147.83:9094_us1tst-cnxms002.int.clxnetworks.net"
This is related to traffic going over that proxy, and I'm not seeing this warning at the time when I successfully delete a toxicity on another toxiproxy server with the same setup (but I do see the warning at other times).

@mikael-carlstedt
Copy link

I got a stack trace of the hanging toxiproxy server.
toxiproxy-stacktrace.txt

@kwart
Copy link

kwart commented Aug 30, 2022

Here is a shell+docker version of my reproducer: reproducer.zip

@miry
Copy link
Contributor

miry commented Sep 2, 2022

Thank you @mikael-carlstedt and @kwart.
It helped me to reproduce the problem.

@miry miry pinned this issue Sep 2, 2022
@miry
Copy link
Contributor

miry commented Sep 3, 2022

Hi.

I want to share some status.
I have mittigated the problem of deadlock when destination is not available and still has some packets in buffer.

Made changes in my branch: locking-problem.

It contains a lot of cosmetic changes and debugging statements.

Next steps:

  1. Split changes to small PRs.
  2. Test changes in production environment.
  3. Release a new version.

@miry miry mentioned this issue Sep 4, 2022
13 tasks
@miry
Copy link
Contributor

miry commented Sep 7, 2022

Here is fix #436

@miry
Copy link
Contributor

miry commented Sep 10, 2022

@mikael-carlstedt @kwart @Wrent I have released a new version https://github.com/Shopify/toxiproxy/releases/tag/v2.5.0

Let me know if it works for you.

@miry miry added the question label Sep 10, 2022
@kwart
Copy link

kwart commented Sep 10, 2022

It works in Hazelcast's tests. Thank you for the fix, @miry.

@miry miry closed this as completed Sep 11, 2022
@miry miry unpinned this issue Sep 11, 2022
@mikael-carlstedt
Copy link

Works for me too! Many thanks for the quick resolution @miry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants