New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration tests fails due to docker-compose pull timeout #56108
Comments
I don't understand how this is possible in https://s3.amazonaws.com/clickhouse-test-reports/56030/331f661322ee0b12ec41cec8cba36b9973a6aa5a/integration_tests__asan__analyzer__[1_6]/integration_run_parallel3_0.log
Every image should be ready in the pulling stage https://github.com/ClickHouse/ClickHouse/actions/runs/6657903083/job/18096027453#step:6:102. I think, it's every place like in #56082 |
Do we run |
We currently don't prepull clickhouse-server images, which AFAIK are the ones timing out |
Let's generate a compose file with all used images in tests? Then they will be pre-pulled too. |
I catch some issue, that looks exactly like that. It's on one of the docker proxy
|
Playing with neither docker container Now the proxy works again. Next time I'll try to remove it from the balancer and see, what's going on in details |
Everything for Nov 15 is cancelled jobs |
Another big bunch of failures are related to the same host as on the last time. Now I've detached the failed node, and took a look on different metrics It's interesting, we can actually see the problem. Not sure yet, wat's the actual issue, but setting up the notification based on the reset rate is the next thing I'll do |
@Felixoid, integrate aggressive timeouts to docker pull. For example, if it doesn't succeed in five minutes, interrupt and start over again with up to five retries. You can use the |
It doesn't make sense to retry from the same host in this case. Nothing will help, if the docker proxy is slow |
So, yesterday I set up the notifications for reset metrics. It's the first thing to do. Second, this is something striking with long uptime. So this week I'll think about how to identify a safe time to reboot the nodes once a week |
Flaky check does not have pre-pull - ClickHouse/tests/integration/ci-runner.py Lines 849 to 857 in 2150308
I guess the reason is that it does not need to pre-pull all images, well, there is code that pulls images - ClickHouse/tests/ci/integration_test_check.py Line 218 in 2150308
|
@mkaynov it's a good point, we shouldn't pull images in the CI script. Only the image Regarding the rest, yes, that's the point. We would spend too much time and money on completely unnecessary downloads. I don't think pytest could provide the list of necessary test |
@Felixoid, it always makes sense to retry. The software, and also network infrastructure, can be not just uniformly slow but also bug-ridden. For example, the Docker proxy can accept a connection and then go into a deadlock for that connection but happily process another connection. Whatever the Docker proxy is, I have no reason to trust it. |
That's why we retry enough. But we've reached the network bandwidth throttle on EC2 instances. And retries in this case are useless for 146%. Because there's only one poor weak host behind the load balancer, that does its best at 128Mbps to provide huge integration test images to our runners. And it fails, fails drastically! And it's very sad. So let's not retry even harder, to make it upset even more. Khm, besides the jokes, there are two moments:
|
Another action I've taken is to set It was done 20 minutes ago. Let's see if it will help avoid the blob storage issue. The issue today on morning was with another big number of resets Maybe, I'll take a look eBPF hooks |
I am desperate to try this solution rpardini/docker-registry-proxy@master...coreweave:docker-registry-proxy:coreweave Looks simple, so could actually work nicely updateYeah, aga... no way it will work for us. If the proxy is down, docker gets stuck completely, refusing even trying to bypass it. Not our option |
Together with support, we narrowed down the issue to the OS. And there are the following lines in the syslog:
From the linux manual and some random pages, I try the following configuration to mitigate it:
References: https://dzone.com/articles/tcp-out-of-memory-consider-tuning-tcp-mem and https://www.kernel.org/doc/html/latest/networking/ip-sysctl.html update: I caught another resets case, and fixed it by |
Sometimes, 900 seconds is not enough for docker to pull the images, so:
--debug
mode fordockerd
(maybe it will have more logs, like on retries on something), but I doubt that this is a good idea, since otherwise the tests will takes even longer.@Felixoid what do you think?
Examples:
The text was updated successfully, but these errors were encountered: