Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration tests fails due to docker-compose pull timeout #56108

Closed
azat opened this issue Oct 29, 2023 · 20 comments · Fixed by #57744
Closed

Integration tests fails due to docker-compose pull timeout #56108

azat opened this issue Oct 29, 2023 · 20 comments · Fixed by #57744
Assignees
Labels
comp-ci Continuous integration

Comments

@azat
Copy link
Collaborator

azat commented Oct 29, 2023

Sometimes, 900 seconds is not enough for docker to pull the images, so:

  • maybe there are some problems with http://dockerhub-proxy.dockerhub-proxy-zone:5000/?
  • or just with CI infrastructure?
  • or maybe it worth to simply increase this timeout? and enable --debug mode for dockerd (maybe it will have more logs, like on retries on something), but I doubt that this is a good idea, since otherwise the tests will takes even longer.

@Felixoid what do you think?

Examples:

@Felixoid
Copy link
Member

I don't understand how this is possible in https://s3.amazonaws.com/clickhouse-test-reports/56030/331f661322ee0b12ec41cec8cba36b9973a6aa5a/integration_tests__asan__analyzer__[1_6]/integration_run_parallel3_0.log

2023-10-26 21:13:56 [ 538 ] INFO : Got exception pulling images: Command '['docker-compose', '--env-file', '/ClickHouse/tests/integration/test_version_update/_instances_0/.env', '--project-name', 'roottestversionupdate', '--file', '/ClickHouse/tests/integration/test_version_update/_instances_0/node1/docker-compose.yml', '--file', '/ClickHouse/tests/integration/test_version_update/_instances_0/node2/docker-compose.yml', '--file', '/compose/docker_compose_keeper.yml', '--file', '/ClickHouse/tests/integration/test_version_update/_instances_0/node3/docker-compose.yml', '--file', '/ClickHouse/tests/integration/test_version_update/_instances_0/node4/docker-compose.yml', '--file', '/ClickHouse/tests/integration/test_version_update/_instances_0/node5/docker-compose.yml', '--file', '/ClickHouse/tests/integration/test_version_update/_instances_0/node6/docker-compose.yml', 'pull']' timed out after 300 seconds (cluster.py:2668, start)

Every image should be ready in the pulling stage https://github.com/ClickHouse/ClickHouse/actions/runs/6657903083/job/18096027453#step:6:102. I think, it's every place like in #56082

@qoega
Copy link
Member

qoega commented Oct 31, 2023

Do we run pull inside explicitly? I think it checks updated images and pulls them again in this case. If we have prepull enabled we need to disable docker-compose pull inside tests

@Algunenano
Copy link
Member

If we have prepull enabled we need to disable docker-compose pull inside tests

We currently don't prepull clickhouse-server images, which AFAIK are the ones timing out

@Felixoid
Copy link
Member

Felixoid commented Nov 1, 2023

Let's generate a compose file with all used images in tests? Then they will be pre-pulled too.

@Felixoid
Copy link
Member

Felixoid commented Nov 2, 2023

I can't find recent failures in this query. But in #56214 I am adding evidence for failed infrastructure.

@Felixoid
Copy link
Member

Felixoid commented Nov 3, 2023

The query I'm monitoring

updated one

@Felixoid
Copy link
Member

Felixoid commented Nov 10, 2023

I catch some issue, that looks exactly like that. It's on one of the docker proxy

ubuntu@ip-172-31-85-118:~$ docker pull mysql:5.7
5.7: Pulling from library/mysql
9ad776bc3934: Pull complete
a280ac4a8665: Pull complete
4047a3b08336: Pull complete
435611dd4999: Pull complete
f84f2572cb0b: Pull complete
ef893e58839b: Pull complete
42897f531783: Downloading [=>                                                 ]  519.6kB/25.53MB
8a8aad27e96b: Download complete
6b2751f26202: Download complete
b0e9b86ed64c: Download complete
bfef93045c96: Download complete

@Felixoid
Copy link
Member

Felixoid commented Nov 10, 2023

Playing with neither docker container registry nor with storage did help. Only the host restart did.

Now the proxy works again. Next time I'll try to remove it from the balancer and see, what's going on in details

@Felixoid
Copy link
Member

Everything for Nov 15 is cancelled jobs

@Felixoid
Copy link
Member

Felixoid commented Nov 19, 2023

Another big bunch of failures are related to the same host as on the last time. Now I've detached the failed node, and took a look on different metrics
image

It's interesting, we can actually see the problem. Not sure yet, wat's the actual issue, but setting up the notification based on the reset rate is the next thing I'll do

@alexey-milovidov
Copy link
Member

@Felixoid, integrate aggressive timeouts to docker pull. For example, if it doesn't succeed in five minutes, interrupt and start over again with up to five retries. You can use the timeout and for in bash.

@Felixoid
Copy link
Member

Felixoid commented Nov 27, 2023

It doesn't make sense to retry from the same host in this case. Nothing will help, if the docker proxy is slow

@Felixoid
Copy link
Member

So, yesterday I set up the notifications for reset metrics. It's the first thing to do.

Second, this is something striking with long uptime. So this week I'll think about how to identify a safe time to reboot the nodes once a week

@azat
Copy link
Collaborator Author

azat commented Dec 3, 2023

Flaky check does not have pre-pull -

if self.flaky_check or self.bugfix_validate_check:
return self.run_flaky_check(
repo_path, build_path, should_fail=self.bugfix_validate_check
)
self._install_clickhouse(build_path)
logging.info("Pulling images")
runner._pre_pull_images(repo_path)

I guess the reason is that it does not need to pre-pull all images, well, there is code that pulls images -

images = get_images_with_versions(reports_path, IMAGES)
(yes, it tries to pull images), but it is done on the host.

Failed CI - https://s3.amazonaws.com/clickhouse-test-reports/42826/b94a7be3b2415643715050322a098f612fb78100/integration_tests_flaky_check__asan_.html

@Felixoid
Copy link
Member

Felixoid commented Dec 6, 2023

@mkaynov it's a good point, we shouldn't pull images in the CI script. Only the image "clickhouse/integration-tests-runner".

Regarding the rest, yes, that's the point. We would spend too much time and money on completely unnecessary downloads. I don't think pytest could provide the list of necessary test

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Dec 6, 2023

@Felixoid, it always makes sense to retry. The software, and also network infrastructure, can be not just uniformly slow but also bug-ridden. For example, the Docker proxy can accept a connection and then go into a deadlock for that connection but happily process another connection. Whatever the Docker proxy is, I have no reason to trust it.

@Felixoid
Copy link
Member

Felixoid commented Dec 6, 2023

it always makes sense to retry

That's why we retry enough. But we've reached the network bandwidth throttle on EC2 instances. And retries in this case are useless for 146%. Because there's only one poor weak host behind the load balancer, that does its best at 128Mbps to provide huge integration test images to our runners. And it fails, fails drastically! And it's very sad. So let's not retry even harder, to make it upset even more.

Khm, besides the jokes, there are two moments:

  • The instances with the most effective $/bandwidth are used now:
type        b.w.  max  price   
c6gn.medium 1.6   16.0 $0.0432
c7gn.medium 3.125 25.0 $0.0624
  • I've raised the question to AWS if we can identify the moment when the max bandwidth is throttled, so we can react on it

@Felixoid
Copy link
Member

Felixoid commented Dec 7, 2023

Another action I've taken is to set -e REGISTRY_STORAGE_CACHE='' as suggested in distribution/distribution#3722 (comment)

It was done 20 minutes ago. Let's see if it will help avoid the blob storage issue.

The issue today on morning was with another big number of resets sudo tcpdump -ni any 'src port 5000 and tcp[tcpflags] & (tcp-rst) !=0'

Maybe, I'll take a look eBPF hooks

@Felixoid
Copy link
Member

Felixoid commented Dec 7, 2023

I am desperate to try this solution rpardini/docker-registry-proxy@master...coreweave:docker-registry-proxy:coreweave

Looks simple, so could actually work nicely

update

Yeah, aga... no way it will work for us. If the proxy is down, docker gets stuck completely, refusing even trying to bypass it. Not our option

@Felixoid
Copy link
Member

Felixoid commented Dec 8, 2023

Together with support, we narrowed down the issue to the OS. And there are the following lines in the syslog:

Dec 7 18:17:07 i-00b68f90e176e90ce CRON[3107]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Dec 7 18:27:39 i-00b68f90e176e90ce kernel: [29603.038581] TCP: out of memory -- consider tuning tcp_mem
Dec 7 18:27:41 i-00b68f90e176e90ce kernel: [29604.715673] TCP: out of memory -- consider tuning tcp_mem
Dec 7 18:27:43 i-00b68f90e176e90ce kernel: [29606.806509] TCP: out of memory -- consider tuning tcp_mem
Dec 7 18:27:43 i-00b68f90e176e90ce kernel: [29607.391224] TCP: out of memory -- consider tuning tcp_mem

From the linux manual and some random pages, I try the following configuration to mitigate it:

net.core.netdev_max_backlog=2000
net.core.rmem_max=1048576
net.core.wmem_max=1048576
net.ipv4.tcp_max_syn_backlog=1024
net.ipv4.tcp_rmem=4096 131072  16777216
net.ipv4.tcp_wmem=4096 87380   16777216
net.ipv4.tcp_mem=4096 131072  16777216

References: https://dzone.com/articles/tcp-out-of-memory-consider-tuning-tcp-mem and https://www.kernel.org/doc/html/latest/networking/ip-sysctl.html

update: I caught another resets case, and fixed it by net.ipv4.tcp_mem=4096 131072 16777216. Will apply everywhere in a moment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp-ci Continuous integration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants