Integration tests fails due to docker-compose pull timeout #56108

azat · 2023-10-29T16:44:29Z

Sometimes, 900 seconds is not enough for docker to pull the images, so:

maybe there are some problems with http://dockerhub-proxy.dockerhub-proxy-zone:5000/?
or just with CI infrastructure?
or maybe it worth to simply increase this timeout? and enable --debug mode for dockerd (maybe it will have more logs, like on retries on something), but I doubt that this is a good idea, since otherwise the tests will takes even longer.

@Felixoid what do you think?

Examples:

The text was updated successfully, but these errors were encountered:

Felixoid · 2023-10-30T12:50:40Z

I don't understand how this is possible in https://s3.amazonaws.com/clickhouse-test-reports/56030/331f661322ee0b12ec41cec8cba36b9973a6aa5a/integration_tests__asan__analyzer__[1_6]/integration_run_parallel3_0.log

2023-10-26 21:13:56 [ 538 ] INFO : Got exception pulling images: Command '['docker-compose', '--env-file', '/ClickHouse/tests/integration/test_version_update/_instances_0/.env', '--project-name', 'roottestversionupdate', '--file', '/ClickHouse/tests/integration/test_version_update/_instances_0/node1/docker-compose.yml', '--file', '/ClickHouse/tests/integration/test_version_update/_instances_0/node2/docker-compose.yml', '--file', '/compose/docker_compose_keeper.yml', '--file', '/ClickHouse/tests/integration/test_version_update/_instances_0/node3/docker-compose.yml', '--file', '/ClickHouse/tests/integration/test_version_update/_instances_0/node4/docker-compose.yml', '--file', '/ClickHouse/tests/integration/test_version_update/_instances_0/node5/docker-compose.yml', '--file', '/ClickHouse/tests/integration/test_version_update/_instances_0/node6/docker-compose.yml', 'pull']' timed out after 300 seconds (cluster.py:2668, start)

Every image should be ready in the pulling stage https://github.com/ClickHouse/ClickHouse/actions/runs/6657903083/job/18096027453#step:6:102. I think, it's every place like in #56082

qoega · 2023-10-31T17:26:01Z

Do we run pull inside explicitly? I think it checks updated images and pulls them again in this case. If we have prepull enabled we need to disable docker-compose pull inside tests

Algunenano · 2023-10-31T17:53:18Z

If we have prepull enabled we need to disable docker-compose pull inside tests

We currently don't prepull clickhouse-server images, which AFAIK are the ones timing out

Felixoid · 2023-11-01T15:12:01Z

Let's generate a compose file with all used images in tests? Then they will be pre-pulled too.

Felixoid · 2023-11-02T10:40:44Z

I can't find recent failures in this query. But in #56214 I am adding evidence for failed infrastructure.

Felixoid · 2023-11-03T11:45:17Z

The query I'm monitoring

updated one

Felixoid · 2023-11-10T20:44:57Z

I catch some issue, that looks exactly like that. It's on one of the docker proxy

ubuntu@ip-172-31-85-118:~$ docker pull mysql:5.7
5.7: Pulling from library/mysql
9ad776bc3934: Pull complete
a280ac4a8665: Pull complete
4047a3b08336: Pull complete
435611dd4999: Pull complete
f84f2572cb0b: Pull complete
ef893e58839b: Pull complete
42897f531783: Downloading [=>                                                 ]  519.6kB/25.53MB
8a8aad27e96b: Download complete
6b2751f26202: Download complete
b0e9b86ed64c: Download complete
bfef93045c96: Download complete

Felixoid · 2023-11-10T22:07:35Z

Playing with neither docker container registry nor with storage did help. Only the host restart did.

Now the proxy works again. Next time I'll try to remove it from the balancer and see, what's going on in details

Felixoid · 2023-11-16T10:02:28Z

Everything for Nov 15 is cancelled jobs

Felixoid · 2023-11-19T20:11:11Z

Another big bunch of failures are related to the same host as on the last time. Now I've detached the failed node, and took a look on different metrics

It's interesting, we can actually see the problem. Not sure yet, wat's the actual issue, but setting up the notification based on the reset rate is the next thing I'll do

alexey-milovidov · 2023-11-27T03:25:08Z

@Felixoid, integrate aggressive timeouts to docker pull. For example, if it doesn't succeed in five minutes, interrupt and start over again with up to five retries. You can use the timeout and for in bash.

Felixoid · 2023-11-27T08:14:14Z

It doesn't make sense to retry from the same host in this case. Nothing will help, if the docker proxy is slow

Felixoid · 2023-11-27T12:03:03Z

So, yesterday I set up the notifications for reset metrics. It's the first thing to do.

Second, this is something striking with long uptime. So this week I'll think about how to identify a safe time to reboot the nodes once a week

azat · 2023-12-03T08:22:48Z

Flaky check does not have pre-pull -

ClickHouse/tests/integration/ci-runner.py

Lines 849 to 857 in 2150308

    
           if self.flaky_check or self.bugfix_validate_check: 
        
               return self.run_flaky_check( 
        
                   repo_path, build_path, should_fail=self.bugfix_validate_check 
        
               ) 
        
           self._install_clickhouse(build_path) 
        
           logging.info("Pulling images") 
        
           runner._pre_pull_images(repo_path)

I guess the reason is that it does not need to pre-pull all images, well, there is code that pulls images -

ClickHouse/tests/ci/integration_test_check.py

Line 218 in 2150308

images = get_images_with_versions(reports_path, IMAGES)

(yes, it tries to pull images), but it is done on the host.

Failed CI - https://s3.amazonaws.com/clickhouse-test-reports/42826/b94a7be3b2415643715050322a098f612fb78100/integration_tests_flaky_check__asan_.html

Felixoid · 2023-12-06T14:30:51Z

@mkaynov it's a good point, we shouldn't pull images in the CI script. Only the image "clickhouse/integration-tests-runner".

Regarding the rest, yes, that's the point. We would spend too much time and money on completely unnecessary downloads. I don't think pytest could provide the list of necessary test

alexey-milovidov · 2023-12-06T16:32:28Z

@Felixoid, it always makes sense to retry. The software, and also network infrastructure, can be not just uniformly slow but also bug-ridden. For example, the Docker proxy can accept a connection and then go into a deadlock for that connection but happily process another connection. Whatever the Docker proxy is, I have no reason to trust it.

Felixoid · 2023-12-06T21:39:29Z

it always makes sense to retry

That's why we retry enough. But we've reached the network bandwidth throttle on EC2 instances. And retries in this case are useless for 146%. Because there's only one poor weak host behind the load balancer, that does its best at 128Mbps to provide huge integration test images to our runners. And it fails, fails drastically! And it's very sad. So let's not retry even harder, to make it upset even more.

Khm, besides the jokes, there are two moments:

The instances with the most effective $/bandwidth are used now:

type        b.w.  max  price   
c6gn.medium 1.6   16.0 $0.0432
c7gn.medium 3.125 25.0 $0.0624

I've raised the question to AWS if we can identify the moment when the max bandwidth is throttled, so we can react on it

Felixoid · 2023-12-07T13:21:57Z

Another action I've taken is to set -e REGISTRY_STORAGE_CACHE='' as suggested in distribution/distribution#3722 (comment)

It was done 20 minutes ago. Let's see if it will help avoid the blob storage issue.

The issue today on morning was with another big number of resets sudo tcpdump -ni any 'src port 5000 and tcp[tcpflags] & (tcp-rst) !=0'

Maybe, I'll take a look eBPF hooks

Felixoid · 2023-12-07T15:12:40Z

I am desperate to try this solution rpardini/docker-registry-proxy@master...coreweave:docker-registry-proxy:coreweave

Looks simple, so could actually work nicely

update

Yeah, aga... no way it will work for us. If the proxy is down, docker gets stuck completely, refusing even trying to bypass it. Not our option

Felixoid · 2023-12-08T00:39:48Z

Together with support, we narrowed down the issue to the OS. And there are the following lines in the syslog:

Dec 7 18:17:07 i-00b68f90e176e90ce CRON[3107]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Dec 7 18:27:39 i-00b68f90e176e90ce kernel: [29603.038581] TCP: out of memory -- consider tuning tcp_mem
Dec 7 18:27:41 i-00b68f90e176e90ce kernel: [29604.715673] TCP: out of memory -- consider tuning tcp_mem
Dec 7 18:27:43 i-00b68f90e176e90ce kernel: [29606.806509] TCP: out of memory -- consider tuning tcp_mem
Dec 7 18:27:43 i-00b68f90e176e90ce kernel: [29607.391224] TCP: out of memory -- consider tuning tcp_mem

From the linux manual and some random pages, I try the following configuration to mitigate it:

net.core.netdev_max_backlog=2000
net.core.rmem_max=1048576
net.core.wmem_max=1048576
net.ipv4.tcp_max_syn_backlog=1024
net.ipv4.tcp_rmem=4096 131072  16777216
net.ipv4.tcp_wmem=4096 87380   16777216
net.ipv4.tcp_mem=4096 131072  16777216

References: https://dzone.com/articles/tcp-out-of-memory-consider-tuning-tcp-mem and https://www.kernel.org/doc/html/latest/networking/ip-sysctl.html

update: I caught another resets case, and fixed it by net.ipv4.tcp_mem=4096 131072 16777216. Will apply everywhere in a moment

azat added the comp-ci Continuous integration label Oct 29, 2023

This was referenced Oct 29, 2023

Fix incorrect free space accounting for least_used JBOD policy #56030

Merged

Do not mix-up send_timeout and receive_timeout #56035

Merged

tavplubix assigned Felixoid Oct 30, 2023

tavplubix mentioned this issue Nov 1, 2023

Flaky test_backward_compatibility integration tests #56190

Closed

SmitaRKulkarni mentioned this issue Nov 21, 2023

Bug fix explain ast with parameterized view #56004

Merged

azat mentioned this issue Nov 24, 2023

Fix system.*_log in artifacts on CI #57128

Merged

azat mentioned this issue Dec 3, 2023

Optimize cluster node list change but DirectoryMonitor can't sense it #42826

Merged

Felixoid mentioned this issue Dec 11, 2023

Tune network memory for dockerhub proxy hosts #57744

Merged

Felixoid closed this as completed in #57744 Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration tests fails due to docker-compose pull timeout #56108

Integration tests fails due to docker-compose pull timeout #56108

azat commented Oct 29, 2023

Felixoid commented Oct 30, 2023

qoega commented Oct 31, 2023

Algunenano commented Oct 31, 2023

Felixoid commented Nov 1, 2023

Felixoid commented Nov 2, 2023

Felixoid commented Nov 3, 2023 •

edited

Felixoid commented Nov 10, 2023 •

edited

Felixoid commented Nov 10, 2023 •

edited

Felixoid commented Nov 16, 2023

Felixoid commented Nov 19, 2023 •

edited

alexey-milovidov commented Nov 27, 2023

Felixoid commented Nov 27, 2023 •

edited

Felixoid commented Nov 27, 2023

azat commented Dec 3, 2023 •

edited

Felixoid commented Dec 6, 2023

alexey-milovidov commented Dec 6, 2023 •

edited

Felixoid commented Dec 6, 2023

Felixoid commented Dec 7, 2023

Felixoid commented Dec 7, 2023 •

edited

Felixoid commented Dec 8, 2023 •

edited

Integration tests fails due to docker-compose pull timeout #56108

Integration tests fails due to docker-compose pull timeout #56108

Comments

azat commented Oct 29, 2023

Felixoid commented Oct 30, 2023

qoega commented Oct 31, 2023

Algunenano commented Oct 31, 2023

Felixoid commented Nov 1, 2023

Felixoid commented Nov 2, 2023

Felixoid commented Nov 3, 2023 • edited

Felixoid commented Nov 10, 2023 • edited

Felixoid commented Nov 10, 2023 • edited

Felixoid commented Nov 16, 2023

Felixoid commented Nov 19, 2023 • edited

alexey-milovidov commented Nov 27, 2023

Felixoid commented Nov 27, 2023 • edited

Felixoid commented Nov 27, 2023

azat commented Dec 3, 2023 • edited

Felixoid commented Dec 6, 2023

alexey-milovidov commented Dec 6, 2023 • edited

Felixoid commented Dec 6, 2023

Felixoid commented Dec 7, 2023

Felixoid commented Dec 7, 2023 • edited

update

Felixoid commented Dec 8, 2023 • edited

Felixoid commented Nov 3, 2023 •

edited

Felixoid commented Nov 10, 2023 •

edited

Felixoid commented Nov 10, 2023 •

edited

Felixoid commented Nov 19, 2023 •

edited

Felixoid commented Nov 27, 2023 •

edited

azat commented Dec 3, 2023 •

edited

alexey-milovidov commented Dec 6, 2023 •

edited

Felixoid commented Dec 7, 2023 •

edited

Felixoid commented Dec 8, 2023 •

edited