CI backlog #51682

Felixoid · 2023-06-30T13:17:41Z

Here are the collected wishes from email, slack and #34694

The highest prio, CI costs

High prio

Low prio

Add label watch inactive and close issues with it after 1-month inactivity
Patch run_check.py:should_run_ci_for_pr to check if the PR is created from the same fork. If so, continue
Rewrite performance tests: move the logic of finding and downloading binaries to the python script in tests/ci
[discussable] There are quite exotic images. Like mcr.microsoft.com/dotnet/sdk, sequenceiq/hadoop-docker:2.7.0, which it makes sense to get into our own docker organization
Improve Mergeable Check to show not Failed: check_name but Pending: check name addordingly
Get rid of os.path.join in tests/ci where it's possible in favor of pathlib.Path Get rid of the most of os.path stuff #55028
Add a comment for canceling a PR with reasoning about it Quick fail undocumented features #53413
clean out ~~coverage~~ coverity builds Remove Coverity from workflows, but leave in the code #51842

Add wishes in comments

The text was updated successfully, but these errors were encountered:

alexey-milovidov · 2023-06-30T13:25:00Z

I'd like to copy all Docker images into our own organization - either in ECR or in the form of tarballs in an S3 bucket.
This is needed to avoid supply-chain attacks.

qoega · 2023-06-30T13:34:59Z

Get rid of :latest for images and use explicit version.
Rewrite astfuzzer to not download binaries inside in bash.
Rewrite stateless check to just use binaries and repository instead of taking tests from deb package.

tavplubix · 2023-06-30T14:05:59Z

There are quite exotic images

Also #43182

alexey-milovidov · 2023-06-30T14:21:53Z

Avoid using the public Rust repository because it sometimes does not respond.

Felixoid · 2023-07-03T20:32:19Z

I'd like to copy all Docker images into our own organization - either in ECR or in the form of tarballs in an S3 bucket.
This is needed to avoid supply-chain attacks.

This type of discussion bumps up from time to time. It has two different approaches, and I disagree that we must use one of them blindly.
Now we use ubuntu:$release widely, and it's meaningless to have it in our organization. Why? Because nobody will care about updating it. On the other hand, now we have all our images depending on ubuntu updated on a daily basis. Does it break from time to time? Sure, that's the reason to have it updated.
Otherwise, we won't know if something breaks for ages. The same is counted for things like nginx:alpine, python:3. For example, there was php:8.0.18-cli. Why this? why not php:8-cli? This is already updated
There are quite exotic images. Like mcr.microsoft.com/dotnet/sdk, sequenceiq/hadoop-docker:2.7.0, which makes sense to get into our own pocket. But things like centos, nginx, ubuntu or alpine? When they are removed from docker, the world will have much worse issues than problems with our builds

alexey-milovidov · 2023-07-03T22:33:04Z

@Felixoid Various images like Hadoop, PHP, Nginx are only needed for tests, and pinning them is - safer; - saner than allowing updating at unpredictable times from unreliable sources. Also, it's only a matter of time before Dockerhub will be pwned. For example, you will not trust Sourceforge today. And trash images without version pinning are vulnerable to malicious takeover. Say, tomorrow, someone took over Hadoop and uploaded version 99.9 with malware.

Felixoid · 2023-07-04T08:35:20Z

We need to be sure we are on the same page. Now I don't get what exactly you advocate. Are we speaking about every image still, or just about the images for integration tests?

Various images like Hadoop, PHP, Nginx are only needed for tests, and pinning them is - safer; - saner than allowing updating at unpredictable times from unreliable sources.

Testing the current world's state and compatibility with the modern versions is better, than having stale tests for versions eight years old like hadoop 2.7.0. What is this test actually about? How long time ago it was the last time touched?

Also, it's only a matter of time before Dockerhub will be pwned.

This type of concern is usually addressed by something like artifactory with a snapshot. At my previous company, the in-house infrastructure to address a similar thing was built by a team of ~15 in a few years. It wasn't an exclusive task, but once built it did require continuous polishing.

And trash images without version pinning are vulnerable to malicious takeover.

Define trash images, please. The images without leading user_or_company/ in the name are built from https://github.com/docker-library/official-images/ with a manual review and scans. If we don't trust them, well, see the point regarding a proper tool. And the changes should be reviewed manually, like in banks. Otherwise, is it a real issue addressing, or pretending to?

Say, tomorrow, someone took over Hadoop and uploaded version 99.9 with malware.

Is it about sequenceiq/hadoop-docker? Nothing will happen, because we use sequenceiq/hadoop-docker:2.7.0

And once again, images like sequenceiq/hadoop-docker:2.7.0 is actually better to move to our clickhouse/ docker hub scope, as I've stated above:

There are quite exotic images. Like mcr.microsoft.com/dotnet/sdk, sequenceiq/hadoop-docker:2.7.0, which makes sense to get into our own pocket

alexey-milovidov · 2023-07-04T10:49:51Z

@Felixoid See #51769

alexey-milovidov · 2023-07-04T10:51:43Z

@Felixoid

Are we speaking about every image still, or just about the images for integration tests?

Every image. They should be pinned to specific versions, and bytewise identity has to be validated.

Testing the current world's state and compatibility with the modern versions is better

Not better.

This type of concern is usually addressed by something like artifactory with a snapshot. At my previous company, the in-house infrastructure to address a similar thing was built by a team of ~15 in a few years. It wasn't an exclusive task, but once built it did require continuous polishing.

No problem, if you don't like this - don't do it.

Felixoid · 2023-07-04T15:20:20Z

@qoega proposed a great idea, that should cover your concern regarding uncontrolled updates, and my wish to have everything updated automatically in a daily manner

Here it is. Now we have a nightly update everything task, that just does it. The proposal is to replace it with nightly creation PRs to have the CI finished with updated images.

It sounds ∞ times better than the current state, and I hope it looks good to you too.

alexey-milovidov · 2023-07-04T15:22:13Z

This is good for me.

alexey-milovidov · 2023-07-30T09:49:40Z

I've removed Coverity: #52775

Felixoid · 2023-08-11T14:14:25Z

From @alexey-milovidov

If only functional tests were modified, get the latest build from the master and only run functional tests.
If only integration tests were modified, get the latest build from the master and only run integration tests.
If functional tests or fuzzer have failed, immediately stop all other checks.
If there are more than two consecutive merges with the master, remove the "can be tested" label. - ❗️ is not actual, we removed the update branch button
Rent a few machines in Hetzner, put agents there, and see how it will go. They will use the same AWS S3.
If the difference to the previous commits is only in comments or documentation, reuse the previous build and tests. - ❓ it's complex to work on it. We will face issues with force-pushes, and unfinished builds.
Put symbols depending on the build date, commit sha, etc., into a different section of the binary, to allow calculating and comparing hashes of the rest of the code.
For every PR, calculate the risk score based on the diff. PRs with a lower risk score will be checked by a random subset of sanitizers instead of every one.
As we cannot merge PRs with any failures, it makes sense to stop all checks on the first failure. - ❓ this statement is outdated as well

Felixoid · 2023-08-11T16:24:47Z

Here's my opinions regarding the action points from the previous comment:

1-3, 6 are pretty good ideas and should be implemented in the PRInfo and used in scripts.
4 is rather an administrative action and will just be worked around, so I don't see a point.
5 won't affect costs as it was reviewed a year ago. Our computing costs are pretty well optimized. The main goal is to reduce running and rerunning.
7 looks promising if it will allow us to reduce testing runs.
8 looks good from the idea perspective.
9 looks very controversial. Rather it's good to look at Reorder the PR jobs' tree to fail early and not launch heavy/long tests action point at the top.

alexey-milovidov · 2023-08-13T03:14:50Z

If a check has failed in the previous commits, run it and its dependencies first, and stop the runs if it has failed again.

Felixoid · 2023-08-14T08:36:38Z

If a check has failed in the previous commits, run it and its dependencies first, and stop the runs if it has failed again.

I assume you speak about workflows' jobs. If so, we have zero control over the queue. Nothing we can do here while we use GH actions. And the tree is a static dict described in .githug/workflows/*.yaml files.

alexey-milovidov · 2023-08-14T19:12:40Z

Good point to get rid of GitHub Actions.

alexey-milovidov · 2023-08-22T20:51:54Z

This is an example of why GitHub Actions is inefficient: #53697

Felixoid · 2023-08-22T21:53:54Z

I agree that they are terrible.. but what exactly is the issue there?

alexey-milovidov · 2023-08-22T22:14:44Z

This PR only modified a functional test - we can limit the number of tasks to applying this functional test only on the latest build from the master.

Felixoid · 2023-08-22T22:25:13Z

We can not skip the launch completely, that's right. But we can use the code to skip all the builds and all kind of other tests. It will be just done at the init state

It's described in the tasks already.

alexey-milovidov · 2023-08-31T00:13:58Z

To lower the cost of builds, we should use Graviton machines for every build type (both x86_64 and arm), because we are always cross-compiling. The only remaining part is grpc-compiler, CC @rschu1ze

alexey-milovidov · 2023-09-02T16:22:32Z

Obtain test results for interrupted runs from the CI database, and if there are any failures - generate a report, and don't run the check again. If there are no failures, run only the remaining tests on text try.

Felixoid · 2023-09-04T09:23:58Z

Obtain test results for interrupted runs from the CI database, and if there are any failures - generate a report, and don't run the check again. If there are no failures, run only the remaining tests on text try.

Updated the list. It already was there partially, except the idea to fail even earlier

alexey-milovidov · 2024-01-26T13:56:18Z

@Felixoid, it is again very relevant. We should always pin all the dependencies.
Not doing it is insane.

Felixoid · 2024-03-05T10:14:25Z

Comments about the automatic release process.

The system is designed to have a pair of eyes on the workflows launched after the release is created. It explicitly states it in the logs:

INFO:root:To verify all actions are running good visit the following links:
  https://github.com/ClickHouse/ClickHouse/actions/workflows/release.yml
  https://github.com/ClickHouse/ClickHouse/actions/workflows/tags_stable.yml
INFO:root:To rollback the action run the following commands:
  git push git@github.com:ClickHouse/ClickHouse.git +b6.....:23.8
  gh release delete --yes --repo ClickHouse/ClickHouse 'v23.8.10.43-lts'
  git push -d git@github.com:ClickHouse/ClickHouse.git v23.8.10.43-lts
  git tag -d 'v23.8.10.43-lts'

The script to make it automatically doesn't make the whole system resilient. And during the last time we've had too much issues with it to discuss the moving forward here. Either we lose Release ready status completely, or it lies about the actual state for the binaries.

Unless the workflows themselves are stable and in some final step, the autorelese process is out of discussion.

Another point: it's mentioned that the releases should be daily. It will overload two places at the same time. Both git and package repos will be overloaded with unnecessary packages and tags. Currently the task should be executed once a week, so it better to preserve the current behavior.

Felixoid · 2024-03-18T11:42:53Z

I would like to let you know about the docker images' rebuild.

After extensive work by @maxknv, we have idempotent docker tags based on the repo's state. Every docker image:tag is built only once on the change and stays immutable.

Then, the only job that is done every night is tagging the current digest as latest. It completely solves the issue with untested updated images.

Felixoid added the comp-ci Continuous integration label Jun 30, 2023

Felixoid mentioned this issue Jul 3, 2023

Pin for docker-ce #51743

Merged

tavplubix mentioned this issue Aug 10, 2023

Flaky integration test_host_regexp_multiple_ptr_records #51471

Closed

alexey-milovidov mentioned this issue Dec 17, 2023

Fix docker image for integration tests (fixes CI) #57952

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI backlog #51682

CI backlog #51682

Felixoid commented Jun 30, 2023 •

edited

alexey-milovidov commented Jun 30, 2023

qoega commented Jun 30, 2023

tavplubix commented Jun 30, 2023

alexey-milovidov commented Jun 30, 2023

Felixoid commented Jul 3, 2023

alexey-milovidov commented Jul 3, 2023

Felixoid commented Jul 4, 2023 •

edited

alexey-milovidov commented Jul 4, 2023

alexey-milovidov commented Jul 4, 2023

Felixoid commented Jul 4, 2023

alexey-milovidov commented Jul 4, 2023

alexey-milovidov commented Jul 30, 2023

Felixoid commented Aug 11, 2023 •

edited

Felixoid commented Aug 11, 2023 •

edited

alexey-milovidov commented Aug 13, 2023

Felixoid commented Aug 14, 2023

alexey-milovidov commented Aug 14, 2023

alexey-milovidov commented Aug 22, 2023

Felixoid commented Aug 22, 2023

alexey-milovidov commented Aug 22, 2023

Felixoid commented Aug 22, 2023 •

edited

alexey-milovidov commented Aug 31, 2023

alexey-milovidov commented Sep 2, 2023

Felixoid commented Sep 4, 2023

alexey-milovidov commented Jan 26, 2024

Felixoid commented Mar 5, 2024 •

edited

Felixoid commented Mar 18, 2024

CI backlog #51682

CI backlog #51682

Comments

Felixoid commented Jun 30, 2023 • edited

The highest prio, CI costs

High prio

Low prio

alexey-milovidov commented Jun 30, 2023

qoega commented Jun 30, 2023

tavplubix commented Jun 30, 2023

alexey-milovidov commented Jun 30, 2023

Felixoid commented Jul 3, 2023

alexey-milovidov commented Jul 3, 2023

Felixoid commented Jul 4, 2023 • edited

alexey-milovidov commented Jul 4, 2023

alexey-milovidov commented Jul 4, 2023

Felixoid commented Jul 4, 2023

alexey-milovidov commented Jul 4, 2023

alexey-milovidov commented Jul 30, 2023

Felixoid commented Aug 11, 2023 • edited

Felixoid commented Aug 11, 2023 • edited

alexey-milovidov commented Aug 13, 2023

Felixoid commented Aug 14, 2023

alexey-milovidov commented Aug 14, 2023

alexey-milovidov commented Aug 22, 2023

Felixoid commented Aug 22, 2023

alexey-milovidov commented Aug 22, 2023

Felixoid commented Aug 22, 2023 • edited

alexey-milovidov commented Aug 31, 2023

alexey-milovidov commented Sep 2, 2023

Felixoid commented Sep 4, 2023

alexey-milovidov commented Jan 26, 2024

Felixoid commented Mar 5, 2024 • edited

Felixoid commented Mar 18, 2024

Felixoid commented Jun 30, 2023 •

edited

Felixoid commented Jul 4, 2023 •

edited

Felixoid commented Aug 11, 2023 •

edited

Felixoid commented Aug 11, 2023 •

edited

Felixoid commented Aug 22, 2023 •

edited

Felixoid commented Mar 5, 2024 •

edited