Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI backlog #51682

Open
28 of 42 tasks
Felixoid opened this issue Jun 30, 2023 · 27 comments
Open
28 of 42 tasks

CI backlog #51682

Felixoid opened this issue Jun 30, 2023 · 27 comments
Labels
comp-ci Continuous integration

Comments

@Felixoid
Copy link
Member

Felixoid commented Jun 30, 2023

Here are the collected wishes from email, slack and #34694

The highest prio, CI costs

High prio

Low prio

  • Add label watch inactive and close issues with it after 1-month inactivity
  • Patch run_check.py:should_run_ci_for_pr to check if the PR is created from the same fork. If so, continue
  • Rewrite performance tests: move the logic of finding and downloading binaries to the python script in tests/ci
  • [discussable] There are quite exotic images. Like mcr.microsoft.com/dotnet/sdk, sequenceiq/hadoop-docker:2.7.0, which it makes sense to get into our own docker organization
  • Improve Mergeable Check to show not Failed: check_name but Pending: check name addordingly
  • Get rid of os.path.join in tests/ci where it's possible in favor of pathlib.Path Get rid of the most of os.path stuff #55028
  • Add a comment for canceling a PR with reasoning about it Quick fail undocumented features #53413
  • clean out coverage coverity builds Remove Coverity from workflows, but leave in the code #51842

Add wishes in comments

@Felixoid Felixoid added the comp-ci Continuous integration label Jun 30, 2023
@alexey-milovidov
Copy link
Member

I'd like to copy all Docker images into our own organization - either in ECR or in the form of tarballs in an S3 bucket.
This is needed to avoid supply-chain attacks.

@qoega
Copy link
Member

qoega commented Jun 30, 2023

  1. Get rid of :latest for images and use explicit version.
  2. Rewrite astfuzzer to not download binaries inside in bash.
  3. Rewrite stateless check to just use binaries and repository instead of taking tests from deb package.

@tavplubix
Copy link
Member

There are quite exotic images

Also #43182

@alexey-milovidov
Copy link
Member

Avoid using the public Rust repository because it sometimes does not respond.

@Felixoid
Copy link
Member Author

Felixoid commented Jul 3, 2023

I'd like to copy all Docker images into our own organization - either in ECR or in the form of tarballs in an S3 bucket.
This is needed to avoid supply-chain attacks.

This type of discussion bumps up from time to time. It has two different approaches, and I disagree that we must use one of them blindly.
Now we use ubuntu:$release widely, and it's meaningless to have it in our organization. Why? Because nobody will care about updating it. On the other hand, now we have all our images depending on ubuntu updated on a daily basis. Does it break from time to time? Sure, that's the reason to have it updated.
Otherwise, we won't know if something breaks for ages. The same is counted for things like nginx:alpine, python:3. For example, there was php:8.0.18-cli. Why this? why not php:8-cli? This is already updated
There are quite exotic images. Like mcr.microsoft.com/dotnet/sdk, sequenceiq/hadoop-docker:2.7.0, which makes sense to get into our own pocket. But things like centos, nginx, ubuntu or alpine? When they are removed from docker, the world will have much worse issues than problems with our builds

@alexey-milovidov
Copy link
Member

@Felixoid Various images like Hadoop, PHP, Nginx are only needed for tests, and pinning them is - safer; - saner than allowing updating at unpredictable times from unreliable sources. Also, it's only a matter of time before Dockerhub will be pwned. For example, you will not trust Sourceforge today. And trash images without version pinning are vulnerable to malicious takeover. Say, tomorrow, someone took over Hadoop and uploaded version 99.9 with malware.

@Felixoid
Copy link
Member Author

Felixoid commented Jul 4, 2023

We need to be sure we are on the same page. Now I don't get what exactly you advocate. Are we speaking about every image still, or just about the images for integration tests?

Various images like Hadoop, PHP, Nginx are only needed for tests, and pinning them is - safer; - saner than allowing updating at unpredictable times from unreliable sources.

Testing the current world's state and compatibility with the modern versions is better, than having stale tests for versions eight years old like hadoop 2.7.0. What is this test actually about? How long time ago it was the last time touched?

Also, it's only a matter of time before Dockerhub will be pwned.

This type of concern is usually addressed by something like artifactory with a snapshot. At my previous company, the in-house infrastructure to address a similar thing was built by a team of ~15 in a few years. It wasn't an exclusive task, but once built it did require continuous polishing.

And trash images without version pinning are vulnerable to malicious takeover.

Define trash images, please. The images without leading user_or_company/ in the name are built from https://github.com/docker-library/official-images/ with a manual review and scans. If we don't trust them, well, see the point regarding a proper tool. And the changes should be reviewed manually, like in banks. Otherwise, is it a real issue addressing, or pretending to?

Say, tomorrow, someone took over Hadoop and uploaded version 99.9 with malware.

Is it about sequenceiq/hadoop-docker? Nothing will happen, because we use sequenceiq/hadoop-docker:2.7.0

And once again, images like sequenceiq/hadoop-docker:2.7.0 is actually better to move to our clickhouse/ docker hub scope, as I've stated above:

There are quite exotic images. Like mcr.microsoft.com/dotnet/sdk, sequenceiq/hadoop-docker:2.7.0, which makes sense to get into our own pocket

@alexey-milovidov
Copy link
Member

@Felixoid See #51769

@alexey-milovidov
Copy link
Member

@Felixoid

Are we speaking about every image still, or just about the images for integration tests?

Every image. They should be pinned to specific versions, and bytewise identity has to be validated.

Testing the current world's state and compatibility with the modern versions is better

Not better.

This type of concern is usually addressed by something like artifactory with a snapshot. At my previous company, the in-house infrastructure to address a similar thing was built by a team of ~15 in a few years. It wasn't an exclusive task, but once built it did require continuous polishing.

No problem, if you don't like this - don't do it.

@Felixoid
Copy link
Member Author

Felixoid commented Jul 4, 2023

@qoega proposed a great idea, that should cover your concern regarding uncontrolled updates, and my wish to have everything updated automatically in a daily manner

Here it is. Now we have a nightly update everything task, that just does it. The proposal is to replace it with nightly creation PRs to have the CI finished with updated images.

It sounds ∞ times better than the current state, and I hope it looks good to you too.

@alexey-milovidov
Copy link
Member

This is good for me.

@alexey-milovidov
Copy link
Member

I've removed Coverity: #52775

@Felixoid
Copy link
Member Author

Felixoid commented Aug 11, 2023

From @alexey-milovidov

  1. If only functional tests were modified, get the latest build from the master and only run functional tests.
  2. If only integration tests were modified, get the latest build from the master and only run integration tests.
  3. If functional tests or fuzzer have failed, immediately stop all other checks.
  4. If there are more than two consecutive merges with the master, remove the "can be tested" label. - ❗️ is not actual, we removed the update branch button
  5. Rent a few machines in Hetzner, put agents there, and see how it will go. They will use the same AWS S3.
  6. If the difference to the previous commits is only in comments or documentation, reuse the previous build and tests. - ❓ it's complex to work on it. We will face issues with force-pushes, and unfinished builds.
  7. Put symbols depending on the build date, commit sha, etc., into a different section of the binary, to allow calculating and comparing hashes of the rest of the code.
  8. For every PR, calculate the risk score based on the diff. PRs with a lower risk score will be checked by a random subset of sanitizers instead of every one.
  9. As we cannot merge PRs with any failures, it makes sense to stop all checks on the first failure. - ❓ this statement is outdated as well

@Felixoid
Copy link
Member Author

Felixoid commented Aug 11, 2023

Here's my opinions regarding the action points from the previous comment:

  • 1-3, 6 are pretty good ideas and should be implemented in the PRInfo and used in scripts.
  • 4 is rather an administrative action and will just be worked around, so I don't see a point.
  • 5 won't affect costs as it was reviewed a year ago. Our computing costs are pretty well optimized. The main goal is to reduce running and rerunning.
  • 7 looks promising if it will allow us to reduce testing runs.
  • 8 looks good from the idea perspective.
  • 9 looks very controversial. Rather it's good to look at Reorder the PR jobs' tree to fail early and not launch heavy/long tests action point at the top.

@alexey-milovidov
Copy link
Member

If a check has failed in the previous commits, run it and its dependencies first, and stop the runs if it has failed again.

@Felixoid
Copy link
Member Author

If a check has failed in the previous commits, run it and its dependencies first, and stop the runs if it has failed again.

I assume you speak about workflows' jobs. If so, we have zero control over the queue. Nothing we can do here while we use GH actions. And the tree is a static dict described in .githug/workflows/*.yaml files.

@alexey-milovidov
Copy link
Member

Good point to get rid of GitHub Actions.

@alexey-milovidov
Copy link
Member

This is an example of why GitHub Actions is inefficient: #53697

@Felixoid
Copy link
Member Author

I agree that they are terrible.. but what exactly is the issue there?

@alexey-milovidov
Copy link
Member

This PR only modified a functional test - we can limit the number of tasks to applying this functional test only on the latest build from the master.

@Felixoid
Copy link
Member Author

Felixoid commented Aug 22, 2023

We can not skip the launch completely, that's right. But we can use the code to skip all the builds and all kind of other tests. It will be just done at the init state

It's described in the tasks already.

@alexey-milovidov
Copy link
Member

To lower the cost of builds, we should use Graviton machines for every build type (both x86_64 and arm), because we are always cross-compiling. The only remaining part is grpc-compiler, CC @rschu1ze

@alexey-milovidov
Copy link
Member

Obtain test results for interrupted runs from the CI database, and if there are any failures - generate a report, and don't run the check again. If there are no failures, run only the remaining tests on text try.

@Felixoid
Copy link
Member Author

Felixoid commented Sep 4, 2023

Obtain test results for interrupted runs from the CI database, and if there are any failures - generate a report, and don't run the check again. If there are no failures, run only the remaining tests on text try.

Updated the list. It already was there partially, except the idea to fail even earlier

@alexey-milovidov
Copy link
Member

@Felixoid, it is again very relevant. We should always pin all the dependencies.
Not doing it is insane.

@Felixoid
Copy link
Member Author

Felixoid commented Mar 5, 2024

Comments about the automatic release process.

The system is designed to have a pair of eyes on the workflows launched after the release is created. It explicitly states it in the logs:

INFO:root:To verify all actions are running good visit the following links:
  https://github.com/ClickHouse/ClickHouse/actions/workflows/release.yml
  https://github.com/ClickHouse/ClickHouse/actions/workflows/tags_stable.yml
INFO:root:To rollback the action run the following commands:
  git push git@github.com:ClickHouse/ClickHouse.git +b6.....:23.8
  gh release delete --yes --repo ClickHouse/ClickHouse 'v23.8.10.43-lts'
  git push -d git@github.com:ClickHouse/ClickHouse.git v23.8.10.43-lts
  git tag -d 'v23.8.10.43-lts'

The script to make it automatically doesn't make the whole system resilient. And during the last time we've had too much issues with it to discuss the moving forward here. Either we lose Release ready status completely, or it lies about the actual state for the binaries.

Unless the workflows themselves are stable and in some final step, the autorelese process is out of discussion.

Another point: it's mentioned that the releases should be daily. It will overload two places at the same time. Both git and package repos will be overloaded with unnecessary packages and tags. Currently the task should be executed once a week, so it better to preserve the current behavior.

@Felixoid
Copy link
Member Author

I would like to let you know about the docker images' rebuild.

After extensive work by @maxknv, we have idempotent docker tags based on the repo's state. Every docker image:tag is built only once on the change and stays immutable.

Then, the only job that is done every night is tagging the current digest as latest. It completely solves the issue with untested updated images.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp-ci Continuous integration
Projects
None yet
Development

No branches or pull requests

4 participants