Staging system improvement proposal #257

nuclearcat · 2023-09-20T06:49:21Z

At current moment we have fixed schedule based deployment of staging for legacy system that also deploys API/pipeline. System was designed for legancy system, keeping in mind that deployment is very heavy and fragile due presence of various 3rd party components, such as jenkins.
For example if even trivial PR submitted, we need to wait for next 8hr deploy cycle to see if it works. This is considerably slowing down development. Also we have mechanism of "trusted users" which is a bit cumbersome to manage if we might have many infrequent contributors.
With flexibility of new API/Pipeline we can afford to have more frequent deployments.

This document describes proposal for new deployment system.

Terms:

Github workflow deploy: each project have it's own testing over github workflows, for example kernelci-api does full docker deploy, including database, redis, etc, and running e2e tests. This is not related to staging deploy, but in some cases might be sufficient for testing without triggering staging deploy.
Partial staging deploy cycle: only deploy the API and pipeline, but do not run full "test" kernel builds and tests.
Full staging deploy cycle: deploy the API and pipeline, and run full "test" kernel builds and tests.

Remark: Full staging deploy cycle take significant time and computing resources, so it should not be run too often. Lets define a minimum of 6 hours between full staging deploy cycles.

Decouple
- API/pipeline staging script should be standalone from the main kernelci-deploy scripts. We might still use tooling like pending.py, but it might also be a good time to improve them or even rewrite them over time, after all steps are completed.
Rewrite
- The API/pipeline staging script would be better implemented in Python. This allows for its own scheduling, embedded webhook server, and direct interaction with the API itself. For example, properly draining the build queue before staging redeploy.
- Also implement partial and full staging deploy cycles.
Event-driven
- Add webhooks to the API/pipeline staging script from GitHub. This should significantly improve development speed, as we don't want to wait until the next workday for staging results.
Proposal of logic:
- Do not run full staging deploy cycle more often than every 6 hours, if deploy triggered by webhook and last full deployment was less than 6 hours ago, do only partial staging deploy cycle.
- If the last deployment was more than 6 hours ago, check for updates and redeploy.
- If a deployment is in progress, restart from 0. This is beneficial when one developer updated a bit earlier than another, ensuring accurate results for both. Do not restart more than once and do not interrupt full staging deploy cycle.
- If a webhook is received and the last deployment completed less than 30 minutes ago, schedule deployment after this timeout expires to avoid redeploying too often. This is throttling mechanism to avoid too many deployments if there is high activity on PR.
- If a webhook is received and the last deployment completed more than 30 minutes ago, initiate deployment after a 5-minute "settle" time, allowing for developer for "last-minute" PR fixups.
Flexibility/Configurability
- Improve workflows
- Implement partial and configurable redeployment. When fetching a PR, check which files it touches. For example, if an update only affects test lists, there may be no need to rebuild Docker images for compilers, etc.
- Set triggers in PRs, such as if one of the maintainers includes "@kcibot deploy" phrase in a PR from a new contributor (not verified), trigger testing without adding the contributor to trusted users.
- Advanced configurations can include triggers based on specific keywords in PRs, such as "@kcibot deploy" from admins triggering immediate deployment but no artificial "test" kernel and tests submitted (e.g., a short redeploy), while "@kcibot trigger" triggers both deploy and "test" kernel build with reference tests.
- Notify user (in PR?) that his PR was tested full/partially.
- Notify if deployment failed, and try to calculate "faulty" PR, and exclude it automatically.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Staging system improvement proposal #257

Staging system improvement proposal #257

nuclearcat commented Sep 20, 2023 •

edited

Staging system improvement proposal #257

Staging system improvement proposal #257

Comments

nuclearcat commented Sep 20, 2023 • edited

nuclearcat commented Sep 20, 2023 •

edited