Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mechanism to re-trigger a test run? (The idea and the specifications) #382

Closed
pypingou opened this issue Sep 27, 2017 · 14 comments
Closed

Comments

@pypingou
Copy link
Contributor

Sometime the CI pipeline fails due to a bug in the pipeline, an infrastructure issue or an outage in one of the system it depends (like the small outage that dist-git had in Fedora 2 days ago).

When this happens, if gating is enabled, the update will end up being blocked in bodhi because its tests failed. Currently the only way around this would be to "waive" the test results in waiverdb. However, this should be a rel-eng only action and is going a little bit against the idea of CI (since we would be waiving the results not because the tests failed but because we failed to run them).

The better solution, would be to have a way to re-trigger a pipeline run.

There is already a discussion that started on fedora-infra/bodhi#1779 to add a 'Re-run test' button to bodhi for taskotron tests. I believe we should try to come up with a solution that satisfies both taskotron and the Atomic CI pipeline.

My proposal on that PR was to have a dedicated service that would receive API calls and emit fedmsg message with the information needed to re-run the tests.
(Having it a separate service would allow integrating it in multiple places, such as bodhi and pagure)

What do you think of this idea?

Could you describe what information you would need in a fedmsg message to re-trigger a run? (Keeping in mind that bodhi does not know the commit hash corresponding to a build, so we may only be able to provide the combo repository/branch).

@AdamWill Could you also described what information you would like to have to re-trigger a test in taskotron?

Thanks :)

@arilivigni
Copy link
Member

@pypingou I think if bodhi (As it is doing for taskotron fedora-infra/bodhi#1779) can send the messages with the detail from the failure this is possible. I assume that is essentially what the button for taskotron is doing and we would want the same. The rev (commit) and repo (package) and branch are probably what is needed at a minimum.

On our side we may have to add to messages the CI_MESSAGE, which contains all the above as I mentioned and more. Then bodhi can just resend that message and then we can run it once again as we did the first time.
I am going to schedule a breakout session to discuss some of these types of issues now that the pipeline is actively being used.

Thanks for the issue I think this helps the pipeline grow!

@pypingou
Copy link
Contributor Author

bodhi can just resend that message

I was thinking about this this morning and I am not sure relying on an existing message is wise.
Let's take the worst case scenario: a build failed (or pass) and for some reason its corresponding fedmsg message got lost.
The update is now stuck at the gate because we do not know if the tests passed or failed. The only way to unstuck it is to re-run the tests, so clicking on the "re-run" button.
However, we can't rely on an existing message for this since the original message got lost and it's what got us in that situation to start with.

I think a potential good solution would be to have a new message, with a topic something like: org.fedoraproject.prod.ci.re-run.atomic-ci or so, containing the information bodhi knows: namespace, name, NVR and branch (we could include a git hash field though it will be empty until https://pagure.io/koji/issue/550 is solved).

Food for more thoughts :)

@arilivigni
Copy link
Member

That sounds reasonable. We just want to make sure we have the fields that are submitted as part of the dist-git message that you outlined. namespace, repo, rev. This is where I disagree that original_nvr_spec is critical it should be able to rerun with namespace, repo, rev. This makes it unique and why we store this as part of nvr which has part of the rev (commit id)

@pypingou
Copy link
Contributor Author

pypingou commented Oct 2, 2017

This is where I disagree that original_nvr_spec is critical it should be able to rerun with namespace, repo, rev.

I don't think we disagree, but I just can't give you rev with the current tooling in place :)

@arilivigni
Copy link
Member

Yes I believe we are! I think that is a requirement for this to work seamlessly IMHO. This way if there is a failure at any point a message can be sent to re-trigger things even with the separate topic as you mentioned.

@pypingou
Copy link
Contributor Author

pypingou commented Oct 2, 2017

I very much agree with you that this is the ideal situation, but this relies on something that I cannot do today (we cannot from a build go to a git hash, we could if the ticket on koji I linked to above was fixed, but atm, it isn't :(), so I'm trying to find the next-best-thing which is definitely less ideal :(

@arilivigni
Copy link
Member

@pypingou well this can't be we hack and do workarounds on the pipeline side. If this is an issue and needs to be added to Fedora infrastructure then that avenue should be pushed and raised as a blocker.
So far it seems like the pipeline code is doing workarounds to make up for Fedora infrastructure limitations and we won't be able to accommodate every request.

@pypingou
Copy link
Contributor Author

pypingou commented Oct 2, 2017

Then I think we do have a blocker :)

@arilivigni
Copy link
Member

@pypingou Can you add the issue you here on the Fedora Infra side so we know it is being tracked?

@arilivigni
Copy link
Member

This is blocked on https://pagure.io/koji/issue/550 that was asked for 2 months ago and the pipeline is not responsible for a workaround. Other related PR - fedora-infra/bodhi#1847

@pypingou
Copy link
Contributor Author

pypingou commented Oct 4, 2017

Why is this closed? Is is fixed?

@arilivigni
Copy link
Member

arilivigni commented Oct 4, 2017

Because there is nothing to do on our end as I stated before we need the changes in Fedora. Then a new message can be sent via a button in bodhi like you are doing for taskotron. I am not sure what work you can do besides a hack. In that case I see no work for us to do atm.

@pypingou
Copy link
Contributor Author

pypingou commented Oct 4, 2017 via email

@arilivigni
Copy link
Member

arilivigni commented Oct 4, 2017

I didn't say you did but something I have definitely stated is issues in this repo are for issues with the pipeline.
If you already have a ticket somewhere else in pagure that should be where the communication happens. I see no reason to track it in both places. Do you? This also should be raised at the leads meeting Dominik participates at. If there are tasks that we need to do they will be addressed I see no reason to use this issue in github on the ci-pipeline as a communication channel for Fedora infra limitations that need to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants