Mechanism to re-trigger a test run? (The idea and the specifications) #382

pypingou · 2017-09-27T11:12:35Z

Sometime the CI pipeline fails due to a bug in the pipeline, an infrastructure issue or an outage in one of the system it depends (like the small outage that dist-git had in Fedora 2 days ago).

When this happens, if gating is enabled, the update will end up being blocked in bodhi because its tests failed. Currently the only way around this would be to "waive" the test results in waiverdb. However, this should be a rel-eng only action and is going a little bit against the idea of CI (since we would be waiving the results not because the tests failed but because we failed to run them).

The better solution, would be to have a way to re-trigger a pipeline run.

There is already a discussion that started on fedora-infra/bodhi#1779 to add a 'Re-run test' button to bodhi for taskotron tests. I believe we should try to come up with a solution that satisfies both taskotron and the Atomic CI pipeline.

My proposal on that PR was to have a dedicated service that would receive API calls and emit fedmsg message with the information needed to re-run the tests.
(Having it a separate service would allow integrating it in multiple places, such as bodhi and pagure)

What do you think of this idea?

Could you describe what information you would need in a fedmsg message to re-trigger a run? (Keeping in mind that bodhi does not know the commit hash corresponding to a build, so we may only be able to provide the combo repository/branch).

@AdamWill Could you also described what information you would like to have to re-trigger a test in taskotron?

Thanks :)

arilivigni · 2017-09-27T12:58:17Z

@pypingou I think if bodhi (As it is doing for taskotron fedora-infra/bodhi#1779) can send the messages with the detail from the failure this is possible. I assume that is essentially what the button for taskotron is doing and we would want the same. The rev (commit) and repo (package) and branch are probably what is needed at a minimum.

On our side we may have to add to messages the CI_MESSAGE, which contains all the above as I mentioned and more. Then bodhi can just resend that message and then we can run it once again as we did the first time.
I am going to schedule a breakout session to discuss some of these types of issues now that the pipeline is actively being used.

Thanks for the issue I think this helps the pipeline grow!

pypingou · 2017-09-30T06:34:56Z

bodhi can just resend that message

I was thinking about this this morning and I am not sure relying on an existing message is wise.
Let's take the worst case scenario: a build failed (or pass) and for some reason its corresponding fedmsg message got lost.
The update is now stuck at the gate because we do not know if the tests passed or failed. The only way to unstuck it is to re-run the tests, so clicking on the "re-run" button.
However, we can't rely on an existing message for this since the original message got lost and it's what got us in that situation to start with.

I think a potential good solution would be to have a new message, with a topic something like: org.fedoraproject.prod.ci.re-run.atomic-ci or so, containing the information bodhi knows: namespace, name, NVR and branch (we could include a git hash field though it will be empty until https://pagure.io/koji/issue/550 is solved).

Food for more thoughts :)

arilivigni · 2017-10-02T13:55:49Z

That sounds reasonable. We just want to make sure we have the fields that are submitted as part of the dist-git message that you outlined. namespace, repo, rev. This is where I disagree that original_nvr_spec is critical it should be able to rerun with namespace, repo, rev. This makes it unique and why we store this as part of nvr which has part of the rev (commit id)

pypingou · 2017-10-02T14:07:48Z

This is where I disagree that original_nvr_spec is critical it should be able to rerun with namespace, repo, rev.

I don't think we disagree, but I just can't give you rev with the current tooling in place :)

arilivigni · 2017-10-02T14:10:15Z

Yes I believe we are! I think that is a requirement for this to work seamlessly IMHO. This way if there is a failure at any point a message can be sent to re-trigger things even with the separate topic as you mentioned.

pypingou · 2017-10-02T17:55:56Z

I very much agree with you that this is the ideal situation, but this relies on something that I cannot do today (we cannot from a build go to a git hash, we could if the ticket on koji I linked to above was fixed, but atm, it isn't :(), so I'm trying to find the next-best-thing which is definitely less ideal :(

arilivigni · 2017-10-02T17:58:37Z

@pypingou well this can't be we hack and do workarounds on the pipeline side. If this is an issue and needs to be added to Fedora infrastructure then that avenue should be pushed and raised as a blocker.
So far it seems like the pipeline code is doing workarounds to make up for Fedora infrastructure limitations and we won't be able to accommodate every request.

pypingou · 2017-10-02T18:00:24Z

Then I think we do have a blocker :)

arilivigni · 2017-10-02T20:40:05Z

@pypingou Can you add the issue you here on the Fedora Infra side so we know it is being tracked?

arilivigni · 2017-10-03T13:34:16Z

This is blocked on https://pagure.io/koji/issue/550 that was asked for 2 months ago and the pipeline is not responsible for a workaround. Other related PR - fedora-infra/bodhi#1847

pypingou · 2017-10-04T19:28:21Z

Why is this closed? Is is fixed?

arilivigni · 2017-10-04T19:30:18Z

Because there is nothing to do on our end as I stated before we need the changes in Fedora. Then a new message can be sent via a button in bodhi like you are doing for taskotron. I am not sure what work you can do besides a hack. In that case I see no work for us to do atm.

pypingou · 2017-10-04T19:58:33Z

On Wed, Oct 04, 2017 at 12:30:18PM -0700, Ari LiVigni wrote: Because there is nothing to do on our end as I stated before we need the changes in Fedora. then a new message can be sent via a button we are NOT working on this as I already stated

I don't think I ever asked you to work on this, did I? This ticket was opened as a communication channel to help brainstorm and come to an agreement about what to do and how to do it. If you think we have such agreement then I guess it can be closed and once the blocker are lifted we can open a new one. I know I would have left it open so that once the blocker are lifted the conversation can start again with the original context still present, but we all have our ways to manage project so as you prefer :)

arilivigni · 2017-10-04T20:32:44Z

I didn't say you did but something I have definitely stated is issues in this repo are for issues with the pipeline.
If you already have a ticket somewhere else in pagure that should be where the communication happens. I see no reason to track it in both places. Do you? This also should be raised at the leads meeting Dominik participates at. If there are tasks that we need to do they will be addressed I see no reason to use this issue in github on the ci-pipeline as a communication channel for Fedora infra limitations that need to be addressed.

arilivigni closed this as completed Oct 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mechanism to re-trigger a test run? (The idea and the specifications) #382

Mechanism to re-trigger a test run? (The idea and the specifications) #382

pypingou commented Sep 27, 2017

arilivigni commented Sep 27, 2017

pypingou commented Sep 30, 2017

arilivigni commented Oct 2, 2017

pypingou commented Oct 2, 2017

arilivigni commented Oct 2, 2017

pypingou commented Oct 2, 2017

arilivigni commented Oct 2, 2017

pypingou commented Oct 2, 2017

arilivigni commented Oct 2, 2017

arilivigni commented Oct 3, 2017

pypingou commented Oct 4, 2017

arilivigni commented Oct 4, 2017 •

edited

Loading

pypingou commented Oct 4, 2017 via email

arilivigni commented Oct 4, 2017 •

edited

Loading

Mechanism to re-trigger a test run? (The idea and the specifications) #382

Mechanism to re-trigger a test run? (The idea and the specifications) #382

Comments

pypingou commented Sep 27, 2017

arilivigni commented Sep 27, 2017

pypingou commented Sep 30, 2017

arilivigni commented Oct 2, 2017

pypingou commented Oct 2, 2017

arilivigni commented Oct 2, 2017

pypingou commented Oct 2, 2017

arilivigni commented Oct 2, 2017

pypingou commented Oct 2, 2017

arilivigni commented Oct 2, 2017

arilivigni commented Oct 3, 2017

pypingou commented Oct 4, 2017

arilivigni commented Oct 4, 2017 • edited Loading

pypingou commented Oct 4, 2017 via email

arilivigni commented Oct 4, 2017 • edited Loading

arilivigni commented Oct 4, 2017 •

edited

Loading

arilivigni commented Oct 4, 2017 •

edited

Loading