Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: models: avoid polling jobs waiting more than a week on the backend #1012

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chaws
Copy link
Collaborator

@chaws chaws commented Dec 6, 2021

SQUAD has no way to tell whether a TestJob has been worked on by its backend. It might be that the device is out or that the backend is undergoing a unusually long maintenance. Overtime, jobs in this scenario will start clogging up the fetch queue, delaying other fetched jobs.

I hardcoded this feature for a week, because that's the usual behavior I noticed, but this can be done via a backend setting as well if requested.

SQUAD has no way to tell whether a TestJob has been worked on by
its backend. It might be that the device is out or that the backend
is undergoing a unusually long maintenance. Overtime, jobs in this
scenario will start clogging up the fetch queue, delaying other fetched
jobs.

I hardcoded this feature for a week, because that's the usual behavior I
noticed, but this can be done via a backend setting as well if requested.

Signed-off-by: Charles Oliveira <charles.oliveira@linaro.org>
@chaws
Copy link
Collaborator Author

chaws commented Dec 9, 2021

@mrchapp There are a few jobs in NXP that are waiting over a week in their LAVA instance (like this one https://lavalab.nxp.com/scheduler/job/744848), and this PR acts exactly on this kind of job. Specially on NXP, there are old hanging jobs that take ~10 seconds to get a response from the LAVA instance.

@mrchapp
Copy link
Contributor

mrchapp commented Dec 9, 2021

I'm thinking that we eventually want to get those lagged results, even if only for data mining purposes.

Can we ping the LAVA server first and determine if a round of fetching should be initiated based on that? I guess what we want to avoid is the continuous time-outs from an unresponsive server.

@chaws
Copy link
Collaborator Author

chaws commented Dec 10, 2021

You have a good point. I think I will revisit the LAVA/SQUAD signals and have LAVA tell SQUAD when a job is ready for fetching. Sometimes jobs have Submitted status, but sometimes that's not the case.

I don't fully understand why.

By default SQUAD attempts fetching jobs despite whatever signal LAVA sent wrt the job.

One solution is to have SQUAD avoid polling jobs with status="Submitted" (like this NXP job). Then whenever LAVA signals SQUAD that the job is ready, it'll be queued then fetched. Downside is that if the lava lab failed to notify SQUAD, the job will never be fetched.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants