Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cupsd sometimes enters a infinite loop when timeout variable gets lower than current_time #604

Closed
zdohnal opened this issue Jan 30, 2023 · 8 comments · Fixed by #767
Closed
Assignees
Labels
bug Something isn't working priority-low
Milestone

Comments

@zdohnal
Copy link
Member

zdohnal commented Jan 30, 2023

Describe the bug
cupsd is in infinite loop, causing doing a certain operation (expiring subscriptions/deleting temp printers/removing job files) every second until the service is restarted. The trigger of the issue is unknown, but it happens when a timeout variable (f.e. expire_time or local_timeout) is lower than current_time, causing cupsd to run an operation every second.

To Reproduce
Steps to reproduce the behavior:
Unknown

Expected behavior
No infinite loop after a certain event

System Information:

  • seen in Fedora 36/37, NixOS
  • cupsd
  • CUPS version 2.4.2

Additional context
The additional data here #578 , however the complete error_log is difficult to provide, since we don't know what triggers the issue and once the issue finally shows itself, a ton of logs are provided and it is difficult to track what triggered it. I'll keep my eyes on this.

@zdohnal
Copy link
Member Author

zdohnal commented Jan 30, 2023

I've lowered the priority, since it is not hit every time you run cupsd and there is a workaround.

@clefru
Copy link

clefru commented Feb 4, 2023

TL:DR; Old job files are not cleaned up, keeping the scheduler awake with unsuccessful cleanup attempts.


I traced this a bit deeper with gdb and found the following:

First, select_timeout always returns 1. That is because, the JobHistoryUpdate variable is smaller than timeout, so timeout=JobHistoryUpdate gets executed in the following block:

  if (JobHistoryUpdate && timeout > JobHistoryUpdate)
  {
    timeout = JobHistoryUpdate;
    why     = "update job history";
  }

Note: JobHistoryUpdate is stuck and will never get updated, but more on that later. At the end of the function, the value of timeout is much smaller than now and the following clipping logic kicks in:

  timeout = timeout - now + 1;

  if (timeout < 1)
    timeout = 1;

fixing timeout to 1.

Then in the main scheduler loop, we find another check for JobHistoryUpdate and we call out to cupsdCleanJobs. We do so every second. This function is the real culprit not "Expiring subscriptions."

    if (JobHistoryUpdate && current_time >= JobHistoryUpdate)
      cupsdCleanJobs();

Let's talk about my suspected culprit. cupsdCleanJobs iterates over all jobs and recomputes JobHistoryUpdate to the history_time or file_time of the oldest job. It subsequently seems to contain logic to clean up old jobs. Except, this logic never triggers for me.

I've set a breakpoint on cupsdCleanJobs (to see the function entry, which triggers nicely) and the cleanup functions cupsdDeleteJob and remove_job_files. Neither of them is ever called. The following conditional guards the cleanup logic:

    if (job->state_value >= IPP_JOB_CANCELED && !job->printer)

The conjunction with !job->printer looks odd. Is this supposed to mean "when the job has no printer associated anymore, e.g. the job's printer has been deleted"? Because that's rarely going to evaluate to true as most old jobs will still have a printer associated with it. (I rarely delete my printers...)

Browsing around in my web UI shows that most of my completed jobs still have printers "associated" with them. At least there is an href pointing to the printer. My oldest job is from 2020 and I did delete a printer since then, and because jobs from those deleted printers don't exist anymore, I assume that only for those jobs, the cleanup logic worked.

Conclusion: It feels like dropping && !job->printer from the conditional is the appropriate fix. I don't see why an old job should be exempt from cleanup, just because the printer, it was printed to, still exists.

@michaelrsweet
Copy link
Member

@clefru The !job->printer means "not active on a printer". The dest member provides the name of the destination, but the printer member provides a pointer to the printer that is assigned the job (think jobs sent to classes).

@yurkobb
Copy link

yurkobb commented Apr 10, 2023

I've lowered the priority, since it is not hit every time you run cupsd and there is a workaround.

What is the workaround? (sorry for asking).

@clefru
Copy link

clefru commented Apr 29, 2023

Without making cupsdCleanJobs() actually clean old jobs, CUPS will continue to wake up every second to "clean jobs", but not actually do anything, only to repeat the same process a second later. That's a waste of CPU leading to battery drains on laptops by unproductive wakeups.

Has anyone looked into cupsCleanJobs a bit more in detail to see if there actually a bug in there? Or is the JobHistoryUpdate variable miscomputed somehow?

@zdohnal
Copy link
Member Author

zdohnal commented May 18, 2023

I've lowered the priority, since it is not hit every time you run cupsd and there is a workaround.

What is the workaround? (sorry for asking).

Usually restarting the service helps. If it does not, I clean up /var/spool/cups of d and c files.

@michaelrsweet michaelrsweet removed the unable-to-reproduce Unable to reproduce label Jun 2, 2023
@michaelrsweet michaelrsweet self-assigned this Jun 2, 2023
@michaelrsweet michaelrsweet modified the milestones: v2.5, v2.4.x Jun 2, 2023
zdohnal added a commit to zdohnal/cups that referenced this issue Jul 27, 2023
Based on currently unknown trigger scheduler sometimes sets
JobHistoryUpdate into past, which causes `select()` to timeout after one
second.

It happens when `job->file_time` of a job without files to remove gets
assigned to `JobHistoryUpdate`. If we check for `job->num_files` and
assign the `job->file_time` only if there are any, we will fix extensive
logging (and unneeded cupsd execution) in various places, e.g. cleaning
jobs, expiring subscriptions, deleting temporary queues...

Fixes OpenPrinting#604
@zdohnal
Copy link
Member Author

zdohnal commented Jul 27, 2023

@clefru thanks for the investigation! I've found time to look into it too (or rather I got annoyed that I saw it again :) ) - I think we have to check job->num_files when we assign time into JobHistoryUpdate in cupsdCleanJobs() - because we want to run the function when there is a file which is going to be removed here.

I'll create PR for it.

zdohnal added a commit that referenced this issue Jul 27, 2023
Based on currently unknown trigger scheduler sometimes sets JobHistoryUpdate into past, which causes select() to timeout after one second.

It happens when job->file_time of a job without files to remove gets assigned to JobHistoryUpdate. If we check for job->num_files and assign the job->file_time only if there are any, we will fix extensive logging (and unneeded cupsd execution) in various places, e.g. cleaning jobs, expiring subscriptions, deleting temporary queues...

Fixes #604
@clefru
Copy link

clefru commented Jul 30, 2023

@zdohnal Thank you for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority-low
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants