Skip to content

Fixed unreliable DNS-SD discovery with ippfind -T 0#1211

Closed
alexpevzner wants to merge 1 commit intoOpenPrinting:2.4.xfrom
alexpevzner:2.4.x-ippfind-not-reliable
Closed

Fixed unreliable DNS-SD discovery with ippfind -T 0#1211
alexpevzner wants to merge 1 commit intoOpenPrinting:2.4.xfrom
alexpevzner:2.4.x-ippfind-not-reliable

Conversation

@alexpevzner
Copy link
Copy Markdown
Member

The driverless backend uses the ippfind tool for DNS-SD discovery. When invoking ippfind, it uses the -T 0 option to maximize speed.

In this mode, ippfind follows a heuristic: if no new events are received from Avahi for 500 milliseconds and all discovered services have already been resolved, it assumes discovery is complete and exits.

However, Avahi sometimes responds too slowly, meaning a 500 ms silence does not necessarily indicate that no more services will be discovered shortly. As a result, ippfind may exit prematurely, increasing the likelihood of missing devices.

This issue occurs particularly often when only the loopback interface is active, and the only available devices are those published by the ipp-usb daemon.

To address this, the fix increases the minimum search time to 2500 milliseconds. This change has minimal impact on normal operations, since discovery typically takes 2000–2500 ms when devices are found, while eliminating unreliability caused by early termination.

The driverless backend uses the ippfind tool for DNS-SD discovery. When
invoking ippfind, it uses the -T 0 option to maximize speed.

In this mode, ippfind follows a heuristic: if no new events are received
from Avahi for 500 milliseconds and all discovered services have already
been resolved, it assumes discovery is complete and exits.

However, Avahi sometimes responds too slowly, meaning a 500 ms silence
does not necessarily indicate that no more services will be discovered
shortly. As a result, ippfind may exit prematurely, increasing the
likelihood of missing devices.

This issue occurs particularly often when only the loopback interface is
active, and the only available devices are those published by the
ipp-usb daemon.

To address this, the fix increases the minimum search time to 2500
milliseconds. This change has minimal impact on normal operations,
since discovery typically takes 2000–2500 ms when devices are found,
while eliminating unreliability caused by early termination.
@alexpevzner
Copy link
Copy Markdown
Member Author

The initial problem was that system-config-printer began to work VERY unreliable.

Sometimes it doesn't see network printers at all (or ipp-usb exported printers). Sometimes the printer is found, but driverless driver is not offered.

It happens on Fedora 41, it happens on our ROSA Linux, probably it happens everywhere.

I cannot figure out what has caused this change. My guess, something has changed in Avahi. This patch helps.

@michaelrsweet michaelrsweet self-assigned this Apr 1, 2025
@michaelrsweet michaelrsweet added the investigating Investigating the issue label Apr 1, 2025
@michaelrsweet
Copy link
Copy Markdown
Member

I'm not sure this is the correct fix, and will do some investigation.

That said, the "driverless" backend shouldn't be running ippfind - either use Avahi's C API to browse or (for CUPS 2.5 and later) use the corresponding cupsDNSSD APIs, which is what ippfind does. And that assumes that we want to continue using the driverless backend which is convoluting a bunch of things that cupsd already supports in CUPS 2.4...

@alexpevzner
Copy link
Copy Markdown
Member Author

This PR modifies ippfind to perform discovery in a manner more similar to how sane-airscan handles it.

Please note that Avahi's search completion indication is highly unreliable. If a search takes too long, Avahi may send an AVAHI_BROWSER_FAILURE event, which merely indicates that the planned time interval has ended—only to follow it shortly with an AVAHI_BROWSER_NEW event.

Additionally, the real semantics of AVAHI_BROWSER_ALL_FOR_NOW remain unclear for me. Based on observation, it seems Avahi only uses this event to indicate that it hasn't detected any new events in the past ~1 second. However, some printers may take longer than that to respond.

From my experience, the most reliable method for DNS-SD discovery with Avahi is to allocate sufficient time (with an extra 1–2 seconds as a buffer if the last event arrives just before the initial timeout expires).

In sane-airscan, I use a 2.5-second delay for this. However, this is already a bare minimum—if many SANE backends are enabled on the system, their initialization takes annoyingly long before sane-airscan can begin its own. Ideally, I would prefer a 5-second search timeout for better reliability, but the resulting delay becomes too intrusive for users.

CUPS doesn’t suffer from this issue, so a 5-second timeout might be a better tradeoff between search speed and reliability. That said, based on sane-airscan's experience, 2.5 seconds is usually sufficient.

@tillkamppeter
Copy link
Copy Markdown
Member

driverless is running ippfind to use its filtering capabilities, to tell apart whether an IPP service is actually driverless or whether it is some legacy service. It worked well for long time. running ippfind alone probably shows the same problems (@alexpevzner did you test that?).

@alexpevzner
Copy link
Copy Markdown
Member Author

running ippfind alone probably shows the same problems (@alexpevzner did you test that?).

Yes, sure.

Тhat functionality used to work reliably for a long time. However, something broke recently, and I haven’t been able to pinpoint the exact change responsible. The issue reproduces on Fedora 41 with all updates installed—surprisingly, especially when the machine is offline (with ipp-usb). The first response from Avahi arrives too late: ippfind completes quickly but returns no discovery results.

That said, Avahi never guaranteed that 500 ms of silence should be treated as the end of discovery—relying on that is inherently unreliable.

For example, the same device that system-config-printer often fails to detect (sometimes requiring 5 retries or so) is consistently discovered by sane-airscan, which waits for 2500 ms.

The PR I’m proposing is simple, low-risk, and easy to review. It addresses a real, observable issue, so I’d appreciate it if it could be accepted.

@michaelrsweet
Copy link
Copy Markdown
Member

@alexpevzner

The PR I’m proposing is simple, low-risk, and easy to review. It addresses a real, observable issue, so I’d appreciate it if it could be accepted.

With respect, simple timing related fixes often mask a bigger problem. Let's do a little investigation to make sure we are actually fixing things.

Can you try (on the same system) building the current CUPS master (2.5b1) code and run "tools/ippfind-static -T 0" to see if it reproduces with the new DNS-SD code?

@zdohnal
Copy link
Copy Markdown
Member

zdohnal commented Apr 2, 2025

IMO this is related/dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1954469 and #620 and avahi/avahi#442 .

tl, dr: There is no reliable marker for "we're done" in mDNS, especially with Avahi. Additionally, Avahi dBUS API really generates two dBUS messages for both answers, but "sometimes" the answers get bundled into one syscall, where the first answer is empty, and the next part contains answer for the other Avahi call. Usually it happened for me since my printer at home uses IPP only (no TLS), and since ipp-usb registers the services on plain IPP too, I imagine it behaves the same - though I did not see it often, since I use CUPS temporary queues and if I have to create permanent queue, I take URL from lpstat -l -e and use it for lpadmin, so I do not use driverless from cups-filters that much.

My plan was to find some workaround for this glibc behavior, but now I got an idea - what about driverless would separate the requests? First run would be for IPPS service search, if not found, run IPP service search?

@alexpevzner
Copy link
Copy Markdown
Member Author

Let's do a little investigation to make sure we are actually fixing things.

Agree.

I've investigated things a little bit. I need to say that initial problem (driverless unreliably) is the very floating error. Yesterday it didn't work at all, today it seems to work well. Nothing has changed in the system configuration.

Can you try (on the same system) building the current CUPS master (2.5b1) code and run "tools/ippfind-static -T 0"

First of all, it is broken. It sends "_ipp._tcp.local" to avahi_service_browser_new and Avahi doesn't accept it. The error is "Unable to create DNS-SD browse request: Invalid service type".

After hot-fixing that with replacing the typename with "_ipp._tcp", it seems to work reliable and surprisingly very fast (2-3 times faster that the old ippfind). I definitely need to steal some ideas from that to airscan :-)

After that experiment, I've instrumented the old ippfind by adding the following printf lines:

In the poll_callback:

+  if (val > 0) {
+    printf("poll: got data; avahi_got_data=1\n");
     avahi_got_data = 1;
+  }

In event loop in the main function:

 #elif defined(HAVE_AVAHI)
+    printf("poll: want data; avahi_got_data=0\n");
     avahi_got_data = 0;
 
     if (avahi_simple_poll_iterate(avahi_poll, 500) > 0)

And in the browse_callback, in the AVAHI_BROWSER_NEW handler:

        service = get_service((cups_array_t *)context, name, type, domain);
 
+printf("AVAHI_BROWSER_NEW, avahi_got_data=%d\n", avahi_got_data);
+
        if (flags & AVAHI_LOOKUP_RESULT_LOCAL)
          service->is_local = 1;

And with these changes applied and with the disconnected network, I see the following trace:

poll: want data; avahi_got_data=0
poll: got data; avahi_got_data=1
poll: want data; avahi_got_data=0
poll: got data; avahi_got_data=1                     <<<< poll_callback; some data arrived
poll: want data; avahi_got_data=0                  <<<< more data wanted, avahi_got_data reset
AVAHI_BROWSER_NEW, avahi_got_data=0    <<<< browser_callback
poll: want data; avahi_got_data=0
poll: got data; avahi_got_data=1
poll: want data; avahi_got_data=0
poll: got data; avahi_got_data=1
poll: want data; avahi_got_data=0
poll: got data; avahi_got_data=1
poll: want data; avahi_got_data=0
poll: got data; avahi_got_data=1
poll: want data; avahi_got_data=0
ipp://misa.local:60000/ipp/print
poll: want data; avahi_got_data=0

Please notice, the AVAHI_BROWSER_NEW event comes with the zeroed avahi_got_data flag. It is not always happens this way, sometiles avahi_got_data=1 when captured from the browser_callback.

So looks like the positive poll() system call response is not a reliable indication that there are some events has been received from Avahi; events may come without that.

To understand how it could happen, the deeper lock into the libavahi implementation is required, but I didn't do that.

@alexpevzner
Copy link
Copy Markdown
Member Author

alexpevzner commented Apr 2, 2025

So I think for now:

  1. avahi_got_data = 1 needs to be moved from the poll_callback into the browse_callback, into the AVAHI_BROWSER_NEW handler.

  2. My patch that sets the lower bound of the discovery time is still needed, because it covers cases when printer firmware is too slow to respond or packet was lost on a network (especially wifi).

Let me know if some more investigations are required.

@zdohnal
Copy link
Copy Markdown
Member

zdohnal commented Apr 2, 2025

@alexpevzner as I wrote in my comment, IMO the problem is there is no really reliable milestone/status in mDNS where we can say we have the whole answer - see the investigation in the tickets I've shared.

@alexpevzner
Copy link
Copy Markdown
Member Author

@zdohnal,

I've read your comment and links you've shared. This is very interesting and definitely related to the problem we are discussing here.

After some thinking/analysis I've came to conclusion that actually we have two separate problems there. They look similar by their symptoms, but these problems are actually different.

One problem is that when searching the real network, we don't have the reliable "end of discovery" indication. It happens not because of Avahi bugs, but because there is no such thing in the underlying DNS-SD protocol. Only some time-based approach can work here, giving some level of accuracy. Avahi tries to help us with the AVAHI_BROWSER_ALL_FOR_NOW/AVAHI_BROWSER_FAILURE events, but they are timer-based and mean nothing more that chosen by Avahi guys timeout has expired.

The second problem that we loose even the locally-published (through the ipp-usb) devices. At this case we barely read the Avahi cache; no actual network discovery is performed. The AVAHI_BROWSER_CACHE_EXHAUSTED event is the reliable indication that all the relevant cache entries has been received.

Our problem here is that we use to the poll() system call output as the indication of incoming events. Actually, Avahi has never promised us in their public interface that the poll() can be used this way. We just should not be doing that.

So @michaelrsweet is correct here in his answer: my patch mitigates the first problem (it cannot be solved completely, but 500 ms for network discovery is not enough) and only masks the second problem.

However, taking in account that in the CUPS 2.5 this code is totally reworked and poll() output is not used anymore as the end-of-discovery indication, may be it makes sense to apply my patch for the 2.4 as a temporary solution of the problem that looks to be solved "normally" in 2.5?

@michaelrsweet
Copy link
Copy Markdown
Member

@alexpevzner Actually, I was thinking to fix the underlying Avahi code in 2.4.x, based on the 2.5/3.0 Avahi/DNS-SD code we have and know is working.

@alexpevzner
Copy link
Copy Markdown
Member Author

@michaelrsweet,

I understand you.

I still believe that setting avahi_got_data to 1 when AVAHI_BROWSER_NEW is received (rather than waiting for poll() to return >0, as it currently does) in addition to the changes proposed in this PR would be a correct and simple fix.

Your new implementation looks very promising, but it has not been widely deployed yet. As a result, it’s hard to predict what issues might arise once it’s in broader use.

Additionally, the new implementation seems too fast. In my setup, it completes network printer discovery in less than a second, whereas the old implementation takes 2–2.5 seconds for successful discovery.

I’m concerned that not all printers—especially those on Wi-Fi, where packet loss can occur—will respond that quickly.

When time permits, I’ll examine your new code in more detail. For sane-airscan, fast discovery is essential, and I’m curious how you achieved such rapid responses. However, for printing, speed is less critical.

The worst-case scenario would be if a printer appears in the list of available network printers but fails to be rediscovered later. This could lead the system to omit the driverless driver option, directing users to manually select from a list of available drivers—where they might pick an incorrect one (though this may never happen if Avahi cache reads are always reliable in the new implementation).

michaelrsweet added a commit that referenced this pull request Apr 4, 2025
… (Issue #1211)

- Export _cupsGetClock private API.
- Use _cupsGetClock in ippfind.
- Drop "avahi_have_data" and adopt CUPS 2.5/3.0 processing strategy.
@michaelrsweet
Copy link
Copy Markdown
Member

@alexpevzner Try the latest changes:

[2.4.x 8304d6b] Fix "unreliable" discovery with ippfind as used by driverless backend (Issue #1211)

Still not as fast as 2.5/3.0 for some reason, but seems to work reliably...

@michaelrsweet
Copy link
Copy Markdown
Member

Actually, I think the difference in performance is that 2.5/3.0 process DNS-SD updates in a separate thread. Just pushed another change to remove the extra usleep...

@alexpevzner
Copy link
Copy Markdown
Member Author

@michaelrsweet

Try the latest changes:

It is not reliable. I've enabled the network, connected printer to the ipp-usb (so printer was simultaneously available via the IPP over USB and via the normal network) restarted Avahi (to purge the cache) and tried.

For the first time, it was running for 0.3 s and has found the ipp-usb version of printer only. For the second time it was running 1.6 seconds and found both. Third and subsequent runs executed quickly and returns both versions of the printers, but here cache is returned.

So I think the minimal search time needs to be enforced (and to be 2.5 seconds at least).

@zdohnal
Copy link
Copy Markdown
Member

zdohnal commented Apr 7, 2025

IIUC this is no longer valid, the current work is tracked here - #1214 . If I don't get it right, I can reopen the PR again.

@zdohnal zdohnal closed this Apr 7, 2025
@alexpevzner
Copy link
Copy Markdown
Member Author

I’d probably prefer to keep it open while #1214 is still ongoing, just to maintain visibility—since the first half of the discussion is here...

@zdohnal
Copy link
Copy Markdown
Member

zdohnal commented Apr 7, 2025

There are links to this PR in the newer one, so the discussion won't go anywhere. I just take it if the code is not going to be merged (AFAIK the newer one will be the final MR for the problem), then it should be closed. Probably we hit a difference in meaning between issue x merge request.

@michaelrsweet
Copy link
Copy Markdown
Member

[2.4.x 3c710ad] Tweak end-of-scan condition (Issue #1214, Issue #1211)

@michaelrsweet michaelrsweet added bug Something isn't working priority-medium and removed investigating Investigating the issue labels Apr 7, 2025
@michaelrsweet michaelrsweet added this to the v2.4.x milestone Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working priority-medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants