Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

endless poll loop in http_write when POLLHUP is returned #827

Closed
vliaskov opened this issue Nov 17, 2023 · 22 comments
Closed

endless poll loop in http_write when POLLHUP is returned #827

vliaskov opened this issue Nov 17, 2023 · 22 comments
Labels
bug Something isn't working priority-low unable-to-reproduce Unable to reproduce waiting for reporter There are data requested from the reporter
Milestone

Comments

@vliaskov
Copy link
Contributor

Describe the bug

Customer observes that cupsd occasionally gets stuck in a poll() loop. strace shows:

poll([{fd=30, events=POLLOUT}], 1, 10000) = 1 ([{fd=30, revents=POLLOUT|POLLHUP}])
poll([{fd=30, events=POLLOUT}], 1, 10000) = 1 ([{fd=30, revents=POLLOUT|POLLHUP}])
poll([{fd=30, events=POLLOUT}], 1, 10000) = 1 ([{fd=30, revents=POLLOUT|POLLHUP}])

If POLLHUP is received (i.e. the connection's other end was closed?), should we continue polling? Should POLLHUP be checked in the revents to break the loop? I haven't yet found an obvious upstream fix for this, so I am opening the bug.

Backtrace of the cupsd process:

(gdb) bt
#0  0x00007f828cb92963 in ?? () at ../sysdeps/wordsize-64/../../io/fts.c:1062 from /lib64/libc.so.6
#1  0x00007f828cf2d903 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7ffeabd1c3d8) at /usr/include/bits/poll2.h:46
#2  http_write (http=http@entry=0x557afca93d00, buffer=buffer@entry=0x7ffeabd1c420 "20b\r\n", length=5) at http.c:4690
#3  0x00007f828cf2daa4 in http_write_chunk (http=0x557afca93d00, 
    buffer=0x557afca964f8 "/printers/bamp0027ls\">bamp0027ls</A></TD><TD>bamp0027ls</TD><TD>BAM</TD><TD>LaserSoft printer</TD><TD>bampadm - retry-current-job</TD><TD>Idle</TD></TR>\n\n<TR><TD><A HREF=\"/printers/bamp0028\">bamp0028<"..., length=523) at http.c:4820
#4  0x00007f828cf2de35 in httpFlushWrite (http=0x557afca93d00) at http.c:704
#5  0x0000557aea021a79 in cupsdWriteClient (con=0x557afcd38560) at client.c:2727
#6  0x0000557aea04adfb in cupsdDoSelect (timeout=timeout@entry=1) at select.c:489
#7  0x0000557aea019b5e in main (argc=<optimized out>, argv=<optimized out>) at main.c:831

To Reproduce

Unfortunately, I am still not clear on how to reproduce this. Probably clients need to be closed while cupsd is sending data to them. I will update if I have an exact reproducer from the customer.

If there are ideas on how to stress test this let me know. Is there an existing testcase that would stress this scenario?

Expected behavior
Cups should not keep polling indefinitely if the other side has hung up.

System Information:

  • OS and its version: SLE15-SP4, x86_64
  • CUPS version 2.2.7 (2.2.7-150000.3.46.1)
@zdohnal
Copy link
Member

zdohnal commented Nov 20, 2023

Hi @vliaskov ,

actually we have a RHEL customer experiencing issues with CUPS due the same reason - poll() gets POLLOUT|POLLHUP from kernel (in their case CUPS API function fails to connect to local cupsd after the first successful job) - I have a patch verified by the customer and I will send PR this week.
@michaelrsweet the core of the patch is basically similar as we did in #156 for Solaris - check the socket option SO_ERROR in case we get combinations POLLOUT|POLLHUP or POLLIN|POLLHUP and if it is fine, remove the POLLHUP from returned poll events and proceed with clear POLLOUT/POLLIN. I've looked whether such event combinations are expected to come from kernel, and though POLLIN|POLLHUP are quite common and valid combinations, POLLOUT|POLLHUP was not allowed in older versions of poll() man pages, but it is not mentioned in the current man page anymore, so it looks like a new allowed combination as well.
I got this idea how to handle such issues from more senior colleague, but he is not from Kernel team - I can try to reach out Kernel team if you think it is a kernel issue.

@zdohnal zdohnal added bug Something isn't working priority-medium unable-to-reproduce Unable to reproduce labels Nov 20, 2023
@michaelrsweet
Copy link
Member

@zdohnal If the POLLHUP is showing up in revents before all data has been read from the socket, IMHO that would be a bug or at least inconsistent with other Unix operating systems. Happy to review whatever fix you come up with...

@zdohnal
Copy link
Member

zdohnal commented Nov 22, 2023

@vliaskov actually the revents are the same, but it happens in a different place for you - can you compile CUPS with debug-printfs, set cupsd to use debug printfs (by passing CUPS_DEBUG_LOG (location of log file), CUPS_DEBUG_FILTER (which messages to get - .*) and CUPS_DEBUG_LEVEL (0-9 - set it to 9) env variables to cupsd), enable debug logging via LogLevel in cupsd.conf and strace the cupsd when it happens?
In the end please attach strace log and file with debug printfs. Thank you!

@zdohnal zdohnal added the waiting for reporter There are data requested from the reporter label Nov 22, 2023
@zdohnal
Copy link
Member

zdohnal commented Nov 22, 2023

From the logs you showed it looks poll() returns nfds, so we shouldn't iterate in http_write with poll() callings... other polls can happen in send() and seeing debug_printfs can help.

@vliaskov
Copy link
Contributor Author

thank you for the input and suggestions @zdohnal . Sorry for the delay, I was on vacation. I will ask the customer for logs (they already had some but I think that the CUPS_DEBUG_FILTER .* will provide more info). Unfortunately the customer has a massive amount of queues and they claim that even CUPS_DEBUG_LEVEL=8 (debug) is not feasible for them without creating huge files, let alone CUPS_DEBUG_LEVEL=9 (debug2) . CUPS_DEBUG_LEVEL=7 (info) will be possible, not sure about the higher levels. I am also not sure what the file size limit is for github attachments.

By the way, if there is a draft of the patch mentioned in #827 (comment) that can be used for testing, please point me to it. Although as you say the poll hangs happen at a different place in this case, so the patch is perhaps not covering this case...

@zdohnal
Copy link
Member

zdohnal commented Nov 29, 2023

@vliaskov hmm... can you patch the debug_printfs in http_write() to be used in info level? I'm primary interested in those messages, so I would like to see their output - and then enable debug printfs for cupsd for CUPS_DEBUG_LEVEL=7?

Additionally, just to be sure it would be great to add new debug_printfs at info level in http_write to get more idea about the execution.
Please a debug printfs in similar way (pseudocode, I haven't compile it, but it demonstrate where to put the printfs - line numbers are from current upstream master):

http_write():
+              int i = 1;
4253       do
4254       {
4255         pfd.fd     = http->fd;
4256         pfd.events = POLLOUT;
4257 
+                int k = 1;
4258         do
4259         {
4260           nfds = poll(&pfd, 1, http->wait_value);
+                  debug_printf("7http_write: in poll loop - run %d.%d, nfds %d, errno %s", i, k++, nfds, strerror(errno));
4261         }
4262 #  ifdef _WIN32
4263         while (nfds < 0 && (WSAGetLastError() == WSAEINTR || WSAGetLastError() == WSAEWOULDBLOCK));
4264 #  else
4265         while (nfds < 0 && (errno == EINTR || errno == EAGAIN));
4266 #  endif // _WIN32
4267 
4268         if (nfds < 0)
4269         {
+                  debug_printf("7http_write: %d nfds %d, errno %s", i, nfds, strerror(errno));
4270           http->error = errno;
4271           return (-1);
4272         }
4273         else if (nfds == 0 && (!http->timeout_cb || !(*http->timeout_cb)(http, http->timeout_data)))
4274         {
4275 #ifdef _WIN32
4276           http->error = WSAEWOULDBLOCK;
4277 #else
4278           http->error = EWOULDBLOCK;
4279 #endif // _WIN32
+                  debug_printf("7http_write: %d nfds %d, errno %s", i, nfds, strerror(http->error));
4280           return (-1);
4281         }
+                debug_printf("7http_write: end of loop %d for getting nfds", i++);
4282       }
4283       while (nfds <= 0);
4284     }
4285 
4286     if (http->tls)
+            {
+              debug_printf("7http_write: send via TLS");
4287       bytes = _httpTLSWrite(http, buffer, (int)length);
+            }
4288     else
+            {
+              debug_printf("7http_write: send unencrypted");
4289       bytes = send(http->fd, buffer, length, 0);
+             }
...
4310         continue;
4311       }
4312 
4313 #else
+              debug_printf("7http_write: error writing data (%s).", strerror(errno));
4314       if (errno == EINTR)
4315       {
4316         continue;
4317       }

Additionally, just to be sure in case if a TLS library is in play, add/change debug printf in _httpTLSWrite() CUPS implementation for your TLS library - it would be interesting to see the result and errno coming from TLS function such as gnutls_record_send() (but this depend on what TLS library you use).

After that, enable cupsd logging at debug2 (LogLevel debug2 in /etc/cups/cupsd.conf), restart it and attach strace (strace -p <cupsd pid> -v -f -tt -T -yy -Y -o cupsd.strace -s 4092).

It would be great to see all those files, because it will give us idea what's going on - IMO either glibc (send()) or TLS library (depends on what library you use to compile with CUPS in SUSE - it is GNUTLS in RHEL) does not take kindly to a descriptor with POLLHUP.

Ad the patch - the core of the patch is to check SO_ERROR socket option on the descriptor (the same way as we do in cups/http-addrlist.c, httpAddrConnect2() for __sun), and if there is no error there, remove the POLLHUP from revents, which is later checked by the function. But we don't check revents in http_write(), so such patch is useless IMHO...

@vliaskov
Copy link
Contributor Author

Thank you. Btw, do you mean to use debug_printf or DEBUG_printf in the patch above? Just making sure we use the desired debug function .

We also use GNUTLS, there are already DEBUG_printf calls for the errno, do you want changes below?

int                                     /* O - Bytes written */
_httpTLSWrite(http_t     *http,         /* I - Connection to server */
              const char *buf,          /* I - Buffer holding data */
              int        len)           /* I - Length of buffer */
{
  ssize_t       result;                 /* Return value */


  DEBUG_printf(("2http_write_ssl(http=%p, buf=%p, len=%d)", http, buf, len));

  result = gnutls_record_send(http->tls, buf, (size_t)len);

  if (result < 0 && !errno)
  {
   /*
    * Convert GNU TLS error to errno value...
    */

    switch (result)
    {
      case GNUTLS_E_INTERRUPTED :
          errno = EINTR;
          break;

      case GNUTLS_E_AGAIN :
          errno = EAGAIN;
          break;

      default :
          errno = EPIPE;
          break;
    }

    result = -1;
  }

  DEBUG_printf(("3http_write_ssl: Returning %d.", (int)result));

  return ((int)result);
}

@zdohnal
Copy link
Member

zdohnal commented Nov 29, 2023

DEBUG_printf(), because that's the one which is available for whole library. I found debug_printf() only in snmp backend.

Ad gnutls - it would be great to see result and errno which come straight from gnutls call, so put the debug printf right after the command and print result and strerror(errno).

@vliaskov
Copy link
Contributor Author

DEBUG_printf(), because that's the one which is available for whole library. I found debug_printf() only in snmp backend.

Ad gnutls - it would be great to see result and errno which come straight from gnutls call, so put the debug printf right after the command and print result and strerror(errno).

understood, thanks.

@vliaskov
Copy link
Contributor Author

vliaskov commented Dec 1, 2023

Customer is having issues with huge log files (and the issue takes at least a few hours to trigger).

@zdohnal can you suggest a more restrictive CUPS_DEBUG_FILTER setting (not the capture-all setting ".*") that would still provide useful information to you for debugging, while avoiding capturing everything? Or is that not possible?

@zdohnal
Copy link
Member

zdohnal commented Dec 4, 2023

IMO you can try filter it by 'http_write' - it should give us the more narrowed data.

@vliaskov
Copy link
Contributor Author

vliaskov commented Dec 4, 2023

Thank you @zdohnal .
While we are still waiting for more debugging: Do you have the PR/patch mentioned in #827 (comment)?

Could the POLLOUT | POLLER this be caused by a half-closed TCP connection e.g. from a shutdown() on the client side? I am trying to think of ways to reproduce the issue, as I am currently unable to.

What would be the proper way to handle the POLLHUP?

If POLLHUP is detected while waiting for POLLOUT, we could simply clear the returned POLLHUP event, and proceed normally e.g.:.

@@ -4688,9 +4687,17 @@ http_write(http_t     *http,             /* I - HTT
        pfd.events = POLLOUT;
 
        while ((nfds = poll(&pfd, 1, http->wait_value)) < 0 &&
-              (errno == EINTR || errno == EAGAIN))
+              (errno == EINTR || errno == EAGAIN) &&
+              !(pfd.revents & POLLHUP))
          /* do nothing */;
 
+        if (pfd.revents & POLLHUP)
+       {
+          DEBUG_printf(("3http_write: poll returned pdf.revents %d POLLHUP)", pfd.revents));
+         // Clear POLLHUP event
+         pdf.revents &= ~POLLHUP; 
+       }
+

I am not sure if the write should proceed though (or perhaps annulled with setting length=0), or if an error should be returned instead.

@vliaskov
Copy link
Contributor Author

vliaskov commented Dec 5, 2023

Another possibility, using getsockopt() and SO_ERROR. If the socket contains an error, we return that.

+        if (pfd.revents & POLLHUP)
+	{
 
+          int            sres, serr;
+          socklen_t      slen = sizeof(serr);
+          sres = getsockopt(pfd.fd, SOL_SOCKET, SO_ERROR, &serr, &slen);
+          DEBUG_printf(("1http_write: poll returned pfd.revents %d POLLHUP getsockopt returned: %d with error: %s (errno %s)", pfd.revents, sres, strerror(serr), strerror(errno)));
+
+	  if (sres || serr) {
+          	DEBUG_printf(("1http_write: poll returned pfd.revents %d POLLHUP getsockopt returned: %d with error: %s (errno %s), aborting", pfd.revents, sres, strerror(serr), strerror(errno)));
+	  	pfd.revents |= POLLERR;
+	  	http->error = serr;
+	  	return (-1);
+          }
+	}

@zdohnal
Copy link
Member

zdohnal commented Dec 6, 2023

@vliaskov see the last paragraph in #827 (comment) - your last comment is basically the same I described and did in our case.
Has this helped your customer?

@vliaskov
Copy link
Contributor Author

vliaskov commented Dec 6, 2023

@zdohnal my idea (derived from the http-addrlist.c __sun fix) was:

  • check if POLLHUP is returned in revents
  • if yes, check socket state SO_ERROR with getsockopt().
    • If an error is there, add POLLERR to revents (maybe not needed), and return the socket error, thus breaking the possible poll loop
    • If no error is there, continue normally.

The original suggestion:

the core of the patch is to check SO_ERROR socket option on the descriptor, and if there is no error there, remove the POLLHUP from revents, which is later checked by the function.

If POLLHUP was returned but no SO_ERROR is found in socket: shouldn't that be a bug somewhere e.g. kernel? Or do you expect that POLLHUP would not be accompanied by with a socket error?

If the disconnected socket does not have an SO_ERROR: can we return something appropriate anyway to break the loop e.g. ECONNRESET, EPIPE?

It seems if we do not return any error, the poll loop will continue, as http_write will keep trying to write length amount of bytes to a closed socket.

 But we don't check revents in http_write(), so such patch is useless IMHO...

And yes it's true that http_write does not currently check revents for POLLERR or POLLHUP, so I don't think setting or clearing them will make a difference. Also, why would we remove the POLLHUP from revents?

Customer has issues testing at the moment, so no feedback yet...

@zdohnal
Copy link
Member

zdohnal commented Dec 6, 2023

You forgot the patch I had mentioned is for a different issue with similar symptoms (as I mentioned previously, so that's why I asked for more data and didn't share the patch).

I meant to remove POLLHUP only if it is accompanied by POLLIN or POLLOUT, as we see in strace - and tbh, I got the getsockopt() advice from our tech-list, so I expected SO_ERROR would reflect if there is really an error on socket, and if not, remove the event, because we check for it later.

But that still about the similar issue we have CentOS Stream/RHEL, yours in different place - please get me at least part of strace logs with cups debug logs, so I can see where the daemon iterates when doing several polls.

@vliaskov
Copy link
Contributor Author

vliaskov commented Dec 7, 2023

Thanks for the explanation, I understand now what you mean, since your patch is for a similar but different issue in httpAddrConnect2 .

I will update once I have logs.

zdohnal added a commit to zdohnal/cups that referenced this issue Dec 12, 2023
Some Linux kernel versions put POLLOUT|POLLHUP into revents when client tries to connect with httpAddrConnect2(), which makes the connection fail.
Let's check the option SO_ERROR before scratching the attempt - if there is no error, remove POLLHUP from revents.

I've re-purposed previously Solaris-only code to be used everywhere if the conditions are met - this should prevent bigger delays than necessary.

Slightly different issue than OpenPrinting#827, but with similar symptoms (kernel sending POLLOUT|POLLHUP).
zdohnal added a commit to zdohnal/cups that referenced this issue Dec 14, 2023
Some Linux kernel versions put POLLOUT|POLLHUP into revents when client tries to connect with httpAddrConnect2(), which makes the connection fail.
Let's check the option SO_ERROR before scratching the attempt - if there is no error, remove POLLHUP from revents.

I've re-purposed previously Solaris-only code to be used everywhere if the conditions are met - this should prevent bigger delays than necessary.

Slightly different issue than OpenPrinting#827, but with similar symptoms (kernel sending POLLOUT|POLLHUP).
zdohnal added a commit that referenced this issue Dec 14, 2023
Some Linux kernel versions put POLLOUT|POLLHUP into revents when client tries to connect with httpAddrConnect2(), which makes the connection fail.
Let's check the option SO_ERROR before scratching the attempt - if there is no error, remove POLLHUP from revents.

I've re-purposed previously Solaris-only code to be used everywhere if the conditions are met - this should prevent bigger delays than necessary.

Slightly different issue than #827, but with similar symptoms (kernel sending POLLOUT|POLLHUP).
@vliaskov
Copy link
Contributor Author

cups-7427-loglevel9.log.zip
cupsd.strace.7427.zip

@zdohnal I don't have customer traces/logs yet. However, I can trigger POLLOUT|POLLHUP or POLLOUT|POLLHUP|POLLERR
by issuing a lot of curl connections to the cups server in parallel (port 631, with TLS/gnutls enabled), and abruptly killing the client curl processes. E.g.:

7427  00:47:43.252393 poll([{fd=20<TCP:[192.168.122.209:631->192.168.122.1:46576]>, events=POLLOUT}], 1, 10000) = 1 ([{fd=20, revents=POLLOUT|POLLHUP}]) <0.000011>
7427  00:47:43.252996 poll([{fd=20<TCP:[192.168.122.209:631->192.168.122.1:46576]>, events=POLLOUT}], 1, 10000) = 1 ([{fd=20, revents=POLLOUT|POLLHUP}]) <0.000013>

strace, debugging log attached. Debugging patch used is pasted further below.

I see the POLLOUT|POLLHUP triggered for a lot of small buffer http_writes.

There is no EINTR, EAGAIN or nfds < 0 logged in the poll while loop, as far as I see.

Most of the errors result in EPIPE (either in getsockopt, or copied later in errno):

When POLLOUT|POLLHUP revents are returned, almost all buffers get handled in 2 iterations of the http_write poll loop:

  • in the first iteration, http->errno is set to the error (EPIPE usually but also EOPNOTSUPP)
  • in the second iteration, the poll loop exits.

However, the same fd keeps getting polled with new http_write requests, usually for 4-5 byte buffers. e.g.:

T001 23:47:43.172  http_write: poll returned pfd.revents 28 POLLHUP for fd 20 getsockopt returned: 0 with error: Broken pipe (errno Broken pipe) buffer 0x55e18b3d7948 length 40
T001 23:47:43.173  http_write: poll returned pfd.revents 20 POLLHUP for fd 20 getsockopt returned: 0 with error: Success (errno Broken pipe) buffer 0x55e18b3d7948 length 40
T001 23:47:43.252  http_write: poll returned pfd.revents 20 POLLHUP for fd 20 getsockopt returned: 0 with error: Success (errno Broken pipe) buffer 0x7ffe26682750 length 5
T001 23:47:43.253  http_write: poll returned pfd.revents 20 POLLHUP for fd 20 getsockopt returned: 0 with error: Success (errno Broken pipe) buffer 0x7ffe26682750 length 5
T001 23:47:43.346  http_write: poll returned pfd.revents 20 POLLHUP for fd 20 getsockopt returned: 0 with error: Success (errno Broken pipe) buffer 0x7ffe26682750 length 5
T001 23:47:43.348  http_write: poll returned pfd.revents 20 POLLHUP for fd 20 getsockopt returned: 0 with error: Success (errno Broken pipe) buffer 0x7ffe26682750 length 5
...

Note: timestamps are off by an hour in strace versus debugging logs, but they match.

This probably does not replicate the exact customer issue : the trace excerpt in original issue description only shows polling calls in an endless loop, but in these logs the loop breaks.

But let me know if it gives any insights or ideas.

I can also retake logs, or add more debugging output if needed. I can also increase GNUTLS_DEBUG_LEVEL.

gnuTLS used is 3.7.3 (without kernel offload, i.e. no kTLS)
gnutls_recv_send returns usually EPIPE, with gnutls error being mostly GNUTLS_E_INVALID_SESSION (-10) or GNUTLS_E_PUSH_ERROR (-53).

Debugging patch used for the current logs also pasted below:

--- cups-2.2.7.orig/cups/http.c
+++ cups-2.2.7/cups/http.c
@@ -4663,13 +4663,13 @@ http_write(http_t     *http,		/* I - HTT
 		bytes;			/* Bytes sent */
 
 
-  DEBUG_printf(("2http_write(http=%p, buffer=%p, length=" CUPS_LLFMT ")", (void *)http, (void *)buffer, CUPS_LLCAST length));
+  DEBUG_printf(("1http_write(http=%p, buffer=%p, timeout=%f length=" CUPS_LLFMT ")", (void *)http, (void *)buffer, http->timeout_value, CUPS_LLCAST length));
   http->error = 0;
   tbytes      = 0;
 
   while (length > 0)
   {
-    DEBUG_printf(("3http_write: About to write %d bytes.", (int)length));
+    DEBUG_printf(("1http_write: About to write %d bytes.", (int)length));
 
     if (http->timeout_value > 0.0)
     {
@@ -4680,17 +4680,32 @@ http_write(http_t     *http,		/* I - HTT
       struct timeval	timeout;	/* Timeout value */
 #endif /* HAVE_POLL */
       int		nfds;		/* Result from select()/poll() */
-
+      int 		i = 1;
       do
       {
 #ifdef HAVE_POLL
 	pfd.fd     = http->fd;
 	pfd.events = POLLOUT;
+        int 		k = 1;
 
 	while ((nfds = poll(&pfd, 1, http->wait_value)) < 0 &&
 	       (errno == EINTR || errno == EAGAIN))
-	  /* do nothing */;
+		DEBUG_printf(("1http_write: in poll loop - run %d.%d, nfds %d pfd.revents %d, errno %s", i, k++, nfds, pfd.revents, strerror(errno)));
+
+        if (pfd.revents & POLLHUP)
+	{
 
+          int            sres, serr;
+          socklen_t      slen = sizeof(serr);
+          sres = getsockopt(pfd.fd, SOL_SOCKET, SO_ERROR, &serr, &slen);
+          DEBUG_printf(("1http_write: poll returned pfd.revents %d POLLHUP for fd %d getsockopt returned: %d with error: %s (errno %s) buffer %p length %d", pfd.revents, pfd.fd, sres, strerror(serr), strerror(errno), buffer, length));
+	  if (sres || serr || errno) {
+          	DEBUG_printf(("1http_write: poll returned pdf.revents %d POLLHUP for fd %d getsockopt returned: %d with error: %s (errno %s), aborting", pfd.revents, pfd.fd, sres, strerror(serr), strerror(errno)));
+          }
+	}
 #else
 	do
 	{
@@ -4712,6 +4727,7 @@ http_write(http_t     *http,		/* I - HTT
 
         if (nfds < 0)
 	{
+	  DEBUG_printf(("1http_write: %d nfds %d, errno %s", i, nfds, strerror(errno)));	
 	  http->error = errno;
 	  return (-1);
 	}
@@ -4724,18 +4740,25 @@ http_write(http_t     *http,		/* I - HTT
 #endif /* WIN32 */
 	  return (-1);
 	}
+	DEBUG_printf(("1http_write: %d nfds %d, errno %s", i, nfds, strerror(http->error)));
       }
       while (nfds <= 0);
     }
 
 #ifdef HAVE_SSL
     if (http->tls)
+    {
+      DEBUG_printf(("1http_write: send via TLS"));	    
       bytes = _httpTLSWrite(http, buffer, (int)length);
+    }  
     else
 #endif /* HAVE_SSL */
+    {
+      DEBUG_printf(("1http_write: send unencrypted"));	    
     bytes = send(http->fd, buffer, length, 0);
+    }
 
-    DEBUG_printf(("3http_write: Write of " CUPS_LLFMT " bytes returned "
+    DEBUG_printf(("1http_write: Write of " CUPS_LLFMT " bytes returned "
                   CUPS_LLFMT ".", CUPS_LLCAST length, CUPS_LLCAST bytes));
 
     if (bytes < 0)
@@ -4758,6 +4781,7 @@ http_write(http_t     *http,		/* I - HTT
       }
 
 #else
+      DEBUG_printf(("1http_write: error writing data (%s).", strerror(errno)));
       if (errno == EINTR)
         continue;
       else if (errno == EWOULDBLOCK || errno == EAGAIN)
@@ -4776,7 +4800,7 @@ http_write(http_t     *http,		/* I - HTT
       }
 #endif /* WIN32 */
 
-      DEBUG_printf(("3http_write: error writing data (%s).",
+      DEBUG_printf(("1http_write: error writing data (%s).",
                     strerror(http->error)));
 
       return (-1);
@@ -4791,7 +4815,7 @@ http_write(http_t     *http,		/* I - HTT
   http_debug_hex("http_write", buffer - tbytes, (int)tbytes);
 #endif /* DEBUG */
 
-  DEBUG_printf(("3http_write: Returning " CUPS_LLFMT ".", CUPS_LLCAST tbytes));
+  DEBUG_printf(("1http_write: Returning " CUPS_LLFMT ".", CUPS_LLCAST tbytes));
 
   return (tbytes);
 }
Index: cups-2.2.7/cups/tls-gnutls.c
===================================================================
--- cups-2.2.7.orig/cups/tls-gnutls.c
+++ cups-2.2.7/cups/tls-gnutls.c
@@ -1110,7 +1110,7 @@ http_gnutls_read(
   }
 
   bytes = recv(http->fd, data, length, 0);
-  DEBUG_printf(("6http_gnutls_read: bytes=%d", (int)bytes));
+  DEBUG_printf(("http_gnutls_read: bytes=%d", (int)bytes));
   return (bytes);
 }
 
@@ -1681,6 +1681,7 @@ _httpTLSWrite(http_t     *http,		/* I -
   DEBUG_printf(("2http_write_ssl(http=%p, buf=%p, len=%d)", http, buf, len));
 
   result = gnutls_record_send(http->tls, buf, (size_t)len);
+  DEBUG_printf(("1http_write_ssl: gnutls_record_send returning %d, errno %s", result, strerror(errno)));
 
   if (result < 0 && !errno)
   {
@@ -1706,7 +1707,7 @@ _httpTLSWrite(http_t     *http,		/* I -
     result = -1;
   }
 
-  DEBUG_printf(("3http_write_ssl: Returning %d.", (int)result));
+  DEBUG_printf(("1http_write_ssl: Returning %d.", (int)result));
 
   return ((int)result);
 }

Another possibility I thought of was that the customer is using kTLS and getting stuck in a sendmsg() loop due to this gnutls patch not applied and keep returning EAGAIN/EINTR: https://gitlab.com/redhat/centos-stream/rpms/gnutls/-/commit/e99bcaff789941a94416996daeb423df295d5443
but customer is not using kernel module tls from what i can see, so that is probably irrelevant. I also have not reproduced this EINTR/EAGAIN behavior. Furthermore I believe cups sets its own tls push/pull functions.

zdohnal added a commit that referenced this issue Dec 15, 2023
Some Linux kernel versions put POLLOUT|POLLHUP into revents when client tries to connect with httpAddrConnect2(), which makes the connection fail.
Let's check the option SO_ERROR before scratching the attempt - if there is no error, remove POLLHUP from revents.

I've re-purposed previously Solaris-only code to be used everywhere if the conditions are met - this should prevent bigger delays than necessary.

Slightly different issue than #827, but with similar symptoms (kernel sending POLLOUT|POLLHUP).
@vliaskov
Copy link
Contributor Author

I forgot to say the results/traces of my last commit are with 2.4.2 + debugging patch + the patch for the similar issue (POLLOUT||POLLHUP but seen in different code path in http-addrlist.c connect()).

Producing the POLLOUT|POLLHUP revents in http_write is possible e.g. with a simple runcurl.sh:

#!/bin/bash
host=$1
for i in `seq 1 40`;
do
for j in `seq 1 40`;
do
curl -k https://$host:631 &
curl -k https://$host:631/jobs &
curl -k https://$host:631/printers &
curl -k https://$host:631/jobs &
curl -k https://$host:631/admin &
curl -k https://$host:631 &
curl -k https://$host:631/jobs &
curl -k https://$host:631/printers &
curl -k https://$host:631/jobs &
curl -k https://$host:631/admin &
sleep 0.1
done
done

and issuing killall runcurl.sh; killall -9 curl while the script is running.

@zdohnal
Copy link
Member

zdohnal commented Jan 12, 2024

Hi @vliaskov ,

do you have data from customer? I see you were able to get POLLOUT|POLLHUP from poll(), but are you certain it is the action which happens for customer?

I've checked your data - it really seems you missed

+ debug_printf("7http_write: in poll loop - run %d.%d, nfds %d, errno %s", i, k++, nfds, strerror(errno));

right after the poll() - I see only:

http_write: poll returned pfd.revents 20 POLLHUP for fd 23 getsockopt returned: 0 with error: Success (errno Broken pipe) buffer 0x7ffe26682750 length 5

which is called after getsockopt() - it is good to know, but we have to know what errno is at least after poll(), and ideally even before poll() - we have to find out where errno is set, because GNUTLS sending function fails due invalid session.

======================================

But first and foremost please find whether your reproducer is what your customer experience - otherwise we will spend time on getting fix which won't help your customer in the end.

@michaelrsweet
Copy link
Member

@vliaskov So if you are seeing this in groups of 3, this is expected since the chunk writing does everything it can to avoid blocking indefinitely. After the poll call is a send/write which will return an error if the other side is really shut down, so I'm not sure why POLLHUP is set in this case...

Ultimately we need to figure out how to reproduce this, otherwise we are completely in the dark on a solution...

@michaelrsweet michaelrsweet added this to the Future milestone Apr 5, 2024
@michaelrsweet
Copy link
Member

Closing for now; if you find a way to reproduce, please let us know and we'll re-open...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority-low unable-to-reproduce Unable to reproduce waiting for reporter There are data requested from the reporter
Projects
None yet
Development

No branches or pull requests

3 participants