Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In some circumstances, TCP receive queue fills up despite sockets being closed #2519

Open
kubrickfr opened this issue Apr 4, 2024 · 12 comments
Labels
status: needs-triage This issue needs to be triaged. type: bug This issue describes a bug.

Comments

@kubrickfr
Copy link

kubrickfr commented Apr 4, 2024

[EDIT: previous version had typo 2.6.9 -> 2.9.6]

Detailed Description of the Problem

In some circumstances, that I have not been able to establish clearly, HAProxy stops accepting new connections (timeout).
We run haproxy in a number of clusters, all configured exacly the same, and only a couple of them always have this recurring issue, it could be client-behaviour related as these different clusters are used by different clients.

$ ss -ltnup 'sport = :443'
Netid                            State                             Recv-Q                            Send-Q                                                         Local Address:Port                                                         Peer Address:Port                            Process                            
tcp                              LISTEN                            0                                 80000                                                                0.0.0.0:443                                                               0.0.0.0:*                                                                  
tcp                              LISTEN                            80001                             80000                                                                0.0.0.0:443                                                               0.0.0.0:*                                                                  

As you can see here the receive queue is full, maxconn is set to 80000 for that particular frontend. And according to show info, CurrConns: 5378

Other debug information:

$ cat /proc/net/sockstat
sockets: used 86061
TCP: inuse 36613 orphan 165 tw 1338 alloc 85897 mem 82726
UDP: inuse 9 mem 0
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

This bug happens with 2.8.3 on Amazon Linux 2023, and this bug report is based on haproxy-next 2.9.6, with patches up to c788ce33af85a28fa66f591cb65a7ea6c0f92007, which I have tried to see if it fixed the issue.

$ echo show info | socat UNIX-CONNECT:/var/lib/haproxy/stats,connect-timeout=2 stdio
Name: HAProxy
Version: 2.9.6-9eafce5
Release_date: 2024/02/26
Nbthread: 2
Nbproc: 1
Process_num: 1
Pid: 1816
Uptime: 0d 1h35m14s
Uptime_sec: 5714
Memmax_MB: 0
PoolAlloc_MB: 25
PoolUsed_MB: 25
PoolFailed: 0
Ulimit-n: 2000033
Maxsock: 2000033
Maxconn: 1000000
Hard_maxconn: 1000000
CurrConns: 5406
CumConns: 2206804
CumReq: 21588755
MaxSslConns: 0
CurrSslConns: 5405
CumSslConns: 2202347
Maxpipes: 0
PipesUsed: 0
PipesFree: 0
ConnRate: 446
ConnRateLimit: 0
MaxConnRate: 3001
SessRate: 446
SessRateLimit: 0
MaxSessRate: 3001
SslRate: 424
SslRateLimit: 3000
MaxSslRate: 3001
SslFrontendKeyRate: 382
SslFrontendMaxKeyRate: 2683
SslFrontendSessionReuse_pct: 10
SslBackendKeyRate: 0
SslBackendMaxKeyRate: 0
SslCacheLookups: 315
SslCacheMisses: 98
CompressBpsIn: 0
CompressBpsOut: 0
CompressBpsRateLim: 0
Tasks: 6289
Run_queue: 7
Idle_pct: 71
node: ip-10-3-37-5.eu-west-1.compute.internal
Stopping: 0
Jobs: 5415
Unstoppable Jobs: 1
Listeners: 7
ActivePeers: 0
ConnectedPeers: 0
DroppedLogs: 0
BusyPolling: 0
FailedResolutions: 0
TotalBytesOut: 71763612797
TotalSplicedBytesOut: 0
BytesOutRate: 19710848
DebugCommandsIssued: 0
CumRecvLogs: 0
Build info: 2.9.6-9eafce5
Memmax_bytes: 0
PoolAlloc_bytes: 26276632
PoolUsed_bytes: 26276632
Start_time_sec: 1712215221
Tainted: 0
TotalWarnings: 49
MaxconnReached: 0
BootTime_ms: 183
Niced_tasks: 0

Expected Behavior

If CurrConns < maxconn, haproxy should keep accepting new connections.

Steps to Reproduce the Behavior

In the case of the problematic clusters, there doesn't seem any particular trigger...

Do you have any idea what may have caused this?

No

Do you have an idea how to solve the issue?

No

What is your configuration?

haproxy.cfg

#---------------------------------------------------------------------
# Global settings
#---------------------------------------------------------------------
global
    # to have these messages end up in /var/log/haproxy.log you will
    # need to:
    #
    # 1) configure syslog to accept network log events.  This is done
    #    by adding the '-r' option to the SYSLOGD_OPTIONS in
    #    /etc/sysconfig/syslog
    #
    # 2) configure local2 events to go to the /var/log/haproxy.log
    #   file. A line like the following can be added to
    #   /etc/sysconfig/syslog
    #
    #    local2.*                       /var/log/haproxy.log
    #
    log         127.0.0.1:514 local0 warning

    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     1000000
    user        haproxy
    group       haproxy
    daemon
    maxsslrate  3000

    # turn on stats unix socket
    stats socket /var/lib/haproxy/stats mode 660 user ec2-user level admin

    # utilize system-wide crypto-policies
    # ssl-default-server-ciphers PROFILE=SYSTEM
    # or enable TLS < 1.2
    ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES128-SHA:DHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:AES128-GCM-SHA256:AES128-SHA256:AES128-SHA:AES256-GCM-SHA384:AES256-SHA256:AES256-SHA:DHE-DSS-AES128-SHA:DES-CBC3-SHA:@SECLEVEL=0 
    tune.ssl.cachesize 1000000

#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#---------------------------------------------------------------------
defaults
    mode                    http
    log                     global
    option                  httplog
    option                  dontlognull
    option forwardfor       except 127.0.0.0/8
    no option               redispatch
    retries                 0

    # it is important to keep timeout http-request low, especially with with h2, otherwise 
    # we can end up in a situation where the systems runs out of available sockets
    timeout http-request    2s

    timeout queue           10ms
    timeout connect         10s

    # Quote from the documentation:
    # "it is highly recommended that the client timeout remains equal
    # to the server timeout in order to avoid complex situations to debug"
    # https://docs.haproxy.org/2.8/configuration.html#timeout%20client%20(Alphabetically%20sorted%20keywords%20reference) 
    timeout client          19s
    timeout server          19s

    timeout http-keep-alive 1m
    timeout check           10s
    maxconn                 1000000

conf.d/app.cfg

frontend www_ssl
  bind :443 ssl crt /etc/ssl/app.pem shards 4 alpn h2,http/1.1 ssl-min-ver TLSv1.0
  default_backend app
  monitor-uri /NLBcheckstatus
  maxconn 80000

frontend www
  bind :80 shards 4
  default_backend app
  monitor-uri /NLBcheckstatus
  maxconn 80000

conf.d/stats.cfg

listen stats # Define a listen section called "stats"
  bind 127.0.0.1:9000 # Listen on localhost:9000
  mode http
  http-request use-service prometheus-exporter if { path /metrics }
  stats enable  # Enable stats page
  stats hide-version  # Hide HAProxy version
  stats realm Haproxy\ Statistics  # Title text for popup window
  stats uri /admin  # Stats URI
  stats auth datadog:#########  # Authentication credentials

conf.d/app_be.cfg

backend app
  option httpchk
  balance leastconn
  http-reuse safe
  http-check send meth GET  uri /url

/etc/sysctl.d/10-haproxy.conf

# Limit the per-socket default receive/send buffers to limit memory usage
# when running with a lot of concurrent connections. Values are in bytes
# and represent minimum, default and maximum. Defaults: 4096 87380 4194304
#
net.ipv4.tcp_rmem            = 4096 16060 262144
net.ipv4.tcp_wmem            = 4096 16384 262144

# Allow early reuse of a same source port for outgoing connections. It is
# required above a few hundred connections per second. Defaults: 0
#
net.ipv4.tcp_tw_reuse        = 1

# Extend the source port range for outgoing TCP connections. This limits early
# port reuse and makes use of 64000 source ports. Defaults: 32768 61000
#
net.ipv4.ip_local_port_range = 1024 65023

# Increase the TCP SYN backlog size. This is generally required to support very
# high connection rates as well as to resist SYN flood attacks. Setting it too
# high will delay SYN cookie usage though. Defaults: 1024
#
net.ipv4.tcp_max_syn_backlog = 45000

# Timeout in seconds for the TCP FIN_WAIT state. Lowering it speeds up release
# of dead connections, though it will cause issues below 25-30 seconds. It is
# preferable not to change it if possible. Default: 60
#
net.ipv4.tcp_fin_timeout     = 30

# Limit the number of outgoing SYN-ACK retries. This value is a direct
# amplification factor of SYN floods, so it is important to keep it reasonably
# low. However, too low will prevent clients on lossy networks from connecting.
# Using 3 as a default value gives good results (4 SYN-ACK total) and lowering
# it to 1 under SYN flood attack can save a lot of bandwidth. Default: 5
#
net.ipv4.tcp_synack_retries  = 3

# Set this to one to allow local processes to bind to an IP which is not yet
# present on the system. This is typically what happens with a shared VRRP
# address, where you want both primary and backup to be started even though the
# IP is not yet present. Always leave it to 1. Default: 0
#
net.ipv4.ip_nonlocal_bind    = 1

# Serves as a higher bound for all of the system's SYN backlogs. Put it at
# least as high as tcp_max_syn_backlog, otherwise clients may experience
# difficulties to connect at high rates or under SYN attacks. Default: 128
#
net.core.somaxconn           = 90000


### Output of `haproxy -vv`

```plain
HAProxy version 2.9.6-9eafce5 2024/02/26 - https://haproxy.org/
Status: stable branch - will stop receiving fixes around Q1 2025.
Known bugs: http://www.haproxy.org/bugs/bugs-2.9.6.html
Running on: Linux 6.1.79-99.167.amzn2023.aarch64 #1 SMP Tue Mar 12 18:15:29 UTC 2024 aarch64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = cc
  CFLAGS  = -O2 -g -Wall -Wextra -Wundef -Wdeclaration-after-statement -Wfatal-errors -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wno-string-plus-int -Wno-atomic-alignment -DMAX_SESS_STKCTR=12
  OPTIONS = USE_LINUX_TPROXY=1 USE_CRYPT_H=1 USE_GETADDRINFO=1 USE_OPENSSL=1 USE_LUA=1 USE_SLZ=1 USE_SYSTEMD=1 USE_PROMEX=1 USE_PCRE2=1
  DEBUG   = -DDEBUG_STRICT -DDEBUG_MEMORY_POOLS

Feature list : -51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY +CRYPT_H -DEVICEATLAS +DL -ENGINE +EPOLL -EVPORTS +GETADDRINFO -KQUEUE -LIBATOMIC +LIBCRYPT +LINUX_CAP +LINUX_SPLICE +LINUX_TPROXY +LUA +MATH -MEMORY_PROFILING +NETFILTER +NS -OBSOLETE_LINKER +OPENSSL -OPENSSL_AWSLC -OPENSSL_WOLFSSL -OT -PCRE +PCRE2 -PCRE2_JIT -PCRE_JIT +POLL +PRCTL -PROCCTL +PROMEX -PTHREAD_EMULATION -QUIC -QUIC_OPENSSL_COMPAT +RT +SHM_OPEN +SLZ +SSL -STATIC_PCRE -STATIC_PCRE2 +SYSTEMD +TFO +THREAD +THREAD_DUMP +TPROXY -WURFL -ZLIB

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_TGROUPS=16, MAX_THREADS=256, default=2).
Built with OpenSSL version : OpenSSL 3.0.8 7 Feb 2023
Running on OpenSSL version : OpenSSL 3.0.8 7 Feb 2023
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
OpenSSL providers loaded : default
Built with Lua version : Lua 5.4.4
Built with the Prometheus exporter as a service
Built with network namespace support.
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.40 2022-04-14
PCRE2 library supports JIT : no (USE_PCRE2_JIT not set)
Encrypted password support via crypt(3): yes
Built with gcc compiler version 11.4.1 20230605 (Red Hat 11.4.1-2)

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
         h2 : mode=HTTP  side=FE|BE  mux=H2    flags=HTX|HOL_RISK|NO_UPG
       fcgi : mode=HTTP  side=BE     mux=FCGI  flags=HTX|HOL_RISK|NO_UPG
         h1 : mode=HTTP  side=FE|BE  mux=H1    flags=HTX|NO_UPG
  <default> : mode=HTTP  side=FE|BE  mux=H1    flags=HTX
       none : mode=TCP   side=FE|BE  mux=PASS  flags=NO_UPG
  <default> : mode=TCP   side=FE|BE  mux=PASS  flags=

Available services : prometheus-exporter
Available filters :
	[BWLIM] bwlim-in
	[BWLIM] bwlim-out
	[CACHE] cache
	[COMP] compression
	[FCGI] fcgi-app
	[SPOE] spoe
	[TRACE] trace

Last Outputs and Backtraces

No response

Additional Information

Any local patches applied
haproxy-next 2.6.9, with patches up to c788ce33af85a28fa66f591cb65a7ea6c0f92007

Environment specificities
Arm64, ECDSA certificate

No backend servers defined in configuration, backends are added and removed via the socket using

        haproxyUpdateCommand += `add server app/${instanceId} ${privateIpAddress}:${healthCheckPort} check inter 10s  fall 3  rise 1 weight ${weight} pool-max-conn -1 pool-purge-delay ${backendDrainingTimeSeconds}s\n`;
        haproxyUpdateCommand += `enable server app/${instanceId}\n`;
        haproxyUpdateCommand += `set server app/${instanceId} health down\n`;
        haproxyUpdateCommand += `enable health app/${instanceId}\n`;

Weights are recaclulated every minute, and set using

        haproxyUpdateCommand += `set weight app/${instanceId} ${weight}\n`;
@kubrickfr kubrickfr added status: needs-triage This issue needs to be triaged. type: bug This issue describes a bug. labels Apr 4, 2024
@wtarreau
Copy link
Member

wtarreau commented Apr 4, 2024

Very strange, I've never heard about any similar report even on the other versions you mention. Was the "show info" above produced when the problem was happening ? Or can't you connect anymore to the stats socket when the problem is happening ? Does it recover only by restarting haproxy ? If you're able to connect to the stats socket, sending a "show stat" and a "show fd" could help.

What I suspect could be related to the size of the backlog: I'm seeing a sessrate of 446 in your "show info" output, which indicates what connection rate is acceptable with SSL negotiation. Let's assume your server can deal with 2k sessions/s including SSL etc. If you receive an attack with more than that, the accept queue will fill up. It will then take 40s to process the last entry in the queue at 2k/s, and by then the client will have aborted but there's no way to know, so it costs a handshake calculation for nothing. In such a case, an approach can be to limit the backlog to a much lower value via the "backlog" keyword on the "bind" line. This way during an attack, you won't be accumulating connections that users gave up, and the recovery can be much faster. Just set that to 2-3x the max rate you can accept so that users don't needlessly wait more than 2-3s before getting an error.

@kubrickfr
Copy link
Author

Very strange, I've never heard about any similar report even on the other versions you mention. Was the "show info" above produced when the problem was happening ?

Yes

Or can't you connect anymore to the stats socket when the problem is happening ?

Nope, connecting to the socket is fine

Does it recover only by restarting haproxy ?

That is correct

If you're able to connect to the stats socket, sending a "show stat" and a "show fd" could help.

Will do as soon as I can identify a host has the issue again

What I suspect could be related to the size of the backlog: I'm seeing a sessrate of 446 in your "show info" output, which indicates what connection rate is acceptable with SSL negotiation. Let's assume your server can deal with 2k sessions/s including SSL etc. If you receive an attack with more than that, the accept queue will fill up. It will then take 40s to process the last entry in the queue at 2k/s, and by then the client will have aborted but there's no way to know, so it costs a handshake calculation for nothing.

We set a max ssl rate of 3k sessions. (in fact the limits are dynamic, we set maxsslrate $(( $(nproc) * 1500 )) and maxconn $(( $(nproc) * 40000 ). We've benchmarked it and it works fine for us (<100% CPU/Network/memory).

In such a case, an approach can be to limit the backlog to a much lower value via the "backlog" keyword on the "bind" line. This way during an attack, you won't be accumulating connections that users gave up, and the recovery can be much faster. Just set that to 2-3x the max rate you can accept so that users don't needlessly wait more than 2-3s before getting an error.

Ok I will try tinkering with backlog

@kubrickfr
Copy link
Author

kubrickfr commented Apr 4, 2024

Ok, so on another host with 4 cores, and therefore our dynamic maxconn set to 160000, we get this

$ ss -ltnup 'sport = :443'
Netid                            State                             Recv-Q                            Send-Q                                                         Local Address:Port                                                         Peer Address:Port                            Process                            
tcp                              LISTEN                            0                                 90000                                                                0.0.0.0:443                                                               0.0.0.0:*                                                                  
tcp                              LISTEN                            0                                 90000                                                                0.0.0.0:443                                                               0.0.0.0:*                                                                  
tcp                              LISTEN                            90001                             90000                                                                0.0.0.0:443                                                               0.0.0.0:*                                                                  
tcp                              LISTEN                            90001                             90000                                                                0.0.0.0:443                                                               0.0.0.0:*                                                                  

This time we hit the non dynamic limit of net.core.somaxconn = 90000 (which I realise should really be calculated as well).

$ echo show info | socat UNIX-CONNECT:/var/lib/haproxy/stats,connect-timeout=2 stdio
Name: HAProxy
Version: 2.9.6-9eafce5
Release_date: 2024/02/26
Nbthread: 4
Nbproc: 1
Process_num: 1
Pid: 2003
Uptime: 2d 5h33m21s
Uptime_sec: 192801
Memmax_MB: 0
PoolAlloc_MB: 6
PoolUsed_MB: 6
PoolFailed: 0
Ulimit-n: 2000043
Maxsock: 2000043
Maxconn: 1000000
Hard_maxconn: 1000000
CurrConns: 2208
CumConns: 8298281
CumReq: 4057092995
MaxSslConns: 0
CurrSslConns: 2207
CumSslConns: 8191837
Maxpipes: 0
PipesUsed: 0
PipesFree: 0
ConnRate: 3
ConnRateLimit: 0
MaxConnRate: 6002
SessRate: 3
SessRateLimit: 0
MaxSessRate: 6002
SslRate: 3
SslRateLimit: 6000
MaxSslRate: 6001
SslFrontendKeyRate: 4
SslFrontendMaxKeyRate: 7755
SslFrontendSessionReuse_pct: 0
SslBackendKeyRate: 0
SslBackendMaxKeyRate: 0
SslCacheLookups: 40
SslCacheMisses: 40
CompressBpsIn: 0
CompressBpsOut: 0
CompressBpsRateLim: 0
Tasks: 3644
Run_queue: 0
Idle_pct: 71
node: ip-10-2-32-176.ap-southeast-1.compute.internal
Stopping: 0
Jobs: 2221
Unstoppable Jobs: 1
Listeners: 11
ActivePeers: 0
ConnectedPeers: 0
DroppedLogs: 0
BusyPolling: 0
FailedResolutions: 0
TotalBytesOut: 8706499242217
TotalSplicedBytesOut: 0
BytesOutRate: 47007488
DebugCommandsIssued: 0
CumRecvLogs: 0
Build info: 2.9.6-9eafce5
Memmax_bytes: 0
PoolAlloc_bytes: 6503920
PoolUsed_bytes: 6503920
Start_time_sec: 1712051025
Tainted: 0
TotalWarnings: 76
MaxconnReached: 0
BootTime_ms: 156
Niced_tasks: .
$ cat /proc/net/sockstat
sockets: used 183355
TCP: inuse 28965 orphan 0 tw 392 alloc 183185 mem 76212
UDP: inuse 11 mem 0
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
$ uptime 
 15:17:47 up 2 days,  5:34,  1 user,  load average: 1.40, 1.50, 1.67
 echo show stat | socat UNIX-CONNECT:/var/lib/haproxy/stats,connect-timeout=2 stdio
# pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime,agent_status,agent_code,agent_duration,check_desc,agent_desc,check_rise,check_fall,check_health,agent_rise,agent_fall,agent_health,addr,cookie,mode,algo,conn_rate,conn_rate_max,conn_tot,intercepted,dcon,dses,wrew,connect,reuse,cache_lookups,cache_hits,srv_icur,src_ilim,qtime_max,ctime_max,rtime_max,ttime_max,eint,idle_conn_cur,safe_conn_cur,used_conn_cur,need_conn_est,uweight,agg_server_status,agg_server_check_status,agg_check_status,srid,sess_other,h1sess,h2sess,h3sess,req_other,h1req,h2req,h3req,proto,-,ssl_sess,ssl_reused_sess,ssl_failed_handshake,h2_headers_rcvd,h2_data_rcvd,h2_settings_rcvd,h2_rst_stream_rcvd,h2_goaway_rcvd,h2_detected_conn_protocol_errors,h2_detected_strm_protocol_errors,h2_rst_stream_resp,h2_goaway_resp,h2_open_connections,h2_backend_open_streams,h2_total_connections,h2_backend_total_streams,h1_open_connections,h1_open_streams,h1_total_connections,h1_total_streams,h1_bytes_in,h1_bytes_out,h1_spliced_bytes_in,h1_spliced_bytes_out,
www_ssl,FRONTEND,,,2365,66164,160000,2674547,7364640673982,1069039835706,0,0,1290,,,,,OPEN,,,,,,,,,1,2,0,,,,0,1,0,7755,,,,0,4056045898,0,1658,288043,32,,22701,38582,4056375550,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,1,6001,8191793,0,0,0,0,,,0,0,,,,,,,0,,,,,,,,,,0,2674547,0,0,0,4056375550,0,0,,-,2674555,0,5439614,0,0,0,0,0,0,0,0,0,0,0,0,0,2365,346,2674547,4056374288,8412490505427,1068762896456,0,0,
www,FRONTEND,,,1,5,160000,3466,248130,593027,0,0,37,,,,,OPEN,,,,,,,,,1,3,0,,,,0,0,0,13,,,,0,3234,0,251,3,0,,0,13,3488,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,13,3466,3234,0,0,0,,,0,0,,,,,,,0,,,,,,,,,,0,3466,1,0,0,3488,0,0,,-,0,0,0,0,0,3,0,1,0,0,0,0,0,0,1,0,1,0,3466,3475,296944,596122,0,0,
app,i-0ed07def403d51d04,0,0,1,33537,,5427241,9726848390,1385133631,,0,,33849,271,0,0,UP,1,1,0,5,2,18405,100,,1,4,9,,5427241,,2,89,,9419,L7OK,200,5,0,5393176,0,0,0,0,,,,5393176,15172,25,,,,,0,,,0,0,17,107,,,,Layer7 check passed,,1,3,3,,,,10.2.37.170:8080,,http,,,,,,,,0,60210,5367031,,,1,,0,7226,14244,60170,0,0,1,1,2,1,,,,3,,,,,,,,,,-,0,0,0,,,,,,,,,,,,,,,,,,,,,,
app,i-0f3070f2d8b7b1d0d,0,0,173,14884,,298538885,535274006840,77726797767,,0,,45294,43248,0,0,UP,256,1,0,4,2,18399,150,,1,4,10,,298538885,,2,10902,,20046,L7OK,200,0,0,298466900,0,6,0,0,,,,298466906,370418,14205,,,,,0,,,0,0,17,63,,,,Layer7 check passed,,1,3,3,,,,10.2.34.206:8080,,http,,,,,,,,0,839598,297699287,,,398,,0,7280,17522,60525,0,0,398,173,571,256,,,,3,,,,,,,,,,-,0,0,0,,,,,,,,,,,,,,,,,,,,,,
app,i-07c3394790ff363ff,0,0,172,14882,,296034911,530763867149,77021383861,,0,,50914,39210,0,0,UP,256,1,0,5,2,18418,100,,1,4,11,,296034911,,2,11475,,19980,L7OK,200,3,0,295959262,0,20,0,0,,,,295959282,372788,12903,,,,,0,,,0,0,18,67,,,,Layer7 check passed,,1,3,3,,,,10.2.35.181:8080,,http,,,,,,,,0,847203,295187708,,,399,,0,7248,18999,60953,0,0,399,172,571,256,,,,3,,,,,,,,,,-,0,0,0,,,,,,,,,,,,,,,,,,,,,,
app,i-0541d9d428440e383,0,0,1,225,,3966656,7115916553,1026978140,,0,,0,0,0,0,UP,1,1,0,0,1,18377,73,,1,4,1,,3966656,,2,46,,4191,L7OK,200,1,0,3966647,0,0,0,0,,,,3966647,10625,0,,,,,0,,,0,0,17,86,,,,Layer7 check passed,,1,3,3,,,,10.2.42.23:8080,,http,,,,,,,,0,20316,3946340,,,1,,0,11,1502,60198,0,0,1,1,2,1,,,,4,,,,,,,,,,-,0,0,0,,,,,,,,,,,,,,,,,,,,,,
app,BACKEND,0,0,347,38700,32000,4056374503,7364640718370,1069039943633,0,0,,221932,169328,0,0,UP,514,4,0,,5,18418,31,,1,4,0,,4056373077,,1,22702,,38582,,,,0,4056045898,0,607,288046,32,,,,4056334583,823043,59015,0,0,0,0,0,,,0,0,17,59,,,,,,,,,,,,,,http,leastconn,,,,,,,0,3975617,4052397460,0,0,,,0,7280,18999,60953,0,,,,,514,0,0,0,,,,,,,,,,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1146,347,4052868,4056450328,1063974567729,8696536936083,0,0,
stats,FRONTEND,,,0,3,1000000,12851,1696332,928968202,0,0,0,,,,,OPEN,,,,,,,,,1,5,0,,,,0,0,0,1,,,,0,12851,0,0,0,0,,0,1,12851,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,0,1,12851,12851,0,0,0,,,0,0,,,,,,,0,,,,,,,,,,0,12851,0,0,0,12851,0,0,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,12851,12851,2017607,929508480,0,0,
stats,BACKEND,0,0,0,0,100000,0,1696332,928968202,0,0,,0,0,0,0,UP,0,0,0,,0,192772,,,1,5,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,0,0,0,0,0,0,0,15,,,0,0,1,2,,,,,,,,,,,,,,http,roundrobin,,,,,,,0,0,0,0,0,,,0,0,4,224,0,,,,,0,0,0,0,,,,,,,,,,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

And the result of show fd:
2519.fds.txt

Maybe what these two instances have in common is that they both hit their maxsslrate limit at some point in the past.

@wtarreau
Copy link
Member

wtarreau commented Apr 4, 2024

I had not noticed first that you were using maxsslrate. Pretty interesting. Maybe you're facing a race condition that prevents it from properly recovering when the limit it met. That's something reasonably easy to try to reproduce on our side by setting a lower limit. At least your "show fd" shows the listener is active (thus not disabled) in the poller. Thanks for these, we'll need a bit of time to analyse it deeper now.

@Darlelet
Copy link
Contributor

Darlelet commented Apr 5, 2024

Could it be related to #2476 then?

@capflam
Copy link
Member

capflam commented Apr 5, 2024

I checked. At first glance, it seems similar but I doubt it is related. Especially because here the listeners don't seems to be limited when the issue occurred.

@capflam
Copy link
Member

capflam commented Apr 5, 2024

However, the fix was backported, thus it can be tested.

@kubrickfr
Copy link
Author

kubrickfr commented Apr 5, 2024

[EDIT: previous version had typo 2.6.9 -> 2.9.6]
As mentioned in the bug description, the present issue happens with "haproxy-next 2.9.6, with patches up to c788ce33af85a28fa66f591cb65a7ea6c0f92007" which includes BUG/MINOR: listener: Wake proxy's mngmt task up if necessary on session release from #2476

@kubrickfr
Copy link
Author

kubrickfr commented Apr 5, 2024

Ha! now that I re-read my message, I realise my there is a typo, it's 2.9.6! not 2.6.9, this applies to my last comment as well, adding an [EDIT] note.

@capflam
Copy link
Member

capflam commented Apr 5, 2024

Thanks for the confirmation :) So it is indeed another issue.

@kubrickfr
Copy link
Author

Just FYI, we have switched from setting maxsslrate to using backlog, and the problem hasn't reappeared, so far.

Did you manage to reproduce it on your end?

@wtarreau
Copy link
Member

hi François. Thanks for the update. No repro on our side for now. What surprises me is that the code used to deal with the maxsslrate is exactly the same (and uses the same code paths) as the one dealing with the global rate limit. So if something is broken there (and it's fairly possible that a race remains), it should affect all limits, not just SSL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: needs-triage This issue needs to be triaged. type: bug This issue describes a bug.
Projects
None yet
Development

No branches or pull requests

4 participants