status check for zombie proxy #579

Closed
motonorin opened this Issue Apr 4, 2014 · 5 comments

Projects

None yet

3 participants

@motonorin

I'm using FreeRadius server 2.2.3 for eduroam and made a setting "status_check = status-server" to avoid marking a proxy server which does not respond immediately as dead frequently.
The configuration with "status_check = status-server" works well mostly. But I observed a situation that a proxy server marked as dead even if the server responds to a status check probe.
I added some debugging output on the source code and found that a proxy server marked as zombie already is marked as dead immediately even if the proxy responds for a prove. The judgement is made based on time past from "zombie_period_start", and it is copied from a record of time not on last sent packet for the proxy, but on last received packet from the proxy.

I applied the following patch and it works well for a couple of week.
It seems that the same issue exists in V3.

$ diff -u freeradius-server-2.2.3/src/main/event.c.org freeradius-server-2.2.3/src/main/event.c
--- freeradius-server-2.2.3/src/main/event.c.org        2013-12-12 05:10:12.000000000 +0900
+++ freeradius-server-2.2.3/src/main/event.c    2014-03-20 12:02:37.000000000 +0900
@@ -704,8 +704,17 @@
        home = request->home_server;
        home->num_received_pings++;

+#if 0
        radlog(L_PROXY, "Received response to status check %d (%d in current sequence)",
               request->number, home->num_received_pings);
+#else
+       radlog(L_PROXY, "Received response to status check %d (%d in current sequence) for home server %s port %d",
+              request->number, home->num_received_pings,
+              inet_ntop(request->proxy->dst_ipaddr.af,
+                        &request->proxy->dst_ipaddr.ipaddr,
+                        buffer, sizeof(buffer)),
+              request->proxy->dst_port);
+#endif

        /*
         *      Remove the request from any hashes
@@ -1121,6 +1130,13 @@
                return;
        }

+#if 1
+       radlog(L_PROXY, "server: %s:%d, now: %d, last packet: %d, zperiod: %d",
+              inet_ntop(home->ipaddr.af, &home->ipaddr.ipaddr,
+                        buffer, sizeof(buffer)),
+              home->port,
+              now.tv_sec, home->last_packet, home->zombie_period);
+#endif
        /*
         *      We've received a real packet recently.  Don't mark the
         *      server as zombie until we've received NO packets for a
@@ -1142,8 +1158,12 @@
         */
        home->state = HOME_STATE_ZOMBIE;

+#if 1
+       home->zombie_period_start = now;
+#else
        home->zombie_period_start.tv_sec = home->last_packet;
        home->zombie_period_start.tv_usec = USEC / 2;
+#endif

        fr_event_delete(el, &home->ev);
        home->currently_outstanding = 0;
@alanbuxey
Member

just checking how this interacts with the 'wait for 3 replies' or whatever
the admin has configured for status-check - you might not want to bring
up the link just on one ping being responded to (heavy congestion or big
delays....)

On 4 April 2014 16:23, motonorin notifications@github.com wrote:

I'm using FreeRadius server 2.2.3 for eduroam and made a setting
"status_check = status-server" to avoid marking a proxy server which does
not respond immediately as dead frequently.
The configuration with "status_check = status-server" works well mostly.
But I observed a situation that a proxy server marked as dead even if the
server responds to a status check probe.
I added some debugging output on the source code and found that a proxy
server marked as zombie already is marked as dead immediately even if the
proxy responds for a prove. The judgement is made based on time past from
"zombie_period", and it is a record of time not on last sent packet for the
proxy, but on last received packet from the proxy.

I applied the following patch and it works well for a couple of week.
It seems that the same issue exists in V3.

$ diff -u freeradius-server-2.2.3/src/main/event.c.org freeradius-server-2.2.3/src/main/event.c
--- freeradius-server-2.2.3/src/main/event.c.org 2013-12-12 05:10:12.000000000 +0900
+++ freeradius-server-2.2.3/src/main/event.c 2014-03-20 12:02:37.000000000 +0900
@@ -704,8 +704,17 @@
home = request->home_server;
home->num_received_pings++;

+#if 0
radlog(L_PROXY, "Received response to status check %d (%d in current sequence)",
request->number, home->num_received_pings);
+#else

  •   radlog(L_PROXY, "Received response to status check %d (%d in current sequence) for home server %s port %d",
    
  •          request->number, home->num_received_pings,
    
  •          inet_ntop(request->proxy->dst_ipaddr.af,
    
  •                    &request->proxy->dst_ipaddr.ipaddr,
    
  •                    buffer, sizeof(buffer)),
    
  •          request->proxy->dst_port);
    

    +#endif

    /*
     *      Remove the request from any hashes
    

    @@ -1121,6 +1130,13 @@
    return;
    }

+#if 1

  •   radlog(L_PROXY, "server: %s:%d, now: %d, last packet: %d, zperiod: %d",
    
  •          inet_ntop(home->ipaddr.af, &home->ipaddr.ipaddr,
    
  •                    buffer, sizeof(buffer)),
    
  •          home->port,
    
  •          now.tv_sec, home->last_packet, home->zombie_period);
    
    +#endif
    /*
    * We've received a real packet recently. Don't mark the
    * server as zombie until we've received NO packets for a
    @@ -1142,8 +1158,12 @@
    */
    home->state = HOME_STATE_ZOMBIE;

+#if 1

  •   home->zombie_period_start = now;
    

    +#else
    home->zombie_period_start.tv_sec = home->last_packet;
    home->zombie_period_start.tv_usec = USEC / 2;
    +#endif

    fr_event_delete(el, &home->ev);
    home->currently_outstanding = 0;
    

Reply to this email directly or view it on GitHubhttps://github.com/FreeRADIUS/freeradius-server/issues/579
.

@alandekok
Member

We want to maintain a balance between failing too quickly, and not failing quickly enough. The requirement to have 3 status_server responses is a good one, and I don't think it needs to change.

The current code has gone through a lot of testing, and I'm not inclined to make modifications which change the meaning of zombie_period. Doing that will change peoples existing systems, potentially breaking something else.

A simpler patch would be just to change the limit for num_answers_to_alive, so that the lower limit is one. That way servers which respond will immediately get marked as alive.

But for this issue, the server is behaving as expected. Once a home server is marked zombie, it has to respond to 3 status servers, before it's marked alive. And that's what it's doing.

@alandekok alandekok closed this Apr 4, 2014
@motonorin

num_answers_to_alive seems a parameter to control frequency of status change from DEAD to ALIVE. However, my issue is status change from ALIVE/ZOMBIE to DEAD.

Here are logged messages with original server (without the patch):
Wed Mar 19 11:14:10 2014 : Proxy: Marking home server XX.XX.XX.XX port 1812 as zombie (it looks like it is dead).
Wed Mar 19 11:14:10 2014 : Proxy: Received response to status check 431881 (1 in current sequence)
Wed Mar 19 11:14:10 2014 : Proxy: Marking home server XX.XX.XX.XX port 1812 as dead.

As you see, even if the proxy server responded for a probe, status for the proxy was changed to DEAD immediately just after marked as ZOMBIE.
"zombie period" is a minimum period to keep status of ZOMBIE before changing to DEAD, isn't it?

@motonorin

Thank you for the fix. My one more concern is that status for the proxy will be still changed immediately from ZOMBIE to DEAD without waiting for zombie_period in case "status_check = status-server" is not defined.

The following is logged messages with the patch to produce debugging messages included in the first report:
Thu Mar 20 09:32:19 2014 : Proxy: server: XX.XX.XX.XX:1812, now: 1395275539, last packet: 1395274579, zperiod: 40
Thu Mar 20 09:32:19 2014 : Proxy: Marking home server XX.XX.XX.XX port 1812 as zombie (it looks like it is dead).
Thu Mar 20 09:32:19 2014 : Proxy: Received response to status check 4615 (1 in current sequence) for home server XX.XX.XX.XX port 1812
Thu Mar 20 09:32:19 2014 : Proxy: Marking home server XX.XX.XX.XX port 1812 as dead.

As you see, zombie_period_start is set as 960 seconds (=1395275539-1395274579, 16 mins) past in this case, and it is already more than zombie_period (40).

@alandekok alandekok added a commit that referenced this issue Apr 6, 2014
@alandekok alandekok Limit zombie period start. Fixes #579
If we've received a packet in the last 1/4 zombie period, don't
go to zombie.  If the last packet was earlier than that, set
the zombie period start to that time.

We don't set it to home->last_packet, because that could have
been minutes or hours in the past
ec79173
@alandekok alandekok added a commit that referenced this issue Apr 6, 2014
@alandekok alandekok Limit zombie period start. Fixes #579
If we've received a packet in the last 1/4 zombie period, don't
go to zombie.  If the last packet was earlier than that, set
the zombie period start to that time.

We don't set it to home->last_packet, because that could have
been minutes or hours in the past
3efcbbd
@alandekok alandekok added a commit that referenced this issue Apr 6, 2014
@alandekok alandekok Limit zombie period start. Fixes #579
If we've received a packet in the last 1/4 zombie period, don't
go to zombie.  If the last packet was earlier than that, set
the zombie period start to that time.

We don't set it to home->last_packet, because that could have
been minutes or hours in the past
4367280
@motonorin

Thanks a lot. I'll try the latest code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment