-
Notifications
You must be signed in to change notification settings - Fork 883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add graceful-restart support #2386
Conversation
Note to discuss:
it seems it fails, and retries to apply that the speaker/frr-reloader. |
The above might be related to FRRouting/frr#8403
|
In my test: I achieved zero downtime if I have graceful restart either in global (Test= in runtime I delete the speaker and see if the client loses connectivity) The second command that enables the F -bit is only available in global mode, not available per neighbor. Therefore a solution which we add graceful-restart in neighbor level and f-bit in global (because is not safe to just have that there by default) seem not optimal. @fedepaol How much do you object to have it global (one for enable gracefull restart/one to on/off f-bit), and also have the option to opt out per neighbor? Any other ideas? |
FRRouting/frr#15880 created this issue upstream, even though we are not really blocked on that (well we will get an error but doe not look like it has an impact) |
Won't the global F-bit affect only peers with gr enabled? If so, doesn't sound too terrible to always enable it. |
My preference would be to be configurable to be in the safe side. I cannot find argument against always having it (or being default on) but no idea for which case this option exists. Should we go always having it in the config? In the same topic, should we not allow tuning the timer time? default is 120sec but seems long (ideally we need < holdtime) |
I'd try to keep the ux as best as we can, which means not having to set an extra parameter. We can either set the global when at least one peer needs GR, or just always enable it. I am leaning towards the second, as we'll have to deal with peers without GR and the global anyway, so if it doesn't work we'll have a problem.
You mean the graceful restart timer? That should be meaningful only for the helper side I guess (but we may need it in frr-k8s where we might be helpers). |
cc @oribon |
14ff603
to
0b43ca8
Compare
72e3c27
to
d65bb88
Compare
50f6988
to
363fa20
Compare
63944e7
to
a7cbd33
Compare
95e4970
to
e57aee6
Compare
e2etest/pkg/metallb/metallb.go
Outdated
|
||
ret := true | ||
for _, p := range pods { | ||
ret = ret && k8s.PodIsReady(p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you just exit early if the pod is not ready?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
e2etest/pkg/metallb/metallb.go
Outdated
} | ||
} | ||
|
||
f := func(context.Context) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be embedded as anonymous function in the call below. No need to assign it to a variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, but is there a drawback to have it assigned to variable?
7459922
to
f93f754
Compare
e2etest/bgptests/bgp.go
Outdated
} | ||
|
||
if err := c(); err != nil { | ||
if !errors.Is(err, ErrStaleRoute) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is staleroute skipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
during the controlplane reboot (when peer has started graceful restart timers), the routes are stale and that is okay(happy path), we should ignore them during that time. We should NOT when we generally check for svc to be validated.
BGP graceful restart functionality as defined in RFC-4724 defines the mechanisms that allows BGP speaker to continue to forward data packets along known routes while the routing protocol information is being restored. This allows DS to be updated without routes being retracted in the peer side. We enable by default if GR then the F-bit (preserve-fw-state) to be set. We make gracefulRestart immutable according to https://kubernetes.io/blog/2022/09/29/enforce-immutability-using-cel/#immutablility-after-first-modification Signed-off-by: karampok <karampok@gmail.com>
Signed-off-by: karampok <karampok@gmail.com>
f59de1e
to
a286563
Compare
@fedepaol I have completed the last commit with adding the e2e test (plus the reverse to see a failure without graceful restart). I think I addressed all your comments. Let me know if I miss something. Thanks |
9e0c354
to
5925259
Compare
Another version pushed that does make the function not depend on external variable |
09b162b
to
8fed043
Compare
@@ -25,7 +25,7 @@ func Do(address string, exc executor.Executor) error { | |||
|
|||
// Retry loop to handle wget NetworkFailure errors | |||
for { | |||
out, err = exc.Exec("wget", "-O-", "-q", address, "-T", "60") | |||
out, err = exc.Exec("wget", "-O-", "-q", address, "-T", "5") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change related?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes a timeout of 60 seconds could block the serial test and hide a potential downtime
e2etest/pkg/frr/provider/provider.go
Outdated
if !ok { | ||
return nil, fmt.Errorf("speakers %s/%s not found in known speakers %v", ns, name, f.speakers) | ||
return nil, fmt.Errorf("speakers %s/%s not found in known speakers %v", ns, name, speakers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: speaker %s not found in..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -415,7 +415,7 @@ func (j *TestJig) UpdateService(ctx context.Context, update func(*v1.Service)) ( | |||
|
|||
// WaitForLoadBalancer waits the given service to have a LoadBalancer, or returns an error after the given timeout | |||
func (j *TestJig) WaitForLoadBalancer(ctx context.Context, timeout time.Duration) (*v1.Service, error) { | |||
ginkgo.GinkgoWriter.Printf("Waiting up to %v for service %q to have a LoadBalancer", timeout, j.Name) | |||
ginkgo.GinkgoWriter.Printf("\tWaiting up to %v for service %q to have a LoadBalancer\n", timeout, j.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this related? If not but makes sense, please add it in a separte commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
e2etest/pkg/metallb/metallb.go
Outdated
} | ||
return true, nil | ||
}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unnecessary line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
few nits, almost there! |
- Restarts speakers pod in non-blocking - Wait speakers to be ready and monitor at the same time if downtime BDD Description looks like BGP GracefulRestart, when speakers restart and when GR enabled dataplane should keep working BGP GracefulRestart, when speakers restart when GR disabled dataplane should have a downtime Signed-off-by: karampok <karampok@gmail.com>
Signed-off-by: karampok <karampok@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about now :) let me know if you want to rebase. Thanks
e2etest/pkg/metallb/metallb.go
Outdated
} | ||
return true, nil | ||
}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -25,7 +25,7 @@ func Do(address string, exc executor.Executor) error { | |||
|
|||
// Retry loop to handle wget NetworkFailure errors | |||
for { | |||
out, err = exc.Exec("wget", "-O-", "-q", address, "-T", "60") | |||
out, err = exc.Exec("wget", "-O-", "-q", address, "-T", "5") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes a timeout of 60 seconds could block the serial test and hide a potential downtime
e2etest/pkg/frr/provider/provider.go
Outdated
if !ok { | ||
return nil, fmt.Errorf("speakers %s/%s not found in known speakers %v", ns, name, f.speakers) | ||
return nil, fmt.Errorf("speakers %s/%s not found in known speakers %v", ns, name, speakers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -415,7 +415,7 @@ func (j *TestJig) UpdateService(ctx context.Context, update func(*v1.Service)) ( | |||
|
|||
// WaitForLoadBalancer waits the given service to have a LoadBalancer, or returns an error after the given timeout | |||
func (j *TestJig) WaitForLoadBalancer(ctx context.Context, timeout time.Duration) (*v1.Service, error) { | |||
ginkgo.GinkgoWriter.Printf("Waiting up to %v for service %q to have a LoadBalancer", timeout, j.Name) | |||
ginkgo.GinkgoWriter.Printf("\tWaiting up to %v for service %q to have a LoadBalancer\n", timeout, j.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
rebase is not necessary as we use merge queues. |
/kind feature
What this PR does / why we need it:
Issue covers #2368
Special notes for your reviewer:
Release note: