New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support simple ping mesh in agent. #6120
base: main
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
I0408 19:16:11.801016 1 monitor.go:249] "Connection status" Connection={"FromIP":"10.244.2.1","ToIP":"10.244.0.1","Latency":0,"Status":false,"LastUpdated":"2024-04-08T19:16:11.800804713Z","CreatedAt":"0001-01-01T00:00:00Z"}
I0408 19:16:11.801040 1 monitor.go:249] "Connection status" Connection={"FromIP":"10.244.2.1","ToIP":"10.244.2.1","Latency":97881,"Status":true,"LastUpdated":"2024-04-08T19:16:11.80070513Z","CreatedAt":"0001-01-01T00:00:00Z"}
I0408 19:16:11.801057 1 monitor.go:249] "Connection status" Connection={"FromIP":"10.244.2.1","ToIP":"10.244.1.1","Latency":0,"Status":false,"LastUpdated":"2024-04-08T19:16:11.800928351Z","CreatedAt":"0001-01-01T00:00:00Z"} |
310d69b
to
b866203
Compare
pkg/apis/crd/v1alpha2/types.go
Outdated
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object | ||
type NodeLatencyMonitorList struct { | ||
metav1.TypeMeta `json:",inline"` | ||
// +optional | ||
metav1.ListMeta `json:"metadata,omitempty"` | ||
|
||
Items []NodeLatencyMonitor `json:"items"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For that kind of "singleton" CRD, does it make sense to support the List operation? Maybe we could do without?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@antoninbas My previous implementation didn't add a List, but it seems to have failed to generate it in make codegen
, suggesting that the type of List needs to be supplied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can explicitly tell client-gen to only generate for get/watch etc., or to skip list https://gist.github.com/liangrog/1fc1cfbbfb0ef555729781c73caaef0b#file-k8s-code-gen-tags-L5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't add a list, we can't use the informer interface directly to listen to CRD resources, and if we use watch directly, we will probably handle the events manually in the same way as cmd/antrea-agent-simulator/simulator.go
.
It may be necessary to get the crd resource via get for the first time when the node is online, and then listen to the resource changes via watch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a comment for this NodeLatencyMonitorList
struct.
I will update my code later. |
pkg/apis/crd/v1alpha2/types.go
Outdated
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object | ||
type NodeLatencyMonitorList struct { | ||
metav1.TypeMeta `json:",inline"` | ||
// +optional | ||
metav1.ListMeta `json:"metadata,omitempty"` | ||
|
||
Items []NodeLatencyMonitor `json:"items"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can explicitly tell client-gen to only generate for get/watch etc., or to skip list https://gist.github.com/liangrog/1fc1cfbbfb0ef555729781c73caaef0b#file-k8s-code-gen-tags-L5
Latest log: I0422 18:01:41.173972 1 monitor.go:315] "Recv ICMP message" IP="10.244.0.1" SeqID=6 echo={"ID":1,"Seq":6,"Data":"MjAyNC0wNC0yMlQxODowMTo0MS4xNzM4MDI0MTZa"}
I0422 18:01:41.174009 1 monitor.go:366] "NodeIPLatency status" Key="10.244.0.1" Entry={"SeqID":6,"LastSendTime":"2024-04-22T18:01:41.173802416Z","LastRecvTime":"2024-04-22T18:01:26.173511212Z","LastMeasuredRTT":451565}
I0422 18:01:41.174042 1 monitor.go:366] "NodeIPLatency status" Key="10.244.1.1" Entry={"SeqID":4,"LastSendTime":"2024-04-22T18:01:41.173511596Z","LastRecvTime":"2024-04-22T18:01:26.177130478Z","LastMeasuredRTT":3727957}
I0422 18:01:41.174064 1 monitor.go:366] "NodeIPLatency status" Key="10.244.2.1" Entry={"SeqID":5,"LastSendTime":"2024-04-22T18:01:41.17371097Z","LastRecvTime":"2024-04-22T18:01:26.177815251Z","LastMeasuredRTT":4256801} |
Signed-off-by: IRONICBo <boironic@gmail.com>
Signed-off-by: IRONICBo <boironic@gmail.com>
Signed-off-by: IRONICBo <boironic@gmail.com>
Signed-off-by: IRONICBo <boironic@gmail.com>
Signed-off-by: IRONICBo <boironic@gmail.com>
Signed-off-by: IRONICBo <boironic@gmail.com>
Signed-off-by: IRONICBo <boironic@gmail.com>
Signed-off-by: IRONICBo <boironic@gmail.com>
Signed-off-by: IRONICBo <boironic@gmail.com>
Signed-off-by: IRONICBo <boironic@gmail.com>
f6c7714
to
6915393
Compare
Hi @antoninbas , do you know why the codecov bot is not running for this PR? @IRONICBo was asking about the codecov requirements for merging a PR |
I don't think there has been a successful CI run yet? |
pkg/agent/monitortool/monitor.go
Outdated
} | ||
|
||
// Convert to NodeIPLatencyStats | ||
nodeIPLatencyStats := &cpv1beta.NodeIPLatencyStat{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One node has dual IPs?
One node only contains a key(name?) to store latencies in map in agg.
map[nodename]NodeIPLatencyEntry
Put a nodename to NodeLatencyEntry struct.
142eef7
to
cea5be0
Compare
Signed-off-by: IRONICBo <boironic@gmail.com>
cea5be0
to
3c1cfef
Compare
pkg/agent/monitortool/monitor.go
Outdated
// Notify the latency config changed | ||
m.latencyConfigChanged <- struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may not be ideal if there is potential for blocking, as event handlers should not block
we can revisit later and potentially use a workqueue, no need to address this for this PR unless someone feels strongly about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, using the workqueue approach would be better than using the channel directly, I'll discuss and address this in other PRs.
pkg/agent/monitortool/monitor.go
Outdated
case <-stopCh: | ||
return | ||
default: | ||
readBuffer := make([]byte, 1500) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how did you determine the right buffer size here? There should be a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I have added the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is not accurate. The maximum size of an Ethernet frame is not 1500, that depends on the MTU.
We do however only expect small packets, because we only expect ICMP echo replies messages, and the size of the ICMP echo request messages we send is small. I am pretty sure that we don't expect messages larger than 128 bytes, so I would use a buffer with that size. If we get a larger message, the extra data will be discarded, which is fine.
Additionally, in our case it should be possible to reuse the same buffer for each loop iteration, as opposed to allocating a new buffer every time.
Signed-off-by: Asklv <boironic@gmail.com>
Signed-off-by: Asklv <boironic@gmail.com>
be5aa62
to
b44469b
Compare
I have fixed the unit test error in my local machine. antrea/pkg/agent/monitortool# go test -race ./...
ok antrea.io/antrea/pkg/agent/monitortool 1.243s |
Signed-off-by: Asklv <boironic@gmail.com>
b44469b
to
49074e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a general comment, avoid obvious code comments that are just a one sentence description of what the next line of code is doing, when it's already obvious from the code. It just makes everything more verbose and doesn't help visibility.
pkg/agent/monitortool/monitor.go
Outdated
// getICMPSeq returns the next sequence number as uint32, | ||
// wrapping around to 0 after reaching the maximum value of uint32. | ||
func getICMPSeq() uint32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that this should return a uint16, given that the sequence number is a 2-byte value. You should do an explicit uint16
cast before returning newVal
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I changed the return type.
pkg/agent/monitortool/monitor.go
Outdated
latencyConfig *LatencyConfig | ||
// latencyConfigChanged is the channel to notify the latency config changed. | ||
latencyConfigChanged chan struct{} | ||
// isIPv4Enabled is the flag to indicate if the IPv4 is enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// isIPv4Enabled is the flag to indicate if the IPv4 is enabled. | |
// isIPv4Enabled is the flag to indicate whether IPv4 is enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I have fixed it.
pkg/agent/monitortool/monitor.go
Outdated
latencyConfigChanged chan struct{} | ||
// isIPv4Enabled is the flag to indicate if the IPv4 is enabled. | ||
isIPv4Enabled bool | ||
// isIPv6Enabled is the flag to indicate if the IPv6 is enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// isIPv6Enabled is the flag to indicate if the IPv6 is enabled. | |
// isIPv6Enabled is the flag to indicate whether IPv6 is enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I have fixed it.
pkg/agent/monitortool/monitor.go
Outdated
// The informer of Nodes, it will changed by Node watcher | ||
nodeInformer coreinformers.NodeInformer | ||
// nodeLatencyMonitorInformer is the informer for the NodeLatencyMonitor CRD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these 2 comments are not really helpful. Unexpected fields don't always benefit from a field comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I have fixed it.
pkg/agent/monitortool/monitor.go
Outdated
nodeLatencyMonitorInformer: nlmInformer, | ||
} | ||
|
||
// Get the IPv4/IPv6 enabled status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a useful comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I have fixed it.
getNodeIPs, err := l.getNodeIPs(node) | ||
if err != nil { | ||
klog.ErrorS(err, "Failed to get IPs for Node", "nodeName", node.Name) | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be much better not to rely on this for deletion. Ideally you should do your cleanup simply based on the Node name and your internal state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not addressed, we should not need to call l.getNodeIPs(node)
in the deletion handler. We should do a lookup internally based on the Node name, and figure out what we need to clean up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I use nodeTargetIPsMap
to retrieve target IPs and remove them.
if len(podCIDRStrs) == 0 { | ||
// Skip the Node if it does not have a PodCIDR. | ||
err := errors.New("node does not have a PodCIDR") | ||
klog.ErrorS(err, "Node does not have a PodCIDR", "Node name", node.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a rule, it is somewhat bad practice to log an error AND also return it, as it easily leads to duplicate error logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I have removed all of these logs.
} | ||
|
||
if node.Spec.PodCIDR == "" { | ||
// Does not help to return an error and trigger controller retries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment was copied from node_route_controller.go, but it doesn't make sense in your case because you don't have a workqueue, so there is no retry mechanism. You should remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I have removed them.
} | ||
|
||
// CleanUp cleans up the latency store. | ||
func (l *LatencyStore) CleanUp() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point me to where this is called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LatencyStore is cleaned up when crd is deleted, I have updated the code.
cmd/antrea-agent/agent.go
Outdated
// Start the node latency monitor. | ||
if features.DefaultFeatureGate.Enabled(features.NodeLatencyMonitor) { | ||
nodeLatencyMonitor := monitortool.NewNodeLatencyMonitor( | ||
nodeInformer, | ||
nodeLatencyMonitorInformer, | ||
nodeConfig, | ||
networkConfig.TrafficEncapMode, | ||
) | ||
go nodeLatencyMonitor.Run(stopCh) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took some time to think about it, and this should probably be gated by if o.nodeType == config.K8sNode
. Check out the rest of this file for reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I add the nodeType
check.
Signed-off-by: Asklv <boironic@gmail.com>
815f399
to
527a0b0
Compare
We need to implement a simple ping-mesh monitoring tool to get nodes connectivity latency.
Design doc(TODO): https://docs.google.com/document/d/1EdKJ8iQ3KwVBQAHaPisqHP7cgpq4RW8mD9KPWtYETbE
Issue: #5514