Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support simple ping mesh in agent. #6120

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

IRONICBo
Copy link

We need to implement a simple ping-mesh monitoring tool to get nodes connectivity latency.

Design doc(TODO): https://docs.google.com/document/d/1EdKJ8iQ3KwVBQAHaPisqHP7cgpq4RW8mD9KPWtYETbE

Issue: #5514

@IRONICBo IRONICBo changed the title feat: support simple ping mesh in agent. Support simple ping mesh in agent. Mar 19, 2024
@IRONICBo

This comment was marked as outdated.

cmd/antrea-agent/agent.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/config/agent/config.go Outdated Show resolved Hide resolved
pkg/features/antrea_features.go Outdated Show resolved Hide resolved
pkg/features/antrea_features.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
@IRONICBo

This comment was marked as resolved.

@IRONICBo
Copy link
Author

IRONICBo commented Apr 8, 2024

  • support to get gw0/podCIDR from node spec.
  • support singleton CRD in nodelatencymonitor.

Maybe I need to consider about the NetworkPolicyOnly mode by using Transport Address, now only support get IP from first CIDR IP.

I0408 19:16:11.801016       1 monitor.go:249] "Connection status" Connection={"FromIP":"10.244.2.1","ToIP":"10.244.0.1","Latency":0,"Status":false,"LastUpdated":"2024-04-08T19:16:11.800804713Z","CreatedAt":"0001-01-01T00:00:00Z"}
I0408 19:16:11.801040       1 monitor.go:249] "Connection status" Connection={"FromIP":"10.244.2.1","ToIP":"10.244.2.1","Latency":97881,"Status":true,"LastUpdated":"2024-04-08T19:16:11.80070513Z","CreatedAt":"0001-01-01T00:00:00Z"}
I0408 19:16:11.801057       1 monitor.go:249] "Connection status" Connection={"FromIP":"10.244.2.1","ToIP":"10.244.1.1","Latency":0,"Status":false,"LastUpdated":"2024-04-08T19:16:11.800928351Z","CreatedAt":"0001-01-01T00:00:00Z"}

build/charts/antrea/crds/nodelatencymonitor.yaml Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store_test.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
@IRONICBo IRONICBo force-pushed the 5514-ping-mesh-monitoring-tool branch from 310d69b to b866203 Compare April 9, 2024 06:25
@IRONICBo IRONICBo requested a review from Dyanngg April 9, 2024 18:47
cmd/antrea-agent/agent.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store_test.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store_test.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store_test.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
Comment on lines 582 to 589
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
type NodeLatencyMonitorList struct {
metav1.TypeMeta `json:",inline"`
// +optional
metav1.ListMeta `json:"metadata,omitempty"`

Items []NodeLatencyMonitor `json:"items"`
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that kind of "singleton" CRD, does it make sense to support the List operation? Maybe we could do without?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoninbas My previous implementation didn't add a List, but it seems to have failed to generate it in make codegen, suggesting that the type of List needs to be supplied.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can explicitly tell client-gen to only generate for get/watch etc., or to skip list https://gist.github.com/liangrog/1fc1cfbbfb0ef555729781c73caaef0b#file-k8s-code-gen-tags-L5

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't add a list, we can't use the informer interface directly to listen to CRD resources, and if we use watch directly, we will probably handle the events manually in the same way as cmd/antrea-agent-simulator/simulator.go.

It may be necessary to get the crd resource via get for the first time when the node is online, and then listen to the resource changes via watch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a comment for this NodeLatencyMonitorList struct.

pkg/apis/crd/v1alpha2/types.go Outdated Show resolved Hide resolved
pkg/apis/crd/v1alpha2/types.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
@IRONICBo
Copy link
Author

I will update my code later.

pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
Comment on lines 582 to 589
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
type NodeLatencyMonitorList struct {
metav1.TypeMeta `json:",inline"`
// +optional
metav1.ListMeta `json:"metadata,omitempty"`

Items []NodeLatencyMonitor `json:"items"`
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can explicitly tell client-gen to only generate for get/watch etc., or to skip list https://gist.github.com/liangrog/1fc1cfbbfb0ef555729781c73caaef0b#file-k8s-code-gen-tags-L5

pkg/apis/crd/v1alpha2/types.go Outdated Show resolved Hide resolved
pkg/apis/crd/v1alpha2/types.go Outdated Show resolved Hide resolved
pkg/apis/crd/v1alpha2/types.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/apis/crd/v1alpha2/types.go Outdated Show resolved Hide resolved
@IRONICBo
Copy link
Author

Latest log:

I0422 18:01:41.173972       1 monitor.go:315] "Recv ICMP message" IP="10.244.0.1" SeqID=6 echo={"ID":1,"Seq":6,"Data":"MjAyNC0wNC0yMlQxODowMTo0MS4xNzM4MDI0MTZa"}
I0422 18:01:41.174009       1 monitor.go:366] "NodeIPLatency status" Key="10.244.0.1" Entry={"SeqID":6,"LastSendTime":"2024-04-22T18:01:41.173802416Z","LastRecvTime":"2024-04-22T18:01:26.173511212Z","LastMeasuredRTT":451565}
I0422 18:01:41.174042       1 monitor.go:366] "NodeIPLatency status" Key="10.244.1.1" Entry={"SeqID":4,"LastSendTime":"2024-04-22T18:01:41.173511596Z","LastRecvTime":"2024-04-22T18:01:26.177130478Z","LastMeasuredRTT":3727957}
I0422 18:01:41.174064       1 monitor.go:366] "NodeIPLatency status" Key="10.244.2.1" Entry={"SeqID":5,"LastSendTime":"2024-04-22T18:01:41.17371097Z","LastRecvTime":"2024-04-22T18:01:26.177815251Z","LastMeasuredRTT":4256801}

Signed-off-by: IRONICBo <boironic@gmail.com>
Signed-off-by: IRONICBo <boironic@gmail.com>
Signed-off-by: IRONICBo <boironic@gmail.com>
Signed-off-by: IRONICBo <boironic@gmail.com>
Signed-off-by: IRONICBo <boironic@gmail.com>
Signed-off-by: IRONICBo <boironic@gmail.com>
Signed-off-by: IRONICBo <boironic@gmail.com>
Signed-off-by: IRONICBo <boironic@gmail.com>
Signed-off-by: IRONICBo <boironic@gmail.com>
Signed-off-by: IRONICBo <boironic@gmail.com>
@IRONICBo IRONICBo force-pushed the 5514-ping-mesh-monitoring-tool branch from f6c7714 to 6915393 Compare April 22, 2024 18:19
pkg/apis/controlplane/types.go Outdated Show resolved Hide resolved
pkg/features/antrea_features.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
build/charts/antrea/crds/nodelatencymonitor.yaml Outdated Show resolved Hide resolved
@heanlan
Copy link
Contributor

heanlan commented Apr 24, 2024

Hi @antoninbas , do you know why the codecov bot is not running for this PR? @IRONICBo was asking about the codecov requirements for merging a PR

@antoninbas
Copy link
Contributor

Hi @antoninbas , do you know why the codecov bot is not running for this PR? @IRONICBo was asking about the codecov requirements for merging a PR

I don't think there has been a successful CI run yet?
For example currently, even unit tests are failing.

}

// Convert to NodeIPLatencyStats
nodeIPLatencyStats := &cpv1beta.NodeIPLatencyStat{
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One node has dual IPs?

One node only contains a key(name?) to store latencies in map in agg.
map[nodename]NodeIPLatencyEntry

Put a nodename to NodeLatencyEntry struct.

@IRONICBo IRONICBo force-pushed the 5514-ping-mesh-monitoring-tool branch from 142eef7 to cea5be0 Compare April 26, 2024 23:21
Signed-off-by: IRONICBo <boironic@gmail.com>
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
Comment on lines 204 to 205
// Notify the latency config changed
m.latencyConfigChanged <- struct{}{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may not be ideal if there is potential for blocking, as event handlers should not block

we can revisit later and potentially use a workqueue, no need to address this for this PR unless someone feels strongly about it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, using the workqueue approach would be better than using the channel directly, I'll discuss and address this in other PRs.

pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
case <-stopCh:
return
default:
readBuffer := make([]byte, 1500)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did you determine the right buffer size here? There should be a comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have added the comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is not accurate. The maximum size of an Ethernet frame is not 1500, that depends on the MTU.

We do however only expect small packets, because we only expect ICMP echo replies messages, and the size of the ICMP echo request messages we send is small. I am pretty sure that we don't expect messages larger than 128 bytes, so I would use a buffer with that size. If we get a larger message, the extra data will be discarded, which is fine.

Additionally, in our case it should be possible to reuse the same buffer for each loop iteration, as opposed to allocating a new buffer every time.

pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
Signed-off-by: Asklv <boironic@gmail.com>
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
Signed-off-by: Asklv <boironic@gmail.com>
@IRONICBo IRONICBo requested a review from antoninbas May 5, 2024 10:52
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/latency_store.go Outdated Show resolved Hide resolved
pkg/agent/monitortool/monitor.go Outdated Show resolved Hide resolved
@IRONICBo IRONICBo force-pushed the 5514-ping-mesh-monitoring-tool branch 2 times, most recently from be5aa62 to b44469b Compare May 7, 2024 16:57
@IRONICBo IRONICBo requested a review from Dyanngg May 7, 2024 16:57
@IRONICBo
Copy link
Author

IRONICBo commented May 8, 2024

I have fixed the unit test error in my local machine.

antrea/pkg/agent/monitortool# go test -race ./...
ok      antrea.io/antrea/pkg/agent/monitortool  1.243s

Signed-off-by: Asklv <boironic@gmail.com>
@IRONICBo IRONICBo force-pushed the 5514-ping-mesh-monitoring-tool branch from b44469b to 49074e0 Compare May 8, 2024 02:55
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general comment, avoid obvious code comments that are just a one sentence description of what the next line of code is doing, when it's already obvious from the code. It just makes everything more verbose and doesn't help visibility.

Comment on lines 52 to 54
// getICMPSeq returns the next sequence number as uint32,
// wrapping around to 0 after reaching the maximum value of uint32.
func getICMPSeq() uint32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that this should return a uint16, given that the sequence number is a 2-byte value. You should do an explicit uint16 cast before returning newVal.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I changed the return type.

latencyConfig *LatencyConfig
// latencyConfigChanged is the channel to notify the latency config changed.
latencyConfigChanged chan struct{}
// isIPv4Enabled is the flag to indicate if the IPv4 is enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// isIPv4Enabled is the flag to indicate if the IPv4 is enabled.
// isIPv4Enabled is the flag to indicate whether IPv4 is enabled.

Copy link
Author

@IRONICBo IRONICBo May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have fixed it.

latencyConfigChanged chan struct{}
// isIPv4Enabled is the flag to indicate if the IPv4 is enabled.
isIPv4Enabled bool
// isIPv6Enabled is the flag to indicate if the IPv6 is enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// isIPv6Enabled is the flag to indicate if the IPv6 is enabled.
// isIPv6Enabled is the flag to indicate whether IPv6 is enabled.

Copy link
Author

@IRONICBo IRONICBo May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have fixed it.

Comment on lines 76 to 78
// The informer of Nodes, it will changed by Node watcher
nodeInformer coreinformers.NodeInformer
// nodeLatencyMonitorInformer is the informer for the NodeLatencyMonitor CRD.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these 2 comments are not really helpful. Unexpected fields don't always benefit from a field comment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have fixed it.

nodeLatencyMonitorInformer: nlmInformer,
}

// Get the IPv4/IPv6 enabled status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a useful comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have fixed it.

Comment on lines 124 to 128
getNodeIPs, err := l.getNodeIPs(node)
if err != nil {
klog.ErrorS(err, "Failed to get IPs for Node", "nodeName", node.Name)
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be much better not to rely on this for deletion. Ideally you should do your cleanup simply based on the Node name and your internal state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not addressed, we should not need to call l.getNodeIPs(node) in the deletion handler. We should do a lookup internally based on the Node name, and figure out what we need to clean up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I use nodeTargetIPsMap to retrieve target IPs and remove them.

if len(podCIDRStrs) == 0 {
// Skip the Node if it does not have a PodCIDR.
err := errors.New("node does not have a PodCIDR")
klog.ErrorS(err, "Node does not have a PodCIDR", "Node name", node.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a rule, it is somewhat bad practice to log an error AND also return it, as it easily leads to duplicate error logs.

Copy link
Author

@IRONICBo IRONICBo May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have removed all of these logs.

}

if node.Spec.PodCIDR == "" {
// Does not help to return an error and trigger controller retries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was copied from node_route_controller.go, but it doesn't make sense in your case because you don't have a workqueue, so there is no retry mechanism. You should remove it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have removed them.

}

// CleanUp cleans up the latency store.
func (l *LatencyStore) CleanUp() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me to where this is called?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LatencyStore is cleaned up when crd is deleted, I have updated the code.

Comment on lines 927 to 936
// Start the node latency monitor.
if features.DefaultFeatureGate.Enabled(features.NodeLatencyMonitor) {
nodeLatencyMonitor := monitortool.NewNodeLatencyMonitor(
nodeInformer,
nodeLatencyMonitorInformer,
nodeConfig,
networkConfig.TrafficEncapMode,
)
go nodeLatencyMonitor.Run(stopCh)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took some time to think about it, and this should probably be gated by if o.nodeType == config.K8sNode. Check out the rest of this file for reference.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I add the nodeType check.

@IRONICBo IRONICBo requested a review from antoninbas May 9, 2024 02:02
Signed-off-by: Asklv <boironic@gmail.com>
@IRONICBo IRONICBo force-pushed the 5514-ping-mesh-monitoring-tool branch from 815f399 to 527a0b0 Compare May 9, 2024 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants