Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Windows] Use IP and MAC to find virtual management adatper #3641

Merged
merged 1 commit into from Apr 18, 2022

Conversation

wenyingd
Copy link
Contributor

@wenyingd wenyingd commented Apr 14, 2022

  1. After creating HNSNetwork, Windows host creates a virtual management
    network adapter which takes over the uplink's IP and MAC. Originally
    the name with a format "vEthernet ($uplink_name)" is used to get the
    virtual adapter, but it might fail when the name is taken by other
    adapter. In this change, uses the uplink's IP and MAC to find the
    adpter, and uses the prefix "vEthernet" as a filter.
  2. Remove the virtual adapter name from the name list to search the
    Windows Node transport interface's IP configuration in agent restart
    case. This is because the IP is finally moved to OVS bridge
    interface, which is renamed from the virtual network adapter. So in a
    restart case, a virtual network adapter with the name format
    "vEthernet ($uplink_name)" should not exist.

Fixes #3636

Signed-off-by: wenyingd wenyingd@vmware.com

@wenyingd
Copy link
Contributor Author

/test-windows-all
/test-all
/skip-ipv6-all
/skip-ipv6-only-all

@codecov-commenter
Copy link

codecov-commenter commented Apr 14, 2022

Codecov Report

Merging #3641 (381fb96) into main (d7b1eed) will decrease coverage by 13.16%.
The diff coverage is n/a.

❗ Current head 381fb96 differs from pull request most recent head 185e0c2. Consider uploading reports for the commit 185e0c2 to get more accurate results

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #3641       +/-   ##
===========================================
- Coverage   63.35%   50.18%   -13.17%     
===========================================
  Files         278      248       -30     
  Lines       39367    35664     -3703     
===========================================
- Hits        24941    17899     -7042     
- Misses      12472    15968     +3496     
+ Partials     1954     1797      -157     
Flag Coverage Δ
e2e-tests 50.18% <ø> (?)
kind-e2e-tests ?
unit-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/controller/networkpolicy/endpoint_querier.go 4.58% <0.00%> (-88.08%) ⬇️
pkg/agent/util/iptables/lock.go 0.00% <0.00%> (-80.00%) ⬇️
pkg/cni/client.go 0.00% <0.00%> (-77.78%) ⬇️
pkg/controller/networkpolicy/crd_utils.go 14.48% <0.00%> (-77.25%) ⬇️
...lowaggregator/clickhouseclient/clickhouseclient.go 0.00% <0.00%> (-76.62%) ⬇️
pkg/controller/externalippool/validate.go 0.00% <0.00%> (-75.87%) ⬇️
pkg/apiserver/handlers/featuregates/handler.go 1.63% <0.00%> (-73.78%) ⬇️
.../registry/networkpolicy/clustergroupmember/rest.go 11.11% <0.00%> (-73.62%) ⬇️
pkg/controller/networkpolicy/clustergroup.go 3.50% <0.00%> (-73.57%) ⬇️
...kg/agent/flowexporter/connections/conntrack_ovs.go 0.00% <0.00%> (-70.91%) ⬇️
... and 146 more

1. After creating HNSNetwork, Windows host creates a virtual management
   network adapter which takes over the uplink's IP and MAC. Originally
   the name with a format "vEthernet ($uplink_name)" is used to get the
   virtual adapter, but it might fail when the name is taken by other
   adapters. In this change, uses the uplink's IP and MAC to find the
   adpter, and uses the prefix "vEthernet" as a filter.
2. Remove the virtual adapter name from the name list to search the
   Windows Node transport interface's IP configuration in agent restart
   case. This is because the IP is finally moved to OVS bridge
   interface, which is renamed from the virtual network adapter. So in a
   restart case, a virtual network adapter with the name format "vEthernet ($uplink_name)" should not exist.

Signed-off-by: wenyingd <wenyingd@vmware.com>
@wenyingd
Copy link
Contributor Author

/test-windows-all
/test-all
/skip-ipv6-all
/skip-ipv6-only-all

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @wenyingd. This LGTM.

I tried it on my existing Windows Node first. I got the following logs from the agent:

ubuntu@ip-10-0-0-25:~$ kubectl  -n kube-system logs antrea-agent-windows-fbx9l -f

    Directory: C:\host\k\antrea

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d----           4/14/2022  8:33 PM                bin
I0414 20:34:34.151157    6448 log_file.go:99] Set log file max size to 104857600
I0414 20:34:34.225900    6448 agent.go:84] Starting Antrea agent (version v1.7.0-dev-185e0c2.dirty)
I0414 20:34:34.225900    6448 client.go:81] No kubeconfig file was specified. Falling back to in-cluster config
W0414 20:34:34.233898    6448 env.go:83] Environment variable POD_NAMESPACE not found
W0414 20:34:34.235898    6448 env.go:121] Failed to get Pod Namespace from environment. Using "kube-system" as the Antrea Service Namespace
I0414 20:34:34.235898    6448 prometheus.go:171] Initializing prometheus metrics
I0414 20:34:34.235898    6448 ovs_client.go:68] Connecting to OVSDB at address \\.\pipe\C:openvswitchvarrunopenvswitchdb.sock
I0414 20:34:34.236899    6448 agent.go:331] Setting up node network
I0414 20:34:43.286042    6448 agent.go:837] "Setting Node MTU" MTU=8951
I0414 20:34:48.793827    6448 net_windows.go:386] "Creating HNSNetwork" name="antrea-hnsnetwork" subnet="192.168.3.0/24" nodeIP="10.0.0.189/24" adapter=&{Index:11 MTU:9001 Name:Ethernet HardwareAddr:06:5e:47:7f:7f:93 Flags:up|broadcast|multicast}
I0414 20:34:50.430779    6448 net_windows.go:408] "Moving uplink configuration to the management virtual network adapter" adapter="vEthernet (Ethernet) 3"
I0414 20:35:02.896840    6448 net_windows.go:431] "Moved uplink configuration to the management virtual network adapter" adapter="vEthernet (Ethernet) 3"



^C
ubuntu@ip-10-0-0-25:~$ ping 10.0.0.189
PING 10.0.0.189 (10.0.0.189) 56(84) bytes of data.
^C
--- 10.0.0.189 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4084ms

ubuntu@ip-10-0-0-25:~$

After that, connectivity was lost to the instance and I had to force reboot from the AWS console to recover connectivity. I had the same issue after rebooting the Antrea Agent.

However, I tried on a fresh Windows instance, and I didn't observe the issue:

ubuntu@ip-10-0-0-25:~$ kubectl  -n kube-system logs antrea-agent-windows-g62fc -f

    Directory: C:\host\k\antrea

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d----           4/14/2022  8:49 PM                bin
I0414 20:49:37.801997    7584 log_file.go:99] Set log file max size to 104857600
I0414 20:49:37.866647    7584 agent.go:84] Starting Antrea agent (version v1.7.0-dev-185e0c2.dirty)
I0414 20:49:37.867650    7584 client.go:81] No kubeconfig file was specified. Falling back to in-cluster config
W0414 20:49:37.875656    7584 env.go:83] Environment variable POD_NAMESPACE not found
W0414 20:49:37.877655    7584 env.go:121] Failed to get Pod Namespace from environment. Using "kube-system" as the Antrea Service Namespace
I0414 20:49:37.878663    7584 prometheus.go:171] Initializing prometheus metrics
I0414 20:49:37.878663    7584 ovs_client.go:68] Connecting to OVSDB at address \\.\pipe\C:openvswitchvarrunopenvswitchdb.sock
I0414 20:49:37.879668    7584 agent.go:331] Setting up node network
I0414 20:49:37.920852    7584 agent.go:837] "Setting Node MTU" MTU=8951
I0414 20:49:43.122357    7584 net_windows.go:386] "Creating HNSNetwork" name="antrea-hnsnetwork" subnet="192.168.4.0/24" nodeIP="10.0.0.10/24" adapter=&{Index:9 MTU:9001 Name:Ethernet HardwareAddr:06:37:40:9b:4a:09 Flags:up|broadcast|multicast}
I0414 20:49:58.292344    7584 net_windows.go:514] Enabled Receive Segment Coalescing (RSC) for vSwitch antrea-hnsnetwork
I0414 20:49:58.292480    7584 net_windows.go:453] "Created HNSNetwork" name="antrea-hnsnetwork" id="8918EBD5-E86A-4B3F-B6F6-46C485DB0806"
I0414 20:49:58.293621    7584 ovs_client.go:119] Created bridge: 0d1b7b88-5b32-4db1-8d36-1e9e98e73819
...

I don't know if the error in the first instance is something we need to worry about. I know that this corresponds to a different code path in PrepareHNSNetwork, but I don't know enough about it.

@wenyingd
Copy link
Contributor Author

/test-windows-conformance

@wenyingd
Copy link
Contributor Author

After that, connectivity was lost to the instance and I had to force reboot from the AWS console to recover connectivity. I had the same issue after rebooting the Antrea Agent.

I think we should focus on the issue on the first instance. @antoninbas Could you help dump the IP/route configurations from console after the network is lost? These logic happens when HNS network doesn't move the IP to the virtual management adapter (although it is not expected), and agent will try to move the configurations instead.

@wenyingd
Copy link
Contributor Author

@antoninbas I have another question, is OVS working correctly on the Windows Node in your first instance? To verify it, maybe you can try with antrea 1.4?

@antoninbas
Copy link
Contributor

@wenyingd Unfortunately I deleted that instance yesterday after I got the new instance working, so I can't collect the information you are asking for :/

Could you help dump the IP/route configurations from console after the network is lost?

I only have RDP access to the instance, so I don't think I could have done that... Once the network goes down, I don't have any access to the instance anymore.

Let me know if we can merge this PR.

@wenyingd
Copy link
Contributor Author

wenyingd commented Apr 18, 2022

@wenyingd Unfortunately I deleted that instance yesterday after I got the new instance working, so I can't collect the information you are asking for :/

Could you help dump the IP/route configurations from console after the network is lost?

I only have RDP access to the instance, so I don't think I could have done that... Once the network goes down, I don't have any access to the instance anymore.

Let me know if we can merge this PR.

Then maybe we could merge this PR first? In my opinion, there should be some different issue for the network connectivity lost, and we can process that issue when it is reproduced and collected enough infomation. What do you think @antoninbas ?

@antoninbas
Copy link
Contributor

@wenyingd sound good to me

@antoninbas antoninbas added area/OS/windows Issues or PRs related to the Windows operating system. kind/bug Categorizes issue or PR as related to a bug. action/release-note Indicates a PR that should be included in release notes. labels Apr 18, 2022
@antoninbas
Copy link
Contributor

@wenyingd could you backport this as needed?

@antoninbas antoninbas merged commit ec06feb into antrea-io:main Apr 18, 2022
@wenyingd wenyingd deleted the issue_3636 branch August 15, 2022 03:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/release-note Indicates a PR that should be included in release notes. area/OS/windows Issues or PRs related to the Windows operating system. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Antrea Agent failing to start on Windows
3 participants