Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EFLOW is not working on some devices. Believed to be networking issue. #279

Closed
bhjertaas opened this issue Nov 15, 2023 · 5 comments
Closed

Comments

@bhjertaas
Copy link

bhjertaas commented Nov 15, 2023

Question Summary
We are experiencing sudden problems with EFLOW, often after reboot of the PC. EFLOW/Edge has been running fine for days, then suddenly it is disconnected from IoT Hub, and we cannot run Connect-EflowVm, Start-EflowVm or similar commands.

Detailed background
We have about 15 Windows PCs running EFLOW. The last few days we have experienced problems on some PCs, as explained in the summary.

EFLOW is configured with default switch. Like this
Deploy-Eflow -cpuCount 1 -memoryInMB 2048 -vmDataSize 16 -acceptEula Yes -acceptOptionalTelemetry No

We really could need some help figure out what is causing this.

We've run the commands mentioned on the troubleshooting pages.

PS C:\WINDOWS\system32> Connect-EflowVm
ERROR: 0x80072741
ssh_exchange_identification: Connection closed by remote host

===================================

PS C:\WINDOWS\system32> Get-Service -Name WSSDAgent
Status   Name               DisplayName
------   ----               -----------
Running  WSSDAgent          WSSD Agent Service

=================

PS C:\WINDOWS\system32> Get-CIMInstance Win32_OperatingSystem | Select FreePhysicalMemory
FreePhysicalMemory
------------------
          13057796           (yes this is more than the allocated amount)

==================================

PS C:\WINDOWS\system32> Get-EflowVm
      [11/15/2023 15:09:48] Exception caught!!!
      
       - ssh  failed to execute [ERROR: 0x80072741 ssh_exchange_identification: Connection closed by remote host] at line 4349
      
      [11/15/2023 15:09:48] Exception caught!!!
      
       - Cannot index into a null array. at line 1815
      
      
      VmConfiguration    : @{ID=1be44d111d6c133; name=DESKTOP-LS0RRP4-EFLOW; properties=; tags=}
      VmPowerState       : Running
      EdgeRuntimeVersion : {}
      EdgeRuntimeStatus  : {}
      SystemStatistics   : {}

==================================

PS C:\WINDOWS\system32> (Get-EflowVm).EdgeRuntimeStatus.SystemCtlStatus | Format-List
          [11/15/2023 15:11:31] Exception caught!!!
          
           - ssh  failed to execute [ERROR: 0x80072741 ssh_exchange_identification: Connection closed by remote host] at line 4349
          
          [11/15/2023 15:11:31] Exception caught!!!
          
           - Cannot index into a null array. at line 1815

==================================

PS C:\WINDOWS\system32> hcsdiag list
PS C:\WINDOWS\system32>

==================================

PS C:\WINDOWS\system32> ssh -V
OpenSSH_for_Windows_7.7p1, LibreSSL 2.6.5

==================================

PS C:\WINDOWS\system32> Get-EflowVmAddr

[11/15/2023 15:15:27] Querying IP and MAC addresses from virtual machine (DESKTOP-LS0RRP4-EFLOW)

 - Virtual machine MAC: 00:15:5d:16:86:57
 - Failed to acquire IP address for virtual machine
00:15:5d:16:86:57

==================================

I did also run Get-EflowLogs, but many of the files are empty. The zip file is attached.
eflowlogs.zip

Azure IoT Edge LTS (the eflow program on windows) is version 1.4.10.25103
No new Windows updates have been applied.
It is running on a Windows 10 Enterprise (version 1809) 64-bit PC with 16GB RAM and i7-9700E CPU.

@bhjertaas
Copy link
Author

bhjertaas commented Nov 23, 2023

We are still waiting assistance on this issue. Help would be much appreciated.
Since last post we have updated Windows which we initially thought fixed the problem because the device ran nicely for almost two days. But after a reboot of the PC, we're seeing the same problem. This is happening especially on one machine, while another identical PC (same hardware and software) is running totally fine. The PC with problems has rebooted OK a handful of times, but this latest reboot must have done something to HCS/Default Switch we believe.

Why is it that hcsdiag list returns an empty list when Get-Service -Name WSSDAgent doesn't?

Here are extracts from wssdagent.log file

[LOG: 11-23 08:17:0000058 TraceID: hcn.go:36 c52bcfa9135705b6b76e46e308a3b541] HCS%!!(MISSING)(EXTRA string=CreateVirtualNetworkInterface, string=DESKTOP-LS0RRP4-EFLOWInterface)

Failed to write to log, write /dev/stderr: The handle is invalid.

[SPAN_END: 11-23 08:29:0000049 TraceID: b67096bd3d0abc0e1873db2176e67475] Name: NewClient reconcile Virtualnetwork Duration: 5.04s
raiseNotification - [entityName:"Default Switch" operation:UPDATE ]
VirtualNetwork AddToMonitor Error[Network Name Default Switch not found]
raiseNotification - [entityName:"DESKTOP-LS0RRP4-EFLOWInterface" operation:UPDATE ]

[LOG: 11-23 08:29:0000049 TraceID: hcn.go:54 f43fb105fc5e454e632806685645f874] CreateVirtualNetworkInterface vnicName=DESKTOP-LS0RRP4-EFLOWInterface vnicId=5f76cf3c22e27d9 vmName=DESKTOP-LS0RRP4-EFLOW
[LOG: 11-23 08:29:0000049 TraceID: hcn.go:59 f43fb105fc5e454e632806685645f874] [NetworkInterface][Create] Unable to get the endpoint config for the specified network interface, error: Network Name Default Switch not found
[LOG: 11-23 08:29:0000049 TraceID: hcn.go:468 f43fb105fc5e454e632806685645f874] [NetworkInterface][Reconcile] Unable to create network interface with Id 5f76cf3c22e27d9, error: Network Name Default Switch not found
[LOG: 11-23 08:29:0000049 TraceID: client.go:548 f43fb105fc5e454e632806685645f874] Error Reconciling Virtual NetworkInterface [DESKTOP-LS0RRP4-EFLOWInterface]: Network Name Default Switch not found

Why is this happening? Is there a process in Windows during startup that corrupts HCS and Hyper-V networking?

There are plenty of similarities between our problems and this reportedly fixed issue #105 the OS is the same also.

@jagadishmurugan
Copy link

@bhjertaas it looks to me that after a reboot, WssdAgent service is started which will recreate the VM. But the VM creation failed because it is not able to see "Default Switch". Can you confirm with Get-VmSwitch ?
If indeed the "Default Switch" exists, a restart of WssdAgent service should likely fix the issue. Can you try the same and confirm?

Regards,

@bhjertaas
Copy link
Author

Yes, the Default Switch does indeed exist.

Get-VmSwitch
Name           SwitchType NetAdapterInterfaceDescription
----           ---------- ------------------------------
Default Switch Internal

The WssdAgent is also in the Running state however. I restarted this, but the Eflow errors are still the same. I am unable to Start-EflowVm and hcsdiag list still produces empty result.

@bhjertaas
Copy link
Author

This is working better with recent releases from the edge and eflow teams. Closing for now.

@Nilas1994
Copy link

I fixed it by doing Stop-EflowVM and Start-EflowVM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants