Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make tests resilient to Windows service manager errors #5608

Merged
merged 6 commits into from
Oct 1, 2021

Conversation

damonbarry
Copy link
Member

@damonbarry damonbarry commented Sep 30, 2021

In a failed end-to-end test run on Windows earlier today, I noticed that all test failures looked like this:

  Error Message:
   System.InvalidOperationException : Cannot stop iotedge service on computer '.'.
  ----> System.ComponentModel.Win32Exception : The service cannot accept control messages at this time.
  Stack Trace:
     at System.ServiceProcess.ServiceController.Stop()
   at Microsoft.Azure.Devices.Edge.Test.Common.Windows.EdgeDaemon.InternalStopAsync(CancellationToken token) in D:\a\_work\1\s\test\Microsoft.Azure.Devices.Edge.Test.Common\windows\EdgeDaemon.cs:line 107
   at Microsoft.Azure.Devices.Edge.Test.Common.Windows.EdgeDaemon.ConfigureAsync(Func`2 config, CancellationToken token, Boolean restart) in D:\a\_work\1\s\test\Microsoft.Azure.Devices.Edge.Test.Common\windows\EdgeDaemon.cs:line 65
   ...

I've noticed this before as well. Sometimes, the service manager in Windows isn't ready when you stop a service, but waiting a little while and trying again tends to solve it (e.g., see this). I noticed we weren't retrying when we get this exception, so this change adds the retry logic.

The error is not very common, so I was unable to confirm that my change works around it. But I ran the Windows jobs in the pipeline 7 times (with several jobs running in parallel too), and the stop logic still works, so at least generally I know I didn't make it worse. I added a verbose log so that we can gather more data if we still see this error in the future.

@damonbarry damonbarry changed the title [DRAFT] Make end-to-end tests more resilient to service failures in Windows Make tests resilient to Windows servicemgr errors Oct 1, 2021
@damonbarry damonbarry changed the title Make tests resilient to Windows servicemgr errors Make tests resilient to Windows service manager errors Oct 1, 2021
@damonbarry damonbarry marked this pull request as ready for review October 1, 2021 01:15
@kodiakhq kodiakhq bot merged commit e17bfa7 into Azure:release/1.1 Oct 1, 2021
@damonbarry damonbarry deleted the retry-service-stop branch October 1, 2021 20:48
kodiakhq bot pushed a commit that referenced this pull request Oct 20, 2021
A recent PR (#5608) added retry logic to the end-to-end tests on Windows when they try to stop the IoT Edge service but the service manager isn't ready. This PR expands that one to include another case: when the tests try to stop the IoT Edge service but the service is already stopped.

```
  X QuickstartCerts [7s 488ms]
  Error Message:
   System.InvalidOperationException : Cannot stop iotedge service on computer '.'.
  ----> System.ComponentModel.Win32Exception : The service has not been started.
```

The code path that stops the service first checks its status, and only issues the stop command if the service isn't already stopped. However, checking the service status + stopping the service is not an atomic operation, so there is a small window of opportunity to call "stop" on an already-stopped service. This change handles that window by checking the service status on every retry, not just the first time through. 

I was unable to get the condition to repro again after several runs in the pipeline, but I at least confirmed that these changes don't disrupt the happy path.

## Azure IoT Edge PR checklist:

This checklist is used to make sure that common guidelines for a pull request are followed.

### General Guidelines and Best Practices
- [x] I have read the [contribution guidelines](https://github.com/azure/iotedge#contributing).
- [x] Title of the pull request is clear and informative.
- [x] Description of the pull request includes a concise summary of the enhancement or bug fix.

### Testing Guidelines
- [x] Pull request includes test coverage for the included changes.
- Description of the pull request includes 
	- [x] concise summary of tests added/modified
	- [x] local testing done.  

### Draft PRs
- Open the PR in `Draft` mode if it is:
	- Work in progress or not intended to be merged.
	- Encountering multiple pipeline failures and working on fixes.

_Note: We use the kodiakhq bot to merge PRs once the necessary checks and approvals are in place. When it merges a PR, kodiakhq converts the PR title to the commit title, PR description to the commit description, and squashes all the commits in the PR to a single commit. The net effect is that entire PR becomes a single commit. Please follow the best practices mentioned [here](https://chris.beams.io/posts/git-commit/#:~:text=The%20seven%20rules%20of%20a%20great%20Git%20commit,what%20and%20why%20vs.%20how%20For%20example%3A%20) for the PR title and description_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants