New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OZW goes into 100% CPU load if the Aeon USB Stick 2 is unplugged (only restarts helps) #111

Open
GoogleCodeExporter opened this Issue Mar 14, 2015 · 21 comments

Comments

Projects
None yet
7 participants
@GoogleCodeExporter

GoogleCodeExporter commented Mar 14, 2015

What steps will reproduce the problem?
1. Start OZW e.g. with MinOZW and get the devices discovered
2. Unplug the USB Stick
3. Check with "top" and the MinOZW is consuming 100% CPU load

What is the expected output? What do you see instead?
Not 100% CPU load

What version of the product are you using? On what operating system?
Unbuntu 10.04
Open Z-Wave r556

Please provide any additional information below.


Original issue reported on code.google.com by uAle...@gmail.com on 3 Nov 2012 at 11:37

@GoogleCodeExporter

This comment has been minimized.

Show comment
Hide comment
@GoogleCodeExporter

GoogleCodeExporter Mar 14, 2015

Peter made a Linux only patch that I modularized so it could be stubbed out for 
the other hardware platforms (Mac, Windows). I never committed it because not 
many people have expressed a need to pull their controller from the USB port 
and it wasn't a very clean patch. Seems like a good time to revisit this issue.

Original comment by glsatz on 3 Nov 2012 at 3:35

GoogleCodeExporter commented Mar 14, 2015

Peter made a Linux only patch that I modularized so it could be stubbed out for 
the other hardware platforms (Mac, Windows). I never committed it because not 
many people have expressed a need to pull their controller from the USB port 
and it wasn't a very clean patch. Seems like a good time to revisit this issue.

Original comment by glsatz on 3 Nov 2012 at 3:35

@GoogleCodeExporter

This comment has been minimized.

Show comment
Hide comment
@GoogleCodeExporter

GoogleCodeExporter Mar 14, 2015

Ok it would be nice to include that patch in a revision of OZW, maybe we can 
also have a look to serial port handling if we unplug and replug it again? I 
like to call these enhancements more "stability" fixes (now OZW is getting more 
mature - then we need such code more i think).

Original comment by uAle...@gmail.com on 3 Nov 2012 at 4:05

GoogleCodeExporter commented Mar 14, 2015

Ok it would be nice to include that patch in a revision of OZW, maybe we can 
also have a look to serial port handling if we unplug and replug it again? I 
like to call these enhancements more "stability" fixes (now OZW is getting more 
mature - then we need such code more i think).

Original comment by uAle...@gmail.com on 3 Nov 2012 at 4:05

@GoogleCodeExporter

This comment has been minimized.

Show comment
Hide comment
@GoogleCodeExporter

GoogleCodeExporter Mar 14, 2015

Ok. I will need someone to verify a windows (stub) patch. 

Original comment by glsatz on 3 Nov 2012 at 4:57

GoogleCodeExporter commented Mar 14, 2015

Ok. I will need someone to verify a windows (stub) patch. 

Original comment by glsatz on 3 Nov 2012 at 4:57

@GoogleCodeExporter

This comment has been minimized.

Show comment
Hide comment
@GoogleCodeExporter

GoogleCodeExporter Mar 14, 2015

Peter's original email about his patch: 
https://groups.google.com/forum/#!topic/openzwave/2VdGhj77i8Q/discussion

Note this adds a new dependency to the Linux build, libudev.

Original comment by glsatz on 4 Nov 2012 at 7:14

GoogleCodeExporter commented Mar 14, 2015

Peter's original email about his patch: 
https://groups.google.com/forum/#!topic/openzwave/2VdGhj77i8Q/discussion

Note this adds a new dependency to the Linux build, libudev.

Original comment by glsatz on 4 Nov 2012 at 7:14

@GoogleCodeExporter

This comment has been minimized.

Show comment
Hide comment
@GoogleCodeExporter

GoogleCodeExporter Mar 14, 2015

Adding the libudev dependency i don't see as a (major) issue, because we get 
more stability back for it.

Other question, can we add 2 types of notification to allow the main program to 
known the serial port is unavailable (when e.g. disconnect) and available 
again? Currently i don't have any good way to allow my main program to know the 
Open Z-Wave library is working fine or not.

Original comment by uAle...@gmail.com on 4 Nov 2012 at 8:35

GoogleCodeExporter commented Mar 14, 2015

Adding the libudev dependency i don't see as a (major) issue, because we get 
more stability back for it.

Other question, can we add 2 types of notification to allow the main program to 
known the serial port is unavailable (when e.g. disconnect) and available 
again? Currently i don't have any good way to allow my main program to know the 
Open Z-Wave library is working fine or not.

Original comment by uAle...@gmail.com on 4 Nov 2012 at 8:35

@GoogleCodeExporter

This comment has been minimized.

Show comment
Hide comment
@GoogleCodeExporter

GoogleCodeExporter Mar 14, 2015

Here is the patch Peter did updated so the platform specific code lives in 
different files instead of using #ifdefs. Not sure this patch is very useful as 
it only checks for the presence of the USB controller at open time. Pulling the 
USB controller still needs to be detected elsewhere. I am not planning on 
committing this patch until we have some better understanding on where and how 
USB controller presence should be managed. Peter reported that he wound up 
moving his detection code into his user program and use the manager object 
calls to open and close the driver. More discussion is needed.

Original comment by glsatz on 5 Nov 2012 at 5:34

  • Changed state: Accepted

Attachments:

GoogleCodeExporter commented Mar 14, 2015

Here is the patch Peter did updated so the platform specific code lives in 
different files instead of using #ifdefs. Not sure this patch is very useful as 
it only checks for the presence of the USB controller at open time. Pulling the 
USB controller still needs to be detected elsewhere. I am not planning on 
committing this patch until we have some better understanding on where and how 
USB controller presence should be managed. Peter reported that he wound up 
moving his detection code into his user program and use the manager object 
calls to open and close the driver. More discussion is needed.

Original comment by glsatz on 5 Nov 2012 at 5:34

  • Changed state: Accepted

Attachments:

@GoogleCodeExporter

This comment has been minimized.

Show comment
Hide comment
@GoogleCodeExporter

GoogleCodeExporter Mar 14, 2015

Commets from Peter on how he did this under Linux:


Both the original and the current code I have are essentially stock example udev
code.  See the monitor stuff:

http://www.signal11.us/oss/udev/

There are a few changes because it's a serial port, but those are already
in the code you have.

In my main loop, I have a select loop anyway, so this is just another
file descriptor to monitor.  It's very easy.

Original comment by glsatz on 13 Nov 2012 at 8:26

GoogleCodeExporter commented Mar 14, 2015

Commets from Peter on how he did this under Linux:


Both the original and the current code I have are essentially stock example udev
code.  See the monitor stuff:

http://www.signal11.us/oss/udev/

There are a few changes because it's a serial port, but those are already
in the code you have.

In my main loop, I have a select loop anyway, so this is just another
file descriptor to monitor.  It's very easy.

Original comment by glsatz on 13 Nov 2012 at 8:26

@GoogleCodeExporter

This comment has been minimized.

Show comment
Hide comment
@GoogleCodeExporter

GoogleCodeExporter Mar 14, 2015

Going down the USB route is overkill if you ask me. 

The 100% CPU spin was caused in the Read Member function of the SerialImpl 
class. (not tried/looked at the HID or Windows version). Its root cause was 
that we were not testing for the read system call returning 0 (by the look of 
the code, the original author assumed read returning 0 indicated no data. In 
fact as we are in blocking mode, read returning 0 indicates EOF)

The second problem is that there is no error path upto the Driver (or 
eventually) the Manager class to indicate errors. Thus the driver would happy 
keep trying to read/write etc and the manager (and thus the application) had no 
idea there was a problem)

Attached is a WIP patch, that adds some error handling to at least get the 
Error upto the driver class. I'm still trying to propogate that error to the 
Manager class so we can unload the driver and at least have the driver die 
gracefully, but there is still a loop somewhere in the WriteMsg functions etc. 

I'll revist this patch after the Security Class has settled down a bit. 

Original comment by jus...@dynam.ac on 12 Jun 2014 at 11:53

Attachments:

GoogleCodeExporter commented Mar 14, 2015

Going down the USB route is overkill if you ask me. 

The 100% CPU spin was caused in the Read Member function of the SerialImpl 
class. (not tried/looked at the HID or Windows version). Its root cause was 
that we were not testing for the read system call returning 0 (by the look of 
the code, the original author assumed read returning 0 indicated no data. In 
fact as we are in blocking mode, read returning 0 indicates EOF)

The second problem is that there is no error path upto the Driver (or 
eventually) the Manager class to indicate errors. Thus the driver would happy 
keep trying to read/write etc and the manager (and thus the application) had no 
idea there was a problem)

Attached is a WIP patch, that adds some error handling to at least get the 
Error upto the driver class. I'm still trying to propogate that error to the 
Manager class so we can unload the driver and at least have the driver die 
gracefully, but there is still a loop somewhere in the WriteMsg functions etc. 

I'll revist this patch after the Security Class has settled down a bit. 

Original comment by jus...@dynam.ac on 12 Jun 2014 at 11:53

Attachments:

@GoogleCodeExporter

This comment has been minimized.

Show comment
Hide comment
@GoogleCodeExporter

GoogleCodeExporter Mar 14, 2015

Thanks, i will apply the patch on my system and give it a try. Yes, telling via 
a notification, that the device is gone, is a valuable addition to open-zwave 
(but the Security Class first :-)).

Original comment by uAle...@gmail.com on 12 Jun 2014 at 11:59

GoogleCodeExporter commented Mar 14, 2015

Thanks, i will apply the patch on my system and give it a try. Yes, telling via 
a notification, that the device is gone, is a valuable addition to open-zwave 
(but the Security Class first :-)).

Original comment by uAle...@gmail.com on 12 Jun 2014 at 11:59

@GoogleCodeExporter

This comment has been minimized.

Show comment
Hide comment
@GoogleCodeExporter

GoogleCodeExporter Mar 14, 2015

Hi, as mentioned, the patch isn't complete. It's just moved the bug to 
somewhere around WriteMsg in the driver class. Feel free to improve it though!

Original comment by jus...@dynam.ac on 12 Jun 2014 at 4:17

GoogleCodeExporter commented Mar 14, 2015

Hi, as mentioned, the patch isn't complete. It's just moved the bug to 
somewhere around WriteMsg in the driver class. Feel free to improve it though!

Original comment by jus...@dynam.ac on 12 Jun 2014 at 4:17

@GoogleCodeExporter

This comment has been minimized.

Show comment
Hide comment
@GoogleCodeExporter

GoogleCodeExporter Mar 14, 2015

I tested the patch and it doesn't generate a 100% cpu anymore on the 
disconnect, but the driver is now looping in itself as you mentioned (and then 
again 100% cpu again) ;-)

Original comment by uAle...@gmail.com on 21 Jun 2014 at 11:23

GoogleCodeExporter commented Mar 14, 2015

I tested the patch and it doesn't generate a 100% cpu anymore on the 
disconnect, but the driver is now looping in itself as you mentioned (and then 
again 100% cpu again) ;-)

Original comment by uAle...@gmail.com on 21 Jun 2014 at 11:23

@GoogleCodeExporter

This comment has been minimized.

Show comment
Hide comment
@GoogleCodeExporter

GoogleCodeExporter Mar 14, 2015

Original comment by jus...@dynam.ac on 15 Oct 2014 at 2:56

  • Now blocking: #189

GoogleCodeExporter commented Mar 14, 2015

Original comment by jus...@dynam.ac on 15 Oct 2014 at 2:56

  • Now blocking: #189
@GoogleCodeExporter

This comment has been minimized.

Show comment
Hide comment
@GoogleCodeExporter

GoogleCodeExporter Mar 14, 2015

Original comment by jus...@dynam.ac on 15 Oct 2014 at 2:58

  • Added labels: Milestone-1.4

GoogleCodeExporter commented Mar 14, 2015

Original comment by jus...@dynam.ac on 15 Oct 2014 at 2:58

  • Added labels: Milestone-1.4
@dsoulayrol

This comment has been minimized.

Show comment
Hide comment
@dsoulayrol

dsoulayrol May 9, 2016

Hi. It seems this issue was abandoned half-fixed. As reported earlier, there are two problems. First, a 100% Cpu spin in SerialImpl, and then the absence of an event towards the user of the Manager object.

I need to address both those difficulties, so I'd like to know if anyone has made progress upon the proposed patches. Thanks.

dsoulayrol commented May 9, 2016

Hi. It seems this issue was abandoned half-fixed. As reported earlier, there are two problems. First, a 100% Cpu spin in SerialImpl, and then the absence of an event towards the user of the Manager object.

I need to address both those difficulties, so I'd like to know if anyone has made progress upon the proposed patches. Thanks.

@dsoulayrol

This comment has been minimized.

Show comment
Hide comment
@dsoulayrol

dsoulayrol May 23, 2016

Hello.

Here is a refreshed version of the patch proposed on 2015 March, the 14th. It still cannot be considered complete because I do not use the recover capabilities of the Driver and the Controller for now. When the stream is broken, the Driver emits a DriverFailure notification, and my program stops upon reception of this event.

However, I think there is not much to do so as to get the Driver up again. I experience no more CPU spins, but I had some problems on reconnection and I suspect some purge is missing before commands can be transmitted correctly again.

dsoulayrol commented May 23, 2016

Hello.

Here is a refreshed version of the patch proposed on 2015 March, the 14th. It still cannot be considered complete because I do not use the recover capabilities of the Driver and the Controller for now. When the stream is broken, the Driver emits a DriverFailure notification, and my program stops upon reception of this event.

However, I think there is not much to do so as to get the Driver up again. I experience no more CPU spins, but I had some problems on reconnection and I suspect some purge is missing before commands can be transmitted correctly again.

@Fishwaldo

This comment has been minimized.

Show comment
Hide comment
@Fishwaldo

Fishwaldo Jun 1, 2016

Member

Thanks. I'll look over it soon.

I think trying to recover from a "failed" serial port is too hard. I'll probably just try to tweek it a bit, so we also unload the Driver as well...

Member

Fishwaldo commented Jun 1, 2016

Thanks. I'll look over it soon.

I think trying to recover from a "failed" serial port is too hard. I'll probably just try to tweek it a bit, so we also unload the Driver as well...

@julienw

This comment has been minimized.

Show comment
Hide comment
@julienw

julienw Jun 6, 2016

Contributor

I'm very interested by this issue, please tell me if I can help at something.

Contributor

julienw commented Jun 6, 2016

I'm very interested by this issue, please tell me if I can help at something.

@julienw

This comment has been minimized.

Show comment
Hide comment
@julienw

julienw Jun 6, 2016

Contributor

note: the link to the issue in https://github.com/OpenZWave/open-zwave/wiki/FAQ points incorrectly to google code.

Contributor

julienw commented Jun 6, 2016

note: the link to the issue in https://github.com/OpenZWave/open-zwave/wiki/FAQ points incorrectly to google code.

@dhylands

This comment has been minimized.

Show comment
Hide comment
@dhylands

dhylands Jun 6, 2016

The real secret is that the serial port must be closed once it has "gone away". I've been using udev in other programs to detect serial ports being added and removed, and then just close the handle when the close is detected. This will typically cause any reads/writes which happen after the close to fail, and if timeouts are used, then in-progress reads/writes will timeout.

Under linux, if the serial port is still open when the dongle is plugged back in, then it will skip a number (i.e. go from /dev/ttyUSB0 to /dev/ttyUSB1).

dhylands commented Jun 6, 2016

The real secret is that the serial port must be closed once it has "gone away". I've been using udev in other programs to detect serial ports being added and removed, and then just close the handle when the close is detected. This will typically cause any reads/writes which happen after the close to fail, and if timeouts are used, then in-progress reads/writes will timeout.

Under linux, if the serial port is still open when the dongle is plugged back in, then it will skip a number (i.e. go from /dev/ttyUSB0 to /dev/ttyUSB1).

@mattw-db

This comment has been minimized.

Show comment
Hide comment
@mattw-db

mattw-db Jun 6, 2017

Does anyone have a full or work-in-progress solution to this, beyond the Linux-only partial discussed above?

The 100% CPU usage on controller removal appears to exist still on Windows also. I was able work around that symptom in a similar manner as what was done in the proposed patch for Linux, but gracefully and reliably removing the driver (so that it can later be re-added by the application) seems like a more substantial task. To be able to safely call NotifyWatchers and stop the driver thread on the way down, specifically, would appear to need some restructure to avoid deadlocks.

I'm new to the code, so not keen on hacking things up too much, but if someone's got a starting point or a clever idea on how it should be done, I'd be interested in playing with it since this is actually a fairly important use-case for me.

Thanks!

mattw-db commented Jun 6, 2017

Does anyone have a full or work-in-progress solution to this, beyond the Linux-only partial discussed above?

The 100% CPU usage on controller removal appears to exist still on Windows also. I was able work around that symptom in a similar manner as what was done in the proposed patch for Linux, but gracefully and reliably removing the driver (so that it can later be re-added by the application) seems like a more substantial task. To be able to safely call NotifyWatchers and stop the driver thread on the way down, specifically, would appear to need some restructure to avoid deadlocks.

I'm new to the code, so not keen on hacking things up too much, but if someone's got a starting point or a clever idea on how it should be done, I'd be interested in playing with it since this is actually a fairly important use-case for me.

Thanks!

@DvdGiessen

This comment has been minimized.

Show comment
Hide comment
@DvdGiessen

DvdGiessen Jun 19, 2018

Contributor

Just spend some time testing the patch from #111 (comment) and that does properly notify the application of the failure instead of just spinning to 100% and not doing much else.

@Fishwaldo Is there any reason not to merge this patch already? Sure, we want to implement a better solution for the problem, but that's still something we can do in the future. By merging this patch applications can at least be made aware of the issue without any external hacks to monitor the underlying devices; I don't think anybody is happier with 100% CPU, so merging this patch will improve the situation for now until we can implement an even better solution.

Contributor

DvdGiessen commented Jun 19, 2018

Just spend some time testing the patch from #111 (comment) and that does properly notify the application of the failure instead of just spinning to 100% and not doing much else.

@Fishwaldo Is there any reason not to merge this patch already? Sure, we want to implement a better solution for the problem, but that's still something we can do in the future. By merging this patch applications can at least be made aware of the issue without any external hacks to monitor the underlying devices; I don't think anybody is happier with 100% CPU, so merging this patch will improve the situation for now until we can implement an even better solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment