-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write failure after a number of commands #1188
Comments
I added an explicit counter to see exactly how many commands were being sent+drained when this happened. Ran 5 tests. The first two had exactly the same circumstances, and died in the exactly the same way. The fourth I reset the bot position full manual, and ended up with more commands before hitting the bug. The last two were completely different art at a much lower, then much higher speed. Test results:
I take from this:
|
Surrounding write callback next command execute in |
Well, this is stupid 🗡:rage4: port.write(command + "\r", 'ascii', function() {
port.drain(function() {
port.flush(callback);
});
}); Write, then drain, then flush. One write timeout (failed send with no callback triggered, ~ cmd 3244), but flushing after every single command seems to make it work. What the hell does flush do anyways?! Or rather, WHEN should it actually be used? I suspect there's a subtle and low-level bug here, going to attempt flush less often and see if that helps. |
Flushing every 500 commands is not good enough, and I suspect there's a segfault or something else very bad happening when on a write timeout you flush and try to write the same command again. |
Flushing on every command for v5 Beta 3 appears to be required in my use case. Will follow-through with testing on other systems, though I may feel a little safer putting a little fence around it and making it windows specific. |
Running in this way (write->drain->flush->callback) on Mac seems to cause the issue after ~9k commands. Very odd. |
After 18k commands on Mac with no flush, very similar failure, but with write callback working just fine. Very strange. |
These are great experiments @techninja, thanks for sharing. I've been struggling to find a stable combination that works reliably on both OSX and windows via electron. With small bursts of data things work reliably, but the second I start moving large chunks of binary over the wire I also see missed write callbacks and other serial port errors in windows. I hope to have some time soon to take a closer look at the native bindings. Have you done any tests with the SerialPort v4 release? |
I have not tested 4 much with this, as I was upgrading to 5 from What I'd LIKE to do, is replace the native binding layer with something else, and retain the API. This is technically possible, but.. far easier said than done. I've mused over making a |
I've pretty much come to the same realization. I'm currently experimenting with my own solution by writing custom bindings for libserialport (http://sigrok.org/api/libserialport/unstable/index.html) Arduino uses this library so I imagine it's pretty stable. The interface probably won't match node-serialport exactly, but it should come pretty close. |
I did some more tests on OSX, and the results weren't good. The entire electron process that serialport runs in dies when whatever is happening here happens, after some 5-15k commands are sent. A terrible thought: I could rewrite my command streamer to watch the process for whenever it dies and restart it... though it does hold on to volatile data, I'd have to do a bunch of rewrites but it could "solve" around the problem 😖 I was supposed to make a release months ago! This is awful. |
OSX testing continues outside of Electron: With the Windows required flush, running via node (my app runs in Electron or node), the process streaming the serialport commands exits after ~1-2k commands with the following:
Without the flush on every command drain.... it seems to simply exit after ~1-2k commands with an error code of... get this.... |
On average how many bytes are each one of your commands? Also, how many different machines have you tried? My electron app works wonderfully on my personal MBP running both OSX and windows 7/10 in a virtual machine. I keep getting user bug reports though about failed data transfers on other people's machines so I know this code isn't stable. That's the worst part though, I can't seem to recreate the problems reliably on any of my machines so I'm forced to find an alternative solution that has already been battle tested across platforms... |
Average bytes is.. likely around 10-12bytes sent as ascii strings terminated by readlines. I do not have another Mac to test on, and have only tested on one Windows machine. With some effort I could likely get another to test on, but at this rate the bug is basically guaranteed in my app on mac. Used to be Windows was the odd man out, but with the "flush fix" Windows doesn't have the problem at all, and is easily the most stable! |
I just released 5.0.0-beta4 which includes a windows fix. We rewrote reading on windows so I'm glad it's working out. The bindings layer in 5x has a pretty good api and if you wanted to write a different binding (as opposed to hacking on the existing c++ ones) I'd be all for it. My goal is that we can move them into their own projects to support the wide range of platforms and approaches. (that continue to fragment Eg, there are now 2 different ways to do serial port access on windows) |
Thanks @reconbot , glad to see that fix landed. I know this must be a hell of a project to maintain. Has any effort been made in the past to use libserialport as the underlying driver? It provides a consistent API at the native level so it seems like it would be a much easier base to build bindings on for a unified node API. I'm about half way through my own initial implementation, but my c++ skills are lacking and I'm largely unfamiliar with v8 and nan. If there is a good reason libserialport hasn't been used for this purpose then I should probably stop wasting my time... |
If there was one area of the shared c++ bindings that I think might be the cause of our issues it would be the c++ write queuing system. It was needed before the streams rewrite but it is no longer as the JS will ensure only one write is happening at a time. |
@jayalfredprufrock there isn't a good reason. Possibly the requirement of having a 3rd party library? Since we ship precompiled binaries - it would be possible to work around that hurdle for most people. I'd love to both get feedback on the binding layer, and have some alternative bindings available. I fully support your experiment. |
I pulled out the write queue stuff, so writes should be more strait forward now with less code. 5.0.0-beta5 |
Ok, I wrote this as a pared down version of how I manage batched writes. I've had this die twice now on Mac with latest Beta5, first was a segfault, exit code 11, second said
This may or may not be specific to writing to the EBB (might be related to its data writes back to the machine?), otherwise pretty much every command given prompts the serial device to return |
This looks related to reads and being notified of the reads on nix. I assume if you pipe an |
Just ran my test code (no edits) on node v6.2.1 Ubuntu x64, died early with EDIT: With the flush in place on linux... still dies. 😕 |
Should be in the cwd or /
…---
Francis Gulotta
wizard@roborooter.com
On Wed, Jun 21, 2017 at 1:19 PM, James T ***@***.***> wrote:
Just ran my test code (no edits) on node v6.2.1 Ubuntu x64, died early
with Segmentation Fault (core dumped)... have been utterly unable to
*find* said dump of course. Tried again on node v8.1.2 latest, same deal.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1188 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABlbphIDKHTNtjazp9OkaO9i9Yz1ImSks5sGVCGgaJpZM4N2L4a>
.
|
Dump file was not in the working dir, the executable dir, OR in the root (running as standard user, not |
I am able to reproduce this. I wrote a test script from your test script that uses the arduino echo program at 9600 baud. It should take about 76 seconds to send the 89k at 9600 bps (I might be wrong here) Node 6, on my macbook pro, with an arduino uno. Test 1
Results
[2] 511 abort node test/manual-testing/many-writes.js
node test/manual-testing/many-writes.js 21.91s user 80.01s system 103% cpu 1:38.00 total
// and a different time
Assertion failed: (!(handle->flags & (UV_CLOSING | UV_CLOSED))), function uv_poll_start, file ../deps/uv/src/unix/poll.c, line 98.
[2] 6433 abort node test/manual-testing/many-writes.js Test 2
Results
I can't get test 2 to reliably error but the message gives me a little to go on... and of course some other edge cases to handle. Test 3
Results
What did I learn? Not a ton but there are some leads around closing the port and the read poller. Also that cpu usage... |
Wahoo! Awesome work @reconbot!!! It fills me with joy when I can actually be helpful in finding code that baby steps closer to fixing a scary bug in the wild. I know exactly how it feels to hear something you're responsible for isn't working as expected, but having no way to reproduce is the hardest thing to try and deal with. My next hope of course is your background knowledge and debugging skills can be leveraged to find what deep pocket of weirdness lurk within these results (and the code) to find something soon. I'm currently being forced to sell my house over the course of the next 4 weeks, but I will keep up to date multiple times per day to test anything you've got 😸. Good luck sir!! |
I have an alpha ready that is in much better shape. The branch is called write-tests and currently only works for OSX. It probably also has a small memory with closing ports. (TODO items) It's able to run all 3 tests with 0-2% cpu, and close the port without crashing. @indutny helped me write a better poller which was the answer we needed. |
Alright! Testing : OSX - Electron v1.6.10
Battery died before I could fully test node directly, will try again and report back soon! |
Please try In addition
|
@reconbot Testing this now, but I ave to say with physical disconnect detection working great(!!), and my initial tests outside of electron going incredibly well, I say this is totally release worthy. I've yet to get it to fail, and will continue testing but I suspect with the incredible work done overhauling various systems, many things are in far better shape. Thank you so much for your hard work!!! |
SerialPort Version: v5 beta3 tag
NodeJS Version: node 6, electron 1.6.10
Operating System and Hardware Platform: Win 10, x64 (possibly other win)
Have you checked the right version of the api docs?: 👍
Are you using Electron and have you checked the Electron Docs?: Yup. Fails inside and outside electron
Summary of Problem
While doing extensive testing for the latest version of my Electron based software for streaming serial data to drawing robots, I find that on Windows only, my data write callback loop to move through my array buffered command set will suddenly... stop cold. Somewhere around 3k-5k commands in. Debugging the process in Electron (running inside its own dedicated webview process) shows that the last serial write is triggered, but the write callback is not. Implementing a timeout to catch this does work, but re-initiating when this happens.. does not. Running flush seems to get the last command to write (the robot moves), but running writes after this does not work.
Steps and Code to Reproduce the Issue
I do not have a good test case apart from my description above. If prompted I could likely try to make a short form test for reproduction, but without understanding the problem more, the best I can say is that it appears to very low level, not a problem with the destination hardware, and probably related to the number of commands sent. Maybe call stack limit? Who knows.
The text was updated successfully, but these errors were encountered: