Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write failure after a number of commands #1188

Closed
techninja opened this issue Jun 10, 2017 · 31 comments
Closed

Write failure after a number of commands #1188

techninja opened this issue Jun 10, 2017 · 31 comments

Comments

@techninja
Copy link

  • SerialPort Version: v5 beta3 tag

  • NodeJS Version: node 6, electron 1.6.10

  • Operating System and Hardware Platform: Win 10, x64 (possibly other win)

  • Have you checked the right version of the api docs?: 👍

  • Are you using Electron and have you checked the Electron Docs?: Yup. Fails inside and outside electron

Summary of Problem

While doing extensive testing for the latest version of my Electron based software for streaming serial data to drawing robots, I find that on Windows only, my data write callback loop to move through my array buffered command set will suddenly... stop cold. Somewhere around 3k-5k commands in. Debugging the process in Electron (running inside its own dedicated webview process) shows that the last serial write is triggered, but the write callback is not. Implementing a timeout to catch this does work, but re-initiating when this happens.. does not. Running flush seems to get the last command to write (the robot moves), but running writes after this does not work.

Steps and Code to Reproduce the Issue

I do not have a good test case apart from my description above. If prompted I could likely try to make a short form test for reproduction, but without understanding the problem more, the best I can say is that it appears to very low level, not a problem with the destination hardware, and probably related to the number of commands sent. Maybe call stack limit? Who knows.

@techninja
Copy link
Author

I added an explicit counter to see exactly how many commands were being sent+drained when this happened. Ran 5 tests. The first two had exactly the same circumstances, and died in the exactly the same way. The fourth I reset the bot position full manual, and ended up with more commands before hitting the bug. The last two were completely different art at a much lower, then much higher speed.

Test results:

  1. Firefox @ 35% speed, Restart (15), then reset position (19), dies at (3067)
  2. Firefox @ 35% speed, Restart (15), then reset position (19), dies at (3067)
  3. Firefox @ 35% speed, Restart (15), dies at 3074
  4. Bird at 15% speed, Restart (15, 19), dies at (3011)
  5. Bird at 75% speed Restart (15), dies at (3018)

I take from this:

  • Drain/send speed doesn't seem to effect it (each movement command has a timer on it that restricts drain event until after execution on the firmware, so for faster speeds, movements avg ~1-10ms instead of 50-100ms)
  • The number of commands is definitely a factor, and a lot less than I thought
  • The loop that calls this may be a factor, so will retest with a rethinking of how this works

@techninja
Copy link
Author

Surrounding write callback next command execute in process.nextTick() or setTimeout({callback}, 0) (to theoretically break out of command call stack limitations) results in _exactly the same call count both times at 3018. No improvement. 😭 😡

@techninja
Copy link
Author

Well, this is stupid 🗡:rage4:

port.write(command + "\r", 'ascii', function() {
  port.drain(function() {
    port.flush(callback);
  });
});

Write, then drain, then flush. One write timeout (failed send with no callback triggered, ~ cmd 3244), but flushing after every single command seems to make it work. What the hell does flush do anyways?! Or rather, WHEN should it actually be used?

I suspect there's a subtle and low-level bug here, going to attempt flush less often and see if that helps.

@techninja
Copy link
Author

Flushing every 500 commands is not good enough, and I suspect there's a segfault or something else very bad happening when on a write timeout you flush and try to write the same command again.

@techninja
Copy link
Author

Flushing on every command for v5 Beta 3 appears to be required in my use case. Will follow-through with testing on other systems, though I may feel a little safer putting a little fence around it and making it windows specific.

@techninja techninja changed the title Windows 10 x64 v5 beta 3: Write callback never fires Windows 10 x64 v5 beta 3: Write callback never fires after a number of commands Jun 12, 2017
@techninja
Copy link
Author

techninja commented Jun 12, 2017

Running in this way (write->drain->flush->callback) on Mac seems to cause the issue after ~9k commands. Very odd.

@techninja
Copy link
Author

After 18k commands on Mac with no flush, very similar failure, but with write callback working just fine. Very strange.

@techninja techninja changed the title Windows 10 x64 v5 beta 3: Write callback never fires after a number of commands Write failure after a number of commands Jun 12, 2017
@jayalfredprufrock
Copy link

These are great experiments @techninja, thanks for sharing. I've been struggling to find a stable combination that works reliably on both OSX and windows via electron. With small bursts of data things work reliably, but the second I start moving large chunks of binary over the wire I also see missed write callbacks and other serial port errors in windows. I hope to have some time soon to take a closer look at the native bindings. Have you done any tests with the SerialPort v4 release?

@techninja
Copy link
Author

I have not tested 4 much with this, as I was upgrading to 5 from 4.0.1 specifically because of the "dropped data bug" in Windows which was more evil in that commands were still being sent, but for a robot that only send relative commands, a missed movement means a potentially crashing offset from where it should be. I have more tests to run on mac and hopefully can come up with a solution as I don't believe we're going to get much help from the maintainers as they're all very busy on other things.

What I'd LIKE to do, is replace the native binding layer with something else, and retain the API. This is technically possible, but.. far easier said than done. I've mused over making a pyserial wrapper as the project itself is incredibly stable and has relatively huge support/backing, but I'd hate to get started if a solution can be found for the native "supported" binding.

@jayalfredprufrock
Copy link

I've pretty much come to the same realization. I'm currently experimenting with my own solution by writing custom bindings for libserialport (http://sigrok.org/api/libserialport/unstable/index.html) Arduino uses this library so I imagine it's pretty stable. The interface probably won't match node-serialport exactly, but it should come pretty close.

@techninja
Copy link
Author

I did some more tests on OSX, and the results weren't good. The entire electron process that serialport runs in dies when whatever is happening here happens, after some 5-15k commands are sent.

A terrible thought: I could rewrite my command streamer to watch the process for whenever it dies and restart it... though it does hold on to volatile data, I'd have to do a bunch of rewrites but it could "solve" around the problem 😖

I was supposed to make a release months ago! This is awful.

@techninja
Copy link
Author

OSX testing continues outside of Electron: With the Windows required flush, running via node (my app runs in Electron or node), the process streaming the serialport commands exits after ~1-2k commands with the following:

node(2441,0x7fff774a7000) malloc: *** error for object 0x102005e38: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug

Without the flush on every command drain.... it seems to simply exit after ~1-2k commands with an error code of... get this.... null! 😡

@jayalfredprufrock
Copy link

On average how many bytes are each one of your commands?

Also, how many different machines have you tried? My electron app works wonderfully on my personal MBP running both OSX and windows 7/10 in a virtual machine. I keep getting user bug reports though about failed data transfers on other people's machines so I know this code isn't stable. That's the worst part though, I can't seem to recreate the problems reliably on any of my machines so I'm forced to find an alternative solution that has already been battle tested across platforms...

@techninja
Copy link
Author

Average bytes is.. likely around 10-12bytes sent as ascii strings terminated by readlines.

I do not have another Mac to test on, and have only tested on one Windows machine. With some effort I could likely get another to test on, but at this rate the bug is basically guaranteed in my app on mac. Used to be Windows was the odd man out, but with the "flush fix" Windows doesn't have the problem at all, and is easily the most stable!

@reconbot
Copy link
Member

I just released 5.0.0-beta4 which includes a windows fix. We rewrote reading on windows so I'm glad it's working out. The bindings layer in 5x has a pretty good api and if you wanted to write a different binding (as opposed to hacking on the existing c++ ones) I'd be all for it.

My goal is that we can move them into their own projects to support the wide range of platforms and approaches. (that continue to fragment Eg, there are now 2 different ways to do serial port access on windows)

@jayalfredprufrock
Copy link

Thanks @reconbot , glad to see that fix landed. I know this must be a hell of a project to maintain. Has any effort been made in the past to use libserialport as the underlying driver? It provides a consistent API at the native level so it seems like it would be a much easier base to build bindings on for a unified node API. I'm about half way through my own initial implementation, but my c++ skills are lacking and I'm largely unfamiliar with v8 and nan. If there is a good reason libserialport hasn't been used for this purpose then I should probably stop wasting my time...

@reconbot
Copy link
Member

If there was one area of the shared c++ bindings that I think might be the cause of our issues it would be the c++ write queuing system. It was needed before the streams rewrite but it is no longer as the JS will ensure only one write is happening at a time.

@matmis
Copy link

matmis commented Jun 17, 2017

@reconbot
Copy link
Member

@jayalfredprufrock there isn't a good reason. Possibly the requirement of having a 3rd party library? Since we ship precompiled binaries - it would be possible to work around that hurdle for most people.

I'd love to both get feedback on the binding layer, and have some alternative bindings available. I fully support your experiment.

@reconbot
Copy link
Member

reconbot commented Jun 20, 2017

I pulled out the write queue stuff, so writes should be more strait forward now with less code. 5.0.0-beta5

@techninja
Copy link
Author

Ok, I wrote this as a pared down version of how I manage batched writes. I've had this die twice now on Mac with latest Beta5, first was a segfault, exit code 11, second said

Assertion failed: (loop->watchers[w->fd] == w), function uv__io_stop, file ../deps/uv/src/unix/core.c, line 888.
Abort trap: 6

This may or may not be specific to writing to the EBB (might be related to its data writes back to the machine?), otherwise pretty much every command given prompts the serial device to return OK. run like this: node deathtest.js /path/to/port.

deathtest.js.txt

baddata.txt

@reconbot
Copy link
Member

This looks related to reads and being notified of the reads on nix. I assume if you pipe an fs.readStream to the port you get a similar behavior? Haven't been able to test yet.

@techninja
Copy link
Author

techninja commented Jun 21, 2017

Just ran my test code (no edits) on node v6.2.1 Ubuntu x64, died early with Segmentation Fault (core dumped)... have been utterly unable to find said dump of course. Tried again on node v8.1.2 latest, same deal.

EDIT: With the flush in place on linux... still dies. 😕

@reconbot
Copy link
Member

reconbot commented Jun 21, 2017 via email

@techninja
Copy link
Author

Dump file was not in the working dir, the executable dir, OR in the root (running as standard user, not root, I wouldn't have been able to write there). My research shows these crash dumps are thrown to apport, which then figures out if they should be sent on to some home base, and if it can't figure that out it puts them.... somewhere else. I'll keep looking and will attempt to retrieve it, but won't be able to do so till tomorrow.

@reconbot
Copy link
Member

I am able to reproduce this. I wrote a test script from your test script that uses the arduino echo program at 9600 baud. It should take about 76 seconds to send the 89k at 9600 bps (I might be wrong here)

Node 6, on my macbook pro, with an arduino uno.

Test 1 writeAllCommands

  • Open port, wait 3 seconds for Arduino
  • Log on each received command from the Arduino
  • Write each command with it's own call to write in a sync loop, log when each write finishes
  • close the port when the last command is received

Results

  • No errors during writes
  • write callbacks all happened at once!? Maybe writev combined the whole file into one write
  • 100% cpu!?
  • 98 seconds to write the 89k
  • Crashed when the port closed
[2]    511 abort      node test/manual-testing/many-writes.js
node test/manual-testing/many-writes.js  21.91s user 80.01s system 103% cpu 1:38.00 total

// and a different time
Assertion failed: (!(handle->flags & (UV_CLOSING | UV_CLOSED))), function uv_poll_start, file ../deps/uv/src/unix/poll.c, line 98.
[2]    6433 abort      node test/manual-testing/many-writes.js

Test 2 writeOneCommandAtATime

  • Open port, wait 3 seconds for arduino
  • Log on each received command from the arduino
  • Write each command and wait until the callback before writing the next command
  • close the port when the last command has finished writing

Results

  • Got it to crash once with the same error you've been seeing
Assertion failed: (loop->watchers[w->fd] == w), function uv__io_stop, file ../deps/uv/src/unix/core.c, line 888.
[1]    99358 abort      node test/manual-testing/many-writes.js
  • Data and Write callbacks seem to happen in chunks of 30 at a time, must be something to do with OS buffering
  • 100% cpu
  • Finished in 93 seconds, closed the port without issues

I can't get test 2 to reliably error but the message gives me a little to go on... and of course some other edge cases to handle.

Test 3 writeAndDrain

  • Open port, wait 3 seconds for arduino
  • Log on each received command from the arduino
  • Write each command and wait until the callback and a call to drain has a callback before writing the next command
  • close the port when the last command has finished writing

Results

  • finished in 97 seconds and closed port without issues

What did I learn? Not a ton but there are some leads around closing the port and the read poller. Also that cpu usage...

@techninja
Copy link
Author

Wahoo! Awesome work @reconbot!!! It fills me with joy when I can actually be helpful in finding code that baby steps closer to fixing a scary bug in the wild. I know exactly how it feels to hear something you're responsible for isn't working as expected, but having no way to reproduce is the hardest thing to try and deal with.

My next hope of course is your background knowledge and debugging skills can be leveraged to find what deep pocket of weirdness lurk within these results (and the code) to find something soon. I'm currently being forced to sell my house over the course of the next 4 weeks, but I will keep up to date multiple times per day to test anything you've got 😸. Good luck sir!!

@reconbot
Copy link
Member

I have an alpha ready that is in much better shape.

The branch is called write-tests and currently only works for OSX. It probably also has a small memory with closing ports. (TODO items)

It's able to run all 3 tests with 0-2% cpu, and close the port without crashing. @indutny helped me write a better poller which was the answer we needed.

@techninja
Copy link
Author

Alright!

Testing : EmergingTechnologyAdvisors/node-serialport#write-tests, @ 5c0ee67

OSX - Electron v1.6.10

  • Write->Drain->Callback, likely died of segfault at ~16k commands.
  • Write->Drain->Flush->Callback, likely died of segfault at ~8.9k commands.

Battery died before I could fully test node directly, will try again and report back soon!

@reconbot
Copy link
Member

reconbot commented Jul 5, 2017

Please try 5.0.0-beta6 this should have read and write support on unix.

In addition

  • disconnect events don't fire when you're closing the port
  • Close during writes causes an error, close during reads does not

@techninja
Copy link
Author

techninja commented Jul 8, 2017

@reconbot Testing this now, but I ave to say with physical disconnect detection working great(!!), and my initial tests outside of electron going incredibly well, I say this is totally release worthy. I've yet to get it to fail, and will continue testing but I suspect with the incredible work done overhauling various systems, many things are in far better shape.

Thank you so much for your hard work!!!

@lock lock bot locked and limited conversation to collaborators Feb 5, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

No branches or pull requests

4 participants