Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LPC1768 & DUE USB serial can trigger watchdog #11715

Closed
gloomyandy opened this issue Sep 4, 2018 · 20 comments
Closed

LPC1768 & DUE USB serial can trigger watchdog #11715

gloomyandy opened this issue Sep 4, 2018 · 20 comments
Labels
T: HAL & APIs Topic related to the HAL and internal APIs.

Comments

@gloomyandy
Copy link
Contributor

I've been doing a little work on the current LPC1768 USB code and noticed an issue which also seems to exist with the DUE code. Basically in both cases when writing to the USB based serial port the code will spin in a loop waiting for space in the transmit buffer (or for the connection to go away). This means that if for some reason the PC program does not read the port for a period then a watchdog timeout can trigger. This does not seem to be ideal, but I'm really not sure if there is a better solution, obviously it would be good if the PC did not stop reading, but if possible I'd like to make the Marlin side more robust if this does happen.

Does anyone have any thoughts on the best way to handle this? Perhaps some sort of timeout after which we just dump the serial output? Should we trigger some sort of error or warning (does Marlin have a good way to do this?) or is it best just to let the watchdog happen (assuming it is enabled)? Any thoughts @ejtagle @thinkyhead @p3p

@thinkyhead
Copy link
Member

Good CC'ing…

I'm stumped, but we'll look closer and see. Does it help to mess around with the size of the TX buffer?

@thinkyhead thinkyhead added the T: HAL & APIs Topic related to the HAL and internal APIs. label Sep 4, 2018
@gloomyandy
Copy link
Contributor Author

gloomyandy commented Sep 4, 2018

A larger TX buffer will probably not help very much (at least not in the worst case of the PC just not reading, it may help if it is just slow), if a host program has used M155 to request automatic temperature updates and then stops reading them, they just pile up until you get a watchdog timeout. I'm not sure it is a big deal, just that since I was touching the code anyway I thought I'd raise it to see if there was a better way.

It is probably not helped by the fact that I'm using what is basically a smootieboard (an MKS SBASE with a smothie bootloader installed), which I think interferes with the "normal" Marlin handling of a watchdog timeout. I think the bootloader is detecting the timeout and throwing the board into some sort of download mode rather than booting into Marlin and letting that handle it. Probably another thing to sort out at some point for good Marlin support of that class of boards.

@AnHardt
Copy link
Member

AnHardt commented Sep 4, 2018

We could simulate the not handshaked 2 line serial between the processor and the usb chip on the Arduino Mega (Uno, ...). Information gets lost if not received. Not nice but consistent.

@forkoz
Copy link
Contributor

forkoz commented Sep 4, 2018

This is probably how the locks trigger the watchdog. USB serial gets disconnected and the temperature updates pile up. I notice using your SPI sharing that sometimes after printing from SD I can't re-connect to the serial port until I reboot the printer. So both may be intertwined somehow or could be related to my motherboards' usb drivers which is why it was so hard to reproduce before. Good to know that my board isn't broken and that I'm not crazy.

@ejtagle
Copy link
Contributor

ejtagle commented Sep 4, 2018

On DUE, the USB CDC will just discard characters to be sent if no program has opened the emulated serial port or USB is disconnected.
It is assumed that once a pc program opens the serial port, Marlin goes into the "slave mode" and all responsability for reading responses and sending commands is a problem of the PC itself (it makes pretty much sense!)
There is no fix for lack of reading on the PC side... I would even say this is a behaviour we want! ... We use that feature of the USB stack to implement "forced" handshake, so it is impossible to lose any TXd or RXd character from any side of the communication bridge.

On MKS, the watchdog reset could be handled in a very similar way as the DUE can handle it (if configured)... You can reprogram the watchdog to trigger an interrupt with maximum priority, and handle it. creating a traceback and then restarting the firmware, instead of letting it reset the board...

@p3p
Copy link
Member

p3p commented Sep 9, 2018

There isn't much we can do other than drop the data if the host is not reading from the open connection, we could perhaps retry a few times or leave that to the upstream function, the serial api should just return the amount that was successfully buffered to what ever Marlin process tried to send and it can decide whether to retry or throw it away, as it stands Marlin functions do not interpret the return values so will default to discard.

It is probably not helped by the fact that I'm using what is basically a smootieboard (an MKS SBASE with a smothie bootloader installed), which I think interferes with the "normal" Marlin handling of a watchdog timeout.

The smoothie bootloader does go into DFU mode on watchdog reset, but needing human interaction (hard reset) on fatal error is best practice with dangerous hardware anyway so I don't see it as a problem, if we did fork the bootloader I'm not sure we should change that.

@gloomyandy
Copy link
Contributor Author

I agree that it should probably have a manual reset following a watchdog timeout, but it would be nice if some indication was given that it was a watchdog error that caused the system to halt. At the moment it just looks like the printer has locked up.

I have the traceback code mentioned above working on my SBase board and will create a PR for those changes as soon as I've finished up the work on USB/SD card/SPI. Having the ability to display the traceback and add code into the handler to print out other system state is very handy and has helped me a lot to figure out what was going on. I'm not sure if it should be included as standard, but it is really handy when developing!

@gjdodd
Copy link
Contributor

gjdodd commented Oct 9, 2018

I do not know if it helps, but I have been playing with an mks sbase and have seen this problem of the halting consistently, I thought I had a bad board at first. The issue shows up on using either Octoprint or Pronterface (any host will do). Once the M155 has been started, it will at some point, normally after about 5 minutes, end up halting, even if the usb is still listening. I can make it do this consistently by starting the connection then just removing the USB cable and normally in about 5 minutes or so it will have halted. This does not seem to happen when using the other serial port instead, that seems to carry on fine, it is only when using the USB connection. As it stands at the moment the USB is not usable as the printer will halt at some point if the USB is plugged in, and will just cause failed prints.

If you want me to try anything out or test anything please do give me a shout.

@p3p
Copy link
Member

p3p commented Oct 9, 2018

I'm dragging my feet, ( and probably annoying @gloomyandy because of it) at putting a PR in with his USB fixes that have been added to the new LPC176x framework, you could try my experimental branch (https://github.com/p3p/Marlin/tree/pr_bf2_pulloutlpcframework) and see if it solves any of your issues,

@gjdodd
Copy link
Contributor

gjdodd commented Oct 9, 2018

I am just trying out the branch now, is there a branch or change set with just the USB fixes included, just so I could isolate the fix if required. Not to worry if there is not

@p3p
Copy link
Member

p3p commented Oct 9, 2018

There isn't an up to date branch with just the USB changes I don't think, if you do have any problems you can usually catch me on the Marlin discord.

@gjdodd
Copy link
Contributor

gjdodd commented Oct 9, 2018

It looks to be working at the moment

@thinkyhead
Copy link
Member

@gloomyandy — Do you also find that it's working now?

@gloomyandy
Copy link
Contributor Author

I've been using the new framework for a while with no problems. I did a fair bit of testing of the USB and SD card code. But there is new code in it now that I've not tested as I'm away at the moment. The new code is for PWM and servos I think. I'll be updating to that and trying it as soon as I get back.

Note that it is still possible to get a watchdog timeout if the USB connection is open but the PC side code does not read from the port (as discussed above). If we want to fix this then we really need to either just have the USB serial code drop characters or change the higher level Marlin code. The various problems that could cause the Marlin side of the connection to lock up should now be fixed.

@gjdodd
Copy link
Contributor

gjdodd commented Oct 16, 2018

I have just noticed a merged pull request (commit 3d5be2e) into 2.0 bugfix, with what appears to be these changes included. I am right in my thinking? If so I can switch over to the official bugfix code base instead of the custom branch.

@p3p
Copy link
Member

p3p commented Oct 16, 2018

Indeed Marlin is on the new LPC176x framework now with all the USB fixes it included.

@boelle
Copy link
Contributor

boelle commented Feb 20, 2019

@gloomyandy had any problem since?

@gloomyandy
Copy link
Contributor Author

This issue has been resolved on the LPC176x (by dropping data if still not sent after a short timeout), but it is still potentially an issue on the DUE. But it is probably a matter of opinion as to if it needs to be fixed and what that fix should be. So for the sake of reducing the issue queue I'm closing it!

@marcio-ao
Copy link
Contributor

I agree that it should probably have a manual reset following a watchdog timeout, but it would be nice if some indication was given that it was a watchdog error that caused the system to halt.

How about calling "kill()" if the host is no longer accepting characters, but before the watchdog triggers? That way there would be a kill message on the LCD for the user to understand what caused the fault.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T: HAL & APIs Topic related to the HAL and internal APIs.
Projects
None yet
Development

No branches or pull requests

9 participants