New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arduino_handler reads empty messages from the arduino and never recovers #331

Closed
goruck opened this Issue Aug 13, 2017 · 16 comments

Comments

Projects
None yet
2 participants
@goruck
Contributor

goruck commented Aug 13, 2017

I saw an incidence of the arduino_handler reading empty messages from the arduino, here's the trace:

TRACE> arduino_handler serial read: >0,84.20,26.30,462,26.50,0,0,3.90,0.00<
[rosout][DEBUG] 2017-08-11 20:52:37,093: 
TRACE> arduino_handler serial write 89 bytes: >0,0.0,0.0,False,False,False,False,False,False,1.0,True,True,True,0.0,0.0,0.0,False,False<
[rosout][DEBUG] 2017-08-11 20:52:37,352: 
TRACE> arduino_handler serial read: >0,84.20,26.30,462,26.50,0,0,3.90,0.00<
[rosout][DEBUG] 2017-08-11 20:52:38,093: 
TRACE> arduino_handler serial write 89 bytes: >0,0.0,0.0,False,False,False,False,False,False,1.0,True,True,True,0.0,0.0,0.0,False,False<
[rosout][DEBUG] 2017-08-11 20:52:40,347: 
TRACE> arduino_handler serial read: ><
[rosout][DEBUG] 2017-08-11 20:52:40,373: 
TRACE> arduino_handler serial write 89 bytes: >0,0.0,0.0,False,False,False,False,False,False,1.0,True,True,True,0.0,0.0,0.0,False,False<
[rosout][DEBUG] 2017-08-11 20:52:42,627: 
TRACE> arduino_handler serial read: ><

It never again read a non-empty message from the arduino. Everything was fine again after I killed openag brain and restarted it. This was seen on the latest develop branch code.

This is similar to the behavior described in #305 but the arduino looks to be still alive based on the serial TX and RX signal activity measured by my scope. In #305 the arduino would crash. Still trying to verify this, though if I can catch another occurrence which I'm trying now.

The fixes that were backed out in #329 may actually mitigate this problem since it seems like resetting the serial connection after detecting a few blank lines can handle the condition. So it might be worthwhile to put them back in.

@rbaynes

This comment has been minimized.

Member

rbaynes commented Aug 14, 2017

I still see this same condition after 5 hours running on one of our PFCs. The fix I have in #329 does not seem to have any effect on the issue described here. (#329 specifically fixes a different error that was seen only on one bot, due to the serial port switching).

@rbaynes rbaynes self-assigned this Aug 14, 2017

@goruck

This comment has been minimized.

Contributor

goruck commented Aug 14, 2017

@goruck

This comment has been minimized.

Contributor

goruck commented Aug 15, 2017

I saw another incident of this issue after about 30 hours. (Probably just coincidence, the issue occured right after I stopped a 'rostopic echo /arduino_status' command that had been running in another terminal. But still very strange).

As far as I can tell from the serial activity I'm monitoring with the scope, the Arduino is still alive after this issue occurs. So its probably another case of the serial port getting into a weird state.

I'm now running the latest develop branch code but with the following changes in arduino_handler.py (MAX_EMPTY_READ is set to 4 which allows for the cases the Arduino receives bad messages which do cause empty reads but does not cause the system to lockup).

# Blocks until one line is read or times out
buf = serial_connection.readline()
empty_read_count += 1 if buf == "" else 0 # Number of sequential empty reads
trace('arduino_handler empty read count: %d', empty_read_count)
if empty_read_count > MAX_EMPTY_READS:
    empty_read_count = 0
    raise RuntimeError('serial_connection.readline() socket connection broken')

and

except RuntimeError as e3:
    rospy.logwarn(e3)
    serial_connection = connect_serial()
@rbaynes

This comment has been minimized.

Member

rbaynes commented Aug 15, 2017

Thank you for the continued work on this issue!

@goruck

This comment has been minimized.

Contributor

goruck commented Aug 17, 2017

Here's an update.

I've hit the issue again, it looks like the USB port dies completely since the output of lsusb shows nothing there for the Arduino:

pi@raspberrypi:~/usbreset $ lsusb
Bus 001 Device 005: ID 05a3:9520 ARC International 
Bus 001 Device 003: ID 0424:ec00 Standard Microsystems Corp. SMSC9512/9514 Fast Ethernet Adapter
Bus 001 Device 002: ID 0424:9514 Standard Microsystems Corp. 
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

And dmesg shows strange things going on with the USB:

[133742.936612] usb 1-1.5: New USB device found, idVendor=2341, idProduct=0042
[133742.936621] usb 1-1.5: New USB device strings: Mfr=1, Product=2, SerialNumber=220
[133742.936625] usb 1-1.5: Manufacturer: Arduino (www.arduino.cc)
[133742.936630] usb 1-1.5: SerialNumber: 55639313133351C0D0A0
[133742.937432] cdc_acm 1-1.5:1.0: ttyACM1: USB ACM device
[133744.546986] usb 1-1.5: USB disconnect, device number 11
[133744.547318] cdc_acm 1-1.5:1.0: failed to set dtr/rts
[133744.862758] usb 1-1.5: new full-speed USB device number 12 using dwc_otg
[133744.962761] usb 1-1.5: device descriptor read/64, error -32
[133745.182794] usb 1-1.5: device descriptor read/64, error -32
[133745.402811] usb 1-1.5: new full-speed USB device number 13 using dwc_otg
[133745.502819] usb 1-1.5: device descriptor read/64, error -32
[133745.722952] usb 1-1.5: device descriptor read/64, error -32
[133745.952777] usb 1-1.5: new full-speed USB device number 14 using dwc_otg
[133746.392852] usb 1-1.5: device not accepting address 14, error -32
[133746.492784] usb 1-1.5: new full-speed USB device number 15 using dwc_otg
[133746.932771] usb 1-1.5: device not accepting address 15, error -32
[133746.932863] usb 1-1-port5: unable to enumerate USB device
[167868.358648] usb 1-1.4: new full-speed USB device number 16 using dwc_otg
[167868.458586] usb 1-1.4: device descriptor read/64, error -32
[167868.678554] usb 1-1.4: device descriptor read/64, error -32
[167868.898651] usb 1-1.4: new full-speed USB device number 17 using dwc_otg
[167869.008554] usb 1-1.4: device descriptor read/64, error -32
[167869.228608] usb 1-1.4: device descriptor read/64, error -32
[167869.448606] usb 1-1.4: new full-speed USB device number 18 using dwc_otg
[167869.888842] usb 1-1.4: device not accepting address 18, error -32
[167869.988572] usb 1-1.4: new full-speed USB device number 19 using dwc_otg
[167870.428611] usb 1-1.4: device not accepting address 19, error -32
[167870.428802] usb 1-1-port4: unable to enumerate USB device
[167904.458856] usb 1-1.5: new full-speed USB device number 20 using dwc_otg
[167904.559018] usb 1-1.5: device descriptor read/64, error -32
[167904.778856] usb 1-1.5: device descriptor read/64, error -32
[167904.998845] usb 1-1.5: new full-speed USB device number 21 using dwc_otg
[167905.098887] usb 1-1.5: device descriptor read/64, error -32
[167905.318859] usb 1-1.5: device descriptor read/64, error -32
[167905.538838] usb 1-1.5: new full-speed USB device number 22 using dwc_otg
[167905.978853] usb 1-1.5: device not accepting address 22, error -32
[167906.088844] usb 1-1.5: new full-speed USB device number 23 using dwc_otg
[167906.528910] usb 1-1.5: device not accepting address 23, error -32
[167906.529047] usb 1-1-port5: unable to enumerate USB device

The arduino_handler tries to reconnect but of course fails because the USB port is gone:

TRACE> arduino_handler empty read count: 0
[rosout][DEBUG] 2017-08-16 09:43:12,351: 
TRACE> arduino_handler serial read: >0,71.50,25.70,500,23.25,0,0,3.95,0.00<
[rosout][DEBUG] 2017-08-16 09:43:13,088: 
TRACE> arduino_handler serial write 89 bytes: >0,0.0,0.0,False,False,False,False,False,False,1.0,True,True,True,0.0,0.0,0.0,False,False<
[rosout][DEBUG] 2017-08-16 09:43:15,342: 
TRACE> arduino_handler empty read count: 1
[rosout][DEBUG] 2017-08-16 09:43:15,344: 
TRACE> arduino_handler serial read: ><
[rosout][DEBUG] 2017-08-16 09:43:15,357: 
TRACE> arduino_handler serial write 89 bytes: >0,0.0,0.0,False,False,False,False,False,False,1.0,True,True,True,0.0,0.0,0.0,False,False<
[rosout][DEBUG] 2017-08-16 09:43:17,612: 
TRACE> arduino_handler empty read count: 2
[rosout][DEBUG] 2017-08-16 09:43:17,614: 
TRACE> arduino_handler serial read: ><
[rosout][DEBUG] 2017-08-16 09:43:17,624: 
TRACE> arduino_handler serial write 89 bytes: >0,0.0,0.0,False,False,False,False,False,False,1.0,True,True,True,0.0,0.0,0.0,False,False<
[rosout][DEBUG] 2017-08-16 09:43:19,879: 
TRACE> arduino_handler empty read count: 3
[rosout][DEBUG] 2017-08-16 09:43:19,881: 
TRACE> arduino_handler serial read: ><
[rosout][DEBUG] 2017-08-16 09:43:19,888: 
TRACE> arduino_handler serial write 89 bytes: >0,0.0,0.0,False,False,False,False,False,False,1.0,True,True,True,0.0,0.0,0.0,False,False<
[rosout][DEBUG] 2017-08-16 09:43:22,141: 
TRACE> arduino_handler empty read count: 4
[rosout][DEBUG] 2017-08-16 09:43:22,144: 
TRACE> arduino_handler serial read: ><
[rosout][DEBUG] 2017-08-16 09:43:22,160: 
TRACE> arduino_handler serial write 89 bytes: >0,0.0,0.0,False,False,False,False,False,False,1.0,True,True,True,0.0,0.0,0.0,False,False<
[rosout][WARNING] 2017-08-16 09:43:22,416: device reports readiness to read but returned no data (device disconnected?)
[rosout][WARNING] 2017-08-16 09:43:22,541: No serial device found on system in /dev/serial/by-id
[rosout][WARNING] 2017-08-16 09:43:22,747: No serial device found on system in /dev/serial/by-id
[rosout][DEBUG] 2017-08-16 09:43:22,953: 
TRACE> arduino_handler serial read: ><
[rosout][DEBUG] 2017-08-16 09:43:23,161: 
TRACE> arduino_handler serial write 89 bytes: >0,0.0,0.0,False,False,False,False,False,False,1.0,True,True,True,0.0,0.0,0.0,False,False<
[rosout][WARNING] 2017-08-16 09:43:24,357: device reports readiness to read but returned no data (device disconnected?)
[rosout][WARNING] 2017-08-16 09:43:24,370: No serial device found on system in /dev/serial/by-id
[rosout][WARNING] 2017-08-16 09:43:24,573: No serial device found on system in /dev/serial/by-id
[rosout][WARNING] 2017-08-16 09:43:24,775: No serial device found on system in /dev/serial/by-id
[rosout][WARNING] 2017-08-16 09:43:24,977: No serial device found on system in /dev/serial/by-id
[rosout][WARNING] 2017-08-16 09:43:25,180: No serial device found on system in /dev/serial/by-id

I checked the USB cable from the rPi to the Arduino, looks Ok. Disconnecting and reconnecting the cable does no good.

Not sure how to fix this case and maybe this a one-off with my setup.

Anyone see this happening?

@rbaynes

This comment has been minimized.

Member

rbaynes commented Aug 17, 2017

@goruck

This comment has been minimized.

Contributor

goruck commented Aug 21, 2017

I found problem. The Arduino's USB to serial processor is locking up. I've been able to restart the communications between the Arduino and the rPi by manually resetting that processor which can be done by grounding its reset pin (assessable on the openag signal board). As you know, an ATmega16U2 (ATmega 8U2 on the revision 1 and revision 2 boards) serves as the USB to serial processor and is not reset by the DTS initiated software reset like is the case for the main ATmega2560 processor.

I'm not sure why its locking up. It could be a marginally bad part but if @rbaynes is seeing it too then probably not. My guess is noise on the serial line or in the system in general is somehow causing it. This noise may be different from system to system and build quality of the system (connections, wiring, etc) could be a factor.

Now how to mitigate? Some thoughts:

  1. Try replacing the Arduino and see if the problem goes away.
  2. Implement a watchdog timer on the ATmega*U2.
  3. Implement a timer on the ATmega2560 that causes a GPIO that toggles the ATmega*U2's reset line if it doesn't get a message within some time interval.

Note: Just connecting the ATmega*U2 reset line to the ATmega2560 reset line won't work.

@rbaynes can you try resetting the ATmega*U2 the next time you see this?

@rbaynes

This comment has been minimized.

Member

rbaynes commented Aug 21, 2017

Hi @goruck,
Yes I will try resetting that chip. Can you tell me how you reset the ATmega16U2 ? (I will also search for it). Looks like this is what I need:
https://www.arduino.cc/en/Hacking/DFUProgramming8U2

I now have two PFCs (out of 3) that are showing this issue. Unfortunately one is my long term grow test PFC.
Thanks!

@goruck

This comment has been minimized.

Contributor

goruck commented Aug 21, 2017

@goruck

This comment has been minimized.

Contributor

goruck commented Aug 22, 2017

Here's an update.

I found out that the openag signal board ties together the ATmega*U2 and ATmega2560 resets. So actually the reset button on the openag signal board when pressed will reset both processors. Sorry for the confusion here.

The more important finding is that the DTS line was not getting pulled back high when the serial connection is reconnected in the arduino_handler code which meant that the Arduino was never getting reset. I changed the arduino_handler code to close the serial connection before a reconnection attempt. Closing the serial connection causes the DTS line to go high and then the reconnection causes it to go low which triggers the reset circuit on the Arduino board. I've had this running for a while and it looks good. I've also been monitoring the Arduino reset line with a scope to verify that a serial reconnect does indeed reset the Arduino now.

I'm still not clear why the Arduino locks up occasionally and still suspect the USB to serial processor but this may be an acceptable mitigation if it holds.

I'll submit a PR with these changes so @rbaynes can try it on his similarly failing PFCs.

goruck added a commit to goruck/openag_brain that referenced this issue Aug 22, 2017

@goruck goruck referenced this issue Aug 22, 2017

Merged

fix for #331 #334

@rbaynes

This comment has been minimized.

Member

rbaynes commented Aug 22, 2017

@rbaynes

This comment has been minimized.

Member

rbaynes commented Aug 22, 2017

So far, so awesome. My flakey PFC has been running 2.5 hours (a new record) and it looks like your fix @goruck works!

TRACE> arduino_handler empty read count: 1
TRACE> arduino_handler empty read count: 2
TRACE> arduino_handler empty read count: 3
TRACE> arduino_handler empty read count: 4
TRACE> arduino_handler empty read count: 5
[rosout][WARNING] 2017-08-22 12:19:15,949: arduino_handler: serial_connection.readline() broken
[rosout][WARNING] 2017-08-22 12:19:16,923: arduino_handler WARN>  MHZ16 #1: Initializing
...
continues to work!
@rbaynes

This comment has been minimized.

Member

rbaynes commented Aug 23, 2017

Still working after 8 hours. 12 'broken' resets of the serial port and all is well.

@goruck

This comment has been minimized.

Contributor

goruck commented Aug 23, 2017

My PFC is still working after about 30 hours. Similar number (proportional to time) of broken resets healed.

@rbaynes

This comment has been minimized.

Member

rbaynes commented Aug 23, 2017

@goruck

This comment has been minimized.

Contributor

goruck commented Aug 24, 2017

I've been wondering why the Arduino is locking up in the first place and to that end I've noticed (via the scope) that the reset line at the openag signal board level glitches just before a lockup and about the same time the Nutrient A pump delivers a dose. I can actually cause the Arduino to lock up in a similar way by pressing the reset on the openag signal board very quickly.

I'm not sure what this means exactly but wanted to note it here for future reference. It could be a clue later on to actually find the root cause of the lockup but hopefully the mitigation will hold and allow us to move on. I hate not knowing a root cause.

rbaynes added a commit that referenced this issue Aug 24, 2017

@rbaynes rbaynes closed this Aug 25, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment