-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heavy traffic will trigger TEF interrupt (CAN-HAT) #7
Comments
I'm having the same problem, but just with normal writes - It seems to happen after around 5-10 minutes of testing (writing to the canbus every 5 seconds). The error reported is:
Restarting the interface (ip link down and then up) restores functionality for a similar duration, but eventually the same error as above happens. |
It seems to be the new driver's fault, in Aug and Sept I used to run my project with 100 Hz multiple 8 messages, it was good during 3 hour+ test. The hardware was same. |
I just tested with an older version of the kernel: |
I've tested another board mounting MCP2515(driver version 2017) with 1300 fps message rate, 500kbps baud rate, for 17 hours, and it's OK. |
I've tried the driver here: https://github.com/GBert/misc/tree/master/RPi-MCP2517 (changing the oscillator setting to 40 MHz), and that worked for around 15 hours before failing hard. After that, nothing but a full power cycle would bring it online again. At least the driver in this repository works after a software reset, if only for a short while. |
Same issue with the CAN-HAT MCP2517 on latest raspbian 4.19.75-v7l, on Raspberry Pi4. It works for a while, and the suddenly stops. It seems the higher the load, the higher likelihood of getting the error. |
It seems related to the fact that the MCP2517FD have now been discontinued due to a quite severe silicon bug, described in this errata document: They specifically mention this to be related to slower linux (raspberry pi) and a too long delay between spi bytes:
|
We tested with a board using the MCP2518FD where the errata of the MCP2517 have been fixed. This did not solve the issue with the "Something went wrong - we got a TEF..." error. |
Note, this appears to be this issue: msperl/linux-rpi#6 |
The raspberry PI kernel bug, maintained by someone, should be fixed soon. We will also upgrade this product to MCP2518FD in the future |
I'm bitten by this too, a real show stopper. @Pillar1989 do you know who might resolve the PI kernel bug (if that is indeed the cause of this issue) and more importantly do you know if whoever might resolve the PI kernel bug is aware of the bug? |
@jbakuwel msperl/linux-rpi#6 Here are some updates; I don't know if they are what you want. Besides, we are experiencing the Chinese New Year, and we will not have enough energy to devote to this matter until February 1, 2020. |
Hi Baozhu,
Thanks for the prompt response. I did read all the posts on github and even tried reversing the TEf optimisations as some people seem to think these might have introduced the problem but so far have no luck getting the hardware to behave.
Wish you a great festive season and hope to hear from you re this in Feb.
kind regards,
Jan
… On 14/01/2020, at 3:14 PM, Baozhu Zuo ***@***.***> wrote:
@jbakuwel msperl/linux-rpi#6 Here are some updates; I don't know if they are what you want. Besides, we are experiencing the Chinese New Year, and we will not have enough energy to devote to this matter until February 1, 2020.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Update: reverting the optimization commits as suggested by @irrwisch1 here does seem to resolve the problem for me too on a Raspberry Pi 4; at least so far. I also had to change MCP25XXFD_SCLK_DIVIDER to 4 in mcp25xxfd_regs.h. There are remaining issues: test kernel: [ 1538.699553] mcp25xxfd spi1.0: ECC double bit error at c00 test kernel: [ 2222.497103] mcp25xxfd spi1.0: unidentified system interrupt - intf = b91a1118 test kernel: [ 666.922505] mcp25xxfd spi1.0 can1: CAN Bus error experienced Then there's this: test kernel: [ 1913.359886] mcp25xxfd spi0.0: found IVMIF situation not supported by driver - bdiag = [0x00008400, 0x00005ed7] which was resolved by applying the changes suggested by @jrm06c here It would be great to get this driver stable; happy to help in anyway I can. |
Further debug info. I've written a simple program that forwards packets from can0 (called b) to can1 (called v) and vice versa. This runs fine for about 7 minutes (having processed ~70K frames), then single bit ECC errors are showing up. These are (most likely) corrected and all continues to run fine. See my previous post for the changes I made to the driver. Any suggestions? 20200126-142108: Frames received on can0 (b) will be forwarded to can1 (v) |
@jbakuwel can you send me your test program? |
@marckleinebudde, yeah sure no worries: cantxrx.py.txt. Note tabstop=3 makes for easy reading. |
@jbakuwel I'm on a rapi3, using a PiCAN FD Duo, which has two
Can you improve your python test, that it doesn't loopback the error frames that are received on a interface:
|
@marckleinebudde I'm using a bitrate of 500kbps. I've been doing two tests. For the first test I connect can0 to can1 with two wires and use cangen and candump. This test appears to run fine, at least for a day or so. Hence the need for another test that reliably reproduces the issue(s). This other test uses my Python script on a live CAN bus with plenty of traffic (@ 500kbps) that normally connects two nodes. For my test, both nodes are connected to the Pi HAT instead of directly to each other. The Pi running my Python script functions as the wires mentioned in the first test. If this can run for an extended period of time without causing issues either showing up in the Pi kernel log or my Python script or in the live CAN system, we can hopefully conclude that the issues with the driver have been resolved. I've added a test for error frames so they are not passed on in the code but I do not see them appear (ie. makes no difference). Maybe these are not passed on by the Python CAN library. The Seeed Studio Pi Hat also has two mcp2517fd's, but those are connected to spi0 and spi1. After a while (sometimes 15 minutes, sometimes an hour), I see: [47007.453669] mcp25xxfd spi1.0: unidentified system interrupt - intf = b91a1118 This suggests a timing issue / bug in the driver causing a data corruption. from which the driver does not recover. I can't restart it either: ip link set can1 type can restart but have to reload the kernel module. I'm bringing the devices up at boot with: set_can () { |
Hi folks, Back at testing the driver after having to rebuild the test system due to a failing SD card. Kernel log filling up with heaps of these (possibly one message for each forwarded frame while running my 2nd test see above): Feb 12 08:40:28 mitm kernel: [43531.353724] mcp25xxfd spi0.0 can0: Something is wrong - we got a TEF interrupt but we were not able to detect a finished fifo |
Hey @jbakuwel, good to see you back! |
Hey @marckleinebudde, glad to be back :-) |
Interesting point to note: despite the kernel messages mentioned above: |
@jbakuwel The driver is based on the latest v4.19-rpi kernel and includes some backports for the spi-aux driver. Using the latest stable kernel is always a good idea. So compile the whole kernel. |
Update (still on current Seeed Studio driver and 4.19.97-v7l+ kernel): Feb 12 09:16:13 test kernel: [45675.865861] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 7 to 6 and after increasing the traffic a bit: Otherwise everything still running... |
As expected, the driver didn't recover after the first double bit ECC error. The "mode switches" are new... haven't seen those before. Kernel 4.19.97-v7l+ with current Seeed Studio driver. @marckleinebudde, will now compile your kernel + driver and give that a go. Feb 12 09:16:13 test kernel: [45675.865861] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 7 to 6 Feb 12 15:00:23 test kernel: [66326.870662] mcp25xxfd spi1.0: ECC double bit error at 000 |
@jbakuwel Due to the errata the mcp2517fd can experience TX MAB underflows. With this error it will switch from CAN-2.0 Mode (6) to Restricted Mode (7). Then the driver brings the device back into CAN-2.0 Mode. However from the driver's point of view the transition from Restricted to CAN-2.0 (6 -> 7) is unexpected. For yet unknown reasons, in contradiction to the errata the devive will sometimes switch to Listen Only Mode (3) instead of Restricted. This (6 -> 3) and the recovery (3 -> 6) are also not expected by the driver. A debug version of my driver read the mode value several times and it seems the device sometimes changes first into Listen Only and then into Restricted: I cannot explain the 7 -> 3 transition. Might be a recovery 7 -> 6, followed immediately by a TX MAB error, where the device changes into Listen Only. The interrupt handler's delay might be so high, that the device already changed to Listen Only. Long story short, the driver doesn't take all these mode changes into account.
|
@johandc Building the entire kernel might take some computer time but not much of your time :-)
Now reboot and you're running the kernel with the required/recommended SPI patches as well as Marc's driver. |
@marckleinebudde Is half-duplex mode enabled or disabled by default? In other words, would I see any difference in behaviour between -17 and -27 in that sense? |
@marckleinebudde Driver -27 fails to load on the RPi4 with: |
Hey @jbakuwel, thanks for the report, the problem is fixed. Meanwhile I've rebased to the latest raspi v4.19 kernel (note: the branch has a new timestamp) and backported some SPI improvements, reducing the overall time of a TX message. https://github.com/marckleinebudde/linux/tree/v4.19-rpi/mcp25xxfd-20200429-35 |
@marckleinebudde Somehow you lost bcm2711_defconfig in this tree: |
Hey @jbakuwel, handling too many trees right now, the raspi tree is: https://github.com/marckleinebudde/linux/tree/v4.19-rpi/mcp25xxfd-20200429-35 The above wrong link was for the Sorry for the noise :) |
Hey @marckleinebudde, I guess I could have spotted that. Too quick with my copy / paste ;-) Thanks for all your work on this! |
Hello, some updates for you:
|
@marckleinebudde Thank you. We will update it as soon as possible, so that users can use your latest software directly. Let's test and perfect it together. |
@marckleinebudde I tested -36 for a bit now and it looks really stable - I didn't see any issues in either classic or FD mode. Will continue testing. |
Hi @marckleinebudde . I wasn't sure if it was appropriate to comment here or on your github, but seeing as this is active I will comment here. I am currently using the MCP2515 on a Allwinner V3S (essentially a single core H3, identical SPI driver). Currently we are working on a revision and the new board design accepts either MCP2515 or MCP2518FD - with the hope of utilizing a higher SPI frequency (to save time for other devices on the SPI bus. e.g. display) From reading your notes above, if I have a 40Mhz clock on the MCP2518FD, the appropriate SPI frequency would be ~17.64MHz? What are the increments on the H3 driver, and what is the limitation here? I am currently using buildroot with mainline 5.3.13, its it possible I can package your driver as an out of tree driver (e.g. build patch against mainline /drivers)? Is there anything that I need to be aware of here, I guess I need to use the sunxi branch in your repository? any other advice? Thankyou |
Hey @kjngineering, I hijacked this issue while writing my driver, so better ask on my github next time :) There are two limitations on the usable SPI clock speed. One is a yet unknown issue with the mcp25xxfd: it doesn't properly work with Yes, you can package my driver as an external one, or port the patches to your branch. Note: you'll need the sun6i spi patches, too. Drop me a note if you need commercial support for this. |
Hey @marckleinebudde, We have problem when using your version of kernel:
Any advice on this? We are using RPi 3B+ and SeedStudio. |
Hey @kowalikm, which kernel are you exactly using? |
@marckleinebudde I have used this branch:
|
Use something more recent, like https://github.com/marckleinebudde/linux/tree/v5.7-rpi/mcp25xxfd-2020618-44 The error message |
@marckleinebudde Thanks for reply. I've also checked However with v21 there is a problem that when CAN receiver is not connected and we send frames (~1000/sec) RPi hangs after some time (20-30seconds). Do you have any ideas what can we check? Also could you provide some instructions how can I compile kernel v5.7? (unfortunately instructions that works for kernel 4.19 does not work here) |
@marckleinebudde: I tried the latest sunxi -46 branch but still can't get it to run reliable on my hardware. On the other hand, msperls original driver runs just fine (with his queue optimizations removed). Any idea what I can do to diagnose the issue further? |
Hey @irrwisch1, are you using the proper IRQ type? The correct one is probably |
Hey @kowalikm, the driver fails with CRC error, you can disable CRC check by removing: |
@marckleinebudde: Seems like this did the trick. I still had this set on falling edge. Thanks! |
I am using @marckleinebudde's -47 release that was pushed to linux-can-next for mainline: After a few patching issues due to the driver being broken up and some changed flags in device tree I thought I was getting some success:
However that was short lived; I couldn't receive a single message (candump 500k) without getting a crc error:
I used the advice and disabled the CRC flags which got me some results (I was able to load-test once at around 15% bus utilisation at 500k), however now I am getting persistent errors after a few packets:
It seems horribly inconsistent, sometimes high busload is ok, sometimes anything more than 250ms in-between will fault the driver. This seems related to ECC, next build I think I might try and disable this and see if I have any more luck. Anyone got any more thoughts? |
Just to follow up: building the driver without EEC support didn't help. Still get RX tail errors. |
I'm using https://github.com/marckleinebudde/linux/tree/v5.4-rpi/mcp251xfd-20201022-55, and the canbus fails under heavy load (some hundreads messages/s ). This is what I read on dmesg:
Should I use some other version of the driver? Did anyone manage to get it working reliably on some other version? Could I be missing something really stupid? (I am not familiar at all with low level stuff) |
Hey @RaphyJake, meanwhile you can use the official rpi-5.4.y branch https://github.com/raspberrypi/linux/tree/rpi-5.4.y, however the official overlay is a bit different now. However the driver is the same. Please describe your test setup, used CAN controllers and raspberries, who is sender/reciever and send messages more closely. Also please check the cabling of the CAN bus. Keep in mind that you cannot send the same message from two CAN controllers over the same bus (at the same time), you'll see these strange messages. |
I got my hands on a brand new raspberry pi 4, a brand new seed can shield and after a clean install everything works perfectly! |
Hello , @marckleinebudde I'm using rpi4 and 5.4.140-rt64 kernel + 2ch CAN bus FD shield
and I follow this document after heavy traffic , error frames are detected(PCAN-view) , and motor driver detect something wrong (bus off) here is my dmesg
what could be the problem ? |
This is unrelated to the original problem. Consider opening a new issue. Have a look at the error frames with You can try to configure a bigger sudo ip link set can0 type can bitrate 1000000
ip -details link show can0 Have a look at the sudo ip link set can0 up type can bitrate 1000000 sjw 5 |
when pressure test was performed, write() to can0 will stuck, and the "dmesg | grep spi" will report:
[13378.996137] mcp25xxfd spi0.0 can0: Something is wrong - we got a TEF interrupt but we were not able to detect a finished fifo
[13378.996181] mcp25xxfd spi0.0 can0: tefif: fifo 6 not pending - tef data: id: 00000301 flags: 00000c08, ts: d7649027 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
will you kindly give a solution how to recover from this situation?
The text was updated successfully, but these errors were encountered: