Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heavy traffic will trigger TEF interrupt (CAN-HAT) #7

Open
sume-cn opened this issue Nov 9, 2019 · 200 comments
Open

Heavy traffic will trigger TEF interrupt (CAN-HAT) #7

sume-cn opened this issue Nov 9, 2019 · 200 comments

Comments

@sume-cn
Copy link

sume-cn commented Nov 9, 2019

when pressure test was performed, write() to can0 will stuck, and the "dmesg | grep spi" will report:

[13378.996137] mcp25xxfd spi0.0 can0: Something is wrong - we got a TEF interrupt but we were not able to detect a finished fifo
[13378.996181] mcp25xxfd spi0.0 can0: tefif: fifo 6 not pending - tef data: id: 00000301 flags: 00000c08, ts: d7649027 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced

will you kindly give a solution how to recover from this situation?

@sume-cn sume-cn changed the title Heavy traffic will Heavy traffic will trigger TEF interrupt Nov 9, 2019
@sume-cn sume-cn changed the title Heavy traffic will trigger TEF interrupt Heavy traffic will trigger TEF interrupt (CAN-HAT) Nov 9, 2019
@jenshz
Copy link

jenshz commented Nov 11, 2019

I'm having the same problem, but just with normal writes - It seems to happen after around 5-10 minutes of testing (writing to the canbus every 5 seconds). The error reported is:

[ 1909.422685] mcp25xxfd spi0.0 can0: Something is wrong - we got a TEF interrupt but we were not able to detect a finished fifo
[ 1909.422758] mcp25xxfd spi0.0 can0: tefif: fifo 4 not pending - tef data: id: 0013d803 flags: 00000817, ts: 6991dd58 - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced
[ 1914.423354] mcp25xxfd spi0.0 can0: tefif: fifo 1 not pending - tef data: id: 0013e803 flags: 00000217, ts: 6991eb0f - this may be a problem with spi signal quality- try reducing spi-clock speed if this can get reproduced

Restarting the interface (ip link down and then up) restores functionality for a similar duration, but eventually the same error as above happens.

@sume-cn
Copy link
Author

sume-cn commented Nov 12, 2019

It seems to be the new driver's fault, in Aug and Sept I used to run my project with 100 Hz multiple 8 messages, it was good during 3 hour+ test. The hardware was same.
How can I roll back to the old driver?

@jenshz
Copy link

jenshz commented Nov 12, 2019

I just tested with an older version of the kernel: Linux raspberrypi 4.19.57-v7+ #1244 SMP Thu Jul 4 18:45:25 BST 2019 armv7l GNU/Linux from the previous release of Raspbian, but the problem is also present there. I think it might be some kind of bug in the mcp25xxfd driver itself. Bringing the network interface (can0 / can1) down and then up is a workaround for now, but not ideal.

@sume-cn
Copy link
Author

sume-cn commented Nov 13, 2019

I've tested another board mounting MCP2515(driver version 2017) with 1300 fps message rate, 500kbps baud rate, for 17 hours, and it's OK.

@jenshz
Copy link

jenshz commented Nov 14, 2019

I've tried the driver here: https://github.com/GBert/misc/tree/master/RPi-MCP2517 (changing the oscillator setting to 40 MHz), and that worked for around 15 hours before failing hard. After that, nothing but a full power cycle would bring it online again. At least the driver in this repository works after a software reset, if only for a short while.

@johandc
Copy link

johandc commented Nov 15, 2019

Same issue with the CAN-HAT MCP2517 on latest raspbian 4.19.75-v7l, on Raspberry Pi4. It works for a while, and the suddenly stops. It seems the higher the load, the higher likelihood of getting the error.

@johandc
Copy link

johandc commented Nov 15, 2019

It seems related to the fact that the MCP2517FD have now been discontinued due to a quite severe silicon bug, described in this errata document:
http://ww1.microchip.com/downloads/en/DeviceDoc/MCP2517FD-Data-Sheet-Errata-DS80000792B.pdf

They specifically mention this to be related to slower linux (raspberry pi) and a too long delay between spi bytes:

TX MAB underflow/RX MAB overflow due to
long delays between SPI bytes:

The SPI Interface can block the CAN FD Controller
module from accessing RAM in between SPI bytes
and between the last byte and the rising edge of
the nCS line during an SPI READ or SPI READ
CRC instruction while accessing RAM.
If the CAN FD Controller module is blocked for
more than TSPIMAXDLY, a TX MAB underflow or an
RX MAB overflow can occur.

Work around:

Keep the delay between two SPI bytes and
between the last SPI byte and the rising edge of
nCS shorter than TSPIMAXDLY; see Figure 1.
The maximum allowed delay between two bytes
depends on which CAN message frame is
transmitted and on the selected Nominal Bit Time
(NBT) and Data Bit Time (DBT). Table 1 lists
TSPIMAXDLY for the worst-case scenarios.
For example: TSPIMAXDLY is 8.5 µs for a CAN FD
frame at 500 kbps/2 Mbps. In comparison, an SPI
byte takes 0.67 µs at 12 MHz SCK. A delay of 10
times the duration of one SPI byte could cause a
TX MAB underflow. It is highly unlikely for an MCU
application to introduce such a long delay, but this
error could occur when running an operation
system like Linux on a slower MPU.
In case of a TX MAB underflow, the device will
notify the application by setting SERRIF and
MODIF and by transitioning to Restricted
Operation or Listen Only mode (depending on
CiCON.SERR2LOM). After the application
requests Normal mode, the CAN FD Controller
module will automatically attempt to retransmit the
message that caused the TX MAB underflow. It is
not necessary to reset the device.
In case of an RX MAB overflow, the device will
notify the application by setting SERRIF. The
device will remain in Normal mode. The message
that caused the RX MAB overflow will be
discarded.

@johandc
Copy link

johandc commented Nov 19, 2019

We tested with a board using the MCP2518FD where the errata of the MCP2517 have been fixed. This did not solve the issue with the "Something went wrong - we got a TEF..." error.

@omdathetkan
Copy link

Note, this appears to be this issue: msperl/linux-rpi#6

@Pillar1989
Copy link
Member

Pillar1989 commented Dec 6, 2019

The raspberry PI kernel bug, maintained by someone, should be fixed soon. We will also upgrade this product to MCP2518FD in the future

@jbakuwel
Copy link

jbakuwel commented Jan 12, 2020

I'm bitten by this too, a real show stopper. @Pillar1989 do you know who might resolve the PI kernel bug (if that is indeed the cause of this issue) and more importantly do you know if whoever might resolve the PI kernel bug is aware of the bug?

@Pillar1989
Copy link
Member

@jbakuwel msperl/linux-rpi#6 Here are some updates; I don't know if they are what you want. Besides, we are experiencing the Chinese New Year, and we will not have enough energy to devote to this matter until February 1, 2020.

@jbakuwel
Copy link

jbakuwel commented Jan 14, 2020 via email

@jbakuwel
Copy link

jbakuwel commented Jan 26, 2020

Update: reverting the optimization commits as suggested by @irrwisch1 here does seem to resolve the problem for me too on a Raspberry Pi 4; at least so far. I also had to change MCP25XXFD_SCLK_DIVIDER to 4 in mcp25xxfd_regs.h.

There are remaining issues:

test kernel: [ 1538.699553] mcp25xxfd spi1.0: ECC double bit error at c00
test kernel: [ 1538.699695] mcp25xxfd spi1.0: unidentified system interrupt - intf = 391a1108

test kernel: [ 2222.497103] mcp25xxfd spi1.0: unidentified system interrupt - intf = b91a1118
test kernel: [ 2222.497225] mcp25xxfd spi1.0: ECC double bit error at 800

test kernel: [ 666.922505] mcp25xxfd spi1.0 can1: CAN Bus error experienced
test kernel: [ 666.923487] mcp25xxfd spi1.0 can1: CAN Bus error experienced
test kernel: [ 667.119274] mcp25xxfd spi1.0 can1: CAN Bus error experienced
test /testd: 20200126-115813: Processing v queue: 1738 messages
test /testd: 20200126-115815: Processing b queue: 534 messages
test kernel: [ 730.263659] mcp25xxfd spi1.0: ECC single bit error at 800
test /testd: 20200126-115931: Processing v queue: 16365 messages
test /testd: 20200126-115945: Processing b queue: 5602 messages
test kernel: [ 821.931424] mcp25xxfd spi1.0: ECC double bit error at c00
test kernel: [ 821.931872] mcp25xxfd spi1.0: ECC double bit error at c00
test kernel: [ 821.932135] mcp25xxfd spi1.0: ECC double bit error at c00
test /testd: 20200126-120040: CanError('Transmit buffer full')

Then there's this:

test kernel: [ 1913.359886] mcp25xxfd spi0.0: found IVMIF situation not supported by driver - bdiag = [0x00008400, 0x00005ed7]

which was resolved by applying the changes suggested by @jrm06c here

It would be great to get this driver stable; happy to help in anyway I can.

@jbakuwel
Copy link

Further debug info. I've written a simple program that forwards packets from can0 (called b) to can1 (called v) and vice versa. This runs fine for about 7 minutes (having processed ~70K frames), then single bit ECC errors are showing up. These are (most likely) corrected and all continues to run fine.
Then after ~150K frames, I start to see double ECC errors with subsequent transmit errors.

See my previous post for the changes I made to the driver.

Any suggestions?

20200126-142108: Frames received on can0 (b) will be forwarded to can1 (v)
20200126-142108: Frames received on can1 (v) will be forwarded to can0 (b)
20200126-142108: Starting notifiers
20200126-142108: Starting test
20200126-142142: Processing v queue: 25 messages
20200126-142142: Processing b queue: 8 messages
20200126-142336: Saving statistics to /data/logs/20200126-142336-test-v-stats.log
20200126-142336: Saving statistics to /data/logs/20200126-142336-test-b-stats.log
20200126-142336: Processing v queue: 24579 messages
20200126-142340: Processing b queue: 2493 messages
20200126-142707: Saving statistics to /data/logs/20200126-142707-test-v-stats.log
20200126-142707: Saving statistics to /data/logs/20200126-142707-test-b-stats.log
20200126-142707: Processing v queue: 44220 messages
20200126-142717: Processing b queue: 6007 messages
kernel: [ 7050.997314] mcp25xxfd spi1.0: ECC single bit error at c00
20200126-143028: Saving statistics to /data/logs/20200126-143028-test-v-stats.log
20200126-143028: Saving statistics to /data/logs/20200126-143028-test-b-stats.log
20200126-143028: Processing v queue: 40758 messages
20200126-143029: Processing b queue: 914 messages
kernel: [ 7297.112660] mcp25xxfd spi1.0: ECC single bit error at 800
kernel: [ 7304.042530] mcp25xxfd spi1.0: ECC single bit error at 800
20200126-143404: Saving statistics to /data/logs/20200126-143404-test-v-stats.log
20200126-143404: Saving statistics to /data/logs/20200126-143404-test-b-stats.log
20200126-143404: Processing v queue: 45941 messages
20200126-143405: Processing b queue: 643 messages
kernel: [ 7446.778260] mcp25xxfd spi1.0: ECC double bit error at c00
kernel: [ 7446.778557] mcp25xxfd spi1.0: ECC double bit error at c00
20200126-143504: b->v: CanError('Transmit buffer full')

@marckleinebudde
Copy link

@jbakuwel can you send me your test program?

@jbakuwel
Copy link

@marckleinebudde, yeah sure no worries: cantxrx.py.txt. Note tabstop=3 makes for easy reading.
I see you're working with Victron equipment too :-) ?

@marckleinebudde
Copy link

@jbakuwel I'm on a rapi3, using a PiCAN FD Duo, which has two mcp2517fd, both connected to spi0 and is probably similar to the Seed Studio's. Both lines are connected and terminated.

  • What bitrate are you using in your test?
  • How do you inject CAN messages into the loop?

Can you improve your python test, that it doesn't loopback the error frames that are received on a interface:

 (2020-01-29 16:46:00.992598)  can0  RX - -  002   [1]  00                                                                                                                                                                                                                                
 (2020-01-29 16:46:00.993560)  can1  RX - -  20000004   [8]  00 08 00 00 00 00 00 00   ERRORFRAME                                                                                                                                                                                         
        controller-problem{tx-error-warning}                                                                                                                                                                                                                                              
 (2020-01-29 16:46:00.993705)  can0  RX - -  20000004   [8]  00 08 00 00 00 00 00 00   ERRORFRAME                                                                                                                                                                                         
        controller-problem{tx-error-warning}                                                                                                                                                                                                                                              
 (2020-01-29 16:46:00.993868)  can1  RX - -  20000004   [8]  00 20 00 00 00 00 00 00   ERRORFRAME                                                                                                                                                                                         
        controller-problem{tx-error-passive}                                                                                                                                                                                                                                              
 (2020-01-29 16:46:00.993899)  can0  RX - -  20000004   [8]  00 20 00 00 00 00 00 00   ERRORFRAME                                                                                                                                                                                         
        controller-problem{tx-error-passive}                                                                                                                                                                                                                                              
 (2020-01-29 16:46:00.994153)  can0  RX - -  002   [1]  08                                                                                                                                                                                                                                
 (2020-01-29 16:46:00.994184)  can1  TX - -  002   [1]  08                                                                                                                                                                                                                                
[...]
 (2020-01-29 16:46:01.001755)  can0  TX - -  002   [1]  08
 (2020-01-29 16:46:01.002159)  can1  RX - -  002   [1]  09
 (2020-01-29 16:46:01.002306)  can0  TX - -  002   [1]  09
 (2020-01-29 16:46:01.002798)  can1  RX - -  004   [8]  00 08 00 00 00 00 00 00                                                              

@jbakuwel
Copy link

jbakuwel commented Jan 29, 2020

@marckleinebudde I'm using a bitrate of 500kbps. I've been doing two tests. For the first test I connect can0 to can1 with two wires and use cangen and candump. This test appears to run fine, at least for a day or so. Hence the need for another test that reliably reproduces the issue(s).

This other test uses my Python script on a live CAN bus with plenty of traffic (@ 500kbps) that normally connects two nodes. For my test, both nodes are connected to the Pi HAT instead of directly to each other. The Pi running my Python script functions as the wires mentioned in the first test. If this can run for an extended period of time without causing issues either showing up in the Pi kernel log or my Python script or in the live CAN system, we can hopefully conclude that the issues with the driver have been resolved.

I've added a test for error frames so they are not passed on in the code but I do not see them appear (ie. makes no difference). Maybe these are not passed on by the Python CAN library.

The Seeed Studio Pi Hat also has two mcp2517fd's, but those are connected to spi0 and spi1.

After a while (sometimes 15 minutes, sometimes an hour), I see:

[47007.453669] mcp25xxfd spi1.0: unidentified system interrupt - intf = b91a1118
[47007.453792] mcp25xxfd spi1.0: ECC double bit error at 800
20200130-113800: b->v: CanError('Transmit buffer full')

This suggests a timing issue / bug in the driver causing a data corruption.

from which the driver does not recover. I can't restart it either:

ip link set can1 type can restart
RTNETLINK answers: Invalid argument

but have to reload the kernel module.

I'm bringing the devices up at boot with:

set_can () {
ip link set $1 up type can bitrate 500000 restart-ms 100 fd off
ifconfig $1 txqueuelen 65536
ifconfig $1 up
}

@jbakuwel
Copy link

Hi folks,

Back at testing the driver after having to rebuild the test system due to a failing SD card.
I've started at the beginning; ie with the driver as currently published by Seeed Studio.
Now getting a slightly different message after this commit.

Kernel log filling up with heaps of these (possibly one message for each forwarded frame while running my 2nd test see above):

Feb 12 08:40:28 mitm kernel: [43531.353724] mcp25xxfd spi0.0 can0: Something is wrong - we got a TEF interrupt but we were not able to detect a finished fifo
Feb 12 08:40:28 mitm kernel: [43531.353831] mcp25xxfd spi1.0 can1: Something is wrong - we got a TEF interrupt but we were not able to detect a finished fifo
Feb 12 08:40:28 mitm kernel: [43531.354085] mcp25xxfd spi1.0 can1: Something is wrong - we got a TEF interrupt but we were not able to detect a finished fifo
Feb 12 08:40:28 mitm kernel: [43531.354318] mcp25xxfd spi1.0 can1: Something is wrong - we got a TEF interrupt but we were not able to detect a finished fifo
Feb 12 08:40:28 mitm kernel: [43531.354561] mcp25xxfd spi1.0 can1: Something is wrong - we got a TEF interrupt but we were not able to detect a finished fifo
Feb 12 08:40:28 mitm kernel: [43531.354695] mcp25xxfd spi0.0 can0: Something is wrong - we got a TEF interrupt but we were not able to detect a finished fifo

@marckleinebudde
Copy link

Hey @jbakuwel, good to see you back!
In the mean time, I've squashed my driver into a nice patch stack, which is available here:
https://github.com/marckleinebudde/linux/tree/v4.19-rpi/mcp25xxfd-20200211-2
I'd really appreciate if you give it a go.

@jbakuwel
Copy link

Hey @marckleinebudde, glad to be back :-)
Still very busy with other stuff so working on this in between other things. Do you want me to test the complete kernel with driver from your repo or just the driver, ie. are there changes to other parts of the kernel that are relevant as well?

@jbakuwel
Copy link

Interesting point to note: despite the kernel messages mentioned above:
Feb 12 08:40:28 mitm kernel: [43531.353724] mcp25xxfd spi0.0 can0: Something is wrong - we got a TEF interrupt but we were not able to detect a finished fifo
the live CAN system (so far) seems to be happy. I'll let it run and expect to see ECC errors showing up after a while; after all nothing has changed yet in that department.

@marckleinebudde
Copy link

@jbakuwel The driver is based on the latest v4.19-rpi kernel and includes some backports for the spi-aux driver. Using the latest stable kernel is always a good idea. So compile the whole kernel.

@jbakuwel
Copy link

Update (still on current Seeed Studio driver and 4.19.97-v7l+ kernel):

Feb 12 09:16:13 test kernel: [45675.865861] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 7 to 6
Feb 12 13:03:43 test kernel: [59326.835049] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 6 to 3
Feb 12 13:03:43 test kernel: [59326.835426] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 3 to 6
Feb 12 13:14:35 test kernel: [59978.328018] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 7 to 6
Feb 12 13:18:49 test kernel: [60232.023712] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 7 to 6
Feb 12 13:19:32 test kernel: [60275.729571] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 7 to 6
Feb 12 13:20:58 test kernel: [60361.052802] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 7 to 6
Feb 12 13:21:16 test kernel: [60379.242129] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 7 to 6
Feb 12 13:26:37 test kernel: [60700.705055] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 6 to 3
Feb 12 13:26:37 test kernel: [60700.705152] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 3 to 6
Feb 12 13:32:31 test kernel: [61054.898096] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 6 to 3
Feb 12 13:32:31 test kernel: [61054.898234] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 3 to 6

and after increasing the traffic a bit:
Feb 12 13:53:01 test kernel: [62284.237189] mcp25xxfd spi1.0: ECC single bit error at 000

Otherwise everything still running...

@jbakuwel
Copy link

As expected, the driver didn't recover after the first double bit ECC error. The "mode switches" are new... haven't seen those before. Kernel 4.19.97-v7l+ with current Seeed Studio driver.

@marckleinebudde, will now compile your kernel + driver and give that a go.

Feb 12 09:16:13 test kernel: [45675.865861] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 7 to 6
Feb 12 13:03:43 test kernel: [59326.835049] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 6 to 3
Feb 12 13:03:43 test kernel: [59326.835426] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 3 to 6
Feb 12 13:14:35 test kernel: [59978.328018] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 7 to 6
Feb 12 13:18:49 test kernel: [60232.023712] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 7 to 6
Feb 12 13:19:32 test kernel: [60275.729571] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 7 to 6
Feb 12 13:20:58 test kernel: [60361.052802] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 7 to 6
Feb 12 13:21:16 test kernel: [60379.242129] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 7 to 6
Feb 12 13:26:37 test kernel: [60700.705055] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 6 to 3
Feb 12 13:26:37 test kernel: [60700.705152] mcp25xxfd spi0.0: Controller unexpectedly switched from mode 3 to 6
Feb 12 13:32:31 test kernel: [61054.898096] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 6 to 3
Feb 12 13:32:31 test kernel: [61054.898234] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 3 to 6
Feb 12 13:53:01 test kernel: [62284.237189] mcp25xxfd spi1.0: ECC single bit error at 000

Feb 12 15:00:23 test kernel: [66326.870662] mcp25xxfd spi1.0: ECC double bit error at 000
Feb 12 15:00:23 test kernel: [66326.870799] mcp25xxfd spi1.0: unidentified system interrupt - intf = b91a1108
Feb 12 15:00:23 test kernel: [66326.870819] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 6 to 3
Feb 12 15:00:23 test kernel: [66326.870839] mcp25xxfd spi1.0: ECC double bit error at 000
Feb 12 15:00:23 test kernel: [66326.870979] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 3
Feb 12 15:00:23 test kernel: [66326.871023] mcp25xxfd spi1.0: unidentified system interrupt - intf = b91a9100
Feb 12 15:00:23 test kernel: [66326.871043] mcp25xxfd spi1.0: ECC double bit error at 000
Feb 12 15:00:23 test kernel: [66326.871186] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 3
Feb 12 15:00:23 test kernel: [66326.871230] mcp25xxfd spi1.0: unidentified system interrupt - intf = b91a9100
Feb 12 15:00:23 test kernel: [66326.871250] mcp25xxfd spi1.0: ECC double bit error at 000
Feb 12 15:00:23 test kernel: [66326.871389] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 3
Feb 12 15:00:23 test kernel: [66326.871433] mcp25xxfd spi1.0: unidentified system interrupt - intf = b91a9100
Feb 12 15:00:23 test kernel: [66326.871452] mcp25xxfd spi1.0: ECC double bit error at 000
Feb 12 15:00:23 test kernel: [66326.871591] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 3
Feb 12 15:00:23 test kernel: [66326.871635] mcp25xxfd spi1.0: unidentified system interrupt - intf = b91a9100
Feb 12 15:00:23 test kernel: [66326.871655] mcp25xxfd spi1.0: ECC double bit error at 000
Feb 12 15:00:23 test kernel: [66326.871793] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 3
Feb 12 15:00:23 test kernel: [66326.871837] mcp25xxfd spi1.0: unidentified system interrupt - intf = b91a9100
Feb 12 15:00:23 test kernel: [66326.871857] mcp25xxfd spi1.0: ECC double bit error at 000
Feb 12 15:00:23 test kernel: [66326.871996] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 3
Feb 12 15:00:23 test kernel: [66326.872040] mcp25xxfd spi1.0: unidentified system interrupt - intf = b91a9100
Feb 12 15:00:23 test kernel: [66326.872059] mcp25xxfd spi1.0: ECC double bit error at 000
Feb 12 15:00:23 test kernel: [66326.872198] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 3
Feb 12 15:00:23 test kernel: [66326.872242] mcp25xxfd spi1.0: unidentified system interrupt - intf = b91a9100
Feb 12 15:00:23 test kernel: [66326.872261] mcp25xxfd spi1.0: ECC double bit error at 000
Feb 12 15:00:23 test kernel: [66326.872402] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 3
Feb 12 15:00:23 test kernel: [66326.872446] mcp25xxfd spi1.0: unidentified system interrupt - intf = b91a9100
Feb 12 15:00:23 test kernel: [66326.872466] mcp25xxfd spi1.0: ECC double bit error at 000
Feb 12 15:00:23 test kernel: [66326.872605] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 3
Feb 12 15:00:23 test kernel: [66326.872650] mcp25xxfd spi1.0: unidentified system interrupt - intf = b91a9100
Feb 12 15:00:23 test kernel: [66326.872805] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 3
Feb 12 15:00:23 test kernel: [66326.872997] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 3
Feb 12 15:00:23 test kernel: [66326.873190] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 3
Feb 12 15:00:23 test kernel: [66326.873382] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 3
Feb 12 15:00:23 test kernel: [66326.873574] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 3
Feb 12 15:00:23 test kernel: [66326.873766] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 3
Feb 12 15:00:23 test kernel: [66326.873958] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 3
Feb 12 15:00:23 test kernel: [66326.874276] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 3
Feb 12 15:00:23 test kernel: [66326.874603] mcp25xxfd spi1.0: Controller unexpectedly switched from mode 7 to 3

@marckleinebudde
Copy link

@jbakuwel Due to the errata the mcp2517fd can experience TX MAB underflows. With this error it will switch from CAN-2.0 Mode (6) to Restricted Mode (7). Then the driver brings the device back into CAN-2.0 Mode. However from the driver's point of view the transition from Restricted to CAN-2.0 (6 -> 7) is unexpected.

For yet unknown reasons, in contradiction to the errata the devive will sometimes switch to Listen Only Mode (3) instead of Restricted. This (6 -> 3) and the recovery (3 -> 6) are also not expected by the driver.

A debug version of my driver read the mode value several times and it seems the device sometimes changes first into Listen Only and then into Restricted:
CAN-2.0 ---<TX MAB Underflow>---> Listen Only --> Restricted
I don't know if the Listen Only -> Restricted change creates an interrupt.

I cannot explain the 7 -> 3 transition. Might be a recovery 7 -> 6, followed immediately by a TX MAB error, where the device changes into Listen Only. The interrupt handler's delay might be so high, that the device already changed to Listen Only.

Long story short, the driver doesn't take all these mode changes into account.

#define MCP25XXFD_CAN_CON_MODE_LISTENONLY 3
#define MCP25XXFD_CAN_CON_MODE_CAN2_0 6
#define MCP25XXFD_CAN_CON_MODE_RESTRICTED 7

@jbakuwel
Copy link

@johandc Building the entire kernel might take some computer time but not much of your time :-)

mkdir /opt/kernel
git clone --depth=1 --branch v4.19-rpi/mcp25xxfd-20200211-27 https://github.com/marckleinebudde/linux.git
cd linux
KERNEL=kernel7l
make bcm2711_defconfig
make -j4 zImage modules dtbs
sudo make modules_install
sudo cp arch/arm/boot/dts/.dtb /boot/
sudo cp arch/arm/boot/dts/overlays/
.dtb* /boot/overlays/
sudo cp arch/arm/boot/dts/overlays/README /boot/overlays/
sudo cp arch/arm/boot/zImage /boot/$KERNEL.img

Now reboot and you're running the kernel with the required/recommended SPI patches as well as Marc's driver.

@jbakuwel
Copy link

@marckleinebudde Is half-duplex mode enabled or disabled by default? In other words, would I see any difference in behaviour between -17 and -27 in that sense?

@jbakuwel
Copy link

@marckleinebudde Driver -27 fails to load on the RPi4 with:
Apr 30 11:15:53 test kernel: [ 131.464702] mcp25xxfd: probe of spi1.0 failed with error -22
Apr 30 11:15:53 test kernel: [ 131.504698] mcp25xxfd: probe of spi0.0 failed with error -22
Reverted back to -21, which works fine.

@marckleinebudde
Copy link

marckleinebudde commented May 1, 2020

Hey @jbakuwel,

thanks for the report, the problem is fixed. Meanwhile I've rebased to the latest raspi v4.19 kernel (note: the branch has a new timestamp) and backported some SPI improvements, reducing the overall time of a TX message.

https://github.com/marckleinebudde/linux/tree/v4.19-sunxi/mcp25xxfd-20200429-34

https://github.com/marckleinebudde/linux/tree/v4.19-rpi/mcp25xxfd-20200429-35

@jbakuwel
Copy link

jbakuwel commented May 3, 2020

@marckleinebudde Somehow you lost bcm2711_defconfig in this tree:
*** Can't find default configuration "arch/arm/configs/bcm2711_defconfig"!
No problem; I copied it from the previous version but you may want to add it back in :-)

@marckleinebudde
Copy link

Hey @jbakuwel,

handling too many trees right now, the raspi tree is:

https://github.com/marckleinebudde/linux/tree/v4.19-rpi/mcp25xxfd-20200429-35

The above wrong link was for the sunxi platform.

Sorry for the noise :)

@jbakuwel
Copy link

jbakuwel commented May 4, 2020

Hey @marckleinebudde,

I guess I could have spotted that. Too quick with my copy / paste ;-)
-35 up & running now.

Thanks for all your work on this!

@marckleinebudde
Copy link

marckleinebudde commented May 11, 2020

@Pillar1989
Copy link
Member

@marckleinebudde Thank you. We will update it as soon as possible, so that users can use your latest software directly. Let's test and perfect it together.

@t-kopp
Copy link

t-kopp commented May 13, 2020

@marckleinebudde I tested -36 for a bit now and it looks really stable - I didn't see any issues in either classic or FD mode. Will continue testing.

@kjngineering
Copy link

Hi @marckleinebudde . I wasn't sure if it was appropriate to comment here or on your github, but seeing as this is active I will comment here. I am currently using the MCP2515 on a Allwinner V3S (essentially a single core H3, identical SPI driver). Currently we are working on a revision and the new board design accepts either MCP2515 or MCP2518FD - with the hope of utilizing a higher SPI frequency (to save time for other devices on the SPI bus. e.g. display)

From reading your notes above, if I have a 40Mhz clock on the MCP2518FD, the appropriate SPI frequency would be ~17.64MHz? What are the increments on the H3 driver, and what is the limitation here?

I am currently using buildroot with mainline 5.3.13, its it possible I can package your driver as an out of tree driver (e.g. build patch against mainline /drivers)? Is there anything that I need to be aware of here, I guess I need to use the sunxi branch in your repository? any other advice?

Thankyou

@marckleinebudde
Copy link

Hey @kjngineering,

I hijacked this issue while writing my driver, so better ask on my github next time :)

There are two limitations on the usable SPI clock speed. One is a yet unknown issue with the mcp25xxfd: it doesn't properly work with SYSCLOCK / 2 speed. Which would be 40 MHz / 2 = 20 MHz in your case. The other is the limitation of the sun6i SPI IP core and the H3's clock tree. The clock tree can provide 600 MHz and you can use multiple-of-two as a divider. With experiments I figured out that 600 MHz / 32 = 18.75 MHz is unstable and 600 MHz / 34 = 17.65 MHz seems to be stable.

Yes, you can package my driver as an external one, or port the patches to your branch. Note: you'll need the sun6i spi patches, too. Drop me a note if you need commercial support for this.

@kowalikm
Copy link

Hey @marckleinebudde,

We have problem when using your version of kernel:

[    6.799566] mcp25xxfd spi0.0 (unnamed net_device) (uninitialized): CRC read error at address 0x0000, length 4.
[    6.799693] mcp25xxfd spi0.0 (unnamed net_device) (uninitialized): CRC read error at address 0x0000, length 4.
[    6.866262] mcp25xxfd: probe of spi0.0 failed with error -74
[    6.877261] mcp25xxfd spi1.0 can0: MCP2517FD rev0.0 (-RX_INT +MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD m:20.00MHz r:18.50MHz e:17.85MHz) successfully initialized.
[    6.894772] mcp25xxfd spi1.0 can1: renamed from can0

Any advice on this? We are using RPi 3B+ and SeedStudio.

@marckleinebudde
Copy link

Hey @kowalikm,

which kernel are you exactly using?

@kowalikm
Copy link

kowalikm commented Jun 30, 2020

@marckleinebudde I have used this branch: v4.19-rpi/mcp25xxfd-20200429-36

uname -a:

Linux rpi 4.19.118-v7+ #1 SMP Tue Jun 30 13:00:09 CEST 2020 armv7l GNU/Linux

@marckleinebudde
Copy link

marckleinebudde commented Jun 30, 2020

Use something more recent, like https://github.com/marckleinebudde/linux/tree/v5.7-rpi/mcp25xxfd-2020618-44

The error message probe of spi0.0 failed with error -74 indicates that there is a communication problem with one of the mcp2517fd (the one that's attached to spi0.0). Please check the electrical connection.

@kowalikm
Copy link

kowalikm commented Jul 3, 2020

@marckleinebudde Thanks for reply.

I've also checked v4.19-rpi/mcp25xxfd-20200429-41 but there was the same error as above. And this is no electrical connection problem because in version v4.19-rpi/mcp25xxfd-20200211-21 it works good.

However with v21 there is a problem that when CAN receiver is not connected and we send frames (~1000/sec) RPi hangs after some time (20-30seconds). Do you have any ideas what can we check?

Also could you provide some instructions how can I compile kernel v5.7? (unfortunately instructions that works for kernel 4.19 does not work here)

@irrwisch1
Copy link

@marckleinebudde: I tried the latest sunxi -46 branch but still can't get it to run reliable on my hardware.
The strange thing is, it works fine for a few minutes (with a few thousand frames received and transmitted) and then the driver suddenly stops receiving and transmitting frames.
Even with debug printing enabled and also the additional logging option in Kconfig I don't see anything in dmesg.
And also no error counters going up in ifconfig.

On the other hand, msperls original driver runs just fine (with his queue optimizations removed).

Any idea what I can do to diagnose the issue further?

@marckleinebudde
Copy link

Hey @irrwisch1,

are you using the proper IRQ type? The correct one is probably IRQ_TYPE_LEVEL_LOW. See:
marckleinebudde/linux@fdce9f5#diff-b0098f47cd6958a4951df22fb1074161R60

@marckleinebudde
Copy link

marckleinebudde commented Jul 13, 2020

Hey @kowalikm,

the driver fails with CRC error, you can disable CRC check by removing:
MCP25XXFD_QUIRK_CRC_REG | MCP25XXFD_QUIRK_CRC_RX | MCP25XXFD_QUIRK_CRC_TX for all three mcp25xxfd_devtype_data_mcp2517fd mcp25xxfd_devtype_data_mcp2518fd and mcp25xxfd_devtype_data_mcp25xxfd here:

https://github.com/marckleinebudde/linux/blob/v4.19-rpi/mcp25xxfd-20200429-46/drivers/net/can/spi/mcp25xxfd/mcp25xxfd-core.c#L30

@irrwisch1
Copy link

@marckleinebudde: Seems like this did the trick. I still had this set on falling edge. Thanks!

@kjngineering
Copy link

kjngineering commented Sep 8, 2020

I am using @marckleinebudde's -47 release that was pushed to linux-can-next for mainline:
https://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next.git/log/?h=mcp25xxfd-47

After a few patching issues due to the driver being broken up and some changed flags in device tree I thought I was getting some success:

[ 1.255123] CAN device driver interface
[ 1.259309] spi_master spi0: will run message pump with realtime priority
[ 1.304566] mcp25xxfd spi0.1 can0: MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD m:8.82MHz r:8.82MHz e:8.33MHz) successfully initialized.

However that was short lived; I couldn't receive a single message (candump 500k) without getting a crc error:

[ 48.469759] mcp25xxfd spi0.1 can0: CRC read error at address 0x001c (length=4, data=00 00 1a 3f, CRC=0x1e7c).
[ 48.479730] mcp25xxfd spi0.1 can0: IRQ handler returned -74 (intf=0x3f1a0002).

I used the advice and disabled the CRC flags which got me some results (I was able to load-test once at around 15% bus utilisation at 500k), however now I am getting persistent errors after a few packets:

[ 382.629705] mcp25xxfd spi0.1 can0: RX tail of chip (26) and ours (27) inconsistent.
[ 382.637422] mcp25xxfd spi0.1 can0: IRQ handler mcp25xxfd_handle_rxif() returned -84.
[ 382.645189] mcp25xxfd spi0.1 can0: IRQ handler returned -84 (intf=0x3e1a0002).

It seems horribly inconsistent, sometimes high busload is ok, sometimes anything more than 250ms in-between will fault the driver. This seems related to ECC, next build I think I might try and disable this and see if I have any more luck.

Anyone got any more thoughts?

@kjngineering
Copy link

Anyone got any more thoughts?

Just to follow up: building the driver without EEC support didn't help. Still get RX tail errors.
again, wildly inconsistent and an ip link down / up might only keep it going for a few more frames.

@RaphyJake
Copy link

RaphyJake commented Nov 26, 2020

I'm using https://github.com/marckleinebudde/linux/tree/v5.4-rpi/mcp251xfd-20201022-55, and the canbus fails under heavy load (some hundreads messages/s ). This is what I read on dmesg:

[ 2338.525710] mcp251xfd spi1.0 can1: CRC error
[ 2338.525728] mcp251xfd spi1.0 can1: Stuff error
[ 2345.150009] mcp251xfd spi1.0 can1: CRC error
[ 2345.150028] mcp251xfd spi1.0 can1: Format error
[ 2347.707524] mcp251xfd spi1.0 can1: Stuff error
[ 2361.197609] mcp251xfd spi1.0 can1: Bit0 error
[ 2361.197768] mcp251xfd spi1.0 can1: New error state: 2
[ 2361.198138] mcp251xfd spi1.0 can1: Bit0 error
[ 2361.198526] mcp251xfd spi1.0 can1: Bit0 error
[ 2361.198910] mcp251xfd spi1.0 can1: Bit0 error
[ 2361.199060] mcp251xfd spi1.0 can1: New error state: 3
[ 2361.199384] mcp251xfd spi1.0 can1: bus-off
pi@raspberrypi:~/canvi/linux $ ip link show can1
4: can1: <NO-CARRIER,NOARP,UP,ECHO> mtu 16 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 100
    link/can 

Should I use some other version of the driver? Did anyone manage to get it working reliably on some other version? Could I be missing something really stupid? (I am not familiar at all with low level stuff)

@marckleinebudde
Copy link

Hey @RaphyJake,

meanwhile you can use the official rpi-5.4.y branch https://github.com/raspberrypi/linux/tree/rpi-5.4.y, however the official overlay is a bit different now. However the driver is the same.

Please describe your test setup, used CAN controllers and raspberries, who is sender/reciever and send messages more closely. Also please check the cabling of the CAN bus. Keep in mind that you cannot send the same message from two CAN controllers over the same bus (at the same time), you'll see these strange messages.

@RaphyJake
Copy link

I got my hands on a brand new raspberry pi 4, a brand new seed can shield and after a clean install everything works perfectly!
I guess the problem was on how I configured my older raspberry, which unfortunately I don't have access to at the moment. Thank you for your patience.

@ingjae
Copy link

ingjae commented Feb 6, 2023

Hello , @marckleinebudde

I'm using rpi4 and 5.4.140-rt64 kernel + 2ch CAN bus FD shield

Linux ubuntu 5.4.140-rt64 #1 SMP PREEMPT_RT Sat Oct 16 17:57:52 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux

and I follow this document

after heavy traffic , error frames are detected(PCAN-view) , and motor driver detect something wrong (bus off)
when I use PCAN(Peak CAN) with socketCAN I can get Hz (control and get feedback) I want (Under 1Mbps),
But using 2ch CAN bus FD shield, It has frame drops(low Hz) and Eventually, bus off motor with the lowest priority (same code)

here is my dmesg

[    0.000000] Linux version 5.4.140-rt64 (runner@fv-az42-775) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #1 SMP PREEMPT_RT Sat Oct 16 17:57:52 UTC 2021 (Ubuntu 5.4.0-1044.48-raspi 5.4.140)
[    3.225849] spi-bcm2835 fe204000.spi: could not get clk: -517
[   10.937930] spi_master spi0: will run message pump with realtime priority
[   11.022588] mcp251xfd spi0.1 can0: MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz r:17.00MHz e:0.00MHz) successfully initialized.
[   11.110721] mcp251xfd spi0.0 can1: MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz r:17.00MHz e:0.00MHz) successfully initialized.
[  763.834108] mcp251xfd spi0.1 can0: CRC read error at address 0x0010 (length=4, data=80 90 49 f0, CRC=0x5d94) retrying.
[ 1014.408947] mcp251xfd spi0.1 can0: CRC read error at address 0x0010 (length=4, data=80 e1 b1 45, CRC=0xc8f1) retrying.
[ 1145.472032] mcp251xfd spi0.1 can0: CRC read error at address 0x0010 (length=4, data=00 24 2b 7e, CRC=0x9b19) retrying.
[ 1162.026867] mcp251xfd spi0.1 can0: CRC read error at address 0x0010 (length=4, data=80 3f a3 a5, CRC=0x282a) retrying.
[ 1162.045392] mcp251xfd spi0.1 can0: CRC read error at address 0x0010 (length=4, data=00 8d ae a5, CRC=0x0ffe) retrying.
[ 1162.062180] mcp251xfd spi0.1 can0: CRC read error at address 0x0010 (length=4, data=00 cc b8 a5, CRC=0xfee9) retrying.
[ 1162.085613] mcp251xfd spi0.1 can0: CRC read error at address 0x0010 (length=4, data=80 1a c7 a5, CRC=0xf2eb) retrying.
[ 1162.095969] mcp251xfd spi0.1 can0: CRC read error at address 0x0010 (length=4, data=80 6c cd a5, CRC=0x4850) retrying.
[ 1162.096003] mcp251xfd spi0.1 can0: CRC read error at address 0x0010 (length=4, data=80 72 cd a5, CRC=0x49c8) retrying.
[ 1162.108837] mcp251xfd spi0.1 can0: CRC read error at address 0x0010 (length=4, data=00 46 d5 a5, CRC=0x1a64) retrying.
[ 1162.183280] mcp251xfd spi0.1 can0: CRC read error at address 0x0010 (length=4, data=80 b6 02 a6, CRC=0xe49e) retrying.

what could be the problem ?

@marckleinebudde
Copy link

marckleinebudde commented Feb 6, 2023

This is unrelated to the original problem. Consider opening a new issue.

Have a look at the error frames with candump any,0~0,#FFFFFFFF -cexdtA

You can try to configure a bigger SJW:

sudo ip link set can0 type can bitrate 1000000
ip -details link show can0

Have a look at the phase-seg2 value. Take use phase-seg2/2 for sjw:

sudo ip link set can0 up type can bitrate 1000000 sjw 5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests