New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
toolboards falling off the can bus #84
Comments
Indeed this looks like an arbitration issue. Lets look at the UUIDs in the initial response:
Notice the pattern, the UUID with the first unique 0 always wins. Arbitration is working as intended. What I don't understand is why the nodes go unresponsive after successive queries. Its possible that they are dropping off due to too many errors. This issue should affect Klipper as well. If you were to flash each of these nodes with Klipper and run either With regard to debugging, I looked at these EBB36/42 devices thinking I could order one and take a look. Unfortunately they don't break out the SWD pins, so debugging these particular devices is probably out. |
Hi, thanks for getting back to me :) is it correct that for swd pins swdio, swclk, 3v3 and gnd are necessary ? When looking in datasheet for the g0b1 https://www.st.com/resource/en/datasheet/stm32g0b1cc.pdf I think that SWDIO and SWCLK are on PA13 and PA14 respectively. (pls correct me if i'm wrong) And the schematic of this board does mention those pins: https://github.com/bigtreetech/EBB/blob/master/EBB%20CAN%20V1.1%20(STM32G0B1)/EBB42%20CAN%20V1.1/Hardware/BIGTREETECH%20EBB42%20CAN%20V1.1-SCH.pdf In the same schema I do see BOOT0 going to another block, namely a button. But the other pin SWDIO is not marked as not-connected. Also, on the bottom of the board there are four pads next to each other that are not in the document, i will measure them now to see if they're connected to the pins on the mcu :) |
I looked wrong, the swdio goes to an led... still will search for something undocumented |
I probed every pad but cannot find another place where the PA13 and PA14 are exposed. Thinking of a couple of ways forward:
The issue indeed also happens in klipper. I'm also wondering what is the best way forward to create a ticket for klipper, since it seems that they are closed in github and only directing to Discord ? thanks |
Still wondering about the arbitration. I see indeed that the 0 bits all the time win and therefore the sequence. But is this how CAN is supposed to work ? I understood from some online references that the ID field should be used for arbitration (unless extended bit is set then also the extended id - but that is not set in this case). When looking at a pcap of the query unassigned (see attachment - this one is from when klipper is running), the id is all the time 0x3f1 (which is also what linux side listens on for the answer - well I don't need to tell you that :) haha) but still, the "zero-fallthrough" is happening in the UUID which is part of the payload. Does CAN work in such a way, that if the ID is the same, that arbitration just goes on for consecutive bits - even into the payload - until it gets a 1 and then goes quiet (losing) ? |
the udelay isn't foolproof .. still having devices becoming unresponsive, just not so quickly. was to be expected ofc, need something better |
I got openocd to work, able to see what's going on with connected gdb !! :) (wiring it up was not easy i can tell :>) I have total of five devices connected on the bus, reverted above change where it does the random sleep (reverted in order to easily reproduce the issue). The device that's connected to the st-link v2 has a high uuid, so i am able to let it go silent. Then checking gdb and seeing things are still running, often spidev_transfer. Next up is setting a breakpoint in the can_process_query_unassigned ? exactly at place where i had the sleep earlier. Thinking that if the breakpoint hits, that I know at least that the request was interpreted and that something goes wrong with sending .. |
My understanding is that arbitration occurs during the entire frame. After each attempt to send a "1" the node checks the line, if its driven low then another device has the bus. I took a brief look at the reference manual, and it sounds like the nodes are going in to a As far as I can tell in |
thanks for helping :) I will create ticket on klipper git, would like to firstly get this BO state from register printed so i can include the info in there too. But I cannot get it to be printed out. Things that I try:
can you give me a tip on how i can read out the BO state in this register ? |
I had following conversation with chatgpt :) what do you think of it, could it really not be that only the ID should be used for arbitration and that having multiple devices using the same id result in bit errors and therefore incrementing the TEC (transmit error counter) resulting the device going into bus_off state ? GPT-4 User ChatGPT Here are some reasons for a CAN node to enter the Bus-Off state: Transmit Error Counter Exceeds Limit: The CAN standard specifies that if the Transmit Error Counter (TEC) exceeds 255, the device will enter a Bus-Off state. This usually happens when a node consistently detects errors while trying to send messages, indicating that there might be an issue with its own transmissions. Persistent Errors on the Network: If a node consistently observes errors (either in its own transmissions or in others), it might imply network issues like noise, incorrect bit timing settings, or a malfunctioning node disrupting communication. Continuous errors can cause the error counters to increase rapidly. Incorrect Bit Timing Configuration: If the CAN bit timing parameters are not configured correctly for the given network speed or do not match with the other nodes in the network, this can lead to consistent frame errors and eventually a Bus-Off state. Physical Layer Issues: Problems like incorrect termination, broken cables, or connectors, and noise interference can lead to persistent errors on the bus. A common issue is the lack of proper termination resistors (typically 120 Ohms) at both ends of the CAN bus. Software Bugs: Firmware in a CAN node might inadvertently send malformed messages or respond incorrectly to received messages, leading to errors. For example, a node sending a message with an incorrect CRC value will cause error frame generation. Higher Layer Protocol Violations: If you're using higher-layer protocols like CANopen or J1939 on top of the base CAN protocol, not adhering to their specific rules can also trigger error conditions. Electrical Disturbances: Transients, spikes, or other electrical disturbances can introduce errors on the CAN bus. Proper grounding and shielding can mitigate such issues. When the STM32 enters the Bus-Off state, it will stop participating in bus activities and won't send or acknowledge any message. To recover from this state, the software often has to initiate a recovery process, which involves waiting for a specific number of bus idle time (11 consecutive recessive bits) and then resetting the CAN controller or its error counters. To diagnose and solve Bus-Off problems: Monitoring: Use a CAN bus monitoring tool to inspect the traffic and see if there are nodes sending erroneous frames. User ChatGPT Here are the reasons for which the TEC can be incremented: Bit Errors: A bit error occurs when a transmitting node detects a difference between the transmitted bit and the monitored bit on the bus. For example, if the node sends a dominant bit (logic 0) but reads a recessive bit (logic 1) or vice versa, it would count as a bit error. The TEC is incremented by 8 for a bit error during transmission. Form Errors: A form error is detected by a transmitting node when a fixed-form bit field contains illegal bits. This typically occurs in fields like the CRC delimiter, acknowledgment delimiter, or end of frame (EOF). For a form error, the TEC is incremented by 8. Stuff Errors: In CAN, a bit-stuffing mechanism is used to ensure synchronization. If five consecutive bits of the same level (either all dominant or all recessive) are detected on the bus, a node will insert a bit of the opposite polarity. If the transmitting node detects an absence of this "stuffed" bit, it signifies a stuff error. For a stuff error, the TEC is incremented by 8. CRC Errors: A transmitting node calculates and sends a CRC value at the end of every CAN message. If the transmitted CRC doesn't match the CRC calculated by a receiving node, a CRC error is flagged by the transmitting node when it detects the error flag sent by the receiving node. The TEC is incremented by 8 for a CRC error. Acknowledgment Errors: After sending a message, the transmitting node waits for an acknowledgment. If no other node on the network acknowledges the transmitted message by sending a dominant bit during the ACK slot, the transmitting node detects an acknowledgment error. For an acknowledgment error, the TEC is incremented by 8. Recessive Bit During Dominant Phase: If a node detects a recessive bit during a time when it expects all nodes to send a dominant bit, it's an error. This condition increments the TEC but is generally included in the other error types listed above. In general, whenever the CAN protocol detects an error during transmission, it attempts to signal other nodes about the error using an error flag. This flag consists of six dominant bits in a sequence. If nodes detect the error flag, they will also start transmitting their own error flags. This process ensures synchronization of error handling across the network. To maintain bus reliability, when the TEC exceeds 127, the CAN controller enters an "Error Passive" state. In this state, the controller no longer sends active error flags if it detects errors but still functions. If the TEC exceeds 255, the controller enters the "Bus Off" state, where it is effectively disconnected from the network until a recovery process is initiated. User ChatGPT In the CAN protocol, the arbitration process determines the message priority based solely on the ID: lower ID values have higher priority. When two nodes begin to transmit at the same time, they will start monitoring the bus while they send. They continue transmitting as long as what they send matches what they see on the bus. The first discrepancy between what's being sent and what's being monitored on the bus will occur when one node is sending a recessive bit while the other is sending a dominant bit (because dominant bits will override recessive bits). At this point, the node sending the recessive bit recognizes that what it sent doesn't match what's on the bus and stops transmitting, leaving the other message (with the more dominant ID) to continue. However, if two nodes with the same ID start transmitting their messages simultaneously, the arbitration process won't determine a winner based on the ID, since the IDs are identical. The arbitration would then proceed to the next field. If the data or any subsequent field in the CAN message is different between the two nodes, one node might be transmitting a dominant bit while the other transmits a recessive bit. The node transmitting the recessive bit will detect a mismatch between what it sent and what it reads from the bus, resulting in a bit error. The node detecting the bit error will then abort its transmission and signal an error frame. If by chance both nodes with the same ID also have identical data and other fields (making the whole message identical), they might both successfully transmit without detecting an error. However, this behavior is unpredictable and depends on exact timings, making it an unreliable scenario. In conclusion, having multiple devices with the same ID on a CAN bus is not advisable. It can lead to unpredictable behavior, potential bit errors, and data collisions, compromising the integrity and reliability of communication on the bus. |
i don't see an error frame being emitted by the toolboard though... |
The response says an error frame is "signaled", that may just mean internally. That sounds correct. If so, the solution is rather simple:
|
I already commented out the part in my local flash_can.py script so that it doesn't do the query before flashing (just for debugging this issue), but still in my setup i need to do multiple of these query_unassigned_id requests (i'm using custom software on linux side to talk to multiple toolboards) - is not in a printer but whole other usecase where i want to control many stepper motors :) But I think it would be useful if the query works on multiple can toolboards in the normal usecase of a printer (with multiple print heads?) as well.. Wouldn't a better solution be to make sure that devices don't use the same id ? I mean it's not proven yet that this causes it and I understand that this would be a protocol change which is probably not what people are waiting for :) Also I have to admit that i didn't measure resistance on the bus earlier (stupid). But now I did and noticed 30-something-ohms, apparently the cable that came with the peak usb<>can transceiver has two 120ohms in there, one on each side. So I had 3x 120ohm on the bus. Still I had the same issue with the STM32 (candlelight) transceiver which does have only one 120ohms (while i have one terminator myself on other end), so I think this is not the reason. Anyway I will remove one resistor from that cable and test it again. If it still reproduces I will open the ticket on klipper repo. |
This might help for checking the physical layer. |
I have just now removed one resistor, measured, only seeing 58.6ohms when everything is connected, after that re-did the query test and saw devices going silent, so it wasn't that. |
hi @NAPCAL thanks yes, i think it's further in order (now i don't use twisted pair, but the total wire is quite short - i think one meter. Also other communication goes fine it seems - although i still have to check for other situations where toolboards use the same id at the same time. |
The purpose of the query is to get a device UUID so the application (be it Klipper or flashtool.py) can assign a unique node ID. The query is the only command that can transmit a response to a duplicate node ID. Once you know the device UUIDs it shouldn't be necessary to query them again unless you lose them. As you mention, an attempt to pre-assign node IDs would require a change in protocol. Assignment and ID conflict resolution would have to be moved to the MCU. I'm not open to deviating from Klipper's protocol due to the maintenance burden, so this proposal would need to be discussed and implemented in Klipper to make it in to Katapult. |
@ldr On my setup, I am using four BTT EBB42, one PITB (dual 5160HV), and BTT EBB SB2240 on a dual CAN hat (SeeedStudio). FYI the UUID’s come from the CAN bus transceivers so if you have a conflict, change the transceiver or put them on different CAN bus’ |
afaik the uuid's come from a hashing function on the serial of the mcu, it's not the uuid's that are conflicting, but I believe it's because of the same id (can id != uuid) - the can id is also programmatically set |
The UUID is generated from the MCU's unique serial number. Indeed its hashed. The node ID is set by the application and initialized to 0. Applications must know the UUID to set the node ID. |
Some random site where it is mentioned that having multiple devices with same can id may result in bus-off state:
|
It seems clear that the nodes are going bus off even if it hasn't been verified by the debugger. The fact that they recover after a reset is the giveaway. The issue with the query command is that all of the nodes transmit their response to the |
hey i want to thank you Arksine for the help so far, and Napcal too, tnx. I understand best place is to create a post on discourse explaining things and try to come up with a proposal :) therefore closing this ticket - i'll be back if there is a change that may be backported from klipper ! |
For reference, followup on Klipper Discourse: https://klipper.discourse.group/t/can-toolboards-falling-off-of-bus-after-query-unassigned/10161/1 |
Hi. I'm not on discourse, so I figured I'd post here regarding this issue. I found this closed issue because I was facing the same problem. My u2c (and ebb42) with canboot would drop off and the interface would disappear. Here's what I came to find out: When queried by the Canboot flash script using the uuid the board is reset into bootloader mode and thus drops off. From there it will show up in the serial devices and can be reflashed using
This worked for me multiple times. I hope the information helps someone else also. |
Hi, we spoke this afternoon on Discord (@Arksine), but thinking this may be a better place to continue discussion about this issue.
I have below described issue with Klipper, an earlier version of Canboot, as well as the latest version of Katapult (master when I tried it at git rev abd1545).
I have a setup where there are 14 BTT EBB42 (STM32G0B1) devices connected to the CAN bus. When I run the
flash_can.py -q
tool, I see the following output:Candump shows:
If I then run the same
flash_can.py -q
a second time, I see fewer devices on the bus:And on candump:
Every time the query for unassigned devices is done, the amount of toolboards gets less. This goes on until two or sometimes one device is left and the rest doesn't answer anymore.
I can then only bring back the missing devices by resetting them (button press on the toolboard).
--
Notice that the devices are always listed from low to high uuid.
This low-to-high made me think of the bus arbitration, where the lowest value always gets precedence. But still I thought that was based on the id, which is 3F1 here all the time. But then again, how can bus arbitration work if the id's are always the same ?
I was given advice to try a debugger, but I have no experience with STM32 programming or debugging yet.
I did wonder what happens if I make sure that the devices do not answer at the same time, so I put a random sleep before replying at the following location:
https://github.com/Arksine/katapult/blob/master/src/generic/canserial.c#L154
When using this, all the devices remain working, now the
flash_can.py -q
shows:And I can repeat that many times, all devices stay on the bus.
Notice that sequence is now also randomized.
--
I think the solution I concocted is somewhat ugly, but for now it gets me a bit further, seeing that other traffic after short node-id assignment uses a unique id.
Could someone help me out making a nice solution for this issue ?
Thank you,
Leon
The text was updated successfully, but these errors were encountered: