Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

toolboards falling off the can bus #84

Closed
ldr opened this issue Aug 25, 2023 · 25 comments
Closed

toolboards falling off the can bus #84

ldr opened this issue Aug 25, 2023 · 25 comments

Comments

@ldr
Copy link

ldr commented Aug 25, 2023

Hi, we spoke this afternoon on Discord (@Arksine), but thinking this may be a better place to continue discussion about this issue.

I have below described issue with Klipper, an earlier version of Canboot, as well as the latest version of Katapult (master when I tried it at git rev abd1545).

I have a setup where there are 14 BTT EBB42 (STM32G0B1) devices connected to the CAN bus. When I run the flash_can.py -q tool, I see the following output:

Resetting all bootloader node IDs...
Checking for Katapult nodes...
Detected UUID: 00f22d4144ed, Application: Katapult
Detected UUID: 1317977e2691, Application: Katapult
Detected UUID: 1938ab445cd6, Application: Katapult
Detected UUID: 1c48bb32084e, Application: Katapult
Detected UUID: 2f8cda1c080d, Application: Katapult
Detected UUID: 524c7d22ceb7, Application: Katapult
Detected UUID: 5cdbc4430eb7, Application: Katapult
Detected UUID: 5ea9f1aad1ad, Application: Katapult
Detected UUID: 63da04283cb8, Application: Katapult
Detected UUID: 7ebe07793a00, Application: Katapult
Detected UUID: 863872fb3e8c, Application: Katapult
Detected UUID: a561e958aa96, Application: Katapult
Detected UUID: c2645270504d, Application: Katapult
Detected UUID: e80c6d2a5c83, Application: Katapult
Query Complete

Candump shows:

  can0  3F0   [1]  12
  can0  3F0   [1]  00
  can0  3F1   [8]  20 00 F2 2D 41 44 ED 11
  can0  3F1   [8]  20 13 17 97 7E 26 91 11
  can0  3F1   [8]  20 19 38 AB 44 5C D6 11
  can0  3F1   [8]  20 1C 48 BB 32 08 4E 11
  can0  3F1   [8]  20 2F 8C DA 1C 08 0D 11
  can0  3F1   [8]  20 52 4C 7D 22 CE B7 11
  can0  3F1   [8]  20 5C DB C4 43 0E B7 11
  can0  3F1   [8]  20 5E A9 F1 AA D1 AD 11
  can0  3F1   [8]  20 63 DA 04 28 3C B8 11
  can0  3F1   [8]  20 7E BE 07 79 3A 00 11
  can0  3F1   [8]  20 86 38 72 FB 3E 8C 11
  can0  3F1   [8]  20 A5 61 E9 58 AA 96 11
  can0  3F1   [8]  20 C2 64 52 70 50 4D 11
  can0  3F1   [8]  20 E8 0C 6D 2A 5C 83 11

If I then run the same flash_can.py -q a second time, I see fewer devices on the bus:

Resetting all bootloader node IDs...
Checking for Katapult nodes...
Detected UUID: 00f22d4144ed, Application: Katapult
Detected UUID: 1317977e2691, Application: Katapult
Detected UUID: 1938ab445cd6, Application: Katapult
Detected UUID: 1c48bb32084e, Application: Katapult
Detected UUID: 2f8cda1c080d, Application: Katapult
Detected UUID: 524c7d22ceb7, Application: Katapult
Detected UUID: 5cdbc4430eb7, Application: Katapult
Detected UUID: 5ea9f1aad1ad, Application: Katapult
Detected UUID: 63da04283cb8, Application: Katapult
Query Complete

And on candump:

  can0  3F0   [1]  12
  can0  3F0   [1]  00
  can0  3F1   [8]  20 00 F2 2D 41 44 ED 11
  can0  3F1   [8]  20 13 17 97 7E 26 91 11
  can0  3F1   [8]  20 19 38 AB 44 5C D6 11
  can0  3F1   [8]  20 1C 48 BB 32 08 4E 11
  can0  3F1   [8]  20 2F 8C DA 1C 08 0D 11
  can0  3F1   [8]  20 52 4C 7D 22 CE B7 11
  can0  3F1   [8]  20 5C DB C4 43 0E B7 11
  can0  3F1   [8]  20 5E A9 F1 AA D1 AD 11
  can0  3F1   [8]  20 63 DA 04 28 3C B8 11

Every time the query for unassigned devices is done, the amount of toolboards gets less. This goes on until two or sometimes one device is left and the rest doesn't answer anymore.

I can then only bring back the missing devices by resetting them (button press on the toolboard).

--

Notice that the devices are always listed from low to high uuid.

This low-to-high made me think of the bus arbitration, where the lowest value always gets precedence. But still I thought that was based on the id, which is 3F1 here all the time. But then again, how can bus arbitration work if the id's are always the same ?

I was given advice to try a debugger, but I have no experience with STM32 programming or debugging yet.

I did wonder what happens if I make sure that the devices do not answer at the same time, so I put a random sleep before replying at the following location:

https://github.com/Arksine/katapult/blob/master/src/generic/canserial.c#L154

#include "board/armcm_timer.h" // udelay(uint32_t) and timer_read_time(void)

// ...

static void
can_process_query_unassigned(struct canbus_msg *msg)
{
    if (CanData.assigned_id)
        return;
    struct canbus_msg send;
    send.id = CANBUS_ID_ADMIN_RESP;
    send.dlc = 8;
    send.data[0] = CANBUS_RESP_NEED_NODEID;
    memcpy(&send.data[1], CanData.uuid, sizeof(CanData.uuid));
    send.data[7] = CANBUS_CMD_SET_CANBOOT_NODEID;

    // hack introduce random delay of 0 upto about 0.5 sec (=500000 usec)
    // take last 19 bits (mask 0x7FFFF fit max value 524287) of current time
    udelay(timer_read_time() & 0x7FFFF);

    // Send with retry
    for (;;) {
        int ret = canbus_send(&send);
        if (ret >= 0)
            return;
    }
}

When using this, all the devices remain working, now the flash_can.py -q shows:

Resetting all bootloader node IDs...
Checking for Katapult nodes...
Detected UUID: 1938ab445cd6, Application: Katapult
Detected UUID: 1c48bb32084e, Application: Katapult
Detected UUID: e80c6d2a5c83, Application: Katapult
Detected UUID: 863872fb3e8c, Application: Katapult
Detected UUID: a561e958aa96, Application: Katapult
Detected UUID: 5ea9f1aad1ad, Application: Katapult
Detected UUID: 1317977e2691, Application: Katapult
Detected UUID: 524c7d22ceb7, Application: Katapult
Detected UUID: 7ebe07793a00, Application: Katapult
Detected UUID: c2645270504d, Application: Katapult
Detected UUID: 2f8cda1c080d, Application: Katapult
Detected UUID: 5cdbc4430eb7, Application: Katapult
Detected UUID: 00f22d4144ed, Application: Katapult
Detected UUID: 63da04283cb8, Application: Katapult
Query Complete

And I can repeat that many times, all devices stay on the bus.

Notice that sequence is now also randomized.

--

I think the solution I concocted is somewhat ugly, but for now it gets me a bit further, seeing that other traffic after short node-id assignment uses a unique id.

Could someone help me out making a nice solution for this issue ?

Thank you,
Leon

@Arksine
Copy link
Owner

Arksine commented Aug 25, 2023

Indeed this looks like an arbitration issue. Lets look at the UUIDs in the initial response:

Detected UUID: 00f22d4144ed - 000000001111001000101101010000010100010011101101
Detected UUID: 1317977e2691 - 000100110001011110010111011111100010011010010001
Detected UUID: 1938ab445cd6 - 000110010011100010101011010001000101110011010110
Detected UUID: 1c48bb32084e - 000111000100100010111011001100100000100001001110
Detected UUID: 2f8cda1c080d - 001011111000110011011010000111000000100000001101
Detected UUID: 524c7d22ceb7 - 010100100100110001111101001000101100111010110111
Detected UUID: 5cdbc4430eb7 - 010111001101101111000100010000110000111010110111
Detected UUID: 5ea9f1aad1ad - 010111101010100111110001101010101101000110101101
Detected UUID: 63da04283cb8 - 011000111101101000000100001010000011110010111000
Detected UUID: 7ebe07793a00 - 011111101011111000000111011110010011101000000000
Detected UUID: 863872fb3e8c - 100001100011100001110010111110110011111010001100
Detected UUID: a561e958aa96 - 101001010110000111101001010110001010101010010110
Detected UUID: c2645270504d - 110000100110010001010010011100000101000001001101
Detected UUID: e80c6d2a5c83 - 111010000000110001101101001010100101110010000011

Notice the pattern, the UUID with the first unique 0 always wins. Arbitration is working as intended. What I don't understand is why the nodes go unresponsive after successive queries. Its possible that they are dropping off due to too many errors.

This issue should affect Klipper as well. If you were to flash each of these nodes with Klipper and run either flashtool.py -q or Klipper's canbus_query.py I suspect this same thing will occur. This is something that probably should be first addressed there, then merged in Katapult.

With regard to debugging, I looked at these EBB36/42 devices thinking I could order one and take a look. Unfortunately they don't break out the SWD pins, so debugging these particular devices is probably out.

@ldr
Copy link
Author

ldr commented Aug 26, 2023

Hi, thanks for getting back to me :) is it correct that for swd pins swdio, swclk, 3v3 and gnd are necessary ? When looking in datasheet for the g0b1 https://www.st.com/resource/en/datasheet/stm32g0b1cc.pdf I think that SWDIO and SWCLK are on PA13 and PA14 respectively. (pls correct me if i'm wrong)

And the schematic of this board does mention those pins: https://github.com/bigtreetech/EBB/blob/master/EBB%20CAN%20V1.1%20(STM32G0B1)/EBB42%20CAN%20V1.1/Hardware/BIGTREETECH%20EBB42%20CAN%20V1.1-SCH.pdf

In the same schema I do see BOOT0 going to another block, namely a button. But the other pin SWDIO is not marked as not-connected. Also, on the bottom of the board there are four pads next to each other that are not in the document, i will measure them now to see if they're connected to the pins on the mcu :)

@ldr
Copy link
Author

ldr commented Aug 26, 2023

I looked wrong, the swdio goes to an led... still will search for something undocumented

@ldr
Copy link
Author

ldr commented Aug 26, 2023

I probed every pad but cannot find another place where the PA13 and PA14 are exposed.

Thinking of a couple of ways forward:

  • try to debug with soldering the two pins ont PA13 and PA14 on the mcu itself (steady hand necessary) :> luckily they're on the corner
  • there is this stm32 nucleo board with the same mcu - i got one some time ago, yesterday ordered a couple of can transceivers, but my skills are still lacking to debug (would like to learn though) i could order one for you if you want
  • try to reproduce with other mcu. I have a bunch of flysht boards (similar to the btt ebb42) but they have an SMT32F072 - not sure if that would help ?

The issue indeed also happens in klipper.

I'm also wondering what is the best way forward to create a ticket for klipper, since it seems that they are closed in github and only directing to Discord ?

thanks

@ldr
Copy link
Author

ldr commented Aug 26, 2023

Still wondering about the arbitration. I see indeed that the 0 bits all the time win and therefore the sequence. But is this how CAN is supposed to work ? I understood from some online references that the ID field should be used for arbitration (unless extended bit is set then also the extended id - but that is not set in this case).

When looking at a pcap of the query unassigned (see attachment - this one is from when klipper is running), the id is all the time 0x3f1 (which is also what linux side listens on for the answer - well I don't need to tell you that :) haha) but still, the "zero-fallthrough" is happening in the UUID which is part of the payload. Does CAN work in such a way, that if the ID is the same, that arbitration just goes on for consecutive bits - even into the payload - until it gets a 1 and then goes quiet (losing) ?

can_query_unassigned_klipper.pcap.zip

@ldr
Copy link
Author

ldr commented Aug 27, 2023

the udelay isn't foolproof .. still having devices becoming unresponsive, just not so quickly. was to be expected ofc, need something better

@ldr
Copy link
Author

ldr commented Aug 27, 2023

I got openocd to work, able to see what's going on with connected gdb !! :) (wiring it up was not easy i can tell :>)

I have total of five devices connected on the bus, reverted above change where it does the random sleep (reverted in order to easily reproduce the issue).

The device that's connected to the st-link v2 has a high uuid, so i am able to let it go silent.

Then checking gdb and seeing things are still running, often spidev_transfer. Next up is setting a breakpoint in the can_process_query_unassigned ? exactly at place where i had the sleep earlier. Thinking that if the breakpoint hits, that I know at least that the request was interpreted and that something goes wrong with sending ..

@Arksine
Copy link
Owner

Arksine commented Aug 27, 2023

My understanding is that arbitration occurs during the entire frame. After each attempt to send a "1" the node checks the line, if its driven low then another device has the bus.

I took a brief look at the reference manual, and it sounds like the nodes are going in to a bus_off state after experiencing a number of arbitration errors. Since you have a debugger working you should be able to break/pause a node that has fallen off the bus and check the CAN registers. The PSR register provides a bitfield containing the bus_off status.

As far as I can tell in stm32/fdcan.c there is no method to recover from a bus_off state. A fix would need to be proposed, implemented, and merged in Klipper, then Katapult can pull in the changes.

@ldr
Copy link
Author

ldr commented Aug 28, 2023

thanks for helping :)

I will create ticket on klipper git, would like to firstly get this BO state from register printed so i can include the info in there too.

But I cannot get it to be printed out.

Things that I try:

  • when setting a breakpoint, in order to stop the software going, i tried for example b sched.c:240 since it seems that that gets hit all the time, but i don't see the program stopping.. (also not sure, but think it's not resetting due to watchdog?)
  • then tried to just read registers, but not seeing the FDCAN_PSR, also cannot find it mentioned in the source.
  • then seeing in the manual you pointed to, section 36.4.13, seeing BO (buss off) at bit 7 in the register, and seeing the register is at offset 0x0044, i tried reading the register as p /t *(uint32_t*)0x0044 and then tried getting the seventh bit like so: p /t (*(uint32_t*)0x0044 & (1<<6)) >> 6 but that keeps returning value 1, even after i see the device becoming unresponsive

can you give me a tip on how i can read out the BO state in this register ?

@ldr
Copy link
Author

ldr commented Aug 28, 2023

I had following conversation with chatgpt :) what do you think of it, could it really not be that only the ID should be used for arbitration and that having multiple devices using the same id result in bit errors and therefore incrementing the TEC (transmit error counter) resulting the device going into bus_off state ?


GPT-4

User
on stm32 programming what can be the reasons for the can bus to go into error bus_off state ?

ChatGPT
The CAN (Controller Area Network) bus on an STM32 microcontroller can enter a Bus-Off state due to specific error conditions. When a device on the CAN network detects an error in a transmitted or received message, it will increment its error counters. If these error counters exceed specific thresholds, the device can be transitioned into an Error Passive state or, in more severe cases, a Bus-Off state.

Here are some reasons for a CAN node to enter the Bus-Off state:

Transmit Error Counter Exceeds Limit: The CAN standard specifies that if the Transmit Error Counter (TEC) exceeds 255, the device will enter a Bus-Off state. This usually happens when a node consistently detects errors while trying to send messages, indicating that there might be an issue with its own transmissions.

Persistent Errors on the Network: If a node consistently observes errors (either in its own transmissions or in others), it might imply network issues like noise, incorrect bit timing settings, or a malfunctioning node disrupting communication. Continuous errors can cause the error counters to increase rapidly.

Incorrect Bit Timing Configuration: If the CAN bit timing parameters are not configured correctly for the given network speed or do not match with the other nodes in the network, this can lead to consistent frame errors and eventually a Bus-Off state.

Physical Layer Issues: Problems like incorrect termination, broken cables, or connectors, and noise interference can lead to persistent errors on the bus. A common issue is the lack of proper termination resistors (typically 120 Ohms) at both ends of the CAN bus.

Software Bugs: Firmware in a CAN node might inadvertently send malformed messages or respond incorrectly to received messages, leading to errors. For example, a node sending a message with an incorrect CRC value will cause error frame generation.

Higher Layer Protocol Violations: If you're using higher-layer protocols like CANopen or J1939 on top of the base CAN protocol, not adhering to their specific rules can also trigger error conditions.

Electrical Disturbances: Transients, spikes, or other electrical disturbances can introduce errors on the CAN bus. Proper grounding and shielding can mitigate such issues.

When the STM32 enters the Bus-Off state, it will stop participating in bus activities and won't send or acknowledge any message. To recover from this state, the software often has to initiate a recovery process, which involves waiting for a specific number of bus idle time (11 consecutive recessive bits) and then resetting the CAN controller or its error counters.

To diagnose and solve Bus-Off problems:

Monitoring: Use a CAN bus monitoring tool to inspect the traffic and see if there are nodes sending erroneous frames.
Physical Inspection: Check for proper bus termination, cable quality, and connector integrity.
Configuration Verification: Ensure that all nodes have matching bit timing configurations and are operating at the same baud rate.
Isolation: Disconnect nodes one by one to identify if a particular node is causing issues.
Software Review: Check the software for potential issues in message formation or handling.
Remember that understanding the root cause is crucial to ensuring reliable and robust CAN communication.


User
for which reasons can the TEC transmit error counter be incremented ?

ChatGPT
The Transmit Error Counter (TEC) in a CAN (Controller Area Network) node keeps track of the errors related to transmissions by that particular node. The counter is incremented under various error scenarios, and its behavior is defined in the CAN specification (ISO 11898).

Here are the reasons for which the TEC can be incremented:

Bit Errors: A bit error occurs when a transmitting node detects a difference between the transmitted bit and the monitored bit on the bus. For example, if the node sends a dominant bit (logic 0) but reads a recessive bit (logic 1) or vice versa, it would count as a bit error. The TEC is incremented by 8 for a bit error during transmission.

Form Errors: A form error is detected by a transmitting node when a fixed-form bit field contains illegal bits. This typically occurs in fields like the CRC delimiter, acknowledgment delimiter, or end of frame (EOF). For a form error, the TEC is incremented by 8.

Stuff Errors: In CAN, a bit-stuffing mechanism is used to ensure synchronization. If five consecutive bits of the same level (either all dominant or all recessive) are detected on the bus, a node will insert a bit of the opposite polarity. If the transmitting node detects an absence of this "stuffed" bit, it signifies a stuff error. For a stuff error, the TEC is incremented by 8.

CRC Errors: A transmitting node calculates and sends a CRC value at the end of every CAN message. If the transmitted CRC doesn't match the CRC calculated by a receiving node, a CRC error is flagged by the transmitting node when it detects the error flag sent by the receiving node. The TEC is incremented by 8 for a CRC error.

Acknowledgment Errors: After sending a message, the transmitting node waits for an acknowledgment. If no other node on the network acknowledges the transmitted message by sending a dominant bit during the ACK slot, the transmitting node detects an acknowledgment error. For an acknowledgment error, the TEC is incremented by 8.

Recessive Bit During Dominant Phase: If a node detects a recessive bit during a time when it expects all nodes to send a dominant bit, it's an error. This condition increments the TEC but is generally included in the other error types listed above.

In general, whenever the CAN protocol detects an error during transmission, it attempts to signal other nodes about the error using an error flag. This flag consists of six dominant bits in a sequence. If nodes detect the error flag, they will also start transmitting their own error flags. This process ensures synchronization of error handling across the network.

To maintain bus reliability, when the TEC exceeds 127, the CAN controller enters an "Error Passive" state. In this state, the controller no longer sends active error flags if it detects errors but still functions. If the TEC exceeds 255, the controller enters the "Bus Off" state, where it is effectively disconnected from the network until a recovery process is initiated.


User
can bit errors occur when multiple devices use the same id ?

ChatGPT
Yes, bit errors can occur when multiple devices use the same identifier (ID) and try to transmit messages simultaneously. Here's how:

In the CAN protocol, the arbitration process determines the message priority based solely on the ID: lower ID values have higher priority. When two nodes begin to transmit at the same time, they will start monitoring the bus while they send. They continue transmitting as long as what they send matches what they see on the bus.

The first discrepancy between what's being sent and what's being monitored on the bus will occur when one node is sending a recessive bit while the other is sending a dominant bit (because dominant bits will override recessive bits). At this point, the node sending the recessive bit recognizes that what it sent doesn't match what's on the bus and stops transmitting, leaving the other message (with the more dominant ID) to continue.

However, if two nodes with the same ID start transmitting their messages simultaneously, the arbitration process won't determine a winner based on the ID, since the IDs are identical. The arbitration would then proceed to the next field. If the data or any subsequent field in the CAN message is different between the two nodes, one node might be transmitting a dominant bit while the other transmits a recessive bit. The node transmitting the recessive bit will detect a mismatch between what it sent and what it reads from the bus, resulting in a bit error. The node detecting the bit error will then abort its transmission and signal an error frame.

If by chance both nodes with the same ID also have identical data and other fields (making the whole message identical), they might both successfully transmit without detecting an error. However, this behavior is unpredictable and depends on exact timings, making it an unreliable scenario.

In conclusion, having multiple devices with the same ID on a CAN bus is not advisable. It can lead to unpredictable behavior, potential bit errors, and data collisions, compromising the integrity and reliability of communication on the bus.

@ldr
Copy link
Author

ldr commented Aug 28, 2023

i don't see an error frame being emitted by the toolboard though...

@Arksine
Copy link
Owner

Arksine commented Aug 28, 2023

The response says an error frame is "signaled", that may just mean internally. That sounds correct. If so, the solution is rather simple:

  1. Document that the query is only intended to identify the UUID, and should be done with a single node on the network.
  2. I need to remove the UUID verification step in flashtool.py. This removes the query before setting the node ID and connecting to the bootloader. If no matching UUID is on the network there would be a timeout error when the tool attempts to connect to the bootloader.

@ldr
Copy link
Author

ldr commented Aug 28, 2023

I already commented out the part in my local flash_can.py script so that it doesn't do the query before flashing (just for debugging this issue), but still in my setup i need to do multiple of these query_unassigned_id requests (i'm using custom software on linux side to talk to multiple toolboards) - is not in a printer but whole other usecase where i want to control many stepper motors :) But I think it would be useful if the query works on multiple can toolboards in the normal usecase of a printer (with multiple print heads?) as well..

Wouldn't a better solution be to make sure that devices don't use the same id ? I mean it's not proven yet that this causes it and I understand that this would be a protocol change which is probably not what people are waiting for :)

Also I have to admit that i didn't measure resistance on the bus earlier (stupid). But now I did and noticed 30-something-ohms, apparently the cable that came with the peak usb<>can transceiver has two 120ohms in there, one on each side. So I had 3x 120ohm on the bus. Still I had the same issue with the STM32 (candlelight) transceiver which does have only one 120ohms (while i have one terminator myself on other end), so I think this is not the reason. Anyway I will remove one resistor from that cable and test it again. If it still reproduces I will open the ticket on klipper repo.

@NAPCAL
Copy link

NAPCAL commented Aug 28, 2023

@ldr
Copy link
Author

ldr commented Aug 28, 2023

I have just now removed one resistor, measured, only seeing 58.6ohms when everything is connected, after that re-did the query test and saw devices going silent, so it wasn't that.

@ldr
Copy link
Author

ldr commented Aug 28, 2023

hi @NAPCAL thanks yes, i think it's further in order (now i don't use twisted pair, but the total wire is quite short - i think one meter. Also other communication goes fine it seems - although i still have to check for other situations where toolboards use the same id at the same time.

@Arksine
Copy link
Owner

Arksine commented Aug 28, 2023

The purpose of the query is to get a device UUID so the application (be it Klipper or flashtool.py) can assign a unique node ID. The query is the only command that can transmit a response to a duplicate node ID. Once you know the device UUIDs it shouldn't be necessary to query them again unless you lose them.

As you mention, an attempt to pre-assign node IDs would require a change in protocol. Assignment and ID conflict resolution would have to be moved to the MCU. I'm not open to deviating from Klipper's protocol due to the maintenance burden, so this proposal would need to be discussed and implemented in Klipper to make it in to Katapult.

@NAPCAL
Copy link

NAPCAL commented Aug 28, 2023

@ldr On my setup, I am using four BTT EBB42, one PITB (dual 5160HV), and BTT EBB SB2240 on a dual CAN hat (SeeedStudio).

FYI the UUID’s come from the CAN bus transceivers so if you have a conflict, change the transceiver or put them on different CAN bus’

@ldr
Copy link
Author

ldr commented Aug 28, 2023

afaik the uuid's come from a hashing function on the serial of the mcu, it's not the uuid's that are conflicting, but I believe it's because of the same id (can id != uuid) - the can id is also programmatically set

@Arksine
Copy link
Owner

Arksine commented Aug 28, 2023

The UUID is generated from the MCU's unique serial number. Indeed its hashed. The node ID is set by the application and initialized to 0. Applications must know the UUID to set the node ID.

@ldr
Copy link
Author

ldr commented Aug 28, 2023

Some random site where it is mentioned that having multiple devices with same can id may result in bus-off state:

Controller Area Network (CAN) is a communication protocol that utilizes differential pair communication for reliability and noise reduction, and has a fixed bit arrangement so that each packet can contain the maximum of 8 bytes. The physical protocol of the CAN bus is designed so that if there are non-conflicting IDs, one is guaranteed to win if two nodes attempt to communicate at the same time. If two of the nodes attempt to communicate on the CAN bus with identical IDs at the same time, it will quickly lead to a condition known as bus-off, where one or both of the nodes essentially stop responding to packets until a fixed correct number of packets is received.

@Arksine
Copy link
Owner

Arksine commented Aug 28, 2023

It seems clear that the nodes are going bus off even if it hasn't been verified by the debugger. The fact that they recover after a reset is the giveaway.

The issue with the query command is that all of the nodes transmit their response to the admin id. So the problem isn't really that they haven't been assigned a unique node ID, but that there is no way for multiple nodes to broadcast to the same host ID without running into transmission errors. When Klipper/flashtool assign node IDs, they do it in pairs, one for the node and one of the host.

@ldr
Copy link
Author

ldr commented Aug 28, 2023

hey i want to thank you Arksine for the help so far, and Napcal too, tnx. I understand best place is to create a post on discourse explaining things and try to come up with a proposal :) therefore closing this ticket - i'll be back if there is a change that may be backported from klipper !

@ldr ldr closed this as completed Aug 28, 2023
@ldr
Copy link
Author

ldr commented Aug 28, 2023

@AtomicWerks
Copy link

Hi. I'm not on discourse, so I figured I'd post here regarding this issue.

I found this closed issue because I was facing the same problem. My u2c (and ebb42) with canboot would drop off and the interface would disappear.

Here's what I came to find out:

When queried by the Canboot flash script using the uuid the board is reset into bootloader mode and thus drops off. From there it will show up in the serial devices and can be reflashed using

python3 flash_can.py -i can0 -f firmware-path.bin -d serial-device-path

This worked for me multiple times. I hope the information helps someone else also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants