Skip to content
This repository has been archived by the owner on Mar 7, 2022. It is now read-only.

Standardize unused bytes in CAN-FD #36

Closed
thirtytwobits opened this issue Jun 12, 2018 · 17 comments
Closed

Standardize unused bytes in CAN-FD #36

thirtytwobits opened this issue Jun 12, 2018 · 17 comments

Comments

@thirtytwobits
Copy link
Contributor

thirtytwobits commented Jun 12, 2018

One of three standards should be adopted:

  1. Unused bytes must contain the value 0xAA – For FD we want unused byte padding to be 0xAA to avoid extra bit-stuffing for unused bytes.
  2. Unused byte values are unspecified but are included in the transport CRC – Any two implementations that have different ways of generating bits in the unused bytes must be compatible but any bit errors that are not caught by the CAN CRC check will be included in the transport CRC check.
  3. Unused bytes are ignored – Bit-errors that occur in ignored bytes, and which are not caught by the CAN CRC checks, are ignored.

image

@thirtytwobits
Copy link
Contributor Author

After some experimentation it seems like option 3 allows for the best separation between the transport and data serialization layers in libuavcan in a way that suggests that option 3 is a better design in-general. Option 3 allows for implementations to use 0xAA (or 0x55) for unused bytes but doesn't require it.

@thirtytwobits
Copy link
Contributor Author

Eh, now I'm disagreeing with myself. Ignoring the unused data means the transport layer has to collaborate with the data layer. This is becoming more evident as I learn more about this codebase. I'm thinking 2 is the correct answer now.

@pavel-kirienko
Copy link
Member

pavel-kirienko commented Jun 15, 2018

For the sake of extra clarity please allow me to be redundant here and state the problem again. So the problem of CRC verification is that at the transport layer we don't know where the useful data ends and the padding begins; this is a problem because the CRC (if we were to move it to the end as discussed on the mailing list) is supposed to be allocated at the very end of the useful sequence of bytes.

One of the unpleasant side effects of simple CRC algorithms is that once the remainder converges to zero, following zero input bytes do not affect its value, making the algorithm insensitive to extra zero data at the end of the data stream (or in the beginning, if the initial value is zero). Advanced CRC algorithms battle this problem by applying bitwise inversion at the output. Thankfully our CRC16-CCITT is not one of these, and it is prone to being stable at zero.

The second interesting thing about our CRC16-CCITT is that its residual value is zero. Meaning that if the algorithm is fed a proper checksum value of the preceding data in the big-endian byte order, it converges to zero.

From the above it is plain to see that the problem of CRC verification is solved by simply moving the CRC bytes in the big-endian data order to the end of the data stream and padding the extra bytes with zeroes.

Proof:

>>> crc16_from_bytes('123456789')
10673
>>> hex(crc16_from_bytes('123456789'))
'0x29b1'
>>> hex(crc16_from_bytes('123456789\x29\xb1'))
'0x0'
>>> hex(crc16_from_bytes('123456789\x29\xb1\x00\x00'))
'0x0'
>>> hex(crc16_from_bytes('123456789\x29\xb1\x00\x00\x00'))
'0x0'

Comments welcome.

@thirtytwobits
Copy link
Contributor Author

I like the idea of moving the transfer CRC to the last frame (specifically to be the previous two bytes before the tail byte) but we should allow and actually implement 0x55 empty bytes. It's just good practice to not send a long stream of 1|0 over the wire (0x55 is better then 0xAA since it's the same between uint8 and int8).

The big question is one for you to answer: are you okay with a breaking change in the protocol? To date I've been proposing changes that only affect FD frames. This would be the first change that would make this version of UAVCAN incompatible with older versions.

@pavel-kirienko
Copy link
Member

I like the idea of moving the transfer CRC to the last frame (specifically to be the previous two bytes before the tail byte)

That won't work. If the CRC were placed immediately before the tail byte, that would require the presentation layer (the one that performs data serialization/deserialization) to know about DLC and the number of frames emitted per transfer. This is a bad leaky abstraction, let's avoid that. Instead, I am proposing to place CRC immediately after the useful data. The transport layer then will be able to pad the transfer with zero bytes. When receiving the transfer, conforming implementations will always arrive at a correct CRC residual of zero regardless of the number of padding bytes used.

Padding values other than zero can't be used with the current CRC algorithm because the algorithm maintains the stable residual of zero only when fed zero bytes.

@pavel-kirienko
Copy link
Member

The big question is one for you to answer: are you okay with a breaking change in the protocol?

This is highly unpleasant but seems to be acceptable for the greater good. We seem to be able to introduce the change without breaking backward compatibility.

@kjetilkjeka
Copy link
Contributor

That won't work. If the CRC were placed immediately before the tail byte, that would require the presentation layer (the one that performs data serialization/deserialization) to know about DLC and the number of frames emitted per transfer.

I do not understand what you mean by this. I interpret it as you state that CRC checking is performed at the presentation layer? This seems weird to me, as CRC calculation can be performed without knowing anything about serialization/deserialization or datatypes. Please explain.

The transport layer then will be able to pad the transfer with zero bytes. When receiving the transfer, conforming implementations will always arrive at a correct CRC residual of zero regardless of the number of padding bytes used.

Again, my understanding is that the transport layer would receive data without CRC or tail bytes (pure payload) and add CRC and tail byte. Is this wrong?

From my point of view it seems like putting the CRC directly before the tail byte would be more consistant with the abstraction layer model. On reception, the payload including padding will be fed to the serializer untill the datastructure is completely serialized. This is analogous to what currently happens for "padding bits"

@pavel-kirienko
Copy link
Member

Okay, gentlemen, I am back in the office, let's continue. Let me first address this comment that Scott (@thirtytwobits) posted on the mailing list:

This is how I have it implemented in my libuavcan-FD PR and it works well that the transport layer can count backwards from the DLC boundary to find things. If you put it after the end of data then you need to know the data-length to find it which seems like a layering violation to me.

Nope, when decoding, implementations don't need to care where the CRC field is located. They just feed the bytes into the CRC algorithm, including the padding bytes (if any), and then verify that the final residual has the expected value. In the case of the standard CRC-16-CCITT, the residual is zero; it also requires the padding bytes to be zero. As Kjetil said, this bears the disadvantage of the extra stuff bits in the message, which we don't want.

Kjetil, to reply to your previous comment: the presentation layer does not need to know anything about CRC, DLC, CAN FD, or framing in general.

CRC calculation can be performed without knowing anything about serialization/deserialization or datatypes.

Correct.

my understanding is that the transport layer would receive data without CRC or tail bytes (pure payload) and add CRC and tail byte. Is this wrong?

This is correct.

From my point of view it seems like putting the CRC directly before the tail byte would be more consistant with the abstraction layer model. On reception, the payload including padding will be fed to the serializer untill the datastructure is completely serialized. This is analogous to what currently happens for "padding bits"

This is suboptimal. If we did this, we'd be mixing logic between the transport layer and the presentation layer: the latter will have to report to the former how much useful data there is, and how many padding bytes to ignore.

Now let's start from scratch.


As I see it, the approach of the zero pad bytes would be perfect, if not for the pesky stuff bits. Luckily, there seem to be a solution to the problem stated by Kjetil on the mailing list:

I'm not familiar with how these CRC algorithms work on anything else than a shallow level, but would it be possible (as a best of both worlds) to find a CRC algorithm that have this property for a value that do not contain more than 4 repeating bits?

So I added a simple xor against 0x55 at the input and the output of CRC-16-CCITT, and as a result, the algorithm became invariant to trailing 0x55 instead of trailing zeros. The residual is different now: instead of zero it now equals 0x5555, which is perfectly acceptable though. I made a post on Stackoverflow to let the hivemind verify my assumptions: https://stackoverflow.com/questions/51020246/a-cyclic-redundancy-check-algorithm-that-is-invariant-to-the-number-of-trailing

So my proposition is as follows: use the modified CRC16 algorithm. Arrange data in the following order:

  1. Useful payload.
  2. CRC16.
  3. Padding bytes =0x55, if any.
  4. Tail byte.

When encoding the transfer, compute the CRC and put the remainder immediately after the useful data, then pad the extra bytes with 0x55.

When decoding the transfer, simply compute the CRC over the entire byte sequence up until the tail byte (from 1 to 3, inclusive), and then verify that the residual equals 0x5555. If it does, the data is valid; if it has any other value, the packet is corrupted and shall be discarded.

Let me re-emphasize here that implementations shall not care where the CRC is located in the packet. There is no coupling between the transport layer and the presentation layer, and no layering violation seems to take place.

Please let me know what you think.

@kjetilkjeka
Copy link
Contributor

This improves our situation greatly. Since we can use 0x55 as padding no matter where we put the CRC we're free to choose the correct/nice solution instead of compromising on practicalities like bit stuffing.

However, I'm still stuck at your argument about where it will not break the abstraction. Let me exemplify. In Uavcan.rs receiving a message goes through 2 stages, framing and serializing. When i receive a CAN frame i will feed the can data to the deframer, the deframer will read the tail byte and pull out the transfer payload. The transfer payload will be sent to the deserializer, this may contain padding bits.

Currently, the (de)framer knows exactly where to find things such as CRC and Tail byte. As it is always located on a fixed byte index (either from start of beginning). If we put the CRC right before the Tail Byte this would still be true. We would of course need to include the padding bytes in our CRC, but we already do this for padding bits. No communication between (de)serializer and (de)framer is required other than passing the transfer payload.

To me it seems like the solution you're explaining would achieve the same, but with two minor disadvatages:

  1. Only able to verify CRC "indirectly". Can verify CRC but if check fail, can not report what CRC was and what it was supposed to be (without communicating with (de)serializer).
  2. Also sends CRC to (de)serializer as padding, instead of only padding bytes.

@thirtytwobits
Copy link
Contributor Author

Also remember that the transport layer has to add the padding bytes itself so in that regard it must know about the DLC. You'll see in my pr here that I had to put this logic in the CanFrame. Putting the CRC before the tail byte is trivial once you have the DLC.

But, yes, your proposal @pavel-kirienko, seems feasible even if I haven't had time to get cozy with the math.

I'd still like to discuss a meta-issue: why is this CRC necessary? CAN itself is reliable. How much complexity and extra data are we adding to the protocol for how much error correcting benefit? It certainly seems like we should be able to document such a benefit in the specification if we are going to have it.

@kjetilkjeka
Copy link
Contributor

Also remember that the transport layer has to add the padding bytes itself so in that regard it must know about the DLC. You'll see in my pr here that I had to put this logic in the CanFrame. Putting the CRC before the tail byte is trivial once you have the DLC.

As mentioned, this is the exact same conclusion I've come to in uavcan.rs

I'd still like to discuss a meta-issue: why is this CRC necessary? CAN itself is reliable. How much complexity and extra data are we adding to the protocol for how much error correcting benefit? It certainly seems like we should be able to document such a benefit in the specification if we are going to have it.

If we're only relying on the toggle bit to assemble frames in the correct order a fault is bound to happen at some point, when it does it will not be detected due to no CRC. This can change the meaning of the message. To discuss CRC to be removed i think it must be replaced with a different robust mechanism to verify that messages are assembled in the correct order.

@pavel-kirienko
Copy link
Member

Only able to verify CRC "indirectly". Can verify CRC but if check fail, can not report what CRC was and what it was supposed to be (without communicating with (de)serializer).

This information is unobtainable either way - if CRC doesn't match, you don't know whether it is caused by corrupted data (in our case it can be caused by improper deframing only, since every CAN frame is CRC-protected individually; this also answers the question Scott raised on why do we need CRC at all) or by corrupted CRC. So I don't think this difference has any practical consequences.

Also sends CRC to (de)serializer as padding, instead of only padding bytes.

Let's review the data flow here for a general case, irrespective of any particular existing implementations. I haven't been developing UAVCAN in a while (only using it, mostly through libuavcan), so it also helps me to recall things.

In the case of data emission, the sequence would be as follows:

  1. The application constructs and populates a data structure (either a message, a service request, or a service response).
  2. The data structure is sent to the serializer. The serializer outputs a raw block of bytes.
  3. The serialized block of bytes is delivered into the framer. The framer outputs an ordered set of CAN frames.

Seeing as the integrity check is not concerned with serialization, and, furthermore, is dependent on framing (because single-frame transfers do not require additional integrity checks beyond that which is provided by the CAN transport), we can see that related logic should be implemented on step 3 in the framer. In the case of multi-frame transfer emission, the CRC could be located either immediately after the serialized data (followed by arbitrary padding, if necessary) or after the padding before the tail byte - both approaches seem equivalent.

Now consider data reception, which is always a lot harder, because data aggregation information is not provided explicitly by the upper layers as in the case of data emission; instead, it has to be carefully reconstructed from the transport.

  1. Received frames are consumed by matching deframers (filtered by src/dst node ID, data type ID, and transfer type). De-framers, as per the specification, reset their state when they receive a matching frame with the Start-Of-Transfer flag set, and assume the received transfer to be complete upon reception of a frame with the End-Of-Transfer flag set. Every once in a while the deframer will emit a block of bytes ready for deserialization.
  2. The deserialization-ready block is fed into the deserializer. The deserializer extracts useful data from the block, constructing an application-level data object from it.
  3. The object is fed into the application.

Similar to the simpler case of data emission, it is the job of the transport layer to ensure data integrity before passing the block to the higher level for deserialization. I am now observing that the CRC could be placed either before the padding bytes or after them, both approaches being architecturally clean with no obvious leaks.

But, consider this. One of the possible ways of DSDL incompatibility detection is to see whether all of the data was consumed by the deserializer after deserialization is finished. If there is less data than expected by the deserializer, or if there is more data, that means that the DSDL definitions used by the sender and by the receiver are different, and so the message cannot be processed (or, possibly, a different version of the same message could be used instead - this is just an example, the topic of DSDL versioning is actually extremely complex and warrants a dedicated discussion which is already underway over at OpenCyphal/public_regulated_data_types#35). With CAN 2.0, a one-byte granularity level is always guaranteed; however, with CAN FD the deserializer can run out of data and fail to notice that if there are padding bytes at the end, i.e. the effective granularity increases to 15 bytes in the worst case.

If we were to put CRC after the padding bytes, the above granularity limitation would have to be tolerated. But if the CRC field were to follow the payload immediately (and the following padding were to be in the residual-preserving pattern, which is currently 0x55), then implementations would have the option (this is not mandatory) to attempt decoding first, and then check if the two bytes where the deserializer stopped consuming data contain a correct CRC. This is a leak, but it could be optional, and it allows implementations to provide stronger DSDL compatibility guarantees, which could be handy for the DSDL versioning issue.

Does that make sense?

@kjetilkjeka
Copy link
Contributor

Great, it looks like we're all pretty much on the same page now.

I am now observing that the CRC could be placed either before the padding bytes or after them, both approaches being architecturally clean with no obvious leaks.

I would argue that it is marginally cleaner to keep it after the padding bytes due to.

  • The framer knows where to find the CRC. Practically it's not very usefull except for maybe reporting what it expected the CRC to be for the user (in apps like uavcan_gui_tool) or maybe unit tests.
  • We currently feed padding bits to the serializer, feeding padding bytes makes not much difference. Feeding the crc to the deserializer feels like a violation.
  • It allows for leaving the value of the padding bytes unspecified.

And for the practical perspective:

@pavel-kirienko
Copy link
Member

It allows us to keep the exact same CRC scheme as in the current version.

Yes, it is best to stay away from unconventional math tricks if we can help it. Makes the standard somewhat more approachable.

So let's see if we've reached consensus here (@thirtytwobits we need your input here). Is everyone okay with all of the following statements:

  • The CRC field is to be allocated immediately before the last tail byte of the transfer.
  • The padding bytes, if any, are to be inserted between the last byte of the payload and the CRC.
  • The padding bytes, if any, are recommended to be set to 0x55 when emitting and must be ignored when receiving.

This is essentially the option 2 proposed by Scott in the first post.

@thirtytwobits
Copy link
Contributor Author

thirtytwobits commented Jun 28, 2018

I'm behind a bit on this conversation. To catch up let me just try to close on my question about the need for the multi-frame CRC:

So, the assertion is that, because only the toggle bit guarantees the proper ordering of the frames the CRC is needed? Is this because of the multi-bit error vulnerability in CAN 2.0? If so do we only need the CRC for CAN-2.0 and can elide this mechanism when using CAN-FD where this defect has been fixed?

I'm still struggling to see how the transport CRC does anything except double-check the reliability of the bus. For FD this seems redundant given that the standard guarantees a hamming distance of 6 for all message lengths. Remember that for 64-byte FD frames, which all multi-frame transfers will have at least one of, the CRC is expanded from 17 bits to 21 bits so these transfers already add 4 more bits of overhead to guarantee reliability. Are we adding enough value to justify the additional 16-bits of overhead for the transfer?

CAN FD and the CRC issue

@pavel-kirienko
Copy link
Member

We need CRC to verify the correctness of de-framing on the receiving side. Improper deframing can be caused by:

  • Frame reordering on the emitting side.
  • Transfer ID being reset or mishandled on the emitting side (e.g. as a result of an unexpected restart of the emitting node).
  • Node ID conflict on the bus.

This is not related to bit-level distortions whatsoever.

@kjetilkjeka
Copy link
Contributor

Just as a reminder. Even though we've reached a conclusion on the padding bits, the proposal in #41 also affects Transfer CRC.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants