Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDO block download logic aspects (sequence breaks, losts etc.) #205

Closed
lukegluke opened this issue Jul 1, 2020 · 9 comments
Closed

SDO block download logic aspects (sequence breaks, losts etc.) #205

lukegluke opened this issue Jul 1, 2020 · 9 comments

Comments

@lukegluke
Copy link
Contributor

@CANopenNode, I would like to discuss some aspects with SDO block download I've run into. Especially while you are working on renewed SDO server.

  1. CanOpen 301 spec not well defined under which conditions the server has to send an ACK (section 7.2.4.2.10) related to sequence breaks.
  2. For now CANopenNode sends ACK with seq of last correctly received segment on every (in theory) received messages with not expected sequence number
    if(seqno == (SDO->sequence + 1U))
    except case when we are waiting new subblock (or receive same segment)
    else if((seqno == SDO->sequence) || (SDO->sequence == 0U))
    By "in theory" I mean that messages are received at high rate, while SDO process is done more rare, so in fact driver do not manage to send ACK on each message with wrong seq.
  3. Spec says that

If ACK number does not correspond with the sequence number of the last segment sent by the client during this block transfer the client shall retransmit all segments discarded by the server within the next block transfer.

  1. SDO client in our case send data by subblocks, it can't stop in the middle (and looks like spec also expects this kind of SDO client behavior).

Keeping this in mind.

I think we shouldn't send ACK on every wrong seq if received seqno is less than expected (SDO->sequence + 1U).
If SDO client start resend subblock from very first segment (discarding for now how we got into this situation), CANopenNode periodically (form SDO_process) sends ACK back.
изображение

After client completes, it handled queued received ACK from CANopenNode SDO server, and start resend "missed" data.
CANopenNode SDO server for that time received whole subblock, ACK it [I will write ACK=127 for simplicity] and waits first segment of new block ignoring anything else.
If client SDO doesn't squash received ACKs and simply handles them just one by one, pulling them from received queue - it will restart subblock transfers quite several times for each previously received ACK (starting 107 seq in pictured case). In theory as many times as it saves segments before, but in fact much fewer.

Another issue. What if ACK of last segment of subblock gets lost?
CANopenNode SDO server will move internal sequence to 0, expecting 1 segment of new subblock. But client SDO missed ACK and is still waiting ACK. After timeout CANopenNode will send ACKseq=0, it force client to retransmit whole previous subblock, but CANopenNode will write it as a new one - data get corrupted.
It looks like in this particular case - if there is a timeout after sending ACK of last segment of subblock - we should repeat ACK=127, not the ACKseq=0 of new subblock [the case when whole new subblock get totally lost, but client received our ACK=127 is unbelievable and it is ok to get corrupted data in such unusual unavoidable case].

This issues combined with some particular aspects of SDO client could lead to infinite loop resending blocks again and again (I've encountered this).

To reduce useless retransmissions we should avoid by maximum sending duplicated ack on seq break:
изображение

I will add missed checks for CO_SDO_ST_DOWNLOAD_BL_SUB_RESP and CO_SDO_ST_DOWNLOAD_BL_SUB_RESP_2 in CO_SDO_receive - to ignore new received messages in this states.

Probably we could add some timeout var (or state?) to not resend last ACK if we already send one, but still receiving seq breaks.

p.s. I do not use CANopenNode SDO client and don't know nothing about it, probably it should be inspected for this aspects too.

@CANopenNode
Copy link
Owner

I agree, we should avoid sending duplicated ack on seq break.

It is long time, since I was working on SDO server. I didn't know about the problem of duplicated acks. I think, there should be exactly one ack. Server should then ignore all segments, and wait for segment no. 1. This is how I understand block transfer. But I'm not sure, how exactly it is implemented in current SDO server.

To explain a little more:

  • SDO client downloads block of data, 127 segments of subblock is currently being transferred.
  • SDO client correctly sends 106 segment, then makes a mistake and sends wrong sequence number
  • SDO server detects last correct segment and sends ackseq=106
  • SDO client sends some more segments (or all as it ittended). Then reads ack and realizes, it has to re-transmit all data after segment 106.
  • SDO client starts new subblock with seqno=1 and data after previous segment 106.
  • SDO server detected new subblock (sequno=1) and starts reception.

I have recently renewed SDO client. Internal states are now much more clearly defined. I will do similar for SDO server. I thing, curent SDO server is quite a mess. Please take a look into doxigen documentation here.

@lukegluke
Copy link
Contributor Author

Server should then ignore all segments, and wait for segment no. 1.

If we are talking about ACK on sequence break, it is rather correctly to say: "server should then ignore all segments, and wait for segment with seq next to correctly received last".
And current SDO server is already waiting right seq and continue fill data right, but it does not ignore other segments (rest after braked subblock) and try to ack every not expected msg.

Just in case, my first picture is unusual and specific case, please do not take in mind how I got there and why SDO client starts resend subblock from beginning while SDO server ACK=106. I just want to illustrate argument for not sending ACK on every wrong seq if received seqno is less than expected.
And looks like this just a particular case

I think, current SDO server is quite a mess.

There are several place to optimize and clear, but I wouldn't be so categorical about this :)

@CANopenNode
Copy link
Owner

  1. Server should then ignore all segments, and wait for segment no. 1.
  1. Server should then ignore all segments, and wait for segment with seq next to correctly received last.

That is a big difference.

As you cite the standard, it says: "If this number (ackseq, sent by server) does not correspond with the sequence number of the last segment sent by the client during this block transfer the client shall retransmit all segments discarded by the server within the next block transfer."

And above that, standard says: "The block data is transferred to the server by a sequence of segments. Each segment consists of the data and a sequence number starting by 1, which is increased for each segment by 1 up to blksize."

There is some more about ackseq: "sequence number of last segment that was received successfully during the last block download. If ackseq is set to 0 the server indicates the client that the segment with the sequence number 1 was not received correctly and all segments shall be retransmitted by the client."

So the client does not retransmit data, which was already accepted and sequence always start with 1. I think, that is also the most optimal and reliable approach for SDO block transfer.

@lukegluke
Copy link
Contributor Author

lukegluke commented Jul 2, 2020

I just realized how you apparently interpret written in spec, now I can see this logic in spec cites too.
I illustrate it with picture, please correct if I'm wrong:
изображение

Is this right? It's important change in concept and SDO client must support this logic!

By the way, this aspect I mentioned "What if ACK of last segment of subblock gets lost?" still important in new logic too and should be taken into account in implementation.

Update: I've just checked out SDO server upload logic, I didn't look at it previously. And, yes, it use logic from right side of the picture: always starting seq from 1 in each block transfer (preliminarily filling buffer with data discarded previously).

@lukegluke
Copy link
Contributor Author

lukegluke commented Jul 2, 2020

@CANopenNode, just in case, once again, please absolutely don't look on my first picture in the first comment to remind yourself about current logic of SDO server block download! Current CANopenNode SDO server and external SDO client worked according to "current logic" I posted last, the first picture in the first comment is just a unsuccessfully cut fragment and has history above and some issues with client, that's why it start sending from 1 - it's an unusual specific bug case for current logic.

@CANopenNode
Copy link
Owner

I think, the "new logic" is correct and according to standard. This is also, how SDO client from CANopenNode is written. As I said, I'm not absolutely sure for SDO server, but I think, it was intended to work according to "new logic" from the beginning.

What if ACK of last segment of subblock gets lost?

I think, CAN bus itself has very, very high reliability to deliver each message successfully. (I read interesting comparison some time ago, but i don't remember what exactly it says.)
Anyway, if there is missing ack message (normal or SDO block transfer), communication is aborted by timeout.

@lukegluke
Copy link
Contributor Author

lukegluke commented Jul 3, 2020

I think, the "new logic" is correct and according to standard.

Yes, today I rethink this and totally agree that "new logic" = the logic that standard suggest, and current one is wrong.

I'm not absolutely sure for SDO server, but I think, it was intended to work according to "new logic" from the beginning.

As I said it is true only for block uploading, but not for downloading.
Just fresh example with seq break (I've got plenty of them):
изображение

Renewed SDO server in new master should be corrected indeed. As suggestion to give me a motivation to move to new master :), we can not fixed this in old master, just add a note.

I think, CAN bus itself has very, very high reliability to deliver each message successfully. (I read interesting comparison some time ago, but i don't remember what exactly it says.)

I agree, but there also could be software aspects on other devices. In the end I'm talking about just one particular case. Of course we can let it go and probably never encountered it.

Anyway, if there is missing ack message (normal or SDO block transfer), communication is aborted by timeout.

In current implementation, on block downloading communication is aborted only on second timeout, on the first timeout SDO server resend last ACK.
изображение

I just suggest to check one case - when we are wait for seq 1 (confirm whole sub-block ACK=blksize in current wrong logic / or every time after sending ACK in correct standard logic), didn't receive anything lesser than last received seq and get timeout it is not good thing to send ACK=0 (as it will be done in current implementation), because SDO client will start resend old sub-block, but SDO server will write it as the data of new sub-block.

@martinwag
Copy link
Collaborator

I think, CAN bus itself has very, very high reliability to deliver each message successfully. (I read interesting comparison some time ago, but i don't remember what exactly it says.) Anyway, if there is missing ack message (normal or SDO block transfer), communication is aborted by timeout.

Thats correct. Either message is correctly placed on bus or your hardware will go into error state (see https://assets.vector.com/cms/content/know-how/can/Graphics/CAN_FD_Poster_V2.2.jpg). Your driver should then generate a Bus Off Emergency (that it most likely can't send on the bus because of bus off condition...).

@CANopenNode
Copy link
Owner

I'm not quite sure, how is with this issue now. Currently we have:

  • new Object Dictionary approach, rewritten and tested SDO server in CANopenNode version 4.0
  • rewritten and tested SDO client in CANopenNode version 2.0 and above
  • no additional information for old Object Dictionary and old SDO server.
    I'm closing this issue for now. If there will be similar problems, new issue should be opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants