Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Citing Experimental RFCs as if being a part of CUBIC #90

Closed
goelvidhi opened this issue Aug 31, 2021 · 45 comments · Fixed by #111 or #136
Closed

Citing Experimental RFCs as if being a part of CUBIC #90

goelvidhi opened this issue Aug 31, 2021 · 45 comments · Fixed by #111 or #136
Assignees
Labels

Comments

@goelvidhi
Copy link
Collaborator

@goelvidhi goelvidhi commented Aug 31, 2021

Markku Kojo said,

The draft says that CUBIC MAY implement DSACK [RFC3708], limited slow
start [RFC 3742], [RFC7661] and hybrid slow start [cites a paper].
Aren't the first three down references? Not sure if it is appropriate
for a Stds Track document to cite experimental work or a paper like
this even though it's a MAY.

@larseggert
Copy link
Member

@larseggert larseggert commented Aug 31, 2021

These references are all informational, which means they are not DOWNREFs by definition (see RFC3967).

Propose to close with no action. Markku?

@larseggert larseggert self-assigned this Aug 31, 2021
@larseggert
Copy link
Member

@larseggert larseggert commented Sep 15, 2021

No word from @markkukojo, closing this. @markkukojo, please reopen if you disagree.

@larseggert larseggert closed this Sep 15, 2021
@markkukojo
Copy link
Collaborator

@markkukojo markkukojo commented Sep 16, 2021

Right, the references are currently informal but I think the issue is more subtle and how some of the experimental RFCs are now cited puts them into a "grey area" .

Sec 4.10:
In sec 4.10 the draft says "CUBIC MUST employ slow-start" and then continues by allowing CUBIC to use RFC 3742 or [HR08] as the slow start algorithm for fast and long-distance networks by`using the keyword MAY. So, an implementer MUST implement slow- start and if he/she selects to implement RFC 3742 he/she must read and understand RFC 3742 in order to implement it. In my view it would require RFC 3742 to become a normative reference and thereby a down ref, or at least puts it in a grey area.

A solution to this could be simply to say: "CUBIC MUST employ a slow-start algorithm [RFC 5681]."
And it does not need to say anything like "MAY use x" because RFC 3742 (and [HR08]) is less aggressive than regular slow start so use of it is allowed as any other experimental RFC (for experimental purposes).

I understand that the problem here is that if one decides to use [HR08], one needs to take that into account when setting W_max , etc. as described in the draft. However, I'm not sure whether it's reasonable to allow (by using MAY) employing an algorithm in a research paper (or in an Exp RFC) as part of a Stds Track RFC? My interpretation of using MAY in this way is that it sends a message that "while the cited algorithm is experimental (or research) only and should not be used in the public Internet other than for experimental purposes, it is, however, ALL OK to deploy it when used with CUBIC".

So, instead of using MAY, the draft could simply say that "if CUBIC uses [HR08], it may exit ..."
This suggests dropping the second sentence (starting: "Among ...") of the first para as unnecessary and potentially confusing.

Sec 4.9:

Here also the use of MAY is quite unnecessary and potentially results in similar confusion about the status of RFC 3708.
There is no reason to separately allow (by using MAY) CUBIC to employ RFC 5682 or RFC 3522 as they are Stds Track, nor to separately allow employing RFC 3708 as it can be employed with same provisions as any Experimental algorithm.

Again, it would be just enough to note that if any of these are used, the mentioned CUBIC variables need to be restored as described. However, the authors should carefully consider what RFC 4015 says and what limitations there are for a congestion control response when a spurious RTO is detected. RFC 4015 is currently the only Stds Track CC response algorithm for spurious RTOs. There is no Stds Track CC response algorithm for false Fast Retransmits; this topic was studied around two decades ago and there was no consensus on whether restoring CC state is a good idea in all scenarios. E.g., in mobile networks, a make-before-break handoff from a high RTT path to a low RTT path may result in packet reordering when the available network capacity actually decreases and restoring the old state in such a case is not a sane action. Therefore, I think that this draft is not in position to provide a reasonable suggestion to unconditionally restore CC state after spurious congestion events nor to provide evidence that it is always the correct action. Moreover, handling spurious retransmissions is not a key part of CUBIC but it is a more general issue where other CCs have the same challenges as CUBIC and therefore I think the CC response for spurious rexmits should be handled separately and should follow common principles for all CCs.

Sec 4.6:

The draft allows using RFC 7661 as described in the RFC. However, if I recall it correctly, the equations in RFC 7661 assume that cwnd increase and decrease uses standard methods i.e., alfa=1 and beta=0.5. If the equations in RFC 7661 are directly applied with CUBIC, the outcome might not be expected nor correct. I didn't have time to check this, so I might be wrong, but this is good to double-check with Gorry and/or other authors of RFC 7661 who have better insight to the details of RFC 7661.

@larseggert
Copy link
Member

@larseggert larseggert commented Sep 16, 2021

Please check #90, which intends to address your first two comments.

About RFC7661, would @gorryfair please chime in?

@goelvidhi
Copy link
Collaborator Author

@goelvidhi goelvidhi commented Sep 21, 2021

@markkukojo

Therefore, I think that this draft is not in position to provide a reasonable suggestion to unconditionally restore CC state after spurious congestion events nor to provide evidence that it is always the correct action. Moreover, handling spurious retransmissions is not a key part of CUBIC but it is a more general issue where other CCs have the same challenges as CUBIC and therefore I think the CC response for spurious rexmits should be handled separately and should follow common principles for all CCs.

Response to spurious events due to Fast retransmits was added newly to 8312-bis draft as it made sense to cover this scenario which is likely more common than spurious RTOs and folks who implemented QUIC noticed a great throughput improvement when implementing this. I agree that this is common to all CCs but at as of today, I don't think any RFC covers it. For completeness, it made sense to add it to CUBIC.

There is no Stds Track CC response algorithm for false Fast Retransmits; this topic was studied around two decades ago and there was no consensus on whether restoring CC state is a good idea in all scenarios. E.g., in mobile networks, a make-before-break handoff from a high RTT path to a low RTT path may result in packet reordering when the available network capacity actually decreases and restoring the old state in such a case is not a sane action.

How common is the scenario you are describing above? And if we are switching to low RTT, then the sender will soon-after reduce the congestion window if restoring to old state was a mistake. (i.e. if the old state cwnd > capacity for low RTT path)

@goelvidhi
Copy link
Collaborator Author

@goelvidhi goelvidhi commented Sep 21, 2021

The draft allows using RFC 7661 as described in the RFC. However, if I recall it correctly, the equations in RFC 7661 assume that cwnd increase and decrease uses standard methods i.e., alfa=1 and beta=0.5. If the equations in RFC 7661 are directly applied with CUBIC, the outcome might not be expected nor correct. I didn't have time to check this, so I might be wrong, but this is good to double-check with Gorry and/or other authors of RFC 7661 who have better insight to the details of RFC 7661.

Regarding RFC 7661, there are two things to consider,

  1. Threshold for validated vs non-validated phase - this threshold has been set by 7661 based on slow-start increase and is valid for CUBIC too.

A standard TCP sender in slow-
start is permitted to double its FlightSize from one RTT to the next.
This motivated the choice of a threshold value of 1/2. This threshold ensures a
sender does not further increase the cwnd as long
as the FlightSize is less than (1/2*cwnd)

  1. Congestion response - RFC 7661 uses 1/2 as the reduction factor which is probably taken from New Reno. I think this is what we need to update in CUBIC. Something like this,
CUBIC updates the response to congestion when RFC 7661 is used to validate the congestion window.
It updates the congestion window for loss recovery as below,

cwnd = (Max(pipeACK,LossFlightSize)) * Beta_cubic

It also updates the congestion window at the end of the recovery phase to,
cwnd = (Max(pipeACK,LossFlightSize) - R) * Beta_cubic

I don't know how to best document this, should we re-write these equations and explain all the variables or is there a way to reference RFC 7661 and just say, If you use RFC 7661, CUBIC updates Section 4.4.1 of RFC 7661 to replace the 0.5 reduction factor by Beta_cubic?

@larseggert
Copy link
Member

@larseggert larseggert commented Sep 21, 2021

The latter seems easier to me, but I don't have a strong preference. Waiting to hear from @markkukojo and/or @gorryfair.

@larseggert
Copy link
Member

@larseggert larseggert commented Sep 29, 2021

I'd like to merge #111 soon. Is there anything that needs to be changes in it to address this issue?

@markkukojo
Copy link
Collaborator

@markkukojo markkukojo commented Sep 29, 2021

I'll take a look at it later today.

@markkukojo
Copy link
Collaborator

@markkukojo markkukojo commented Sep 30, 2021

@markkukojo

Therefore, I think that this draft is not in position to provide a reasonable suggestion to unconditionally restore CC state after spurious congestion events nor to provide evidence that it is always the correct action. Moreover, handling spurious retransmissions is not a key part of CUBIC but it is a more general issue where other CCs have the same challenges as CUBIC and therefore I think the CC response for spurious rexmits should be handled separately and should follow common principles for all CCs.

Response to spurious events due to Fast retransmits was added newly to 8312-bis draft as it made sense to cover this scenario which is likely more common than spurious RTOs and folks who implemented QUIC noticed a great throughput improvement when implementing this. I agree that this is common to all CCs but at as of today, I don't think any RFC covers it. For completeness, it made sense to add it to CUBIC.

Let me try again. There are several reasons why the existence of the current text is problematic:

  1. Spurious rexmits are not part of CUBIC nor specific to CUBIC. I think this draft should focus on CUBIC and not include various features present in Linux or any other stack, no matter how useful they are particularly when the problem space is orthogonal to CUBIC. UNDO of congestion control state was there in Linux well before CUBIC was introduced and before the cited RFCs were published.
  2. As I said, such an UNDO comes with caveats. Certainly it provides benefits in certain scenarios, but may also be harmful in some other scenarios, particularly if various safeguards are not implemented. 3) Why there is no RFC to cover response for false fast retransmits? Because there are cases when one should not revert the CC state as it would result in undesired behaviour, such as incorrectly bursting uncontrolled data in case one should actually reduce cwnd. This is also the reason why all these detection algorithms (RFC 5682, RFC 3522, and RFC 3708) were separated from the response.
  3. The way that current text proposes for reverting the congestion control state in case of a spurious RTO is in conflict with RFC 4015 which is Stds Track. What's wrong in the steps specified in RFC 4015 such that one should not follow it but instead follow this draft which in turn would require an update to RFC 4015?
    If this draft is in need to say anything, it should cite RFC 4015 for response to detected spurious RTOs and indicate the additional state information to store and restore in case of CUBIC.

There is no Stds Track CC response algorithm for false Fast Retransmits; this topic was studied around two decades ago and there was no consensus on whether restoring CC state is a good idea in all scenarios. E.g., in mobile networks, a make-before-break handoff from a high RTT path to a low RTT path may result in packet reordering when the available network capacity actually decreases and restoring the old state in such a case is not a sane action.

How common is the scenario you are describing above? And if we are switching to low RTT, then the sender will soon-after reduce the congestion window if restoring to old state was a mistake. (i.e. if the old state cwnd > capacity for low RTT path)

The scenario above is just one example. I don't have an answer how common it might be today. AFAIK at least cellular networks used to attempt delivering data in order up until 3G. Don't know that well what is the exact situation with 4G and 5G (or 6G). But if one would like to propose reverting CC state in case of false fast retransmits one should provide evidence that such scenarios are non-existing today and unlikely to (re)appear in the near future. E.g., is there evidence that route fluttering does not exist anymore? Moreover, there have recently been proposals which suggest that link layers should not anymore deliver data in order. E.g. Bob has been advocating it. How about if it gets realised anytime soon with some technologies?

Sure the sender will reduce the congestion window if restoring to old state was a mistake, but the new available capacity might be only a small fraction of the old and it would possibly require reducing cwnd a notable number of times before cwnd decreases enough. And such a sender behaviour would be detrimental to other flows from which it would steal the bandwidth unfairly.

@markkukojo
Copy link
Collaborator

@markkukojo markkukojo commented Sep 30, 2021

  1. Congestion response - RFC 7661 uses 1/2 as the reduction factor which is probably taken from New Reno. I think this is what we need to update in CUBIC. Something like this,
CUBIC updates the response to congestion when RFC 7661 is used to validate the congestion window.
It updates the congestion window for loss recovery as below,

cwnd = (Max(pipeACK,LossFlightSize)) * Beta_cubic

It also updates the congestion window at the end of the recovery phase to,
cwnd = (Max(pipeACK,LossFlightSize) - R) * Beta_cubic

Has this been implemented for CUBIC and experimented with so we know it works properly? And do we have data to show this is the case?

In other words, would this be mature enough to become Stds Track and just for CUBIC, while RFC 7661 is experimental and the problem space once again is not specific to CUBIC but common for all CCs?

Anyway, I'd like to hear from those who have worked with RFC 7661 and do have a lot of experience on it. And would definitely have much better insight in any potential caveats that might be buried in there.

I don't know how to best document this, should we re-write these equations and explain all the variables or is there a way to reference RFC 7661 and just say, If you use RFC 7661, CUBIC updates Section 4.4.1 of RFC 7661 to replace the 0.5 reduction factor by Beta_cubic?

IMO this problem space is common to all CCs and still experimental, so the proposal is not mature enough and best handled in a separate document, either RFC7661bis or a draft of its own that may potentially be incorporated in RFC7661bis later.

@goelvidhi
Copy link
Collaborator Author

@goelvidhi goelvidhi commented Sep 30, 2021

@huitema @nealcardwell In this issue #23, we discussed and added a new section for Spurious detection and response for CUBIC but @markkukojo doesn't think we should do that due to any adverse effects of reverting to old_cwnd in case there is route change that has much lower capacity than old_cwnd state.

Do you think citing RFC 4015 for spurious RTOs is enough? This would mean we wouldn't provide any guidance for spurious detection of false fast retransmits.

@huitema
Copy link

@huitema huitema commented Oct 1, 2021

It has been experimented -- was the default option for using cubic in picoquic, before switching the default to BBR. No negative feedback. There were also extensive simulations done by the MS Quic team, as part of their systematic unit tests. We should ask them, but I believe these simulations are based on scenarios they have encountered in real deployments. Implementing spurious loss recovery in QUIC-Cubic was key to achieving parity with TCP-Cubic in these scenarios.

Links are not perfect, spurious reordering do happen -- that's quite visible on Wi-Fi and LTE links. I also saw some networks aggressively bundling the small UDP packets that carry QUIC ACK. Some networks will send small and large packets with different priorities. All that means that the standard recovery algorithm sometimes results in spurious loss detection. Cubic is not performing well in practice without spurious loss recovery. If this is not specified, implementations will probably just move to use BBR instead of Cubic. There is a perverse bad angel whispering to me that this would be a good thing, since BBR does result in smaller queues all over the Internet. Which means I am a bit conflicted...

If you are concerned about standardization status, one possibility would be to move that text to a non normative appendix.

@nealcardwell
Copy link
Collaborator

@nealcardwell nealcardwell commented Oct 1, 2021

@huitema @nealcardwell In this issue #23, we discussed and added a new section for Spurious detection and response for CUBIC but @markkukojo doesn't think we should do that due to any adverse effects of reverting to old_cwnd in case there is route change that has much lower capacity than old_cwnd state.

I would strongly urge the draft to keep the "Spurious Congestion Events" section, since reordering and spurious retransmits are common and getting more common, so undo of CC reductions is critical and becoming more so.

In the scenario mentioned that has a route change that causes a reordering event and reduction in available bandwidth, it's still fine to undo based on detecting the spurious retransmits. If there are real losses at any point, those will be detected and the CC will react. The likely order in this scenario would be:

  • route change to path with lower RTT and bandwidth
  • RTT decreases, sender receives SACKs for reordered packets
  • enter spurious fast recovery -> spurious retransmit and CUBIC reduces ssthresh and cwnd
  • spurious retransmit detected -> CUBIC undo, restoring parameters from before spurious recovery
  • real packet losses happen due to lower bandwidth -> enter well-founded fast recovery -> CUBIC reduces ssthresh and cwnd

Do you think citing RFC 4015 for spurious RTOs is enough? This would mean we wouldn't provide any guidance for spurious detection of false fast retransmits.

I would suggest citing all the major mechanisms that major TCP stacks use in the real world to detect spurious retransmits, namely something like: "CUBIC implementations MAY use any of the following documented mechanisms for detecting spurious congestion events: Eifel [rfc3522] [rfc4015], F-RTO [rfc5682], DSACK [rfc3708]."

I believe their statuses are:

  • Eifel rfc3522 (exp), rfc4015 (std)
  • F-RTO: rfc5682 (std)
  • DSACK: rfc3708 (exp)

Really TCP stacks SHOULD use these, but if there's a problem with a stds track doc saying a stack SHOULD use an experimental RFC then perhaps MAY is OK?

@larseggert
Copy link
Member

@larseggert larseggert commented Oct 2, 2021

If we want to say "MUST implement" for something that is not on the standards track, maybe that's a reason to republish that document as a PS. Which documents specifically would this be the case for?

@goelvidhi
Copy link
Collaborator Author

@goelvidhi goelvidhi commented Oct 2, 2021

I was wondering why is Eifel detection (RFC 3522) experimental while the response on detection (RFC 4015) a stds track.
F-RTO is stds track detection of spurious RTO which does not require timestamps to be enabled (unlike RFC 3522)

Should we say,
MUST implement F-RTO detection and RFC 4015 response algorithm. MAY implement other algorithms for detection, i.e. RFC 3522 and RFC 3708

@larseggert
Copy link
Member

@larseggert larseggert commented Oct 4, 2021

IIRC, RFC3522 has an IPR disclosure that makes it difficult to implement in open source or otherwise. RFC4015 has no such IPR. The community at the time felt strongly that the IPR meant RFC3522 shouldn't be standards track. Pasi developed F-RTO in part because of the IPR in RFC3522.

@huitema
Copy link

@huitema huitema commented Oct 4, 2021

Note that as far as QUIC is concerned, detecting spurious RTO is completely straightforward. QUIC packets have unique numbers, any data that need to be repeated is carried in new packets, with numbers different from the initial transmission. The sender just has to record an ack of a packet that was considered lost, there is no requirement to implement F-RTO or other algorithms.

@larseggert
Copy link
Member

@larseggert larseggert commented Oct 11, 2021

Heads up: I plan to merge #111 soon, unless I hear objections. If you do object, please make a concrete rephrasing proposal (on the PR).

@markkukojo
Copy link
Collaborator

@markkukojo markkukojo commented Oct 14, 2021

It has been experimented -- was the default option for using cubic in picoquic, before switching the default to BBR. No negative feedback. There were also extensive simulations done by the MS Quic team, as part of their systematic unit tests. We should ask them, but I believe these simulations are based on scenarios they have encountered in real deployments. Implementing spurious loss recovery in QUIC-Cubic was key to achieving parity with TCP-Cubic in these scenarios.

Thanks Christian for the data point. Just wondering whether route fluttering was included in the test cases?

Links are not perfect, spurious reordering do happen -- that's quite visible on Wi-Fi and LTE links. I also saw some networks aggressively bundling the small UDP packets that carry QUIC ACK. Some networks will send small and large packets with different priorities. All that means that the standard recovery algorithm sometimes results in spurious loss detection. Cubic is not performing well in practice without spurious loss recovery. If this is not specified, implementations will probably just move to use BBR instead of Cubic. There is a perverse bad angel whispering to me that this would be a good thing, since BBR does result in smaller queues all over the Internet. Which means I am a bit conflicted...

Sure, links are not perfect. That's why these algos were brought to IETF some two decades ago. At that time maybe the hardest thing for us was to convince that spurious rexmits are a real problem ;)

There was quite a bit lessons learnt in that exercise on those days and it would be good to not ignore it all. Much is documented in those RFCs but not all, e.g., response to spurious fast rexmits never materialised.
That's why I think it is not the best option from the process point of view to embed a response to spurious rexmits in this CUBIC spec as the problem area and solution is common to all CCs. Having it in a specific response draft would also allow/invite those with expertise in this topic to join reviewing and contributing unlike it now pretty much went unnoticed by many because no one expected it to a part of CUBIC algo (me included).

A correct response (UNDO) should not just simply restore previous state but consider various issues like adjusting DupThresh (or RACK reorder wnd) to avoid further spurious rexmits, control possible bursts due to UNDO, and other safeguards (see more in a latter comment). In addition, all this requires experiments in a range of environments to make it an experimental RFC, not to mention Stds Track.

@markkukojo
Copy link
Collaborator

@markkukojo markkukojo commented Oct 14, 2021

Do you think citing RFC 4015 for spurious RTOs is enough? This would mean we wouldn't provide any guidance for spurious detection of false fast retransmits.

I would suggest citing all the major mechanisms that major TCP stacks use in the real world to detect spurious retransmits, namely something like: "CUBIC implementations MAY use any of the following documented mechanisms for detecting spurious congestion events: Eifel [rfc3522] [rfc4015], F-RTO [rfc5682], DSACK [rfc3708]."

I believe their statuses are:

  • Eifel rfc3522 (exp), rfc4015 (std)
  • F-RTO: rfc5682 (std)
  • DSACK: rfc3708 (exp)

Please note that RFC 4015 is a stds track response algorithm for spurious RTOs. The question was why the advise in RFC 4015 is wrong and why this draft would need to modify it as the current text does and thereby maybe requires to update RFC 4015, instead of pointing to RFC 4015 as the response algo for spurious RTOs and just giving the extra state that needs to be stored/restored?

@larseggert
Copy link
Member

@larseggert larseggert commented Oct 14, 2021

@markkukojo could you explain what changes you would like to see in #111 or what is missing from it?

@markkukojo
Copy link
Collaborator

@markkukojo markkukojo commented Oct 14, 2021

IIRC, RFC3522 has an IPR disclosure that makes it difficult to implement in open source or otherwise. RFC4015 has no such IPR. The community at the time felt strongly that the IPR meant RFC3522 shouldn't be standards track. Pasi developed F-RTO in part because of the IPR in RFC3522.

Yes, I think IPR issue is one reason but not the only one. There is much in those RFCs so my apologies for not opening it all when I said there are caveats. AFAIK, another reason why RFC 3522 (and RFC 3708) is not Std Track is that it is vulnerable to lying receivers that can easily make a fast rexmit to appear as a spurious rexmit. There is a mitigation proposed for Eifel detection but it calls for experimentation that we do not have AFAIK. As Ack supression is deemed to be quite common today, such experimentation would be even more crucial to show that it works appropriately. Without a safe detection for fast rexmits it is hard to propose a response.

As Christian noted, QUIC is much more straightforward in detecting spurious rexmits, so should not suffer from similar problems.

@markkukojo
Copy link
Collaborator

@markkukojo markkukojo commented Oct 14, 2021

@markkukojo could you explain what changes you would like to see in #111 or what is missing from it?

Text for Slow Start section:

CUBIC MUST employ a slow-start algorithm, when *cwnd* is no more than

ssthresh {{!RFC5681}}.

If CUBIC uses HyStart++ {{!I-D.ietf-tcpm-hystartplusplus}} for early
exit from the initial slow start, it may exit the slow start without
incurring any packet loss  ...

Reasoning:

  • HyStart++ is only for the initial slow start. Slow start is used elsewhere as well and HyStart++ is not applicable then. That is, HyStart++ cannot be a general alternative for the std slow start; in particular, "in the rare cases when HyStart++ is not suitable" does not make much sense.
  • HyStart++ is essentially the same as the initial slow start in RFC 5681. It essentially specifies an additional algorithm to decide on an early exit from slow start. (That would be good to be reflected in the Title of HyStart draft as well: HyStart++: An algorithm for Early Exit from TCP Initial Slow Start)
  • I asked but did not get answer why CUBIC would be dependent on HyStart++ (use SHOULD), i.e., what are the full implications that must be understood and carefully weighted before choosing not to implement HyStart++ with CUBIC? I don't see any such reasons, maybe I am missing something?

For spurious rexmits:

I cannot provide text as I have expressed my strong opinion:
-this draft must not provide an alternative stds track specification for response from spurious RTOs; we already have it in RFC 4015. There is no discussion in tcpm wg, no data provided, nor justification provided to change RFC 4015.
-this draft is not mature enough to provide a new stds track specification for response from spurious fast rexmits. There is no discussion in tcpm wg on the such algorithm considerations, no experiments nor data provided. There are two experimental algorithms for detecting false fast rexmits but they both have security implications which have not been considered for the proposed response.

  • A response to false fast rexmits should be a draft of its own because the solution is common to all congestion control algorithms, not in anyway specific to CUBIC, and the solution deserves much more attention and more careful design than what this draft currently provides.

I'd suggest to move the text in sec 4.9 to sec 5.3 (Difficult Environments):
The first para extended with a brief intro for environments where and why spurious RTOs and false rexmits may occur. Pointers to results where the implications are shown. And pointing to detection algos and RFC 4015 for response to spurious RTOs + list what additional state in addition to ssthresh to store/restore.

@nealcardwell
Copy link
Collaborator

@nealcardwell nealcardwell commented Oct 15, 2021

Re: "Much is documented in those RFCs but not all, e.g., response to spurious fast rexmits never materialised."...

Even if RFCs for responding to spurious fast retransmits never happened, the code to do that has been widely deployed and used in the Internet with good results, including at least in Linux TCP for over a decade. Since implementations are doing it, we may as well tell them how to do it well. :-)

Re: "That's why I think it is not the best option from the process point of view to embed a response to spurious rexmits in this CUBIC spec as the problem area and solution is common to all CCs."...

The solution is not common to all CCs, because the state variables for each CC are different. It is important for the CC spec to specify which variables need to be saved and restored if there is an undo event. That's why it's great that 8312bis covers those details.

I agree the question of whether/when to undo a fast recovery is a separate question. But that's why this 8312bis is referencing other docs to cover that.

@larseggert
Copy link
Member

@larseggert larseggert commented Oct 15, 2021

Text for Slow Start section:

...

@markkukojo, so that entire paragraph should look like the following?

CUBIC MUST employ a slow-start algorithm, when *cwnd* is no more than
*ssthresh* {{!RFC5681}}. If CUBIC uses HyStart++
{{!I-D.ietf-tcpm-hystartplusplus}} for early exit from the initial
slow start, it may exit the slow start without incurring any packet
loss. Experimental alternatives to HyStart++ include hybrid slow
start {{HR08}}, a predecessor to HyStart++ that some CUBIC
implementations have used as the default for the last decade, and
limited slow start {{?RFC3742}}.
  • I asked but did not get answer why CUBIC would be dependent on HyStart++ (use SHOULD), i.e., what are the full implications that must be understood and carefully weighted before choosing not to implement HyStart++ with CUBIC? I don't see any such reasons, maybe I am missing something?

IIRC one pragmatic reason is that there is no other standards-track mechanism.

I cannot provide text as I have expressed my strong opinion: -this draft must not provide an alternative stds track specification for response from spurious RTOs; we already have it in RFC 4015.

This sounds like you are arguing that text should be taken out of the document - which bits?

  • A response to false fast rexmits should be a draft of its own because the solution is common to all congestion control algorithms, not in anyway specific to CUBIC, and the solution deserves much more attention and more careful design than what this draft currently provides.

Again, this sounds like you are arguing that text should be taken out of the document - which bits?

I'd suggest to move the text in sec 4.9 to sec 5.3 (Difficult Environments): The first para extended with a brief intro for environments where and why spurious RTOs and false rexmits may occur. Pointers to results where the implications are shown. And pointing to detection algos and RFC 4015 for response to spurious RTOs + list what additional state in addition to ssthresh to store/restore.

First, I thought spurious RTOs and fals RTXs are not uncommon on general Internet paths. So I'm not sure that moving that text into a section called "difficult environments" is appropriate.

Second, it sounds like you don't just want to move the existing text, you are outlining an complete rewrite of what we have. It would be more actionable if you could propose a draft of the concrete text you think should be there (ideally, as a PR.)

larseggert added a commit that referenced this issue Oct 18, 2021
* Rephrase text around algorithmic alternatives.

Fixes #90.

* Add HyStart++

* Fix spacing

* or -> and

* Update draft-ietf-tcpm-rfc8312bis.md

Co-authored-by: Lars Eggert <lars@eggert.org>

* Update draft-ietf-tcpm-rfc8312bis.md

* Incorporate suggestion from @bbriscoe, and strengthen rec for HyStart++.

* Update draft-ietf-tcpm-rfc8312bis.md

Co-authored-by: Vidhi Goel <goel.vidhi07@gmail.com>

* Update draft-ietf-tcpm-rfc8312bis.md

* Update draft-ietf-tcpm-rfc8312bis.md

Co-authored-by: Vidhi Goel <goel.vidhi07@gmail.com>
@markkukojo
Copy link
Collaborator

@markkukojo markkukojo commented Oct 18, 2021

Text for Slow Start section:

...

@markkukojo, so that entire paragraph should look like the following?

CUBIC MUST employ a slow-start algorithm, when *cwnd* is no more than
*ssthresh* {{!RFC5681}}. If CUBIC uses HyStart++
{{!I-D.ietf-tcpm-hystartplusplus}} for early exit from the initial
slow start, it may exit the slow start without incurring any packet
loss. Experimental alternatives to HyStart++ include hybrid slow
start {{HR08}}, a predecessor to HyStart++ that some CUBIC
implementations have used as the default for the last decade, and
limited slow start {{?RFC3742}}.

No. Sorry, I meant with the three dots (...) that I modified the beginning of that para and the rest of the para may remain untouched. Please, see below for the full text.

CUBIC MUST employ a slow-start algorithm, when cwnd is no more than
ssthresh {{!RFC5681}}.

If CUBIC uses HyStart++ {{!I-D.ietf-tcpm-hystartplusplus}} for early
exit from the initial slow start, it may exit the slow start without
incurring any packet loss and thus Wmax is undefined.
I n this special case, CUBIC
switches to congestion avoidance and increases its congestion window
size using {{eq1}}, where t is the elapsed time since the beginning
of the current congestion avoidance, K is set to 0,
and Wmax is set to the congestion window size at the
beginning of the current congestion avoidance stage.

I also tried to explain why the draft cannot say "in general SHOULD use HyStart++ ...". HyStart++ is specified for the initial slow start only! You cannot require it to be used e.g., after a timeout when CUBIC like any other TCP CC algo must use RFC 5681 slow start as there is no other slow start specified for that purpose. Same holds for the other cases where cwnd < ssthresh.

Furthermore, the current text effectively says (claims):
"... MAY use the Reno TCP slow start algorithm {{!RFC5681}} in the rare cases (of RTO) when
HyStart++ is not suitable."

  • I asked but did not get answer why CUBIC would be dependent on HyStart++ (use SHOULD), i.e., what are the full implications that must be understood and carefully weighted before choosing not to implement HyStart++ with CUBIC? I don't see any such reasons, maybe I am missing something?

IIRC one pragmatic reason is that there is no other standards-track mechanism.

My apologies but I don't understand this reasoning. If something is stds track, it is not a reason to make it automatically a SHOULD. For SHOULD (and MUST) there needs to be a reason why it is required and it needs to be explained. What beaks if CUBIC does not use HyStart++?
There is currently no justification provided to SHOULD in the draft.

I cannot provide text as I have expressed my strong opinion: -this draft must not provide an alternative stds track specification for response from spurious RTOs; we already have it in RFC 4015.

This sounds like you are arguing that text should be taken out of the document - which bits?

  • A response to false fast rexmits should be a draft of its own because the solution is common to all congestion control algorithms, not in anyway specific to CUBIC, and the solution deserves much more attention and more careful design than what this draft currently provides.

Again, this sounds like you are arguing that text should be taken out of the document - which bits?

I'd suggest to move the text in sec 4.9 to sec 5.3 (Difficult Environments): The first para extended with a brief intro for environments where and why spurious RTOs and false rexmits may occur. Pointers to results where the implications are shown. And pointing to detection algos and RFC 4015 for response to spurious RTOs + list what additional state in addition to ssthresh to store/restore.

First, I thought spurious RTOs and fals RTXs are not uncommon on general Internet paths. So I'm not sure that moving that text into a section called "difficult environments" is appropriate.

Sure, spurious RTOs and false RTXs occur. But how often, in what conditions, what are the reasons, what else may happen at the same time? We need evidence (well analysed measurement results) that it is common and that we understand the reasons for such behaviour and hence are able to justify a correct way to do the response. Currently there is nothing in this draft about these. Moreover, what could be done to avoid such spurious rexmits, what actions the IETF has taken/should take to advise how to avoid them. Is the transport protocol the only actor that should be prepared and react? What we need to take into account if we undo cc state and how the state is safely restored? All these questions are specific to the problem area of spurious rexmits but not specific to CUBIC. Undo involves a congestion control decision and as far as I remember IETF has never accepted a Stds Track RFC for congestion control actions without any evidence that demonstrates that the proposed cc actions are safe and correct. Why should this draft make an exception? Please see my reply to Neal concerning whether cc actions as currently proposed are safe and correct.

If the problem is shown to be common, then the text may better belong to sec 5.4 (Investigating a Range of Environments) than a section about difficult environments.

Second, it sounds like you don't just want to move the existing text, you are outlining an complete rewrite of what we have. It would be more actionable if you could propose a draft of the concrete text you think should be there (ideally, as a PR.)

Yes, it requires much more consideration and analysis than the current piece of text in this draft. In particular, if one wants to specify a new mechanisms to restore state or a mechanism that deviates from the one for which we have a stds track specification (RFC 4015).
I may help in writing the text for sec 5.3/5.4 but I don't have any measurement results that demonstrate the problem and how common the problem is.

@markkukojo
Copy link
Collaborator

@markkukojo markkukojo commented Oct 18, 2021

Re: "Much is documented in those RFCs but not all, e.g., response to spurious fast rexmits never materialised."...

Even if RFCs for responding to spurious fast retransmits never happened, the code to do that has been widely deployed and used in the Internet with good results, including at least in Linux TCP for over a decade. Since implementations are doing it, we may as well tell them how to do it well. :-)

Yes, UNDO has been there in Linux well be before CUBIC was invented. But UNDO being implemented in Linux does not alone provide any evidence that doing UNDO like rfc8312bis proposes is the right way to do it. ;)

I fully agree that it is important to tell how to restore cc state "well" but the problem is exactly that this draft does not do it now. In other words, you are saying and supporting that it is all ok to
restore cc state such that after a detection algo figures that it was false fast rexmit the TCP sender,

a) blasts a full cwnd worth data at line speed,
b) does not consider adjusting dupack treshold to avoid further false rexmits,
c) does not need to be prepared for a lying receiver that can easily make a fast rexmit to appear as a spurious rexmit, i.e., it is fine make it easy to fool the sender not to react to congestion at all?

Re: "That's why I think it is not the best option from the process point of view to embed a response to spurious rexmits in this CUBIC spec as the problem area and solution is common to all CCs."...

The solution is not common to all CCs, because the state variables for each CC are different. It is important for the CC spec to specify which variables need to be saved and restored if there is an undo event. That's why it's great that 8312bis covers those details.

Sure, I am not objecting to specifying which variables need to be saved and restored in addition to those that the existing RFCs already do and what is common to all CCs. But what I try to say is that this is the quite trivial part of the cc response to spurious rexmit.

I'm objecting that this draft modifies RFC 4015 response to spurious RTOs such that after the undo

a) the TCP sender blasts a full restored cwnd worth of unnecessary rexmits at the bottleneck
line speed,
b) does not adjust RTO and RTO calculation to avoid further spurious RTOs,
c) executes UNDO even if it received the acceptable Ack with ECE,
d) and more ...

It would be helpful for the discussion if everyone would read RFC 4015 and the related RFCs and compare what they say to what RFC8312bis says.
I'm not saying that the existing RFCs are perfect but they contain useful advise that this draft rfc8312bis totally misses .

@larseggert larseggert reopened this Oct 18, 2021
@larseggert
Copy link
Member

@larseggert larseggert commented Oct 18, 2021

Could you propose a PR that would address your concerns? I think we'll keep missing things otherwise.

@nealcardwell
Copy link
Collaborator

@nealcardwell nealcardwell commented Oct 18, 2021

a) blasts a full cwnd worth data at line speed,

That's a separate problem. Restart from idle does that as well. TCP implementations should use pacing to solve this problem of blasting full cwnds. Pacing is the only complete and general solution to blasting full cwnds, and it's orthogonal to CUBIC.

b) does not consider adjusting dupack treshold to avoid further false rexmits,

That's an orthogonal loss recovery issue, and belongs in a separate document focused on loss recovery. For example, the RACK RFC covers this kind of issue. It doesn't make sense to put this in a CUBIC document, because the CUBIC document doesn't know whether

c) does not need to be prepared for a lying receiver that can easily make a fast rexmit to appear as a spurious rexmit, i.e., it is fine make it easy to fool the sender not to react to congestion at all?

Interactions between lying receivers and the detection of spurious loss recovery episodes are a loss recovery issue, not a congestion control issue, and belong in a loss recovery document.

Sure, I am not objecting to specifying which variables need to be saved and
restored in addition to those that the existing RFCs already do and what
is common to all CCs. But what I try to say is that this is the quite
trivial part of the cc response to spurious rexmit.

Most of the issues you are bringing up are loss recovery issues, not congestion control issues, and should be in a separate document, IMHO. IMHO the 8312bis is a CC document, and should effectively say something like: "if your loss detection algorithm detects that a previous loss inference was spurious, then the CUBIC CC response is X".

a) the TCP sender blasts a full restored cwnd worth of unnecessary rexmits
at the bottleneck line speed,

If the loss recovery has been undone, the sender is not in a loss recovery state, and thus would not retransmit anything.

b) does not adjust RTO and RTO calculation to avoid further spurious RTOs,

That's an orthogonal loss recovery issue about how to adjust the RTO calculation based on an undo even, and belongs in a separate document focused on loss recovery.

c) executes UNDO even if it received the acceptable Ack with ECE,

That's an orthogonal loss recovery issue about whether undo should happen, and belongs in a separate document focused on loss recovery.

@goelvidhi
Copy link
Collaborator Author

@goelvidhi goelvidhi commented Oct 19, 2021

I read RFC 4015 and agree that it does cover all the safety requirements for restoring / undoing the congestion control state and how to update the RTO to further avoid spurious timeouts. So, it covers both updates to loss recovery and congestion control.

As a developer for QUIC spurious retransmit detection, congestion window restoring and changing the packet (dup ACK) and time threshold for loss detection, handling of special ACKs (CE counts), I do agree if all the responses to a spurious retransmit were in a single document, it would have made my life a bit easier.

So, it might be a better idea to remove this section (section 4.9) completely.

@larseggert
Copy link
Member

@larseggert larseggert commented Oct 19, 2021

If we do this,

  1. would this address @markkukojo's issues and
  2. would we need to cite RFC4015 and say something about it?

@goelvidhi
Copy link
Collaborator Author

@goelvidhi goelvidhi commented Oct 19, 2021

No, I don't think we need to cite anything. Folks who would want to implement Eifel response would follow RFC 4015

@larseggert
Copy link
Member

@larseggert larseggert commented Oct 19, 2021

We added that section in response to #23, and there was quite a bit of discussion and support for it. It seems odd to go back on this now?

@goelvidhi
Copy link
Collaborator Author

@goelvidhi goelvidhi commented Oct 28, 2021

I know but what we added is trivial (reset the state to previous state).
It needs more work and probably can be better placed in a new RFC that tackles spurious retransmits. Reverting the state to previous needs to be more methodical than what we proposed. Esp, as the link condition could have changed from previously restored state to current state.

@larseggert
Copy link
Member

@larseggert larseggert commented Oct 28, 2021

OK. Could we get some others to chime on on removing this section? If there is consensus, I'll do a PR.

@nealcardwell
Copy link
Collaborator

@nealcardwell nealcardwell commented Oct 28, 2021

I strongly advocate to keep Section 4.9, "Spurious Congestion Events".

Some responses to Vidhi's comments:

I know but what we added is trivial (reset the state to previous state).

It seems trivial once you write it down, but it's not obvious to implementors. The Linux TCP CUBIC code has had undo for a decade, but the undo of CUBIC is still incomplete and buggy because it only restores cwnd and ssthresh, not the CUBIC-specific values. Yuchung and I noticed this years ago, but didn't have development resources to fix it given that our team had moved on to a different congestion control algorithm. So one of the widest-used CUBIC implementation is still getting undo wrong. Yuchung and I were planning to resurrect and finish our in-progress patch to fix this to match 8312bis when 8312bis settled down. But if 8312bis omits this then Linux CUBIC undo and other implementations will lose momentum and/or fail to get it right.

Even just specifying the set of variables to restore is hugely valuable, and will save the world from having more buggy implementations like Linux TCP CUBIC.

It needs more work and probably can be better placed in a new RFC that tackles spurious retransmits.

The 8312bis section on undo doesn't need more work, IMHO, because this draft doesn't need to specify everything that a TCP stack needs to do to handle undo events; it only needs to cover what CUBIC's response to an undo event should be.

Even if someone later writes a new RFC that tackles spurious retransmits, it doesn't make sense for that future spurious retransmit RFC to specify all the variables settings that every IETF-documented CC algorithm (Reno, CUBIC, DCTCP, etc) should make upon undo events.

From a protocol specification and software engineering standpoint, we should try to strictly separate loss detection/recovery (including detecting spurious retransmission events) from congestion control (including how to undo a congestion response, if the loss detection module decides to undo). In software engineering terms, loss recovery drafts should cover whether to call cc->handle_loss() and cc->handle_undo_loss(), and this rfc8312bis should cover what cc->handle_loss() and cc->handl_undo_loss() do.

Reverting the state to previous needs to be more methodical than what we proposed.

What is proposed is quite methodical and sufficient as far as CUBIC CC goes. Loss detection/recovery should make its own specification for what the loss detection/recovery algorithm need to do upon detection spurious recovery events, which the RACK RFC already does (spurious loss recovery events raise the reordering window).

Esp, as the link condition could have changed from previously restored state to current state.

All the existing algorithms or detecting spurious loss recoveries (FRTO, DSACKs, Eifel TS-based undo) detect spurious loss recoveries in 1-2 round trips, so in practice the link conditions change very little. Just restoring the CC algorithm state to what it was before the spurious loss episode is excellent in practice, in our experience looking at real-world traces on YouTube, google.com, and internal RPC traffic: it's simple and robust, and far better than not undoing at all, given that reordering is not rare, and will likely become more common over time.

If link conditions have changed and available bandwidth has decreased, then in practice the flow will see loss/ECN soon, and slow down. But in practice when there are undo events this is usually due to radio layer delay fluctuations or reordering, not due to sustained reductions in available bandwidth.

@lisongxu
Copy link
Collaborator

@lisongxu lisongxu commented Oct 28, 2021

I agree with @nealcardwell that it is better to keep this subsection, as we have noticed that several issues/bugs (not just cubic) are/were related to the undo component. We do not need to specify when to undo, but instead specify which variables to undo.

@larseggert
Copy link
Member

@larseggert larseggert commented Oct 28, 2021

Thanks. Then @markkukojo needs to be a bit more specific about what changes he would want to see in the section to address his issue.

@markkukojo
Copy link
Collaborator

@markkukojo markkukojo commented Nov 11, 2021

My sincere apologies again not being able to respond earlier (unfortunate force major for me).

About slow start start text in Sec 4.10:

I've already suggested text to replace the current one and explained the rationale.
I try to repeat briefly. The current text is impossible; CUBIC or any other CC cannot replace RFC 5681 slow start by HyStart++ in general because HyStart++ is applicable to initial slow start only. It cannot be used during the slow start in RTO recovery!

Please see draft-ietf-tcpm-hystartplusplus, Sec 4.3:
An implementation SHOULD use HyStart++ only for the initial slow
start (when ssthresh is at its initial value of arbitrarily high per
[RFC5681]) and fall back to using traditional slow start for the
remainder of the connection lifetime.

The problem with overshoot and the MD factor to use is present in the RTO recovery as well because a pkt loss may be detected before cwnd reaches ssthresh (even though HyStart++ suggest differently).

Please see shortly my reply to Bob in #85 where I clarify this further.

@markkukojo
Copy link
Collaborator

@markkukojo markkukojo commented Nov 11, 2021

About spurious retransmissions and response to such events (Sec 4.9):

I agree with Vidhi that the issues and the response to spurious rexmits is not that trivial. I also agree with Neal that the issues related to response I listed belong to another document.

However, Neal suggests for some reason that those issues are loss recovery issues which sounds very weird to me. We are discussing spurious rexmit detection and response when such event is detected; when a spurious rexmit is detected there is no loss recovery by definition because detecting a spurious rexmit avoids/terminates loss recovery.

Instead, this topic was extensively studied and considered at the time of writing RFC 3522, RFC 5682, RFC 3708, and RFC 4015. The IETF decided to handle spurious rexmits by separating the detection and response in separate documents and it is the current IETF consensus as documented in these RFCs. The decision was to not include any response actions in the detection documents but to put all actions related to response in the response documents. These actions include congestion control actions and any other actions that were considered important for a sender to take. This of course is not the only possibly way to organize RFCs for handling the issues, but it is what IETF currently has agreed on. We may change this but it requires someone to volunteer rewriting all these documents.

The response actions are currently documented in RFC 4015 for detected spurious RTOs. AFAIK IETF has never accepted a standards track RFC that modifies existing congestion control and/or related important actions without appropriate experimental data to support such a change. AFAIK nobody has presented any data for discussion in TCPM wg that would support ignoring/modifying RFC 4015 response. Maybe chairs can clarify whether this is the case?

IMO the only thing that this draft is subject to present w.r.t. response to a spurious RTO is the additional congestion control variables that needs to be stored/restored for CUBIC, that is, W_max, K, epoch_start, and W_est.

I am happy to suggest text for this, if we can agree on this.

For spurious fast rexmits we do not have a response RFC for various reasons that I have tried to explain and do try to clarify here further.

First of all, again, it is not that trivial as this draft suggests and Neal seems to think. When a false rexmit is detected there are various different scenarios that require different handling and the response also depends on the detection method used. I believe that Neal mainly considers using DSACK-based detection? If DSACK is used, the fast recovery is over and all related ACKs have been received when the decision is made, i.e., there should not be any data segments in flight that were injected during the fast recovery. In this case the response is much more trivial and more closer to what this draft suggests. However, using RFC 3708 to detect false rexmits and applying an UNDO CC response to is not ready for standard track for security reasons! There is no working solution for RFC 3708 algorithm against lying receivers, meaning that it is very easy to device a TCP receiver that avoids cwnd reductions in case of pkt loss! TCPM chairs would need to address this in their writeup and I would be suprised if such draft would pass SECDIR review or actually I would expect TCPM not to decide forwarding such proposal for publication.

If Eifel (RFC 3522) is used to detect false rexmits, the response must be more complex, because there might be a half a window worth spuriously rexmitted data segments in flight when Eifel detects the spurious false rexmit. That is, a trivial case where the reordering extent (distance) is small (just enough to trigger a false fast rexmit) and maybe one or a few segments only got reordered (delayed). In such a case, a new ACK arrives only almost immediately after the false fast rexmit and maybe it is the only segment unnecessarily rexmitted. However, if segments are delayed such that they arrive at the receiver as the last segments in the current window and there is a relatively large chunk of segments reordered (maybe half of the window) and SACK is in use, the sender unnecessarily rexmits half a window of segments before Eifel detects that the fast rexmit was spurious. And, when Eifel detects it, it does not know what has
happened to the rest of the window, i.e., whether it was unnecessarily rxmitted or if actually those segments were lost (or some of them). Therefore, the response algorithm must take this into account and it must not blindly restore the old value of cwnd. Unfortunately, the original Eifel paper did not properly cover different possible reordering cases (nor did we when Eifel RFCs were prepared if I recall it correctly). So, we do not have a proper response algorithm for Eifel-detected false fast rexmits nor has anyone presented any data on how UNDO works with Eifel detection in various different scenarios.
Moreover, AFAIK there is no data ever gathered and presented on how the safe version of Eifel performs. Therefore, IMHO a response for spurious false rexmits is definitely not ready for standards track.

I think we all agree that the problem of spurious rexmits is important to solve and it should be properly. Therefore, IMO we should handle these in separate documents common to all CCs as we already did for spurious RTOs (which actually needs an update), not in this draft without any evidence presented to the wg for making the decision.

@goelvidhi
Copy link
Collaborator Author

@goelvidhi goelvidhi commented Nov 12, 2021

@markkukojo I presented your objections on this topic at the TCPM IETF meeting today (still Nov 11th in PST) and I see two problems that are a bit tricky to solve.

  1. The response described in RFC 4015 for setting cwnd is very conservative. It sets cwnd to flight size + min(IW, bytes_acked), which could be very low if the spurious detection happens at the tail of the window.
  2. Spurious RTOs are less common than spurious fast retransmits in TCP. For QUIC, there are no spurious RTOs as the only way to detect losses is after receiving an ACK, which is equivalent to fast retransmits. So, QUIC implementations wouldn't be able to take advantage of spurious detection.

AFAIK IETF has never accepted a standards track RFC that modifies existing congestion control and/or related important actions without appropriate experimental data to support such a change.

I have not attended all IETF meetings in the past, so I can't say what data may or may not have been presented. But I can say for sure that there are many QUIC implementations (including ours) that are doing UNDO on detection of spurious retransmits and the improvement in throughput was significant for us to keep using it. If you'd like I can get an exact number of how many implementations that are widely deployed are doing this.

So, I am starting to lean towards this approach. How about we document how CUBIC should respond to spurious events due to, a. RTOs and
b. fast retransmits.

We can use RFC 4015's response for a. and we can do the UNDO method for b. (backed by multiple QUIC deployments)

@nealcardwell
Copy link
Collaborator

@nealcardwell nealcardwell commented Nov 12, 2021

@goelvidhi

How about we document how CUBIC should respond to spurious events due to,
a. RTOs and
b. fast retransmits.

This sounds good to me. Thanks, Vidhi!

@markkukojo said:

I believe that Neal mainly considers using DSACK-based detection?

As I noted,I'm referring to several different detection mechanisms:
"(FRTO, DSACKs, Eifel TS-based undo)". :-)

There is no working solution for RFC 3708 algorithm against lying receivers

If we're going to stop work based on lying receivers, this is going to hold
up all TCP congestion control and loss recovery work unreasonably. As
has been clear since at least 1999 ("TCP Congestion Control with a
Misbehaving Receiver"), TCP has no crypto/nonce mechanism for
receivers to prove what they have received, so receivers can always
lie and claim they have received all packets, and thus cause the
sender to send as fast as the sender hardware and application will
allow, defeating just about any existing TCP congestion control or loss
recovery mechanism.

So, we do not have a proper response algorithm for Eifel-detected false fast rexmits

I believe we do have such an algorithm, even if it's not documented
yet in an IETF document: Linux TCP has a proper response algorithm for
Eifel-detected false fast rexmits, which properly handles the case you
mention since 2013 (patch was by @yuchungcheng):

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=7026b912f97d912476dff5465ed9a127be094208

I think we all agree that the problem of spurious rexmits is important
to solve and it should be properly. Therefore, IMO we should handle
these in separate documents common to all CCs as we already did for
spurious RTOs (which actually needs an update), not in this draft
without any evidence presented to the wg for making the decision.

I think this passage is missing a key distinction. I think it is
useful to make a distinction between:

(a) How the TCP connection as a whole, and loss detection, and
detection of spurious retransmits should decide when to undo a
spurious loss detection event.

(b) What the congestion control module should do when the logic in (a)
decides that a loss detection event was spurious.

I continue to argue that it is useful and best for 8312bis to document
the (b) aspect of CUBIC. Partly because integrating the (b) aspect of
CUBIC into some future draft about (a) would be something of a
modularity violation, in that it would needlessly drag the details of
CUBIC state variables into what ideally should be a largely
congestion-control-agnostic document about (a).

@larseggert
Copy link
Member

@larseggert larseggert commented Dec 13, 2021

Could we drive this issue to closure, please?

@goelvidhi
Copy link
Collaborator Author

@goelvidhi goelvidhi commented Dec 14, 2021

I will send out a PR in a day or two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants