Overly aggressive window increase #14

lisongxu · 2020-11-17T14:44:17Z

Since we are revising this RFC, I guess it is a good time to fix some Cubic bugs reported in our NSDI 2019 paper .

This RFC sets W_cubic(t + RTT) as the target window size after in the next RTT. However, this targe size may be too high, like even higher than 2 * cwnd (i.e., more aggressive than slow start), in the following special cases.

case 1: RTT is extremely long. An extremely long RTT is very likely an indication of network congestion, in such an environment it is dangerous to set a very high target.
case 2: after a long idle period (i.e., a big increase of t). This is a bug reported and fixed by Google.
case 3: after a long application rate-limited period (i.e., a bug increase of t). Similar to case 2

To be safer, we may change Equation (1) as follow to fix all the above bugs

    W_cubic(t) = C*(t-K)^3 + origin_point (Eq. 1)
    if (W_cubic (t) > 2* cwnd)
        W_cubic(t)  =  2 * cwnd

Note that, Linux Cubic already does something similar (line 328) by limiting target to be no more then 1.5 * cwnd.

Thanks

The text was updated successfully, but these errors were encountered:

sangtaeha · 2020-11-17T19:09:52Z

@lisongxu Agreed. We also include this change in the list of changes for this new revision.

goelvidhi · 2020-11-18T06:47:15Z

@lisongxu Below is my response for the three cases you mentioned,

At least we use SRTT instead of RTT which mitigates the problem somewhat
For any idle period, we reset the epoch period to 0.
Same as 2.

Having said that, your suggestion is a safe option regardless.

goelvidhi · 2020-11-18T06:47:56Z

@larseggert I can do this change. Feel free to assign :-)

lisongxu · 2020-11-18T14:28:34Z

What I proposed is a simple fix mainly for other implementations. For bug 2, Google already proposed a fix that has been implemented in Linux and been adopted into the cubic RFC. (Section 5.8). Thanks

larseggert · 2020-11-19T14:47:48Z

Would good to give this issue a more descriptive title.

lisongxu · 2020-11-19T19:31:35Z

@larseggert done

lisongxu · 2020-11-19T19:35:34Z

@yuchungcheng Case 2 is already fixed by Google in function bictcp_cwnd_event. How about case 3? Does bictcp_cwnd_event also fix case 3? Thanks

goelvidhi · 2020-11-20T04:17:28Z

@lisongxu The fix to bound the CUBIC target cwnd is added to #3. Let me know if there is any remaining items for this issue.

yuchungcheng · 2020-11-20T04:50:25Z

I believe bictcp_cwnd_event does fix case 3, specifically on the comment of " We were application limited (idle) for a while..." This commit message has more details: commit 30927520dbae297182990bb21d08762bcc35ce1d Author: Eric Dumazet <edumazet@google.com> Date: Wed Sep 9 21:55:07 2015 -0700 tcp_cubic: better follow cubic curve after idle period Jana Iyengar found an interesting issue on CUBIC : The epoch is only updated/reset initially and when experiencing losses. The delta "t" of now - epoch_start can be arbitrary large after app idle as well as the bic_target. Consequentially the slope (inverse of ca->cnt) would be really large, and eventually ca->cnt would be lower-bounded in the end to 2 to have delayed-ACK slow-start behavior. This particularly shows up when slow_start_after_idle is disabled as a dangerous cwnd inflation (1.5 x RTT) after few seconds of idle time. Jana initial fix was to reset epoch_start if app limited, but Neal pointed out it would ask the CUBIC algorithm to recalculate the curve so that we again start growing steeply upward from where cwnd is now (as CUBIC does just after a loss). Ideally we'd want the cwnd growth curve to be the same shape, just shifted later in time by the amount of the idle period. Reported-by: Jana Iyengar <jri@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Sangtae Ha <sangtae.ha@gmail.com> Cc: Lawrence Brakmo <lawrence@brakmo.org> Signed-off-by: David S. Miller <davem@davemloft.net>

…

On Thu, Nov 19, 2020 at 11:35 AM Lisong Xu ***@***.***> wrote: @yuchungcheng <https://github.com/yuchungcheng> Case 2 is already fixed by Google in function bictcp_cwnd_event <https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_cubic.c#L149>. How about case 3? Does bictcp_cwnd_event also fix case 3? Thanks — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AM5EPYAUMQWLOANKGDO7DEDSQVXRJANCNFSM4TYVHEFQ> .

larseggert · 2020-11-20T05:13:30Z

@lisongxu The fix to bound the CUBIC target cwnd is added to #3. Let me know if there is any remaining items for this issue.

Could I ask that we do individual PRs to address individual issues? It sometimes causes a little rebasing effort, but it's much easier to review such PRs compared to ones that contain changes for multiple issues.

lisongxu · 2020-11-20T05:28:57Z

@yuchungcheng Thanks.

The reason that I am not sure about case 3 is that Cubic still sends out packets just at a low rate limited by application instead of cwnd. If Cubic remains in case 3 for a long time, cwnd will not be increased for a long time but it is not idle. As a result, I guess bictcp_cwnd_event does not fix case 3, because bictcp_cwnd_event only detects the idle period? Am I missing something? Thanks

goelvidhi · 2020-11-20T07:18:47Z

@lisongxu The fix to bound the CUBIC target cwnd is added to #3. Let me know if there is any remaining items for this issue.

Could I ask that we do individual PRs to address individual issues? It sometimes causes a little rebasing effort, but it's much easier to review such PRs compared to ones that contain changes for multiple issues.

Sorry about that but the fix in #3 is incomplete without the bounds on target. Will try to keep them separate as much as possible.

yuchungcheng · 2020-11-20T07:28:16Z

@lisongxu <https://github.com/lisongxu> I see your point now -- however it gets tricky what the right action is. For example, let's cwnd is 100, and the application keeps using only 99 pkts for 5 minutes w/o experiencing any losses (so it's "application-limited"), then starts using cwnd fully 1 second after. should t = 5m or 1s? 1s seems over-conservative afterall cwnd was nearly fully used all time. An obvious opposite example is the application using the minimal cwnd of 1 for a long time. My sense is, for most application-limited traffic eventually go idle. Addressing the idle period may be good enough (if we want to keep it simple).

…

On Thu, Nov 19, 2020 at 9:29 PM Lisong Xu ***@***.***> wrote: @yuchungcheng <https://github.com/yuchungcheng> Thanks. The reason that I am not sure about case 3 is that Cubic still sends out packets just at a low rate limited by application instead of cwnd. If Cubic remains in case 3 for a long time, cwnd will not be increased for a long time but it is not idle. As a result, I guess bictcp_cwnd_event does not fix case 3, because bictcp_cwnd_event only detects the idle period? Am I missing something? Thanks — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AM5EPYBPVJOSO3W35HWUL3LSQX5CNANCNFSM4TYVHEFQ> .

goelvidhi · 2020-11-20T08:51:16Z

@yuchungcheng Doesn't https://tools.ietf.org/html/rfc7661 solve the different app-limited scenarios?

lisongxu · 2020-11-20T14:46:32Z

@yuchungcheng Thank you!

Yes, we should avoid overly conservative (e.g., 1s in your example) and overly aggressive (e.g., 5m in your example). This is why I am suggesting a simple fix to set the lower bound (cwnd) and the upper bound (2*cwnd) to the target cwnd in the next RTT, and the detailed discussions can be found in issue #1 . Thanks

lisongxu · 2020-11-20T14:50:27Z

@goelvidhi Thank you. That (rfc7661) requests a major change to Cubic, and I believe that the basic idea of rfc7661 is consistent with what we have been discussed (i.e., avoiding overly conservative and overly aggressive and being responsive to congestion). Thanks

nealcardwell · 2020-11-20T16:19:18Z

AFAICT the effect of the proposed fix in the first message in this thread -- #14 (comment) -- would be to bound the rate of increase of the cwnd to at most doubling each round trip time. Is that the intent?

The Linux TCP CUBIC implementation has already always bounded the rate of increase of the cwnd to at most 1.5x per round trip time. Initially it did this implicitly (the logic only allowed cwnd increases on at most every alternate ACK), and then when the stretch ACKs fixes were put in place the bound became explicit. See:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/net/ipv4/tcp_cubic.c?id=d578e18ce93f5d33a7120fd57c453e22a4c0fc37

Given that separate pre-existing bound of 1.5x per round trip, AFAICT the proposed fix of bounding to less than 2x per round trip would be NOP, AFAICT? Or perhaps I misunderstand the proposal.

By the way, for YouTube we found this implicit bound of 1.5x per round trip was important for keeping losses at a reasonable level. This has a large impact on behavior, and presumably big implications for fairness between CUBIC implementations. So this may be important to document in the RFC, if it is not already (I couldn't find it, but I may have just missed it).

lisongxu · 2020-11-20T16:28:34Z

@nealcardwell Yes, you are right that Linux has already implemented the 1.5x upper bound. I am fine to change the upper bound from 2.0x to 1.5x. What is important is to add an upper bound to RFC to make sure that all other implementations also implement something similar. Thanks

goelvidhi · 2020-11-24T09:11:15Z

Closed by #3

larseggert added the design Normative change relative to RFC8312 or earlier bis versions label Nov 17, 2020

larseggert assigned goelvidhi Nov 18, 2020

lisongxu mentioned this issue Nov 19, 2020

Clear definition of "the beginning of the current congestion avoidance stage" #21

Closed

lisongxu changed the title ~~Some bugs to fix~~ Overly aggressive window increase Nov 19, 2020

lisongxu mentioned this issue Nov 19, 2020

Replace modelled TCP Reno window approach with AIMD emulation #20

Closed

goelvidhi mentioned this issue Nov 21, 2020

redefine the K equation #3

Merged

goelvidhi closed this as completed Nov 24, 2020

lisongxu mentioned this issue Dec 10, 2020

Add handling of spurious retransmissions #23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overly aggressive window increase #14

Overly aggressive window increase #14

lisongxu commented Nov 17, 2020

sangtaeha commented Nov 17, 2020

goelvidhi commented Nov 18, 2020 •

edited

Loading

goelvidhi commented Nov 18, 2020

lisongxu commented Nov 18, 2020

larseggert commented Nov 19, 2020

lisongxu commented Nov 19, 2020

lisongxu commented Nov 19, 2020

goelvidhi commented Nov 20, 2020

yuchungcheng commented Nov 20, 2020 via email

larseggert commented Nov 20, 2020

lisongxu commented Nov 20, 2020

goelvidhi commented Nov 20, 2020

yuchungcheng commented Nov 20, 2020 via email

goelvidhi commented Nov 20, 2020

lisongxu commented Nov 20, 2020

lisongxu commented Nov 20, 2020

nealcardwell commented Nov 20, 2020 •

edited

Loading

lisongxu commented Nov 20, 2020

goelvidhi commented Nov 24, 2020

Overly aggressive window increase #14

Overly aggressive window increase #14

Comments

lisongxu commented Nov 17, 2020

sangtaeha commented Nov 17, 2020

goelvidhi commented Nov 18, 2020 • edited Loading

goelvidhi commented Nov 18, 2020

lisongxu commented Nov 18, 2020

larseggert commented Nov 19, 2020

lisongxu commented Nov 19, 2020

lisongxu commented Nov 19, 2020

goelvidhi commented Nov 20, 2020

yuchungcheng commented Nov 20, 2020 via email

larseggert commented Nov 20, 2020

lisongxu commented Nov 20, 2020

goelvidhi commented Nov 20, 2020

yuchungcheng commented Nov 20, 2020 via email

goelvidhi commented Nov 20, 2020

lisongxu commented Nov 20, 2020

lisongxu commented Nov 20, 2020

nealcardwell commented Nov 20, 2020 • edited Loading

lisongxu commented Nov 20, 2020

goelvidhi commented Nov 24, 2020

goelvidhi commented Nov 18, 2020 •

edited

Loading

nealcardwell commented Nov 20, 2020 •

edited

Loading