Use FlightSize instead of cwnd #114

goelvidhi · 2021-09-21T08:52:56Z

More importantly. The draft says in Sec. 4.2:
"It does not make any changes to the TCP Fast Recovery and Fast
Retransmit algorithms [RFC6582][RFC6675]."

But it does modify RFC6582, RFC 6675, and RFC 5681 Fast Retransmit&Fast Recovery by applying beta_cubic to cwnd, not FlightSize. This is incorrect and would bring back an old design&implementation bug that was corrected more than two decades ago when RFC 2581 was published (and subsequently corrected also for RFC 6582 and RFC 6675). Please see the implementation note attached to Equation (4) of RFC 5681: "An easy mistake to make is to simply use cwnd, rather than FlightSize, which in some implementations may incidentally increase well beyond rwnd."

Using cwnd in the equation will also result in incorrect outcome in some other cases as well, for example, if the sender is application limited.

nealcardwell · 2021-09-24T17:09:32Z

Applying beta_cubic to cwnd describes correctly what Linux TCP CUBIC does, and IIRC what Windows TCP CUBIC does as well (from discussions with the MS TCP team). The Linux and Windows CUBIC behavior is very much intentional, since reducing cwnd based on FlightSize is not robust; it causes very poor performance in the very common application-limited case.

I would suggest keeping the text here as-is (cwnd = cwnd * beta_cubic), since this reflects the long-deployed behavior, and the more robust of the two options.

goelvidhi · 2021-09-24T20:55:15Z

@nealcardwell thanks Neal and I agree with you.

@markkukojo do you agree with Neal's comment?

juhamatk · 2021-09-27T08:24:15Z

If you leave the formula as it is, you'll need to change the claim: "It does not make any changes to the TCP Fast Recovery and Fast Retransmit algorithms [RFC6582][RFC6675]." to more descriptive of Cubic behaviour, right?

nsdyoshi · 2021-09-27T09:12:29Z

In my understanding, Equation (4) of RFC 5681 is meant for the cases where rto expire. But, it seems to me these cases don't need to be covered in cubic draft. Or, am I missing something?

nealcardwell · 2021-09-27T14:17:09Z

Equation (4) of RFC 5681 is meant for RTO cases, but it is also meant for Fast Recovery. See section 3.2, "Fast Retransmit/Fast Recovery":
https://datatracker.ietf.org/doc/html/rfc5681#section-3.2
which says:
When the third duplicate ACK is received, a TCP MUST set ssthresh
to no more than the value given in equation (4)

markkukojo · 2021-09-27T16:39:19Z

I'll reply to all comments in this.

Yoshi:
Equation (4) in RFC 5681 is used both when RTO expires and when 3rd DupAck arrives: see Sec 3.2, Step 2.
Equation (4) is represented only once in RFC 5681 and together with RTO just because of representational reasons as Slow Start & RTO are discussed first in Sec 3.1 and discussion on Fast Rxmit & Fast Recovery follows later in Sec 3.2.

This is also consistent with NewReno and SACK-based Fast Rexmit&Fast Recovery, see RFC 6582, Sec 3.2, Step 2 that points to step 2 of Section 3.2 of RFC 5681 and
RFC 6675, Sec 5, Step 4.2. Please see more in my reply to Neal below.

Juha-Matti: you are right, but it is more than that. Please see my reply to Neal below.

Neal, all:
The use of FlightSize is intentional and consistently used in all current Stds Track RFC mentioned above. It is based on the IETF consensus for more than two decades and intended to prevent a TCP sender from not responding to congestion at all when the TCP sender is rwnd limited (or application limited) and it has increased its cwnd to a (very) high value beyond actual flight size. AFAIK, this is one of the major reasons why FlightSize was introduced in RFC 2581 and later adopted in all above mentioned RFCs. It protects the network from (re)transmission of uncontrolled data in various scenarios.

This said, admittedly the use of FlightSize has its shortcomings because in some scenarios it may result in a very small cwnd after a congestion signal. I believe that's what Neal refers to by saying it is not robust? In addition, FlightSize does not provide any protection in cases where an application-limited or rwnd-limited TCP sender stops being limited and ends up transmitting at (notably) higher rate.

However, if we leave the text as is, it results in bringing back the decades old problem because there are no safeguards. In addition, it would result in this draft text being in conflict with the current Standards Track RFCs. Such conflicting information clearly is not ideal as it would create inconsistency and confusion in the RFC series.

So, what would be the best way forward? These problems with application or rwnd limited TCP senders are not specific to CUBIC but the same regardless of which congestion control mechanism we are using. We need an update to the existing RFCs at some point as well. Therefore, I think a proper way to solve the problems is to solve them for all involved RFCs at once in a consistent way. We have RFC 7661 as experimental. Aren't the algos there appropriate enough to all known application-limited scenarios (assuming the equations are adjusted to apply to CUBIC as well)?

lisongxu · 2021-09-27T18:36:57Z

Thank you for the discussions on this important issue! I agree that we need to change ""It does not make any changes to the TCP Fast Recovery and Fast Retransmit algorithms [RFC6582][RFC6675]", which is not correct/accurate.

Based on the comments of @markkukojo, it seems that it is safer to update ssthresh using FlightSize.

In addition to how ssthresh is calculated, a related issue is that currently Cubic cwnd is not updated according to the fast retransmit/recovery in RFC 5681/6582. For example, Step 6 in Serction 3.2 of RFC 5681: "When the next ACK arrives that acknowledges previously unacknowledged data, a TCP MUST set cwnd to ssthresh (the value set in step 2)". However, currently Linux Cubic sets cwnd to ssthresh at the beginning of fast recovery instead of the end of fast recovery. @nealcardwell, right? thanks

nealcardwell · 2021-09-27T20:21:23Z

Linux TCP CUBIC follows PRR and sets cwnd to ssthresh at the end of fast recovery (see tcp_end_cwnd_reduction() where it sets cwnd=ssthresh).

FWIW it sounds good to me to have rfc8312bis document setting ssthresh using FlightSize, given that this issue is mostly orthogonal to CUBIC (both Reno and CUBIC have the same question about setting ssthresh using FlightSize vs cwnd).

nsdyoshi · 2021-09-27T22:05:13Z

neal, are you suggesting that we might need to update 5681? just curious.

larseggert · 2021-09-28T07:03:21Z

FWIW, the current editor's copy already updates 5681: https://tools.ietf.org/rfcdiff?url1=https://tools.ietf.org/id/draft-ietf-tcpm-rfc8312bis.txt&url2=https://NTAP.github.io/rfc8312bis/draft-ietf-tcpm-rfc8312bis.txt

So if we need to add other reasons for doing so, that is not an issue.

nealcardwell · 2021-09-28T13:10:11Z

Personally I think the Reno and CUBIC RFCs should use:
ssthresh = beta * cwnd
(This is what Linux TCP has done since 2013).

Or if the receiver window case mentioned by @markkukojo is a big concern, then perhaps something like:
ssthresh = beta * min(cwnd, rwnd)

But I don't think we need to hold up rfc8312bis if this is a controversial point, since it is somewhat orthogonal to CUBIC.

lisongxu · 2021-09-28T14:55:41Z

Thanks, @nealcardwell . You are right that Linux Cubic sets cwnds to ssthresh at the end of fast recovery.

Because Reno(RFC5681)/NewReno(RFC6582)/Sack(RFC6675) all set ssthresh using FlightSize, and because Cubic does not intend to modify the fast retransmit and fast recovery algorithms of Reno/NewReno/Sack except the multiplicative decrease parameter, it would be better/consistent for this rfc8312bis to set ssthresh using FlightSize.

We may describe what is implemented in Linux/Window and why ("since reducing cwnd based on FlightSize is not robust; it causes very poor performance in the very common application-limited case").

nealcardwell · 2021-09-28T15:12:30Z

Sounds good to me. Thanks!

yuchungcheng · 2021-09-29T23:56:40Z

I completely disagree the compromise to use FlightSize.

Using FlightSize has the detrimental performance effect when losses are encountered in application-limited case, which is prevalent in today's structured Internet application traffic. Consider a simple case where the application bursts and a drop occurs toward the end of the burst. At that time FlightSize is likely very small. Cubic is reducing the window based on an underestimated amount of volume during the congestion.

Ever since Cubic was invented, it has been using cwnd. Shouldn't this RFC reflect the Internet-wide decades long running implementation w better algorithm?

Now on the "Implementation Note: An easy mistake to make is to simply use cwnd, rather than FlightSize, which in some implementations may incidentally increase well beyond rwnd." in RFC5681. First why is cwnd going beyond rwnd an issue when the effective window is the min of the two? second can someone come up an example to illustrate such incident b/c I can't tell.

I have indicated the same exact issue to Jana so I am glad RFC9002 QUIC does not repeat the mistake to use FlightSize
(https://www.rfc-editor.org/rfc/rfc9002.html#name-recovery). I really want to stress the point using FlightSize simply makes TCP spec worse and out-sync'd of modern TCP implementations. It's really a mistake.

goelvidhi · 2021-10-01T06:36:45Z

@yuchungcheng thank you for chiming in. We have been going back and forth about using cwnd vs flight size for ssthresh and therefore cwnd (at the end of recovery). There are two alternatives from this discussion:

Continue to use cwnd and then update the text with,
Cubic updates the fast retransmit and fast recovery algorithms of Reno/NewReno/Sack to use cwnd to do multiplicative decrease after a congestion event as described in Section 4.6. This has proven to be more robust and provides better throughput and link utilization as seen with most modern implementations that have been deployed globally.
Change to using flightsize in Section 4.6 and add a caveat for lower throughput,
Since reducing congestion window based on FlightSize is not robust and can cause low throughput in the very common application-limited scenarios, most modern implementations use cwnd instead for better link utilization.

To keep it more conservative and aligned with previous stds track RFCs, I am fine with option 2. We need to converge on this soon to unblock CUBIC from this orthogonal problem.

larseggert · 2021-10-01T06:38:44Z

Which option do the major stacks use? IMO that is what the spec should recommend. We can discuss the implications of that choice and mention the other alternative, too.

goelvidhi · 2021-10-01T06:49:04Z

Which option do the major stacks use? IMO that is what the spec should recommend. We can discuss the implications of that choice and mention the other alternative, too.

I don't know if this is the right question, I think for a stds RFC, a more important question is which option should the stacks use? Because the current implementations optimize for better throughput and that may or mayn't be the right recommendation.

larseggert · 2021-10-01T07:39:02Z

In my mind, a main reason for doing 8312 and now 8312bis was to describe what the currently deployed CUBIC looks like, because the versions described in the papers and used in various simulators were really different.

I'd therefore really like to describe what variant of CUBIC is currently deployed - it seems to work well enough with few enough downsides, right? If that is the case, why would we want to recommend something else for the spec?

nealcardwell · 2021-10-01T15:51:47Z

] Which option do the major stacks use?

Linux TCP CUBIC calculates ssthresh based on cwnd. Not sure about the other major stacks.

I agree with Lars that it's far preferable to describe the CUBIC behavior that is currently deployed. Hopefully MS/Apple/Linux stacks agree on this point and we can have the RFC match that.

I support something like Vidhi's option number 1 above ("Continue to use cwnd and then update the text..." ).

yuchungcheng · 2021-10-01T20:27:20Z

+1 what Neal and Lars said and option 1. Adding Parveen from Microsoft to help answer Lars' question on major Cubic implementation. Vidhi: I agree that it is more important to publish an RFC based on what "should" TCP do, not just what "are" TCP stack doing. But that's the exact rationale we used to change Linux to use cwnd instead of inflight. What we had observed with FlightSize is that loss usually happened toward the end of the flight (fifo queuing). Let's say inflight at the beginning of a round-trip is 100. The lost packet occurs at packet 90. the subsequent 3 DUPACKs by packet 91,92,93 trigger fast recovery. at this moment FlightSize is down to 10 (number of unacked). Cubic will apply the beta and adjust cwnd to 10*0.7 = 7. As you can tell, using FlightSize makes cwnd highly dominated by the location of the loss under application-limited scenario, not it's original real flightsize that experienced the congestion.

…

On Fri, Oct 1, 2021 at 8:51 AM Neal Cardwell ***@***.***> wrote: ] Which option do the major stacks use? Linux TCP CUBIC calculates ssthresh based on cwnd. Not sure about the other major stacks. I agree with Lars that it's far preferable to describe the CUBIC behavior that is currently deployed. Hopefully MS/Apple/Linux stacks agree on this point and we can have the RFC match that. I support something like Vidhi's option number 1 above ("Continue to use cwnd and then update the text..." ). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#114 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AM5EPYFGQJZXTF7N4VYQ2KTUEXKJ7ANCNFSM5EN3H3UA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

pravb · 2021-10-01T21:03:27Z

Both Windows TCP and MsQuic apply beta_cubic to cwnd and not FlightSize. We have had intense internal debates on this but we are erring towards higher performance in app limited cases. We also had briefly default enabled restarting cwnd upon idle for datacenter connections but had to revert due to application regressions.

Additionally, Windows TCP limits cwnd growth to not exceed the largest receive window advertised by the peer during connection lifetime. And MsQuic limits cwnd growth to not exceed 2*MaxFlightSize i.e. twice the maximum in flight during connection lifetime.

pravb · 2021-10-01T21:11:14Z

I too think Option 1 is better and add justification for deviation from prior standards track RFCs.

goelvidhi · 2021-10-01T23:04:16Z

Apple's stacks implement RFC 7661 and applies beta_cubic to Max(pipeACK, LossFlightSize) which is like option 2.

I am happy to recommend option 1 considering the widespread deployment of Linux on the servers but I am sure there are some folks who will object that. Hoping to hear from them soon.

goelvidhi · 2021-10-01T23:25:11Z

I want to note that using FlightSize is good for persistent congestion (as we would be acting quickly to reduce sending rate based on congestion experienced on bytes in the network),
whereas,
Using CWND is good for transient congestion and drops unrelated to congestion, because we might recover without significantly reducing congestion window based on bytes in the network and instead continue to use a higher CWND assuming that the drop was not due to persistent congestion.

yuchungcheng · 2021-10-02T00:18:24Z

Vidhi: very interesting to hear Apple's approach which uses max(PipeAck, LossFlightSize) While it's not clear how exactly PipeACK is computed, I believe RFC7661 is trying to mitigate the problem I raised. I would argue that Apple's approach is much closer to option 1 than option 2. In my previous email's toy example, pipeACK would actually be around 100 instead of 10 because PipeACK is recording the maximum inflight in the most recent round-trip. Would that be the case in Apple's implementation? Neal and I have discussed this before. Our conclusion is the most precise approach is to record the flightsize on a per data packet basis. So in my toy case, the starting loss packet would have a flightsize of 90. Congestion control should use this flight-size to reduce cwnd. But that has a steep per-packet storage requirement. I would propose to capture our discussions in the RFC of pros & cons, and also reflect the fact that some implementations use max(PipeAck, LossFlightSize), instead of the recommended cwnd. These insights and examples would be valuable for future implementers, since congestion control is often about trade-offs :-)

…

On Fri, Oct 1, 2021 at 4:25 PM Vidhi Goel ***@***.***> wrote: I want to note that using FlightSize is good for persistent congestion (as we would be acting quickly to reduce sending rate based on congestion experienced on bytes in the network), whereas, Using CWND is good for transient congestion and drops unrelated to congestion, because we might recover without significantly reducing congestion window based on bytes in the network and instead continue to use a higher CWND knowing that the drop was not due to persistent congestion. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#114 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AM5EPYBFDQKWNMT3D2BU7A3UEY7OFANCNFSM5EN3H3UA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

markkukojo · 2021-10-05T23:30:40Z

IMO what Apple stack does is the right thing to do. If I understood it correctly, it simply follows what we currently have in the RFC series:
it applies FlightSize (=LossFlightSize) but implements also RFC 7661 that is meant to mitigate exactly the problem scenario which Yuchung described. RFC 7661 even mentions such a scenario explicitly.
That's why I asked whether RFC 7661 would be enough.
In the Yuchung's toy example pipeACK would be at least 89 (or 92 if SACK is in use), so the effect would be quite close to what it would be if FlightSize is recorded on per pkt basis (but without high storage requirements).

The problems with application-limited traffic are not new. It is a well-known issue and RFC 2861, the predecessor of RFC 7661, provides a nice overview of the problem space in its Intro (Sec 2). The FlightSize was selected to be the way to protect against use of invalid cwnd values for MD in the RFC series. It is not ideal, but it is the current way.

I hope we are not selecting simply between the two options because neither of those is ideal.

Option 1 is simply wrong as it would bring back an old, well-known problem currently avoided as documented in Stds Track RFCs. If the problem is not addressed, it easily results in behaviour where a rwnd or application-limited TCP sender does not react to congestion, because cwnd does not correctly reflect the currently available capacity but may have an arbitrarily high value. That is obviously bad for the Internet and not safe nor robust.
Limiting cwnd growth like Windows TCP does help but cwnd may still be badly off in some usage scenarios.

Option 2 is safe although it is known to be suboptimal for the performance of application-limited traffic. However, if RFC 7661 is adapted for CUBIC, we should have the same working solution for CUBIC as we have for the other CCs in the RFC series, right?

I'd suggest the following way forward:
Use option 2 and add a caveat, maybe something along the following lines:

Note that a rate-limited application with idle periods or periods when unable to send at the full rate permitted by the congestion window may easily encounter notable variations in the volume of data sent from one RTT to another, resulting in FlightSize that is significantly less than congestion window on a congestion event. This may result in decreasing congestion window to a much lower value than necessary. To avoid suboptimal performance with such applications, mechanisms described in RFC 7661 [RFC7661] can be adapted to be employed with CUBIC. Appendix X provides guidelines for necessary modifications to the mechanisms in RFC 7661 when used with CUBIC.

As RFC 7661 is experimental it might be time to advance it to the Standards Track. Particularly, when it seems that application-limited scenarios are important to address and it seems to be possible to collect a good amount of experimental data to verify whether it works as intended and/or to provide data to tweak it as seen necessary.

stewrtrs · 2021-10-06T20:50:31Z

FreeBSD/NewReno uses cwin not flightsize. I have not looked at the other congestion control modules just newReno so Cubic or others may use something else. I do think that cwin is the right thing to use here. I can go look at the other modules if there is interest.. a quick look at cc_cubic shows it too uses cwin. So thats 2 out of 6 ;)

nealcardwell · 2021-10-07T13:45:30Z

IMHO saying that basing sshtresh on cwnd "easily results in behaviour where a rwnd or application-limited TCP sender does not react to congestion" overstates the issue. Basing ssthresh on cwnd still reacts to congestion; it's just that it will take longer for the cwnd to converge down to the available capacity of the path.

But basing ssthresh on FlightSize can have the same issue (slow convergence down to the available capacity of the path); just to a different degree.

If we base ssthresh on cwnd, the connection will tend to take log_2(original_cwnd / available_volume) to converge to match the current available_volume.

If we base ssthresh on FlightSize, the connection will tend to take log_2(original_flightsize / available_volume) to converge to match the current available_volume.

Because log_2 turns multiplicative ratios into linear differences, there may not be much difference between the two approaches, depending on how far apart original_cwnd and original_flightsize and available_volume are.

Given that both approaches are only guaranteed to converge eventually, IMHO we should carefully weigh the costs and benefits of over-reaction vs under-reaction, and also weight the benefits of having the RFC document "rough consensus and running code".

markkukojo · 2021-10-11T01:25:12Z

FreeBSD/NewReno uses cwin not flightsize. I have not looked at the other congestion control modules just newReno so Cubic or others may use something else. I do think that cwin is the right thing to use here. I can go look at the other modules if there is interest.. a quick look at cc_cubic shows it too uses cwin. So thats 2 out of 6 ;)

Thanks for the data point.

@stewrtrs does this mean that FreeBSD/NewReno does not anymore have any safeguards preventing cwnd from increasing to an arbitrarily high value beyond rwnd or above actually used capacity (FlightSize)?
AFAIK at least safeguards to not increase cwnd beyond rwnd were earlier implemented around the time when RFC 5681 was published. Note also that I pointed to RFC 2861 for the problem space description, not as solution (CWV as per RFC 2861). RFC 2861 is known to not perform well and RFC 7661 has obsoleted RFC 2861 providing a more appropriate method for application (rate) limited traffic.

larseggert · 2021-10-11T06:41:25Z

I'd like to come to a conclusion on this - what should the draft say?

gorryfair · 2021-10-11T11:33:05Z

My take here is that we have an interesting point. I'm keen to see documentation for what is being deployed, and it's clear to me that something other than flight size is needed for an app-limited flow for performance. I am really not keen on publishing a standards track RFC that reintroduces a well-known problem and that sitrs poorly with other RFCs.

From a standards track document, I'd expect the text not to be related to cwnd, given what RFC7661 says about cwnd with data-limited cases. Alebit, I would also expect that any text to also note that when flightsize<<cwnd you might do something else, which is where RFC 7661 is relevant. (or it's bis if we decide to do that).

RFC 7661 was published as EXP a little while ago now! When published, there was limited deployment experience, and no real experience of including this in stacks, but I understood there was support for the problem it sought to address. If experience has changed, a change in RFC status seems helpful (with whatever that now means for calculating a "safe" operating point between flightsize and cwnd), and I’d love to see a co-editor (or two) to bring this up to date and make it a standard.

mattmathis · 2021-10-11T18:13:15Z

Sorry I am late to the conversation. I was present when the concept of setting ssthresh from flight_size was first introduced in RFC2581 Specifically draft-ietf-tcpimpl-cong-control-01.txt, 33 years ago. I failed to kill it then. Today I understand the problem better, and can explain why it was wrong then and is still wrong everywhere that language has been carried forward.

First let me define some terms:
RTTscale: things that happen on about the same timescale as the RTT. For example, traditional slowstart often causes RTTscale "lunpyness" in the data.

CCscale: things that happen on about the same time scale as the Congestion Control sawtooth (more generally the reciprocal of the CC's natural frequency.)

SEGscale: Things like TSO, that happen at scales similar to inter-segment times.

At about the time RFC 2581 was in draft, I happened to be working on a project to copy a large quantity of data over the vBNS from Pittsburgh Supercomputing Center to mag tape at the San Diego Supercomputer Center, The path was approximately 70 mS x 125 Mb/s (I have forgotten the precise numbers). Due to the details of the tape system, the flow control (rwin) introduced application stalls that had dimensions similar to the network. The tape buffer was a bit larger than the network BDP and the pauses between socket reads were about one RTT. By default this system had pathologically bad performance. The read at the far end triggered a full window burst at the sender (750+ segments). If cwnd was large enough the very end of the burst would overflow a queue. At the point where the sender detected the loss, the TCP had already nearly filled the receiver's socket buffer so TCP was announcing a tiny rwin, and the flight_size had to be tiny as well. Given that this was classic Reno, setting cwnd to flightsize/2 would result in it taking nearly a minute for cwnd to recover (CCscale for Reno on this path was more than 20 seconds).

The point: if the application is bursty on a scale that is similar to the network dimensions, the flight size will be large when the segments at ends of the bursts are sent, but small when the ACKs for the same segments are received. A loss indicates that the flight size at the time of transmission was too large, But the flight size at the time the ACK was received tells you nothing except that the application stopped keeping the network full after the lost segment was initially sent.

Hypothetically you could record flight size at the time segments were sent, and use it in the CC computation, and we did look at that idea a bit 30 years ago.

Today, we have a better approach: Use pacing to recover the self clock after application pauses, and don't ever pull down cwnd or ssthresh because the application failed to fill the network (also don't raise cwnd except following RTTs in which the application successfully kept the network full). More specifically avoid setting cwnd from flight_size (even indirectly), under nearly all conditions.

Algorithms that violate this principle are likely to fail industrial scale regression testing, because some specific application traffic pattern will trigger pathological behaviors over some network paths.

I make a stronger (unproven) conjecture: for every patch introducing such an algorithm, and for every bursty application, there exists some path which will exhibit persistent pathological behavior, and extreme performance regressions. To the extent this is true, such a patch will never pass code review.

An easier experiment: For a real physical path with CCscale > 1sec, and RTTscale << CCscale, design a bursty application that triggers queue overflows near the ends of the bursts.

When you survey what the various OS's do, consider weighing the results by the deployed population.

goelvidhi · 2021-10-12T20:49:48Z

Adding Markku's comment about relocating RFC 7661 when we do a PR for this.
#122 (comment)

stewrtrs · 2021-10-12T21:00:47Z

Matt: Interesting thoughts.. and it actually corresponds to some work I am doing on our rack stack to add hystart+ to newreno.. and what i just got through adding was a fas (flight at send) for each of the entries in my send map.. I was contemplating using it as the ssthresh for hystart+ when we decided to switch to CA.. I may even plumb it in and use it for normal loss based detection… at least I was contemplating it.. your response here makes me think even harder about doing so :) R

…

On Oct 11, 2021, at 2:13 PM, Matt Mathis ***@***.***> wrote: Sorry I am late to the conversation. I was present when the concept of setting ssthresh from flight_size was first introduced in RFC2581 Specifically draft-ietf-tcpimpl-cong-control-01.txt, 33 years ago. I failed to kill it then. Today I understand the problem better, and can explain why it was wrong then and is still wrong everywhere that language has been carried forward. First let me define some terms: RTTscale: things that happen on about the same timescale as the RTT. For example, traditional slowstart often causes RTTscale "lunpyness" in the data. CCscale: things that happen on about the same time scale as the Congestion Control sawtooth (more generally the reciprocal of the CC's natural frequency.) SEGscale: Things like TSO, that happen at scales similar to inter-segment times. At about the time RFC 2581 was in draft, I happened to be working on a project to copy a large quantity of data over the vBNS from Pittsburgh Supercomputing Center to mag tape at the San Diego Supercomputer Center, The path was approximately 70 mS x 125 Mb/s (I have forgotten the precise numbers). Due to the details of the tape system, the flow control (rwin) introduced application stalls that had dimensions similar to the network. The tape buffer was a bit larger than the network BDP and the pauses between socket reads were about one RTT. By default this system had pathologically bad performance. The read at the far end triggered a full window burst at the sender (750+ segments). If cwnd was large enough the very end of the burst would overflow a queue. At the point where the sender detected the loss, the TCP had already nearly filled the receiver's socket buffer so TCP was announcing a tiny rwin, and the flight_size had to be tiny as well. Given that this was classic Reno, setting cwnd to flightsize/2 would result in it taking nearly a minute for cwnd to recover (CCscale for Reno on this path was more than 20 seconds). The point: if the application is bursty on a scale that is similar to the network dimensions, the flight size will be large when the segments at ends of the bursts are sent, but small when the ACKs for the same segments are received. A loss indicates that the flight size at the time of transmission was too large, But the flight size at the time the ACK was received tells you nothing except that the application stopped keeping the network full after the lost segment was initially sent. Hypothetically you could record flight size at the time segments were sent, and use it in the CC computation, and we did look at that idea a bit 30 years ago. Today, we have a better approach: Use pacing to recover the self clock after application pauses, and don't ever pull down cwnd or ssthresh because the application failed to fill the network (also don't raise cwnd except following RTTs in which the application successfully kept the network full). More specifically avoid setting cwnd from flight_size (even indirectly), under nearly all conditions. Algorithms that violate this principle are likely to fail industrial scale regression testing, because some specific application traffic pattern will trigger pathological behaviors over some network paths. I make a stronger (unproven) conjecture: for every patch introducing such an algorithm, and for every bursty application, there exists some path which will exhibit persistent pathological behavior, and extreme performance regressions. To the extent this is true, such a patch will never pass code review. An easier experiment: For a real physical path with CCscale > 1sec, and RTTscale << CCscale, design a bursty application that triggers queue overflows near the ends of the bursts. When you survey what the various OS's do, consider weighing the results by the deployed population. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

------ Randall Stewart ***@***.***

larseggert · 2021-10-13T13:56:38Z

I merged #122. What is left to close this issue?

markkukojo · 2021-10-15T00:06:45Z

IMHO saying that basing sshtresh on cwnd "easily results in behaviour where a rwnd or application-limited TCP sender does not react to congestion" overstates the issue. Basing ssthresh on cwnd still reacts to congestion; it's just that it will take longer for the cwnd to converge down to the available capacity of the path.

My bad to be inaccurate. It should have read:
"easily results in behaviour where a rwnd or application-limited TCP sender does not react to
a congestion event or a significant number of consecutive congestion events"

Sure the cwnd will at some point converge down. Otherwise it would result in congestion collapse which is not the issue here. The issue is negative impact to other flows and being unfair.

If an application-limited sender is sending (or receiver receiving) at more or less constant rate for a long period of time (alone on the bottleneck), the cwnd may after a while be arbitrarily high, meaning that the sender does not react to congestion for a notably long period of time if new flows join (this is against one of the major cc principles saying "must react to congestion event immediately" that we just agreed on). Note that if we have a relatively stable path with tail-drop bottleneck, according to the packet conservation principle a sender sending at constant rate is not subject to regular losses because it does not seek for more capacity; other flows need to "push" a bit with good timing or enough variation in packet delivery is needed cause a drop. So, it may take much longer than some linear number of CA epochs per your log scale equations before such a sender encounters enough drops to get cwnd to level of its FlightSize.

markkukojo · 2021-10-15T01:10:49Z

When you survey what the various OS's do, consider weighing the results by the deployed population.

Matt, thanks for participating.

The deployed major stacks do it all differently. One uses FlightSize together with RFC 7661 to address app & rwnd limited case. Others use cwnd with no or varying measures to prevent cwnd from growing on RTTs when the network is not kept full.

OTOH, a significant number of RFCs in the RFC series use FlightSize and we should come up with an RFC that is compatible (and readable) with the other RFCs. And no one is proposing a resolution which just uses FlightSize alone like RFC 5681.

But the actual point wrt to your interesting mag tape example or other experiments and the "better approach". The mag tape example seems to have essentially the same behaviour as Yuchung's application-limited toy example (but just just caused by receiver via rwnd).
We have RFC7661 with a simple algo that tracks of the amount of data Acked during recent RTTs (at least 3 most recent RTTs) and maintains a variable PipeACK to store the highest per RTT value among the recent RTTs in PipeACK. When entering loss recovery it uses max(PipeACK, FlightSize) to calculate new reduced value for cwnd. Doesn't that essentially provide the same outcome?

RFC7661 differs from your "better approach" though in that sense it doesn't keep the non-validated cwnd forever but is required to adjust cwnd after 5 mins (as currently specified). Is there any reason why one should keep the non-validated cwnd for longer than that?
Does not sound like feasible to allow a sender that has first been alone on the path with buffer bloated bottleneck and acquired a huge cwnd (much more than the bottleneck link capacity) and then gone idle for hours to restart with the same cwnd although it would pace the cwnd worth of data over one RTT; things may have changed and only a tiny fraction of the capacity is anymore available. Even with pacing these packets are uncontrolled data with no ack clock, causing significant negative impact on other flows, i.e., it is like allowing an enormous init wnd with pacing.

mattmathis · 2021-10-15T04:49:32Z

the cwnd may after a while be arbitrarily high

No this is a different bug. Don't raise cwnd unless flight size has reached cwnd at least once sometime since the last cwnd increment, so that you know that the path can actually deliver a cwnd sized window. Then a multiplicative cwnd reduction does the right thing, even if it happens while inflight is tiny. The case that you mention is not a problem: not only are there no reductions, there are no increases either, which is correct: we have zero new information and the estimate of the network capacity remains constant (and larger than in use by the application).

I was prototyping this type of algorithm when 2581 came out: I dropped a whole pile of work in the dumpster and switched my focus to socket buffer autotuning and TCP instrumentation (web100/ESTATS-MIB). I have to admit I stopped paying much attention to derivative RFCs.

BTW Linux does this correctly - I know nothing about other stacks.

I believe PipeACK is (nearly?) isomorphic to simply not raising cwnd when it isn't controlling the flight size.

As for decaying estimates: I have never been able to make sense of this idea. It isn't the right model for the most common case, which is a bunch of low duty cycle on/off transactional applications sharing a fixed link. Most of the time the old estimate is correct forever, however every so often transactions collide, and (assuming a paced restart and short queues) you get 50% or 66% loss. My preference would be to make the collisions converge in one RTT, by reducing cwnd by the losses (or the pacing rate by losses/RTT). This idea is present in PRR, but is not uniformly applied everywhere it could be. I remain open minded about decaying estimates: I just have never seen an approach that seems right.

Although backwards compatibility with existing standards is admirable, standards that include persistent pathological behaviors under straightforward conditions are unlikely to be deployed at scale. If you want standards to be relevant, you must not specify such behaviors. If a strict implementation of a standard does not work well, then the standard will be doomed to be only partially implemented.

larseggert · 2021-10-15T06:34:47Z

Does anyone believe anything else in the current text needs to change before this issue can be closed?

gorryfair · 2021-10-15T11:20:35Z

RFC 5681 was clear: ssthresh MUST be set to no more than the value given in equation (4):
ssthresh = max (FlightSize / 2, 2*SMSS)

RFC 2581 is obsolete. RFC7661 provides some rationale and some limits to how a flow might use cwnd here.

Since this is proposed as PS, the present text in this ID certainly appears to make a substantive change to the RFC series, and I think it does not yet provide a justification and analysis of how that would compete with RFC 5691 traffic. If the text remains in this form, I do think this needs to be discussed in a TCPM meeting.

larseggert · 2021-10-19T17:58:20Z

Would a way forward be to use @gorryfair's phrasing but add a comment that Linux (and other stacks? which?) update based on cwnd?

lisongxu · 2021-10-19T18:42:12Z

Injong, Sangtae and I discussed this issue yesterday. We feel that this issue is orthogonal to CUBIC because CUBIC does not modify fast retransmit and recovery. Thus, we would like to follow the current RFCs (that is, set ssthresh using FlightSize). The discussion on using cwnd or FlightSize should be discussed in a different context which deals with the modification of fast retransmit and recovery. But we will add the following note (based on @markkukojo 's note): "Note that a rate-limited application with idle periods or periods when unable to send at the full rate permitted by the congestion window may easily encounter notable variations in the volume of data sent from one RTT to another, resulting in FlightSize that is significantly less than congestion window on a congestion event. This may decrease the congestion window to a much lower value than necessary. To avoid suboptimal performance with such applications, some implementations of CUBIC use cwnd instead of FlightSize. Alternatively, the mechanisms described in RFC 7661 [RFC7661] may also be adopted to mitigate this issue. " @Injongrhee @sangtaeha

yuchungcheng · 2021-10-19T19:05:06Z

I respect the inventors' perspectives even though I am disappointed. But maybe that's fine -- RFC has been recommending FlightSize but stacks use something else b/c of performance implications. In the end the data is the best guidance.

…

On Tue, Oct 19, 2021 at 11:42 AM Lisong Xu ***@***.***> wrote: Injong, Sangtae and I discussed this issue yesterday. We feel that this issue is orthogonal to CUBIC because CUBIC does not modify fast retransmit and recovery. Thus, we would like to follow the current RFCs (that is, set ssthresh using FlightSize). The discussion on using cwnd or FlightSize should be discussed in a different context which deals with the modification of fast retransmit and recovery. But we will add the following note (based on @markkukojo <https://github.com/markkukojo> 's note): "Note that a rate-limited application with idle periods or periods when unable to send at the full rate permitted by the congestion window may easily encounter notable variations in the volume of data sent from one RTT to another, resulting in FlightSize that is significantly less than congestion window on a congestion event. This may decrease the congestion window to a much lower value than necessary. To avoid suboptimal performance with such applications, some implementations of CUBIC use cwnd instead of FlightSize. Alternatively, the mechanisms described in RFC 7661 [RFC7661] may also be adopted to mitigate this issue. " @Injongrhee <https://github.com/Injongrhee> @sangtaeha <https://github.com/sangtaeha> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#114 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AM5EPYEJFTZTYQVGGUUGZGTUHW3Y7ANCNFSM5EN3H3UA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

nealcardwell · 2021-10-19T20:18:23Z

I agree with Yuchung's remark and feel that, overall, the text from "the following note" in
#114 (comment)
sounds like the best compromise to allow a path forward.

mattmathis · 2021-10-19T21:10:35Z

Here is another possible way to fix the text. In stead of flightsize, use "any recent flight size that was successfully delivered without loss or congestion marks". I believe that this (or something like it) includes all implementation and prior RFCs. BTW for true bulk transfers, this has no effects on fairness. It only alters the penalty caused by application pauses.

goelvidhi · 2021-10-20T04:41:29Z

@lisongxu the text in #114 (comment) looks good to me. Do you want to write a PR? Thanks.

lisongxu · 2021-10-20T13:56:27Z

Sure, I will submit it soon. Thanks

@goelvidhi

* Fix conflicts * typo * add a reference to the ssthresh equation * Fix subscript Co-authored-by: Vidhi Goel <goel.vidhi07@gmail.com> * Refer to equation Co-authored-by: Vidhi Goel <goel.vidhi07@gmail.com> * Wrap line * Remove sentence, as suggested by @goelvidhi * Resolve conflicts Co-authored-by: Lars Eggert <lars@eggert.org> Co-authored-by: Vidhi Goel <goel.vidhi07@gmail.com>

goelvidhi mentioned this issue Sep 21, 2021

On page 13 #102

Closed

goelvidhi self-assigned this Sep 24, 2021

lisongxu mentioned this issue Oct 20, 2021

#114 Use FlightSize instead of cwnd #126

Merged

goelvidhi linked a pull request Oct 21, 2021 that will close this issue

#114 Use FlightSize instead of cwnd #126

Merged

larseggert closed this as completed in #126 Oct 25, 2021

Use FlightSize instead of cwnd #114

Use FlightSize instead of cwnd #114

Comments

goelvidhi commented Sep 21, 2021 • edited Loading

nealcardwell commented Sep 24, 2021

goelvidhi commented Sep 24, 2021

juhamatk commented Sep 27, 2021

nsdyoshi commented Sep 27, 2021

nealcardwell commented Sep 27, 2021

markkukojo commented Sep 27, 2021

lisongxu commented Sep 27, 2021

nealcardwell commented Sep 27, 2021

nsdyoshi commented Sep 27, 2021

larseggert commented Sep 28, 2021

nealcardwell commented Sep 28, 2021

lisongxu commented Sep 28, 2021

nealcardwell commented Sep 28, 2021

yuchungcheng commented Sep 29, 2021

goelvidhi commented Oct 1, 2021

larseggert commented Oct 1, 2021

goelvidhi commented Oct 1, 2021

larseggert commented Oct 1, 2021 • edited Loading

nealcardwell commented Oct 1, 2021

yuchungcheng commented Oct 1, 2021 via email

pravb commented Oct 1, 2021

pravb commented Oct 1, 2021

goelvidhi commented Oct 1, 2021 • edited Loading

goelvidhi commented Oct 1, 2021 • edited Loading

yuchungcheng commented Oct 2, 2021 via email

markkukojo commented Oct 5, 2021

stewrtrs commented Oct 6, 2021

nealcardwell commented Oct 7, 2021

markkukojo commented Oct 11, 2021

larseggert commented Oct 11, 2021

gorryfair commented Oct 11, 2021

mattmathis commented Oct 11, 2021

goelvidhi commented Oct 12, 2021

stewrtrs commented Oct 12, 2021 via email

larseggert commented Oct 13, 2021

markkukojo commented Oct 15, 2021

markkukojo commented Oct 15, 2021

mattmathis commented Oct 15, 2021 • edited Loading

larseggert commented Oct 15, 2021

gorryfair commented Oct 15, 2021

larseggert commented Oct 19, 2021

lisongxu commented Oct 19, 2021

yuchungcheng commented Oct 19, 2021 via email

nealcardwell commented Oct 19, 2021

mattmathis commented Oct 19, 2021

goelvidhi commented Oct 20, 2021

lisongxu commented Oct 20, 2021 via email • edited Loading

goelvidhi commented Sep 21, 2021 •

edited

Loading

larseggert commented Oct 1, 2021 •

edited

Loading

goelvidhi commented Oct 1, 2021 •

edited

Loading

goelvidhi commented Oct 1, 2021 •

edited

Loading

mattmathis commented Oct 15, 2021 •

edited

Loading

lisongxu commented Oct 20, 2021 via email •

edited

Loading