Skip to content
This repository has been archived by the owner on Aug 28, 2023. It is now read-only.

RFC 8511 handling of ECE signal (slow start vs congestion avoidance) #85

Closed
goelvidhi opened this issue Aug 31, 2021 · 16 comments · Fixed by #111 or #125
Closed

RFC 8511 handling of ECE signal (slow start vs congestion avoidance) #85

goelvidhi opened this issue Aug 31, 2021 · 16 comments · Fixed by #111 or #125
Assignees

Comments

@goelvidhi
Copy link
Contributor

goelvidhi commented Aug 31, 2021

Markku Kojo said,

ABE (RFC 8511) is currently the only experimental RFC to modify
the TCP-sender response to ECE. ABE allows modifying multiplicative
decrease factor only for AIMD TCP and only when ECE arrives in
congestion avoidance, that is, not when the sender is in slow-start.

Applying a decrease factor of 0.7 (or higher) when a congestion
singnal arrives and ends the initial slow start would be
inconsiderate because it extends the convergence time from
the slow-start overshoot. ABE has found that using a larger decrease
factor yields performance improvement when applied in congestion
avoidance, but not otherwise. Do we have data that would support
different findings with CUBIC?

@goelvidhi goelvidhi self-assigned this Sep 1, 2021
@bbriscoe
Copy link
Contributor

Markku raises a good question about how large the first reduction should be to come out of the flow-startup phase. Most cubic implementations resolve this by using a start-up algo that intends to overshoot by less than 2x, so that a reduction of 30% will usually be sufficient without another round.

§ 4.10 of the rfc8312bis draft mentions LSS, Hystart and standard RFC5681 SS.

If the draft did not mention RFC5681 slow-start as an option, I think this would resolve the matter (given I believe no Cubic implementations use it). It also ought to list the Hystart++ draft as the first option, given it is currently adopted for the stds track, so it has potentially higher IETF 'status' than LSS (EXP) or Hystart (research paper).

@goelvidhi
Copy link
Contributor Author

goelvidhi commented Sep 15, 2021

@bbriscoe Has Hystart++ been adopted (stds track), I thought this was a verbal consensus in last IETF (111).

I can rearrange the order starting with Hystart++, then LSS. As Hystart++ is an update to Hystart paper, I think we should mention it some way - not sure how.
Regarding mentioning 5681, I think there are some implementations that still use it, it doesn't mean that's the right thing to put in a RFC. But currently, 5681 is the only approved standards track RFC. So, maybe we can keep that and mention the overshoot issue that comes with it.

@larseggert
Copy link
Contributor

https://datatracker.ietf.org/doc/draft-ietf-tcpm-hystartplusplus/ exists and says Intended status: Standards Track in the document. But the datatracker doesn't have that reflected (yet).

@larseggert larseggert linked a pull request Sep 16, 2021 that will close this issue
@larseggert
Copy link
Contributor

Please check #111, which intends to address this issue and #90

@goelvidhi
Copy link
Contributor Author

Yes, this is a dup of #90

@goelvidhi goelvidhi reopened this Sep 21, 2021
@goelvidhi
Copy link
Contributor Author

Leaving this open until the PR #111 gets merged.

@markkukojo
Copy link

Markku raises a good question about how large the first reduction should be to come out of the flow-startup phase. Most cubic implementations resolve this by using a start-up algo that intends to overshoot by less than 2x, so that a reduction of 30% will usually be sufficient without another round.

This logic above seems to assume that a start-up algo (e.g., Hystart++) never fails. That is, it always exits slow start before the first congestion signal (drop or ECE), resulting in no or some number of drops. If a TCP sender still is in slow start when congestion is signalled, it means that the overshoot detection has failed and everything occurred just like without such an algo.
At least in such a case the TCP sender should not apply reduction other than 0.5. I cannot speak for Van Jacobson but I believe that the selection of the MD factor (0.5) was no coincidence but carefully selected to match with the maximum cwnd increase during slow start (double cwnd in each RTT) as otherwise it would result in an intentional overloading of the bottleneck immediately after the congestion event!

So, even if HyStart++ or any such algo is in use, we should not use a decrease factor larger than 0.5 if the TCP sender still is in slow start when congestion event occurs. And, even though I didn't mention it, this should be applied always with slow start, not only to initial slow start.

With Hystart++ we may distinguish between 3 major cases:

  1. The algo succeeds ideally and exits early early enough to avoid any pkt losses (or ECE signal).
    This case does not need MD, so everything is fine.

  2. The algo fails to detect overshoot and exit slow start before a pkt loss is detected (or ECE arrives), i.e., the TCP sender is in slow start when congestion is signalled. In this case, a MD factor larger than 0.5 should not be used.

  3. The algo detects overshoot and exits slow start before the first pkt loss is detected (or ECE arrives) but cannot avoid losses (or ECE), i.e., it mitigates the impact of the overshoot by reducing the number of losses which vary from a single loss to the number of losses in case 2 minus 1 loss. The experimental results in the Hystart++ draft suggest that this might be the common case as there is ~50% reduction is losses. Because RTOs are still quite common, it may suggest that case 2 is not that uncommon but more details would be needed to make any reliable conclusions.

Bob's suggestion that a reduction of 30% will usually be sufficient does not sound well advised, because it would not be enough for the avg case 3 where a half of the losses are avoided (i.e., 1.5x overshoot) as it would still result in too high cwnd value exceeding the actually available capacity. Note that overshoot by only 1 segment (one drop) basically would request for the same reaction as in normal CA where just one segment is dropped per cycle.

The presence of the case 3 also suggests that it may be possible and reasonable to have a heuristic with Hystart++ that varies MD factor between 0.5 and beta (0.7) depending on how much cwnd exceeded the actual congestion (saturation) point, that is, how many losses the Hystart++ early exit was able to avoid; if only one or a few losses occur, then apply a MD factor of 0.7, and decrease the MD factor towards the case where the algo avoids only one or a few losses where a MD factor of 0.5 is applied.

§ 4.10 of the rfc8312bis draft mentions LSS, Hystart and standard RFC5681 SS.

If the draft did not mention RFC5681 slow-start as an option, I think this would resolve the matter (given I believe no Cubic implementations use it). It also ought to list the Hystart++ draft as the first option, given it is currently adopted for the stds track, so it has potentially higher IETF 'status' than LSS (EXP) or Hystart (research paper).

@nealcardwell
Copy link
Contributor

Yes, IIRC from discussions years ago with Van, the 0.5 MD in Reno was motivated by the expected 2x overshoot from slow-start.

And I definitely have seen many traces where CUBIC requires two consecutive fast recoveries after slow start to converge cwnd to the available BDP + buffer space, due this issue (two because 0.7^2 = 0.49).

I agree using a 0.5 MD rather than 0.7 after slow start makes sense, and would probably show some nice benefits.

But this is a pretty significant change, and my sense is that it would be best to get some real world test data from at least one implementation to back up our intuitions before codifying a change this big.

@bbriscoe
Copy link
Contributor

bbriscoe commented Oct 1, 2021

Need to be careful about mission-creep here.
If we start adding untried stuff, we shouldn't move that to stds track.

Having said that, how to make the reduction match the overshoot is an open question, and we can't get away from the fact that this is an important part of a CC algo.

This is why I said it's good enough to reduce by roughly the average overshoot of the start-up algo, rather than aiming for perfection.

@markkukojo
Copy link

We have over three decades of experience and consensus that after a slow-start overshoot a TCP sender must do MD with factor of 0.5 to match the overshoot. And, as Neal pointed out, this is how it was designed in order to not intentionally continue with over-aggressive sending rate during fast recovery which would be against the congestion control principles.
To change this, a lot of evidence (= well-analysed experimental data) should be provided to back up any claims proposing it is safe. Now we have none as far as I know.

This draft intends to modify MD with the justification that AI is modified to be less aggressive than the current std AI such that it compensates the use of a larger MD factor of 0.7. Such justification does not cover slow start but only the operation in steady state, i.e., MD when in CA; this draft does not provide any compensation for using an MD factor of 0.7 when in slow start. To do so, it should propose increasing cwnd at most by factor of 1.428 per RTT in slow-start. Otherwise, it must continue using the MD factor of 0.5 when in slow start because that is known to be safe. And, as I pointed out, using HyStart++ does not solve it because it is not guaranteed to work always and we do not have a good understanding on how often and how much it fails.

My intention was not to suggest that this draft should seek for an MD factor in between 0.5 and 0.7 depending on when HyStart++ exited slow start (i.e., how much did it overshoot). I just threw an idea how HyStart++ could possibly be enhanced in the future to allow an MD factor larger than 0.5 in slow start. Bob's suggestion "to reduce by roughly the average overshoot" is definitely not well advised. It is like suggesting a solution that works roughly half the time and fails otherwise.

@larseggert
Copy link
Contributor

Please review #111 and make suggestions there if you all believe anything more needs to be said in the document about any of this. I'd like to close this issue soon.

@bbriscoe
Copy link
Contributor

I don't believe the current text takes account of Markku's valid point, so this issue should not have been closed. If the draft wants to move to stds track, I suggest sthg like the following added on the end of the first para of §4.10:

Which ever start-up algorithm is used, work might be needed to ensure that the end of slow start and the first multiplicative decrease of congestion avoidance work well together.


One response to Markku below, 'for the record' but it's not relevant to draft wording...

IMO, it is not ill-advised to aim for the MD to match the average overshoot. When the MD is not exactly equal to the overshoot, it hasn't "failed"; it's just not been perfect.

Equally with standard Reno, where there's an AQM, the overshoot relative to the AQM's intended target averages around 2 with lots of randomness either side. There are all sorts of different buffer sizes, and types of AQM with different heuristics that allow for slow start bursts on the Internet. So, in practice, at the instant the start-up ends, it's likely to be increasingly imperfect the further the environment is from the average / expected position.

So all those years of experience with an MD of 0.5 matching an expected overshoot of x2 can be said to have supported the idea of aiming for the expectation of the overshoot (relative to the bottleneck's intended target).

If we advised anything else than trying to match the MD to the average overshoot, how would the two parts of the system know what the other was most likely to do? The only approach the network can take is to try to bring the queue back to its target, and it can only assume (if it needs to assume anything) that the CC will also be trying to match its reduction to its expectation of the overshoot.

@larseggert larseggert reopened this Oct 19, 2021
larseggert added a commit that referenced this issue Oct 19, 2021
@larseggert
Copy link
Contributor

Thanks for the suggestion, @bbriscoe. This is now a proposal in #125.

@markkukojo
Copy link

I don't believe the current text takes account of Markku's valid point, so this issue should not have been closed. If the draft wants to move to stds track, I suggest sthg like the following added on the end of the first para of §4.10:

Which ever start-up algorithm is used, work might be needed to ensure that the end of slow start and the first multiplicative decrease of congestion avoidance work well together.

First, while this is a valid sentence it is not very useful for a standards track document because it does not provide actual specification nor a concrete advise. It might be ok for an experimental RFC but for a standards track RFC we must know better how to implement this. I think it would be very difficult for a random implementer to understand what to do based on this sentence?

But then to the actual point. Bob seems to focus on ECN (meaning there is AQM) because of the title of this issue? Remember that #85 and #86 were folded into one issue as suggested by Bob.

I'm discussing this for the loss-based CC in the first place and in particular for tail drop which still is the majority of the Internet and must be properly addressed.

The original design of MD by Van Jacobson rightly handled MD separately depending on whether a pkt loss was detected during slow start or when in congestion avoidance (see Appendix C of [Jacob88]).
When a pkt dropped in tail drop queue, a half of the current cwnd was guaranteed work without drop, i.e., this is the available capacity for the flow as correctly explained in [Jacob88].

Therefore, cwnd must be at least halved, otherwise loss recovery is entered with a cwnd that is overaggressive and any excess pkts are guaranteed to be dropped. This is very bad because in slow start overshoot half of the pkts are dropped by default, meaning that a TCP sender must rexmit half a window of dropped pkts during fast recovery. If we add there extra packets (42% extra pkts with MD of 0.7) it is more or less guaranteed that some of the rexmits are dropped, resulting in RTO and very bad performance for a lone-flow in tail drop bottleneck. If there are competing flows, such an overaggressive sender will have a significant bad impact to other flows likely to forcing several unnecessary pkt drops to other flows and thereby achieving unjustified benefit to itself on the expense of the other traffic (in the worst scenario all dropped pkts may belong to other flows, yielding significant benefit for the overaggressive flow).

Nothing related to the slow start overshoot behaviour have been separately studied and data presented to the wg in order to justify this behaviour. A very simple study focusing on the initial slow start behaviour would easily provide more insight to this issue.

Such behaviour is also against the congestion control principles as it is guaranteed to result in sending undelivered packets (see RFC 2914) that unnecessarily eat capacity from other flows sharing the path before the bottleneck.

Moreover, it represents behaviour that RFC 2119 considers very bad:
"Others may deliberately be implemented
with congestion avoidance algorithms that are more aggressive in
their use of bandwidth than other TCP implementations; this would
allow a vendor to claim to have a "faster TCP". The logical
consequence of such implementations would be a spiral of increasingly
aggressive TCP implementations, or increasingly aggressive transport
protocols, leading back to the point where there is effectively no
congestion avoidance and the Internet is chronically congested."

HyStart++ is now considered to be helping here but it is only intended to be used during the initial slow start! If a sender is in RTO recovery and a pkt is dropped before cwnd reaches ssthresh, it is likely to indicate that there is heavy congestion and the sender is required to be extremely careful in reducing cwnd appropriately and quickly. Using overaggressive MD of 0.7 in such a case guarantees inadequate cwnd decrease and further pkt drops during the loss recovery.

In addition, even when using HyStart++ it may occur during initial slow start that HyStart++ does not detect the overshoot timely and the sender ii still in slow start when a pkt drop occurs. This is the case I pointed to by saying that HyStart++ failed because it did not have any effect on behaviour and it is guaranteed that cwnd has value that is double the available capacity when a packet is dropped. That is why the rule must be that cwnd is halved when pkt drop occurs during slow start.

When AQM is in is, thing are somewhat different, particularly when AQM uses probabilistic drop. It is possible that a drop also occurs earlier such that cwnd does not yet have a value that is double the available capacity. This, however, is unlikely to be the common case as classic AQMs are known to be sloppy wit slow start. They are designed and configured to work "well" with senders in CA. In slow start the cwnd increase rate is so much faster that AQMs (most?) often react too late, i.e., after the avarage, than too early.

Anyway, the behaviour with AQMs (with or without ECN) is not studied and results have not been presented to the wg, so I find it hard to make other than a conservative decision without any evidence when we are considering a stds track document. Experimental would be quite different decision and a lot could be left for later experimentation.

[Jacob88] V. Jacobson, Congestion avoidance and control, SIGCOMM '88.

@lisongxu
Copy link
Contributor

Another way to explain why MD is set to 0.7 is similar to a binary search for target = available BDP + buffer space.

Consider the following example,

RTT n: cwnd = x, no packet loss for those packets
RTT n+1: cwnd = 2x, packet loss
Therefore, the target should be between x and 2
x.

By setting MD=0.7, Cubic searches for the target starting from the middle point, that is approximately equal to 2x0.7=1.4*x.

Setting MD=0.5 is guaranteed to be safe but overly conservative. Setting MD close to 1 is overlay aggressive.

@markkukojo
Copy link

Thanks @lisongxu for your reply.

Another way to explain why MD is set to 0.7 is similar to a binary search for target = available BDP + buffer space.

Just to be sure that the terminology I used is clear: I called this (actual) available (buffering) capacity, i.e., where BDP = base_e2e_RTT * bottleneck bit-rate, meaning that any buffer at bottleneck adds to the pure network buffering capacity. In other words, available buffering capacity equals to the max cwnd that a flow may have at the tip of the normal congestion avoidance cycle (saw tooth).

Consider the following example,

RTT n: cwnd = x, no packet loss for those packets RTT n+1: cwnd = 2_x, packet loss Therefore, the target should be between x and 2_x.

Well, that is incorrect. x should be the target if a pkt loss is signalled when cwnd = 2*x and the sender is still in slow start because x is the max available (buffering) capacity for the flow that slow start just determined. See the explanation with an simple example below (see also [Jacob88]).

For simplicity, let's consider a single flow case. Let's take an example network setting and begin with the regular congestion avoidance cycle.
Let's assume a stable network with tail drop bottleneck:
BDP = 100 pkts
Bottleneck buffer space: 100 pkts = BDP

When a flow has reached steady state, it drops one pkt during each CA cycle once cwnd exceeds 200 pkts. With Reno CC, cwnd is decreased to ~100, and with CUBIC cwnd is decreased to ~140 and W_max is set to 200 (or to 201 if we are accurate). In both cases the bottleneck link is fully utilised: a Reno flow oscillates between 100 and 200, and a CUBIC flow between 140 and 200. So, the max cwnd for either flow is 200. Is that what you call target?

Now, let's consider the initial slow start in the same example environment:

The TCP sender starts with whatever initial cwnd << 100. At some point after some number of RTTs cwnd = 100 (the most recent RTT is not necessarily a full RTT). From this point on the sender is fully utilising the bottleneck and the incoming ACKs for these 100 in flight segments arrive evenly spread over one RTT. Thereby, during the next RTT (= RTT n in your example) the sender injects 200 segments that are nicely ACK clocked and no pkt loss yet occurs. When the ACK for the 1st of these 200 segments arrives, it starts the next RTT (= RTT n+1 in your example) and the cwnd becomes 201, making the sender to inject two segments and one of these two segments gets dropped. Each subsequent ACK for these 200 segments also increases cwnd by 1 and results in the sender injecting 2 segments. So, during this RTT (RTT n+1) 400 segments are injected to the network and every second segment gets dropped at the bottleneck because available capacity is 200. When the first dupack arrives, cwnd = 400, that is, double the available capacity. If the sender applies MD of 0.5 as correctly designed by Van Jacobson, the resulting cwnd (=200) equals to the max cwnd available for the sender (the target?). If the network conditions are stable, there are no drops during the subsequent fast recovery because the sender is allowed to transmit at most 200 segments per RTT during the fast recovery.

If an MD factor > 0.5 is used, it means that the sender enters fast recovery with cwnd > 200, that is, it guarantees pkt drop(s) during fast recovery. Even worse, when SACK is in use, it is possible that all 200 dropped segments are retransmitted over the next RTT (RTT n+2); if the sender injects more than 200 segments over that RTT, it is likely that some of the retransmitted segments gets dropped and an RTO is required!
If MD of 0.7 is applied, the sender injects 280 segments in RTT n+2 (in fast recovery), resulting in 80 segments being dropped and meaning that very likely at least one of these dropped segments is a rexmitted segment (and quite likely more than one rexmitted segment is dropped).

If there are other competing flows sharing the bottleneck, a starting flow using MD of 0.7, unlike a flow using MD of 0.5, is overaggressive by sending excess pkts and does very likely force other flows also to drop extra pkts during its fast recovery (and subsequent CA cycle). For each pkt loss imposed to a competing flow the overaggressive flow gets one extra pkt of its own through and is likely force the other flow(s) to do MD which it/they possibly could have otherwise avoided and thereby steals capacity from the other flows unfairly.

By setting MD=0.7, Cubic searches for the target starting from the middle point, that is approximately equal to 2_x_0.7=1.4*x.

Setting MD=0.5 is guaranteed to be safe but overly conservative. Setting MD close to 1 is overlay aggressive.

Setting MD=0.5 when in slow start is not conservative, it is very optimistic because it allows continuing with cwnd that equals to the slow-start determined available capacity, not leaving any extra room as we saw in this example above. Therefore, a slow start overshoot with MD=0.5 may also sometimes result in loss of a rexmit during fast recovery, because network conditions are not necessarily stable. Note that RFC 5681 slow start does not allow doubling cwnd every RTT when delayed ACKs are in use. This increase in slow start roughly by factor of 1.5 per RTT leaves quite a good amount of extra room with MD=0.5 and is unlikely to be the reason for an pkt losses during fast recovery that follows slow start. Someone may argue (quite rightly) that increasing cwnd roughly by factor of 1.5 per RTT during slow start is too conservative, but that's is another discussion and let's not start debating on it here.

As I already indicated, selecting MD is a separate decision for a pkt loss occurring in slow start and for a pkt loss occurring in CA as discussed in [Jacob88]; MD=0.5 just happened conveniently to become the MD in both cases by the original design and this MD value has been inherited in the congestion control RFCs without any distinction.

It seems that the above fact was left unnoticed in the CUBIC design and no experimental data has been gathered for backing up the decision to use MD=0.5 when the sender is in slow start?

Moreover, I didn't notice it earlier but this also means that setting W_max should be different if the sender is in slow start when a pkt loss (congestion) is detected and cwnd is reduced.
If CUBIC sets W_max to size of cwnd just before cwnd was reduced when in slow start alike when in CA, it results not only starting the first CA epoch after fast recovery with an overaggressive cwnd (using MD=0.7) above the actual saturation point but also starting the subsequent CA with the concave phase and continuing with the concave phase until cwnd reaches value that is double the saturation point. This makes CUBIC (when not in Reno-friendly region) much more aggressive than intended because it operates in the concave phase even though the cwnd size actually already exceeds the previous saturation point (which in the above example is 200 not 400).

IMO this issue of which MD factor to use when in slow start is not resolved. Please also add the problem with setting W_max when in slow start to the set of open issues.

@larseggert larseggert mentioned this issue Nov 23, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants