-
Notifications
You must be signed in to change notification settings - Fork 14
RFC 8511 handling of ECE signal (slow start vs congestion avoidance) #85
Comments
Markku raises a good question about how large the first reduction should be to come out of the flow-startup phase. Most cubic implementations resolve this by using a start-up algo that intends to overshoot by less than 2x, so that a reduction of 30% will usually be sufficient without another round. § 4.10 of the rfc8312bis draft mentions LSS, Hystart and standard RFC5681 SS. If the draft did not mention RFC5681 slow-start as an option, I think this would resolve the matter (given I believe no Cubic implementations use it). It also ought to list the Hystart++ draft as the first option, given it is currently adopted for the stds track, so it has potentially higher IETF 'status' than LSS (EXP) or Hystart (research paper). |
@bbriscoe Has Hystart++ been adopted (stds track), I thought this was a verbal consensus in last IETF (111). I can rearrange the order starting with Hystart++, then LSS. As Hystart++ is an update to Hystart paper, I think we should mention it some way - not sure how. |
https://datatracker.ietf.org/doc/draft-ietf-tcpm-hystartplusplus/ exists and says |
Yes, this is a dup of #90 |
Leaving this open until the PR #111 gets merged. |
This logic above seems to assume that a start-up algo (e.g., Hystart++) never fails. That is, it always exits slow start before the first congestion signal (drop or ECE), resulting in no or some number of drops. If a TCP sender still is in slow start when congestion is signalled, it means that the overshoot detection has failed and everything occurred just like without such an algo. So, even if HyStart++ or any such algo is in use, we should not use a decrease factor larger than 0.5 if the TCP sender still is in slow start when congestion event occurs. And, even though I didn't mention it, this should be applied always with slow start, not only to initial slow start. With Hystart++ we may distinguish between 3 major cases:
Bob's suggestion that a reduction of 30% will usually be sufficient does not sound well advised, because it would not be enough for the avg case 3 where a half of the losses are avoided (i.e., 1.5x overshoot) as it would still result in too high cwnd value exceeding the actually available capacity. Note that overshoot by only 1 segment (one drop) basically would request for the same reaction as in normal CA where just one segment is dropped per cycle. The presence of the case 3 also suggests that it may be possible and reasonable to have a heuristic with Hystart++ that varies MD factor between 0.5 and beta (0.7) depending on how much cwnd exceeded the actual congestion (saturation) point, that is, how many losses the Hystart++ early exit was able to avoid; if only one or a few losses occur, then apply a MD factor of 0.7, and decrease the MD factor towards the case where the algo avoids only one or a few losses where a MD factor of 0.5 is applied.
|
Yes, IIRC from discussions years ago with Van, the 0.5 MD in Reno was motivated by the expected 2x overshoot from slow-start. And I definitely have seen many traces where CUBIC requires two consecutive fast recoveries after slow start to converge cwnd to the available BDP + buffer space, due this issue (two because 0.7^2 = 0.49). I agree using a 0.5 MD rather than 0.7 after slow start makes sense, and would probably show some nice benefits. But this is a pretty significant change, and my sense is that it would be best to get some real world test data from at least one implementation to back up our intuitions before codifying a change this big. |
Need to be careful about mission-creep here. Having said that, how to make the reduction match the overshoot is an open question, and we can't get away from the fact that this is an important part of a CC algo. This is why I said it's good enough to reduce by roughly the average overshoot of the start-up algo, rather than aiming for perfection. |
We have over three decades of experience and consensus that after a slow-start overshoot a TCP sender must do MD with factor of 0.5 to match the overshoot. And, as Neal pointed out, this is how it was designed in order to not intentionally continue with over-aggressive sending rate during fast recovery which would be against the congestion control principles. This draft intends to modify MD with the justification that AI is modified to be less aggressive than the current std AI such that it compensates the use of a larger MD factor of 0.7. Such justification does not cover slow start but only the operation in steady state, i.e., MD when in CA; this draft does not provide any compensation for using an MD factor of 0.7 when in slow start. To do so, it should propose increasing cwnd at most by factor of 1.428 per RTT in slow-start. Otherwise, it must continue using the MD factor of 0.5 when in slow start because that is known to be safe. And, as I pointed out, using HyStart++ does not solve it because it is not guaranteed to work always and we do not have a good understanding on how often and how much it fails. My intention was not to suggest that this draft should seek for an MD factor in between 0.5 and 0.7 depending on when HyStart++ exited slow start (i.e., how much did it overshoot). I just threw an idea how HyStart++ could possibly be enhanced in the future to allow an MD factor larger than 0.5 in slow start. Bob's suggestion "to reduce by roughly the average overshoot" is definitely not well advised. It is like suggesting a solution that works roughly half the time and fails otherwise. |
Please review #111 and make suggestions there if you all believe anything more needs to be said in the document about any of this. I'd like to close this issue soon. |
I don't believe the current text takes account of Markku's valid point, so this issue should not have been closed. If the draft wants to move to stds track, I suggest sthg like the following added on the end of the first para of §4.10:
One response to Markku below, 'for the record' but it's not relevant to draft wording... IMO, it is not ill-advised to aim for the MD to match the average overshoot. When the MD is not exactly equal to the overshoot, it hasn't "failed"; it's just not been perfect. Equally with standard Reno, where there's an AQM, the overshoot relative to the AQM's intended target averages around 2 with lots of randomness either side. There are all sorts of different buffer sizes, and types of AQM with different heuristics that allow for slow start bursts on the Internet. So, in practice, at the instant the start-up ends, it's likely to be increasingly imperfect the further the environment is from the average / expected position. So all those years of experience with an MD of 0.5 matching an expected overshoot of x2 can be said to have supported the idea of aiming for the expectation of the overshoot (relative to the bottleneck's intended target). If we advised anything else than trying to match the MD to the average overshoot, how would the two parts of the system know what the other was most likely to do? The only approach the network can take is to try to bring the queue back to its target, and it can only assume (if it needs to assume anything) that the CC will also be trying to match its reduction to its expectation of the overshoot. |
First, while this is a valid sentence it is not very useful for a standards track document because it does not provide actual specification nor a concrete advise. It might be ok for an experimental RFC but for a standards track RFC we must know better how to implement this. I think it would be very difficult for a random implementer to understand what to do based on this sentence? But then to the actual point. Bob seems to focus on ECN (meaning there is AQM) because of the title of this issue? Remember that #85 and #86 were folded into one issue as suggested by Bob. I'm discussing this for the loss-based CC in the first place and in particular for tail drop which still is the majority of the Internet and must be properly addressed. The original design of MD by Van Jacobson rightly handled MD separately depending on whether a pkt loss was detected during slow start or when in congestion avoidance (see Appendix C of [Jacob88]). Therefore, cwnd must be at least halved, otherwise loss recovery is entered with a cwnd that is overaggressive and any excess pkts are guaranteed to be dropped. This is very bad because in slow start overshoot half of the pkts are dropped by default, meaning that a TCP sender must rexmit half a window of dropped pkts during fast recovery. If we add there extra packets (42% extra pkts with MD of 0.7) it is more or less guaranteed that some of the rexmits are dropped, resulting in RTO and very bad performance for a lone-flow in tail drop bottleneck. If there are competing flows, such an overaggressive sender will have a significant bad impact to other flows likely to forcing several unnecessary pkt drops to other flows and thereby achieving unjustified benefit to itself on the expense of the other traffic (in the worst scenario all dropped pkts may belong to other flows, yielding significant benefit for the overaggressive flow). Nothing related to the slow start overshoot behaviour have been separately studied and data presented to the wg in order to justify this behaviour. A very simple study focusing on the initial slow start behaviour would easily provide more insight to this issue. Such behaviour is also against the congestion control principles as it is guaranteed to result in sending undelivered packets (see RFC 2914) that unnecessarily eat capacity from other flows sharing the path before the bottleneck. Moreover, it represents behaviour that RFC 2119 considers very bad: HyStart++ is now considered to be helping here but it is only intended to be used during the initial slow start! If a sender is in RTO recovery and a pkt is dropped before cwnd reaches ssthresh, it is likely to indicate that there is heavy congestion and the sender is required to be extremely careful in reducing cwnd appropriately and quickly. Using overaggressive MD of 0.7 in such a case guarantees inadequate cwnd decrease and further pkt drops during the loss recovery. In addition, even when using HyStart++ it may occur during initial slow start that HyStart++ does not detect the overshoot timely and the sender ii still in slow start when a pkt drop occurs. This is the case I pointed to by saying that HyStart++ failed because it did not have any effect on behaviour and it is guaranteed that cwnd has value that is double the available capacity when a packet is dropped. That is why the rule must be that cwnd is halved when pkt drop occurs during slow start. When AQM is in is, thing are somewhat different, particularly when AQM uses probabilistic drop. It is possible that a drop also occurs earlier such that cwnd does not yet have a value that is double the available capacity. This, however, is unlikely to be the common case as classic AQMs are known to be sloppy wit slow start. They are designed and configured to work "well" with senders in CA. In slow start the cwnd increase rate is so much faster that AQMs (most?) often react too late, i.e., after the avarage, than too early. Anyway, the behaviour with AQMs (with or without ECN) is not studied and results have not been presented to the wg, so I find it hard to make other than a conservative decision without any evidence when we are considering a stds track document. Experimental would be quite different decision and a lot could be left for later experimentation. [Jacob88] V. Jacobson, Congestion avoidance and control, SIGCOMM '88. |
Another way to explain why MD is set to 0.7 is similar to a binary search for target = available BDP + buffer space. Consider the following example, RTT n: cwnd = x, no packet loss for those packets By setting MD=0.7, Cubic searches for the target starting from the middle point, that is approximately equal to 2x0.7=1.4*x. Setting MD=0.5 is guaranteed to be safe but overly conservative. Setting MD close to 1 is overlay aggressive. |
Thanks @lisongxu for your reply.
Just to be sure that the terminology I used is clear: I called this (actual) available (buffering) capacity, i.e., where BDP = base_e2e_RTT * bottleneck bit-rate, meaning that any buffer at bottleneck adds to the pure network buffering capacity. In other words, available buffering capacity equals to the max cwnd that a flow may have at the tip of the normal congestion avoidance cycle (saw tooth).
Well, that is incorrect. x should be the target if a pkt loss is signalled when cwnd = 2*x and the sender is still in slow start because x is the max available (buffering) capacity for the flow that slow start just determined. See the explanation with an simple example below (see also [Jacob88]). For simplicity, let's consider a single flow case. Let's take an example network setting and begin with the regular congestion avoidance cycle. When a flow has reached steady state, it drops one pkt during each CA cycle once cwnd exceeds 200 pkts. With Reno CC, cwnd is decreased to ~100, and with CUBIC cwnd is decreased to ~140 and W_max is set to 200 (or to 201 if we are accurate). In both cases the bottleneck link is fully utilised: a Reno flow oscillates between 100 and 200, and a CUBIC flow between 140 and 200. So, the max cwnd for either flow is 200. Is that what you call target? Now, let's consider the initial slow start in the same example environment: The TCP sender starts with whatever initial cwnd << 100. At some point after some number of RTTs cwnd = 100 (the most recent RTT is not necessarily a full RTT). From this point on the sender is fully utilising the bottleneck and the incoming ACKs for these 100 in flight segments arrive evenly spread over one RTT. Thereby, during the next RTT (= RTT n in your example) the sender injects 200 segments that are nicely ACK clocked and no pkt loss yet occurs. When the ACK for the 1st of these 200 segments arrives, it starts the next RTT (= RTT n+1 in your example) and the cwnd becomes 201, making the sender to inject two segments and one of these two segments gets dropped. Each subsequent ACK for these 200 segments also increases cwnd by 1 and results in the sender injecting 2 segments. So, during this RTT (RTT n+1) 400 segments are injected to the network and every second segment gets dropped at the bottleneck because available capacity is 200. When the first dupack arrives, cwnd = 400, that is, double the available capacity. If the sender applies MD of 0.5 as correctly designed by Van Jacobson, the resulting cwnd (=200) equals to the max cwnd available for the sender (the target?). If the network conditions are stable, there are no drops during the subsequent fast recovery because the sender is allowed to transmit at most 200 segments per RTT during the fast recovery. If an MD factor > 0.5 is used, it means that the sender enters fast recovery with cwnd > 200, that is, it guarantees pkt drop(s) during fast recovery. Even worse, when SACK is in use, it is possible that all 200 dropped segments are retransmitted over the next RTT (RTT n+2); if the sender injects more than 200 segments over that RTT, it is likely that some of the retransmitted segments gets dropped and an RTO is required! If there are other competing flows sharing the bottleneck, a starting flow using MD of 0.7, unlike a flow using MD of 0.5, is overaggressive by sending excess pkts and does very likely force other flows also to drop extra pkts during its fast recovery (and subsequent CA cycle). For each pkt loss imposed to a competing flow the overaggressive flow gets one extra pkt of its own through and is likely force the other flow(s) to do MD which it/they possibly could have otherwise avoided and thereby steals capacity from the other flows unfairly.
Setting MD=0.5 when in slow start is not conservative, it is very optimistic because it allows continuing with cwnd that equals to the slow-start determined available capacity, not leaving any extra room as we saw in this example above. Therefore, a slow start overshoot with MD=0.5 may also sometimes result in loss of a rexmit during fast recovery, because network conditions are not necessarily stable. Note that RFC 5681 slow start does not allow doubling cwnd every RTT when delayed ACKs are in use. This increase in slow start roughly by factor of 1.5 per RTT leaves quite a good amount of extra room with MD=0.5 and is unlikely to be the reason for an pkt losses during fast recovery that follows slow start. Someone may argue (quite rightly) that increasing cwnd roughly by factor of 1.5 per RTT during slow start is too conservative, but that's is another discussion and let's not start debating on it here. As I already indicated, selecting MD is a separate decision for a pkt loss occurring in slow start and for a pkt loss occurring in CA as discussed in [Jacob88]; MD=0.5 just happened conveniently to become the MD in both cases by the original design and this MD value has been inherited in the congestion control RFCs without any distinction. It seems that the above fact was left unnoticed in the CUBIC design and no experimental data has been gathered for backing up the decision to use MD=0.5 when the sender is in slow start? Moreover, I didn't notice it earlier but this also means that setting W_max should be different if the sender is in slow start when a pkt loss (congestion) is detected and cwnd is reduced. IMO this issue of which MD factor to use when in slow start is not resolved. Please also add the problem with setting W_max when in slow start to the set of open issues. |
Markku Kojo said,
The text was updated successfully, but these errors were encountered: