Development: Latency

Pierre Ossman (Work account) edited this page Dec 8, 2016 · 1 revision

This page describes the issues the RFB protocol has with high latency environments and ideas for improving the situation.

Problem Description

RFB was designed to be used with very simple clients and to have a low complexity. As part of this, it uses a very simple flow control system where all screen updates are only sent when the client explicitly asks for them.

The system also forces you to only have a single update request pending at any given time as updates can be aggregated and there is no synchronisation mechanism to determine if format changing commands have gone into effect or not.

The result of this is that the protocol is limited to one update per round trip. So with a network latency of 100 ms (RTT), you can only have 10 updates per second, which is not a smooth experience.

Note: An important concept here is the Bandwidth Delay Product (BDP) which is the amount of data that can be in flight between the server and client. This data is essentially "stored" by the network itself, on wires and fibers and in router and switch buffers. You must be able to provide at least twice this amount of data in order to fully utilise the available bandwidth on the network. The reason for "twice" is that data needs to continue to flow whilst the acknowledgment for the first byte needs to travel back across the network, which doubles the effective latency.

Related Concerns

The major component in the solutions to this problem is flow control, and solving that can cross over into a related issue; automatic encoder handling. To be able to select the best compression settings at a given moment requires knowledge about available CPU time and bandwidth. It is quite possible that any flow control mechanism used to solve the latency issue can also be used as part of the solution to that problem. Such concerns should be considered when designing a new system.

Proposed Solutions

To make RFB less sensitive to latency, the protocol must be changed so that we can have many updates being transmitted concurrently on the wire. There are several ways to achieve this though.

Continuous Request Stream

A simple solution, that doesn't require any protocol changes, is to simply send many update requests to the server, making sure updates are allowed to be sent out at a reasonable rate.

The problem with this solution is the mentioned lack of synchronisation. As soon as you have more than one request pending, then you can no longer send any command to the server that change the format of the updates as you don't know if a received update is in the old or new format.

Another issue is that you will constantly be sending a steady stream of update requests to the server, even if there are no updates that need to be sent. This is waste of bandwidth and can also prevent power saving modes in modern systems.

Lastly, flow control gets very difficult here as there is no clear condition on when to pause the updates requests.

Unsolicited Updates

A variation of the above is to add an extension that tells the server to start sending updates at its leisure, without any update requests. The TightVNC project has experimented with this system in their "Continuous updates" extension.

This method solves the issue of spamming the network with lots of update requests. It still suffers from the synchronisation issues though. TightVNC's solution is to turn the unsolicited updates on and off. It isn't entirely clear how this relates to "normal" updates during transition though.

Flow control becomes a bit more apparent here than in the previous method. The RFB specification states that the protocol can work over any reliable transport, but for the sake of this analysis we will assume it is standard TCP.

The main feedback mechanism we have here is that the socket will start blocking writes. For an overloaded client, this means that:

  • The client receive buffers will contain at least BDP worth of data.
  • The server send buffers will most likely also contain at least BDP worth of data as the client has ACKed all data it has in its buffers and the send buffers are therefore free for new data.

In this state, the next update we want to send will have to wait for not only the network latency, but also for the send buffer to drain, which means doubling the network latency. We also have to wait for the client to process two BDP:s worth of data. This also assumes a single TCP connection between the server and the client, which might not be the case if there are proxies inbetween (like a SSH tunnel). Such proxies will have their own buffers that will fill up and add latency.

In other words, this method can add a lot of extra latency on top of the network delay, which is rather counter-productive in the goal of reducing latency effects.

Advanced Flow Control

Sending a steady stream of [unsolicted] updates from the server is most likely the best way forward, but we need a better feedback mechanism to make sure we do not introduce more latency than necessary. We also need to take care of the synchronisation issue to avoid regressions compared to the current one-update-at-a-time model.

In a perfect world we could rely on the flow control mechanisms of the underlying transport protocol, but unfortunately that information is not available to applications. Instead we'll have to try and infer that information and possibly duplicate some of the work the transport layer does.

Existing Algorithms: TCP

As a source of inspiration, we should look at one of the most commonly implemented flow control systems out there; namely TCP. Basic flow control in TCP is handled using the receiver window. This is something that's fairly easy to handle and not a concern for us at this stage. The difficult portion is congestion control, and a lot of research and testing has been done in this area. The inputs TCP uses for congestion control are:

  • RTT
  • Packet loss

We could fairly easily add an extension to monitor RTT in RFB protocol, using a scheme similar to TCP. Packet loss is something we cannot detect though as all traffic goes on top of a reliable transport. The closest approximation we have is noticing that the send buffer fills up. This only happens after quite a bit of packet loss though, and can also be caused by other things, which makes it difficult to use.

The more popular TCP congestion control algorithms, like Reno, CUBIC, and CTCP, all use packet loss as a primary source of information. Fortunately there are purely RTT based algorithms, like Vegas and derivatives, that are a better match to our conditions.

The basic principle of RTT based algorithms is that there is a minimum RTT, and as long as we have an RTT higher than that, then we are putting too much data on the wire. There are some variations on how to determine the minimum RTT, but they are all based on assuming that the lowest RTT seen in the current connection so far is the minimum.

Protocol Extensions

In order to measure RTT we need to have the client echo something back to the server so that time between sending and receiving can be measured. TCP has to keep track of every packet (for retransmission), so it just piggybacks on that to monitor RTT. We need to add something new as there isn't anything existing that has suitable behaviour.

Simple Messages

The simplest solution is that we add a new message pair. The server will now and then send a request to the client, and the client will send a message back when it sees this. We could also add a payload to make it easier to avoid keeping extra state on the server. Things would work more or less like an ICMP "ping".

Complications with this solution is to define how this command relates to other commands (like screen updates). A client that process things serially might not be an issue, but a more advanced client might read in the update and process it in the background. Should a "ping" following the update be handled once it is read, or not until the screen updated has been fully processed?

Fence

A generalisation of the above would be a fence mechanism. The server message would consist of an id, and flags indicating one or more of:

  • All messages before this should finish processing before sending a response.
  • All messages following this should not start processing until a response has been sent.

This would handle things in a more clear manner (especially when parallel clients are involved), as well as give a flexible mechanism that could be used for other synchronisation in future extensions.

Encoding

Instead of messages, we could do one of the above variants but as a rectangle in a screen update. This would allow us to do measurements at a more fine grained level, which could be useful in larger updates. The downside would be added complexity for parallel clients that would have to synchronise things at a rectangle level instead of at a message level.

We would still need a message for the response back to the server though.

Server Algorithm

With a system to measure RTT to place, the question is how the server algorithm should be implemented. This is something that will probably have to evolve over time, but it should not affect the protocol so it is an area where experimentation is fairly easy.

Compared to TCP, there are a few issues to be aware of:

  • "Packets" will generally be much larger. We cannot send partial updates, so granularity will be very coarse. This will most likely affect convergence behaviours, as well as make startup difficult.
  • Related to the previous point, it will be difficult to have a backlog of data ready to be sent as that backlog will contribute to latency.
  • We might not have a steady state very often. Updates are far more likely to be "bursty" than continous, which might mean that we are in a perpetual slow start scenario.
  • TCP reverts to a small congestion window whenever the connection is idle, further complicating the "bursty" behaviour.
  • We have at least one more lower layer of congestion control, making the conditions we work under vary constantly.

Through a lot of experimentation, it was possibly to find a system that works reasonable well under these conditions. The basic principle is that of TCP Vegas, working constantly in congestion avoidance mode (i.e. no slow start). The second principle is to make a system that reacts slowly and in fairly large steps, to allow congestion control in lower layers to do their magic properly.

The system as implemented in the prototype works as follows:

  • Latency is measured before and after each framebuffer update.
  • The initial window (IW) is set to 16 KiB.
  • An update is allowed to go out as long as the window has any room (a single byte is enough). This means we will often overfill the network. We keep track of the overtaxing and compensate for it when calculating latency.
  • We update the congestion window every other RTT, or 100 ms (whichever is the lowest).
  • We adjust the window by 4 KiB each time, trying to keep 25-50 ms of buffering latency. The lowest RTT seen during the last period is used, as we want to allow bursts, but avoid continuous buffer build up.
  • If there has been no traffic for 2 * "Base RTT", then we revert to a restart window (RW), that's the smallest of the last congestion window or the initial window (IW).

This seems to work well in finding a stable bandwidth that the network and system can support. It does have some problems growing quickly enough on LFN networks, so the growth function probably needs to be revised.

Update

The algorithm above could run into problems with the granularity. When most of the updates were larger than the BDP (and larger than the congestion window), the effect was that only one update could be in flight at any given time. In extreme circumstances this meant using just 50% of the available bandwidth as the wire was idle as we were waiting for the client to process the update, and the fence response to return after that.

The solution was to try to behave like the older implementation when we hit this case. This is achieved by keeping track of how many outstanding fences there are. If there is just one, that means the fence before the update has reached the client and the response has gotten back. The single remaining fence we are waiting at this point is the one after the single update we have on the wire. For the old implementation, we would have gotten a new FramebufferUpdateRequest at this point, so we just emulate that behaviour by allowing another update.

The end result is that we might worsen the latency situation a bit, but we will use the available bandwidth better and give a better update rate. On the whole, this subjectively works better.

Update 2

The growth function was augmented to increase the window by 8 KiB if there was less than 5 ms buffering latency. This made it aggressive enough to better keep up with Linux' CUBIC.

Synchronisation

All of the things so far are about providing the server with more information. Unfortunately that means that the issue of format change synchronisation is still unsolved (as that is about giving the client more information about what's going on). We need some more work to solve this issue.

A simple approach is to extend the fence message (mentioned above) to allow the client to send fences to the server. We would also need another flag:

  • The message right after this should finish processing before sending a response.

The client could then send out a sequence of:

  1. Fence
  2. Disruptive change

Anything the client then sees before the fence response must be unaffected by the change, and everything after the fence response must be affected.

Related Concerns

The RTT measurement allows us to measure both bandwidth and total round trip. This can give some insight into the CPU load of the client as RTT will start increasing if the client is overworked. The server will respond by backing off a bit and the CPU load will go down.

Unfortunately the server cannot tell the difference at this point between lack of bandwidth and lack of CPU power. More information would have to be exported from the client to determine this, e.g. CPU time spent processing an update.

The big question is if such extra information needs (should?) to be part of solving the latency issue, or if it works just as well as a completely independent extension?

References