Skip to content

[net, ktcp] Important bug fixes and enhancements #211

Merged
Mellvik merged 2 commits intomasterfrom
networking
Dec 7, 2025
Merged

[net, ktcp] Important bug fixes and enhancements #211
Mellvik merged 2 commits intomasterfrom
networking

Conversation

@Mellvik
Copy link
Copy Markdown
Owner

@Mellvik Mellvik commented Dec 6, 2025

This PR includes the following fixes and updates to ktcp:

  • Fix for a problem appearing when a recipient acknowledges several packets in the same ACK - originally reported and subsequently discussed in [inet] telnetd seems to hang in a "TCP Retransmit" situation on ELKS. This would cause the inflight counter, part of the SlowStart/CongestionAvoidance mechanism, to get out of sync and eventually hang the connection. More details below.
  • The code includes a commented out but easily accessible mechanism to induce outgoing packet loss in ktcp - for future debugging (in deveth.c).
  • Fix for a problem appearing when ktcp is running low on memory for retrans packets. More details below.
  • Add the new get_time() mechanism to reliably get system time (jiffies) replacing direct access to the kernel jiffies variable in ktcp, as discussed in [ktcp] A work-in-progress checkin, lots of enhancements & fixes #189 (comment) and Investigation into kernel jiffies variable access and data corruption ghaerr/elks#2456.
  • More elaborate comments have been added to the code at several places to help understanding the code, the choices made and not the least those NOT made.
  • Revert a former 'optimization' which sent an outgoing packet to the driver level before adding it to the retrans queue. It turns out (thank you @ghaerr) that metadata updates in add_to_retrans are required before sending.
  • Some rearrangement of (still voluminous) debug code

The first commit includes some old and/or unused code for reference. This is deleted in the second commit.

<TL;DR>

The inflight counter problem was discovered more or less by accident (by @swausd) and the root cause turned out to be hardware. The bug was real enough though, and brought up some interesting questions. Like 'should so-called DUP ACKs, repeated ACKs of a previously sent packet' be ignored or reacted upon? DUP ACKs appear when a recipient either discovers a missing packet or gets impatient - a receive timeout. Both are rare. Losing a packet is extremely rare in today's switched networks - even between very fast systems. Being the slowest kid on the block, losing a packet from ktcp is close to impossible.

Impatience (timeouts) is a different story. Two ktcp-based systems - say, real XT systems - are likely to see timeouts every once in a while, quite often on a busy system: A recipient times out, ACKs the previous packet again (DUP ACK), while most likely the recipient - being as slow as the sender - may have received the packet but hasn't noticed yet. Or the sender hasn't had time to send it yet.
In both cases - and this is the point - doing nothing (patience) is the simple way to handle the 'problem'. In the first case, the sender's retrans timer will kick in and the not-ACKed packet will the retransmitted. In the second case, the recipient will discover the 'missing' packet in its in-queue and be happy.

So what's the problem here? A simple bug. The inflight counter, which keeps track of the number of 'packets in transit', packets sent but not ACKed, wrongly assumed one ack per packet. Which amazingly often is the case. When a telnet connection to a ktcp-based systems sends a long listing (like from the hd command) to the connected 'terminal', the sender often gets ahead of the recipient (many small packets), but the recipient still ACKs every single packet. However, if a packet stays missing (i.e. lost), the following happens: The recipient keeps sending DUP ACKs until the sender finally (re)sends the missing packet. By then the recipient has received many, maybe 10s of packets ahead of the missing one, so it acknowledges all of them in one single ACK. Which means our inflight counter is suddenly far off, which eventually will screw up the connection.
Simple problem, simple fix, serious consequences.

BTW - the 'simple fix' is simple, but not perfect. The inflight counter will still be slightly off at times, but minimally so, and it will automagically resync thanks to the initial if-statement in the fix:

            if (cb->inflight == 1 || cb->send_una == cb->send_nxt)
                cb->inflight = 0;       /* all outstanding packets ACKed */

The 100% alternative would be to run through the retrans-list and account for packets with sequence numbers lower than the last ACK. I tried that (code included in the first commit), but even that turns into a nightmare because on a fast system, we would end up scanning the retrans buffer several times between its regular updates and thus counting the same packets more than once. In order to get this right we would have to yank every counted packet from the buffer, which would be fine except we don't really want to do this kind of work here, where fast processing of an ACK packet is the focus. It's the task of a timed retrans-loop - as it should.

Then the obvious question: Should ktcp do things differently, like let the DUP ACK trigger a retransmission? That would be technically simple - and fast, but there are two reasons not to do it: 1) Lost or dropped packet are extremely rare so the delay introduced by doing nothing is not really a problem. And 2) - acting immediately on a DUP ACK would disturb the most common case, the slow systems situation described above: Trigger unnecessary retransmissions, which is detrimental to efficiency. So if we were to do something, it would have to be on the 2nd or 3rd DUP ACK, which in turn would mean a per-connection counter, more complexity etc etc. Not worth it.

The 'out of retrans memory' fix is just a matter of deleting code that most likely was put in there when ktcp was very different than today. Every time ktcp sends a tcp packet it saves a copy and puts it into the retrans list. Memory is allocated on the fly and freed when the packet has been acked or timed out in the buffer - or the connection closed. If there isn't enough space in the retrans buffer, ktcp will hold back and tell the sending application to come back later, typically in 100ms. This way, overflowing or overextending the retrans buffer is impossible. It will grow slightly over the limit (set in config.h) some times, but minimally so.

The point is that checking for overflow elsewhere is superfluous and useless because there is nothing to do. The sender(s) have already been put on hold as described before. In this particular (now deleted) case, the entire retransbuffer was purged. Admittedly, no one would notice unless a retransmission was required on an active connection. Which is what happened as I was heavy-testing the 'inflight-fix' above with artificially induced packet loss.
A bug, OK, but serious? Depends on the setting. When slow systems are communication there will be occasional retransmissions because of timeouts - without any real packet loss, and the connections would hang if the retransbuffer got slightly over the limit. Hang with a not very informative (sensible) message.

</TL;DR>

Why this long rant about something seemingly innocuous? It's educating. To me as I have to think everything through again, and for you who actually skipped the TL;DR temptation at the beginning.

@ghaerr
Copy link
Copy Markdown

ghaerr commented Dec 6, 2025

Thank you for the detailed explanation of the fix with the inflight counter, I was wondering about that.

Should ktcp do things differently, like let the DUP ACK trigger a retransmission?

Again, thanks for the detailed discussion from your perspective. While I had previously thought implementing DUP ACK support would be good because it might speed up some errors, I now agree with you, that it's not worth it. A bigger reason I see for not trying to "fix" it is that for networking, reliability is far more important than minor efficiency improvements. The work and testing you've put in to get both slow and fast systems working reliably is far more important than saving a few seconds in a somewhat rare hardware-induced situation. Thus, I'm firmly in your camp now about waiting for "improvements" until they heavily affect a user's work, rather than taking the chance of decreasing reliability for a minor gain.

Why this long rant about something seemingly innocuous? It's educating.

I couldn't agree more. Thanks for the long descriptions!!

My plan is to wait a bit for things to settle down from users in both TLVC and ELKS, then again compare changes in this and/or subsequent PRs, and bring them over to ELKS, to keep the majority of ktcp in sync. I've left out most of the experimental and/or commented-out code, but everything else remains.

On another note, I've had a full epiphany about kernel interrupt processing, and now realize that, even though interrupts are enabled during interrupt handling routines, actual new hardware interrupts (of lower priority) are always disabled, especially of course all interrupts are disabled during the IRQ 0 timer handler execution, which computes all sorts of stuff (like CPU usage calcs, disk I/O spin wheel, serial pump routines, etc). This all happens before the IRQ 0 EOI, which means way too much time is spent with interrupts effectively disabled certainly for fast serial input, as well as network packets, etc. Thus - I'm going to implement Linux-style "bottom-half" interrupts into the kernel. While initially optional for device drivers, this effectively divides an interrupt hander into two parts, where the top half does the minimum required to read/write hardware data, and then "schedules" the bottom half to run after the EOI. This means that other interrupts can still take place without being blocked, and it all works because the EOI remains at the end of the top half. I think with this improvement we can get continuous 38400 baud serial input with no drops, ever, and we might see a gain with network packets not being lost (need more info on that from you). More discussion coming in some preliminary PRs.

@ghaerr
Copy link
Copy Markdown

ghaerr commented Dec 7, 2025

The 'out of retrans memory' fix is just a matter of deleting code that most likely was put in there when ktcp was very different than today.

Yes, I added that code in the early days to prevent ktcp from running out of memory. Glad to hear it's no longer needed.

@Mellvik
Copy link
Copy Markdown
Owner Author

Mellvik commented Dec 7, 2025

On another note, I've had a full epiphany about kernel interrupt processing, and now realize that, even though interrupts are enabled during interrupt handling routines, actual new hardware interrupts (of lower priority) are always disabled, especially of course all interrupts are disabled during the IRQ 0 timer handler execution, which computes all sorts of stuff (like CPU usage calcs, disk I/O spin wheel, serial pump routines, etc). This all happens before the IRQ 0 EOI, which means way too much time is spent with interrupts effectively disabled certainly for fast serial input, as well as network packets, etc. Thus - I'm going to implement Linux-style "bottom-half" interrupts into the kernel. While initially optional for device drivers, this effectively divides an interrupt hander into two parts, where the top half does the minimum required to read/write hardware data, and then "schedules" the bottom half to run after the EOI. This means that other interrupts can still take place without being blocked, and it all works because the EOI remains at the end of the top half. I think with this improvement we can get continuous 38400 baud serial input with no drops, ever, and we might see a gain with network packets not being lost (need more info on that from you). More discussion coming in some preliminary PRs.

Holy cr... @ghaerr - you're right! I've been pondering this since your post arrived - this is actually big. It is certain to change the response time of ELKS/TLVC significantly in many/most situations. And improve not only serial, but system response.
Thanks, I will be following your work on this closely!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants