Race condition in eventloop #36

romanalexander · 2023-11-21T21:12:47Z

I have geyser running inside a k8s pod (containerd runtime). There's been a persistent bug where clients can't connect half the time in a weird pattern like

Connect Attempt #1 works.
Connect Attempt #2 fails.
Connect Attempt #3 works.
... repeat ...

I isolated it down to RakServerOnlineInitialHandler.channelRead0 channel.eventLoop() responsible for RakChildChannel.setActive is somehow happening too slowly. When this happens, datagrams are read while the channel is still inactive. This causes any UpstreamSession.sendPacketImmediately to silently(!!) fail writing the packet out. This behavior is caused solely by RequestNetworkSettingsPacket trying to immediately flush NetworkSettingsPacket that will never actually be sent to the underlying socket.

I think the best solution is to hold all EncapsulatedPackets in a buffer until the channel becomes active. (Also I have no clue why eventLoop is running like this. Maybe the guest OS being stuck with low resolution timers.)

You can create a reproducible test case with:

                channel.eventLoop().execute(() -> {
                    Thread.sleep(5000);
                    channel.setActive(true); // make sure this is set on same thread
                    channel.pipeline().fireChannelActive();
                });

I attached this as a PR with example solution code that fixes the problem for me. Obviously this isn't an ideal solution. Hopefully someone more well versed in netty can write a better solution.

Additional observations:

JRE graalvm-ce-java17-linux-amd64-22.2.0
AMD Ryzen 7 5800X 8-Core Processor
Host Ubuntu Jellyfish, guest Ubuntu Jammy
Tested the problem with TransportMethod.EPOLL and TransportMethod.NIO with the same situation.
This occurs only inside a containerd pod. Running the same hardware, same NIC, same jar, same JVM in the host OS works without issue.
This has been an ongoing issue for a few months.

Alemiz112 · 2023-11-22T11:00:15Z

Hi, thanks for opening the issue. So what I understand the actual issue is that sendPacketImmediately() sends bedrock packet before the channel is active?

Alemiz112 · 2023-11-22T11:02:00Z

Have you tried enabling trace logging? If so, do you see received message from inactive channel or similar messages when this happens?

romanalexander · 2023-11-22T11:38:53Z

Hi, thanks for opening the issue. So what I understand the actual issue is that sendPacketImmediately() sends bedrock packet before the channel is active?

Yes this is accurate. Due to how Geyser works if you delay sending the packet until after the channelActive it'll crash the client because compression is turned on at the same time that packet is sent. Ideally the incoming datagrams are buffered until channelActive or we modify the API for ChannelFuture access and enable compression only after that packet is placed in the outbound buffer.

Have you tried enabling trace logging? If so, do you see received message from inactive channel or similar messages when this happens?

I tried trace logging in org.cloudburst io.netty org.geyser and the only traces look like this (slightly different due to my own debugging lines)

[16:22:03 INFO] [geyser]: /*.*.*.*:61631 tried to connect! In PacketId 5. notBanned? true
[16:22:04 INFO] [geyser]: /*.*.*.*:61631 tried to connect! In PacketId 5. notBanned? true
[16:22:04 TRACE] [org.cloudburstmc.protocol.bedrock.BedrockSession]: Inbound /*.*.*.*:616310: RequestNetworkSettingsPacket(protocolVersion=622)
[16:22:04 INFO] [geyser]: sendPacketImmediately: isClosed? false Sending NetworkSettingsPacket packet to /*.*.*.*:61631
[16:22:04 TRACE] [org.cloudburstmc.protocol.bedrock.BedrockSession]: Outbound /*.*.*.*:616310: NetworkSettingsPacket(compressionThreshold=512, compressionAlgorithm=ZLIB, clientThrottleEnabled=false, clientThrottleThreshold=0, clientThrottleScalar=0.0)

Nothing relevant following until you press CANCEL in minecraft client.

(edit: redact PII)

Alemiz112 · 2023-12-05T12:36:10Z

Hi @romanalexander , can you please test if you are able to reproduce this issue with fix/early-messages branch?

romanalexander · 2023-12-05T13:20:20Z

Hi @romanalexander , can you please test if you are able to reproduce this issue with fix/early-messages branch?

👍 fix/early-messages seems to work great, thanks!

bump netty due CVE-2022-41881 Changes in cloudburst libs: - uses netty 4.1.101.Final - bumps netty-transport-raknet which fixes CloudburstMC/Network#36, an issue where some connection attempts did not work

* bump cloudburst's netty-transport-raknet * bump cloudburst protocol/codec/connection bump bump netty due CVE-2022-41881 Changes in cloudburst libs: - uses netty 4.1.101.Final - bumps netty-transport-raknet which fixes CloudburstMC/Network#36, an issue where some connection attempts did not work

Fix race condition in eventLoop

befcbc4

romanalexander closed this Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in eventloop #36

Race condition in eventloop #36

romanalexander commented Nov 21, 2023

Alemiz112 commented Nov 22, 2023

Alemiz112 commented Nov 22, 2023

romanalexander commented Nov 22, 2023 •

edited

Loading

Alemiz112 commented Dec 5, 2023

romanalexander commented Dec 5, 2023

Race condition in eventloop #36

Race condition in eventloop #36

Conversation

romanalexander commented Nov 21, 2023

Alemiz112 commented Nov 22, 2023

Alemiz112 commented Nov 22, 2023

romanalexander commented Nov 22, 2023 • edited Loading

Alemiz112 commented Dec 5, 2023

romanalexander commented Dec 5, 2023

romanalexander commented Nov 22, 2023 •

edited

Loading